Last Updated on June 7, 2016
There was a recent question that asked “How to not waste-time/procrastinate while ml scripts are running?“.
I think this is an important question. I think answers to this question show a level of organization or maturity in your approach to work.
I left a small comment on this question, but in this post I elaborate on my answer and give you a few perspectives on how to consider this question, minimize it and even avoid it completely.
Run fewer experiments
Consider why you are executing model runs. You are almost certainly performing a form of exploratory data analysis.
You are trying to understand your problem with the aim of achieving a result with a specific accuracy. You may want the result for a report or you may want the model to operationalized.
Your experiments are intended to teach you something about the problem. As such, you need to be crystal clear on what intend to learn from each experiment that you execute.
If you do not have a clear unambiguous question that the experimental results will enlighten, consider whether you need to run the experiment at all.
When you get empirical answers to your questions, honor those results. Do your best to integrate the new knowledge into your understanding of the problem. This may be a semi-formal work product such as a daily journal or a technical report.
Run faster experiments
The compile-run-fix loop of modern programming is very efficient. The immediate pay-off lets you continually test ideas and course-correct.
This process was not always so efficient. As engineers, you used to design modules and desk-check their logic by hand with pen and paper. If you do any mathematics in your programming you very likely still use this process.
A useful modern tool are unit tests that automate the desk-check process making them repeatable. A maxim for good test design is speed. The more immediate the feedback, the faster you can course-correct and fix bugs.
The lesson here is you want speed.
You want to get the empirical answers to your questions quickly so that you can ask the follow-up questions. This does not mean designing bad experiments. It means making the experiments only large or detailed enough to answer one question.
The simplest way to have faster experiments is to work with reduced samples of your data. It’s so simple a technique that it’s often overlooked.
Often the effect you are looking for scales predictably with the data, whether it is a property of the data itself like outliers or the accuracy from models of the data.
Run tuning as experiments
Some experiments are inherently slow, like tuning hyper-parameters. In fact, tuning can be really addictive when your pursuit is optimized accuracy.
Completely avoid hand tuning any parameters, it’s a trap! My suggestion is to design methodical tuning experiments using a search method like random or grid search.
Collect the results and use the parameters that your experiments suggest are optimal.
If you want better results, design follow-up experiments on reduced hyper-cubes in parameter space and change the search algorithms to use gradient (or quasi-gradient) based methods.
Run experiments in downtime
Avoid running experiments in your most productive time. If you get you work done in daylight working hours, don’t tie up your machine and focus in that time with a blocking task like a model run.
Schedule your experiments to run when you are not working. Run experiments at night, in your lunch hour and over the weekends.
To run your experiments in your down time means that you will need to schedule them. This becomes a lot easier if you are able to batch your experiments.
You can do this by taking time to design 5-10 experiments in a batch, preparing the model runs and running experiments in sequentially or parallel in your off-time.
This may require discipline to decouple the question and the answers that your experiments serve. The benefits will be the depth of knowledge you gain about your problem and the increased speed at which you obtain it.
Run experiments off-site
Some experiments may require days or weeks, meaning that running them on your workstation is practically infeasible.
For long-running experiments you can harness compute servers in the cloud (like EC2 and friends) or a local compute server. Regardless of it’s locale, the compute server is not to be used in real-time. You feed in questions and receive back answers.
The most efficient use of a compute server is to have a queue of questions and a process for consuming and integrating the answers into your growing knowledge base on the problem.
For example, you may set the goal of running one experiment per day (or night) no matter what. I often try to hold to this pattern on new projects. This can be good for keeping momentum high.
When ideas wane, you can fill the queue with thoughtless optimization experiments to tune the parameters of well performing models, an ongoing background task that you can always back on.
Plan while experiments are running
Sometimes you must run experiments on your workstation in real-time. Your workstation must block while the model runs. The reason will be some pressing real-time requirement that you cannot delay.
When this happens, remember that you project and your thoughts are not blocked, only your workstation.
Pull out a text editor or a pen and paper (preferred so you don’t steal any cycles from your experimental run). Use this time to think deeply about your project. Make lists like:
- List and prioritize experiments you would like to perform
- List questions, expected answers, set-up required and impact the results each experiment will have.
- List and prioritize assumptions and experiments you can do to dispute them.
- List and prioritize areas of the code you would like to write unit test for.
- List alternative perspectives and framing of your problem.
Be creative and consider testing long held beliefs about the project.
I like to do my creative work at the end of the day to allow my subconscious to work on the problems while I sleep. I also like to run experiments on my workstation over night to let it think alongside my subconscious.
In this post you have discovered some ways to tackle the problem of being productive during machine learning model runs.
Below is a summary of the key tactics that you can use:
- Consider whether each experiment is required using the contribution it will provide to your understanding of the problem as the evaluation criteria.
- Design experiments that run faster and use samples of data to achieve speed ups.
- Never tune hyper-parameters by hand, always design automated experiments to answer questions of model calibration.
- Run experiments during your down time, such as overnight, lunch breaks and weekends.
- Design experiments in batch so that you can queue and schedule their execution.
- Delegate experimental runs to compute servers off your workstation to increase the efficiency.
- If you must run blocking real-time experiments, use that time to think deeply about your problem, design future experiments and challenge base assumptions.
Good stuff. I often find I allow myself to be distracted by Hacker News or whatever when I’m waiting for a process to run, and then before I know I’ve lost more time than I intended. That last tip sounds like a good way out of this.
I hope it helps Flimm.