It is good practice to have reproducible outcomes in software projects. It might even be standard practice by now, I hope it is.
You can take any developer off the street and they should be able to follow your process to check out the code base from revision control and make a build of the software ready to use. Even better if you have a procedure for setting up an environment and for releasing the software to users/operational environments.
It is the tools and the process make the outcome reproducible. In this post you will learn that it is just as important to make the outcomes of your machine learning projects reproducible and that practitioners and academics in the field of machine learning struggle with this.
As a programmer and a developer you already have the tools and the process to leap ahead, if you have the discipline.
Reproducibility of Results in Computational Sciences
Reproducibility of experiments is one of the main principles of the scientific method. You write up what you did but other scientists don’t have to take your word for it, they follow the same process and expect to get the same result.
Work in the computational sciences involves code, running on computers that reads and writes data. Experiments that report results that do not clearly specify any of these elements are very likely not easily reproducible. If the experiment cannot be reproduced, then what value is the work.
This is an open problem in computational sciences and is becoming ever more concerning as more fields rely on computational results of experiments. In this section we will review this open problem by looking a few papers that consider the issue.
Ten Simple Rules for Reproducible Computational Research
This was an article in PLoS Computational Biology in 2013 by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor and Eivind Hovig. In the paper, the authors list simple 10 rules that if followed are expected to result in more accessible (reproducible!?) computational research. The rules have been summarized below.
- Rule 1: For Every Result, Keep Track of How It Was Produced
- Rule 2: Avoid Manual Data Manipulation Steps
- Rule 3: Archive the Exact Versions of All External Programs Used
- Rule 4: Version Control All Custom Scripts
- Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
- Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
- Rule 7: Always Store Raw Data behind Plots
- Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
- Rule 9: Connect Textual Statements to Underlying Results
- Rule 10: Provide Public Access to Scripts, Runs, and Results
The authors are commenting from the field of computational biology. Nevertheless, I would argue the rules do not go far enough. I find them descriptive and I would be a lot more prescriptive.
For example, with rule 2 “Avoid Manual Data Manipulation Steps”, I would argue that all data manipulation be automated. For rule 4 “Version Control All Custom Scripts”, I would argue that the entire automated process to create work product be in revision control.
If you are developer familiar with professional process, you mind should be buzzing with how useful dependency management, build systems, markup systems for documents that can execute embedded code, and continuous integration tools could really bring some rigor.
Accessible Reproducible Research
An article by Jill Mesirov published in Science magazine in 2010. In this short article the author introduces a terminology for systems that facilitate reproducible computational research by scientists, specifically:
- Reproducible Research System (RRS): Comprised of a Reproducible Research Environment and a Reproducible Research Publisher.
- Reproducible Research Environment (RRE): The computational tools, management of data, analyses and results and the ability to package them together for redistribution.
- Reproducible Research Publisher (RRP): The document preparation system which links to the Reproducible Research Environment and provides the ability to embed analyses and results.
A prototype system is described that was developed for Gene Expression analysis experiments called the GenePattern-Word RRS.
Again, looking through the eyes of software development and the tools available, the RRE sounds like revision control plus a build system with dependency management plus a continuous integration server. The RRP sounds like a markup system with linking and a build process.
An invitation to reproducible computational research
This was a paper written by David Donoho in Biostatistics, 2010. This is a great paper, I really agree with the points it makes. For example:
“Computational reproducibility is not an afterthought — it is something that must be designed into a project from the beginning.”
I could not articulate it clearer myself. In the paper, the author lists the benefits for building reproducibility into computational research. For the researcher the benefits are:
- Improved work and work habits.
- Improved teamwork.
- Greater impact. (Less inadvertent competition and More acknowledgement)
- Greater continuity and cumulative impact.
The benefits the author lists for the taxpayer that funds the research are:
- Steward ship of public goods.
- Public access to public goods.
I made some of the same arguments to colleagues off the cuff and it is fantastic to be able to point to this paper that does a much better job of making a case.
Making scientific computations reproducible
Published in Computing in Science & Engineering, 2000 by Matthias Schwab, Martin Karrenbach and Jon Claerbout. The opening sentences of this paper are terrific:
“Commonly research involving scientific computations are reproducible in principle but not in practice. The published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself. Consequently authors are usually unable to reproduce their own work after a few months or years.”
The paper describes the standardization of computational experiments through the adoption of GNU make, standard project structure and the distribution of experimental project files on the web. These practices were standardized in the Stanford Exploration Project (SEP).
The motivating problem addressed by the adoption was the loss of programming effort when a graduate student left the group because of the inability to reproduce and build upon experiments.
The ideas of a standard project structure and build system seem so natural to a developer.
Reproducibility by Default in Machine Learning
The key point I want to make is to not disregard the excellent practices that have built up to standard in software development when starting in the field of machine learning. Use them and build upon them.
I have a blue print I use for machine learning projects and it’s improving with each project I complete. I hope to share it in the future. Watch this space. Until then, here are some tips for reusing software tools to make reproducibility a default for applied machine learning and machine learning projects in general:
- Use a build system and have all results produced automatically by build targets. If it’s not automated, it’s not part of the project, i.e. have an idea for a graph or an analysis? automate its generation.
- Automate all data selection, preprocessing and transformations. I even put in wget’s for accruing data files when working on machine learning competitions. I want to get up and running from scratch on new workstations and fast servers.
- Use revision control and tag milestones.
- Strongly consider checking in dependencies or at least linking.
- Avoid writing code. Write thin scripts and use standard tools and use standard unix commands to chain things together. Writing heavy duty code is a last resort during analysis or a last step before operations.
- Use a markup to create reports for analysis and presentation output products. I like to think up lots of interesting things in batch and implement them all and let my build system create them when it next runs. This allows me to evaluate and think deeply about the observations at a later time when I’m not in idea mode.
Use a Continuous Integration server to run your test harness often (daily or hourly).
I have conditions in my test harness to check for the existence of output products and create them if they are missing. That means that each time I run the harness, only things that have changed or results that are missing are computed. This means I can let my imagination run wild and keep adding algorithms, data transforms and all manner of crazy ideas to the harness and some server somewhere will compute missing outputs on the next run for me to evaluate.
This disconnect I impose between idea generation and result evaluation really speeds up progress on a project.
I find a bug in my harness, I delete the results and rebuild them all again with confidence on the next cycle.
In this post you have learned that the practice of machine learning is project work with source data, code, computations with intermediate work product and output work products. There also likely all manner of things in between.
If you manage a machine learning project like a software project and reap the benefits of reproducibility by default. You will also get added benefits of speed and confidence which will result in better outcomes.
If you would like to read further on these issues, I have listed the resources used in the research of this post below.
- Reproducibility Wikipedia page
- Ten Simple Rules for Reproducible Computational Research, Geir Kjetil Sandve, Anton Nekrutenko, James Taylor and Eivind Hovig, 2013
- Accessible Reproducible Research, Jill Mesirov, 2010
- An invitation to reproducible computational research, David Donoho, 2010
- Making scientific computations reproducible, Matthias Schwab, Martin Karrenbach and Jon Claerbout, 2000
- Reproducible Research with R and RStudio (affilite link) by Christopher Gandrud is a book on this subject using R. I have not read this book at the time of writing, but it’s high my to-read list.
Have you encountered the challenge of reproducible machine learning projects? Do you have idea of other tools of software development that could aid in this cause? Leave a comment and share your experiences.