Reproducible Machine Learning Results By Default

It is good practice to have reproducible outcomes in software projects. It might even be standard practice by now, I hope it is.

You can take any developer off the street and they should be able to follow your process to check out the code base from revision control and make a build of the software ready to use. Even better if you have a procedure for setting up an environment and for releasing the software to users/operational environments.

It is the tools and the process make the outcome reproducible. In this post you will learn that it is just as important to make the outcomes of your machine learning projects reproducible and that practitioners and academics in the field of machine learning struggle with this.

As a programmer and a developer you already have the tools and the process to leap ahead, if you have the discipline.

Reproducible Computational Research

Reproducible Computational Research
Photo credit ZEISS Microscopy, some rights reserved

Reproducibility of Results in Computational Sciences

Reproducibility of experiments is one of the main principles of the scientific method. You write up what you did but other scientists don’t have to take your word for it, they follow the same process and expect to get the same result.

Work in the computational sciences involves code, running on computers that reads and writes data. Experiments that report results that do not clearly specify any of these elements are very likely not easily reproducible. If the experiment cannot be reproduced, then what value is the work.

This is an open problem in computational sciences and is becoming ever more concerning as more fields rely on computational results of experiments. In this section we will review this open problem by looking a few papers that consider the issue.

Ten Simple Rules for Reproducible Computational Research

This was an article in PLoS Computational Biology in 2013 by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor and Eivind Hovig. In the paper, the authors list simple 10 rules that if followed are expected to result in more accessible (reproducible!?) computational research. The rules have been summarized below.

  • Rule 1: For Every Result, Keep Track of How It Was Produced
  • Rule 2: Avoid Manual Data Manipulation Steps
  • Rule 3: Archive the Exact Versions of All External Programs Used
  • Rule 4: Version Control All Custom Scripts
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
  • Rule 7: Always Store Raw Data behind Plots
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
  • Rule 9: Connect Textual Statements to Underlying Results
  • Rule 10: Provide Public Access to Scripts, Runs, and Results

The authors are commenting from the field of computational biology. Nevertheless, I would argue the rules do not go far enough. I find them descriptive and I would be a lot more prescriptive.

For example, with rule 2 “Avoid Manual Data Manipulation Steps”, I would argue that all data manipulation be automated. For rule 4 “Version Control All Custom Scripts”, I would argue that the entire automated process to create work product be in revision control.

If you are developer familiar with professional process, you mind should be buzzing with how useful dependency management, build systems, markup systems for documents that can execute embedded code, and continuous integration tools could really bring some rigor.

Accessible Reproducible Research

An article by Jill Mesirov published in Science magazine in 2010. In this short article the author introduces a terminology for systems that facilitate reproducible computational research by scientists, specifically:

  • Reproducible Research System (RRS): Comprised of a Reproducible Research Environment and a Reproducible Research Publisher.
  • Reproducible Research Environment (RRE): The computational tools, management of data, analyses and results and the ability to package them together for redistribution.
  • Reproducible Research Publisher (RRP): The document preparation system which links to the Reproducible Research Environment and provides the ability to embed analyses and results.

A prototype system is described that was developed for Gene Expression analysis experiments called the GenePattern-Word RRS.

Again, looking through the eyes of software development and the tools available, the RRE sounds like revision control plus a build system with dependency management plus a continuous integration server. The RRP sounds like a markup system with linking and a build process.

An invitation to reproducible computational research

This was a paper written by David Donoho in Biostatistics, 2010. This is a great paper, I really agree with the points it makes. For example:

“Computational reproducibility is not an afterthought — it is something that must be designed into a project from the beginning.”

I could not articulate it clearer myself. In the paper, the author lists the benefits for building reproducibility into computational research. For the researcher the benefits are:

  • Improved work and work habits.
  • Improved teamwork.
  • Greater impact. (Less inadvertent competition and More acknowledgement)
  • Greater continuity and cumulative impact.

The benefits the author lists for the taxpayer that funds the research are:

  • Steward ship of public goods.
  • Public access to public goods.

I made some of the same arguments to colleagues off the cuff and it is fantastic to be able to point to this paper that does a much better job of making a case.

Making scientific computations reproducible

Published in Computing in Science & Engineering, 2000 by Matthias Schwab, Martin Karrenbach and Jon Claerbout. The opening sentences of this paper are terrific:

“Commonly research involving scientific computations are reproducible in principle but not in practice. The published documents are merely the advertisement of scholarship whereas the computer programs, input data, parameter values, etc. embody the scholarship itself. Consequently authors are usually unable to reproduce their own work after a few months or years.”

The paper describes the standardization of computational experiments through the adoption of GNU make, standard project structure and the distribution of experimental project files on the web. These practices were standardized in the Stanford Exploration Project (SEP).

The motivating problem addressed by the adoption was the loss of programming effort when a graduate student left the group because of the inability to reproduce and build upon experiments.

The ideas of a standard project structure and build system seem so natural to a developer.

Reproducibility by Default in Machine Learning

The key point I want to make is to not disregard the excellent practices that have built up to standard in software development when starting in the field of machine learning. Use them and build upon them.

I have a blue print I use for machine learning projects and it’s improving with each project I complete. I hope to share it in the future. Watch this space. Until then, here are some tips for reusing software tools to make reproducibility a default for applied machine learning and machine learning projects in general:

  • Use a build system and have all results produced automatically by build targets. If it’s not automated, it’s not part of the project, i.e. have an idea for a graph or an analysis? automate its generation.
  • Automate all data selection, preprocessing and transformations. I even put in wget’s for accruing data files when working on machine learning competitions. I want to get up and running from scratch on new workstations and fast servers.
  • Use revision control and tag milestones.
  • Strongly consider checking in dependencies or at least linking.
  • Avoid writing code. Write thin scripts and use standard tools and use standard unix commands to chain things together. Writing heavy duty code is a last resort during analysis or a last step before operations.
  • Use a markup to create reports for analysis and presentation output products. I like to think up lots of interesting things in batch and implement them all and let my build system create them when it next runs. This allows me to evaluate and think deeply about the observations at a later time when I’m not in idea mode.

Pro Tip

Use a Continuous Integration server to run your test harness often (daily or hourly).

Continuous Integration

Continuous Integration
Photo credit regocasasnovas, some rights reserved

I have conditions in my test harness to check for the existence of output products and create them if they are missing. That means that each time I run the harness, only things that have changed or results that are missing are computed. This means I can let my imagination run wild and keep adding algorithms, data transforms and all manner of crazy ideas to the harness and some server somewhere will compute missing outputs on the next run for me to evaluate.

This disconnect I impose between idea generation and result evaluation really speeds up progress on a project.

I find a bug in my harness, I delete the results and rebuild them all again with confidence on the next cycle.


In this post you have learned that the practice of machine learning is project work with source data, code, computations with intermediate work product and output work products. There also likely all manner of things in between.

If you manage a machine learning project like a software project and reap the benefits of reproducibility by default. You will also get added benefits of speed and confidence which will result in better outcomes.


If you would like to read further on these issues, I have listed the resources used in the research of this post below.

Have you encountered the challenge of reproducible machine learning projects? Do you have idea of other tools of software development that could aid in this cause? Leave a comment and share your experiences.

6 Responses to Reproducible Machine Learning Results By Default

  1. Avatar
    Matt January 11, 2014 at 6:45 am #

    Nice, timely, and comprehensive post.

    What sort of setup do you use for your continuous integration? I’ve been checking this out myself, but have been baffled by the many options out there (many of which are far too heavyweight for my needs — a one man analysis operation in a small bio lab).

  2. Avatar
    jasonb January 11, 2014 at 7:43 am #

    Hey Matt, I’m glad you liked the post.

    I keep my setup as simple as possible. I do a lot of R and use Makefile targets. I use Jenkins for my CI and call make targets every hour (depending on the project). Super simple to setup and configure.

  3. Avatar
    Jesús Martínez February 15, 2018 at 11:44 am #

    I think that having a standard, predictable process worths gold and diamonds! When I found my progress has plummeted is, almost always, due to the lack of a good process that allows me to iterate faster. Recently I performed really bad in a Kaggle competition as a consequence of being all over the place and not automating most (or all) of the steps I took. Often I found myself redoing stuff… So, yep, reproducibility and automation are critical to excelling at anything!

    Do you have a process that produces good to great results every time you apply it? I’d love to hear about it!


  4. Avatar
    Heloá July 25, 2020 at 6:58 pm #

    Hi Jason, thanks for the post!

    Have you shared the blueprint mentioned in the section before Pro Tip?

Leave a Reply