14  Reproducibility

Feedback wanted

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

The beauty of code: Once written, you can use it as often as you like. If set up correctly, the same code applied to the same data will produce the same results. Ideally, training your machine learning model works the same: at the click of a button, you can run your code again and get the exact same machine learning model. This is known as computational reproducibility, and it comes with many advantages.

But making your code reproducible is much harder than it seems. Computational reproducibility is inherently unstable due to ever-changing computing environments and the unique challenges of machine learning.

Reproducibility versus replicability

You may have heard about the replicability crisis in the social sciences (Collaboration 2015): Many research findings couldn’t be replicated. With reproducibility, we focus on a narrower goal: getting the same results (e.g. the same model) using the same code on the same data. For an overview of definitions, see the Turing Way Book (Arnold et al. 2019).

Reproducibility is not binary, but it’s a spectrum. This chapter will help you move up that spectrum and highlights the unique challenges ML poses to reproducibility. Let’s dive in.

The hurricane forecasting system worked like a charm. Until it didn’t. One hot summer day, the server had a meltdown. And with it vanished the model. Fortunately, the researchers still had the code on an old laptop. All they had to do was retrain and deploy the model. It would take an afternoon, Product Raven estimated. But easier said than done. They had trained the model years ago, and the lead developer had retired, leaving them with a mess. A strange folder structure, mysterious file names, and unclear instructions. In what order should they run the scripts? Why does the training produce different models, sometimes even getting stuck in a local optimum? It was a stressful time for the researchers.

14.1 Making any coding project reproducible

Reproducibility is a challenge for any coding project even without machine learning. This chapter focuses more on the ML-specific challenges, but it wouldn’t be complete if we didn’t mention some general tips and tricks for reproducibility (Seibold 2024):

  • Create a clear folder structure.
  • Use good names for files, folders, and functions.
  • Document your project, including a README, metadata, and code documentation.
  • Use version control software such as git to track changes to code and text.
  • Stabilize the computing environment and software using environment management tools such as conda and virtualization tools such as Docker.
  • Automate computations, e.g. Makefile, workflow stuff, and have all steps reproducible with code.
  • Publish your work.

The more tips you follow, the more reproducible and satisfying your project will be. However, it may require learning new tools like git and adopting new habits like documenting your code – this is an initial investment of time and effort, but it will pay off in the long run. Your future self will thank you.

Now let’s move on to the ML-specific challenges.

14.2 Handling non-deterministic code

Running the same code twice can produce different results due to randomness. Machine learning involves a lot of randomness: random weight initialization, stochastic gradient descent, random forests, random data splitting. It’s not a bug, it’s a feature: Randomness is an inherent driver of learning. For example, randomly splitting data into training and testing simulates “drawing from the same distribution” which is a crucial element of generalization (see Chapter 8) and stochastic gradient descent implicitly regularizes the model through its random nature (Bottou 2010).

To introduce randomness into the deterministic world of programming logic, machine learning software relies on “random number generators”. These generators are not truly random – they are based on pseudorandom processes that can be controlled by setting an initial random seed. A random seed initializes the random number generator, and all subsequent “random” numbers are deterministic.

This random seed is also your key to reproducibility for code with randomness. Without a random seed, training a machine learning model will produce different models every time. At least it’s unlikely to get the same results twice. But by setting a random seed, you make the random number generation reproducible. There are some exceptions: When running multiple processes in parallel, setting a random seed may still not guarantee deterministic results if the order of execution is inconsistent.

Besides your computer’s random number generator, there are other reasons why running the same code twice may produce different results, even if you have followed all the tips (e.g., making sure you are using the same software):

  • Operations on GPUs can be non-deterministic, even simple ones like a sum (see this StackOverflow question).
  • Dependence on external systems that change over time (e.g. API calls).
  • Floating point arithmetic can vary between different hardware and software platforms.

14.3 Jupyter Notebooks

Jupyter notebooks are a blessing and a course to research.

Jupyter Notebook

The term refers to both the software and the document, just as “Excel” can refer to both the program and a single spreadsheet. A notebook – the software – is essentially an HTML application where you can manage multiple notebooks (the documents), create new ones, delete, edit, and run them. A notebook – the document – is a collection of “cells”. A cell can contain markdown, code, or plain text. Markdown cells are rendered so that you can structure your notebook like a document, with titles, bold text, and other formatting. You can execute code cells, and the results are embedded in the document, whether it’s a plot, code warnings, or a printout.

Notebooks let you experiment, quickly prototype ideas, and explore data; they encourage documentation; they are great for reporting results. But they make reproducibility difficult. In his provocative talk “I Don’t Like Notebooks”, Joel Grus specifically criticized “hidden states” that make reproducibility difficult:

  • You can run the cells in any order.
  • You can delete cells, but the variables that were created remain in the workspace.
  • You can change a cell’s content while only running the old version.

Imagine you’re working on a machine learning project in a Jupyter notebook. You decide to standardize features before training but forget to run the updated cell. You save the model, but now the code doesn’t match the model, and re-running the notebook produces a different model.

In another scenario, you accidentally delete the cell that generates a weight vector shortly after writing the code. But because you executed the cell, the vector is in memory. You finish the project, check your latest changes into version control, and call it a day. The next person to work on the project gets an error about the missing weight vector.

There are even more problems with notebooks: While you can put them under version control, they are quite verbose because they store all HTML output, and actually comparing code changes is no fun like this.

It is possible to make a research project reproducible, even if it relies on notebooks. But you have to be very strict about running them linearly from the first cell to the last, and always with a fresh Python session.

14.4 APIs and proprietary software

Training a machine learning algorithm on someone else’s hardware can make your life a lot easier: You don’t have to worry about hardware and software installations. Hundreds of machine-learning-as-a-service platforms allow you to upload your data and train a model. The problem: lack of reproducibility. The company behind the ML software can change its software without informing you. And even if they don’t: You may not have enough control over the workflow to make it fully reproducible for others. It may not be transparent what algorithms, settings, and software versions the platform uses. Or the company may simply go out of business and you lose access to your training setup.

This problem has gotten much worse with generative AI, especially with large language models like ChatGPT. Whether you are studying ChatGPT for bias or using it to label data: The company behind it is constantly updating the models, and once a model is retired, no one can reproduce your research.

14.5 Hardware-specific challenges

Even if you control the server or use your laptop and have followed all the reproducibility recommendations, your project may still not be at 100% reproducible. At least when it comes to running your code on different hardware. The problem is that we can’t completely abstract the training and so on from the hardware. Increasingly, machine learning, especially deep learning, relies on more specific hardware, such as NVIDIA GPUs and Google TPUs. The more hardware-specific your setup, the harder it is for others to reproduce your models.

14.6 Inaccessible data

Data can be inaccessible for a number of reasons: it may be too large to share, there may be privacy concerns (such as patient data), or it may be proprietary. This makes reproducibility difficult. But even making it reproducible within the project itself can be difficult, especially with large datasets. A possible “incremental” solution would be to share a subset of the data or simulated data.

Reproducible versus Reusable

Reproducibility and reusability are not the same thing: A research result can be reproducible without being reusable. You may be able to reproduce someone else’s project but if, for example, many things are hard-coded, it will take a lot of effort to adapt it to your use case.

14.7 Reproducibility and other issues

Reproducibility is more than just ensuring that others, including your future self, can reproduce your work. It relates to many other aspects we cover in this book:

  • Reporting (see Chapter 15) and reproducibility both rely on documentation and go well together: If you have good reporting, you have also made your project more reproducible.
  • You can think of reproducibility as a form of robustness (see Chapter 12): A reproducible model is robust to the computing environment and to the random number generation.
  • A lack of reproducibility introduces uncertainty (see Chapter 13). Let’s say you forgot to set a random seed. This means that the next time the model is trained, it will be subject to uncertainty due to random splitting of the data and other operations based on a random number generator.