7  Bare-Bones Machine Learning Is Insufficient

Feedback wanted

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

Even though machine learning has great merits when it comes to prediction, it clashes with other scientific goals like control, explanations, and reasoning. This chapter highlights more concretely and practically the insufficiencies of bare-bones machine learning as a scientific methodology.

Everyone except Krarah and her fellow scientists was happy about the tornado prediction model. While they loved not being disrupted by tornadoes, they felt that the models were alien creatures they couldn’t communicate with. They feared that Raven Science would reach a dead end of understanding. Over time, the tornado system acted up. False positives and false negatives appeared more often. Something was going on, and the raven scientists felt that they had to patch their system for these insufficiencies.

7.1 Low empirical error on a test set is not enough

As we highlighted in Chapter 5, machine learning has a transparent notion of what makes a good model: a good model has low error on an unseen test set and in consequence a low generalization error on unseen data. However, this notion is empty if it is not equipped with an underlying theory of generalization. When can you infer that the model performance generalizes from just knowing the error on a test set? Are there guarantees that certain learning algorithms must yield models with low expected error?

In Chapter 8, we present statistical learning theory as the main contender for a theory of generalization in machine learning.

However, the standard notion of generalization only reflects the predictive capacities of models on data from the same distribution. Extrapolation to completely new scenarios or data from different distributions is not encompassed in the standard notion of generalization. It therefore has limited bite, especially in science, where the goal is to generalize from models to insights about the phenomenon.

In Chapter 8, we therefore also discuss broader conceptions of generalization. Our chapter on robustness Chapter 12, moreover, shows how to define generalization under distribution shifts.

7.2 Domain knowledge is overlooked

Science builds on prior knowledge to produce knowledge. Bare-bones machine learning disrupts both ends of this knowledge flow: You expect the model to figure out relations from the data, and get out an opaque prediction model. The focus shifts from epistemology to utility, from science to engineering. You tinker with the model to improve a metric. If machine learning is fully embraced in its bare-bones form, a lot of domain knowledge is left unused, and there is little new knowledge to gain. However, coherence with background knowledge makes scientific models more valuable. It is smart to incorporate additional information1. Compare that to statistical modeling or differential equations: These modeling approaches encourage, even require, the formulation of prior knowledge in terms of distribution assumptions and equations. And you get interpretable estimates back that help you better understand the phenomenon you study.

In Chapter 9, we argue that you can take a domain-knowledge-driven approach with machine learning as well. Even better: Thanks to the focus on predictive performance, you can evaluate your domain knowledge in terms of predictive performance.

7.3 Lack of interpretability and explanations

Interpretability enables you to justify models and reason about phenomena. But machine learning models are generally not inherently interpretable, because they may have a complex functional form that is adjusted to data – they are black-boxes. This makes it difficult to understand how the model behaves and what it relies on:

  • What were the most influential features?
  • Which features did the model learn?
  • Why did the model make a certain prediction?

We show in Chapter 10 how to use interpretability techniques to improve models and gain insights from them.

7.4 Predictive performance is at odds with causality

The world we live and act in is a gigantic causal mechanism. Bare-bones machine learning models however ignore the causal dimension of things. All they care about is making better predictions, and they rely on whatever statistical dependency to pursue this goal. Do you want to know your COVID risk? Alright, the machine learning model needs your shoe size, salary, and the football ranking of your local club. But scientists want to separate between causes, effects, or spuriously correlated features. Physicians want to know why certain people have COVID and others don’t. They want to develop drugs and prescribe treatments that make people healthy. To answer these questions, causality must be taken into account.

We’ll show in Chapter 11 how the combination of causal inference and machine learning allows you to find causal dependencies, learn causal models, and estimate causal effects.

7.5 Lack of robustness

Bare-bones machine learning provides predictive models, but only for a static environment. This means machine learning models are only accurate, if:

  • they are applied to data that is similar to the training data, and
  • the data structure remains intact during deployment.

In the wild, phenomena are complex. Data changes constantly, external factors enter, measurement devices produce errors, time moves, and observation targets shift. If you want to use machine learning models in your scientific pipeline, you have to make them robust tools under real-world conditions.

We show in Chapter 12 what types of robustness you may be interested in and discuss strategies such as data augmentation that help you robustify your model.

7.6 No uncertainty quantification

Proper uncertainty quantification is crucial when high-stakes decisions are made in the real world. However, bare-bones machine learning models only provide point predictions. Some models, such as Bayesian models, come with built-in uncertainty quantification, but limiting the model class to these models may result in a loss in predictive performance. Perhaps the best-performing model is a random forest. If you choose the best-performing model, you might get one without built-in uncertainty quantification. But when you make decisions based on predictions, you need to quantify how certain these predictions are. Even when model outputs look like probabilities because they are between zero and one (yes, we’re talking to you, softmaxers), they often cannot be interpreted as ‘real’ probabilities when models aren’t well calibrated.

We discuss the philosophy behind uncertainty in machine learning, calibration, and model-agnostic methods for uncertainty quantification in Chapter 13.

7.7 Lack of standards for reproducibility

There are many standards for how to document code, data, and models for others to reproduce results. Reproducibility is key to:

  • being transparent about what you did to achieve high performance,
  • allowing others to test your work and gain trust in it,
  • reliably building on your results, and
  • making your code reusable for potential applications.

Unfortunately, many papers that use machine learning in science do not allow for reproducibility (McDermott et al. 2021). Often, important information is missing such as on the weight initialization, training epochs, hyperparameters, or random seeds. Also, preprocessing steps might not be listed, and the code is poorly documented or poorly implemented.

We discuss standards for reproducibility of machine learning in science in Chapter 14.

7.8 Lack of standards for reporting model results

In science, it is pivotal to be transparent about why a scientific model is a good fit and what data it explains. In machine learning, researchers have yet to establish what information is scientifically relevant. Is predictive performance on the test set enough? There might be other interesting information such as performance on alternative metrics or feature importances.

We discuss different standards for reporting model results in chapter Chapter 15. We provide a high-level view of how to group these standards and some best practices for publication.


  1. Usually scientific theories are informed by a lot of historical data; thus, in a certain way incorporating background knowledge increases the data support.↩︎