7  Bare-Bones Machine Learning Is Insufficient

Feedback wanted

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

ML delivers predictive models. Great! But as we philosophically discussed in Chapter 5, there are other goals of scientific modeling: control, explanations, and reasoning. Let’s discuss more concretely and practically the insufficiencies of bare-bones ML.

Everyone except Krarah and her fellow scientists was happy about the tornado prediction model. While they loved not being disrupted by tornadoes, they felt that the models were alien creatures they couldn’t communicate with. They feared that Raven Science would reach a dead end of understanding. Over time, the tornado system acted up. False positives and false negatives appeared more often. Something was going on, and the raven scientists felt that they had to patch their system for these insufficiencies.

7.1 Domain knowledge is overlooked

Science builds on prior knowledge to produce knowledge. Bare-bones ML disrupts both ends of this knowledge flow: We expect the model to figure out relations from the data, and we get out an opaque prediction model. The focus shifts from epistemology to utility, from science to engineering. We tinker with the model to improve a metric. If ML is fully embraced in its bare-bones form, a lot of domain knowledge is left unused, and we gain little new knowledge. However, coherence with background knowledge makes scientific models more valuable. It’s smart to incorporate additional information1. Compare that to statistical modeling or differential equations: These modeling approaches encourage, even require, the formulation of our prior knowledge in terms of distribution assumptions and equations. And we get interpretable estimates back that help us better understand the phenomenon of study.

However, we argue that you can take a domain-knowledge-driven approach with ML as well (see Chapter 9). Even better: Thanks to the focus on predictive performance, we get to evaluate our domain knowledge in terms of predictive performance.

7.2 Lack of interpretability and explanations

Interpretability enables us to justify the models and reason about phenomena. But ML models are generally not inherently interpretable, because they may have a complex functional form that is adjusted to data – they are black-boxes. This makes it difficult to understand how the model behaves and what it relies on:

  • What were the most influential features?
  • Which features did the model learn?
  • Why did the model make a certain prediction?

We show in Chapter 10 how to use interpretability techniques to improve models and gain insights from them.

7.3 Predictive performance is at odds with causality

The world we live and act in is a gigantic causal mechanism. Bare-bones ML models however ignore the causal dimension of things. All they care about is making better predictions, and they rely on whatever statistical dependency to pursue this goal. You want to know your COVID risk? Alright, the ML model needs your shoe size, salary, and the football ranking of your local club. We as scientists want to separate between causes, effects, or spuriously correlated features. We want to know why certain people have COVID and others don’t. We want to develop drugs and prescribe treatments that make people healthy. To answer these questions, causality must be taken into account.

We’ll show in Chapter 11 how the combination of causal inference and ML allows us to find causal dependencies, learn causal models, and estimate causal effects.

7.4 The idea of generalization is too narrow

What does it mean for a model to predict well? Machine learning has a clear answer: We measure the performance on unseen data. And if a model performs well on unseen data and not only memorizes the training data, we say that the model generalizes well to new data. Statistical learning theory, the field that studies generalization, A lot of statistical learning theory is built on the concept of generalization, from bias-variance tradeoffs, overfitting, and double descent. However, generalization in machine learning is quite narrowly defined and only applies to data from the same distribution. Beyond possible mismatches between the distribution of the evaluation data and application data, there is more of a transfer that we want out of our models. Especially in science: We might want to generalize the model interpretation or claims about the performance given a certain population.

We discuss generalization and its extensions in Chapter 8.

7.5 No uncertainty quantification

Proper uncertainty quantification is crucial when high-stakes decisions are made in the real world. However, bare-bones ML models only provide point predictions. Some models, such as Bayesian models, come with built-in uncertainty quantification, but limiting the model class to these models may result in a loss in predictive performance. Perhaps the best-performing model is a random forest. If you choose the best-performing model, you might get one without built-in uncertainty quantification. But when we make decisions based on predictions, we need to quantify how certain these predictions are. Even when model outputs look like probabilities because they are between zero and one (yes, we’re talking to you, softmaxers), they often cannot be interpreted as ‘real’ probabilities when models aren’t well calibrated.

We discuss the philosophy behind uncertainty in machine learning, calibration, and model-agnostic methods for uncertainty quantification in Chapter 12.

7.6 Lack of robustness

Bare-bones ML provides predictive models, but only for a static environment. This means ML models are only accurate, if:

  • they are applied to data that is similar to the training data, and
  • the data structure remains intact during deployment.

In the wild, phenomena are complex. Data changes constantly, external factors enter, measurement devices produce errors, time moves, and observation targets shift. If we want to use ML models in our scientific pipeline, we have to make them robust tools under real-world conditions.

We show in Chapter 13 what types of robustness you may be interested in and discuss strategies such as data augmentation that help you robustify your model.

7.7 Lack of standards for reporting model results

In science, it’s pivotal to be transparent about why a scientific model is a good fit and what data it explains. In ML, we have yet to establish what information is scientifically relevant. Is predictive performance on the test set all we need? There might be other interesting information such as performance on alternative metrics or feature importances.

We discuss different standards for reporting model results in chapter Chapter 15. We provide a high-level view of how to group these standards and some best practices for reporting.

7.8 Lack of standards for reproducibility

There are many standards for how to document code, data, and models for others to reproduce results. Reproducibility is key to:

  • being transparent about what you did to achieve high performance,
  • allowing others to test your work and gain trust in it,
  • reliably building on your results, and
  • making your code reusable for potential applications.

Unfortunately, many papers that use ML in science do not allow for reproducibility (McDermott et al. 2021). Often, important information is missing such as on the weight initialization, training epochs, hyperparameters, or random seeds. Also, preprocessing steps might not be listed, and the code is poorly documented or poorly implemented.

We discuss standards for reproducibility of ML in science in Chapter 16.


  1. Usually our theory is informed by a lot of historical data; thus, in a certain way we increase the data support by incorporating background knowledge.↩︎