6 Bare-Bones Machine Learning is Insufficient

Even though machine learning has great merits when it comes to prediction, it clashes with other scientific goals like control, explanations, and reasoning. This chapter highlights more concretely and practically the insufficiencies of bare-bones machine learning as a scientific methodology.

The veteran scientists’ concerns about machine learning were not misplaced. Over time, the tornado system began to act up. False positives and false negatives became more common. Krarah consulted Rattle, but even she was puzzled. Rattle concluded that the problems were not specific to tornado prediction, but systemic to machine learning. They had to be fixed.

6.1 Low test error is not enough

As we highlighted in Chapter 4, machine learning has a transparent notion of what makes a good model: a good model has low error on an unseen test set and in consequence a low generalization error. However, this notion is empty if it is not equipped with an underlying theory of generalization. When can you infer that the model performance generalizes from just knowing the error on a test set? Are there guarantees that certain learning algorithms must yield models with low expected error?

In Chapter 7, we present statistical learning theory as the main contender for a theory of generalization in machine learning.

However, the standard notion of generalization only reflects the predictive capacities of models on data from the same distribution. Extrapolation to completely new scenarios or data from different distributions is not encompassed in the standard notion of generalization. It therefore has limited bite, especially in science, where the goal is to generalize from models to insights about the phenomenon.

In Chapter 7, we therefore also discuss broader conceptions of generalization. Chapter 11 on robustness, moreover, shows how to define generalization under distribution shifts.

6.2 Domain knowledge is overlooked

Science builds on prior knowledge to produce knowledge. Bare-bones machine learning disrupts both ends of this knowledge flow: You expect the model to figure out relations from the data, and get out an opaque prediction model. The focus shifts from epistemology to utility, from science to engineering. You tinker with the model to improve a metric. If machine learning is fully embraced in its bare-bones form, a lot of domain knowledge is left unused, and there is little new knowledge to gain.

However, coherence with background knowledge makes scientific models more valuable. It is smart to incorporate additional information¹. Compare that to statistical modeling or differential equations: These modeling approaches encourage, even require, the formulation of prior knowledge in terms of distribution assumptions and equations. And you get interpretable estimates back that help you better understand the phenomenon you study.

In Chapter 8, we argue that you can take a domain-knowledge-driven approach with machine learning as well. Even better: Thanks to the focus on predictive performance, you can evaluate your domain knowledge in terms of predictive performance.

6.3 Predictions cannot be easily explained and interpreted

Interpretability enables you to justify models and reason about phenomena. But machine learning models are generally not inherently interpretable, because they may have a complex functional form that is adjusted to data – they are black-boxes. This makes it difficult to understand how the model behaves and what it relies on:

What were the most influential features?
Which features did the model learn?
Why did the model make a certain prediction?

We show in Chapter 9 how to use interpretability techniques to improve models and gain insights from them.

6.4 Predictive performance is at odds with causality

The world we live and act in is a gigantic causal mechanism. Bare-bones machine learning models however ignore the causal dimension of things. All they care about is making better predictions, and they rely on whatever statistical dependency to pursue this goal. Do you want to know your COVID risk? Alright, the machine learning model needs your shoe size, salary, and the football ranking of your local club.

But scientists want to separate between causes, effects, or spuriously correlated features. Physicians want to know why certain people have COVID and others don’t. They want to develop drugs and prescribe treatments that make people healthy. To answer these questions, causality must be taken into account.

We’ll show in Chapter 10 how the combination of causal inference and machine learning allows you to find causal dependencies, learn causal models, and estimate causal effects.

6.5 Models lack robustness in deployment

Bare-bones machine learning provides predictive models, but only for a static environment. This means machine learning models are only accurate, if:

they are applied to data that is similar to the training data, and
the data structure remains intact during deployment.

In the wild, phenomena are complex. Data changes constantly, external factors enter, measurement devices produce errors, time moves, and observation targets shift. If you want to use machine learning models in your scientific pipeline, you have to make them robust tools under real-world conditions.

We show in Chapter 11 what types of robustness you may be interested in and discuss strategies such as data augmentation that help you robustify your model.

6.6 Predictions come without uncertainty quantification

Proper uncertainty quantification is crucial when high-stakes decisions are made in the real world. However, bare-bones machine learning models only provide point predictions. Some models, such as Bayesian models, come with built-in uncertainty quantification, but limiting the model class to these models may result in a loss in predictive performance. Perhaps the best-performing model is a random forest. If you choose the best-performing model, you might get one without built-in uncertainty quantification. Even when model outputs look like probabilities because they are between zero and one (yes, we’re talking to you, softmaxers), they often cannot be interpreted as ‘real’ probabilities when models aren’t well calibrated.

We discuss the philosophy behind uncertainty in machine learning, calibration, and model-agnostic methods for uncertainty quantification in Chapter 12.

6.7 No consensus on standards for reproducibility

There are many standards for how to document code, data, and models for others to reproduce results. Reproducibility is key to:

being transparent about what you did to achieve high performance,
allowing others to test your work and gain trust in it,
reliably building on your results, and
making your code reusable for potential applications.

Unfortunately, many papers that use machine learning in science do not allow for reproducibility [1]. Often, important information is missing such as on the weight initialization, training epochs, hyperparameters, or random seeds. Also, preprocessing steps might not be listed, and the code is poorly documented or poorly implemented.

We discuss standards for reproducibility of machine learning in science in Chapter 13.

6.8 Missing standards for reporting

In science, it is of central importance to make transparent why a scientific model is suitable and what data it explains. In machine learning, researchers still need to determine what information is scientifically relevant. Predictive performance on the test set is indeed important. But what about the importance of features? Or information about the data collection or the model selection?

We discuss different standards for reporting model results in chapter Chapter 14. We provide a high-level view of how to group these standards and some best practices for publication.

Usually scientific theories are informed by a lot of historical data; thus, in a certain way incorporating background knowledge increases the data support.↩︎