6  Justification to Use Machine Learning

Feedback wanted

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

We have demonstrated that prediction is central to science. From this perspective, ML in science is a fantastic idea. However, we also highlighted that it clashes with other goals such as explanation, control, and reasoning – at least when considering ML in its most basic form.

We need to decide! Is the application of ML in science justified?

This chapter moves away from purely high-level, philosophy of science stances, and addresses the everyday reality of data-driven research. Many researchers already make practical use of ML in fields as diverse as geoscience (Reichstein et al. 2019), material science (Schmidt et al. 2019), neuroscience (Kell et al. 2018), education science (Luan and Tsai 2021), and archeology (Bickler 2021). Why do they use it?

It’s finally time to justify the use of ML in science, both epistemically and practically.

The first project with ML, under the blessing of the Elder Council, was launched – tornado prediction. The ravens despise tornados. Many lost their homes, unborn children, and loved ones to tornados. They installed a simple early warning system years ago, but it performed poorly – false alerts and very late alerts for real dangers have led the ravens to largely ignore the warning system. Krarah, the tornado expert of raven science, has been consulted to approach the problem with supervised machine learning. She starts by placing sensors within and around Raven’s district, which constantly produce piles of data in all forms: air measurements, pictures, and audio recordings. The data is messy, multi-modal, complex, high-dimensional, and difficult to overview. But after some months, and with Rattle’s help, Krarah manages to train a highly predictive model that will protect all ravens from dangerous storms. Eureka!

6.1 ML adapts your model to the world and not the world to your model

Most approaches where models are fit to data (e.g. statistical models or simulations) come with a constrained functional form. Like assuming a linear, exponential, inverse, or additive relationship between variables. Such constraints usually reflect the domain knowledge of the researcher. For example, if you know that a certain fertilizer is less effective in dry climates, you may add an interaction term.

While incorporating domain knowledge sounds smart, it can be dangerous: your modeling success will depend on the reliability of your domain knowledge and your skill in encoding it. A researcher who only knows linear models will assume linearity in all contexts, whether it’s a conscious choice or not. Misspecified models will lead to wrong conclusions and therefore to bad science (Dennis et al. 2019).

In ML, the model classes – like neural networks or random forests – are more flexible and powerful. They can in principle approximate any function.1 You don’t need to know that there is an interaction between humidity and fertilizer effectiveness to build a good model. The algorithm will learn this interaction from the data. There is a shift in burden, the modeling success rests not on your domain knowledge anymore but on the size and quality of your data. Equipped with enough data, even non-domain experts can find highly predictive models. Indeed, there is a trade-off: with lots of data, domain knowledge is secondary; with little data, on the other hand, domain knowledge is key.

We will discuss the role of domain knowledge and how to incorporate it in ML modeling in Chapter 9.

6.2 Machine learning handles different data structures

The more complex the data structures become, the more difficult it becomes to make assumptions and follow a traditional modeling approach. This becomes eminent if we leave the realm of tabular data and consider images, text, and geospatial data. This is where ML can shine. ML successfully handles various data types, such as tabular (Gao et al. 2020), image (Erickson et al. 2017), geospatial (Ren et al. 2021), sound (Oikarinen et al. 2019), graphs (Hu et al. 2020), or text data (Neethu and Rajasree 2013).

  • Most data wasn’t produced by controlled experiments but comes from the messy world.
  • Data is often high-dimensional without indication of what’s relevant and what’s not.
  • Data can come in various forms (text, images, audio, etc.) and is not restricted to tables with clearly interpretable columns.

Unlike other approaches such as classic statistical modeling, ML can deal with all these problems. All we need is data! Well, we exaggerate a bit here, but you get the idea. Dealing with diverse data is not merely practical but also epistemic. If diverse data sources point to similar insights, this strongly supports the validity of these insights (Earman 1992).

6.3 ML allows us to work on new questions

ML has enabled us to leverage data to solve specific problems that we hadn’t been able to “solve” before using other approaches:

  • Translating texts: Machine translation is older than machine learning, but, let’s be honest, it was too bad to be useful. Before machine learning was mature enough and embraced, rule-based systems and statistical machine translation were used.
  • Diagnosis of diseases based on X-ray images: Before machine learning, techniques like template matching and rule-based algorithms were used. Only machine learning made it possible to make huge progress in this task.
  • Drug discovery: Traditionally, drug discovery involved a lot of trial and error. Machine learning has made it possible to analyze huge amounts of data and predict how different drugs might interact with targets in the body. In the future, this may greatly accelerate the process of drug discovery.

For such research questions, ML is finally offering routes towards studying them systematically, built on empirical data. Paraphrasing Wittgenstein – the limits of my modeling are the limits of my world. This implies that extending the limits of modeling with ML extends the realm of phenomena we can capture in science.

6.4 Don’t undervalue ML because it’s new

Compared to other modeling approaches, ML has a short history. Differential equations or linear models, for instance, have existed for hundreds of years. Historically, new modeling approaches like ML always had a hard time establishing themselves:

  • Einstein’s theory of relativity relied on non-Euclidean geometry – an unusual branch of mathematics for these endeavors – which was one of the reasons why he was met with skepticism.
  • Probabilistic models were first completely rejected in linguistics due to the dominance of logical methods. Today, they are well-established.
  • In econometrics, graphical causal models have a hard time asserting themselves against the prevailing framework of potential outcomes.

The novelty of an approach should neither be taken as an argument in favor nor against using a certain modeling approach. The essential question is: Can the tool help us answer the question we have set out to address?

6.5 Further justifications for ML

There are further justifications for using ML in science:

  • Time efficiency: Accurate ML models can often be built time-efficiently by researchers. There is less pre-processing needed compared to other approaches and large parts of the model selection and training process can be automated. Also, there is great support for ML packages in popular programming languages like Python (e.g., scikit-learn, PyTorch, or TensorFlow) or R (e.g., mlr3 or tidymodels).
  • Computational efficiency: In some situations, ML can be computationally cheaper than traditional modeling. ML models can, for example, be used as fast surrogate models for simulations from physics or material science that are computationally very intense (Toledo-Marı́n et al. 2021). ML weather prediction can run on your laptop, unlike numerical methods that demand the world’s biggest computing clusters (Lam et al. 2023).
  • Basis for theory: ML may support us with the knowledge needed to build classical theory-based models. They may show which features contain predictive information and how features interact.
  • Effective for operationalized goals: ML is ideal if our aim can be easily encoded in a single metric. If we are confident in our metric, ML will provide us with the means to optimize for it.
  • Theoretical underpinnings: Theory starts to slowly catch up to practical success. For many methods, we can provide learning guarantees (Vapnik 1999; Bishop and Nasrabadi 2006) and even for deep learning techniques we increasingly have formal intuitions about what makes them work (Belkin 2021).

  1. Many hypotheses classes in ML are universal approximators of some form; that is, they lie dense within the space of (continuous/step) functions (Hornik 1991; Lu and Lu 2020). While this is not a super special property, e.g., also polynomials lie dense in the space of continuous functions, it at least tells us that there is in principle a model that approximates any possible relationship well. This is different compared to classical statistical models that rely on small model classes; even the best model within such a class may be overall a bad model to capture the true relationship between predictors and target.↩︎