5 Justification to Use Machine Learning
This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com
We have demonstrated that prediction is one essential goal in science. Great! But how does machine learning help scientists in making predictions? And, what value does it add to existing predictive modeling approaches? What are good reasons for using machine learning in science?
In this chapter, we justify the use of machine learning in science both epistemically and pragmatically.
The first project with ML, under the blessing of the Elder Council, was launched – tornado prediction. The ravens despise tornados. Many lost their homes, unborn children, and loved ones to tornados. They installed a simple early warning system years ago, but it performed poorly – false alerts and very late alerts for real dangers have led the ravens to largely ignore the warning system. Krarah, the tornado expert of raven science, has been consulted to approach the problem with supervised machine learning. She starts by placing sensors within and around Raven’s district, which constantly produce piles of data in all forms: air measurements, pictures, and audio recordings. The data is messy, multi-modal, complex, high-dimensional, and difficult to overview. But after some months, and with Rattle’s help, Krarah manages to train a highly predictive model that will protect all ravens from dangerous storms. Eureka!
5.1 Machine Learning has a transparent and quantifiable notion of success
In science, the quality of models is often assessed in a lengthy and non-transparent manner. Social aspects such as the reputation of the researchers who developed the model, the opinion of other scientific authorities, or the amount of funding for the research programme can play as important a role in model assessment as the predictive capacity of the model (Kuhn 1997).
In comparison, machine learning offers a very transparent notion to assess model quality: a good model is one that has a low average prediction error on an unseen test dataset. The idea behind this type of assessment is that a model with a low empirical error on a test set also makes few expected errors on the task to be solved. This inference from the test to the so-called generalization error is even demonstrably correct if the test data is a random and representative data sample of the task.
Furthermore, not only is the goal of modelling very transparent, but the feedback that researchers receive is also quick to obtain and comes in quantified form. In this way, scientists can check their modelling intuitions throughout the modelling process on an almost purely empirical basis and even compare their models with entirely different modelling schools.
This transparency about what makes a good model enables scientists to work towards a common goal. We discuss the concept of generalization in Chapter 8.
5.2 Machine learning adapts your model to the world and not the world to your model
Most approaches where models are fit to data (e.g. statistical models or simulations) come with a constrained functional form. Like assuming a linear, exponential, inverse, or additive relationship between variables. Such constraints usually reflect the domain knowledge of the researcher. For example, if you know that a certain fertilizer is less effective in dry climates, you may add an interaction term.
While incorporating domain knowledge sounds smart, it can be dangerous: Your modeling success will depend on the reliability of your domain knowledge and your skill in encoding it. A researcher who only knows linear models will assume linearity in all contexts, whether it is a conscious choice or not. Misspecified models will lead to wrong conclusions and therefore to bad science (Dennis et al. 2019).
In machine learning, the model classes – like neural networks or random forests – are more flexible and powerful. They can in principle approximate any function.1 You don’t need to know that there is an interaction between humidity and fertilizer effectiveness to build a good model. The algorithm will learn this interaction from the data. There is a shift in burden, the modeling success rests not on your domain knowledge anymore but on the size and quality of your data. Equipped with enough data, even non-domain experts can find highly predictive models. Indeed, there is a trade-off: with lots of data, domain knowledge is secondary; with little data, on the other hand, domain knowledge is key.
We will discuss the role of domain knowledge and how to incorporate it in machine learning modeling in Chapter 9.
5.3 Machine learning handles different data structures
The more complex the data structures become, the more difficult it becomes to make assumptions and follow a traditional modeling approach. This becomes evident if we leave the realm of tabular data and consider images, text, and geospatial data. This is where machine learning can shine. Machine learning successfully handles various data types, such as tabular (Gao et al. 2020), image (Erickson et al. 2017), geospatial (Ren et al. 2021), sound (Oikarinen et al. 2019), graphs (Hu et al. 2020), or text data (Neethu and Rajasree 2013).
- Most data wasn’t produced by controlled experiments but comes from the messy world.
- Data is often high-dimensional without indication of what’s relevant and what’s not.
- Data can come in various forms (text, images, audio, etc.) and is not restricted to tables with clearly interpretable columns.
Unlike other approaches such as classic statistical modeling, machine learning can deal with all these problems. All you need is data! Well, we exaggerate a bit here, but you get the idea. Dealing with diverse data is not merely practical but also epistemic. If diverse data sources point to similar insights, this strongly supports the validity of these insights (Earman 1992).
5.4 Machine learning allows you to work on new questions
Machine learning has enabled scientists to leverage data to solve specific problems that they hadn’t been able to “solve” before using other approaches:
- Translating texts: Machine translation is older than machine learning, but, let’s be honest, it was too bad to be useful. Before machine learning was mature enough and embraced, rule-based systems and statistical machine translation were used.
- Diagnosis of diseases based on X-ray images: Before machine learning, techniques like template matching and rule-based algorithms were used. Only machine learning made it possible to make huge progress in this task.
- Drug discovery: Traditionally, drug discovery involved a lot of trial and error. Machine learning has made it possible to analyze huge amounts of data and predict how different drugs might interact with targets in the body. In the future, this may greatly accelerate the process of drug discovery.
For such research questions, machine learning is finally offering routes toward studying them systematically, built on empirical data. Paraphrasing Wittgenstein – the limits of my modeling are the limits of my world. This implies that extending the limits of modeling with machine learning extends the realm of phenomena scientists can investigate.
5.5 Don’t undervalue machine learning because it is new
Compared to other modeling approaches, machine learning has a short history. Differential equations or linear models, for instance, have existed for hundreds of years. Historically, new modeling approaches like machine learning always had a hard time establishing themselves:
- Einstein’s theory of relativity relied on non-Euclidean geometry – an unusual branch of mathematics for these endeavors – which was one of the reasons why he was met with skepticism.
- Probabilistic models were first completely rejected in linguistics due to the dominance of logical methods. Today, they are well-established.
- In econometrics, graphical causal models have a hard time asserting themselves against the prevailing framework of potential outcomes.
The novelty of an approach should neither be taken as an argument in favor nor against using a certain modeling approach. The essential question is: Can the tool help answering the question you have set out to address?
5.6 Further justifications for machine learning
There are further justifications for using machine learning in science:
- Time efficiency: Accurate machine learning models can often be built time-efficiently by researchers. There is less pre-processing needed compared to other approaches and large parts of the model selection and training process can be automated. Also, there is great support for machine learning packages in popular programming languages like Python (e.g., scikit-learn, PyTorch, or TensorFlow) or R (e.g., mlr3 or tidymodels).
- Computational efficiency: In some situations, machine learning can be computationally cheaper than traditional modeling. Machine learning models can, for example, be used as fast surrogate models for simulations from physics or material science that are computationally very intense (Toledo-Marı́n et al. 2021). Machine learning weather prediction can run on your laptop, unlike numerical methods that demand the world’s biggest computing clusters (Lam et al. 2023).
- Basis for theory: Machine learning may support you with the knowledge needed to build classical theory-based models. They may show which features contain predictive information and how features interact.
- Effective for operationalized goals: Machine learning is ideal if the aim can be easily encoded in a single metric. If you are confident in your metric, machine learning will provide you with the means to optimize for it.
- Theoretical underpinnings: Theory starts to slowly catch up to practical success. For some methods, there are learning guarantees (Vapnik 1999; Bishop and Nasrabadi 2006) and even for deep learning techniques researchers increasingly have formal intuitions about how they work (Belkin 2021).
Many hypotheses classes in machine learning are universal approximators of some form; that is, they lie dense within the space of (continuous/step) functions (Hornik 1991; Lu and Lu 2020). While this is not a super special property, e.g., also polynomials lie dense in the space of continuous functions, it at least expresses that there is in principle a model that approximates any possible relationship well. This is different compared to classical statistical models that rely on small model classes; even the best model within such a class may be overall a bad model to capture the true relationship between predictors and target.↩︎