2 Introduction
This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com
Machine learning has not only found its way into science, it has revolutionised it. This book explores the intriguing relationship between machine learning and science from two perspectives. In Part I of the book, we take a philosophical perspective: We justify machine learning as a predictive tool, but also point out its shortcomings when it comes to other scientific goals such as explaining phenomena. In Part II, we take a more technical perspective: We show how integrating interpretability, (causal) domain knowledge, and uncertainty into machine learning can address many of its shortcomings and make it a proper scientific methodology.
We think that exploring the relationship between machine learning and science is timely. Scientists from almost every field are nowadays using machine learning in their research:
- There are fields where everyone expects applications of machine learning, such as geoscience, materials science, or neuroscience (Reichstein et al. 2019; Schmidt et al. 2019; Kell et al. 2018).
- But in other fields, you might be surprised to find machine learning being used like in anthropology, history, or theoretical physics (Pedersen 2023; Kim and Pavlovic 2014; He 2017).
One thing is certain: machine learning is already changing how science is done and will continue to do so in the future.
Raven Science was stuck. Over the past decade, ravens haven’t progressed on problems of utmost importance for raven society: how to distinguish poisonous from healthy berries? How to predict extreme weather events? Which parks are the most dangerous for ravens? How could the strange noises of humans be translated into proper raven language? It’s not as if there were no hypotheses, but if tested in reality, they often failed. While the ravens had more than enough data – the record-keeping ravens did an excellent job here – they were perplexed about how to deal with it. A peculiar bunch of neuro-stats-computer ravens had recently developed supervised machine learning. They claimed that machine learning models learn by themselves by analyzing data. But this raised a question: Are they still doing science, or have they embarked on a journey toward alchemy?
2.1 Applications of machine learning in science
To get a better feel for how machine learning is being used in science, let’s take a look at some concrete applications, where machine learning plays different scientific roles.
Labeling data: Classifying wildlife animals in the Serengeti
Animal ecologists want to understand and control the complexities of natural ecosystems. To do this, they need to know: What animals live in the ecosystem? How many animals of each species are there? What do the animals do?
For a long time, it was difficult to analyse these questions quantitatively because there was simply no data. Motion sensor cameras, often referred to as camera traps, made huge amounts of data available. These cheap camera sensors were set up at several representative locations in the Serengeti and take three images in succession as soon as something moves. But in order to make use of all these images, people had to label them: they had to decide whether an animal was in the picture, what species it belonged to and describe what it was doing. A tedious task!
For this reason, Norouzzadeh et al. (2018) used the already labelled data to train a convolutional neural network that performs this task automatically. It delivers information as in Figure 2.1 and achieves an impressive prediction performance of 94.9% accuracy on the snapshot serengeti dataset. Not even human labellers are much better at this task!
The primary goal of the paper is to provide a reliable labeling tool for resaerchers. The labels for the data should be accurate enough to draw scientific conclusions. Therefore, Norouzzadeh et al. (2018) are very much concerned with the predictive accuracy of the model under changing natural conditions and the uncertainty associated with the predictions.
We move on from a case where machine learning is primarily used for labeling data to a case where machine learning may inform actions.
Informing actions: Forecasting tornadoes to provide early warnings
In the case of severe weather events like tornadoes, every minute counts. You need to find shelter quickly for yourself and your loved ones quickly. But tornadoes are also hard to predict. They form rapidly and the exact conditions required for a tornado to form are not fully understood.
Lagerquist et al. (2020) therefore use machine learning to predict the occurrence of a tornado within the next hour. They trained a convolutional neural network on data from two different sources: storm-centred radar images and short-range soundings. Their model achieves similar performance to ProbSevere, an advanced machine learning system for severe weather forecasting that is already in operation.
The aim of the work is to provide a system that could be used in deployment to help warn the poplulation of tornados. Therefore, the paper centers very much around the methodology and the assessment of the model’s predictive performance in relevant deployment settings. In particular, the errors of the model are analyzed from a geographical point of view as well as the best and worst predictions for the different types of tornadoes.
Let’s move on to a research project in which the researchers are not only after predictions, but also want to gain scientific insight into the underlying phenomenon.
Gaining insights: Predicting almond yield for better fertilizer use
California is the almond producer of the world: it produces 80% of all the almonds on Earth, probably in the entire Universe. Nitrogen, a fertilizer, plays a key role in growing the nuts.1 The use of fertilizer is regulated, meaning there’s an upper limit for its use. Before you can develop a good fertilizer strategy, you have to quantify its effect first. Zhang et al. (2019) used machine learning to predict the almond yield of orchard fields, based on weather, orchard characteristics using remote sensing (satellites), and, of course, fertilizer use. The authors examined the model not only for its predictive accuracy but also for how individual features affect the prediction and how important they are for the model’s performance.
The goal of this research is twofold: prediction and insight. The model may help almond growers make better decisions about fertilizer use but also contribute to the scientific knowledge base. In general, the scientific fields of ecology and agriculture are at the forefront when it comes to adapting machine learning (Lucas 2020; Perry et al. 2022).
The next application of machine learning in science is probably the best known in all disciplines and has become a central scientific building block in bioinformatics.
Providing a scientific building block: Infering protein structures from amino acid sequences
One of the big goals in bioinformatics is to understand how protein structures are determined by their amino acid sequences. Proteins do a lot of heavy lifting in the body, from building muscles and repairing the body to breaking down food and sending messages. A protein’s structure is largely determined by its sequence of amino acids, the building blocks of protein. The protein’s structure, however, ultimately determines its function. The problem: It is difficult to predict how a protein will be structured if you only have the amino acid sequence. If scientists could do that reliably, it would help them with drug discovery, understanding disease mechanisms, and designing new proteins. Meet AlphaFold (Jumper et al. 2021), a deep neural network that can predict – with rather high performance – protein structures from their amino acid sequence. See Figure 2.2 for one example.
AlphaFold was trained on a dataset of 100,000 protein sequences. Beyond pure prediction, the algorithm aids with molecular replacement (Pereira et al. 2021) and interpreting images of proteins taken with special microscopes (Gupta et al. 2021). There’s now even an entire database, called AlphaFold DB, which stores the predicted protein structures. On its website (“AlphaFold DB Website” 2023) it says:
AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.
Not only has the prediction model AlphaFold become part of science, but its predictions serve as a building block for further research. To justify this sensitive role of machine learning, the model needs to be doubly scrutinized.
Understanding concepts: Detecting satire by paying attention
Nowadays, some people have a hard time separating between satire and real news. But can an algorithm detect satire?
De Sarkar, Yang, and Mukherjee (2018) trained a neural network on 14 satirical sites such as The Onion, and non-satirical news from Google News, and The Guardian to detect which is satire and which is not. Words were first embedded, and then fed into a transformer, which is a neural network architecture with an attention mechanism that allows the network to “pay attention” to different parts of the input. And it worked.
Not only is the model usable as a classifier, but the researchers also analyzed which part of the text is important for the classification. A key insight was that the last sentence of the article deserves special attention: it helps to distinguish between satire and real news.
2.2 What role does machine learning play in science?
You may look at these examples and argue that machine learning is only ever a tool in the scientific process, or that the scientific goal was only to evaluate the model’s performance. You may look at machine learning models as mere tools for science. Just as researchers would use Excel, databases or Python.
If this were the case, the value of machine learning in science would be very limited - machine learning could create ‘models’, but these bear little resemblance to scientific models that actually represent phenomena. The most important methodological steps in science – such as creating scientific models or theories, testing them and finding interesting new research problems - would remain unaffected by machine learning.
And indeed, there are good reasons why everyone is cautious about machine learning in science – in its rawest form, machine learning reduces every scientific question to a prediction problem. This restraint is emphasized by popular critical voices about machine learning.
Take Judea Pearl, a proponent of causal inference, who said that
“I view machine learning as a tool to get us from data to probabilities. But then we still have to make two extra steps to go from probabilities into real understanding — two big steps. One is to predict the effect of actions, and the second is counterfactual imagination.” (Pearl 2019)
Pearl refers here to his ladder of causation that distinguishes three ranks: 1. association, 2. intervention, and 3. counterfactual reasoning (Pearl and Mackenzie 2018). He highlights that machine learning remains on rank one of this ladder and is only suitable for static prediction.
Gary Marcus, one of the most well-known critics of current deep learning claimed that
“In my judgement, we are unlikely to resolve any of our greatest immediate concerns about AI if we don’t change course. The current paradigm —- long on data, but short on knowledge, reasoning and cognitive models – simply isn’t getting us to AI we can trust.” (Marcus 2020)
In the paper, Marcus criticizes the lack of robustness, the ignorance of background knowledge, and the opacity in the current machine learning approach.
Noam Chomsky, one of the founding fathers of modern linguistics argued:
“Perversely, some machine learning enthusiasts seem to be proud that their creations can generate correct “scientific” predictions (say, about the motion of physical bodies) without making use of explanations (involving, say, Newton’s laws of motion and universal gravitation). But this kind of prediction, even when successful, is pseudoscience. While scientists certainly seek theories that have a high degree of empirical corroboration, as the philosopher Karl Popper noted, ‘we do not seek highly probable theories but explanations; that is to say, powerful and highly improbable theories.’ ” (Chomsky, Roberts, and Watumull 2023)
Chomsky criticizes machine learning for its exclusive focus on prediction rather than developing theories and explanations. In the article, he is particularly doubtful that current large language models provide deep insight into human language.
2.3 Machine learning can be more than a tool
We share these critical views above to a certain degree. Machine learning algorithms have problems with being dumb curve-fitters that lack causality. Purely predictive models might make bad explanatory models. And machine learning doesn’t produce scientific theories as we are used to.
Nevertheless, there are reasons for optimism that these issues can be addressed and that machine learning can play a big role in science. The roles we see machine learning play were already shining through some of the above examples:
- Gain insights: In the case of almond yield prediction, the goal wasn’t only to build a prediction machine, but the researchers also extracted insights from the model.
- Exploration: AlphaFold isn’t a mere proof of concept. For example, protein structure predictions are used by researchers to explore and test new drugs.
- Build theory: Finding out that the last sentence in an article is crucial to identify satire can lead to new theoretical groundwork on what defines satire.
But machine learning cannot fulfil these roles in its raw form. It needs to be equipped with upgrades such as (causal) domain knowledge, interpretability, uncertainty, and so on, to become a new approach for extracting insights from data, generating new hypotheses, or informing peoples’ actions. We believe that fully upgraded supervised machine learning models have the potential to become a full-blown scientific methodology that helps scientists to better understand phenomena.
The upgrades we need are readily available. Every year, new sub-fields are emerging, ranging from interpretability and causality to robustness and uncertainty quantification. They are just all over the place. If we think of machine learning not in isolation, but in conjunction with these other works, machine learning could provide great scientific value. Like a puzzle, we have to piece all of it together, starting with a justification of why a machine learning approach focused on prediction is a good core idea for research.
But before we can justify machine learning or start solving the puzzle, let’s look at supervised machine learning in its raw form!
Technically, they are legumes, but we’re not biologists and we eat them as if they were nuts.↩︎