2  Introduction

Feedback wanted

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

Machine learning (ML) has not just entered science but has revolutionized it. This book aims to explore and contextualize this adoption of ML into science, asking whether it’s justified or not.

Almost every scientific discipline uses machine learning today, from fields as diverse as geoscience (Reichstein et al. 2019), materials science (Schmidt et al. 2019), neuroscience (Kell et al. 2018), education science (Luan and Tsai 2021), or archeology (Bickler 2021). You may even find machine learning in unexpected places like anthropology (Pedersen 2023), history (Kim and Pavlovic 2014), or theoretical physics (He 2017). Scientists use ML to analyze different data structures ranging from tabular data (Gao et al. 2020), images (Erickson et al. 2017) and geospatial data (Ren et al. 2021) to text data (Neethu and Rajasree 2013), graphs (Hu et al. 2020), and sounds (Oikarinen et al. 2019). However, not every discipline adopts ML to the same extent. The adoption rate depends on the suitability of ML for the problems and the methodological culture of the discipline (Hacking 1992). One thing is certain: machine learning is changing how we do science and it is here to stay.

Before discussing the justification, let’s explore some applications of supervised machine learning in science.

Raven Science was stuck. Over the past decade, ravens haven’t progressed on problems of utmost importance for raven society: how to distinguish poisonous from healthy berries? How to predict extreme weather events? Which parks are the most dangerous for ravens? How could the strange noises of humans be translated into proper raven language? It’s not as if there were no hypotheses, but if tested in reality, they often failed. While the ravens had more than enough data – the record-keeping ravens did an excellent job here – they were perplexed about how to deal with it. A peculiar bunch of neuro-stats-computer ravens had recently developed supervised machine learning. They claimed that machine learning models learn by themselves by analyzing data. But this raised a question: Are they still doing science, or have they embarked on a journey toward alchemy?

2.1 Forecasting tornadoes to provide early warnings

In the case of severe weather events like tornadoes, every minute counts. But tornadoes are also hard to predict as they form rapidly, and the exact conditions needed for a tornado to form are not fully understood. Lagerquist et al. (2020), therefore, trained a machine learning model to predict the occurrence of a tornado within the next hour. The classification model is a convolutional neural network trained on storm-centered radar images and proximity sounding. Weather forecasting is traditionally a prediction-focused trade, where the prediction gets more emphasis than the explanation, compared to other scientific disciplines. Using machine learning puts even more emphasis on prediction. The paper was published in “Monthly Weather Review”, a journal focused on “research relevant to the analysis and prediction of observed atmospheric circulations and physics” among others. A search for “machine learning” reveals that only 143 out of 40,500 articles in this journal mention machine learning (as of August 2023). Early days still, at least for this community.

The paper focuses on the methodology behind the prediction model and how well it performs. This more method-centric approach is typical for machine learning research in general and stands in contrast to papers that try to draw conclusions about the underlying phenomenon. Lagerquist et al. (2020) go a small step further than only analyzing the performance, they also analyze the model’s errors by geography and the overall best and worst predictions.

The following example shows a research project in which the researchers are not only after predictions but also want to gain insight into the underlying phenomenon.

2.2 Predicting almond yield for better fertilizer use

California is the almond producer of the world: it produces 80% of all the almonds on Earth, probably in the entire Universe. Nitrogen, a fertilizer, plays a key role in growing the nuts.1 The use of fertilizer is regulated, meaning there’s an upper limit for its use. Before you can develop a good fertilizer strategy, you have to quantify its effect first. Zhang et al. (2019) used machine learning to predict the almond yield of orchard fields, based on weather, orchard characteristics using remote sensing (satellites), and, of course, fertilizer use. The authors examined the model not only for its predictive accuracy, but also for how individual features affect the prediction and how important they are for the model’s performance.

The goal of this research is twofold: prediction and insight. The model may help almond growers make better decisions about fertilizer use but also contribute to the scientific knowledge base. In general, the scientific fields of ecology and agriculture are at the forefront when it comes to adapting machine learning (Lucas 2020; Perry et al. 2022).

The next example highlights again another role of ML models – ML predictions can become scientific building blocks.

2.3 Predicting protein structures

One of the big goals in bioinformatics is to understand how protein structures are determined by their amino acid sequences. Proteins do a lot of heavy lifting in the body, from building muscles and repairing the body to breaking down food and sending messages. A protein’s structure is largely determined by its sequence of amino acids, the building blocks of protein. The protein’s structure, however, ultimately determines its function. The problem: It’s difficult to predict how a protein will be structured if you only have the amino acid sequence. If we could do that reliably, it would help us with drug discovery, understanding disease mechanisms, and designing new proteins. Meet AlphaFold (Jumper et al. 2021), a deep neural network that can predict – with rather high performance – protein structures from their amino acid sequence. See Figure 2.1 for one example.

Figure 2.1: Protein structure, predicted in blue, determined by experiment in green. Figure by Jumper et al. (2021), CC-BY

AlphaFold was trained on a dataset of 100,000 protein sequences. Beyond pure prediction, the algorithm aids with molecular replacement (Pereira et al. 2021) and interpreting images of proteins taken with special microscopes (Gupta et al. 2021). There’s now even an entire database, called AlphaFold DB, which stores the predicted protein structures. On its website (“AlphaFold DB Website” 2023) it says:

AlphaFold DB provides open access to over 200 million protein structure predictions to accelerate scientific research.

Not only has the prediction model AlphaFold become part of science, but its predictions serve as a building block for further research. To justify this sensitive role of ML, the model needs to be doubly scrutinized.

2.4 Detecting satire by paying attention

Nowadays, some people have a hard time separating between satire and real news. But can an algorithm detect satire? De Sarkar, Yang, and Mukherjee (2018) trained a neural network on 14 satirical sites such as The Onion, and non-satirical news from Google News, and The Guardian to detect which is satire and which is not. Words were first embedded, and then fed into a transformer, which is a neural network architecture with an attention mechanism that allows the network to “pay attention” to different parts of the input. And it worked. Not only is the model usable as a classifier, but the researchers also analyzed which part of the text is important for the classification. A key insight was that the last sentence of the article deserves special attention: it helps to distinguish between satire and real news.

2.5 Is machine learning only a tool?

In all these examples, you could say that machine learning is either a tool in the scientific process, or the scientific goal was simply to evaluate how well the machine learning model worked for folding proteins and predicting tornadoes. One might think that machine learning models are “just” algorithms, mere tools for doing science. Just like researchers would also use Excel, databases, or Python. If this is true, the value of machine learning in science might be very limited – machine learning might produce ‘models’, but they bear little resemblance to scientific models that represent phenomena. The largest methodological steps in science – such as building scientific models or theories, testing them, and finding interesting new research problems – would remain untouched by machine learning.

2.6 Prominent scientists criticize machine learning

We see good reasons why everyone is cautious about supervised ML in science – ML in its rawest form where its only focus is on predictions, is unsuitable as a building block for science. This restraint is emphasized by popular critical voices about supervised ML.

Take Judea Pearl, a proponent of causal inference, who said that

“I view machine learning as a tool to get us from data to probabilities. But then we still have to make two extra steps to go from probabilities into real understanding — two big steps. One is to predict the effect of actions, and the second is counterfactual imagination.” (Pearl 2019)

Pearl refers here to his ladder of causation that distinguishes three ranks: 1. association, 2. intervention, and 3. counterfactual reasoning (Pearl and Mackenzie 2018). He highlights that ML remains on rank one of this ladder and is only suitable for static prediction.

Gary Marcus, one of the most well-known critics of current deep learning claimed that

“In my judgement, we are unlikely to resolve any of our greatest immediate concerns about AI if we don’t change course. The current paradigm —- long on data, but short on knowledge, reasoning and cognitive models – simply isn’t getting us to AI we can trust.” (Marcus 2020)

In the paper, Marcus criticizes the lack of robustness, the ignorance of background knowledge, and the opacity in the current ML approach.

Noam Chomsky, one of the founding fathers of modern linguistics argued:

“Perversely, some machine learning enthusiasts seem to be proud that their creations can generate correct “scientific” predictions (say, about the motion of physical bodies) without making use of explanations (involving, say, Newton’s laws of motion and universal gravitation). But this kind of prediction, even when successful, is pseudoscience. While scientists certainly seek theories that have a high degree of empirical corroboration, as the philosopher Karl Popper noted, ‘we do not seek highly probable theories but explanations; that is to say, powerful and highly improbable theories.’ ” (Chomsky, Roberts, and Watumull 2023)

Chomsky criticizes ML for its exclusive focus on prediction rather than developing theories and explanations. In the article, he is particularly doubtful that current large language models provide deep insight into human language.

2.7 Machine learning can be more than a tool

We share these critical views to a certain degree. Machine learning algorithms have problems with being dumb curve-fitters that lack causality. Purely predictive models might make bad explanatory models. And machine learning doesn’t produce scientific theories as we are used to.

Nevertheless, there are reasons for optimism that these issues can be addressed and that machine learning can play a big role in science. The big role we see was already shining through some of the above examples:

  • Gain insights: In the case of almond yield prediction, the goal wasn’t only to build a prediction machine, but the researchers also extracted insights from the model.
  • Exploration: AlphaFold isn’t a mere proof of concept. For example, protein structure predictions are used by researchers to explore and test new drugs.
  • Build theory: Finding out that the last sentence in an article is crucial to identify satire can lead to new theoretical groundwork on what defines satire.

Some scientists see supervised ML only as a tool or labeling device. For others, the supervised ML model is the scientific object under investigation or a proof of concept that an event can be predicted in principle. The more upgrades supervised ML will get from causality, interpretability, and so on, the more it will become a new approach to gain insights from data, generate new hypotheses, or inform people’s actions. Ultimately, we believe that fully upgraded supervised ML models have the potential to be part of scientific models of the phenomena they describe.

We see great potential for “upgraded machine learning” especially in areas with complex, unstructured data types where theory building is difficult, such as neuroscience, linguistics, or medicine.

2.8 Making machine learning useful for science

Problems with supervised ML occur when it’s employed in a raw, prediction-first way. A purely predictive approach will have limited effectiveness in science as it neglects other scientific goals like controlling, explaining, or reasoning about phenomena. However, we believe that with some upgrades, machine learning can become a central piece of science. The upgrades we need are readily available.

Every year, new fields are emerging, ranging from interpretability and causality to robustness and uncertainty quantification. They are just all over the place. If we think of machine learning not in isolation, but in conjunction with these other works, machine learning could provide great scientific value.

Like a puzzle, we have to piece all of it together, starting with a justification of why a machine learning approach focused on prediction is a good core idea for research. The piece we want to start with is obvious: let’s talk about supervised ML in its raw form and its main strength – prediction!

  1. Technically, they are legumes, but we’re not biologists and we eat them as if they were nuts.↩︎