2 Bare-Bones Supervised Machine Learning
This chapter looks at supervised machine learning, stripping it to its bare-bones. Why supervised machine learning? Of course, the applications of unsupervised learning and reinforcement learning in science are no less interesting. However, after looking at many scientific applications, we realized that the primary goal was often to solve a supervised learning problem. Unsupervised and reinforcement learning techniques have often been used to support this goal, e.g. by providing powerful representations or fine-tuning models.
What is supervised machine learning about? Think back to the tornado prediction example from the introduction.
Supervised machine learning produces models that provide output predictions, for example, concerning the occurrence of a tornado in the next hour. To obtain these predictions, models must be fed with so-called input feature values, e.g. storm-centered radar images and short-range soundings. The search for models is done via a learning algorithm and a labeled dataset, consisting of input-output pairs, also called training data.
A young Raven called Rattle was the first to adopt supervised machine learning. At first, the other Raven scientists were skeptical. Too new. Unproven. Risky. Not the way of Raven Science. Nevertheless, Rattle began to explain to the first interested Ravens what machine learning was all about.
2.1 Describe the prediction task
To use supervised machine learning, you first have to translate your problem into a prediction task with the following ingredients:
- Pick a target variable
to predict. For example, the occurrence of tornadoes in a 1-hour time window, coded as 0 (no tornado) and 1 (tornado). - Define the task. The task is related to the target and can range from classification and regression to survival analysis and recommendation. Depending on how you frame the tornado prediction problem, you end up with different task types:
- Will a tornado occur within the next hour?
classification (Tornado Y/N) - How many tornadoes will occur this year?
regression (0, 1, 2, …) - How long until the next tornado occurs?
survival analysis (e.g. 2h)
- Will a tornado occur within the next hour?
- Decide on the evaluation metric. This metric defines what counts as a good or bad prediction. To classify tornadoes in 1-hour windows, you could use metrics such as F1 score1 or accuracy.
- Choose features
from which to predict the target. Features could be hourly measurements from radar stations, like wind speeds and precipitation.
Once the task is defined, you need data.
2.2 Get the data
Machine learning requires data. Data points are represented as pairs of feature and target values
After cleaning and pre-processing the data, it is typically randomly split into three subsets for machine learning:
- A training dataset,
, used to train the model. - A validation dataset,
, used to validate modeling choices and selection. - A testing dataset,
, used for final performance evaluation.
The simplest way is to randomly split your dataset into these three buckets. In reality, you have to adapt the splitting mechanism based on data and tasks. For example, time series and clustered data require splitting schemes that respect the data structure. For classification with imbalanced data, you might want to use mechanisms that ensure that the minority class occurs often enough in each split. Anyways, in a completely random 3-way split, our data point
2.3 Train the model
Training a machine learning model means running an algorithm that takes as input the training data and outputs the model. The model describes a function that outputs predictions based on input features.
The training requires making some choices:
- Select a class of models
. For example, decision trees or neural networks. Usually, you have to specify this class further, e.g., by choosing the maximal depth of a tree or the specific architecture of the neural network. Radar readings might be on a spatial level, so you might pick a convolutional neural network such as ResNet for the tornado prediction task. - Choose a training algorithm
. The training algorithm takes the training data and produces a prediction model . For example, neural networks typically use stochastic gradient descent with backpropagation as the training algorithm. But you could also train a neural network using genetic algorithms. - Set hyperparameters. Hyperparameters control the training algorithm
and affect the models that the training produces. Some hyperparameters are related to the model class, like the number of layers in your neural net or the number of neighbors in the k-nearest-neighbors algorithm. Others are connected to the training algorithm, such as the learning rate or the batch size in stochastic gradient descent.
For example, when you train a convolutional neural network to predict tornadoes, you use stochastic gradient descent (
The training process seeks to produce a model
where
2.4 Validate modeling choices
Training doesn’t guarantee that it will produce the best-performing model. For example, you might not have picked the best model class, because you have used linear regression but the best model for that task might be a tree ensemble. And even if you picked the best model class, you might have set the hyperparameters non-optimally. You need a procedure to both pick a model class and the hyperparameters. A naive approach would be to compute the evaluation metric for the training data, but this would be a bad idea since some models may overfit the training data.
We say a model overfits when it has good performance on the training data but performs poorly with new unseen data from the same distribution. A model that overfits is good at reproducing the random errors in training data but fails to capture the general patterns.
You typically compute the evaluation metric using a separate validation dataset
2.5 Evaluate the model on test data
How well would your final model perform? Unfortunately, you can’t use the evaluation metrics you have computed from training and validation data. Both will be too optimistic since you already used the training data to train the model and the validation data to make modeling choices. Instead, you have to evaluate the model performance on test data
2.6 And repeat: the role of resampling data
Having just one split into training, validation, and testing is not very data efficient. Typically, the data is split multiple times. A common technique is cross-validation, which splits the data into k different parts. Let’s say you use
You have another choice to make, and that is how to split training and validation data into nine parts. You could either do a single split or do cross-validation again. Like in the movie Inception, which is about dreams within dreams, you go one level deeper and do cross-validation within cross-validation, a procedure called nested cross-validation [2]. The advantage of (nested) cross-validation is better estimates of the model performance, and better models since you use the data more efficiently.
2.7 Bare-bones machine learning can be automated
Once you have defined the prediction task and your data, the entire training process can be completely automated. The subfield of machine learning called AutoML aims to completely automate the machine learning training process and make machine learning engineers redundant [3]. Upload your data, pick a column as your target, and pick an evaluation metric. Click a button. And the machine does everything for you. Data splitting, hyperparameter tuning, model selection, model evaluation. We call this automatable practice of machine learning “bare-bones machine learning”. The big question is: How does such an optimization-focused approach mix with a complex practice like science?
The F1 score is a metric that balances precision and recall, particularly useful for evaluating performance on imbalanced datasets.↩︎