3 Bare-Bones Supervised Machine Learning

Feedback wanted

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

This chapter looks at supervised machine learning, stripping it to its bare bones. Supervised machine learning produces models that output predictions from input feature values. The search for (or learning of) models is done via a learning algorithm and a labeled dataset, consisting of input-output pairs, also called training data. We’ll use tornado prediction as an example throughout this chapter.

A young raven called Rattle was the first to adopt supervised machine learning. At first, the other raven scientists were skeptical. Too new. Unproven. Risky. Not the way of Raven Science. Nevertheless, Rattle started explaining to the first interested ravens what this machine learning was all about.

3.1 Describe the prediction task

To use supervised machine learning, you first have to translate your problem into a prediction task, which requires:

Pick a target variable \(y\) to predict. For example, the occurrence of tornadoes in a 1-hour time window, coded as 0 (no tornado) and 1 (tornado).
Define the task. The task is related to the target and can range from classification and regression to survival analysis and recommendation. Depending on how we frame our tornado prediction problem, we end up with different task types:
- Will a tornado occur within the next hour? \(\Rightarrow\) classification
- How many tornados will occur this year? \(\Rightarrow\) regression
- How long until the next tornado occurs? \(\Rightarrow\) survival analysis
Decide on the evaluation metric. This metric defines what counts as a good or bad prediction. To classify tornadoes in 1-hour windows, you could use metrics such as F1 score or accuracy.
Choose features \(X\) from which to predict the target. Features could be hourly measurements from radar stations.

Once the task is defined, we need data.

3.2 Get the data

Machine learning requires data. Data points are enumerated as pairs of feature and target values \((x^{(i)}, y^{(i)})_{i=1, \ldots, n}\). In the tornado case, \(x^{(145)}\) might be radar measurements from 10 AM to 11 AM on May 25th, 2021, in a specific 3km by 3km patch in the United States. Accordingly, \(y^{(145)}\) would indicate whether a tornado occurred in this time and place.

The dataset is typically randomly split into three subsets for machine learning:

A training dataset, \(D_{train}\), used to train the model.
A validation dataset, \(D_{val}\), used to validate modeling choices and selection.
A testing dataset, \(D_{test}\), used for final performance evaluation.

The simplest way is to sample the data points without replacement for each split at random. In reality, you have to adapt the splitting mechanism based on data and tasks. For example, time series and clustered data require splitting schemes that respect the data structure. For classification with imbalanced data, you might want to use mechanisms that ensure that the minority class occurs often enough in each split. Anyways, in a completely random 3-way split, our data point \((x^{(145)}, y^{(145)})\) would fall into one of these buckets. In practice, we can use more efficient resampling techniques such as cross-validation that repeatedly split data into training, validation, and testing. Resampling ensures a more efficient use of the data. Approaches like uncertainty quantification, causal effect estimation, and interpretability may require further splits, but this goes beyond “bare bones”.

3.3 Train the model

Training a machine learning model means running an algorithm that takes as input the training data and outputs the model. The model describes a function that outputs predictions based on input features and is again implemented algorithmically. The training requires making some choices:

Select a class of models \(F\). For example, decision trees or neural networks. Usually, you have to specify this class further, e.g., by choosing the maximal depth of a tree or the specific architecture of the neural network. Radar readings might be on a spatial level, so we might pick a convolutional neural network such as ResNet for the tornado prediction task.
Choose a training algorithm \(I\). The training algorithm takes the training data and produces a prediction model \(\hat{f}\). For example, neural networks typically use stochastic gradient descent with backpropagation as the training algorithm. But you could also train a neural network using genetic algorithms.
Hyperparameters \(h \in H\) are set before the learning process begins. Hyperparameters control the training algorithm \(I\) and affect the models that the training produces. Some hyperparameters are related to the model class, like the number of layers in your neural net or the number of neighbors in the k-nearest-neighbors algorithm. Others are connected to the training algorithm, such as the learning rate or the batch size in stochastic gradient descent.

For example, when we train a convolutional neural network to predict tornadoes, we use stochastic gradient descent (\(I\)) with the training data \(D_{train}\). To do this, we have parameters \(h \in H\) to set. After the training process, we get a trained CNN, which we call \(\hat{f}\), and it’s a model from the model class of CNNs, which we call \(F\). The training process seeks to produce a model \(\hat{f}\) that makes a minimal error on the training data:

\[err_{train}(f):= \frac{1}{|D_{train}|} \sum_{x,y \in D_{train}} L(f(x), y),\]

where \(L\) is the loss function the model optimizes for. Some training algorithms can optimize arbitrary (well, some constraints remain) loss functions directly, like neural networks using gradient descent, while other training algorithms have built-in and sometimes less explicit losses they optimize for, like greedy splits in CART decision trees (Breiman 2017).

3.4 Validate modeling choices

Training a model doesn’t guarantee that it will be the best-performing model. We might not have picked the best model class. For example, we might use a linear regression model but the best model for that task might be a random forest. And even if we picked the best model class, we might have set the hyperparameters non-optimally. We need a procedure to both pick a model class and the hyperparameters. A naive approach would be to compare the evaluation metric for the training data, but this would be a bad idea since some models may overfit the training data.

Overfitting

When a model has good performance on the training data but performs poorly with new unseen data from the same distribution.

We compute the evaluation metric using a separate validation dataset \(D_{val}\) and don’t simply choose the model with the highest evaluation metric on the training data. Since the model wasn’t trained using the validation data, we get a fair assessment of its performance. With the validation data, we can compare multiple models and hyperparameter configurations and pick the best-performing one. This allows us to detect under- and overfitting and to guard against it by properly regularizing and tuning the model.

3.5 Evaluate the model on test data

How well would our final model perform? Unfortunately, we can’t use the evaluation metrics we have computed so far from training and validation data. Both will be too optimistic since we already used the training data to train the model and the validation data to make modeling choices. Instead, we evaluate the model performance on test data \(D_{test}\). This gives us an honest estimate of the performance.

3.6 And repeat: the role of resampling data

Having just one split into training, validation, and testing is not very data efficient. Typically, the data is split multiple times. A common technique is cross-validation, which splits the data into k different parts. Let’s say we use k=10. Nine out of ten folds might be used for training and validation, and the tenth for test data. We cycle through the folds so that each fold is used as test data once. This way, we always use “fresh” data for evaluating the model, and the model wasn’t trained with that data. Other sampling methods such as bootstrapping and subsampling can be used here as well. But even having multiple folds may not give stable results – would we generate the fold splitting again, we may get different estimates. So another established method is to repeat the sampling.

We have another choice to make, and that is how to split training and validation data into nine parts. We could either do a single split or do cross-validation again. Like in the movie Inception, which is about dreams within dreams, we go one level deeper and do cross-validation within cross-validation, a procedure called nested cross-validation (Bates, Hastie, and Tibshirani 2023). The advantage of (nested) cross-validation is better estimates of the model performance, and better models since we more efficiently use the data.

3.7 Bare bones machine learning can be automated

Once you have defined the prediction task and your data, the entire training process can be completely automated. The subfield of machine learning called AutoML aims to completely automate the ML training process and make ML engineers redundant (Hutter, Kotthoff, and Vanschoren 2019). Upload your data, pick a column as your target, and pick an evaluation metric. Click a button. And the machine does everything for you. Data splitting, hyperparameter tuning, model selection, model evaluation. We call this automatable practice of machine learning “bare-bones machine learning”.