# 10 Interpretability

This book is work in progress. We are happy to receive your feedback via science-book@christophmolnar.com

What’s your biggest concern about using machine learning for science? A survey by Van Noorden and Perkel (2023) asked 1600 scientists this question. “Leads to more reliance on pattern recognition without understanding” was the top concern, which is well-founded: Supervised machine learning is foremost about prediction, not understanding (see Chapter 3).

To solve that problem, interpretable machine learning was invented. Interpretable machine learning (or explainable AI^{1}) offers a wide range of solutions to tackle the lack of understanding. Interpretability, in the widest sense, is about making the model understandable to humans (Miller 2019). There are plenty of tools for this task, ranging from using decision rules to applying game theory (Shapley values) and analyzing neurons in a neural network. But before we talk solutions let’s talk about the problem first. Interpretability isn’t a goal in itself. Interpretability is a tool that helps you achieve your actual goals. Goals that a prediction-focused approach alone can’t satisfy.

ML became part of the standard toolbox of every knowledgeable scientist. Ravens started building bigger and bigger models to investigate more and more complex phenomena. Everything that could be phrased as a prediction problem was eventually tackled with machine learning. But few problems are just about prediction. The models must have learned interesting relationships in the data, and the ravens felt strongly about exploring them not only for insights but also for improving the models and justifying the models towards their fellow ravens. Luckily, a Ph.D. raven named Krähstof Olnar collected a lot of material in his book “Interpretable Machine Learning” and the ravens started uncovering their models’ secrets.

## 10.1 Goals of interpretation

Imagine you model the yield of almond orchards. The prediction model predicts almond yield in a given year based on precipitation, fertilizer use, prior yield, and so on. You are satisfied with the predictive performance, but you have this nagging feeling of incompleteness that no benchmark can fill.

And when you find a task can’t be solved by performance alone, you might need interpretability (Doshi-Velez and Kim 2017). For example, you might be interested in the effect of fertilizer on almond yield. We roughly distinguish three interpretability goals (inspired by (Adadi and Berrada 2018)):

**Discover**: The model may have learned something interesting about the studied phenomenon. In science, we can further distinguish between confirmatory (aka inference) and exploratory types of discovery. In the case of almond yield, the goal might be to study the effect of fertilizer.**Improve**: Interpretation of a model can help debug and improve the model. This can lead to better performance and higher robustness (see also Chapter 12). Studying feature importance you might find out that the prior yield variable is suspiciously important and detect that a coding error introduced a data leakage.**Justify**: Interpretation helps to justify a prediction, or also the model itself. Justification ranges from justifying a prediction towards an end-user touching upon ethics and fairness, to more formal model audits, but also building trust through, e.g. showing coherence with physics. For example, you are trying to convince an agency to adapt your model, but first, you have to convince them that your model aligns with common agricultural knowledge.

The goals interact with each other in the life cycle of an machine learning project. For example, to discover knowledge, your model should show good performance. However building a performative model is an iterative process in which interpretability can immensely help by monitoring the importance of features, for example.

But how do we achieve interpretability for complex machine learning models? To oversimplify, the field of interpretable machine learning knows two paths: interpretability-by-design and post-hoc interpretability. Interpretability-by-design is about only using interpretable, aka simple models. Post-hoc interpretation is the attempt to interpret potentially complex models after they were trained.

## 10.2 Interpretability-by-design

Interpretability by design is the status quo in many research fields. For example, quantitative medical research often relies on classical statistical models, with logistic regression being a commonly used model. These statistical models are considered inherently interpretable because they often relate the features to the outcome through a linearly weighted sum in one way or another. Statistical models are also dominant in the social sciences and many other fields. The use of a linear regression model to predict almond yield is not unheard of (Basso and Liu 2019). However, models with very different motivations can also be interpretable by design, such as differential equations and physics-based simulations in fields such as physics, meteorology, and ecology.

If we value interpretability, we could say that we only rely on machine learning algorithms that produce interpretable models. We’d still approach the modeling task in a typical performance-first manner, but restrict our solutions to be only models that we see as interpretable. It is still about optimization, but we strongly limit the hypothesis space, which is the pool of models that are deemed okay to be learned.

Interpretable machine learning models may include:

- Anything with linearity. Essentially the entire pantheon of frequentist models. Linear regression, logistic regression, frameworks such as generalized linear models (GLMs) and generalized additive models (GAMs), …
- Decision trees. Structures that present the prediction in the form of typical but not necessarily binary trees that are split by features.
- Decision rule lists
- Combinations of linear models and decision rules such as RuleFit (Friedman and Popescu 2008) and model-based trees (Zeileis, Hothorn, and Hornik 2008)
- Case-based reasoning

Researchers continue to invent new machine-learning approaches that produce models that are interpretable by design (Rudin et al. 2022), but there’s also controversy as to what even constitutes an interpretable model (Lipton 2017). Arguably, not even a linear regression model is interpretable if you have many features. Others argue that whether a particular model can be seen as interpretable depends on the audience and the context. But whatever your definition of interpretation: By restricting the functional form of the models, it becomes functionally better controllable. Even if you can argue that linear regression isn’t as easy to interpret, then it is still true that the model contains no interactions (at least on the input space of the model). And models that are interpretable by design, or at least structurally simpler, can help with the goals of discovery, improvement, and justification.

**Discover:**A logistic regression model allows insights about the features that increase the probability of e.g. diabetes.**Improve:**Regression coefficients with a wrong direction may signal that a feature was wrongly coded.**Justify:**A domain expert can manually check decision rules.

Interpretability by design has a cost: Restricting the hypothesis may exclude many good models and we might end up with a model that is worse in performance. The winners in machine learning competitions are usually complex gradient-boosted trees (e.g. LightGBM, catboost, xgboost) for tabular data and neural networks (CNN, transformer) for image and text data. It is not the models that are interpretable by design that are cashing in the price money.

Besides lower performance, interpretability-by-design has a conceptual problem when it comes to discovery (inference). To extend the interpretation of the model to the real world, you have to establish a connection between your model and the world. In classical statistical modeling, for example, you make lots of assumptions about what the data-generating process might be. What’s the distribution of the target given the features? What’s the process of missing values? What correlation structures do we have to account for? Other approaches such as physics-based simulations also have a link between the model and the world: it is assumed that certain parts of the simulation represent parts of the world. For interpretability by design, there is no such link. For example, we have no justification for saying that almond yield is produced by decision tree-like processes.

Another problem is performance: What do you do when you find models that have better predictive performance than your interpretable one? A model that represents the world well should also be able to predict it well. You would have to argue why your interpretable model better represents the world, even though it predicts it less well. You might have just been oversimplifying the world for your convenience. This weakens any claims to link the model to reality. An even worse conceptual problem is the Rashomon effect (Breiman 2001). The effect is named after the Japanese movie where multiple witnesses of a crime tell different stories that explain the facts equally well. The Rashomon effect describes the phenomenon that multiple models may have roughly the same performance, but are structurally different. A head-scratcher for interpretation: If optimization leads to a set of equally performant models with different internal structures, then we can’t, in good conscience, pick one over the other based purely based on an argument of performance. An example: Decision trees are inherently interpretable (at least if short) but also inherently unstable – small changes in the data may lead to very different trees even if the performance might not suffer too much. If you use a decision tree for discovery, then the Rashomon effect makes it difficult to argue that this is the exact tree that you should even be interpreting.

Our verdict: Inherently interpretable models are much easier to justify than their more complex counterparts. However, using them for insights (inference) has conceptual problems. Fortunately, there’s also the class of post-hoc interpretation methods, for which we have a theory of how their interpretation may be extended to the modeled phenomenon.

## 10.3 Model-specific post-hoc interpretability

Instead of opposing model complexity, we can also embrace it and try to extract insights from the complex models. These interpretability methods are called post-hoc, which translates to “after the event”, the event being model training. Post hoc methods can be either model-specific or model-agnostic. While model-specific interpretation methods work with the structure of the model, model-agnostic methods only work with the input-output data pairs and treat the model as a black box. Model-specific methods have their name because they are tied to a specific model type and require that we inspect the model. It is better illustrated with examples of model-specific methods:

- Gini importance leverages the splitting criterion in decision trees to assign importance values to features
- Transformers, a popular neural network architecture, has an attention layer that decides which part of the (processed) input to attend to for the prediction. Attention visualization is a model-specific post hoc approach to interpret transformers.
- Activation maximization approaches assign semantic meaning to individual neurons and layers: What concept maximally activates each neuron?
- Gradient-based explanation methods like Grad-CAM or layer-wise relevance propagation (LRP) make use of neural network gradients to highlight the inputs that affected the predictions strongly

Model-specific methods occupy an odd spot in the interpretability space. The underlying models are usually more complex than the interpretability-by-design ones and the interpretations can only address subsets of the model. For most interpretation goals, model-specific post-hoc interpretation is worse than interpretability-by-design. And they share the conceptual problem with interpretability by design when it comes to inference: If you want to interpret the model in place of the real world, you need a theory that connects your model with the world and allows a transfer of interpretation. Model-specific approaches, as the name says, rely on the specific model they are building, and such a link between model and reality would require making assumptions about why this model’s structure is representative of the world, which are probably invalid (Freiesleben 2023). However, they still may be useful in pointing out associations for forming causal hypotheses (see Chapter 11).

The line is blurred between interpretability by design and model-specific post-hoc interpretation. We might see a linear regression model as inherently interpretable because the coefficients are directly interpretable as feature effects. But what if we log-transform the target? Then we also have to transform the coefficients for interpretation. Many would argue that this is still interpretable by design, but we can add many modifications that make the model stray further from interpretability heaven. And for interpretation, we rely more and more on post-hoc computations, like transforming the coefficients, visualizing splines, etc.

Now that we are done talking down on model-specific interpretation, let’s talk about our favorite methods: model-agnostic interpretation techniques. And yes, we make no secret out of this, but we are quite the fans of model-agnostic interpretation.

## 10.4 Model-agnostic post-hoc interpretation methods

You are tasked to write a manual for the world’s weirdest vending machine. Nobody knows how it works. You’d try to open it but decide against inspection for fear of breaking the machine. But you have an idea of how you could write that manual. You start pressing some of the buttons and levers until, finally, the machine drops an item: a pack of chips with the flavor “Anchovies”. Fortunately, they are past the due date. You feel no remorse throwing them away. The other good news is that you made the first step toward writing that manual. The secret: Fuck around and find out.

Our fictive almond yield model is just like the vending machine. By intervening in the input features, the output (aka the prediction) changes and we can gather information about the model behavior. The recipe is to sample data, intervene in the data, predict, and aggregate the results (Scholbeck et al. 2020). And that’s why model-agnostic interpretation methods are post-hoc – they don’t require you to access the model internals or change the model training process. Model-agnostic interpretation methods have many advantages:

- You can use the same model-agnostic interpretation method for different models and compare the results.
- You are free to include pre- and post-processing steps in the interpretation. For example, when the model uses principal components as inputs (dimensionality reduction), you can still produce the interpretations based on the original features.
- You can apply model-agnostic methods also to inherently interpretable models like using feature importance on decision trees.

One of the simplest model-agnostic to explain is permutation feature importance: Imagine the almond yield researcher wants to know which features were most important for a prediction. First, they measure the performance of the model. Then they take one of the features, say the amount of fertilizer used, and shuffle it, which destroys the relationship between the fertilizer feature and the actual yield outcome. If the model relies on the fertilizer feature, the predictions will change for this manipulated dataset. For these predictions, the researchers again measure the performance. Usually, shuffling makes the performance worse. The larger the drop in performance, the more important the feature. Figure 10.1 shows an example of permutation feature importance from an actual almond yield paper.

PFI is one of many model-agnostic explanation methods. This chapter won’t introduce all of them, because that’s already its own book called Interpretable Machine Learning (Molnar 2022) and you can read it for free!

But still, we’ll provide a brief overview of the interpretability landscape of model-agnostic methods. The biggest differentiator is local versus global methods:

- Local methods explain individual predictions
- Global methods interpret the average model behavior.

### 10.4.1 Local: Explaining predictions with model-agnostic methods

The almond researchers might want to explain the yield prediction for a particular field and year. Why did the model make this particular prediction? Explaining individual predictions is one of the holy grails of interpretability. Tons of methods are available. Explanations of predictions attribute, in one way or another, the prediction to the individual features.

Here are some examples of local model-agnostic interpretation methods:

- Local surrogate models (LIME) (Ribeiro, Singh, and Guestrin 2016) explain predictions by approximating the complex model locally with a neighborhood-based interpretable model.
- Scoped rules (anchors) (Ribeiro, Singh, and Guestrin 2018) describe which feature values “anchor” a prediction, meaning that within this range, the prediction can’t be changed beyond a chosen threshold.
- Counterfactual explanations (Wachter, Mittelstadt, and Russell 2017) explain a prediction by examining which features to change to achieve a desired counterfactual prediction.
- Shapley values (Štrumbelj and Kononenko 2014) and SHAP (Lundberg and Lee 2017) are attribution methods that assign the prediction to individual features based on game theory.
- Individual conditional expectation curves (Goldstein et al. 2015) describe how changing features individually change the prediction.

### 10.4.2 Global: Interpreting the average model behavior

While local explanations are about data points, global explanations are about datasets: they describe how the model behaves, on average, for a given dataset, and, by extension, the distribution that the dataset represents.

We can further separate global interpretability:

**Feature importance methods**(Fisher, Rudin, and Dominici 2019) rank features by how much they influence the predictions. Examples:- Permutation feature importance ranks features by how much permuting the feature destroys the information.
- SAGE (Covert, Lundberg, and Lee 2020) ranks features based on predictive performance when the model is retrained.
- SHAP importance (Lundberg and Lee 2017) ranks features based on the average absolute SHAP values.

**Feature effect methods**describe how features influence the prediction. We can further divide them into main and interaction effects: Main feature effects describe how an isolated feature changes the prediction, as shown in Figure 10.2. Interaction effects describe how features interact with each other to influence the predictions. Examples:- Partial dependence plots (Friedman 2001)
- Accumulated local effect plots (Apley and Zhu 2020)
- SHAP dependence plots (Lundberg and Lee 2017)

We see global interpretation methods as central, especially for discovery.

### 10.4.3 Model-agnostic interpretation for scientific discovery

How useful is model-agnostic interpretation for scientific insights? Let’s start with two observations: - Interpretation is first and foremost about the model. - If you want to extend the model interpretation to the world, you need a theoretical link.

Let’s say we would model the almond yield with a linear regression model. The coefficients of the linear equation tell us how the features linearly affect the yield. Without further assumption, this interpretation concerns only the model. In classical statistical modeling, a statistician might make the assumption that the conditional distribution of yield given the features is Gaussian, that the errors are homoscedastic, that the data are representative, and so on, and only then carefully make a statement about the real world. Such results in turn may have real-world consequences, such as deciding the fertilizer usage.

In machine learning, we don’t make these assumptions about the data-generating process but let predictive performance guide modeling decisions. But can we establish a theoretical link between the model interpretation and the world maybe differently? We’ve discussed that establishing this link isn’t easy when it comes to model-specific interpretation:

- We would have to justify whether the specific model structures represent the phenomenon we study. That’s unreasonable for most model classes: What reason do we have to believe that phenomena in the world are structured like a decision rule list or a transformer model?
- The Rashomon effect presents us with an unresolvable conflict: If different models have similar predictive performance, how can you justify that the structure of one model represents the world, but the others don’t?

But there’s a way to link model and reality via model-agnostic interpretation. With model-agnostic interpretation, we don’t have to link model components to variables of reality. Instead, we interpret the model’s **behavior** and link these interpretations to the phenomenon (Freiesleben et al. 2024). This, of course, also doesn’t come for free.

- We have to assume there is a true function \(f\) that describes the relation between the underlying features and the prediction target (see Section 13.5.2 for theoretical background on this true function).
- The model needs to be a good representation of function \(f\) or at least we should be able to quantify uncertainties.

Let’s make it more concrete: The almond researcher has visualized a partial dependence plot that shows how fertilizer use influences the predicted yield. Now they want to extend the interpretation of this effect curve to the real world. First, they assume that there is some true \(f\) that their model \(\hat{f}\) tries to approximate, a standard assumption in statistical learning theory (Vapnik 1999).

The partial dependence plot is defined as:

\[PDP(x) = \mathbb{E}[\hat{f}(x_j, X_{-j})]\]

and estimated as

\[\frac{1}{n} \sum_{i=1}^n \hat{f}(x_j, X_{-j}^{(i)})\]

The index indicates the feature of interest, and \(-j\) indicates all the other features. We can’t estimate the true PDP because \(f\) is unknown, but we can define a theoretical partial dependence plot on the true function by replacing the model \(\hat{f}\) for the real \(f\): \(PDP_{true}(x) = \mathbb{E}[f(x_j, X_{-j})]\).

That allows us, at least in theory and simulation, to compare true PDP with estimated PDP. That’s what we have done in Molnar, Freiesleben, et al. (2023) and discussed more philosophically in Freiesleben et al. (2024).^{2} We’ll give an abbreviated sketch of our ideas here. The error between the true and the model PDP consists of 3 parts:

- Model bias
- Model variance
- Estimation error of the PDP

Ideally, we can either reduce or remove each source of uncertainty or at least quantify them. The model bias is the toughest part: We can’t easily quantify it because it would require us to know the true function \(f\). A way to reduce the bias is to train and tune the models well and then assume that the model’s bias is negligible, which is a strong assumption to make of course. Especially considering that many machine learning algorithms rely on regularization which may introduce a bias to reduce variance. An example of model bias: If you use a linear model to model non-linear data, then the PDP will be biased since it can only model a linear relation.

The model variance is a source of uncertainty that stems from the model itself being a random variable since the model is a function of the training data, which is merely a sample of the distribution. If we would sample a different dataset from the same distribution then the trained model might come out slightly different. If we retrain the model multiple times with different data (but from the same distribution), we get an idea of the model’s variance. This makes model variance at least quantifiable.

The third uncertainty source is the estimation error: The PDP, like other interpretation methods, is estimated with data, so it is subject to variance. The estimation error is the simplest to quantify since the PDP at a given position \(x\) is an average for which we know the variance.

In our paper, we showed that permutation feature importance also has these 3 sources of uncertainty when comparing model PFI with the “true” PFI. We have conceptually generalized this approach to arbitrary interpretation methods in Freiesleben et al. (2024). While we haven’t tested the approach for all interpretation methods in practice, our intuition says it should be similar to PDP and PFI. This line of research is rather new and we have to see where it leads. To justify using machine learning + post-hoc interpretation as an inference about the real world this alone might be too thin. But then again, many researchers already use machine learning for insights, so having **some** theoretical justification makes us sleep better.

There’s one challenge to interpretability that gives us bad dreams, specifically for the goal of insights: correlated features.

## 10.5 Correlation may destroy interpretability

When features correlate with each other, model-agnostic interpretation may run into issues. To understand why, let’s revisit how most model-agnostic methods work: Sample data, intervene in the data, get model predictions, and aggregate the results. During the intervention steps, new data points are created, often treating features as independent.

For example, to compute the permutation importance of the feature “absolute humidity” for our almond yield model, the feature gets shuffled (permuted) independently of the other features such as temperature. This can result in unrealistic data points of high humidity but low temperatures. These unrealistic new data points are fed into the model and the predictions are used to interpret the model. This may produce misleading interpretations:

- The model is probed with data points from regions of the feature space where the model doesn’t work well (because it was never trained with data from this region) or even makes extreme predictions.
- Unrealistic data points shouldn’t be used for the interpretation, as they aren’t representative of the studied reality.
- Features don’t independently change in the real world

Given these issues, why do we still use so-called “marginal” methods for interpretation that treat features as independent? Two reasons: 1) It is technically much easier to, for example, shuffle a feature independently compared is easier than shuffling the feature in a way that preserves the correlation. 2) In an ideal world, we want a disentangled interpretation where we can study each feature in isolation.

In practice, you can avoid or at least reduce the problem of correlated features with these approaches:

- Study correlations between features. If they are low, you can proceed with the usual marginal version of the interpretation methods.
- Interpret feature groups instead of individual features. For most interpretation methods you can also compute the importance or effect for an entire group of features. For example, you can compute the combined PFI for humidity and precipitation by shuffling them together.
- Use conditional versions of the interpretation methods. For example, conditional feature importance (Watson and Wright 2021), conditional SHAP (Aas, Jullum, and Løland 2021), M-Plot and ALE (Apley and Zhu 2020) and subgroup-wise PDP and PFI (Molnar, König, et al. 2023), Leave-one-covariate out (LOCO) (Lei et al. 2018).

The third option, conditional interpretation, is more than a technical fix.

Let’s say you want to shuffle the humidity feature to get its importance for almond yield. It is correlated with temperature, so we have to ensure that new data points respect the correlation structure. Instead of shuffling humidity independently, we sample it based on the values of temperature. The first data point has a high absolute humidity, we draw the temperature feature based on this high humidity. This leads to the “permuted” humidity feature to respect the correlation with precipitation, while breaking the relation with the yield target.

Conditional intervention changes the interpretation. For example, for permutation feature importance, the interpretation changes to: Given that the model has access to temperature, how important is it to know the humidity in addition? The more complex the dependence structures between features are, the more complex the (conditional) interpretation becomes. To use conditional interpretation, you must understand how the interpretation changes compared to the marginal versions. But on the other side, we’d argue that conditional interpretation is the way to go, especially for the goal of insights. Not only because it fixes the extrapolation problem, but also because for insights you are interested in how the features are related to the target regardless of the model.

## 10.6 Interpretability is just one part

Model interpretability alone might not solve your problems. Rather, it is strongly interlinked with the other topics covered in this book.

- When you justify the model for your boss, your colleague, or your peers, it is by showing the model coheres with domain knowledge Chapter 9.
- If you aren’t aware of how the data were collected and which population the dataset represents, it is unclear what to do with the conclusions from model interpretation (see Chapter 8).
- Understanding causality (see Chapter 11) is crucial for interpretation. For example, a feature effect may be rather misleading if confounders are missing from the model.
- Interpretability for scientific insights is tightly linked to generalization (see Chapter 8).

What do interpretability and explainability exactly mean? Even researchers in this field can’t decide on a definition (Flora et al. 2022). From an application-oriented perspective, it is most useful to treat these terms interchangeably. Under these keywords, you find approaches that allow you to extract information from the model about how it makes predictions. We use the definition from Roscher et al. (2020): Interpretability is about mapping an abstract concept from the models to an understandable form. Explainability is a stronger term that requires interpretability and additional context.↩︎

Our work on extending interpretation for insights is what motivated us to write the book you are reading.↩︎