6

I am attempting to make a very simple summary plot for a random forest classification model using SHAP. Just to see if I could get the syntax correct, I generated a toy example and fit a random forest classifier to the data.

shap version: 0.45.0
Python version: 3.10.12
import shap
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Generate synthetic data
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train a RandomForest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

From here, I attempted to use SHAP's tree explainer to create shap values based on this model.

# Create a SHAP TreeExplainer
explainer = shap.TreeExplainer(model)

# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

According to the documentation this returns the following:

"For models with a single output this returns a matrix of SHAP values (# samples x # features). Each row sums to the difference between the model output for that sample and the expected value of the model output (which is stored in the expected_value attribute of the explainer when it is constant). For models with vector outputs this returns a list of such matrices, one for each output."

I had thought that this would be a single output model (since this is a binary classification problem), but the object being returned instead seems to be acting like a multiclass classification model. I attempted to check the shapes and got the following:

X_test shape: (125,20)
shap_values shape: (125, 20, 2)

Attempting to run the summary plot command using these values gives me a bizarre 2 by 2 image, which I've included below.

shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=None)

enter image description here

I'm unsure what exactly is causing this except for maybe it's taking the individual class probabilities instead of the flat prediction.

2
  • 1
    I am having the exact same issue with version 0.45.1 Commented May 26, 2024 at 8:11
  • Yes, it seems like this is caused by update 0.45.0. According to the release notes, one of the changes: "type and shape of returned SHAP values in some cases, to be consistent with model outputs". I guess so far no one has updated the tutorials yet. shap.readthedocs.io/en/latest/release_notes.html Commented May 16 at 20:23

1 Answer 1

8

It turns out that it is in fact treating the problem like a multiclass classification problem. In order to get the proper plot, you need to select the shap values for the desired class - in the case of binary classification, it's done using the following:

# Create a Tree SHAP explainer and calculate SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[:,:,1], X_test)

This makes sense given the shape of the shap values.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.