0

Im running a random forest model and to get some feature importance and Im trying to run a SHAP analysis. The problem is that every time I try to plot the shap values, I keep getting this error:

DimensionError: Length of features is not equal to the length of shap_values. 

I don't know whats going on. When I run my XGBoost model, everything seems to go fine, i can see the SHAP plot for the data set. Its the exact same data set but it just wont run with random forest. Its for a binary classification.

Here is my python code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Remove the primary key column 'id' from the features

features = result.drop(columns=['PQ2', 'id'])  # Drop target and ID columns
target = result['PQ2']  # Target variable

# Split data into training and testing sets with 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
 
# Initialize Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

import shap

# Create a Tree SHAP explainer for the Random Forest model
explainer = shap.TreeExplainer(rf_model)

# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Plot a SHAP summary plot
shap.summary_plot(shap_values, X_test, feature_names=features_names)

# Plot a SHAP bar plot for global feature importance

shap.summary_plot(shap_values, X_test, feature_names=features_names, plot_type="bar")

The shape of test set is (829,22), yet the SHAP values consistently return (22,2) for random forest and I dont know how to fix it. The data set has been preprocessed, columns are either 0-1s or numerical columns.

3
  • What is the size of feature_names? on what line is the error? Commented Mar 18 at 8:09
  • 2
    Why using the old API of {shap}? Why decompose 800k predictions? Why running extremely deep trees, knowing about TreeSHAPs memory complexity? Why not provide a working example? Commented Mar 18 at 18:39
  • can you post some data to reproduce the error @Starterkit07 ? Commented Mar 26 at 12:40

1 Answer 1

0

I suspect that the issue is due to the fact that the shap_values array has slight differences in its output format depending on the model used (e.g., XGBoost vs. RandomForestClassifier).

You can successfully generate SHAP analysis plots simply by adjusting the dimensions of the shap_values array.

Since I don't have your data, I generated a sample dataset as an example for your reference:

import numpy as np
import pandas as pd
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate sample data
np.random.seed(42)
features = pd.DataFrame({
    "feature_1": np.random.randint(18, 70, size=100),
    "feature_2": np.random.randint(30000, 100000, size=100),
    "feature_3": np.random.randint(1, 4, size=100), 
    "feature_4": np.random.randint(300, 850, size=100),
    "feature_5": np.random.randint(1000, 50000, size=100)
})
target = np.random.randint(0, 2, size=100)
features_names = features.columns.tolist()

# The following code is just like your example.
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)

# Adjust the dimensions of the shap_values object.
shap.summary_plot(shap_values[:,:,0], X_test, feature_names=features_names)
shap.summary_plot(shap_values[:,:,0], X_test, feature_names=features_names, plot_type="bar")

enter image description here

enter image description here

With the above, you can successfully run the SHAP analysis by simply adjusting shap_values to shap_values[:,:,0].
As for what the third dimension of shap_values represents when using RandomForestClassifier, you can explore it further on your own.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.