Getting extremely low importance scores in ensemble.randomForestClassifier

Ask Question

Asked 6 months ago

Modified 6 months ago

Viewed 17 times

I am training a RandomForestClassifier from sklearn.ensemble with the following code:

        adata = ad.read_h5ad(f'{data_dir}{ct}_clean_log1p_normalized.h5ad')
        adata = adata[:, adata.var.highly_variable]
        print(f'AnnData for {ct}: {adata}')
    
        # Extract feature matrix (X) and target vector (y)
        X = adata.X
        y = adata.obs['clinical_dx']
        
        # Convert sparse matrix to dense for [insert reason]
        if issparse(X):
            X = X.toarray()
        
        # Encode the target variable for [insert reason]
        le = LabelEncoder()
        y_encoded = le.fit_transform(y)
        
        X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)
        
            
       # Initialize the classifier
        rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
        
        # Train the classifier
        rf_classifier.fit(X_train, y_train)
    
        # Validate on the test set
        y_pred_rf = rf_classifier.predict(X_test)
        
        # View validation report
        validation_report = classification_report(y_test, y_pred_rf, target_names=le.classes_)
        print(validation_report)
    
        with open(f'{rfc_dir}validation_report.txt', "w") as report_file:
            report_file.write(validation_report)
    
        # Generate the confusion matrix
        cm = confusion_matrix(y_test, y_pred_rf, labels=rf_classifier.classes_)
        
        # Make percentage
        cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
        plt.figure(figsize=(10, 9))
        sns.heatmap(cm_normalized, annot=True, fmt='.1%', cmap='Blues', 
                    xticklabels=le.inverse_transform(rf_classifier.classes_), 
                    yticklabels=le.inverse_transform(rf_classifier.classes_))
        
        plt.xlabel('Predicted Label')
        plt.ylabel('True Label')
        plt.title('Confusion Matrix (Random Forest)')
        
        plt.savefig(f"{rfc_dir}confusion_matrix", bbox_inches='tight')
    
        # Get feature importances
        feature_importances_rf = rf_classifier.feature_importances_

        number_of_features = 200
        
        # Create a DataFrame for better visualization
        feature_importance_rf_df = pd.DataFrame({
            'Ensembl': adata.var_names,
            'Importance': feature_importances_rf
        })

        top_features = feature_importance_rf_df.sort_values(
            by='Importance', ascending=False
        ).head(number_of_features)
        
        
        
    
        top_features.to_csv(f'{rfc_dir}markers.csv', index=False)

Unfortunately, I cannot share the data since it is HIPAA-protected.

For some reason, regardless of what celltype (there is a different anndata for each of 8 celltype) I train a classifier on, the importance scores in the CSV file are all incredibly low. The highest importance score in any case is around 0.01. Is this a red flag or just something to do with my datasets? Has anyone experienced this before? Thanks

asked May 20 at 14:23

Rushil Patel

132 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Getting extremely low importance scores in ensemble.randomForestClassifier

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest