0

I am training a RandomForestClassifier from sklearn.ensemble with the following code:

        adata = ad.read_h5ad(f'{data_dir}{ct}_clean_log1p_normalized.h5ad')
        adata = adata[:, adata.var.highly_variable]
        print(f'AnnData for {ct}: {adata}')
    
        # Extract feature matrix (X) and target vector (y)
        X = adata.X
        y = adata.obs['clinical_dx']
        
        # Convert sparse matrix to dense for [insert reason]
        if issparse(X):
            X = X.toarray()
        
        # Encode the target variable for [insert reason]
        le = LabelEncoder()
        y_encoded = le.fit_transform(y)
        
        X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)
        
            
       # Initialize the classifier
        rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
        
        # Train the classifier
        rf_classifier.fit(X_train, y_train)
    
        # Validate on the test set
        y_pred_rf = rf_classifier.predict(X_test)
        
        # View validation report
        validation_report = classification_report(y_test, y_pred_rf, target_names=le.classes_)
        print(validation_report)
    
        with open(f'{rfc_dir}validation_report.txt', "w") as report_file:
            report_file.write(validation_report)
    
        # Generate the confusion matrix
        cm = confusion_matrix(y_test, y_pred_rf, labels=rf_classifier.classes_)
        
        # Make percentage
        cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
        plt.figure(figsize=(10, 9))
        sns.heatmap(cm_normalized, annot=True, fmt='.1%', cmap='Blues', 
                    xticklabels=le.inverse_transform(rf_classifier.classes_), 
                    yticklabels=le.inverse_transform(rf_classifier.classes_))
        
        plt.xlabel('Predicted Label')
        plt.ylabel('True Label')
        plt.title('Confusion Matrix (Random Forest)')
        
        plt.savefig(f"{rfc_dir}confusion_matrix", bbox_inches='tight')
    
        # Get feature importances
        feature_importances_rf = rf_classifier.feature_importances_

        number_of_features = 200
        
        # Create a DataFrame for better visualization
        feature_importance_rf_df = pd.DataFrame({
            'Ensembl': adata.var_names,
            'Importance': feature_importances_rf
        })

        top_features = feature_importance_rf_df.sort_values(
            by='Importance', ascending=False
        ).head(number_of_features)
        
        
        
    
        top_features.to_csv(f'{rfc_dir}markers.csv', index=False)


Unfortunately, I cannot share the data since it is HIPAA-protected.

For some reason, regardless of what celltype (there is a different anndata for each of 8 celltype) I train a classifier on, the importance scores in the CSV file are all incredibly low. The highest importance score in any case is around 0.01. Is this a red flag or just something to do with my datasets? Has anyone experienced this before? Thanks

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.