incorrect prediction from sklearn DummyClassifier

Question

I'm trying to perform a dummy classification on a dataset for a school project. The idea is to get an idea of the frequency in which different political parties give speeches. My idea is to write this code in the following way:

from sklearn.dummy import DummyClassifier
import pandas as pd
import bz2


with bz2.open("data/ch3/speeches-201718.json.bz2") as source:
    speeches_201718 = pd.read_json(source)

with bz2.open("data/ch3/speeches-201819.json.bz2") as source:
    speeches_201819 = pd.read_json(source)


training_data, test_data = speeches_201718, speeches_201819

train_parties_count = training_data['party'].value_counts()
test_parties_count = test_data['party'].value_counts()
dummy_clf = DummyClassifier(strategy="most_frequent")

X = train_parties_count
y = train_parties_count.index
dummy_clf.fit(X.values, y)
print(X)
print(y)

test_parties_count.index = pd.CategoricalIndex(test_parties_count.index, categories=train_parties_count.index, ordered=True)
X_test = test_parties_count.sort_index()
print(X_test)
pred_mfc = dummy_clf.predict(X_test.values)

print("Urval av prediktioner [0-4]: ", pred_mfc[:5])

I get the following output: enter image description here

As you can see the prediction is C when it should be S, what can be incorrect?

I have tried defining the train and test data in multiple ways with no success.

MuhammedYunus · Accepted Answer · 2024-04-13 18:37:27Z

1

The dummy estimators in sklearn are not intended for real problems (they are used to obtain baseline measures of performance using very simple rules). In your case, the dummy estimator is configured to always output "C" regardless of the input.

RandomForestClassifier is usually a good 'off-the-shelf' estimator. I'd suggest viewing the train score after you do the training in order to verify that the model is learning something. Then you can assess its performance on data it hasn't seen (a validation set).

For the purposes of getting an accuracy score, you could use my_classifier.score(X_data, y_data).

edited Apr 13, 2024 at 18:37

answered Apr 13, 2024 at 18:30

MuhammedYunus

5,2722 gold badges4 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

incorrect prediction from sklearn DummyClassifier

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related