I'm trying to perform a dummy classification on a dataset for a school project. The idea is to get an idea of the frequency in which different political parties give speeches. My idea is to write this code in the following way:
from sklearn.dummy import DummyClassifier
import pandas as pd
import bz2
with bz2.open("data/ch3/speeches-201718.json.bz2") as source:
speeches_201718 = pd.read_json(source)
with bz2.open("data/ch3/speeches-201819.json.bz2") as source:
speeches_201819 = pd.read_json(source)
training_data, test_data = speeches_201718, speeches_201819
train_parties_count = training_data['party'].value_counts()
test_parties_count = test_data['party'].value_counts()
dummy_clf = DummyClassifier(strategy="most_frequent")
X = train_parties_count
y = train_parties_count.index
dummy_clf.fit(X.values, y)
print(X)
print(y)
test_parties_count.index = pd.CategoricalIndex(test_parties_count.index, categories=train_parties_count.index, ordered=True)
X_test = test_parties_count.sort_index()
print(X_test)
pred_mfc = dummy_clf.predict(X_test.values)
print("Urval av prediktioner [0-4]: ", pred_mfc[:5])
I get the following output: enter image description here
As you can see the prediction is C when it should be S, what can be incorrect?
I have tried defining the train and test data in multiple ways with no success.