0

I am trying to make a simple decision tree , but I keep on getting the same ValueError and none of the similar threats was of any help. None of my variables are string but still I am getting an error in conversion.

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

os.chdir("C:\Mlearning")

"""
Data Engineering and Analysis
"""
#Load the dataset

AH_data = pd.read_csv("gapminder.csv")

data_clean = AH_data.dropna()

#data_clean.dtypes
#data_clean.describe()


"""
Modeling and Prediction
"""
#Split into training and testing sets

predictors = data_clean[['breastcancerper100th','alcconsumption']]

targets = data_clean.employrate

pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO 
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
graph.write_pdf("graph.pdf")

But the result that I am getting is this one:

   array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 
3
  • 1
    is that error happening in your classifier.fit? or somewhere else? can you post a sample of the data you are trying to classify? Commented Jun 7, 2016 at 17:21
  • Can you edit your question to show the full traceback? The output of data_clean.dtypes would be useful, too (and perhaps data_clean.head(), if you can share it). Commented Jun 7, 2016 at 17:39
  • It's looks to me as though you're trying to predict a floating-point value (employment rate). That's a regression problem, not a classification problem. Try using DecisionTreeRegressor instead. We'll be able to help much better if you post a traceback, so that we can see which line the ValueError is coming from. Commented Jun 8, 2016 at 18:18

2 Answers 2

1

You can use pd.to_numeric (introduced in version 0.17) to convert a column or a Series to a numeric type. The function can also be applied over multiple columns of a DataFrame using apply.

Importantly, the function also takes an errors key word argument that lets you force not-numeric values to be NaN, or simply ignore columns containing these values.

Will work if you will convert al entries to numeric. I use a small function for this:

def convert_column_numeric(ax):
    predictors[ax] = pd.to_numeric(predictors[ax], errors='coerce')

.....

convert_column_numeric('breastcancerper100th')
convert_column_numeric('alcconsumption')`
Sign up to request clarification or add additional context in comments.

Comments

0

It is most likely a problem with the data. Since you don't have any point in the code where you attempt to convert to float, it must be that the data you have is in a form that prevents it from being read as a number by your parsing commands.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.