3

I have a very simple dataset for binary classification in csv file which looks like this:

"feature1","feature2","label"
1,0,1
0,1,0
...

where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.

Here is how I read the data:

train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)

test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)

I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:

classifier = DNNClassifier(hidden_units=[3, 5, 3],
                               n_classes=2,
                               feature_columns=feature_columns, # ???
                               activation_fn=nn.relu,
                               enable_centered_bias=False,
                               model_dir=MODEL_DIR_DNN)

So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?

Here is the model training:

classifier.fit(X_train.values,
                   y_train.values,
                   batch_size=dnn_batch_size,
                   steps=dnn_steps)

The solution with replacing fit() parameters with the input function would also be great.

Thanks!

P.S. I'm using TensorFlow version 1.0.1

2
  • unrelated to your question: you're filling missing values with 0 which I dont think would be appropriate given 0 is very meaningful in your dataset - its a ground truth label/class, and its a possible feature value. Which means whenever you're filling na, you're actually creating false training examples and assigning them to your negative (0) class Commented Mar 23, 2017 at 2:55
  • @Simon thanks for your comment! I realized that too. But I had done some feature engineering and I'm sure there are no missing values in the dataset. Commented Mar 23, 2017 at 2:59

2 Answers 2

7

You can directly use tf.feature_column.numeric_column :

feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]
Sign up to request clarification or add additional context in comments.

Comments

3

I've just found the solution and it's pretty simple:

feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)

Apparently infer_real_valued_columns_from_input() works well with categorical variables.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.