1

I would like to read in a csv file using genfromtxt. I have six columns that are float, and one column that is a string.

How do I set the datatype so that the float columns will be read in as floats and the string column will be read in as strings? I tried dtype='void' but that is not working.

Suggestions?

Thanks

.csv file

999.9, abc, 34, 78, 12.3
1.3, ghf, 12, 8.4, 23.7
101.7, evf, 89, 2.4, 11.3



x = sys.argv[1]
f = open(x, 'r')
y = np.genfromtxt(f, delimiter = ',', dtype=[('f0', '<f8'), ('f1', 'S4'), (\
'f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8')])

ionenergy = y[:,0]
units = y[:,1]

Error:

ionenergy = y[:,0]
IndexError: invalid index

I don't get this error when I specify a single data type..

2 Answers 2

4

dtype=None tells genfromtxt to guess the appropriate dtype.

From the docs:

dtype: dtype, optional

Data type of the resulting array. If None, the dtypes will be determined by the contents of each column, individually.

(my emphasis.)


Since your data is comma-separated, be sure to include delimiter=',' or else np.genfromtxt will interpret each column (execpt the last) as including a string character (the comma) and therefore mistakenly assign a string dtype to each of those columns.

For example:

import numpy as np

arr = np.genfromtxt('data', dtype=None, delimiter=',')

print(arr.dtype)
# [('f0', '<f8'), ('f1', 'S4'), ('f2', '<i4'), ('f3', '<f8'), ('f4', '<f8')]

This shows the names and dtypes of each column. For example, ('f3', <f8) means the fourth column has name 'f3' and is of dtype '<i4. The i means it is an integer dtype. If you need the third column to be a float dtype then there are a few options.

  1. You could manually edit the data by adding a decimal point in the third column to force genfromtxt to interpret values in that column to be of a float dtype.
  2. You could supply the dtype explicitly in the call to genfromtxt

    arr = np.genfromtxt(
        'data', delimiter=',',
        dtype=[('f0', '<f8'), ('f1', 'S4'), ('f2', '<f4'), ('f3', '<f8'), ('f4', '<f8')])
    

print(arr)
# [(999.9, ' abc', 34, 78.0, 12.3) (1.3, ' ghf', 12, 8.4, 23.7)
#  (101.7, ' evf', 89, 2.4, 11.3)]

print(arr['f2'])
# [34 12 89]

The error message IndexError: invalid index is being generated by the line

ionenergy = y[:,0]

When you have mixed dtypes, np.genfromtxt returns a structured array. You need to read up on structured arrays because the syntax for accessing columns differs from the syntax used for plain arrays of homogenous dtype.

Instead of y[:, 0], to access the first column of the structured array y, use

y['f0']

Or, better yet, supply the names parameter in np.genfromtxt, so you can use a more relevant column name, like y['ionenergy']:

import numpy as np
arr = np.genfromtxt(
    'data', delimiter=',', dtype=None,
    names=['ionenergy', 'foo', 'bar', 'baz', 'quux', 'corge'])

print(arr['ionenergy'])
# [ 999.9    1.3  101.7]
Sign up to request clarification or add additional context in comments.

6 Comments

What was the resultant dtype? Was there any text in the columns that should be floats? Did you use names or skip_header to deal with the header (if there was one)?
I don't have a header. Using dtype=None, I get an error of an invalid index range when I try to assign each column. Also, there is not text in a column that should be just float.
Please post a sample of the text you are parsing with genfromtxt.
Okay, I tried specifying each column, but an still getting an invalid index error.
Could you post your code and the full traceback error message?
|
-1

Please try this:

import numpy

ionenergy = y.iloc[:,0]
units = y.iloc[:,1]

1 Comment

Hi @Joye, can you explain what the .iloc construction is doing here, and what that has to do with data types?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.