0

I am reading in data with

df = pandas.read_csv("file.csv", names=['A','B','C','D','E','F','G', 'H','I','J', 'K'], header=None)

I get

df.dtypes
Out[54]: 
A     int64
B    object
C     int64
D     int64
E    object
F    object
G    object
H    object
I    object
J    object
K    object
dtype: object

The problem is that some of the fields in the original data have been replaced with the string SUPP when they are less than 6 (but more than 0) so I am not getting numerical data types. I tried replacing them with

df.replace('SUPP', 3.0)

but I still don't get numerical data types.

Some typical input data looks like

931,Oxfordshire,9314125,123255,Larkmead School,Abingdon,125,124,20,SUPP,8
931,Oxfordshire,9314126,123256,John Mason School,Abingdon,164,164,25,6,16
931,Oxfordshire,9314127,123257,Fitzharrys School,Abingdon,150,149,9,0,11
931,Oxfordshire,9316076,123298,Our Lady's Abingdon,Abingdon,57,57,SUPP,SUPP,16

The problem can be reproduced by just saving the example above as file.csv.

6
  • Have you tried df.replace('SUPP', 3.0, inplace=True)? Commented Feb 24, 2014 at 20:29
  • @EdChum That also doesn't help. I still don't get numerical data types. Commented Feb 24, 2014 at 20:30
  • How about reading the values in as NaN like df = pandas.read_csv("file.csv", names=['A','B','C','D','E','F','G', 'H','I','J', 'K'], header=None, na_values=['SUPP']) this will replace 'SUPP' with NaN which you should be able to replace Commented Feb 24, 2014 at 20:32
  • @EdChum That works in the sense that you get the right data types but isn't what I want. I want to use the fact that the values are very small in the graph I will plot rather than just ignore them. Commented Feb 24, 2014 at 20:33
  • But you can replace NaN with 3.0 so it should achieve what you want no? Commented Feb 24, 2014 at 20:34

1 Answer 1

2

EdChum almost had it in the comments.

In [18]: df.dtypes
Out[18]: 
0      int64
1     object
2      int64
3      int64
4     object
5     object
6      int64
7      int64
8     object
9     object
10     int64
dtype: object

In [19]: df.replace('SUPP', 3, inplace=True)

In [20]: df.dtypes
Out[20]: 
0      int64
1     object
2      int64
3      int64
4     object
5     object
6      int64
7      int64
8     object
9     object
10     int64
dtype: object

In [21]: df = df.convert_objects(convert_numeric=True)

In [22]: df.dtypes
Out[22]: 
0      int64
1     object
2      int64
3      int64
4     object
5     object
6      int64
7      int64
8      int64
9      int64
10     int64
dtype: object

You need to convert_objects because even though you've replaced SUPP with 3, the other values in that column are still strings (object dtype).

Sign up to request clarification or add additional context in comments.

4 Comments

That works thanks but I don't understand the explanation. df = pandas.read_csv("test.csv", na_values=['SUPP'], names=['A','B','C','D','E','F','G', 'H','I','J', 'K'], header=None) also gives the right data types so it can only be the SUPP values that cause the problem surely?
@felix because of the ambiguouity of the data type it thinks the data type is a string, if you tell it that 'SUPP' is NaN it can then parse the remaining data as numeric that is the real problem here
Oh I see what you mean. The remaining data for that column hasn't been parsed as being numeric.
Yeah. Each pandas column has a single dtype. As the parser goes along it sees the "SUPP" and assigns the dtype object (string) to that column. So every item in that column is a string. You can check by reading in df and doing df.iloc[0]['J'] and seeing that it returns a string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.