Replace string by numerical value

Question

I am reading in data with

df = pandas.read_csv("file.csv", names=['A','B','C','D','E','F','G', 'H','I','J', 'K'], header=None)

I get

df.dtypes
Out[54]: 
A     int64
B    object
C     int64
D     int64
E    object
F    object
G    object
H    object
I    object
J    object
K    object
dtype: object

The problem is that some of the fields in the original data have been replaced with the string SUPP when they are less than 6 (but more than 0) so I am not getting numerical data types. I tried replacing them with

df.replace('SUPP', 3.0)

but I still don't get numerical data types.

Some typical input data looks like

931,Oxfordshire,9314125,123255,Larkmead School,Abingdon,125,124,20,SUPP,8
931,Oxfordshire,9314126,123256,John Mason School,Abingdon,164,164,25,6,16
931,Oxfordshire,9314127,123257,Fitzharrys School,Abingdon,150,149,9,0,11
931,Oxfordshire,9316076,123298,Our Lady's Abingdon,Abingdon,57,57,SUPP,SUPP,16

The problem can be reproduced by just saving the example above as file.csv.

@EdChum That also doesn't help. I still don't get numerical data types. — Simd
– Simd, Commented Feb 24, 2014 at 20:30
How about reading the values in as NaN like df = pandas.read_csv("file.csv", names=['A','B','C','D','E','F','G', 'H','I','J', 'K'], header=None, na_values=['SUPP']) this will replace 'SUPP' with NaN which you should be able to replace — EdChum
– EdChum, Commented Feb 24, 2014 at 20:32
@EdChum That works in the sense that you get the right data types but isn't what I want. I want to use the fact that the values are very small in the graph I will plot rather than just ignore them. — Simd
– Simd, Commented Feb 24, 2014 at 20:33
But you can replace NaN with 3.0 so it should achieve what you want no? — EdChum
– EdChum, Commented Feb 24, 2014 at 20:34

TomAugspurger · Accepted Answer · 2014-02-24 20:34:56Z

2

EdChum almost had it in the comments.

In [18]: df.dtypes
Out[18]: 
0      int64
1     object
2      int64
3      int64
4     object
5     object
6      int64
7      int64
8     object
9     object
10     int64
dtype: object

In [19]: df.replace('SUPP', 3, inplace=True)

In [20]: df.dtypes
Out[20]: 
0      int64
1     object
2      int64
3      int64
4     object
5     object
6      int64
7      int64
8     object
9     object
10     int64
dtype: object

In [21]: df = df.convert_objects(convert_numeric=True)

In [22]: df.dtypes
Out[22]: 
0      int64
1     object
2      int64
3      int64
4     object
5     object
6      int64
7      int64
8      int64
9      int64
10     int64
dtype: object

You need to convert_objects because even though you've replaced SUPP with 3, the other values in that column are still strings (object dtype).

answered Feb 24, 2014 at 20:34

TomAugspurger

29k8 gold badges90 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Simd Over a year ago

That works thanks but I don't understand the explanation. df = pandas.read_csv("test.csv", na_values=['SUPP'], names=['A','B','C','D','E','F','G', 'H','I','J', 'K'], header=None) also gives the right data types so it can only be the SUPP values that cause the problem surely?

EdChum Over a year ago

@felix because of the ambiguouity of the data type it thinks the data type is a string, if you tell it that 'SUPP' is NaN it can then parse the remaining data as numeric that is the real problem here

Simd Over a year ago

Oh I see what you mean. The remaining data for that column hasn't been parsed as being numeric.

TomAugspurger Over a year ago

Yeah. Each pandas column has a single dtype. As the parser goes along it sees the "SUPP" and assigns the dtype object (string) to that column. So every item in that column is a string. You can check by reading in df and doing df.iloc[0]['J'] and seeing that it returns a string.

Collectives™ on Stack Overflow

Replace string by numerical value

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related