Pandas df.to_records() returns a 1d numpy array

Question

I apologize in advance if this question seems slightly naive. I am still learning about the interplay between pandas and numpy.

I have a pandas DataFrame that I am trying to convert into an array for analysis using scikit-learn. I have tried df.values and df.to_records() to convert it, but for some reason, it changes the shape during the conversion.

This is the first few lines of DataFrame (df) in Pandas.

Index           Code1    Code2       Code3
0               99285    5921         5921
1               99284     NaN         5921
2               99284     NaN         4660
3               99285   42789        42789
4               99284   92321        92321
5               99283     NaN        92321
...
[94 rows x 3 columns]

However, if I call df.values, I get the following result, which, as far as I understand, is not an array as arrays are lists of tuples.

[['99285' '5921' '5921']
['99284' nan '5921']
['99284' nan '4660']
['99285' '42789' '42789']
['99284' '92321' '92321']
['99283' nan '92321']
...

If I call df.to_records(), I get the following result, which is an array, but not of the right shape as shown below.

[(0, '99285', '5921', '5921') (1, '99284', nan, '5921')
(2, '99284', nan, '4660') (3, '99285', '42789', '42789')
(4, '99284', '92321', '92321') (5, '99283', nan, '92321')
...
>>>df.to_records().shape
(94,)

Can someone help me understand what I need to do to get an array with a shape of (94,3)?

Important notes: The columns are all strings (and need to stay as strings), not ints, if that helps.

df.values does return a np array, where did you learn that an array should be a list of tuples? — EdChum
– EdChum, Commented Apr 23, 2015 at 19:45
@Alexander bumpy shurely shome mishtake? You mean numpy ;-) — EdChum
– EdChum, Commented Apr 23, 2015 at 19:46

wflynny · Accepted Answer · 2015-04-23 19:49:10Z

2

In fact, df.values does return a numpy.ndarray. However, due to the way it prints, it looks like a lists of lists. Check by doing type(df.values) or by looking at its shape df.values.shape == (93, 4).

However, df.to_records() does not return a numpy.ndarray, but a numpy.core.records.recarray. You can see that it is a recarray by doing

type(df.to_records())

or by noticing that the dtype is odd-looking:

df.to_records().dtype

The shape of df.to_records() just indicates how many records there are, in your case 94. Record arrays behave differently than normal numpy arrays. For example, try

df.to_records()['Code1']
df.to_records().code1

edited Apr 23, 2015 at 19:49

answered Apr 23, 2015 at 19:43

wflynny

18.6k6 gold badges50 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jlmitch Over a year ago

Thanks for helping me understand why to_records wasn't working. This makes things much clearer for me.

Collectives™ on Stack Overflow

Pandas df.to_records() returns a 1d numpy array

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related