1

I apologize in advance if this question seems slightly naive. I am still learning about the interplay between pandas and numpy.

I have a pandas DataFrame that I am trying to convert into an array for analysis using scikit-learn. I have tried df.values and df.to_records() to convert it, but for some reason, it changes the shape during the conversion.

This is the first few lines of DataFrame (df) in Pandas.

Index           Code1    Code2       Code3
0               99285    5921         5921
1               99284     NaN         5921
2               99284     NaN         4660
3               99285   42789        42789
4               99284   92321        92321
5               99283     NaN        92321
...
[94 rows x 3 columns]

However, if I call df.values, I get the following result, which, as far as I understand, is not an array as arrays are lists of tuples.

[['99285' '5921' '5921']
['99284' nan '5921']
['99284' nan '4660']
['99285' '42789' '42789']
['99284' '92321' '92321']
['99283' nan '92321']
...

If I call df.to_records(), I get the following result, which is an array, but not of the right shape as shown below.

[(0, '99285', '5921', '5921') (1, '99284', nan, '5921')
(2, '99284', nan, '4660') (3, '99285', '42789', '42789')
(4, '99284', '92321', '92321') (5, '99283', nan, '92321')
...
>>>df.to_records().shape
(94,)

Can someone help me understand what I need to do to get an array with a shape of (94,3)?

Important notes: The columns are all strings (and need to stay as strings), not ints, if that helps.

6
  • isn't df.values.shape == (94, 3) ? Commented Apr 23, 2015 at 19:40
  • df.values does return a np array, where did you learn that an array should be a list of tuples? Commented Apr 23, 2015 at 19:45
  • type(df.values) indicates that it is a numpy.ndarray Commented Apr 23, 2015 at 19:45
  • @Alexander bumpy shurely shome mishtake? You mean numpy ;-) Commented Apr 23, 2015 at 19:46
  • typo, but maybe bumpy will stick... Commented Apr 23, 2015 at 19:47

1 Answer 1

2

In fact, df.values does return a numpy.ndarray. However, due to the way it prints, it looks like a lists of lists. Check by doing type(df.values) or by looking at its shape df.values.shape == (93, 4).

However, df.to_records() does not return a numpy.ndarray, but a numpy.core.records.recarray. You can see that it is a recarray by doing

type(df.to_records())

or by noticing that the dtype is odd-looking:

df.to_records().dtype

The shape of df.to_records() just indicates how many records there are, in your case 94. Record arrays behave differently than normal numpy arrays. For example, try

df.to_records()['Code1']
df.to_records().code1
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for helping me understand why to_records wasn't working. This makes things much clearer for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.