I apologize in advance if this question seems slightly naive. I am still learning about the interplay between pandas and numpy.
I have a pandas DataFrame that I am trying to convert into an array for analysis using scikit-learn. I have tried df.values and df.to_records() to convert it, but for some reason, it changes the shape during the conversion.
This is the first few lines of DataFrame (df) in Pandas.
Index Code1 Code2 Code3
0 99285 5921 5921
1 99284 NaN 5921
2 99284 NaN 4660
3 99285 42789 42789
4 99284 92321 92321
5 99283 NaN 92321
...
[94 rows x 3 columns]
However, if I call df.values, I get the following result, which, as far as I understand, is not an array as arrays are lists of tuples.
[['99285' '5921' '5921']
['99284' nan '5921']
['99284' nan '4660']
['99285' '42789' '42789']
['99284' '92321' '92321']
['99283' nan '92321']
...
If I call df.to_records(), I get the following result, which is an array, but not of the right shape as shown below.
[(0, '99285', '5921', '5921') (1, '99284', nan, '5921')
(2, '99284', nan, '4660') (3, '99285', '42789', '42789')
(4, '99284', '92321', '92321') (5, '99283', nan, '92321')
...
>>>df.to_records().shape
(94,)
Can someone help me understand what I need to do to get an array with a shape of (94,3)?
Important notes: The columns are all strings (and need to stay as strings), not ints, if that helps.
df.values.shape == (94, 3)?df.valuesdoes return a np array, where did you learn that an array should be a list of tuples?