0

I have a pandas DataFrame in which one of the columns is made of tuples of floats. When I use arr = df['col_name'].to_numpy(), I end up with a 1D array of tuples, but I need a 2D array of floats.

My solution so far is to use arr = np.array(df['col_name'].to_list()). This works, but it seems inefficient to convert first to a list and then to an array. So I'm wondering, is there a better way to do this?

This question is related, but the only answer there points to reading a text file differently, which is not an option for me since the data is already in the DataFrame.

3
  • Is the dtype object? The tolist step is probably fast. An alternative might be vstack Commented Dec 22, 2019 at 13:03
  • Yes, both df['col_name'].dtype and arr.dtype return dtype('O'). So I should stick to the current approach? Commented Dec 22, 2019 at 13:15
  • An object array used reference/pointers just like python lists. A pandas object dtype series also. So to_list should be pretty fast. Commented Dec 22, 2019 at 17:16

1 Answer 1

0

If your col_name contains actual tuples then run:

pd.DataFrame(df['col_name'].apply(pd.Series))

But if you have read your DataFrame e.g. from a CSV file, then each element of col_name contains actually a string composed of:

  • an opening parenthesis,
  • a sequence of numbers (written as strings), separated with commas,
  • a closing parenthesis,

and it only looks like a tuple.

If this is the case, run:

pd.DataFrame(df['col_name'].apply(lambda txt: pd.Series(eval(txt))))

In both cases the result is a DataFrame. If you need, convert it to a Numpy array.

To check whether col_name contains strings or actual tuples, using Jupyter, run:

df.col_name.iloc[0]

If the result is '(2.15, 3.03, 4.07)' (surrounded with quotes) it is a string. But if you received (2.15, 3.03, 4.07) (without quotes) it is a tuple.

Another way to check is to run type(df.col_name.iloc[0]). You should get either tuple or str.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, it works (I have actual tuples). So to get the numpy array the final call would be pd.DataFrame(df['col_name'].apply(pd.Series)).to_numpy(). To me this looks less readable than my current approach, but I guess the gain is in efficiency, since it's skipping the conversion to list, right?
pandas apply operations are generally slow, since it's actually a row iterator. This operation is making a new dataframe. timeit to be sure.
I agree with @hpaulj, this is entirely unnecessary. Doubly so when OP is concerned about the performance of .to_list().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.