Setting pandas.DataFrame string dtype (not file based)

Question

I'm having trouble with using pandas.DataFrame's constructor and using the dtype argument. I'd like to preserve string values, but the following snippets always convert to a numeric type and then yield NaNs.

from __future__ import unicode_literals
from __future__ import print_function


import numpy as np
import pandas as pd


def main():
    columns = ['great', 'good', 'average', 'bad', 'horrible']
    # minimal example, dates are coming (as strings) from some
    # non-file source.
    example_data = {
        'alice': ['', '', '', '2016-05-24', ''],
        'bob': ['', '2015-01-02', '', '', '2012-09-15'],
        'eve': ['2011-12-31', '', '1998-08-13', '', ''],
    }

    # first pass, yields dataframe full of NaNs
    df = pd.DataFrame(data=example_data, index=example_data.keys(),
        columns=columns, dtype=str) #or string, 'str', 'string', 'object'
    print(df.dtypes)
    print(df)
    print()

    # based on https://github.com/pydata/pandas/blob/master/pandas/core/frame.py
    # and https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/types/common.py
    # we're ultimately feeding dtype to numpy's dtype, so let's just use that:
    #     (using np.dtype('S10') and converting to str doesn't work either)
    df = pd.DataFrame(data=example_data, index=example_data.keys(),
        columns=columns, dtype=np.dtype('U'))
    print(df.dtypes)
    print(df) # still full of NaNs... =(



if __name__ == '__main__':
    main()

What value(s) of dtypes will preserve strings in the data frame?

for reference:

$ python --version

2.7.12

$ pip2 list | grep pandas

pandas (0.18.1)

$ pip2 list | grep numpy

numpy (1.11.1)

Alicia Garcia-Raboso · Accepted Answer · 2016-09-20 21:17:00Z

For the particular case in the OP, you can use the DataFrame.from_dict() constructor (see also the Alternate Constructors section of the DataFrame documentation) .

from __future__ import unicode_literals
from __future__ import print_function

import pandas as pd

columns = ['great', 'good', 'average', 'bad', 'horrible']
example_data = {
    'alice': ['', '', '', '2016-05-24', ''],
    'bob': ['', '2015-01-02', '', '', '2012-09-15'],
    'eve': ['2011-12-31', '', '1998-08-13', '', ''],
}
df = pd.DataFrame.from_dict(example_data, orient='index')
df.columns = columns

print(df.dtypes)
# great       object
# good        object
# average     object
# bad         object
# horrible    object
# dtype: object

print(df)
#             great        good     average         bad    horrible
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13                        
# alice                                      2016-05-24

You can even specify dtype=str in DataFrame.from_dict() — though it is not necessary in this example.

EDIT: The DataFrame constructor interprets a dictionary as a collection of columns:

print(pd.DataFrame(example_data))

#         alice         bob         eve
# 0                          2011-12-31
# 1              2015-01-02            
# 2                          1998-08-13
# 3  2016-05-24                        
# 4              2012-09-15

(I'm dropping the data=, since data is the first argument in the function's signature anyway). Your code confuses rows and columns:

print(pd.DataFrame(example_data, index=example_data.keys(), columns=columns))

#       great good average  bad horrible
# alice   NaN  NaN     NaN  NaN      NaN
# bob     NaN  NaN     NaN  NaN      NaN
# eve     NaN  NaN     NaN  NaN      NaN

(though I'm not exactly sure how it ends up giving you a DataFrame of NaNs). It would be correct to do

print(pd.DataFrame(example_data, columns=example_data.keys(), index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15

Specifying the column names is actually unnecessary — they are already parsed from the dictionary:

print(pd.DataFrame(example_data, index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15

What you want is actually the transpose of this — so you can also take said transpose!

print(pd.DataFrame(data=example_data, index=columns).T)

#             great        good     average         bad    horrible
# alice                                      2016-05-24            
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13

Yay, this works. Out of curiosity, do you know why this works while the original doesn't? Given the usual constructor can take dictionaries it's not clear to me why the from_dict constructor behaves so differently.
@everial: after looking at it more carefully, I figured out what it is that you were doing wrong --- see edit.
thanks for the detailed followup... you can see how experience with pandas I am. =)

AlvaroP · Accepted Answer · 2016-09-20 19:06:03Z

0

This is not a proper answer, but while you get one by someone else, I've noticed that using the read_csv function everything works.

So if you place your data in a .csv file called myData.csv, like this:

great,good,average,bad,horrible
alice,,,,2016-05-24,
bob,,2015-01-02,,,2012-09-15
eve,2011-12-31,,1998-08-13,,

and do

df = pd.read_csv('blablah/myData.csv')

it will keep the strings as they are!

        great      good     average       bad      horrible
alice    NaN        NaN       NaN     2016-05-24      NaN
bob      NaN    2015-01-02    NaN         NaN     2012-09-15
eve   2011-12-31    NaN    1998-08-13     NaN         NaN

if you want, the empty values can be put as an space in the csv file or any other character/marker.

answered Sep 20, 2016 at 19:06

AlvaroP

4101 gold badge4 silver badges10 bronze badges

1 Comment

everial Over a year ago

Thanks for the suggestion, but as mentioned in the sample the data currently isn't in a file -- I don't really want to write it out just to reread it with read_csv if that can be avoided.

Collectives™ on Stack Overflow

Setting pandas.DataFrame string dtype (not file based)

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related