Split a cell data into multiple rows in using python

Question

I want to split the data contained in a cell into multiple rows in using python. Such an example is given below:

This is my data:

fuel          cert_region   veh_class   air_pollution      city_mpg     hwy_mpg    cmb_mpg  smartway
ethanol/gas    FC              SUV          6/8              9/14        15/20      1/16      yes
ethanol/gas    FC              SUV          6/3              1/14        14/19      10/16     no

I want to convert it into this form:

fuel          cert_region   veh_class   air_pollution     city_mpg     hwy_mpg    cmb_mpg   smartway
ethanol         FC             SUV          6               9           15          1          yes
 gas            FC             SUV          8               14          20          16         yes
ethanol         FC             SUV          6               1           14          10         no  
 gas            FC             SUV          3               14          19          16         no

The following code is returning an error:

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split('/')))

# calculate lengths of splits
lens = df_08['fuel'].str.split('/').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({
                'cert_region': np.repeat(df_08['cert_region'], lens),
                'veh_class': np.repeat(df_08['veh_class'], lens),
                'smartway': np.repeat(df_08['smartway'], lens),
                'fuel': chainer(df_08['fuel']),
                'air_pollution': chainer(df_08['air_pollution']),
                'city_mpg': chainer(df_08['city_mpg']),
               'hwy_mpg': chainer(df_08['hwy_mpg']),
               'cmb_mpg': chainer(df_08['cmb_mpg'])})

It gives me this error:

 TypeError                                 Traceback (most recent call last)
 <ipython-input-31-916fed75eee2> in <module>()
 20                     'fuel': chainer(df_08['fuel']),
 21                     'air_pollution_score': chainer(df_08['air_pollution_score']),
 ---> 22                     'city_mpg': chainer(df_08['city_mpg']),
 23                    'hwy_mpg': chainer(df_08['hwy_mpg']),
 24                    'cmb_mpg': chainer(df_08['cmb_mpg']),

  <ipython-input-31-916fed75eee2> in chainer(s)
  4 # return list from series of comma-separated strings
  5 def chainer(s):
  ----> 6     return list(chain.from_iterable(s.str.split('/')))
  7 
  8 # calculate lengths of splits

  TypeError: 'float' object is not iterable

But city_mpg has the Object data type:

   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 2404 entries, 0 to 2403
   Data columns (total 14 columns):
  fuel                    2404 non-null object
  cert_region             2404 non-null object
  veh_class               2404 non-null object
  air_pollution           2404 non-null object
  city_mpg                2205 non-null object
  hwy_mpg                 2205 non-null object
  cmb_mpg                 2205 non-null object
  smartway                2404 non-null object

sammywemmy · Accepted Answer · 2020-04-11 00:07:42Z

3

my suggestion is to step out of pandas, do ur computation and put the result back into a dataframe. in my opinion, it is much easier to manipulate, and I'd like to believe faster :

from itertools import chain

Step 1: convert to dict :

M = df.to_dict('records')

Step 2: do a list comprehension and split the values:

res = [[(key,*value.split('/'))
       for key,value in d.items()]
       for d in M]

Step 3: find the length of the longest row. We need this to ensure all rows are the same length:

 longest = max(len(line) for line in chain(*res))
 print(longest)
 #3

Step 4: the longest entry is 3; we need to ensure that the lines less than 3 are adjusted :

explode = [[(entry[0], entry[-1], entry[-1])
            if len(entry) < longest else entry for entry in box]
            for box in res]

print(explode)

[[('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '8'),
  ('city_mpg', '9', '14'),
  ('hwy_mpg', '15', '20'),
  ('cmb_mpg', '1', '16'),
  ('smartway', 'yes', 'yes')],
 [('fuel', 'ethanol', 'gas'),
  ('cert_region', 'FC', 'FC'),
  ('veh_class', 'SUV', 'SUV'),
  ('air_pollution', '6', '3'),
  ('city_mpg', '1', '14'),
  ('hwy_mpg', '14', '19'),
  ('cmb_mpg', '10', '16'),
  ('smartway', 'no', 'no')]]

Step 4: Now we can pair the keys, with respective values to get a dictionary:

result = {start[0] :(*start[1:],*end[1:])
          for start,end in zip(*explode)}

print(result)

{'fuel': ('ethanol', 'gas', 'ethanol', 'gas'),
 'cert_region': ('FC', 'FC', 'FC', 'FC'),
 'veh_class': ('SUV', 'SUV', 'SUV', 'SUV'),
 'air_pollution': ('6', '8', '6', '3'),
 'city_mpg': ('9', '14', '1', '14'),
 'hwy_mpg': ('15', '20', '14', '19'),
 'cmb_mpg': ('1', '16', '10', '16'),
 'smartway': ('yes', 'yes', 'no', 'no')}

Read result into dataframe:

pd.DataFrame(result)

    fuel    cert_region veh_class   air_pollution   city_mpg    hwy_mpg cmb_mpg smartway
0   ethanol     FC       SUV           6       9            15             1     yes
1   gas         FC       SUV           8       14           20             16    yes
2   ethanol     FC       SUV           6       1            14             10    no
3   gas         FC       SUV           3       14           19             16    no

edited Apr 11, 2020 at 0:07

answered Apr 11, 2020 at 0:02

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Michael Hsi Over a year ago

This is a genius idea, definitely faster than doing iterrows

Umair Mukhtar Over a year ago

It gives me this error: AttributeError: 'float' object has no attribute 'split'

sammywemmy Over a year ago

Convert ur data frame to string df = df.astype(str). Do this before exporting to dict. Let me know how it goes

Umair Mukhtar Over a year ago

Again error: ValueError: too many values to unpack (expected 2) on this: result = {start[0] :(*start[1:],*end[1:]) for start,end in zip(*explode)}

sammywemmy Over a year ago

Good to know u have gotten beyond the first error. What’s d contents of the explode variable? Is it the same as that shared in my solution?

|

Michael Hsi · Accepted Answer · 2020-04-12 01:38:18Z

1

I think you're better off constructing a new dataframe

result = pd.DataFrame(columns=[your_columns])
for index, series in df_08.iterrows():
    temp1 = {}
    temp2 = {}
    for key, value in dict(series).items():
        if '/' in value:
            val1, val2 = value.split('/')
            temp1[key] = [val1]
            temp2[key] = [val2]
        else:
            temp1[key] = temp2[key] = [value]

    result = pd.concat([result, pd.DataFrame(data=temp1), 
                        pd.DataFrame(data=temp2)], axis=0, ignore_index=True)

edited Apr 12, 2020 at 1:38

answered Apr 10, 2020 at 23:02

Michael Hsi

4392 silver badges9 bronze badges

4 Comments

Umair Mukhtar Over a year ago

This is not working for me. Because some cells of fuel column have single value. Therefore, It gives this error: ValueError: not enough values to unpack (expected 2, got 1) Please give me any other solution.

Michael Hsi Over a year ago

added if statement to check if split is required

Umair Mukhtar Over a year ago

Again error: ValueError: If using all scalar values, you must pass an index on This:

result = pd.concat([result, pd.DataFrame(data=temp1),                     pd.DataFrame(data=temp2)], axis=0, ignore_index=True)

Michael Hsi Over a year ago

of yeah, i forgot to change it,

Collectives™ on Stack Overflow

Split a cell data into multiple rows in using python

2 Answers 2

7 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related