Efficient python data transformation with pandas/numpy

Question

I got df like this:

cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)

As you see theres list in every cell. I want to to followings things on each of element:

Calculate mean of list and append 5
If result <= 0, add 1 in place of list
If list is empty, add 0 in place of list

My working solution:

df
def convert_list(x):
    if len(x) != 0:
        res = (sum(x)/len(x)) + 5
        if res <= 0:
            res = 1
        return res
    return 0

for col in cols:
    df[col] = df[col].apply(lambda x: convert_list(x))

Desired output:

df

It's working but its very slow solution (in original df I got about 50k columns and 100k rows, and list might contains many elements). Is there any efficient solution for this? I also tried convert it to numpy array and do some vectecorized operations, but the problem is every list might have different length, so I cant convert it (unless I add many elements to other lists...)

Not a complete solution but a pointer: given a series of lists where M is the maximum length, you could expand it to a DataFrame where each n-sized list is converted to a row of M elements and set elements after the first n to NA, then you should be able to apply those operations efficiently. If you start from a DataFrame, you can temporarily convert it to a series using pivoting operations (stack/unstack). — rrobby86
– rrobby86, Commented Nov 17, 2021 at 13:50

Chris · Accepted Answer · 2021-11-17 14:09:00Z

2

You could use applymap and np.mean to average every cell and add 5. Then any value below 5 would have been a negative mean, and nans can be filled with zero.

import pandas as pd
import numpy as np
cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)


df = (df.applymap(np.mean)+5)
df[df<5]=1
df = df.fillna(0)

Output

      a        b
0   9.0     41.5
1   0.0     49.0
2  17.0      0.0
3   0.0  22168.5

answered Nov 17, 2021 at 14:09

Chris

16.3k3 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Thomas Kodill Over a year ago

Thanks, its about 4 time faster than my solution

hpchavaz · Accepted Answer · 2021-11-17 14:43:31Z

1

You could try using explode then an aggregation, with the idea of avoiding apply.

Something like that, but it is ugly.

def correct_column(col):
    cole = col.explode()
    m_empty = cole.isna().groupby(cole.index).first()
    groups = cole.groupby(cole.index)
    col_sum = groups.sum()
    col_count = groups.count()
    col = groups.sum() / groups.count() + 5
    m_negative = col <= 0
    col[m_empty] = 0
    col[m_negative] = 1
    return col

for col in df.columns:
    df[col] = correct_column(df[col])

Note: groups.sum() / groups.count() may be strange way to get the mean but for empty lists (NaN after explode) mean raises an error when sum returns 0.

edited Nov 17, 2021 at 14:43

answered Nov 17, 2021 at 14:27

hpchavaz

1,38811 silver badges17 bronze badges

Collectives™ on Stack Overflow

Efficient python data transformation with pandas/numpy

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related