3

I got df like this:

cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)

As you see theres list in every cell. I want to to followings things on each of element:

  1. Calculate mean of list and append 5
  2. If result <= 0, add 1 in place of list
  3. If list is empty, add 0 in place of list

My working solution:

df
def convert_list(x):
    if len(x) != 0:
        res = (sum(x)/len(x)) + 5
        if res <= 0:
            res = 1
        return res
    return 0

for col in cols:
    df[col] = df[col].apply(lambda x: convert_list(x))

Desired output:

df

It's working but its very slow solution (in original df I got about 50k columns and 100k rows, and list might contains many elements). Is there any efficient solution for this? I also tried convert it to numpy array and do some vectecorized operations, but the problem is every list might have different length, so I cant convert it (unless I add many elements to other lists...)

1
  • Not a complete solution but a pointer: given a series of lists where M is the maximum length, you could expand it to a DataFrame where each n-sized list is converted to a row of M elements and set elements after the first n to NA, then you should be able to apply those operations efficiently. If you start from a DataFrame, you can temporarily convert it to a series using pivoting operations (stack/unstack). Commented Nov 17, 2021 at 13:50

2 Answers 2

2

You could use applymap and np.mean to average every cell and add 5. Then any value below 5 would have been a negative mean, and nans can be filled with zero.

import pandas as pd
import numpy as np
cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)


df = (df.applymap(np.mean)+5)
df[df<5]=1
df = df.fillna(0)

Output

      a        b
0   9.0     41.5
1   0.0     49.0
2  17.0      0.0
3   0.0  22168.5
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, its about 4 time faster than my solution
1

You could try using explode then an aggregation, with the idea of avoiding apply.

Something like that, but it is ugly.

def correct_column(col):
    cole = col.explode()
    m_empty = cole.isna().groupby(cole.index).first()
    groups = cole.groupby(cole.index)
    col_sum = groups.sum()
    col_count = groups.count()
    col = groups.sum() / groups.count() + 5
    m_negative = col <= 0
    col[m_empty] = 0
    col[m_negative] = 1
    return col

for col in df.columns:
    df[col] = correct_column(df[col])

Note: groups.sum() / groups.count() may be strange way to get the mean but for empty lists (NaN after explode) mean raises an error when sum returns 0.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.