How to use a function that returns numpy array within pandas apply

Question

I have a Data Frame that looks like this:

import pandas as pd
df_dict = {'var1': {(1, 1.0, 'obj1'): 1.0, (1, 1.0, 'obj4'): 1.0, (1, 1.0, 'obj3'): 2.0, (1, 1.0, 'obj5'): 2.0, (1, 1.0, 'obj2'): 3.0, (1, 2.0, 'obj1'): 1.0, (1, 2.0, 'obj4'): 1.0, (1, 2.0, 'obj3'): 2.0, (1, 2.0, 'obj5'): 2.0, (1, 2.0, 'obj2'): 3.0, (1, 3.0, 'obj1'): 1.0, (1, 3.0, 'obj4'): 1.0, (1, 3.0, 'obj3'): 2.0, (1, 3.0, 'obj5'): 2.0, (1, 3.0, 'obj2'): 3.0, (1, 4.0, 'obj1'): 1.0, (1, 4.0, 'obj4'): 1.0, (1, 4.0, 'obj3'): 2.0, (1, 4.0, 'obj5'): 2.0, (1, 4.0, 'obj2'): 3.0}, 'var2': {(1, 1.0, 'obj1'): -0.9799804687499858, (1, 1.0, 'obj4'): 0.009998139880948997, (1, 1.0, 'obj3'): -1.0299944196428612, (1, 1.0, 'obj5'): 0.029994419642846992, (1, 1.0, 'obj2'): 1.9999999999999574, (1, 2.0, 'obj1'): -1.0200195312500426, (1, 2.0, 'obj4'): 0.07001023065477341, (1, 2.0, 'obj3'): -0.6900111607143344, (1, 2.0, 'obj5'): -0.03999255952379599, (1, 2.0, 'obj2'): 1.9400111607142634, (1, 3.0, 'obj1'): -1.0599888392857082, (1, 3.0, 'obj4'): 0.1399972098214164, (1, 3.0, 'obj3'): -0.36002604166661456, (1, 3.0, 'obj5'): -0.12002418154757777, (1, 3.0, 'obj2'): 1.8699776785714306, (1, 4.0, 'obj1'): -1.09000651041665, (1, 4.0, 'obj4'): 0.1900111607142918, (1, 4.0, 'obj3'): -0.029994419642918047, (1, 4.0, 'obj5'): -0.2000093005952408, (1, 4.0, 'obj2'): 1.8099888392857366}, 'var3': {(1, 1.0, 'obj1'): 0.0, (1, 1.0, 'obj4'): -1.9899974149816302, (1, 1.0, 'obj3'): -0.020033892463189318, (1, 1.0, 'obj5'): -0.03999597886028994, (1, 1.0, 'obj2'): -0.029979032628659752, (1, 2.0, 'obj1'): 0.050012925091920124, (1, 2.0, 'obj4'): -1.999978458180145, (1, 2.0, 'obj3'): 0.19003475413597926, (1, 2.0, 'obj5'): 0.18996294806989056, (1, 2.0, 'obj2'): -0.029979032628730806, (1, 3.0, 'obj1'): 0.10002585018380472, (1, 3.0, 'obj4'): -2.03001134535846, (1, 3.0, 'obj3'): 0.3900146484375, (1, 3.0, 'obj5'): 0.41001263786760944, (1, 3.0, 'obj2'): -0.040031881893369814, (1, 4.0, 'obj1'): 0.1499669692095651, (1, 4.0, 'obj4'): -2.040010340073515, (1, 4.0, 'obj3'): 0.5999755859375, (1, 4.0, 'obj5'): 0.6100284352022101, (1, 4.0, 'obj2'): -0.05999396829039938}}
df = pd.DataFrame.from_dict(df_dict)

                                 var1      var2      var3
measurement_id repeat_id object                          
1              1.0       obj1     1.0 -0.979980  0.000000
                         obj4     1.0  0.009998 -1.989997
                         obj3     2.0 -1.029994 -0.020034
                         obj5     2.0  0.029994 -0.039996
                         obj2     3.0  2.000000 -0.029979
               2.0       obj1     1.0 -1.020020  0.050013
                         obj4     1.0  0.070010 -1.999978
                         obj3     2.0 -0.690011  0.190035
                         obj5     2.0 -0.039993  0.189963
                         obj2     3.0  1.940011 -0.029979
               3.0       obj1     1.0 -1.059989  0.100026
                         obj4     1.0  0.139997 -2.030011
                         obj3     2.0 -0.360026  0.390015
                         obj5     2.0 -0.120024  0.410013
                         obj2     3.0  1.869978 -0.040032
               4.0       obj1     1.0 -1.090007  0.149967
                         obj4     1.0  0.190011 -2.040010
                         obj3     2.0 -0.029994  0.599976
                         obj5     2.0 -0.200009  0.610028
                         obj2     3.0  1.809989 -0.059994

I'd like to smooth var2 with scipy.signal.savgol_filter but I need to do this for subsequent object. So my call looks like this:

import scipy.signal as signal
df.groupby(['measurement_id', 'object'])['var2'].apply(lambda x: signal.savgol_filter(x, window_length=3, polyorder=2))

measurement_id  object
1               obj1      [-0.9799804687499857, -1.0200195312500429, -1....
                obj2      [1.9999999999999565, 1.9400111607142636, 1.869...
                obj3      [-1.0299944196428608, -0.6900111607143345, -0....
                obj4      [0.009998139880949027, 0.07001023065477342, 0....
                obj5      [0.02999441964284698, -0.039992559523796, -0.1...
Name: var2, dtype: object

However, as the output of savgol_filter is np.ndarray, I'm not really sure how to properly assign the output as a new column var4. I have tried with pandas explode but I'm still lacking the order to do a proper assignment.

jezrael · Accepted Answer · 2020-04-24 06:43:53Z

I think you need GroupBy.transform for convert numpy array to Series:

df['var4'] = (df.groupby(['measurement_id', 'object'])['var2']
                .transform(lambda x: signal.savgol_filter(x, window_length=3, polyorder=2)))

Another idea is create custom function with assign to new column:

import scipy.signal as signal

def func(x):
    x['var4'] = signal.savgol_filter(x['var2'], window_length=3, polyorder=2)
    return x

df = df.groupby(['measurement_id', 'object']).apply(func)

print (df)
                                 var1      var2      var3      var4
measurement_id repeat_id object                                    
1              1.0       obj1     1.0 -0.979980  0.000000 -0.979980
                         obj4     1.0  0.009998 -1.989997  0.009998
                         obj3     2.0 -1.029994 -0.020034 -1.029994
                         obj5     2.0  0.029994 -0.039996  0.029994
                         obj2     3.0  2.000000 -0.029979  2.000000
               2.0       obj1     1.0 -1.020020  0.050013 -1.020020
                         obj4     1.0  0.070010 -1.999978  0.070010
                         obj3     2.0 -0.690011  0.190035 -0.690011
                         obj5     2.0 -0.039993  0.189963 -0.039993
                         obj2     3.0  1.940011 -0.029979  1.940011
               3.0       obj1     1.0 -1.059989  0.100026 -1.059989
                         obj4     1.0  0.139997 -2.030011  0.139997
                         obj3     2.0 -0.360026  0.390015 -0.360026
                         obj5     2.0 -0.120024  0.410013 -0.120024
                         obj2     3.0  1.869978 -0.040032  1.869978
               4.0       obj1     1.0 -1.090007  0.149967 -1.090007
                         obj4     1.0  0.190011 -2.040010  0.190011
                         obj3     2.0 -0.029994  0.599976 -0.029994
                         obj5     2.0 -0.200009  0.610028 -0.200009
                         obj2     3.0  1.809989 -0.059994  1.809989

Albo · Accepted Answer · 2020-04-28 05:40:20Z

2

import scipy.signal as signal
df['var4'] = signal.savgol_filter(df.var2.values, window_length=3, polyorder=2)

gives you this output as well:

print(df)

                                  var1      var2      var3      var4
measurement_id repeat_id object                                    
1              1.0       obj1     1.0 -0.979980  0.000000 -0.979980
                         obj4     1.0  0.009998 -1.989997  0.009998
                         obj3     2.0 -1.029994 -0.020034 -1.029994
                         obj5     2.0  0.029994 -0.039996  0.029994
                         obj2     3.0  2.000000 -0.029979  2.000000
               2.0       obj1     1.0 -1.020020  0.050013 -1.020020
                         obj4     1.0  0.070010 -1.999978  0.070010
                         obj3     2.0 -0.690011  0.190035 -0.690011
                         obj5     2.0 -0.039993  0.189963 -0.039993
                         obj2     3.0  1.940011 -0.029979  1.940011
               3.0       obj1     1.0 -1.059989  0.100026 -1.059989
                         obj4     1.0  0.139997 -2.030011  0.139997
                         obj3     2.0 -0.360026  0.390015 -0.360026
                         obj5     2.0 -0.120024  0.410013 -0.120024
                         obj2     3.0  1.869978 -0.040032  1.869978
               4.0       obj1     1.0 -1.090007  0.149967 -1.090007
                         obj4     1.0  0.190011 -2.040010  0.190011
                         obj3     2.0 -0.029994  0.599976 -0.029994
                         obj5     2.0 -0.200009  0.610028 -0.200009
                         obj2     3.0  1.809989 -0.059994  1.809989

UPDATE

the scipy savgol filter function call looks the following:

scipy.signal.savgol_filter(x, window_length, polyorder, deriv=0, delta=1.0, axis=-1, mode='interp', cval=0.0)

uses axis=-1 by default, which means the passed numpy array x will be flattened (see df.var2.values.reshape(-1).shape -> (20,)). This results in the same as

signal.savgol_filter(df.var2.values, window_length=3, polyorder=2, axis=0)

due to multiple function calls in scipy/signal/_savitzky_golay.py (see github) until correlate1d([...]) in ndimage/filters.py is called with axis=-1. Then _check_axis(axis, input.ndim) with _check_axis(axis=-1, input.ndim=1) (df.var2.values.ndim=1) returns 0 --> axis=0 returns the same in signal.savgol_filter. Thus I recommend sorting the whole array first:

df.sort_index(axis=0, level='repeat_id')
df['var4'] = signal.savgol_filter(df.var2.values, window_length=3, polyorder=2)

which then returns:

                                  var1      var2      var3      var4
measurement_id repeat_id object                                    
1              1.0       obj1     1.0 -0.979980  0.000000 -0.979980
                         obj2     3.0  2.000000 -0.029979  2.000000
                         obj3     2.0 -1.029994 -0.020034 -1.029994
                         obj4     1.0  0.009998 -1.989997  0.009998
                         obj5     2.0  0.029994 -0.039996  0.029994
               2.0       obj1     1.0 -1.020020  0.050013 -1.020020
                         obj2     3.0  1.940011 -0.029979  1.940011
                         obj3     2.0 -0.690011  0.190035 -0.690011
                         obj4     1.0  0.070010 -1.999978  0.070010
                         obj5     2.0 -0.039993  0.189963 -0.039993
               3.0       obj1     1.0 -1.059989  0.100026 -1.059989
                         obj2     3.0  1.869978 -0.040032  1.869978
                         obj3     2.0 -0.360026  0.390015 -0.360026
                         obj4     1.0  0.139997 -2.030011  0.139997
                         obj5     2.0 -0.120024  0.410013 -0.120024
               4.0       obj1     1.0 -1.090007  0.149967 -1.090007
                         obj2     3.0  1.809989 -0.059994  1.809989
                         obj3     2.0 -0.029994  0.599976 -0.029994
                         obj4     1.0  0.190011 -2.040010  0.190011
                         obj5     2.0 -0.200009  0.610028 -0.200009

edited Apr 28, 2020 at 5:40

answered Apr 24, 2020 at 12:32

Albo

1,66414 silver badges27 bronze badges

6 Comments

Xaume Over a year ago

It seems to give the same output indeed, but I can't really grasp how exactly a function knows it should be applied to each object separately?

jezrael Over a year ago

@Xaume - Change solution to signal.savgol_filter(df.var2.values, window_length=3, polyorder=1) and then get difference between processing pre groups or not like this answer.

jezrael Over a year ago

hmmm, if check my comment above why there is difference?

Albo Over a year ago

I'm sorry, what exactly do you mean?

jezrael Over a year ago

I think second comment under this answer. Why is difference?

|

Collectives™ on Stack Overflow

How to use a function that returns numpy array within pandas apply

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related