4

I have a Data Frame that looks like this:

import pandas as pd
df_dict = {'var1': {(1, 1.0, 'obj1'): 1.0, (1, 1.0, 'obj4'): 1.0, (1, 1.0, 'obj3'): 2.0, (1, 1.0, 'obj5'): 2.0, (1, 1.0, 'obj2'): 3.0, (1, 2.0, 'obj1'): 1.0, (1, 2.0, 'obj4'): 1.0, (1, 2.0, 'obj3'): 2.0, (1, 2.0, 'obj5'): 2.0, (1, 2.0, 'obj2'): 3.0, (1, 3.0, 'obj1'): 1.0, (1, 3.0, 'obj4'): 1.0, (1, 3.0, 'obj3'): 2.0, (1, 3.0, 'obj5'): 2.0, (1, 3.0, 'obj2'): 3.0, (1, 4.0, 'obj1'): 1.0, (1, 4.0, 'obj4'): 1.0, (1, 4.0, 'obj3'): 2.0, (1, 4.0, 'obj5'): 2.0, (1, 4.0, 'obj2'): 3.0}, 'var2': {(1, 1.0, 'obj1'): -0.9799804687499858, (1, 1.0, 'obj4'): 0.009998139880948997, (1, 1.0, 'obj3'): -1.0299944196428612, (1, 1.0, 'obj5'): 0.029994419642846992, (1, 1.0, 'obj2'): 1.9999999999999574, (1, 2.0, 'obj1'): -1.0200195312500426, (1, 2.0, 'obj4'): 0.07001023065477341, (1, 2.0, 'obj3'): -0.6900111607143344, (1, 2.0, 'obj5'): -0.03999255952379599, (1, 2.0, 'obj2'): 1.9400111607142634, (1, 3.0, 'obj1'): -1.0599888392857082, (1, 3.0, 'obj4'): 0.1399972098214164, (1, 3.0, 'obj3'): -0.36002604166661456, (1, 3.0, 'obj5'): -0.12002418154757777, (1, 3.0, 'obj2'): 1.8699776785714306, (1, 4.0, 'obj1'): -1.09000651041665, (1, 4.0, 'obj4'): 0.1900111607142918, (1, 4.0, 'obj3'): -0.029994419642918047, (1, 4.0, 'obj5'): -0.2000093005952408, (1, 4.0, 'obj2'): 1.8099888392857366}, 'var3': {(1, 1.0, 'obj1'): 0.0, (1, 1.0, 'obj4'): -1.9899974149816302, (1, 1.0, 'obj3'): -0.020033892463189318, (1, 1.0, 'obj5'): -0.03999597886028994, (1, 1.0, 'obj2'): -0.029979032628659752, (1, 2.0, 'obj1'): 0.050012925091920124, (1, 2.0, 'obj4'): -1.999978458180145, (1, 2.0, 'obj3'): 0.19003475413597926, (1, 2.0, 'obj5'): 0.18996294806989056, (1, 2.0, 'obj2'): -0.029979032628730806, (1, 3.0, 'obj1'): 0.10002585018380472, (1, 3.0, 'obj4'): -2.03001134535846, (1, 3.0, 'obj3'): 0.3900146484375, (1, 3.0, 'obj5'): 0.41001263786760944, (1, 3.0, 'obj2'): -0.040031881893369814, (1, 4.0, 'obj1'): 0.1499669692095651, (1, 4.0, 'obj4'): -2.040010340073515, (1, 4.0, 'obj3'): 0.5999755859375, (1, 4.0, 'obj5'): 0.6100284352022101, (1, 4.0, 'obj2'): -0.05999396829039938}}
df = pd.DataFrame.from_dict(df_dict)

                                 var1      var2      var3
measurement_id repeat_id object                          
1              1.0       obj1     1.0 -0.979980  0.000000
                         obj4     1.0  0.009998 -1.989997
                         obj3     2.0 -1.029994 -0.020034
                         obj5     2.0  0.029994 -0.039996
                         obj2     3.0  2.000000 -0.029979
               2.0       obj1     1.0 -1.020020  0.050013
                         obj4     1.0  0.070010 -1.999978
                         obj3     2.0 -0.690011  0.190035
                         obj5     2.0 -0.039993  0.189963
                         obj2     3.0  1.940011 -0.029979
               3.0       obj1     1.0 -1.059989  0.100026
                         obj4     1.0  0.139997 -2.030011
                         obj3     2.0 -0.360026  0.390015
                         obj5     2.0 -0.120024  0.410013
                         obj2     3.0  1.869978 -0.040032
               4.0       obj1     1.0 -1.090007  0.149967
                         obj4     1.0  0.190011 -2.040010
                         obj3     2.0 -0.029994  0.599976
                         obj5     2.0 -0.200009  0.610028
                         obj2     3.0  1.809989 -0.059994

I'd like to smooth var2 with scipy.signal.savgol_filter but I need to do this for subsequent object. So my call looks like this:

import scipy.signal as signal
df.groupby(['measurement_id', 'object'])['var2'].apply(lambda x: signal.savgol_filter(x, window_length=3, polyorder=2))

measurement_id  object
1               obj1      [-0.9799804687499857, -1.0200195312500429, -1....
                obj2      [1.9999999999999565, 1.9400111607142636, 1.869...
                obj3      [-1.0299944196428608, -0.6900111607143345, -0....
                obj4      [0.009998139880949027, 0.07001023065477342, 0....
                obj5      [0.02999441964284698, -0.039992559523796, -0.1...
Name: var2, dtype: object

However, as the output of savgol_filter is np.ndarray, I'm not really sure how to properly assign the output as a new column var4. I have tried with pandas explode but I'm still lacking the order to do a proper assignment.

2 Answers 2

2
+25

I think you need GroupBy.transform for convert numpy array to Series:

df['var4'] = (df.groupby(['measurement_id', 'object'])['var2']
                .transform(lambda x: signal.savgol_filter(x, window_length=3, polyorder=2)))

Another idea is create custom function with assign to new column:

import scipy.signal as signal

def func(x):
    x['var4'] = signal.savgol_filter(x['var2'], window_length=3, polyorder=2)
    return x

df = df.groupby(['measurement_id', 'object']).apply(func)

print (df)
                                 var1      var2      var3      var4
measurement_id repeat_id object                                    
1              1.0       obj1     1.0 -0.979980  0.000000 -0.979980
                         obj4     1.0  0.009998 -1.989997  0.009998
                         obj3     2.0 -1.029994 -0.020034 -1.029994
                         obj5     2.0  0.029994 -0.039996  0.029994
                         obj2     3.0  2.000000 -0.029979  2.000000
               2.0       obj1     1.0 -1.020020  0.050013 -1.020020
                         obj4     1.0  0.070010 -1.999978  0.070010
                         obj3     2.0 -0.690011  0.190035 -0.690011
                         obj5     2.0 -0.039993  0.189963 -0.039993
                         obj2     3.0  1.940011 -0.029979  1.940011
               3.0       obj1     1.0 -1.059989  0.100026 -1.059989
                         obj4     1.0  0.139997 -2.030011  0.139997
                         obj3     2.0 -0.360026  0.390015 -0.360026
                         obj5     2.0 -0.120024  0.410013 -0.120024
                         obj2     3.0  1.869978 -0.040032  1.869978
               4.0       obj1     1.0 -1.090007  0.149967 -1.090007
                         obj4     1.0  0.190011 -2.040010  0.190011
                         obj3     2.0 -0.029994  0.599976 -0.029994
                         obj5     2.0 -0.200009  0.610028 -0.200009
                         obj2     3.0  1.809989 -0.059994  1.809989
Sign up to request clarification or add additional context in comments.

Comments

2
import scipy.signal as signal
df['var4'] = signal.savgol_filter(df.var2.values, window_length=3, polyorder=2)

gives you this output as well:

print(df)

                                  var1      var2      var3      var4
measurement_id repeat_id object                                    
1              1.0       obj1     1.0 -0.979980  0.000000 -0.979980
                         obj4     1.0  0.009998 -1.989997  0.009998
                         obj3     2.0 -1.029994 -0.020034 -1.029994
                         obj5     2.0  0.029994 -0.039996  0.029994
                         obj2     3.0  2.000000 -0.029979  2.000000
               2.0       obj1     1.0 -1.020020  0.050013 -1.020020
                         obj4     1.0  0.070010 -1.999978  0.070010
                         obj3     2.0 -0.690011  0.190035 -0.690011
                         obj5     2.0 -0.039993  0.189963 -0.039993
                         obj2     3.0  1.940011 -0.029979  1.940011
               3.0       obj1     1.0 -1.059989  0.100026 -1.059989
                         obj4     1.0  0.139997 -2.030011  0.139997
                         obj3     2.0 -0.360026  0.390015 -0.360026
                         obj5     2.0 -0.120024  0.410013 -0.120024
                         obj2     3.0  1.869978 -0.040032  1.869978
               4.0       obj1     1.0 -1.090007  0.149967 -1.090007
                         obj4     1.0  0.190011 -2.040010  0.190011
                         obj3     2.0 -0.029994  0.599976 -0.029994
                         obj5     2.0 -0.200009  0.610028 -0.200009
                         obj2     3.0  1.809989 -0.059994  1.809989

UPDATE

the scipy savgol filter function call looks the following:

scipy.signal.savgol_filter(x, window_length, polyorder, deriv=0, delta=1.0, axis=-1, mode='interp', cval=0.0)

uses axis=-1 by default, which means the passed numpy array x will be flattened (see df.var2.values.reshape(-1).shape -> (20,)). This results in the same as

signal.savgol_filter(df.var2.values, window_length=3, polyorder=2, axis=0)

due to multiple function calls in scipy/signal/_savitzky_golay.py (see github) until correlate1d([...]) in ndimage/filters.py is called with axis=-1. Then _check_axis(axis, input.ndim) with _check_axis(axis=-1, input.ndim=1) (df.var2.values.ndim=1) returns 0 --> axis=0 returns the same in signal.savgol_filter. Thus I recommend sorting the whole array first:

df.sort_index(axis=0, level='repeat_id')
df['var4'] = signal.savgol_filter(df.var2.values, window_length=3, polyorder=2)

which then returns:

                                  var1      var2      var3      var4
measurement_id repeat_id object                                    
1              1.0       obj1     1.0 -0.979980  0.000000 -0.979980
                         obj2     3.0  2.000000 -0.029979  2.000000
                         obj3     2.0 -1.029994 -0.020034 -1.029994
                         obj4     1.0  0.009998 -1.989997  0.009998
                         obj5     2.0  0.029994 -0.039996  0.029994
               2.0       obj1     1.0 -1.020020  0.050013 -1.020020
                         obj2     3.0  1.940011 -0.029979  1.940011
                         obj3     2.0 -0.690011  0.190035 -0.690011
                         obj4     1.0  0.070010 -1.999978  0.070010
                         obj5     2.0 -0.039993  0.189963 -0.039993
               3.0       obj1     1.0 -1.059989  0.100026 -1.059989
                         obj2     3.0  1.869978 -0.040032  1.869978
                         obj3     2.0 -0.360026  0.390015 -0.360026
                         obj4     1.0  0.139997 -2.030011  0.139997
                         obj5     2.0 -0.120024  0.410013 -0.120024
               4.0       obj1     1.0 -1.090007  0.149967 -1.090007
                         obj2     3.0  1.809989 -0.059994  1.809989
                         obj3     2.0 -0.029994  0.599976 -0.029994
                         obj4     1.0  0.190011 -2.040010  0.190011
                         obj5     2.0 -0.200009  0.610028 -0.200009

6 Comments

It seems to give the same output indeed, but I can't really grasp how exactly a function knows it should be applied to each object separately?
@Xaume - Change solution to signal.savgol_filter(df.var2.values, window_length=3, polyorder=1) and then get difference between processing pre groups or not like this answer.
hmmm, if check my comment above why there is difference?
I'm sorry, what exactly do you mean?
I think second comment under this answer. Why is difference?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.