Vectorizing for-loop

Question

I have a very large dataframe (~10^8 rows) where I need to change some values. The algorithm I use is complex so I tried to break down the issue into a simple example below. I mostly programmed in C++, so I keep thinking in for-loops. I know I should vectorize but I am new to python and very new to pandas and cannot come up with a better solution. Any solutions which increase performance are welcome.

#!/usr/bin/python3


import numpy as np
import pandas as pd

data = {'eventID': [1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8],
        'types':    [0, -1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 1, -1, -1]
        }


mydf = pd.DataFrame(data, columns=['eventID', 'types'])
print(mydf)

MyIntegerCodes = np.array([0, 1])
eventIDs = np.unique(mydf.eventID.values)  # can be up to 10^8 values

for val in eventIDs:

    currentTypes = mydf[mydf.eventID == val].types.values

    if (0 in currentTypes) & ~(1 in currentTypes):
        mydf.loc[mydf.eventID == val, 'types'] = 0

    if ~(0 in currentTypes) & (1 in currentTypes):
        mydf.loc[mydf.eventID == val, 'types'] = 1


print(mydf)

Any ideas?

EDIT: I was ask to explain what I do with my for-loops. For every eventID I want to know if all corresponding types contain a 1 or a 0 or both. If they contain a 1, all values which are equal to -1 should be changed to 1. If the values are 0, all values equal to -1 should be changed to 0. My problem is to do this efficiently for each eventID independently. There can be one or multiple entries per eventID.

Input of example:

    eventID  types
0         1      0
1         1     -1
2         1     -1
3         2     -1
4         2      1
5         3      0
6         4      0
7         5      0
8         6     -1
9         6     -1
10        6     -1
11        6      1
12        7     -1
13        8     -1

Output of example:

    eventID  types
0         1      0
1         1      0
2         1      0
3         2      1
4         2      1
5         3      0
6         4      0
7         5      0
8         6      1
9         6      1
10        6      1
11        6      1
12        7     -1
13        8     -1

Is it possible for an event to contain a zero and a one? or neither zero nor one? Just leave it alone? — Scott Boston
– Scott Boston, Commented Jun 23, 2020 at 13:43
Yes, one event can contain many different types. Unfortunately, I cannot leave this alone ^^ — Andi
– Andi, Commented Jun 23, 2020 at 13:51
Can you share the expected output for the above dataframe mydf? — Shubham Sharma
– Shubham Sharma, Commented Jun 23, 2020 at 14:15

Shubham Sharma · Accepted Answer · 2020-06-23 14:35:09Z

1

First we create boolean masks m1 and m2 using Series.eq then use DataFrame.groupby on this mask and transform using any, then using np.select chose the elements from 1, 0 depending upon the conditions m1 or m2:

m1 = mydf['types'].eq(1).groupby(mydf['eventID']).transform('any')
m2 = mydf['types'].eq(0).groupby(mydf['eventID']).transform('any')
mydf['types'] = np.select([m1 , m2], [1, 0], mydf['types'])

Result:

# print(mydf)

    eventID  types
0         1      0
1         1      0
2         1      0
3         2      1
4         2      1
5         3      0
6         4      0
7         5      0
8         6      1
9         6      1
10        6      1
11        6      1
12        7     -1
13        8     -1

answered Jun 23, 2020 at 14:35

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andi Over a year ago

First of all thank you very much. I will need some to time to check it out and transfer it to my code. I will let you know tomorrow.

Andi Over a year ago

Yes your answer helped me. But lets say we want to group by an addtional column like volumeID, how would you do that? I tried: m1 = mydf['types'].eq(1).groupby(mydf[['eventID', 'volumeID']]).transform('any') but now there is an error because it is not 1-dimensional

Shubham Sharma Over a year ago

Use, m1 = mydf['types'].eq(1).groupby([mydf['eventID'], mydf['volumeID']]).transform('any')

Andi Over a year ago

Wow, I did not know that groupby() worked this way. Thank you. I was looking for this kind of vectorization for forever. I mean I checked out the reference manual pandas.pydata.org/pandas-docs/stable/reference/api/… a bunch of tutorials, but no one ever used it like this. I guess I just do not fully understand how groupby() works

Collectives™ on Stack Overflow

Vectorizing for-loop

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related