1

I have a very large dataframe (~10^8 rows) where I need to change some values. The algorithm I use is complex so I tried to break down the issue into a simple example below. I mostly programmed in C++, so I keep thinking in for-loops. I know I should vectorize but I am new to python and very new to pandas and cannot come up with a better solution. Any solutions which increase performance are welcome.

#!/usr/bin/python3


import numpy as np
import pandas as pd

data = {'eventID': [1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8],
        'types':    [0, -1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 1, -1, -1]
        }


mydf = pd.DataFrame(data, columns=['eventID', 'types'])
print(mydf)

MyIntegerCodes = np.array([0, 1])
eventIDs = np.unique(mydf.eventID.values)  # can be up to 10^8 values

for val in eventIDs:

    currentTypes = mydf[mydf.eventID == val].types.values

    if (0 in currentTypes) & ~(1 in currentTypes):
        mydf.loc[mydf.eventID == val, 'types'] = 0

    if ~(0 in currentTypes) & (1 in currentTypes):
        mydf.loc[mydf.eventID == val, 'types'] = 1


print(mydf)

Any ideas?

EDIT: I was ask to explain what I do with my for-loops. For every eventID I want to know if all corresponding types contain a 1 or a 0 or both. If they contain a 1, all values which are equal to -1 should be changed to 1. If the values are 0, all values equal to -1 should be changed to 0. My problem is to do this efficiently for each eventID independently. There can be one or multiple entries per eventID.

Input of example:

    eventID  types
0         1      0
1         1     -1
2         1     -1
3         2     -1
4         2      1
5         3      0
6         4      0
7         5      0
8         6     -1
9         6     -1
10        6     -1
11        6      1
12        7     -1
13        8     -1

Output of example:

    eventID  types
0         1      0
1         1      0
2         1      0
3         2      1
4         2      1
5         3      0
6         4      0
7         5      0
8         6      1
9         6      1
10        6      1
11        6      1
12        7     -1
13        8     -1
5
  • Please explain what you are doing with your loops. Commented Jun 23, 2020 at 13:40
  • Is it possible for an event to contain a zero and a one? or neither zero nor one? Just leave it alone? Commented Jun 23, 2020 at 13:43
  • Yes, one event can contain many different types. Unfortunately, I cannot leave this alone ^^ Commented Jun 23, 2020 at 13:51
  • Can you share the expected output for the above dataframe mydf? Commented Jun 23, 2020 at 14:15
  • 1
    @ShubhamSharma Yes of course. See Edit above. Commented Jun 23, 2020 at 14:32

1 Answer 1

1

First we create boolean masks m1 and m2 using Series.eq then use DataFrame.groupby on this mask and transform using any, then using np.select chose the elements from 1, 0 depending upon the conditions m1 or m2:

m1 = mydf['types'].eq(1).groupby(mydf['eventID']).transform('any')
m2 = mydf['types'].eq(0).groupby(mydf['eventID']).transform('any')
mydf['types'] = np.select([m1 , m2], [1, 0], mydf['types'])

Result:

# print(mydf)

    eventID  types
0         1      0
1         1      0
2         1      0
3         2      1
4         2      1
5         3      0
6         4      0
7         5      0
8         6      1
9         6      1
10        6      1
11        6      1
12        7     -1
13        8     -1
Sign up to request clarification or add additional context in comments.

4 Comments

First of all thank you very much. I will need some to time to check it out and transfer it to my code. I will let you know tomorrow.
Yes your answer helped me. But lets say we want to group by an addtional column like volumeID, how would you do that? I tried: m1 = mydf['types'].eq(1).groupby(mydf[['eventID', 'volumeID']]).transform('any') but now there is an error because it is not 1-dimensional
Use, m1 = mydf['types'].eq(1).groupby([mydf['eventID'], mydf['volumeID']]).transform('any')
Wow, I did not know that groupby() worked this way. Thank you. I was looking for this kind of vectorization for forever. I mean I checked out the reference manual pandas.pydata.org/pandas-docs/stable/reference/api/… a bunch of tutorials, but no one ever used it like this. I guess I just do not fully understand how groupby() works

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.