Pandas utilize Numpy Vectorization in a user defined function instead of using loops/lambda.apply()

Question

Objective: For every cell in Vendor: I want to check if EVERY word in the vendor's name exists in the Names Database. i.e. Adam AND Smith must BOTH exist to make IsPerson = TRUE

I know that I can do this using lambda.apply() and other ways but all of them are loops based. I'd like to make this as fast and as efficient as possible because I have 1.2 million rows. I've heard about Numpy Vectorization but not sure how to use it when I need to run some routine on the individual contents of each cell. Thanks

jezrael · Accepted Answer · 2019-12-13 09:07:26Z

3

Unfortunately if working with strings in numpy/pandas always are loops under the hoods.

Idea is create DataFrame from split by whitespaces, forward filling last values, filter by isin and last test if all Trues per rows:

df1['IsPerson'] = (df1['Vendor'].str.split(expand=True)
                                .ffill(axis=1)
                                .isin(df2['Persons'].tolist())
                                .all(axis=1))

Solution with sets:

s = set(df2['Persons'])
df1['IsPerson'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))

Performance

Depends of length of Both DataFrames, number of unique values and number of matched values. So in real data should be different.

np.random.seed(123)

N = 100000
L = list('abcdefghijklmno ')

df1 = pd.DataFrame({'Vendor': [''.join(x) for x in np.random.choice(L, (N, 5))]})
df2 = pd.DataFrame({'Persons': [''.join(x) for x in np.random.choice(L, (N * 10, 5))]})

In [133]: %%timeit
     ...: s = set(df2['Persons'])
     ...: df1['IsPerson1'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))
     ...: 
470 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [134]: %%timeit
     ...: df1['IsPerson2'] = (df1['Vendor'].str.split(expand=True)
     ...:                                 .ffill(axis=1)
     ...:                                 .isin(df2['Persons'].tolist())
     ...:                                 .all(axis=1))
     ...:                                 
858 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Dec 13, 2019 at 9:07

answered Dec 13, 2019 at 8:08

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

FBruzzesi Over a year ago

Watch out new column name typos

Chadee Fouad Over a year ago

Wow! That's very clever & very creative technique. Interestingly enough I've tired all of your 3 methods on 100,000 rows. The one you've deleted in your edit yesterday is actually the fastest! %timeit df['IsPerson'] = (df['CleanName'].str.split(expand=True).isin(PersonsDB['NAME'].tolist()+ [None]).all(axis=1)) #Took 4.27 seconds the ffill one took 5.10 seconds and the isdisjoint took 9.75 seconds Great stuff. Thanks!

Collectives™ on Stack Overflow

Pandas utilize Numpy Vectorization in a user defined function instead of using loops/lambda.apply()

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related