3

ytho

Objective: For every cell in Vendor: I want to check if EVERY word in the vendor's name exists in the Names Database. i.e. Adam AND Smith must BOTH exist to make IsPerson = TRUE

I know that I can do this using lambda.apply() and other ways but all of them are loops based. I'd like to make this as fast and as efficient as possible because I have 1.2 million rows. I've heard about Numpy Vectorization but not sure how to use it when I need to run some routine on the individual contents of each cell. Thanks

1 Answer 1

3

Unfortunately if working with strings in numpy/pandas always are loops under the hoods.

Idea is create DataFrame from split by whitespaces, forward filling last values, filter by isin and last test if all Trues per rows:

df1['IsPerson'] = (df1['Vendor'].str.split(expand=True)
                                .ffill(axis=1)
                                .isin(df2['Persons'].tolist())
                                .all(axis=1))

Solution with sets:

s = set(df2['Persons'])
df1['IsPerson'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))

Performance

Depends of length of Both DataFrames, number of unique values and number of matched values. So in real data should be different.

np.random.seed(123)

N = 100000
L = list('abcdefghijklmno ')

df1 = pd.DataFrame({'Vendor': [''.join(x) for x in np.random.choice(L, (N, 5))]})
df2 = pd.DataFrame({'Persons': [''.join(x) for x in np.random.choice(L, (N * 10, 5))]})

In [133]: %%timeit
     ...: s = set(df2['Persons'])
     ...: df1['IsPerson1'] = ~df1['Vendor'].map(lambda x: s.isdisjoint(x.split()))
     ...: 
470 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [134]: %%timeit
     ...: df1['IsPerson2'] = (df1['Vendor'].str.split(expand=True)
     ...:                                 .ffill(axis=1)
     ...:                                 .isin(df2['Persons'].tolist())
     ...:                                 .all(axis=1))
     ...:                                 
858 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

2 Comments

Watch out new column name typos
Wow! That's very clever & very creative technique. Interestingly enough I've tired all of your 3 methods on 100,000 rows. The one you've deleted in your edit yesterday is actually the fastest! %timeit df['IsPerson'] = (df['CleanName'].str.split(expand=True).isin(PersonsDB['NAME'].tolist()+ [None]).all(axis=1)) #Took 4.27 seconds the ffill one took 5.10 seconds and the isdisjoint took 9.75 seconds Great stuff. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.