Detecting almost duplicate rows with mixed variable types

Question

Say for example I am trying to find duplicate values in this set, based on Name, Age and Country

NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Paula' 78 Germany Retired
'Fred' 23 America Banker
'Fred' 22 America Student
'Fred' 23 Brazil Police Officer
'Bingo' 36 New Zealand Money

To find the exact duplicates I have used:

dupDF = df[df.duplicated(['NAME', 'AGE', 'COUNTRY'], keep=False)]

Which would give me:

NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Fred' 23 America Banker

What I really want is to match on Name, Age(+/-1) and Country, so as to return:

NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Fred' 23 America Banker
'Fred' 22 America Student

I have tried to use the solutions provided here: Detecting almost duplicate rows

However I am struggling to adapt the solution to accept non-integer values.

I have also tried creating an array (as in: https://stackoverflow.com/a/43160595/10816095) that contains the Age +/-1 in hopes to use that to match but I can't seem to append it to the dataframe.

How can I do this?

jezrael · Accepted Answer · 2019-05-19 11:22:44Z

0

Use DataFrame.sort_values by all 3 columns, last column in list is integer column, then grouping by columns with same values and Series.diff with back filling forts value, last compare by Series.lt for <, sorting index by Series.sort_index and pass to boolean indexing:

mask = (df.sort_values(['NAME','COUNTRY','AGE'])
          .groupby(['NAME','COUNTRY'])['AGE'].apply(lambda x: x.diff().bfill())
          .lt(2)
          .sort_index())

df = df[mask]
print (df)
     NAME  AGE  COUNTRY PROFESSION
0  'Fred'   23  America     Banker
2  'Fred'   23  America     Banker
3  'Fred'   22  America    Student

answered May 19, 2019 at 11:22

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Detecting almost duplicate rows with mixed variable types

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related