1

Say for example I am trying to find duplicate values in this set, based on Name, Age and Country

NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Paula' 78 Germany Retired
'Fred' 23 America Banker
'Fred' 22 America Student
'Fred' 23 Brazil Police Officer
'Bingo' 36 New Zealand Money

To find the exact duplicates I have used:

dupDF = df[df.duplicated(['NAME', 'AGE', 'COUNTRY'], keep=False)]

Which would give me:

NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Fred' 23 America Banker

What I really want is to match on Name, Age(+/-1) and Country, so as to return:

NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Fred' 23 America Banker
'Fred' 22 America Student

I have tried to use the solutions provided here: Detecting almost duplicate rows

However I am struggling to adapt the solution to accept non-integer values.

I have also tried creating an array (as in: https://stackoverflow.com/a/43160595/10816095) that contains the Age +/-1 in hopes to use that to match but I can't seem to append it to the dataframe.

How can I do this?

1 Answer 1

0

Use DataFrame.sort_values by all 3 columns, last column in list is integer column, then grouping by columns with same values and Series.diff with back filling forts value, last compare by Series.lt for <, sorting index by Series.sort_index and pass to boolean indexing:

mask = (df.sort_values(['NAME','COUNTRY','AGE'])
          .groupby(['NAME','COUNTRY'])['AGE'].apply(lambda x: x.diff().bfill())
          .lt(2)
          .sort_index())

df = df[mask]
print (df)
     NAME  AGE  COUNTRY PROFESSION
0  'Fred'   23  America     Banker
2  'Fred'   23  America     Banker
3  'Fred'   22  America    Student
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.