Say for example I am trying to find duplicate values in this set, based on Name, Age and Country
NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Paula' 78 Germany Retired
'Fred' 23 America Banker
'Fred' 22 America Student
'Fred' 23 Brazil Police Officer
'Bingo' 36 New Zealand Money
To find the exact duplicates I have used:
dupDF = df[df.duplicated(['NAME', 'AGE', 'COUNTRY'], keep=False)]
Which would give me:
NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Fred' 23 America Banker
What I really want is to match on Name, Age(+/-1) and Country, so as to return:
NAME AGE COUNTRY PROFESSION
'Fred' 23 America Banker
'Fred' 23 America Banker
'Fred' 22 America Student
I have tried to use the solutions provided here: Detecting almost duplicate rows
However I am struggling to adapt the solution to accept non-integer values.
I have also tried creating an array (as in: https://stackoverflow.com/a/43160595/10816095) that contains the Age +/-1 in hopes to use that to match but I can't seem to append it to the dataframe.
How can I do this?