3

I am trying to work out which entries in my data store are near-duplicates using approximate string matching.

Is there any implementation of the following approach in python, or do i need to try and roll my own?

Thanks :)

from wikipedia:

...

A brute-force approach would be to compute the edit distance to P for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(n3 m)

A better solution[3][4], utilizing dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, Pi, and any substring Tj',j of T that ends at position j.

What is the most efficient way to apply this to many strings?

4 Answers 4

1

Yes.

google("python levenshtein")
Sign up to request clarification or add additional context in comments.

Comments

1

difflib.get_close_matches should do the work.

Comments

0

difflib might be the answer, eg,

from difflib import context_diff

a = 'acaacbaaca'
b = 'accabcaacc'

print ''.join(context_diff(a,b))

Comments

0

Levenshtein distance performs very similarly to the fuzzywuzzy standard ratio() function. fuzzywuzzy uses difflib http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

example from the fuzzywuzzy documentation: https://github.com/seatgeek/fuzzywuzzy

fuzz.ratio("this is a test", "this is a test!")
    96

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.