Faster alternative to for loop in for loop

Question

I have a dictionary with ~150,000 keys. There are no duplicate keys. Each key is 127 characters long and each key differs at 1-11 positions (most differences happen towards the end of the key). The value for each key is a unique ID and a blank list []. For a given key I want to find all other keys that differ by exactly 1 character and then append there ID's to the given keys blank list. At the end I want a key and its value (an ID and a list of all keys that differ by one character).

My code works, but the problem is that it is too slow. The double for loop is 150,000^2 = ~25 billion. On my computer, I can go through the loop ~2 million times every minute (doing the match1 function each time). This would take ~8 days to finish. The loop without match1 function runs ~7 times faster, so would finish in ~1 day.

I am wondering if anyone knows how I can improve the speed of this?

# example dictionary
dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}


def match1(s1,s2,dict):    
    s = 0
    for c1, c2 in zip(reversed(s1), reversed(s2)):
        if s < 2:
            if c1 != c2:
                s = s + 1
        else:
            break
    if s == 1:
        dict1[s1][1].append(dict1[s2][0])


for s1 in dict:
    for s2 in dict:
        match1(s1,s2,dict)

You don't need the second loop to start from the beginning, because you will compare the same keys many times — warvariuc
– warvariuc, Commented Dec 6, 2015 at 6:28
Also you dont have to reverse the strings(s1 and s2 here). Instead iterate backwards from n-1 to 0. This will reduce some time though complexity wise it is still the same. There will be double impact because of two reversed function invocations. — prem kumar
– prem kumar, Commented Dec 6, 2015 at 6:33

jme · Accepted Answer · 2015-12-06 07:51:52Z

Currently you are checking each key against every other key for a total of O(n^2) comparisons. The insight is that we only need to check against a very small fraction of the other keys.

Suppose the alphabet over which the characters of each key has k distinct values. For example, if your keys are simple ASCII strings consisting of a-z and 0-9, then k = 26 + 10 = 30.

Given any key, we can generate all possible keys which are one character away: there are 127 * k such strings. Whereas before you were comparing each key to ~150,000 other keys, now we only need to compare against 127 * k, which is 3810 for the case where k = 30. This reduces overall time complexity from O(n^2) to O(n * k), where k is a constant independent of n. This is where the real speedup is when you scale up n.

Here's some code to generate all possible neighbors of a key:

def generate_neighbors(key, alphabet):
    for i in range(len(key)):
        left, right = key[:i], key[i+1:]
        for char in alphabet:
            if char != key[i]:
                yield left + char + right

So, for example:

>>> set(generate_neighbors('ab', {'a', 'b', 'c', 'd'}))
{'aa', 'ac', 'ad', 'bb', 'cb', 'db'}

Now we compute the neighborhoods of each key:

def compute_neighborhoods(data, alphabet):
    keyset = set(data.keys())
    for key in data:
        possible_neighbors = set(generate_neighbors(key, alphabet))
        neighbors = possible_neighbors & keyset

        identifier = data[key][0]

        for neighbor in neighbors:
            data[neighbor][1].append(identifier)

Now an example. Suppose

data = {
 '0a': [4, []],
 '1f': [9, []],
 '27': [3, []],
 '32': [8, []],
 '3f': [6, []],
 '47': [1, []],
 '7c': [2, []],
 'a1': [0, []],
 'c8': [7, []],
 'e2': [5, []]
}

Then:

>>> alphabet = set('abcdef01234567890')
>>> compute_neighborhoods(data, alphabet)
>>> data
{'0a': [4, []],
 '1f': [9, [6]],
 '27': [3, [1]],
 '32': [8, [5, 6]],
 '3f': [6, [8, 9]],
 '47': [1, [3]],
 '7c': [2, []],
 'a1': [0, []],
 'c8': [7, []],
 'e2': [5, [8]]}

There are a few more optimizations that I haven't implemented here. First, you say that the keys mostly differ on their later characters, and that they differ at 11 positions, at most. This means that we can be smarter about computing the intersection possible_neighbors & keyset and in generating the neighborhood. First, we amend generate_neighbors to modify the trailing characters of the key first. Then, instead of generating the whole set of neighbors at once, we generate them one at a time and check for inclusion in the data dictionary. We keep track of how many we find, and if we find 11 we break.

The reason I have not implemented this in my answer is that I'm not certain that it will result in a significant speedup, and might in fact be slower, since it means removing an optimized Python builtin (set intersection) with a pure-Python loop.

This is the insight I needed! While the keys are 127 characters long, there are only 11 positions that can change and I know which positions these can be so I could generate a new shorter key for the comparisons (I really should have done this before anyways!).
Also, each of the 11 positions can only change to 1-6 other characters. This should make my program useable. Thanks

tdelaney · Accepted Answer · 2015-12-06 07:11:40Z

1

This is untested so may be little more than idle speculation, but... you can reduce the number of dictionary lookups (and much more importantly) eliminate half of the comparisons by building the dict into a list and only comparing remaining items in the list.

_dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}

# assuming python 3
itemlist = list(_dict.items())

while itemlist:
    key1, value1 = itemlist.pop()
    for key2, value2 in itemlist:
        # doesn't have early short circuit but may have fewer lookups to compensate
        if sum(c1 == c2 for c1, c2 in zip(key1, key2)) == 1:
            value1[1].append(key2)
            value2[1].append(key1)

answered Dec 6, 2015 at 7:11

tdelaney

78k6 gold badges91 silver badges129 bronze badges

Comments

prem kumar · Accepted Answer · 2015-12-06 07:56:40Z

1

Try this code:

# example dictionary
dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}


def match1(s1,s2,dict):    
    s = 0
    #reverse and zip computations are avoided
    index = 127-1
    while (index>=0 && s<2):
        if(s1[index] == s2[index]):
            s = s + 1

    if (s == 1): 
        #we are modifying both s1 and s2 instead of only s1 to improve performance
        dict1[s1][1].append(dict1[s2][0])
        dict1[s2][1].append(dict1[s1][0])

keys = dict.keys()
#no of times match1 will be invoked is (n-1)*n/2 instead of n*n
for i in range(0, len(keys)):
    for j in range(i+1, len(keys)):
        #if match1(s1,s2,dict) is invoked then no need to call match1(s2,s1,dict) because now match1 function will take care of it. So only either one needs to be called
        match1(keys[i],keys[j],dict)

Optmizations:

reverse and zip computations are avoided
for every key, comparison is made only with keys that appear later than this key in the keys list.
match1() modifies both s1 and s2 instead of only s1. This can be done because of commutativity i.e. s1 compared to s2 and s2 compared to s1 are the same
keys list is stored in a variable and accessed by index so that python will not create new temporary lists during execution

edited Dec 6, 2015 at 7:56

answered Dec 6, 2015 at 7:13

prem kumar

5,8873 gold badges26 silver badges39 bronze badges

1 Comment

drhicks Over a year ago

I wanted to do something like this, but wasn't sure using i+1 would work.

kmonsoor · Accepted Answer · 2015-12-06 14:52:32Z

0

For the key-matching part, use Levenshtein matching for extremely fast comparison. Python-Levenshtein is a c-extention based implementation. Use it's hamming() function to determine just number of different characters.

Install it using Git link:

pip install git+git://github.com/ztane/python-Levenshtein.git

Now, use it as below by plugging it into @tdelaney's answer:

import Levenshtein as lv

itemlist = list(_dict.items())

while itemlist:
   if lv.hamming(key1, key2) == 1:
       key1, value1 = itemlist.pop()
       for key2, value2 in itemlist:
           value1[1].append(key2)
           value2[1].append(key1)

answered Dec 6, 2015 at 14:52

kmonsoor

8,1897 gold badges43 silver badges58 bronze badges

Collectives™ on Stack Overflow

Faster alternative to for loop in for loop

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related