Currently you are checking each key against every other key for a total of O(n^2) comparisons. The insight is that we only need to check against a very small fraction of the other keys.
Suppose the alphabet over which the characters of each key has k distinct values. For example, if your keys are simple ASCII strings consisting of a-z and 0-9, then k = 26 + 10 = 30.
Given any key, we can generate all possible keys which are one character away: there are 127 * k such strings. Whereas before you were comparing each key to ~150,000 other keys, now we only need to compare against 127 * k, which is 3810 for the case where k = 30. This reduces overall time complexity from O(n^2) to O(n * k), where k is a constant independent of n. This is where the real speedup is when you scale up n.
Here's some code to generate all possible neighbors of a key:
def generate_neighbors(key, alphabet):
for i in range(len(key)):
left, right = key[:i], key[i+1:]
for char in alphabet:
if char != key[i]:
yield left + char + right
So, for example:
>>> set(generate_neighbors('ab', {'a', 'b', 'c', 'd'}))
{'aa', 'ac', 'ad', 'bb', 'cb', 'db'}
Now we compute the neighborhoods of each key:
def compute_neighborhoods(data, alphabet):
keyset = set(data.keys())
for key in data:
possible_neighbors = set(generate_neighbors(key, alphabet))
neighbors = possible_neighbors & keyset
identifier = data[key][0]
for neighbor in neighbors:
data[neighbor][1].append(identifier)
Now an example. Suppose
data = {
'0a': [4, []],
'1f': [9, []],
'27': [3, []],
'32': [8, []],
'3f': [6, []],
'47': [1, []],
'7c': [2, []],
'a1': [0, []],
'c8': [7, []],
'e2': [5, []]
}
Then:
>>> alphabet = set('abcdef01234567890')
>>> compute_neighborhoods(data, alphabet)
>>> data
{'0a': [4, []],
'1f': [9, [6]],
'27': [3, [1]],
'32': [8, [5, 6]],
'3f': [6, [8, 9]],
'47': [1, [3]],
'7c': [2, []],
'a1': [0, []],
'c8': [7, []],
'e2': [5, [8]]}
There are a few more optimizations that I haven't implemented here. First, you say that the keys mostly differ on their later characters, and that they differ at 11 positions, at most. This means that we can be smarter about computing the intersection possible_neighbors & keyset and in generating the neighborhood. First, we amend generate_neighbors to modify the trailing characters of the key first. Then, instead of generating the whole set of neighbors at once, we generate them one at a time and check for inclusion in the data dictionary. We keep track of how many we find, and if we find 11 we break.
The reason I have not implemented this in my answer is that I'm not certain that it will result in a significant speedup, and might in fact be slower, since it means removing an optimized Python builtin (set intersection) with a pure-Python loop.