1

I am wondering how efficiently to calculate the distribution of words on array based on the words from another array.

We are given the array of words test the task is to aggregate the occurrences of words from test in new array s

for word in test:
    if word not in s:
        mydict[s.count(word)] = 0
    else:           
        mydict[s.count(word)] += 1

This code is very slow, partially due to the lack of performance improvements and due to very slow Python's nature in itetations.

What is the best way to improve the above code?

2 Answers 2

1

You repeat count iteration for every word in test, adding overhead of word lookup with if word not in s. Improvement might be in calculating counts once:

from collections import Counter
counts = Counter(s)

then getting hystogram in second pass:

distribution = Counter(counts[v] for v in set(test))

Demo:

>>> test = list('abcdef')
>>> s = list('here comes the sun')
>>> counts = Counter(s)
>>> distribution = Counter(counts[v] for v in set(test))
>>> distribution
Counter({0: 4, 1: 1, 4: 1})
Sign up to request clarification or add additional context in comments.

Comments

1

You can use Counter and that is what they are for

from collections import Counter
print Counter(Counter(test).values())

For example,

test = ["the", "sun", "rises", "in", "the", "sun"]
from collections import Counter
print Counter(test)
print Counter(Counter(test).values())

Output

Counter({'sun': 2, 'the': 2, 'rises': 1, 'in': 1})
Counter({1: 2, 2: 2})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.