1

I have two numpy arrays, looking like:

field = np.array([5,1,3,3,2,1,6])    
counts = np.array([100,210,300,150,20,90,170])

They are not sorted (and shouldnt change). I now want to calculate a third array (of the same length and order) which contains the sum of the counts whenever they lie in the same field. Here the result should be:

field_counts = np.array([100,300,450,450,20,300,170])

The arrays are very long, such that iterating through it (and always looking where the corresponding partner fields are) is way too inefficient. Maybe I am just not seeing the wood for the trees... I hope someone can help me out on this!

1
  • Aside: when you find yourself needing a groupby operation, that's often a sign you should be using pandas instead of numpy; your operation would be something like df.groupby("field")["counts"].transform(sum). Commented Mar 26, 2015 at 20:53

3 Answers 3

2

I don't know if it will be efficient enough (since I do iterate over field) but here is a suggestion. I first make a directory of field/counts values. Then, I create an array based on that.

from collections import defaultdict
dic = defaultdict(int)
for j, f in enumerate(field):
    dic[f] += counts[j]

field_counts = np.array([dic[f] for f in field])
Sign up to request clarification or add additional context in comments.

Comments

1

Use the following list comprehension :

>>> [np.sum(counts[np.where(field==i)]) for i in field]
[100, 300, 450, 450, 20, 300, 170]

You can get the index of same elements in field with np.where :

>>> [np.where(field==i) for i in field]
[(array([0]),), (array([1, 5]),), (array([2, 3]),), (array([2, 3]),), (array([4]),), (array([1, 5]),), (array([6]),)]

And then get the corresponding elements of counts with indexing! and calculate the sum with np.sum.

1 Comment

This will be very slow if the arrays are long; you've made this an N^2 calculation.
1

This problem an be solved in a fully vectorized manner using the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi
g = npi.group_by(field)
field_counts = g.sum(counts)[1][g.inverse]

g.sum computes the sums for each group of unique fields, and g.inverse maps those values back to the original fields.

5 Comments

There is a reason a went through the hassle to package this functionality, since there are indeed many questions of this type. In my perception, all these questions stand to benefit from my answers; as does this one. It substantially improves upon the currently accepted answer in several respects. It is my understanding that the sections you refer to are directed at commercial purposes; but this is a free-as-in-beer open-source package, but correct me if I'm wrong. My only selfish motive here is getting it better tested :).
Subjectively, it feels more like self-promotion to me if I do mention my authorship; but thank you for the heads-up. Do you happen to have a link to any resources that are a bit more explicit about the distinction between commercial and non-commercial purposes?
Some of them are duplicates I would say, yes. I will follow your suggestion to disclose authorship then, thanks.
I do appreciate the feedback
Awesome @EelcoHoogendoorn I see you added disclosure :). Please do the same for your other answers as well. As a side-note, if some of them are duplicates, feel free to flag them as such! I will delete my previous comments to clean up.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.