Split Numpy array by column value, while keeping track of row indexs

Question

Say I have a numpy array:

Y.shape = (n, 3)

where n is the amount of rows in the numpy array.

I split Y based on the values of the second column by following this thread:

distances = [Y[Y[:, 1] == k] for k in np.unique(Y[:, 1])]

Distances is now a list of numpy arrays of N length, where N is the number of possible values in the second column. I create a loop to split each array in distances, repeating the above step, however splitting by the last column this time like so:

for idx, dist in enumerate(distances):    
  conditions = [dist[dist[:, 2] == k] for k in np.unique(dist[:, 2])]
  # Save conditions list and do something with it

How in numpy can I get the row indexes of the oringal Y numpy array that correspond to each numpy array in conditions?

For me, the snippet you posted to find conditions results in losing any arrays in distances that were 1 row. E.g. if I start with Y = np.array([[1,2,3], [3,4,5], [5,6,7], [7,8,9], [10,8,3], [11,3,2]]), the step to find distances keeps all rows, but the final step leaves me with conditions = [array([[10, 8, 3]]), array([[7, 8, 9]])] while discarding all other rows. Is this supposed to happen? — AJH
– AJH, Commented Mar 10, 2022 at 18:59
Yes this is correct! As I will iterate through each numpy array in the distances list. — n_lyons10
– n_lyons10, Commented Mar 10, 2022 at 19:18
I meant that even with the enumerate statement, rows are discarded because conditions is being overwritten during every iteration of the loop. The whole for idx, dist in enumerate(distances) section only keeps rows from my Y array where 2+ rows have the same value in the middle column. — AJH
– AJH, Commented Mar 10, 2022 at 19:22
Updated the question, I am saving the conditions list after each iteration of the loop, what I am looking for is the matching original indexes of Y — n_lyons10
– n_lyons10, Commented Mar 10, 2022 at 19:27

AJH · Accepted Answer · 2022-03-10 19:37:02Z

Assuming you're storing conditions in another list (I used all_conditions in my code), then this is a potential start-to-finish solution:

from functools import reduce
import operator

# The code you posted
distances = [Y[Y[:, 1] == k] for k in np.unique(Y[:, 1])]

# conditions are stored in this list
all_conditions = []
for idx, dist in enumerate(distances):
    conditions = [dist[dist[:, 2] == k] for k in np.unique(dist[:, 2])]
    all_conditions.append(conditions)

# This step flattens all_conditions so there are no nested lists.
all_conditions = reduce(operator.concat, list(all_conditions))

# For some reason, each row of 3 is within an extra bracket,
# so need to index the 0th element of each element in all_conditions.
# There is probably a more efficient way to extract them than a for loop,
# but this is the best I can come up with.

indices = np.zeros((len(all_conditions),3), dtype=int)
for i in range(len(all_conditions)):
    indices[i] = all_conditions[i][0]

# Select the values from X using the indices array as the indices.
selected = X[tuple(indices.T)]

Let me know if there's anything that needs clarification.

Collectives™ on Stack Overflow

Split Numpy array by column value, while keeping track of row indexs

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related