Removing elements from an array that are in another array

Question

Say I have these 2D arrays A and B.

How can I remove elements from A that are in B. (Complement in set theory: A-B)

A=np.asarray([[1,1,1], [1,1,2], [1,1,3], [1,1,4]])
B=np.asarray([[0,0,0], [1,0,2], [1,0,3], [1,0,4], [1,1,0], [1,1,1], [1,1,4]])
#output = [[1,1,2], [1,1,3]]

To be more precise, I would like to do something like this.

data = some numpy array
label = some numpy array
A = np.argwhere(label==0) #[[1 1 1], [1 1 2], [1 1 3], [1 1 4]]
B = np.argwhere(data>1.5) #[[0 0 0], [1 0 2], [1 0 3], [1 0 4], [1 1 0], [1 1 1], [1 1 4]]
out = np.argwhere(label==0 and data>1.5) #[[1 1 2], [1 1 3]]

didn't == will work, I am just guessing, i don't know much about numpy arrays, from my python console i got this >>>[1,1,1]==[1,1,1] >>>True — Lokesh Sanapalli
– Lokesh Sanapalli, Commented Oct 15, 2016 at 6:38
A simple non-numpy solution - [i for i in A for j in B if i==j] — JRodDynamite
– JRodDynamite, Commented Oct 15, 2016 at 6:40

FObersteiner · Accepted Answer · 2022-02-02 10:41:03Z

36

there is an easy solution with a list comprehension,

A = [i for i in A if i not in B]

Result

[[1, 1, 2], [1, 1, 3]]

List comprehension is not removing the elements from the array, it's just reassigning - if you want to remove the elements, use this method:

for i in B:
     if i in A:
     A.remove(i)

edited Feb 2, 2022 at 10:41

FObersteiner

26.1k9 gold badges60 silver badges94 bronze badges

answered Oct 15, 2016 at 6:47

Rahul K P

16.2k4 gold badges40 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:09:48Z

19

Based on this solution to Find the row indexes of several values in a numpy array, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -

dims = np.maximum(B.max(0),A.max(0))+1
out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]

Sample run -

In [38]: A
Out[38]: 
array([[1, 1, 1],
       [1, 1, 2],
       [1, 1, 3],
       [1, 1, 4]])

In [39]: B
Out[39]: 
array([[0, 0, 0],
       [1, 0, 2],
       [1, 0, 3],
       [1, 0, 4],
       [1, 1, 0],
       [1, 1, 1],
       [1, 1, 4]])

In [40]: out
Out[40]: 
array([[1, 1, 2],
       [1, 1, 3]])

Runtime test on large arrays -

In [107]: def in1d_approach(A,B):
     ...:     dims = np.maximum(B.max(0),A.max(0))+1
     ...:     return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
     ...:                     np.ravel_multi_index(B.T,dims))]
     ...: 

In [108]: # Setup arrays with B as large array and A contains some of B's rows
     ...: B = np.random.randint(0,9,(1000,3))
     ...: A = np.random.randint(0,9,(100,3))
     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
     ...: A[A_idx] = B[B_idx]
     ...:

Timings with broadcasting based solutions -

In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln

In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100 loops, best of 3: 3.66 ms per loop

Timing with less memory footprint based solution -

In [111]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 231 µs per loop

Further performance boost

in1d_approach reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot, like so -

def in1d_dot_approach(A,B):
    cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
    return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]

Let's test it against the previous on much larger arrays -

In [251]: # Setup arrays with B as large array and A contains some of B's rows
     ...: B = np.random.randint(0,9,(10000,3))
     ...: A = np.random.randint(0,9,(1000,3))
     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
     ...: A[A_idx] = B[B_idx]
     ...: 

In [252]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 1.28 ms per loop

In [253]: %timeit in1d_dot_approach(A, B)
1000 loops, best of 3: 1.2 ms per loop

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Oct 15, 2016 at 7:19

Divakar

222k19 gold badges273 silver badges374 bronze badges

4 Comments

Jee Seok Yoon Over a year ago

Your in1d_approach takes 30 sec, in1d_dot_approach takes 45 sec in my program. My numpy array uses dtype=np.uint8. So I tested it with your exact code with dtype=np.uint8 parameter for A, B. Dot function gives me 567 nano seconds, original takes me 539 nano seconds. Any explanation why smaller data type gives the original function better timing?

Divakar Over a year ago

@JeeSeokYoon Well simply because less precision datatypes would occupy less memory in terms of their binary bits and thus would incur less memory occupancy and that in most cases translates to faster processing as its processing less data because lesser number of binary bits are used to represent each number. You gotta keep in mind that at the lowest level, CPUs process binary data. Hope that made sense!

Jee Seok Yoon Over a year ago

I'm asking why is in1d_approach(A,B) slower than in1d_dot_approach(A, B) when dealing with floating point, but faster for integer? Is it just how numpy was built? Why does matrix multiplication perform better with floating point/ worse with integers (compared to other methods)?

SZIEBERTH Ádám Over a year ago

Pay attention to the possibility of dims = np.maximum(B.max(0),A.max(0))+1 overrun. You might have to convert the dims array to a longer dtype before the +1 addition.

Kasravnd · Accepted Answer · 2016-10-15 09:43:26Z

14

Here is a Numpythonic approach with broadcasting:

In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
Out[83]: 
array([[1, 1, 2],
       [1, 1, 3]])

Here is a timeit with other answer:

In [90]: def cal_diff(A, B):
   ....:     A_rows = A.view([('', A.dtype)] * A.shape[1])
   ....:     B_rows = B.view([('', B.dtype)] * B.shape[1])
   ....:     return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
   ....: 

In [93]: %timeit cal_diff(A, B)
10000 loops, best of 3: 54.1 µs per loop

In [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100000 loops, best of 3: 9.41 µs per loop

# Even better with Divakar's suggestion
In [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100000 loops, best of 3: 7.41 µs per loop

Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.

Here is the benchmark with Divakar's in1d approach:

In [144]: def in1d_approach(A,B):
   .....:         dims = np.maximum(B.max(0),A.max(0))+1
   .....:         return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
   .....:                          np.ravel_multi_index(B.T,dims))]
   .....: 

In [146]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 23.8 µs per loop

In [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 20.2 µs per loop

You can use np.diff to get the an order independent result:

In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])

In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
Out[195]: 
array([[1, 1, 2],
       [1, 1, 3]])

In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 30.7 µs per loop

Benchmark with Divakar's setup:

In [198]: B = np.random.randint(0,9,(1000,3))

In [199]: A = np.random.randint(0,9,(100,3))

In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)

In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)

In [202]: A[A_idx] = B[B_idx]

In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 137 µs per loop

In [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 112 µs per loop

In [205]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 115 µs per loop

Timing with larger arrays (Divakar's solution is slightly faster):

In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
1000 loops, best of 3: 1.01 ms per loop

In [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
1000 loops, best of 3: 880 µs per loop

In [233]:  %timeit in1d_approach(A, B)
1000 loops, best of 3: 807 µs per loop

edited Oct 15, 2016 at 9:43

answered Oct 15, 2016 at 7:07

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

9 Comments

Divakar Over a year ago

Nice! Was about to post the same!

Divakar Over a year ago

Actually using equality might be better on performance : A[~((A[:,None,:] == B).all(-1)).any(1)].

Rahul K P Over a year ago

Great Solution :)

Kasravnd Over a year ago

@Divakar Indeed, that's Nice!

R. S. Nikhil Krishna Over a year ago

Nice Answer! Kindly answer to this too regarding the speedup stackoverflow.com/questions/40056275/…

|

R. S. Nikhil Krishna · Accepted Answer · 2016-10-15 07:01:21Z

8

If you want to do it the numpy way,

import numpy as np

A = np.array([[1, 1, 1,], [1, 1, 2], [1, 1, 3], [1, 1, 4]])
B = np.array([[0, 0, 0], [1, 0, 2], [1, 0, 3], [1, 0, 4], [1, 1, 0], [1, 1, 1], [1, 1, 4]])
A_rows = A.view([('', A.dtype)] * A.shape[1])
B_rows = B.view([('', B.dtype)] * B.shape[1])

diff_array = np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])

As @Rahul suggested, for a non numpy easy solution,

diff_array = [i for i in A if i not in B]

edited Oct 15, 2016 at 7:01

answered Oct 15, 2016 at 6:43

R. S. Nikhil Krishna

4,2801 gold badge16 silver badges28 bronze badges

1 Comment

R. S. Nikhil Krishna Over a year ago

Thanks for the heads up. Updated.

Yihe · Accepted Answer · 2016-10-15 06:48:09Z

4

Another non-numpy solution:

[i for i in A if i not in B]

answered Oct 15, 2016 at 6:48

Yihe

4,2642 gold badges22 silver badges22 bronze badges

Collectives™ on Stack Overflow

Removing elements from an array that are in another array

5 Answers 5

Comments

4 Comments

9 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

4 Comments

9 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related