40

Say I have these 2D arrays A and B.

How can I remove elements from A that are in B. (Complement in set theory: A-B)

A=np.asarray([[1,1,1], [1,1,2], [1,1,3], [1,1,4]])
B=np.asarray([[0,0,0], [1,0,2], [1,0,3], [1,0,4], [1,1,0], [1,1,1], [1,1,4]])
#output = [[1,1,2], [1,1,3]]

To be more precise, I would like to do something like this.

data = some numpy array
label = some numpy array
A = np.argwhere(label==0) #[[1 1 1], [1 1 2], [1 1 3], [1 1 4]]
B = np.argwhere(data>1.5) #[[0 0 0], [1 0 2], [1 0 3], [1 0 4], [1 1 0], [1 1 1], [1 1 4]]
out = np.argwhere(label==0 and data>1.5) #[[1 1 2], [1 1 3]]
2
  • didn't == will work, I am just guessing, i don't know much about numpy arrays, from my python console i got this >>>[1,1,1]==[1,1,1] >>>True Commented Oct 15, 2016 at 6:38
  • A simple non-numpy solution - [i for i in A for j in B if i==j] Commented Oct 15, 2016 at 6:40

5 Answers 5

36

there is an easy solution with a list comprehension,

A = [i for i in A if i not in B]

Result

[[1, 1, 2], [1, 1, 3]]

List comprehension is not removing the elements from the array, it's just reassigning - if you want to remove the elements, use this method:

for i in B:
     if i in A:
     A.remove(i)
Sign up to request clarification or add additional context in comments.

Comments

19

Based on this solution to Find the row indexes of several values in a numpy array, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -

dims = np.maximum(B.max(0),A.max(0))+1
out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]

Sample run -

In [38]: A
Out[38]: 
array([[1, 1, 1],
       [1, 1, 2],
       [1, 1, 3],
       [1, 1, 4]])

In [39]: B
Out[39]: 
array([[0, 0, 0],
       [1, 0, 2],
       [1, 0, 3],
       [1, 0, 4],
       [1, 1, 0],
       [1, 1, 1],
       [1, 1, 4]])

In [40]: out
Out[40]: 
array([[1, 1, 2],
       [1, 1, 3]])

Runtime test on large arrays -

In [107]: def in1d_approach(A,B):
     ...:     dims = np.maximum(B.max(0),A.max(0))+1
     ...:     return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
     ...:                     np.ravel_multi_index(B.T,dims))]
     ...: 

In [108]: # Setup arrays with B as large array and A contains some of B's rows
     ...: B = np.random.randint(0,9,(1000,3))
     ...: A = np.random.randint(0,9,(100,3))
     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
     ...: A[A_idx] = B[B_idx]
     ...: 

Timings with broadcasting based solutions -

In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln

In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100 loops, best of 3: 3.66 ms per loop

Timing with less memory footprint based solution -

In [111]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 231 µs per loop

Further performance boost

in1d_approach reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot, like so -

def in1d_dot_approach(A,B):
    cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
    return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]

Let's test it against the previous on much larger arrays -

In [251]: # Setup arrays with B as large array and A contains some of B's rows
     ...: B = np.random.randint(0,9,(10000,3))
     ...: A = np.random.randint(0,9,(1000,3))
     ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
     ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
     ...: A[A_idx] = B[B_idx]
     ...: 

In [252]: %timeit in1d_approach(A,B)
1000 loops, best of 3: 1.28 ms per loop

In [253]: %timeit in1d_dot_approach(A, B)
1000 loops, best of 3: 1.2 ms per loop

4 Comments

Your in1d_approach takes 30 sec, in1d_dot_approach takes 45 sec in my program. My numpy array uses dtype=np.uint8. So I tested it with your exact code with dtype=np.uint8 parameter for A, B. Dot function gives me 567 nano seconds, original takes me 539 nano seconds. Any explanation why smaller data type gives the original function better timing?
@JeeSeokYoon Well simply because less precision datatypes would occupy less memory in terms of their binary bits and thus would incur less memory occupancy and that in most cases translates to faster processing as its processing less data because lesser number of binary bits are used to represent each number. You gotta keep in mind that at the lowest level, CPUs process binary data. Hope that made sense!
I'm asking why is in1d_approach(A,B) slower than in1d_dot_approach(A, B) when dealing with floating point, but faster for integer? Is it just how numpy was built? Why does matrix multiplication perform better with floating point/ worse with integers (compared to other methods)?
Pay attention to the possibility of dims = np.maximum(B.max(0),A.max(0))+1 overrun. You might have to convert the dims array to a longer dtype before the +1 addition.
14

Here is a Numpythonic approach with broadcasting:

In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
Out[83]: 
array([[1, 1, 2],
       [1, 1, 3]])

Here is a timeit with other answer:

In [90]: def cal_diff(A, B):
   ....:     A_rows = A.view([('', A.dtype)] * A.shape[1])
   ....:     B_rows = B.view([('', B.dtype)] * B.shape[1])
   ....:     return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
   ....: 

In [93]: %timeit cal_diff(A, B)
10000 loops, best of 3: 54.1 µs per loop

In [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
100000 loops, best of 3: 9.41 µs per loop

# Even better with Divakar's suggestion
In [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
100000 loops, best of 3: 7.41 µs per loop

Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.

Here is the benchmark with Divakar's in1d approach:

In [144]: def in1d_approach(A,B):
   .....:         dims = np.maximum(B.max(0),A.max(0))+1
   .....:         return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
   .....:                          np.ravel_multi_index(B.T,dims))]
   .....: 

In [146]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 23.8 µs per loop

In [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 20.2 µs per loop

You can use np.diff to get the an order independent result:

In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])

In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
Out[195]: 
array([[1, 1, 2],
       [1, 1, 3]])

In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 30.7 µs per loop

Benchmark with Divakar's setup:

In [198]: B = np.random.randint(0,9,(1000,3))

In [199]: A = np.random.randint(0,9,(100,3))

In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)

In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)

In [202]: A[A_idx] = B[B_idx]

In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
10000 loops, best of 3: 137 µs per loop

In [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
10000 loops, best of 3: 112 µs per loop

In [205]: %timeit in1d_approach(A, B)
10000 loops, best of 3: 115 µs per loop

Timing with larger arrays (Divakar's solution is slightly faster):

In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
1000 loops, best of 3: 1.01 ms per loop

In [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
1000 loops, best of 3: 880 µs per loop

In [233]:  %timeit in1d_approach(A, B)
1000 loops, best of 3: 807 µs per loop

9 Comments

Nice! Was about to post the same!
Actually using equality might be better on performance : A[~((A[:,None,:] == B).all(-1)).any(1)].
Great Solution :)
@Divakar Indeed, that's Nice!
Nice Answer! Kindly answer to this too regarding the speedup stackoverflow.com/questions/40056275/…
|
8

If you want to do it the numpy way,

import numpy as np

A = np.array([[1, 1, 1,], [1, 1, 2], [1, 1, 3], [1, 1, 4]])
B = np.array([[0, 0, 0], [1, 0, 2], [1, 0, 3], [1, 0, 4], [1, 1, 0], [1, 1, 1], [1, 1, 4]])
A_rows = A.view([('', A.dtype)] * A.shape[1])
B_rows = B.view([('', B.dtype)] * B.shape[1])

diff_array = np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])

As @Rahul suggested, for a non numpy easy solution,

diff_array = [i for i in A if i not in B]

1 Comment

Thanks for the heads up. Updated.
4

Another non-numpy solution:

[i for i in A if i not in B]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.