1

I have a big two-dimensional numpy array a of characters (dtype='a1') and want to find invariant columns that contain the same character throughout. The following code works, but is quite slow.

var_col = np.zeros(a.shape[1], dtype='bool')
for c in xrange(a.shape[1]):
    if not all(a[:,c] == a[0,c]):
        var_col[c] = True

Is there a faster solution to this problem? Thanks!

1 Answer 1

2

Here's one way, using broadcasting with the == operator.

First create a test array.

In [27]: np.random.seed(1)

In [28]: a = np.random.choice(list("AABC"), size=(3,9))

In [29]: a
Out[29]: 
array([['A', 'C', 'A', 'A', 'C', 'A', 'C', 'A', 'C'],
       ['A', 'A', 'A', 'A', 'C', 'A', 'A', 'B', 'A'],
       ['B', 'A', 'B', 'A', 'B', 'A', 'C', 'A', 'B']], 
      dtype='|S1')

Compare each element to the element at the top of its column. a[0] is the first row; it is a 1d array (shape is (9,)). When we use == with two arrays like this, a[0] is "broadcast" to act like an array with shape (3,9), filled with copies of the first row.

In [30]: a == a[0]
Out[30]: 
array([[ True,  True,  True,  True,  True,  True,  True,  True,  True],
       [ True, False,  True,  True,  True,  True, False, False, False],
       [False, False, False,  True, False,  True,  True,  True, False]], dtype=bool)

Now use all along the first axis of the result of the comparison.

In [31]: np.all(a == a[0], axis=0)
Out[31]: array([False, False, False,  True, False,  True, False, False, False], dtype=bool)
Sign up to request clarification or add additional context in comments.

1 Comment

great 'vectorized' solution! That's what I was looking for, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.