4

I would like to simply assign a label to each element of an array based on it being below or above a certain threshold and solve this with boolean indexing:

def easy_labeling(arr, thresh=5):
  negative_mask = arr < thresh
  positive_mask = arr >= thresh
  labels = np.empty_like(arr, dtype=str)
  labels[negative_mask] = 'N'
  labels[positive_mask] = 'P'
  return labels

so far so good. I created some dummy arrays to check whether it works:

test_arr1 = np.arange(24).reshape((12,2))
test_arr1
>>> test_arr1
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17],
       [18, 19],
       [20, 21],
       [22, 23]])
easy_labeling(test_arr1)
>>> array([['N', 'N'],
           ['N', 'N'],
           ['N', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P']], dtype='<U1')
test_arr2 = np.random.randint(12, size=(12,2))
test_arr2
>>> array([[ 1, 11],
           [ 5,  6],
           [11,  7],
           [ 9,  4],
           [11,  3],
           [ 0,  9],
           [ 0,  4],
           [11,  8],
           [ 3,  6],
           [ 0,  1],
           [ 5,  8],
           [10,  4]])
easy_labeling(test_arr2)
>>> array([['N', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'N'],
           ['P', 'N'],
           ['N', 'P'],
           ['N', 'N'],
           ['P', 'P'],
           ['N', 'P'],
           ['N', 'N'],
           ['P', 'P'],
           ['P', 'N']], dtype='<U1')

... and it seems that it does.

However, during my specific application, some other arrays arose - same shape, type and dtype, but different outcome:

test_arr3 = np.array([[ 2,  0,  4,  4], [ 0,  2,  9, 11], [ 4,  4,  6, 10], [11,  5, 10, 15], 
[ 5,  8,  0,  8], [ 3,  6,  5, 11], [ 6,  7,  2,  9], [ 1,  1,  1,  2], [ 9, 11,  3, 14], [ 8, 
10,  7, 17], [10,  3, 11, 14], [ 7,  9,  8, 17]])
test_arr3 = test_arr3[:, 1:3]
test_arr3
>>> array([[ 0,  4],
           [ 2,  9],
           [ 4,  6],
           [ 5, 10],
           [ 8,  0],
           [ 6,  5],
           [ 7,  2],
           [ 1,  1],
           [11,  3],
           [10,  7],
           [ 3, 11],
           [ 9,  8]])
easy_labeling(test_arr3):
>>> array([['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P'],
           ['P', 'P']], dtype='<U1')

--> all of a sudden, simply all elements are labeled postive, even though there are clearly numbers below 5 contained in the array. As far as I can see, indexing still works, so if I ask for arr[mask], I get the correct elements, however assigning to it produces this incorrect result.

It gets even weirder: While writing down this question I wanted to simplify the above expression and not have to do the "test_arr3 = test_arr3[:, 1:3]" part, so I entered the array I wanted to have directly:

test_arr4 = np.array([[0,  4], [2,  9], [4,  6], [5, 10], [8,  0], [6,  5], [7,  2], [1,  1], 
[11,  3], [10,  7], [3, 11], [9,  8]])
test_arr4
>>> array([[ 0,  4],
           [ 2,  9],
           [ 4,  6],
           [ 5, 10],
           [ 8,  0],
           [ 6,  5],
           [ 7,  2],
           [ 1,  1],
           [11,  3],
           [10,  7],
           [ 3, 11],
           [ 9,  8]])
easy_labeling(test_arr4)
>>> array([['N', 'N'],
           ['N', 'P'],
           ['N', 'P'],
           ['P', 'P'],
           ['P', 'N'],
           ['P', 'P'],
           ['P', 'N'],
           ['N', 'N'],
           ['P', 'N'],
           ['P', 'P'],
           ['N', 'P'],
           ['P', 'P']], dtype='<U1')

... and suddenly it works. Even though the arrays are the same (at least it seems so)!

I made sure that all test arrays have identical type, shape and dtype:

for x in [test_arr1, test_arr2, test_arr3, test_arr4]:
...   print(type(x), x.shape, x.dtype)
>>> <class 'numpy.ndarray'> (12, 2) int32
    <class 'numpy.ndarray'> (12, 2) int32
    <class 'numpy.ndarray'> (12, 2) int32
    <class 'numpy.ndarray'> (12, 2) int32

I assume that the arrays have some type of hidden attribute that I am not aware of, the whole thing makes very little sense to me - anybody got an idea?


A workaround seems to be to use np.chararray(arr.shape, unicode=True) instead of np.empty_like(arr, dtype=str), however I would still like to know what is wrong with the other solution.

1 Answer 1

2

This looks like a bug in how empty_like handles dtype=str when the input array is not contiguous. (Update: I created a numpy bug report for this issue. The fix has been merged in the main development branch and will be in the next release (NumPy 1.22.0).)

Here's a simple example of the surprising behavior:

In [66]: a = np.arange(9).reshape(3, 3)

In [67]: b = a[:, ::2]

In [68]: b
Out[68]: 
array([[0, 2],
       [3, 5],
       [6, 8]])

In [69]: x = np.empty_like(b, dtype=str)

In [70]: x
Out[70]: 
array([['', ''],
       ['', ''],
       ['', '']], dtype='<U1')

In [71]: x.strides
Out[71]: (0, 0)

The strides attribute of x should not be (0, 0).

Another work-around (in addition to the one you suggested) is to use an explicit NumPy data type instead of str in the call of empty_like:

In [72]: x = np.empty_like(b, dtype='U1')

In [73]: x
Out[73]: 
array([['', ''],
       ['', ''],
       ['', '']], dtype='<U1')

In [74]: x.strides
Out[74]: (8, 4)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.