slicing a numpy array with characters

Question

I have a text file made as:

0.01 1 0.1 1 10 100 a
0.02 3 0.2 2 20 200 b
0.03 2 0.3 3 30 300 c
0.04 1 0.4 4 40 400 d

I read it as a list A and then converted to a numpy array, that is:

>>> A
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], 
      dtype='|S4')

I just want to extract a sub-array B, made of A wherever its 4th entry is lower than 30, that should look something like:

B = array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
           ['0.02', '3', '0.2', '2', '20', '200', 'b']])

When dealing with arrays, I usually do simply B = A[A[:,4]<30], but in this case (maybe due to the presence of characters/strings I've never worked with) it doesn't work, giving me this:

>>> A[A[:,4]<30]
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], 
      dtype='|S4')

and I can't figure out the reason. I'm not dealing with a code of mine and I don't think I can switch all this to structures or dictionaries: any suggestion for doing this with numpy arrays? Thank you very much in advance!

rafaelc · Accepted Answer · 2018-04-29 19:27:17Z

3

You have to compare int to int

A[A[:,4].astype(int)<30]

or str to str

A[A[:,4]<'30']

However, notice that the latter would work in your specific example, but won't work generally because you are comparing str ordering (for example, '110' < '30' returns True, but 110 < 30 returns False)

numpy will infer your elements' types from your data. In this case, it attributed the type = '|S4' to your elements, meaning they strings of length 4. This is probably a consequence of the underlying C code (which enhances numpy's performance) that requires elements to have fixed types.

To illustrate this difference, check the following code:

>>> np.array([['0.01', '1', '0.1', '1', '10', '100', 'a']])
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], dtype='|S4')

The inferred type of strings of length 4, which is the max length of your elements (in elem 0.01). Now, if you expclitily define it to hold general type objects, it will do what you want

>>> np.array([[0.01, 1, 0.1, 1, 10, 100, 'a']], dtype=object)
array([0.01, 1, 0.1, 1, 10, 100, 'a'], dtype=object)

and your code A[A[:,4]<30] would work properly.

For more information, this is a very complete guide

edited Apr 29, 2018 at 19:27

answered Apr 29, 2018 at 18:59

rafaelc

59.4k15 gold badges64 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

urgeo Over a year ago

But when I deal with the file, I read them as integer and float, why do they become strings when I pass to a numpy array?

rafaelc Over a year ago

It converts to str because your arrays have elements with different type s. NumPy tries to infer which are the types of your elements

urgeo Over a year ago

Omg, I didn't notice that my array was made of strings! When I read the file I create a list of lists and I read each entry as integer, float, or string. I don't get why numpy changes them all to strings...

hpaulj · Accepted Answer · 2018-04-29 21:00:06Z

In [86]: txt='''0.01 1 0.1 1 10 100 a
    ...: 0.02 3 0.2 2 20 200 b
    ...: 0.03 2 0.3 3 30 300 c
    ...: 0.04 1 0.4 4 40 400 d'''
In [87]: A = np.genfromtxt(txt.splitlines(), dtype=str)
In [88]: A
Out[88]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], dtype='<U4')
In [89]: A[:,4]
Out[89]: array(['10', '20', '30', '40'], dtype='<U4')

genfromtxt, as a default tries to make floats. But in that case the character column would be nan. Instead I specified str dtype.

So a numeric test would require converting the column to numbers:

In [90]: A[:,4].astype(int)
Out[90]: array([10, 20, 30, 40])
In [91]: A[:,4].astype(int)<30
Out[91]: array([ True,  True, False, False])

In this case a string comparison also works:

In [99]: A[:,4]<'30'
Out[99]: array([ True,  True, False, False])

Or if we use dtype=None, it infers dtype by column and makes a structured array:

In [93]: A1 = np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
In [94]: A1
Out[94]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b'),
       (0.03, 2, 0.3, 3, 30, 300, 'c'), (0.04, 1, 0.4, 4, 40, 400, 'd')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

Now we can select a field by name, and test it:

In [95]: A1['f4']
Out[95]: array([10, 20, 30, 40])

Either way we can select rows based on the True/False mask or the corresponding row indices:

In [96]: A[[0,1],:]
Out[96]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b']], dtype='<U4')

In [98]: A1[[0,1]]     # A1 is 1d
Out[98]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

Collectives™ on Stack Overflow

slicing a numpy array with characters

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related