4

I have a text file made as:

0.01 1 0.1 1 10 100 a
0.02 3 0.2 2 20 200 b
0.03 2 0.3 3 30 300 c
0.04 1 0.4 4 40 400 d

I read it as a list A and then converted to a numpy array, that is:

>>> A
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], 
      dtype='|S4')

I just want to extract a sub-array B, made of A wherever its 4th entry is lower than 30, that should look something like:

B = array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
           ['0.02', '3', '0.2', '2', '20', '200', 'b']])

When dealing with arrays, I usually do simply B = A[A[:,4]<30], but in this case (maybe due to the presence of characters/strings I've never worked with) it doesn't work, giving me this:

>>> A[A[:,4]<30]
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], 
      dtype='|S4')

and I can't figure out the reason. I'm not dealing with a code of mine and I don't think I can switch all this to structures or dictionaries: any suggestion for doing this with numpy arrays? Thank you very much in advance!

2 Answers 2

3

You have to compare int to int

A[A[:,4].astype(int)<30]

or str to str

A[A[:,4]<'30'] 

However, notice that the latter would work in your specific example, but won't work generally because you are comparing str ordering (for example, '110' < '30' returns True, but 110 < 30 returns False)


numpy will infer your elements' types from your data. In this case, it attributed the type = '|S4' to your elements, meaning they strings of length 4. This is probably a consequence of the underlying C code (which enhances numpy's performance) that requires elements to have fixed types.

To illustrate this difference, check the following code:

>>> np.array([['0.01', '1', '0.1', '1', '10', '100', 'a']])
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], dtype='|S4')

The inferred type of strings of length 4, which is the max length of your elements (in elem 0.01). Now, if you expclitily define it to hold general type objects, it will do what you want

>>> np.array([[0.01, 1, 0.1, 1, 10, 100, 'a']], dtype=object)
array([0.01, 1, 0.1, 1, 10, 100, 'a'], dtype=object)

and your code A[A[:,4]<30] would work properly.

For more information, this is a very complete guide

Sign up to request clarification or add additional context in comments.

3 Comments

But when I deal with the file, I read them as integer and float, why do they become strings when I pass to a numpy array?
It converts to str because your arrays have elements with different type s. NumPy tries to infer which are the types of your elements
Omg, I didn't notice that my array was made of strings! When I read the file I create a list of lists and I read each entry as integer, float, or string. I don't get why numpy changes them all to strings...
1
In [86]: txt='''0.01 1 0.1 1 10 100 a
    ...: 0.02 3 0.2 2 20 200 b
    ...: 0.03 2 0.3 3 30 300 c
    ...: 0.04 1 0.4 4 40 400 d'''
In [87]: A = np.genfromtxt(txt.splitlines(), dtype=str)
In [88]: A
Out[88]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], dtype='<U4')
In [89]: A[:,4]
Out[89]: array(['10', '20', '30', '40'], dtype='<U4')

genfromtxt, as a default tries to make floats. But in that case the character column would be nan. Instead I specified str dtype.

So a numeric test would require converting the column to numbers:

In [90]: A[:,4].astype(int)
Out[90]: array([10, 20, 30, 40])
In [91]: A[:,4].astype(int)<30
Out[91]: array([ True,  True, False, False])

In this case a string comparison also works:

In [99]: A[:,4]<'30'
Out[99]: array([ True,  True, False, False])

Or if we use dtype=None, it infers dtype by column and makes a structured array:

In [93]: A1 = np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
In [94]: A1
Out[94]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b'),
       (0.03, 2, 0.3, 3, 30, 300, 'c'), (0.04, 1, 0.4, 4, 40, 400, 'd')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

Now we can select a field by name, and test it:

In [95]: A1['f4']
Out[95]: array([10, 20, 30, 40])

Either way we can select rows based on the True/False mask or the corresponding row indices:

In [96]: A[[0,1],:]
Out[96]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b']], dtype='<U4')

In [98]: A1[[0,1]]     # A1 is 1d
Out[98]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.