14

In Python 3, I have the follow NumPy array of strings.

Each string in the NumPy array is in the form b'MD18EE instead of MD18EE.

For example:

import numpy as np
print(array1)
(b'first_element', b'element',...)

Normally, one would use .decode('UTF-8') to decode these elements.

However, if I try:

array1 = array1.decode('UTF-8')

I get the following error:

AttributeError: 'numpy.ndarray' object has no attribute 'decode'

How do I decode these elements from a NumPy array? (That is, I don't want b'')

EDIT:

Let's say I was dealing with a Pandas DataFrame with only certain columns that were encoded in this manner. For example:

import pandas as pd
df = pd.DataFrame(...)

df
        COL1          ....
0   b'entry1'         ...
1   b'entry2'
2   b'entry3'
3   b'entry4'
4   b'entry5'
5   b'entry6'

2 Answers 2

22

You have an array of bytestrings; dtype is S:

In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]: 
array([b'first_element', b'element'], 
      dtype='|S13')

astype easily converts them to unicode, the default string type for Py3.

In [340]: arr.astype('U13')
Out[340]: 
array(['first_element', 'element'], 
      dtype='<U13')

There is also a library of string functions - applying the corresponding str method to the elements of a string array

In [341]: np.char.decode(arr)
Out[341]: 
array(['first_element', 'element'], 
      dtype='<U13')

The astype is faster, but the decode lets you specify an encoding.

See also How to decode a numpy array of dtype=numpy.string_?

Sign up to request clarification or add additional context in comments.

3 Comments

The astype method seems too specific with the byte length information. For instance what if my input dtype is '|S1' rather than '|S13'?
@John, it looks like we don't have to specify the length: np.array('one', 'S7').astype('U')
I tried astype('U') on some bytearray and got UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128). However np.char.decode(arr) worked alright.
6

If you want the result to be a (Python) list of strings, you can use a list comprehension:

>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>

Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize to make a vectorized decoder function:

>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>

7 Comments

Thanks! I'm taking the numpy array and putting it into a pandas dataframe. Maybe there are quicker shortcuts? Convert by column?
Do you mean quicker as in 'runs faster' or quicker as in 'less code'? Because both methods are oneliners, the print statements are just to show that they work :)
:) I was thinking run faster. However, I think this method works fine---this appears to be a Python2/Python3 side effect, so I suspect others have run into this issue.
In any sense, using decoder gives me this error: AttributeError: 'numpy.void' object has no attribute 'decode'
Hmm, in that case, it looks like your array is not an array of strings at all, but rather an array of strings and voids - but I'm sure you'll be able to modify the decoder to handle those as well. At any rate, I think the best (and probably fastest) way to approach this would be to make sure you use strings everywhere, rather than bytes. How you would do that depends on where your data is coming from and how you read it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.