Compact string data type in numpy

Question

How can I compact the dtype (which is effectively about memory usage) in a numpy string array, where the maximum string length, say U5, is shorter than that is defined in the dtype attribute, say U10.

One way to "compact" the dtype is to explicitly cast it to U5, but can this be done automatically by some function calls without mannually inspecting the string length?

For example:

>>> import numpy as np
>>> a = np.array([['a', 'bb', 'cc'], ['aaabc', 'ccc', 'b']], dtype='U10')

# Non-existing pseudo function
>>> a = compact(a)
>>> print(a.dtype)
dtype('<U5')

so the function compact compacts the redundant U10 to U5.

Thanks in advance!

John Zwinck · Accepted Answer · 2021-04-21 08:28:38Z

1

This does it for both unicode/str ('U') and bytes ('S') arrays:

def compact(str_arr): 
    dtype = (str_arr.dtype.kind, np.char.str_len(str_arr).max()) 
    return np.asarray(str_arr, dtype)

answered Apr 21, 2021 at 8:28

John Zwinck

252k44 gold badges347 silver badges459 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Compact string data type in numpy

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related