How to get field of nested numpy structured array (advanced indexing)

Question

I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.

c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])

I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.

Are there ways to do fancy indexing with nested structured arrays with accessing their field names?

Each field level has to be indexed separately. You can't combine them into one.. — hpaulj
– hpaulj, Commented Jan 12, 2022 at 15:39
@hpaulj so its not possible to treat it as multi-dimensional and index the top level as "all" or [:] in order to access the lowest level? Meaning I do need to know what the preceding level field names are? — 001001
– 001001, Commented Jan 12, 2022 at 16:05
If there were, it would be documented on the numpy.org/doc/stable/user/basics.rec.html page. Indexing fields is more like dict indexing than multidimensional array indexing. You have defined a nested dtype, not a multidimensional dtype. — hpaulj
– hpaulj, Commented Jan 12, 2022 at 16:19
It looks like you want a dataframe data structure like the one provided by Pandas (but not Numpy). — Jérôme Richard
– Jérôme Richard, Commented Jan 12, 2022 at 16:32

hpaulj · Accepted Answer · 2022-01-12 18:53:48Z

I don't know if displaying the resulting array helps you visualize the nesting or not.

In [279]: c = [('x','f8'),('y','f8')]
     ...: A = [('data_string','|S20'),('data_val', c, 2)]
     ...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]: 
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
      dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])

Note how the nesting of () and [] reflects the nesting of the fields.

arr.dtype only has direct access to the top level field names:

In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]: 
array([[(0., 0.), (0., 0.)],
       [(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])

But having accessed one field, we can then look at its fields:

In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]: 
array([[0., 0.],
       [0., 0.]])

Record number indexing is separate, and can be multidimensional in the usual sense:

In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]: 
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
      dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])

Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:

In [289]: arr['data_val']['x']
Out[289]: 
array([[0., 0.],
       [1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])

I mentioned that fields indexing is like dict indexing. Note this display of the fields:

In [294]: arr.dtype.fields
Out[294]: 
mappingproxy({'data_string': (dtype('S20'), 0),
              'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})

Each record is stored as a block of 52 bytes:

In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'

20 of those are data_string, and 32 are the 2 c fields

In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'

You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different

In [306]: arr[['data_val']]
Out[306]: 
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
      dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})

In [311]: arr['data_val'][['y']]
Out[311]: 
array([[(3.,), (4.,)],
       [(0.,), (0.,)]],
      dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})

Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.

this is really helpful, although it also raises more questions for me outside the scope of the original question.

Mad Physicist · Accepted Answer · 2022-01-12 16:43:11Z

1

The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.

The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.

answered Jan 12, 2022 at 16:43

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Collectives™ on Stack Overflow

How to get field of nested numpy structured array (advanced indexing)

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related