0

I have a complicated set of data that I have to do distance calculations on. Each record in the data set contains many different data types so a record array or structured array appears to be the way to go. The problem is when I have to do my distance calculations, the scipy spatial distance functions take arrays and the recored array is numpy voids. How to I make a recored array of numpy arrays instead of numpy voids? Below is a very simple example of what I'm talking about.

import numpy
import scipy.spatial.distance as scidist


input_data = [
    ('340.9', '7548.2', '1192.4', 'set001.txt'),
    ('546.7', '9039.9', '5546.1', 'set002.txt'),
    ('456.3', '2234.8', '2198.8', 'set003.txt'),
    ('332.1', '1144.2', '2344.5', 'set004.txt'),
]

record_array = numpy.array(input_data,
                           dtype=[('d1', 'float64'), ('d2', 'float64'), ('d3', 'float64'), ('file', '|S20')])

The following code fails...

this_fails_and_makes_me_cry = record_array[['d1', 'd2', 'd3']]
scidist.pdist(this_fails_and_makes_me_cry)

I get this error....

Traceback (most recent call last):
  File "/home/someguy/working_datasets/trial003/scrap.py", line 16, in <module>
    scidist.pdist(record_array[['d1', 'd2', 'd3']])
  File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1093, in pdist
    raise ValueError('A 2-dimensional array must be passed.');
ValueError: A 2-dimensional array must be passed.

The error occurs because this_fails_and_makes_me_cry is an array of numpy.voids. To get it to work I have to convert each time like this...

this_works = numpy.array(map(list, record_array[['d1', 'd2', 'd3']]))
scidist.pdist(this_works)

Is it possible to create a record array of numpy arrays to begin with? Or is a numpy record/structured array restricted to numpy voids? It would be handy for the record array to contain the data in a format compatible with scipy's spatial distance functions so that I don't have to convert each time. Is this possible?

2
  • My understanding is that Numpy structured arrays can only contain fields of discrete types (plus fixed lenght strings), so no, you cannot store an array. You could turn that conversion into a function to make it easier... and use some standard way to convert the data to a 2D array (like array.view), see here Commented Aug 13, 2014 at 13:03
  • Bummer. I was hoping that wasn't the case because I have to do this a TON of times due to the large number of distance calculations and the large data set that I have. Thanks for the link. Commented Aug 13, 2014 at 13:45

1 Answer 1

3
this_fails_and_makes_me_cry = record_array[['d1', 'd2', 'd3']]

creates a one-dimensional structured array, with fields d1, d2 and d3. pdist expects a two-dimensional array. Here's one way to create that two-dimensional array containing only the d fields of record_array.

(Note: The following won't work if the fields that you want to use for the distance calculation are not contiguous within the data type of the structured array record_array. See below for an alternative in that case.)

First, we make a new dtype, in which d1, d2 and d3 become a single field called d containing three floating point values:

In [61]: dt2 = dtype([('d', 'f8', 3), ('file', 'S20')])

Next, use the view method to create a view of record_array using this dtype:

In [62]: rav = record_array.view(dt2)

In [63]: rav
Out[63]: 
array([([340.9, 7548.2, 1192.4], 'set001.txt'),
       ([546.7, 9039.9, 5546.1], 'set002.txt'),
       ([456.3, 2234.8, 2198.8], 'set003.txt'),
       ([332.1, 1144.2, 2344.5], 'set004.txt')], 
      dtype=[('d', '<f8', (3,)), ('file', 'S20')])

rav is not a copy--it is a view of the same block of memory used by record_array.

Now access field d to get the two-dimensional array:

In [64]: d = rav['d']

In [65]: d
Out[65]: 
array([[  340.9,  7548.2,  1192.4],
       [  546.7,  9039.9,  5546.1],
       [  456.3,  2234.8,  2198.8],
       [  332.1,  1144.2,  2344.5]])

d can be passed to pdist:

In [66]: pdist(d)
Out[66]: 
array([ 4606.75875427,  5409.10137454,  6506.81395539,  7584.32432455,
        8522.8149229 ,  1107.27706108])

Note that instead of converting record_array to rav, you could use dt2 as the data type of record_array from the start, and just write d = record_array['d'].


If the fields in record_array that are used for the distance calculation are not contiguous in the structure, you'll first have to pull them out into a new array so they are contiguous:

In [83]: arr = record_array[['d1','d2','d3']]

Then take a view of arr and reshape to make it two-dimensional:

In [84]: d = arr.view(np.float64).reshape(-1,3)

In [85]: d
Out[85]: 
array([[  340.9,  7548.2,  1192.4],
       [  546.7,  9039.9,  5546.1],
       [  456.3,  2234.8,  2198.8],
       [  332.1,  1144.2,  2344.5]])

You can combine those into a single line, if that's more convenient:

In [86]: d = record_array[['d1', 'd2', 'd3']].view(np.float64).reshape(-1, 3)
Sign up to request clarification or add additional context in comments.

4 Comments

That is very clever. I wasn't aware you could do this with numpy. So the view function is just a different way of formating an existing numpy object without creating a new one?
Thanks. Also, is it possible to view slices of an array? What if I wanted to create a view of the first two elements of the record array and another separate view of the last two elements without creating two new numpy objects? Is that possible?
Nevermind, I just realized numpy slices don't copy the object like python list slices do.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.