I want to write to a vlen hdf5 dataset, for that I am using h5py.Dataset.write_direct to speed up the process. Suppose I have a list of numpy arrays (e.g. given by cv2.findContours), and by dataset:
dataset = h5file.create_dataset('dataset', \
shape=..., \
dtype=h5py.special_type(vlen='int32'))
contours = [numpy array, ...]
For writing contours to a destination given by the slice dest, I must first convert contours to a numpy array of numpy arrays:
contours = numpy.array(contours) # shape=(len(contours),); dtype=object
dataset.write_direct(contours, None, dest)
But this only works, if all numpy arrays in contours have different shapes, e.g.:
contours = [np.zeros((10,), 'int32'), np.zeros((10,), 'int32')]
contours = numpy.array(contours) # shape=(2,10); dtype='int32'
The question is: How can I tell numpy to create an array of objects?
Possible solutions:
Manual creation:
contours_np = np.empty((len(contours),), dtype=object)
for i, contour in enumerate(contours):
contours_np[i] = contour
But loops are super slow, thus using map:
map(lambda (i, contour): contour.__setitem_(i, contour), \
enumerate(contours))
I have tested a second option, which is twice as fast as the above, but also super ugly:
contours = np.array(contours + [None])[:-1]
Here are the micro benchmarks:
l = [np.random.normal(size=100) for _ in range(1000)]
Option 1:
$ start = time.time(); l_array = np.zeros(shape=(len(l),), dtype='O'); map(lambda (i, c): l_array.__setitem__(i, c), enumerate(l)); end = time.time(); print("%fms" % ((end - start) * 10**3))
0.950098ms
Option 2:
$ start = time.time(); np.array(l + [None])[:-1]; end = time.time(); print("%fms" % ((end - start) * 10**3))
0.409842ms
This looks kind of ugly, any other suggestions?