Saving numpy arrays as a part of larger objects?

Question

I'm working on some imaging-related ML tasks, and as a result of the preprocessing required, I'm creating objects of a class that contains important metadata attributes, along with a 3d numpy array of image data. I'd like to reduce the size of these objects, and increase the speed that they're written and read.

As it stands, the object is saved as a file using pickle, however this does not seem like the most efficient method. The dill library is supposed to be better at saving numpy items, however as I need to process many files, and overall performance is slower, this seems unhelpful.

I also heard of the numpy.save method, but I wasn't sure how to implement this as part of my pickling process. I pickle items using pickle.dump and pickle.load.

pickle will use np.save underneath the hood, I'm pretty sure. pickle saving your class isn't your bottleneck — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jan 18, 2023 at 21:05

hpaulj · Accepted Answer · 2023-01-18 21:04:25Z

2

pickle depends on a "pickle" method for each object, whether it's a list, dict, or something else. The pickle formatting for numpy arrays is essentially the same as np.save. So speed and file size should be similar. Conversely, np.save use pickle for format non-array arguments, or arrays that contain objects (note the allow_pickle parameters in save/load).

In [57]: import pickle
In [58]: x = np.ones((100,100,100))
In [59]: np.save('test.npy',x)
In [60]: !dir test.npy
 Volume in drive C is Windows
 Volume Serial Number is 4EEB-1BF0

 Directory of C:\Users\paul

01/18/2023  12:57 PM         8,000,128 test.npy
               1 File(s)      8,000,128 bytes
               0 Dir(s)  18,489,139,200 bytes free

In [61]: astr=pickle.dumps(x)
In [62]: len(astr)
Out[62]: 8000163

I've seen that some ML projects use HDF5/h5py to save the model and data, but I haven't paid much attention to that. I have answered h5py questions, but haven't tried it for large projects where speed and compression matters.

Multiple numpy arrays can also be saved with np.savez (on the compressed version). That saves each array as a npy file in a zip archive.

np.save is the most efficient means of saving an array. It essentially consists of a small header block, plus a byte copy of the array's data buffer. Unless the array has lots of the same values, there's little room for compression.

edited Jan 18, 2023 at 21:04

answered Jan 18, 2023 at 20:59

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tomas Premoli Muniagurria Over a year ago

Keep in mind though, I'm not pickling numpy arrays, I'm pickling objects with a numpy array as one of their attributes. I'm not pickling an nparray, i'm pickling a custom object with an nparray attribute

hpaulj Over a year ago

And so? Pickled objects have to pickle each their attributes - again using what ever methods those attributes provide. It is possible to specify special pickling methods for custom object classes, but otherwise pickle uses a generic approach that works with most user defined classes. But you can study the pickle docs just as well as I can.

Collectives™ on Stack Overflow

Saving numpy arrays as a part of larger objects?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related