Pickling an object containing a large numpy array

Question

I'm pickling an object which has the following structure:

obj
  |---metadata
  |---large numpy array

I'd like to be able to access the metadata. However, if I pickle.load() the object and iterate over a directory (say because I'm looking for some specific metadata to determine which one to return), then it gets lenghty. I'm guessing pickle wants to load, well, the whole object.

Is there a way to access only the top-level metadata of the object without having to load the whole thing?

I thought about maintaining an index, but then it means I have to implement the logic of it and keep it current, which I'd rather avoid if there's an simpler solution....

"I'm guessing pickle wants to load, well, the whole object." Use a database instead? — roganjosh
– roganjosh, Commented Mar 13, 2020 at 15:40
Well, yes, but that would be way too much work at the moment. See it as an image gallery (just with rather large 3d images). True that manage it with a DB would be better than just some file system structure, but users need to be able to access those individually. If I implement a DB, I must implement an interface for user to interact with it. So good idea but not an option. I'd maintain an index that I would perhaps check during downtimes & update at night... files are generate by users on a machine. — logicOnAbstractions
– logicOnAbstractions, Commented Mar 13, 2020 at 15:55
Do you control the code that pickles this object? If so, I suggest finding a better way to store it in the first place, whether that's a DB or something else. — Code-Apprentice
– Code-Apprentice, Commented Mar 13, 2020 at 16:03

jsbueno · Accepted Answer · 2020-03-13 17:39:41Z

Yes ordinary pickle will load everything. In Python 3.8, the new Pickle protocol allows one to control how objects are serialized and use a side-channel for the large part of the data, but that is mainly useful when using pickle in inter-process communication. That would require a custom implementation of the pickling for your objects.

However, even with older Python versions it is possible to customize how to serialize your objects to disk.

For example, instead of having your arrays as ordinary members of your objects, you could have them "living" in another data structure - say, a dictionary, and implement data-access to your arrays indirectly, through that dictionary.

In Python versions 3.8, this will require you to "cheat" on the pickle-customization, in the sense that upon serialization of your object, the custom method should save the separate data as a side-effect. But other than that, it should be straight forward.

In more concrete terms, when you have something like:


class MyObject:
     def __init__(self, data: NP.NDArray, meta_data: any):
            self.data = data
            self.meta_data = meta_data

Augment it this way - you should be still good to do whatever you do with your objects, but pickling now will only picke the metadata - the numpy arrays will "live" in a separate data structure that won't be automatically serialized:


from uuid import uuid4

VAULT = dict()

class SeparateSerializationDescriptor:
    def __set_name__(self, owner, name):
        self.name = name

    def __set__(self, instance, value):
        id = instance.__dict__[self.name] = str(uuid4())
        VAULT[id] = value

    def __get__(self, instance, owner):
        if instance is None:
            return self
        return VAULT[instance.__dict__[self.name]]

    def __delete__(self, instance):
        del VAULT[instance.__dict__[self.name]]
        del instance.__dict__[self.name]

class MyObject:

    data = SeparateSerializationDescriptor()

    def __init__(self, data: NP.NDArray, meta_data: any):
        self.data = data
        self.meta_data = meta_data

Really -that is all that is needed to customize the attribute access: all ordinary uses of the self.data attribute will retrieve the original numpy array seamlessly - self.data[0:10] will just work. But pickle, at this point, will retrieve the contents of the instance's __dict__ - which only contain a key to the real data in the "vault" object.

Besides allowing you to serialize the metadata and data in separated files, it also allows you a fine-grained of the data in memory, by manipulating the "VAULT".

And now, customize the pickling of the class so that it will save the data separatly to disk, and retrieve it on reading. On Python 3.8, this probably can be done "within the rules" (I will take the time, since I am answering this, to take a lookg at that). For tradiciotnal pickle, we "break the rules" in which we save the extra data to disk, and load it, as side-effects of serialization.

Actually, it just occurred me that ordinarily customizing the methods used directly by the pickle protocol, like __reduce_ex__ and __setstate__ while would work, would, again, automatically unpickle the whole object from disk.

A way to go is: upon serialization, save the full data in a separate file, and create some more metadata so that the array-file can be found. Upon desserialization, always load only the metadata - and build into the descriptor above a mechanism to lazy load the arrays as needed.

So, we provide a Mixin class, and its dump method should be called instead of pickle.dump- so the data is written in separate files. To unpickle the object, use Python's pickle.load normally: it will retrieve only the "normal" attributes of the object. The object's .load() method can be then be called explicitly to load all the arrays, or it will be called automatically when the data is first accessed, in a lazy way:

import pathlib
from uuid import uuid4
import pickle

VAULT = dict()

class SeparateSerializationDescriptor:
    def __set_name__(self, owner, name):
        self.name = name

    def __set__(self, instance, value):
        id = instance.__dict__[self.name] = str(uuid4())
        VAULT[id] = value

    def __get__(self, instance, owner):
        if instance is None:
            return self
        try:
            return VAULT[instance.__dict__[self.name]]
        except KeyError:
            # attempt so silently load missing data from disk upon first array access after unpickling:
            instance.load()
            return VAULT[instance.__dict__[self.name]]

    def __delete__(self, instance):
        del VAULT[instance.__dict__[self.name]]
        del instance.__dict__[self.name]


class SeparateSerializationMixin:

    def _iter_descriptors(self, data_dir):

        for attr in self.__class__.__dict__.values():
            if not isinstance(attr, SeparateSerializationDescriptor):
                continue
            id = self.__dict__[attr.name]
            if not data_dir:
                # use saved absolute path instead of passed in folder
                data_path = pathlib.Path(self.__dict__[attr.name + "_path"])
            else:
                data_path = data_dir / (id + ".pickle")
            yield attr, id, data_path

    def dump(self, file, protocol=None, **kwargs):
        data_dir = pathlib.Path(file.name).absolute().parent

        # Annotate paths and pickle all numpyarrays into separate files:
        for attr, id, data_path in self._iter_descriptors(data_dir):
            self.__dict__[attr.name + "_path"] = str(data_path)
            pickle.dump(getattr(self, attr.name), data_path.open("wb"), protocol=protocol)

        # Pickle the metadata as originally intended:
        pickle.dump(self, file, protocol, **kwargs)


    def load(self, data_dir=None):
        """Load all saved arrays associated with this object.

        if data_dir is not passed, the the absolute path used on picking is used. Otherwise
        the files are searched by their name in the given folder
        """
        if data_dir:
            data_dir = pathlib.Path(data_dir)

        for attr, id, data_path in self._iter_descriptors(data_dir):
            VAULT[id] = pickle.load(data_path.open("rb"))

    def __del__(self):

        for attr, id, path in self._iter_descriptors(None):
            VAULT.pop(id, None)
        try:
            super().__del__()
        except AttributeError:
            pass

class MyObject(SeparateSerializationMixin):

    data = SeparateSerializationDescriptor()

    def __init__(self, data, meta_data):
        self.data = data
        self.meta_data = meta_data

Of course this is not perfect, and there are likely corner cases. I included some safeguards in case the data-files are moved to another directory - but I did not test that.

Other than that, using those in an interactive session here went smooth, and I coud create a MyObject instance that would be pickled separated from its data attribute, which then would be loaded just when needed on unpickling.

As for the suggestion of just "keep stuff in a database" - some of the code here can be used just as well with your objects if they live in a database, and you prefer to let the raw-data on the filesystem rather than on a 'blob column' on the database.

Woah really involved answer, thanks. Would need a bit of time to play around & test it. So, essentially, I would now load a SeparateSerializationDescriptor upon first unpickling a saved file. Then, if I try to access the data, the Descriptor is passed to the Mixin class, which would retrive the data, or I can explicitely load. The Descriptor baysically allows me to give another level of abstraction so I can better control data access... Will check it out!

Collectives™ on Stack Overflow

Pickling an object containing a large numpy array

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related