Yes ordinary pickle will load everything. In Python 3.8, the new Pickle protocol allows one to control how objects are serialized and use a side-channel for the large part of the data, but that is mainly useful when using pickle in inter-process communication. That would require a custom implementation of the pickling for your objects.
However, even with older Python versions it is possible to customize how to serialize your objects to disk.
For example, instead of having your arrays as ordinary members of your objects, you could have them "living" in another data structure - say, a dictionary, and implement data-access to your arrays indirectly, through that dictionary.
In Python versions 3.8, this will require you to "cheat" on the pickle-customization, in the sense that upon serialization of your object, the custom method should save the separate data as a side-effect. But other than that, it should be straight forward.
In more concrete terms, when you have something like:
class MyObject:
def __init__(self, data: NP.NDArray, meta_data: any):
self.data = data
self.meta_data = meta_data
Augment it this way - you should be still good to do whatever you do with your objects, but pickling now will only picke the metadata - the numpy arrays will "live" in a separate data structure that won't be automatically serialized:
from uuid import uuid4
VAULT = dict()
class SeparateSerializationDescriptor:
def __set_name__(self, owner, name):
self.name = name
def __set__(self, instance, value):
id = instance.__dict__[self.name] = str(uuid4())
VAULT[id] = value
def __get__(self, instance, owner):
if instance is None:
return self
return VAULT[instance.__dict__[self.name]]
def __delete__(self, instance):
del VAULT[instance.__dict__[self.name]]
del instance.__dict__[self.name]
class MyObject:
data = SeparateSerializationDescriptor()
def __init__(self, data: NP.NDArray, meta_data: any):
self.data = data
self.meta_data = meta_data
Really -that is all that is needed to customize the attribute access: all ordinary uses of the self.data attribute will retrieve the original numpy array seamlessly - self.data[0:10] will just work. But pickle, at this point, will retrieve the contents of the instance's __dict__ - which only contain a key to the real data in the "vault" object.
Besides allowing you to serialize the metadata and data in separated files, it also allows you a fine-grained of the data in memory, by manipulating the "VAULT".
And now, customize the pickling of the class so that it will save the data separatly to disk, and retrieve it on reading. On Python 3.8, this probably can be done "within the rules" (I will take the time, since I am answering this, to take a lookg at that). For tradiciotnal pickle, we "break the rules" in which we save the extra data to disk, and load it, as side-effects of serialization.
Actually, it just occurred me that ordinarily customizing the methods used directly by the pickle protocol, like __reduce_ex__ and __setstate__ while would work, would, again, automatically unpickle the whole object from disk.
A way to go is: upon serialization, save the full data in a separate file, and create some more metadata so that the array-file can be found. Upon desserialization, always load only the metadata - and build into the descriptor above a mechanism to lazy load the arrays as needed.
So, we provide a Mixin class, and its dump method should be called
instead of pickle.dump- so the data is written in separate files. To unpickle the object, use Python's pickle.load normally: it will retrieve only the "normal" attributes of the object. The object's .load() method can be then be called explicitly to load all the arrays, or it will be called automatically when the data is first accessed, in a lazy way:
import pathlib
from uuid import uuid4
import pickle
VAULT = dict()
class SeparateSerializationDescriptor:
def __set_name__(self, owner, name):
self.name = name
def __set__(self, instance, value):
id = instance.__dict__[self.name] = str(uuid4())
VAULT[id] = value
def __get__(self, instance, owner):
if instance is None:
return self
try:
return VAULT[instance.__dict__[self.name]]
except KeyError:
# attempt so silently load missing data from disk upon first array access after unpickling:
instance.load()
return VAULT[instance.__dict__[self.name]]
def __delete__(self, instance):
del VAULT[instance.__dict__[self.name]]
del instance.__dict__[self.name]
class SeparateSerializationMixin:
def _iter_descriptors(self, data_dir):
for attr in self.__class__.__dict__.values():
if not isinstance(attr, SeparateSerializationDescriptor):
continue
id = self.__dict__[attr.name]
if not data_dir:
# use saved absolute path instead of passed in folder
data_path = pathlib.Path(self.__dict__[attr.name + "_path"])
else:
data_path = data_dir / (id + ".pickle")
yield attr, id, data_path
def dump(self, file, protocol=None, **kwargs):
data_dir = pathlib.Path(file.name).absolute().parent
# Annotate paths and pickle all numpyarrays into separate files:
for attr, id, data_path in self._iter_descriptors(data_dir):
self.__dict__[attr.name + "_path"] = str(data_path)
pickle.dump(getattr(self, attr.name), data_path.open("wb"), protocol=protocol)
# Pickle the metadata as originally intended:
pickle.dump(self, file, protocol, **kwargs)
def load(self, data_dir=None):
"""Load all saved arrays associated with this object.
if data_dir is not passed, the the absolute path used on picking is used. Otherwise
the files are searched by their name in the given folder
"""
if data_dir:
data_dir = pathlib.Path(data_dir)
for attr, id, data_path in self._iter_descriptors(data_dir):
VAULT[id] = pickle.load(data_path.open("rb"))
def __del__(self):
for attr, id, path in self._iter_descriptors(None):
VAULT.pop(id, None)
try:
super().__del__()
except AttributeError:
pass
class MyObject(SeparateSerializationMixin):
data = SeparateSerializationDescriptor()
def __init__(self, data, meta_data):
self.data = data
self.meta_data = meta_data
Of course this is not perfect, and there are likely corner cases.
I included some safeguards in case the data-files are moved to another directory - but I did not test that.
Other than that, using those in an interactive session here went smooth,
and I coud create a MyObject instance that would be pickled separated
from its data attribute, which then would be loaded just when needed
on unpickling.
As for the suggestion of just "keep stuff in a database" - some of the code here can be used just as well with your objects if they live in a database, and you prefer to let the raw-data on the filesystem rather than on a 'blob column' on the database.