Danny Price recently left a comment to let me know about a new Python package he’s developing called hickle. The goal of “hickle” is to create a module that works like Python’s pickle module but stores its data in the HDF5 binary file format. This is a promising approach, because I advocate storing binary data in HDF5 files whenever possible instead of creating yet another one-off binary file format that nobody will be able to read in ten years. The immediate advantage of using HDF5 to store picked Python objects is that HDF5 files are portable across many platforms, while “pickled” objects may not be readable on a different platform.
The hickle developers have made a good start, and they have a long way to go before hickle will be useful to a wider audience. Right now, hickle can only store NumPy ndarrays and Python list objects. If you only need to store lists and arrays, you might as well use HDF5 bindings for Python such as PyTables or h5py. The power of the pickle module is that you can immediately serialize almost any Python object of arbitrary complexity, store it on disk, and retrieve it. hickle will only achieve its full potential once it replicates this functionality, and I’m not sure how difficult this will be. Ideally, you might be able to derive a class from Pickler that uses Picker’s methods to serialize an object, and then add your own method to write the serialized object to an HDF5 file.
In a future post, I’ll describe some of the practical problems with using pickle files to store data, and try to organize some thoughts about how they might be solved.
Hi Craig, just came across your post, my thoughts exactly! Hickle does indeed use h5py under the hood, so you’re right that it doesn’t do anything that can’t already be done (with a few extra lines). Nonetheless, I’ve always found pickle to be amazingly useful as it’s so simple, so wanted to try and make a “drop-in” replacement.
I finally got around to adding some (basic) support for dictionaries, and in particular dictionaries of numpy arrays. I was pleasantly surprised when I did some performance comparisons: by not serializing your data there are ~100x speedups!
I noticed that you said it could only do lists and numpy arrays at the point of your writing, but is it closer to being a hybrid of pickle and HDF5 today?
Hi Fintan — yes it’s progressed quite a bit, thanks to requests from the community. It still doesn’t do everything, but for things it doesn’t understand it will pickle first then store the serialized data.
Here’s the current version: https://github.com/telegraphic/hickle