Storing large Numpy arrays on disk: Python Pickle vs. HDF5

In a previous post, I described how Python’s Pickle module is fast and convenient for storing all sorts of data on disk. More recently, I showed how to profile the memory usage of Python code.  In recent weeks, I’ve uncovered a serious limitation in the Pickle module when storing large amounts of data: Pickle requires a large amount of memory to save a data structure to disk. Fortunately, there is an open standard called HDF, which defines a binary file format that is designed to efficiently store large scientific data sets. I will demonstrate both approaches, and profile them to see how much memory is required. I am writing the HDF file using the PyTables interface. Here’s the little test program I’ve been using:

#!/usr/bin/env python
from numpy import *

array_len = 10000000

a = zeros(array_len, dtype=float)

#import pickle
#f = open('test.pickle', 'wb')
#pickle.dump(a, f, pickle.HIGHEST_PROTOCOL)
#f.close()

import tables
h5file = tables.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
h5file.createArray(root, "test", a)
h5file.close()

I first ran the program with both the pickle and the HDF code commented out, and profiled RAM usage with Valgrind and Massif (see my post about profiling memory usage of Python code). Massif shows that the program uses about 80MB of RAM. I then uncommented the Pickle code, and profiled the program again. Look at how the memory usage almost triples!

 MB
232.2^                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                ,        @        #  .
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
0 +----------------------------------------------------------------------->Mi
  0                                                                   161.9

I then commented out the Pickle code and uncommented the HDF code, and ran the profile again. Notice how efficient the HDF library is:

    MB
81.62^                              ,.., .....,.. .,...,... .,......,..,:  .
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |              .,..,. .....    @::@ :::::@:: :@:::@::: :@::::::@::#:  :::
 0 +----------------------------------------------------------------------->Mi
   0                                                                   246.6

I didn’t benchmark the speed because, for my application, it doesn’t really matter, because the data gets dumped to disk relatively infrequently. It’s interesting to note that the data files on disk are almost identical in size: 76.29MB.

Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple virtual machine (VM) that translates an object into a series of opcodes and writes them to disk. To unpickle something, the VM reads and interprets the opcodes and reconstructs an object. The downside of this approach is that the VM has to construct a complete copy of the object in memory before it writes it to disk. I’m not sure why the pickle operation in my example code seems to require three times as much memory as the array itself. 99% of the time, memory usage isn’t a problem, but it becomes a real issue when you have hundreds of megabytes of data in a single array. Fortunately, HDF exists to efficiently handle large binary data sets, and the Pytables package makes it easy to access in a very Pythonic way.

Advertisements

8 thoughts on “Storing large Numpy arrays on disk: Python Pickle vs. HDF5

  1. Hey – just stumbled upon your blog googling HDF and Pickle comparisons. That’s some neat profiling!

    If you’re interested, I’ve been writing a python module called “hickle”, which aims to be exactly the same in usage as pickle, but using HDF for data storage;
    https://github.com/telegraphic/hickle

    Would be great to get some feedback, I’m hoping it comes in useful for more than just me 😉

    Like

  2. Hi, I have been using mpi4py to do a calculation in parallel. One option for sending data between different processes is pickle. I ran into errors using it, and I wonder if it could be because of the large amount of memory the actual pickling process consumes. The data itself is not so large it can’t fit in memory (a complex array with dimensions of 50x50x200). The problem was resolved when I switched to the other option to send data between processes, which is as a numpy array (via some C method I believe). Any thoughts?

    Like

    1. Ashley, I think your hypothesis is correct. Pickling consumes a lot of memory-in my example, pickling an object required an amount of memory equal to three times the size of the object. NumPy stores data in binary C arrays, which are very efficient. Passing data via NumPy arrays is efficient because MPI doesn’t have to transform the data-it just copies the block of memory.

      Pickle is fine for quick hacks, but I don’t use pickle in production code because it’s potentially insecure and inefficient.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s