Service account credentials with the Python client for the Google Drive API (v3)

There are numerous ways to authenticate against the Google Drive API. If you have an application running on Google Compute Engine that needs to access Drive, a Service Account is probably the easiest way to do it. One use case is for an application to write reports or log files to Drive so that users can see them without logging into a server.

Before you try this example, go through all of the steps in Google’s Using OAuth 2.0 for Server to Server Applications guide and save your service account’s private key locally in JSON format.

Getting credentials from a service account file is easy:

SCOPES = [
        'https://www.google<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>apis.com/auth/drive'
    ]
    SERVICE_ACCOUNT_FILE = './name-of-service-account-key.json'

    credentials = service_account.Credentials.from_service_account_file(
            SERVICE_ACCOUNT_FILE, scopes=SCOPES)

Continue reading Service account credentials with the Python client for the Google Drive API (v3)

Advertisements

Removing an axis or both axes from a matplotlib plot

Sometimes, the frame around a matplotlib plot can detract from the information you are trying to convey.  How do you remove the frame, ticks, or axes from a matplotlib plot?

matplotlib plot without a y axis

The full example is available on github.

Continue reading Removing an axis or both axes from a matplotlib plot

Storing large Numpy arrays on disk: Python Pickle vs. HDF5

In a previous post, I described how Python’s Pickle module is fast and convenient for storing all sorts of data on disk. More recently, I showed how to profile the memory usage of Python code.  In recent weeks, I’ve uncovered a serious limitation in the Pickle module when storing large amounts of data: Pickle requires a large amount of memory to save a data structure to disk. Fortunately, there is an open standard called HDF, which defines a binary file format that is designed to efficiently store large scientific data sets. I will demonstrate both approaches, and profile them to see how much memory is required. I am writing the HDF file using the PyTables interface. Here’s the little test program I’ve been using:

#!/usr/bin/env python
from numpy import *

array_len = 10000000

a = zeros(array_len, dtype=float)

#import pickle
#f = open('test.pickle', 'wb')
#pickle.dump(a, f, pickle.HIGHEST_PROTOCOL)
#f.close()

import tables
h5file = tables.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
h5file.createArray(root, "test", a)
h5file.close()

I first ran the program with both the pickle and the HDF code commented out, and profiled RAM usage with Valgrind and Massif (see my post about profiling memory usage of Python code). Massif shows that the program uses about 80MB of RAM. I then uncommented the Pickle code, and profiled the program again. Look at how the memory usage almost triples!

 MB
232.2^                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                                  #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                         @        #
  |                                                ,        @        #  .
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
  |                                                @        @        #  :
0 +----------------------------------------------------------------------->Mi
  0                                                                   161.9

I then commented out the Pickle code and uncommented the HDF code, and ran the profile again. Notice how efficient the HDF library is:

    MB
81.62^                              ,.., .....,.. .,...,... .,......,..,:  .
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |                              @::@ :::::@:: :@:::@::: :@::::::@::#:  :
   |              .,..,. .....    @::@ :::::@:: :@:::@::: :@::::::@::#:  :::
 0 +----------------------------------------------------------------------->Mi
   0                                                                   246.6

I didn’t benchmark the speed because, for my application, it doesn’t really matter, because the data gets dumped to disk relatively infrequently. It’s interesting to note that the data files on disk are almost identical in size: 76.29MB.

Why does Pickle consume so much more memory? The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple virtual machine (VM) that translates an object into a series of opcodes and writes them to disk. To unpickle something, the VM reads and interprets the opcodes and reconstructs an object. The downside of this approach is that the VM has to construct a complete copy of the object in memory before it writes it to disk. I’m not sure why the pickle operation in my example code seems to require three times as much memory as the array itself. 99% of the time, memory usage isn’t a problem, but it becomes a real issue when you have hundreds of megabytes of data in a single array. Fortunately, HDF exists to efficiently handle large binary data sets, and the Pytables package makes it easy to access in a very Pythonic way.

Installing numpy with the Intel Math Kernel Library (mkl)

Today I installed numpy on a cluster. Normally, as a Gentoo admin, I just install things with emerge, and all the details are taken care of automagically. However, this cluster runs Red Hat Enterprise, and I don’t have admin privileges, so I had to install numpy in my home directory. I installed 1.0.4, to match the version used on another system. You may not need to do this for more recent versions of numpy, which may have an improved setup script. The overall process is:

  1. Unpack the tar file with: tar xfvz numpy-1.0.4.tar.gz
  2. cd numpy-1.0.4
  3. cp site.cfg.example site.cfg
  4. vi site.cfg
  5. python setup.py install –home=~

Numpy tries to find the installed system linear algebra libraries BLAS, ATLAS, and LAPACK. If it can’t find them, you will miss out on some dramatic speed improvements. This cluster happens to have the Intel Math Kernel Libraries (mkl), which replace the open source versions. To tell numpy where to find mkl, you have to edit the file site.cfg. It is not immediately obvious what to do–at first I tried to add the mkl path and library names to the blas_opt and lapack_opt sections, but that didn’t work. I had to add a section as follows:

[mkl]
library_dirs = /opt/software/intel/mkl/10.0.1.014/lib/em64t
mkl_libs = mkl, guide
lapack_libs = mkl_lapack
include_dirs = /opt/software/intel/mkl/10.0.1.014/include

I didn’t have to uncomment the blas_opt or lapack_opt sections at all. I got this idea from a file I found on the scipy web site.

Leave a comment if this does or does not work for you.

Python threads are easy (with example)

It’s remarkably easy to spawn a Python thread.  However, before doing so, I caution you that a Python thread is not the same thing as an OS thread.  Python threads run within the Python interpreter, but the Python interpreter always executes in a single process.  The reasons why have already been explained elsewhere, so I refer you to the thread module documentation to learn about the Global Interpreter Lock.  You probably have objections to this state of affairs, and I assure you they have already been voiced by Juergen Brendel and responded to by Guido van Rossum (creator of Python).  Anyway, the upshot is that Python can only utilize one core of a multi-core CPU.  This isn’t such a big deal for me because I’m a scientific programmer, and if I really need to write parallel code it’s going to have to run on a cluster or a grid.  Threads don’t help with that.

Having said all that, threads in Python are still useful.  I will detail one example in which I spawn a thread to load a large binary file.  While this doesn’t spread the work across multiple CPU cores, it does enable the GUI to remain interactive while the file loads. All you have to do to create a Python thread is create a class that is derived from Thread. In the example below, I derived a class called Loader, which “wraps” the function that actually reads the binary files. The __init__ method accepts the filename and other options as arguments. The run() method is required. Don’t call run() directly–instead, call the start() method (inherited from the base class) to start the thread.

from threading import Thread

class Loader(Thread):
    """The Loader class is a derivative of Thread that allows file loading to take place in a
    separate thread from the GUI."""

    def __init__(self,fileName, readSecondDerivatives, monitor, notifyWindow):
        Thread.__init__(self)
        self._notifyWindow = notifyWindow
        self.fileName = fileName
        self.readSecondDerivatives = readSecondDerivatives
        self.monitor = monitor

    def run(self):
        """Required method that is run by Thread.start() calls the function that loads the binary."""
        binaryFile = open(self.fileName, mode='rb')
        self.data = readWGMfile(binaryFile, self.readSecondDerivatives, self.monitor)
        binaryFile.close()
        wx.PostEvent(self._notifyWindow, DataLoadedEvent(self.data))
####


Using the class:

loader = Loader(self.path+os.sep+self.fileName, readSecondDerivatives, monitor, self)
loader.start()


Even though this doesn’t create a true OS thread, it keeps the GUI from freezing, and allows the GUI to receive and display status updates from the file loading.

Redirecting text output from Python functions

Two posts ago, I described how I wrote a function in Python that reads in a binary file from Labview. In my last post, I described using wxPython to write a GUI to process the data from those binary files. Naturally, I called the binary-file-reader function from the GUI. The problem is that the file reader prints a lot of information to the terminal, using Python print statements. None of this goes to the GUI, requiring the user to run the GUI from a terminal and keep an eye on the text output, which is inconvenient. However, I don’t want to modify the file reader to include GUI-specific code, because that would be less modular and less re-usable. Instead, I learned that Python has a very easy facility to redirect stdout, the default destination of the print statement. I modified the file reader as follows:

def readWGMfile(binaryFile, readSecondDerivatives, output=None):
    if output is not None:
        # allow the caller to hand in an alternative to standard out--for example, if
        # the caller is a GUI, redirect "print" statements to a GUI control
        print "Redirecting output..."
        import sys
        sys.stdout = output

    # rest of the function... 

I added an optional argument to the file reader, allowing the caller to specify an object to replace stdout. The sys module is then used to redirect stdout to the object specified by the caller. There’s no change to the default usage of the function, but it’s a lot more flexible. Here is the object that I defined to capture the text:

class Monitor:
    """The Monitor class defines an object which is passed to the function readWGMfile(). Standard
    output is redirected to this object."""

    def __init__(self, notifyWindow):
        import sys
        self.out = sys.stdout
        self._notifyWindow = notifyWindow        # keep track of which window needs to receive events

    def write(self, s):
        """Required method that replaces stdout."""
        if s.rstrip() != "":        # don't need to be passing blank messages
        #            self.out.write("Posting string: " + s.rstrip() + "n")
        wx.PostEvent(self._notifyWindow, TextMessageEvent(s.rstrip()))

The only requirement of the Monitor class is that it implement a write(self, string) method that accepts a string that would otherwise have gone to standard out. In my case, I post a special event which is sent to the specified window. Here is the definition of that event:

class TextMessageEvent(wx.PyEvent):
    """Event to notify GUI that somebody wants to pass along a message in the form of a string."""
    def __init__(self,message):
        wx.PyEvent.__init__(self)
        self.SetEventType(TEXT_MESSAGE)
        self.message = message

Once again, I’m impressed with the way that Python makes hard things easy and very hard things possible. In my next post, I’ll give a quick example of how easy it is “thread” things in Python.

Fun with threads in Python and wxPython

I have finally gotten back to programming in the last couple of days.  Our project has finally started to generate a lot of data, so I’ve been refactoring and improving my code that reads data stored in LabView binaries.  Today I spent a lot of time creating a GUI for browsing data.  Arguably, this wasn’t the best use of my time, but I learned a lot about multi-threaded Python GUI programming with wxPython.  You can find a gold mine of information on the multi-threaded wx programming at the wxPython wiki.  Because the LabView binary data has to be read sequentially, and the files are rather large, it takes a long time to read in a file.  I spawn a thread to handle the file reading,  while allowing the GUI to remain responsive.  The thread posts messages to the GUI window, which are used to update the user on the status of the file reading operation.  When the file is read, a final message containing the data is posted to the window.  It’s really pretty slick now that I’ve figured out how to do it.

I will soon post a clever scheme to capture text output from the file-reading function, and display it in the GUI, without making substantial changes to the file-reading function.

Reading Labview binary files with Python

My research group uses Labview 7.1 to write custom data acquisition (DAQ) software. I code everything else in Python, so I need to get data from Labview into Python for processing. Our DAQ program produces Labview binary files, so I had to find a way to read them with Python. Binary files are nice because they are a compact way to store numerical data as compared to ASCI or (heaven forbid) XML, but they are much harder to read. The binary format used by Labview is documented only indirectly, so I had to hack a little.

The first thing to realize is that the Labview binary file is a direct dump of the data that was stored in RAM. How Labview stores data in memory is documented here. Indirectly, this documents how binary files are stored on disk. Our DAQ program writes a rather complex “cluster” (Labview’s version of a C structure) to disk. The elements of the cluster are stored contiguously as a sequence of bytes, and there’s no way to know which byte goes with which element, unless you know the size of each element and the order in which they are stored in the cluster. So, the first step is to document the cluster that’s being written to disk. You can use the context help in Labview to view the data type of the wire that leads to the VI that writes the file. With this in hand, you are ready to write Python code.

First, make sure you open the file in binary mode:

binaryFile = open("Measurement_4.bin", mode='rb')

Now you can use the file.read(n) command to read n bytes of data from the file. This data is read as a string, so you will need the struct module to interpret the string as packed binary data. This is what it looks like:

 (data.offset,) = struct.unpack('>d', binaryFile.read(8))

The first argument to struct is a format string. The ‘>’ tells struct to interpret the data as big-endian (the default for Labview) and the ‘d’ tells it to produce a data type of double. The data type has to match the number of bytes read in by file.read(). Remember that we’re reading the file sequentially, and there are no delimiters, so if you read in the wrong number of bytes for any data, all the subsequent data will be junk.

One final note about arrays: arrays are represented by a 32-bit dimension, followed by the data. If the array contains no elements, it is stored as 32 zero bits with no other data. It would probably be a good idea to use the Python array module if you need to read in large arrays efficiently.