Digging into Python’s PYC files

One of the first things we needed to do when we started working on Testuff, was to figure out how are we going to update the installed desktop clients. This is one of those problems that seems to usually fall under the NIH syndrome, and like many others before me, I invented my own scheme. The gist of it is a version.xml file that sits alongside the setup file for the newest release and looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<update-info version="0.8.0[1212]">
    <update file="TestuffSetup.exe" from-version="all"/>
    <update file="TestuffUpdate.exe" from-version="0.7.1[1110]"/>
        <file md5="3a23dd6eff6fd6c1d0fbfcbfb0d57221" path="async.pyc"/>
        <file md5="0d1ea490a18c65cec7ba8715b5ea9e69" path="atexit.pyc"/>
        <file md5="166723a4330a98b573119326fc689322" path="base64.pyc"/>
        <file md5="01c1bda049936de570ed922424c057a8" path="BeautifulSoup.pyc"/>

When the Testuff client launches, it gets the version.xml file from the server and compares its version to the version attribute of the update-info tag. If the client’s version is wrong, it checks the update tags to see which update it should download and install. We generate two separate setup files – one to update the most recent version to the new one called and another to update all the other (older) versions.

Aside from the info about which version of the client should use which update file, version.xml also contains the MD5 hashes for each file in the distribution. That might seem like a lot of wasted space and time, but it’s actually there for a very good reason. When our setup building script is creating TestuffUpdate.exe, it too downloads version.xml from our server. It then tries to determine which files have changed or have been added since the last version by comparing the MD5 hashes in version.xml to the the hashes of the actual files that have been generated by the build. Any file that is different is added to the update so we can be sure we haven’t missed any essential component in the update.

Recently I discovered that our update files are much larger than they should be. We release a new version with just a couple of fixes in a single module, and the size of the update is half the size of the full install. As it turned out, that most of the PYC files were marked as changed and added to the update. That didn’t seem right, especially for things like threading.pyc, which is a Python module that shouldn’t change unless you upgrade to a different version of Python, which we didn’t (still stuck at 2.4.4 I’m afraid). That got me curious enough to go digging in the, apparently undocumented, binary structure of the PYC files.

This module contains functions that can read and write Python values in a binary format. The format is specific to Python, but independent of machine architecture issues (e.g., you can write a Python value to a file on a PC, transport the file to a Sun, and read it back there). Details of the format are undocumented on purpose; it may change between Python versions (although it rarely does).

The first thing I did was compare the two threading.pyc files – the one from the current distribution and the one just generated by the build script. The result showed there was difference in only two bytes:

D:GooliDevTempcompare>fc /b threading-old.pyc threading-new.pyc
Comparing files threading-old.pyc and threading-new.pyc
00000004: 6E CE
00000005: F7 6A

Only two bytes differ, and they are right at the beginning of the file? That looks suspiciously like a version or a timestamp in the file header. Since the PYC file structure is undocumented, I went looking for the details in Python’s source code, but the answer was actually closer to home – in the compiler package. A file called pycodegen.py in Python\Lib\compiler contains the following code:

def getPycHeader(self):
    mtime = os.path.getmtime(self.filename)
    mtime = struct.pack(‘&lt;i’, mtime)
    return self.MAGIC + mtime

So, the PYC header file contains a magic number that identifies the Python release and the modification time of the original source file as the number of seconds since the epoch. That shouldn’t be a problem – the threading module hasn’t changed and should have the same timestamp. But as we’ve seen, the PYC files were different. How can that be?

Acting on a hunch, I wrote a short script to read the header from the PYC file and print the embedded date:

import os, os.path
import struct
import timedef print_internal_date(filename):
    f = open(filename, "rb")
    data = f.read(8)
    mtime = struct.unpack("&lt;i", data[4:])
    print time.asctime(time.gmtime(mtime[0]))print_internal_date("threading-old.pyc")

Which printed the following results:

Mon Mar 13 22:51:26 2006
Mon Mar 13 12:51:26 2006

Notice anything odd about them? They are exactly 10 hours apart. At first I thought I might actually be looking at two different versions of threading.py, but the chances of two edits being exactly 10 hours apart right down to the second is practically non-existent. It had to be something with time zones. I live and work in Israel, which is at GMT+2:00. The default timezone for Windows is Pacific time, which is GMT-8:00. Exactly 10 hours apart. However, no matter how I tweak the Regional Settings on my computer, all the PYC files I generate here have the same timestamp. Perhaps it has to do with the timezone you have set when you install Python. If I ever find out, I’ll let you know.

But that wasn’t the point of this post. The point was to figure out what PYC files look inside and we did that, at least in part – they start with a magic number that is different for each Python version (check out the comments in import.c), and they have an embedded timestamp of the source code they got generated from after that. The rest is generated by the marshal module and can be read by it to get the code objects and the global data in the module.

Another thing to be learned from this is that we really should always build the Testuff client on the same machine, which is why I’m heading to the office right now to burn a copy of the VMWare image I created with everything needed to build Testuff. We got a new version with a couple of important fixes to our Mantis support to release today.

5 Comments on “Digging into Python’s PYC files”

By Dan Price. April 22nd, 2008 at 01:54

This post was very helpful– we’ve been developing a packaging system and hit just exactly this problem with pyc files.

By Markus. May 18th, 2008 at 19:09

Easier would be to skip the first 8 bytes when calculating the md5 hash. Possibly adding an offset attribute to the -tags.
If the file hasn’t changed elsewhere, the time-stamp is irrelevant.

def filehash(filename, offset = 0):
f = open(filename, ‘rb’)
data = f.read()
return hashlib.md5(data).hexdigest()

By gooli. May 18th, 2008 at 19:28

You’re right, but then then my version.xml file wouldn’t really contain hashes of the files, but of some modified version of the files. I’m nitpicking I guess, but this is code that somebody else will probably have to maintain some day and that’s just a little too confusing for my taste.

By struct module. April 5th, 2010 at 07:53

[...] … When I did that module, I was running a kernel at 2.6.18 series. To make sure it works …gooli.org Digging into Python's PYC filesThis module contains functions that can read and write Python values in a binary format. … module [...]

By Digging format | A1stoptravel. November 29th, 2011 at 09:10

[...] gooli.org – Digging into Python’s PYC filesJan 25, 2008 … Digging into Python’s PYC files. One of the first things we needed to do when we started working on Testuff, was to figure out how are we going … [...]