One of the first things we needed to do when we started working on Testuff, was to figure out how are we going to update the installed desktop clients. This is one of those problems that seems to usually fall under the NIH syndrome, and like many others before me, I invented my own scheme. The gist of it is a version.xml file that sits alongside the setup file for the newest release and looks something like this:
<?xml version="1.0" encoding="UTF-8"?> <update-info version="0.8.0"> <update file="TestuffSetup.exe" from-version="all"/> <update file="TestuffUpdate.exe" from-version="0.7.1"/> <md5hashes> <file md5="3a23dd6eff6fd6c1d0fbfcbfb0d57221" path="async.pyc"/> <file md5="0d1ea490a18c65cec7ba8715b5ea9e69" path="atexit.pyc"/> <file md5="166723a4330a98b573119326fc689322" path="base64.pyc"/> <file md5="01c1bda049936de570ed922424c057a8" path="BeautifulSoup.pyc"/> </md5hashes> </update-info>
When the Testuff client launches, it gets the version.xml file from the server and compares its version to the version attribute of the update-info tag. If the client’s version is wrong, it checks the update tags to see which update it should download and install. We generate two separate setup files – one to update the most recent version to the new one called and another to update all the other (older) versions.
Aside from the info about which version of the client should use which update file, version.xml also contains the MD5 hashes for each file in the distribution. That might seem like a lot of wasted space and time, but it’s actually there for a very good reason. When our setup building script is creating TestuffUpdate.exe, it too downloads version.xml from our server. It then tries to determine which files have changed or have been added since the last version by comparing the MD5 hashes in version.xml to the the hashes of the actual files that have been generated by the build. Any file that is different is added to the update so we can be sure we haven’t missed any essential component in the update.
Recently I discovered that our update files are much larger than they should be. We release a new version with just a couple of fixes in a single module, and the size of the update is half the size of the full install. As it turned out, that most of the PYC files were marked as changed and added to the update. That didn’t seem right, especially for things like threading.pyc, which is a Python module that shouldn’t change unless you upgrade to a different version of Python, which we didn’t (still stuck at 2.4.4 I’m afraid). That got me curious enough to go digging in the, apparently undocumented, binary structure of the PYC files.
This module contains functions that can read and write Python values in a binary format. The format is specific to Python, but independent of machine architecture issues (e.g., you can write a Python value to a file on a PC, transport the file to a Sun, and read it back there). Details of the format are undocumented on purpose; it may change between Python versions (although it rarely does).
The first thing I did was compare the two threading.pyc files – the one from the current distribution and the one just generated by the build script. The result showed there was difference in only two bytes:
D:GooliDevTempcompare>fc /b threading-old.pyc threading-new.pyc Comparing files threading-old.pyc and threading-new.pyc 00000004: 6E CE 00000005: F7 6A
Only two bytes differ, and they are right at the beginning of the file? That looks suspiciously like a version or a timestamp in the file header. Since the PYC file structure is undocumented, I went looking for the details in Python’s source code, but the answer was actually closer to home – in the compiler package. A file called pycodegen.py in Python\Lib\compiler contains the following code:
mtime = os.path.getmtime(self.filename)
mtime = struct.pack(‘<i’, mtime)
return self.MAGIC + mtime
So, the PYC header file contains a magic number that identifies the Python release and the modification time of the original source file as the number of seconds since the epoch. That shouldn’t be a problem – the threading module hasn’t changed and should have the same timestamp. But as we’ve seen, the PYC files were different. How can that be?
Acting on a hunch, I wrote a short script to read the header from the PYC file and print the embedded date:
import timedef print_internal_date(filename):
f = open(filename, "rb")
data = f.read(8)
mtime = struct.unpack("<i", data[4:])
Which printed the following results:
Mon Mar 13 22:51:26 2006 Mon Mar 13 12:51:26 2006
Notice anything odd about them? They are exactly 10 hours apart. At first I thought I might actually be looking at two different versions of threading.py, but the chances of two edits being exactly 10 hours apart right down to the second is practically non-existent. It had to be something with time zones. I live and work in Israel, which is at GMT+2:00. The default timezone for Windows is Pacific time, which is GMT-8:00. Exactly 10 hours apart. However, no matter how I tweak the Regional Settings on my computer, all the PYC files I generate here have the same timestamp. Perhaps it has to do with the timezone you have set when you install Python. If I ever find out, I’ll let you know.
But that wasn’t the point of this post. The point was to figure out what PYC files look inside and we did that, at least in part – they start with a magic number that is different for each Python version (check out the comments in import.c), and they have an embedded timestamp of the source code they got generated from after that. The rest is generated by the marshal module and can be read by it to get the code objects and the global data in the module.
Another thing to be learned from this is that we really should always build the Testuff client on the same machine, which is why I’m heading to the office right now to burn a copy of the VMWare image I created with everything needed to build Testuff. We got a new version with a couple of important fixes to our Mantis support to release today.