November 16, 2009

Deltarpm problems (Part I)

In my last post, I talked a bit about how deltarpm’s delta algorithm actually works, especially in relation to binary files.

Today, I want to look at one of the real-world problems that has a large effect on the size of some of our deltas.

Background

Our problem stems from how RPM deals with multiple arches. All binary files in an RPM are labeled with a “color”, that is, the architecture the RPM is built for. When two different arches of the same package are installed and binary files conflict, the binary files whose color matches the system color are the ones that are kept.

For example, let’s imagine that you install samba-common.x86_64 on an x86_64 Fedora installation. You then install samba-common.i686 because some program requires 32-bit libnetapi.so. This isn’t a problem because the 64-bit version of libnetapi.so is in /usr/lib64 while the 32-bit version is in /usr/lib.

But there’s no such thing as /usr/bin64, so what about the files installed into /usr/bin by samba-common? Because our system “color” is x86_64, the 32-bit /usr/bin executables from samba-common.i686 never actually get installed. If you run the file command on /usr/bin/pdbedit, it will tell you that it’s a 64-bit binary.

The Problem

While this is (generally) what you want on your system, it leaves us in a bit of a pickle for deltarpms. One of the requirements of a deltarpm is that it must build back perfectly into the original rpm.

Let’s imagine that we are now wanting to upgrade to a newer version of samba-common.i686 with the only change being a minor typo being fixed in the documentation. (Why you would want to upgrade for that is beyond the scope of this article.)

At first glance, this is an excellent place for a deltarpm. A small change in the documentation should result in a very small deltarpm. But when we try to apply our deltarpm to the currently installed samba-common.i686, we run into a major problem:

The currently installed 64-bit /usr/bin/pdbedit doesn’t match the 32-bit /usr/bin/pdbedit we need for our new rpm.

Bam! The delta fails, and yum-presto proceeds to download the full 14MB samba-common.i686 package.

The Solution

We became aware of this problem a few years back when 32-bit packages were common in 64-bit installs (this was back in the day when OpenOffice.org was still 32-bit only). We ended up pushing a patch into deltarpm that forces all colored executables not installed into multilib directories to be included in the deltarpm. This way, we never use the locally installed executables and the whole color problem disappears.

The Tradeoff

But this leaves us with another problem. We end up losing all delta savings on any binaries in /usr/bin. For samba-common, that’s 27MB (uncompressed) of binaries that we have to redownload every time we get an update.

Which is exactly the problem that deltarpm is supposed to solve.

Broken bicycle credit: unwanted & undesired … just let it rot by notsogoodphotography on Flickr. Used under a Creative Commons Attribution 2.0 license.

November 6, 2009

On binary delta algorithms

Yesterday I was reading this update (warning: still subscriber-only) on the Courgette delta algorithm that will be used by Google to push Chrome updates.

The article focused on potential patent problems with the Courgette method, but also described how the Courgette method works and also suggested that deltarpm (which I maintain in Fedora) could learn from it.

While not wanting to argue that particular point (I would love to get a hold of the code and see how much it would actually help our deltarpms), I do want to correct a couple of misconceptions about deltarpm.

Courgette, deltarpm and bsdiff (the tool) are all based on the bsdiff algorithm, which is very efficient. The problem comes when dealing with binaries. One small change in the source code will normally result in many small changes through the binary as the locations of pointers change.

(In the graphs provided, location two is a byte of data that has changed, while locations 3-6 and 21-24 represent two relative jump pointers to the same location. In the new binary, the location has changed by 74 bytes.)

The first graph is that of a naive delta using the bsdiff algorithm.

Note that the delta is four bytes of not-very-compressible data (out of 12). Not so great.

The second graph is roughly what courgette does with the same data. It does some disassembly to convert the relative pointers into labels. When applying the delta, courgette will disassemble the old file, apply the delta, and then reassemble the resulting not-quite-machine code into machine code.

Note that we’re down to a one-byte delta! Wonderful!

Now, the final example is how both the bsdiff tool and deltarpm work. They use the algorithm Colin Percival describes here which has an interesting twist when dealing with binaries.

When the locations change in a binary, they tend to do so in a consistent way. As deltarpm is doing its comparison, it adds an extra step. If less than half of the bytes in a segment have changed, deltarpm will do a byte-wise subtraction of the old binary’s bytes from the new binary’s bytes. This is then stored in an “add block”, which will mostly contain zeros.

While this does give us a block the size of the old binary, this block is very highly compressible (especially using bzip2).

While we may not be getting the compression levels that we might using courgette, we are doing far better than we would be using a naive bsdiff algorithm.

While courgette obviously has the advantage of size, it’s primary disadvantage is that it is very archicture-specific. The algorithm must know how to both assemble and disassemble binaries for each architecture it’s built for.

The advantage of the “add block” method is that it is generic to all architectures, but its disadvantage is that it won’t reach the same delta levels as courgette.

Having said that, I still haven’t had a chance to do any direct comparisons, which may tell us how much better courgette is than deltarpm.

(Updated: Tried to make it clear that bsdiff (the tool) doesn’t just use a naive bsdiff algorithm)