Brittle deltas – a possible solution?

Deltarpm is brittle. When it works correctly, it’s brilliant. But, like a tightrope walker crossing the Niagara falls while balancing an egg on his head, all it takes is one slip and…*splat*.

At the beginning of the Fedora 15 release cycle, a new version of xz was pushed in which the defaults for compression level 3 were changed (as far as I can tell, to what used to be level 4). This doesn’t cause any problems for newly compressed data, but if you decompress an rpm whose payload was compressed using old level 3 (like makedeltarpm does) and then recompress it with new level 3 (like applydeltarpm does), the compressed files no longer match. *Splat*.

I wrote about the root problem here over a year ago, but to summarize: almost no compression algorithms ever guarantee that, over all releases, they will create the same compressed output given the same uncompressed input.

Our fix for Fedora 15 was pretty simple. Delete all of the old deltarpms in Rawhide. As long as the users have the new xz before doing a yum update, all new deltarpms will work correctly. Yay.

The problem is that this is all still extremely fragile. Take Fedora bugs #524720, #548523, and #677578 for example. All three bugs have cropped up because of mistakes in handling changes in the compression format, and it’s all a bit ridiculous. Would anyone use gzip if an old version couldn’t decompress data compressed with a newer version?

A possible solution?

There is no simple solution. So what if we change the rules? Instead of trying to keep the compression algorithms static, what if we stored just enough information in the deltas to recompress using the exact same settings, whatever they are.

For gzip, this would mean recording things like each block size, dictionary, etc. For xz, it would mean recording the LZMA2 settings. The problem is that this information is different for each compression type and the functions to extract the needed information haven’t been included in any compression libraries (to my knowledge).

However, if we could write these functions and get them into the upstream libraries, it would benefit all programs that try to generate deltas. Deltarpm would continue to work when compression algorithms change. Rsync could actually delta gzipped files, even if the “–rsyncable” switch hasn’t been used in gzip.

There are a couple of possible problems with this solution. First, I’m not sure how big the extra needed information is. Obviously, for each compression format, it’s different, but, unless it’s at most 1/100th the size of the uncompressed file, storing the extra data in the deltarpm will probably not be worth the effort.

Second, no code has actually been written. In an open source world of “Show me the code”, this is obviously a major issue. I’d love to do a reference for one of the simpler compression formats (like zlib), but just haven’t had the time yet.

Obviously, the best solution would be for the various upstreams to provide the necessary functions, as they understand both their algorithms and what information should be stored. However, most upstreams have enough on their plates without needing extra stuff thrown in from random blogs.

Another good solution would be for someone who is interested in deltas and compression to take on this project themselves. Any volunteers? 🙂

Broken eggs credit: Broken Eggs by kyle tsui. Used under CC BY-NC-ND

Config Caching Filesystem (ccfs)

One of the problems we’ve had to deal with on our servers is high load on the fileserver that holds the user directories. I haven’t worked out if it’s because we’re using standard workstation hardware for our servers, or if it’s a btrfs problem.

The strange thing is that the load will shoot up at random times when the network shouldn’t be that taxed, and then be fine when every computer in the school has someone logged into it.

Anyhow, we hit a point where the load on the server hit something like 60 and the workstations would lock for sixty seconds (or more) while waiting the the NFS server to respond again. This seemed to happen most often when all of the students in the computer room opened Firefox at the same time.

In a fit of desperation, I threw together a python fuse filesystem that I have cunningly called the Config Caching Filesystem (or ccfs for short). The concept is simple. A user’s home directory at /netshare/users/[username] is essentially bind-mounted to /home/[username] using ccfs.

The thing that separates ccfs from a simple fuse bind-mount is that every time a configuration file (one that starts with a “.”) is opened for writing, it is copied to a per-user cache directory in /tmp and opened for writing there. When the user logs out, /home/[username] is unmounted, and all of the files in the cache are copied back to /netshare/users/[username] using rsync. Any normal files are written directly to /netshare/users/[username], bypassing the cache.

Now the only time the server is being written to is when someone actually saves a file or when they log out. The load on the server rarely goes above five, and even then it’s only when everyone is logging out simultaneously, and the server recovers quickly.

A few bugs have cropped up, but I think I’ve got the main ones. The biggest bug was that some students were resetting their desktops when the system didn’t log out quickly enough and were getting corrupted configuration directories, mainly for Firefox. I fixed that by using –delay-updates with rsync so you either get the fully updated configuration files or you’re left with the configuration files were there when you logged in.

I do think this solution is a bit hacky, but it’s had a great effect on the responsiveness of our workstations, so I’ll just have to live with it.

Ccfs is available here for those interested, but if it breaks, you get to keep both pieces.

Jetpack credit: Fly with U.S. poster by Tom Whalen. Used under CC BY-NC-ND