The return of Mr. Sparky

Power strip

As I mentioned in a previous post, I am most definitely not an electrician. Take a close look at the picture on the right. This is a normal power strip. Sorry, this was a normal power strip. If you look closely, you’ll notice what looks like a black line running along it, with occasional ruptures and what look like silver beads coming out of it. There’s a story behind that line…

So I’m finally finishing up the computer room in the mountains, and the last problem to deal with is the computer room UPS’s. They’ve been there for years and need to be replaced. The crazy thing is that it’s cheaper to buy several small UPS’s (one for every two computers) than a couple of large UPS’s. So we get a ten small UPS’s and put them on top of every other computer in the room. I then grab the plugs from the three sections of the computer room.

Now, at this point I should clarify that I did not design the computer room, and it wasn’t my choice to have the power leads for all three sections come out in one place. However, I am the idiot who chooses to plug the three leads into a single power strip. A single thin-cabled power strip. Specifically the single thin-cabled power strip pictured above. I then plug the power strip into the wall. I turn to my wife (who is helping, that’s love for you) and say, “Ok, turn on a few of the computers.” All of the monitors (most of which are CRTs) are already on. As the first computer starts booting, it happens…

*BZRT*

*fizzle* *fizzle*

Before my eyes, the power strip starts smoking! I immediately reach in to pull it out of the wall. A spark shoots out right next to my hand, and then an open flame that distinctly resembles a flame-thrower. I change my mind and pull my hand back.

By this point smoke is pouring out the whole length of the power strip’s cable. And I’m praying that it burns itself out before shorting out the school’s (and possibly the village’s) electricity supply. Finally, the fizzling dies away and I reach in and pull out the strip. The entire wire has been burned through, and there are ruptures every foot or so, with beaded metal that used to be the wire.

So, I go and grab two more power strips. One thick one for the UPS’s and one thinner one for the monitors. I plug them into the wall, and…we have power! And no sparks. And no fizzles.

The computer room

So, the school’s computer room is now finished. Twenty-two computers, eleven Core 2 Duo’s with 15″ widescreen LCDs and eleven mishmash Celerons with CRTs that look like they were stolen from the ark. But all of them are running Fedora 13 with gnome-shell, and all in all, the room looks good. And now I’m back to my normal job which doesn’t include messing around with electricity (at least, beyond the point of plugging my laptop in).

I am not an electrician. I most definitely do not wish to be an electrician.

btrfs on the server

As mentioned back here and here, our current server setup looks something like this:

Current server configuration

One thing not noted in the diagram is that fileserver, our dns server, ldap server, web server, and a few others all run as virtual machines on storage-server01 and storage-server02.

The drawback to this is that when disk io gets heavy, our virtual machines start struggling, even though they’re on separate hard drives.

Another problem with our current system is that we don’t have a good method of backup. Replication, yes, but if a student accidentally runs rm ./ -rf in their home directory, it’s gone.

So, with a bit of time over the summer after I’ve set up the school’s Fedora 13 image, I thought I’d tackle these problems. We now have three new “servers” (well, 2GB desktop systems with lots of big hard drives shoved in them). Our data has been split into three parts, and each server is primary for one part and backup for another.

The advantage? Now our virtual machines have full use of the (now misnamed) storage-servers01-2, both of which are still running CentOS 5.5. Our three new datastore servers, running Fedora 13, now share the load that was being put on one storage-server.

But this doesn’t solve the backup problem. A few years back, I experimented with LVM snapshots, but they were just way too slow. Ever since then, though, I’ve been very interested in the idea of snapshots and btrfs has them for free (at least in terms of extra IO, and I’m not too worried about space). Btrfs also handles multiple devices just fine, which means goodbye LVM. With btrfs, our new setup looks something like this:

New server configuration

I have hit a couple of problems, though. By default, btrfs will RAID1 metadata if you have more than one device in a btrfs filesystem. I’m not sure whether my problem was related to this, but when I tried to manually balance the user filesystem which was spread across a 2TB and 1TB disk, I got -ENOSPC, a kernel panic, and a filesystem that was essentially read-only. This when the data on the drive was under 800GB (though most of the files are small hidden files in our users’ home directories). After checking out the btrfs wiki, I upgraded the kernel to the latest 2.6.34 available from koji (at that point in time), and then copied the data over to a newly created filesystem with RAID0 metadata and data (after all, my drives are already RAID1 using DRBD). A subsequent manual balance had no problems at all.

The second problem is not so easily solved. I wanted to do a speed comparison between our new configuration and our old one, so I ran bonnie++ on all of the computers in our main computer lab. I set it up so each computer was running their instance in a different directory on the nfs share (/networld/bonnie/$HOSTNAME).

Yes, I knew it would take a while (and stress-test the server), but that’s the point, right? The server froze after a few minutes. No hard drive activity. No network activity. The flashing cursor on the display stopped flashing (and, yes, it’s in runlevel 3). Num lock and caps lock don’t change color. Nothing in any logs. Frozen dead.

I rebooted the server, and tried the latest 2.6.33 kernel. After a few minutes of the stress test, it was doing a great imitation of an ice cube. I tried a 2.6.35 Fedora 14 kernel rebuilt for Fedora 13 that I had discarded because of a major drop in DRBD sync speed. This time the stress test barely made it 30 seconds.

So where does that leave me? Tomorrow I plan on running the stress test on our old CentOS server. If it freezes too, then I’m not going to worry too much. It hasn’t ever frozen like that with normal use, so I’ll just put it down to NFS disliking 30+ computers writing gigabytes of data at the same time. I did file this bug report, but not sure if I’ll hear anything on it. It’s kind of hard to track down a problem if there aren’t any error messages on screen or in the logs.

The good news is that I do have daily snapshots set up, shared read-only over NFS, that get deleted after a week. So now we have replication and backups.

I’d like to keep this configuration, but that depends on whether the server freeze bug will show up in real-world use. If it does, we’ll go back to CentOS on the three servers, and probably use ext4 as the base filesystem.

Update: 08/26/2010 After adding a few boot options, I finally got the logs of the freeze from the server. It looks like it’s a combination of relatively low RAM and either a lousy network card design or a poor driver. Switching the motherboard has mitigated the problem, and I’m hoping to get some more up-to-date servers with loads more RAM.