Yubikey - PIV vs Security Key

At my day job, we’ve just purchased Yubikeys for my team to help in the neverending process of securing our infrastructure.  While we’re looking at implementing MFA in a number of places, the starting point is securing our SSH connections to our servers.  We use FreeIPA to manage authorization and authentication through SSH, so key management is pretty straightforward.   The real question is how best to secure an SSH key using a Yubikey.  There are two main options: setting up a PIV key on the Yubikey or creating an OpenSSH Security Key (SK) key that requires the Yubikey to login.

I tried out the SK key first because the documentation made it look like it was easiest to set up, and (perhaps surprisingly) it was!  Generating the key was a piece of cake.  From a security point of view, I prefer it because the key is stored on my laptop and can be protected with a passphrase.  Theft of the Yubikey alone isn’t enough to compromise the key.  Using the key is simple too. I just need to have my Yubikey plugged into my laptop and tap on it after initiating the SSH session.

The first problem that came up is that our servers run an in-house rpm-ostree distribution based off of AlmaLinux 8, and the latest release of OpenSSH there doesn’t support SK keys.  This problem was easily resolved by taking Fedora’s OpenSSH builds and rebuilding them for our distribution.

The second problem could not be as easily solved, and has, unfortunately, caused me to abandon SK keys.  My team uses Ansible extensively, and we always deploy our changes using our own SSH keys so we can audit who has performed the changes.  Due to the way that Ansible re-uses SSH connections, you only have to tap the Yubikey once when deploying a change to a single server.  However, when deploying a change to many servers (we have over 100 call servers around the world), you have to tap the Yubikey for every. single. server.  This turns a minor speed bump into an insurmountable road block.

This brought me to our second option, PIV keys.  I’d passed them up because setting them up is anything but simple, but most of the pain can be abstracted away, and the extra libraries are only required on the system that has the Yubikey connected. The downside is that PIV keys are stored directly on the Yubikey (as a certificate, if I understand correctly), which means I now need to set a PIN on the Yubikey (otherwise someone can just plug the Yubikey in their computer and use my SSH key) and run extra commands to load the key into my SSH agent every time I insert my Yubikey.  I’m also limited to storing a single SSH key on my Yubikey.

PIV keys are more difficult to setup and maintain than OpenSSH SK keys, but they have one major advantage - Yubikey supports a touch “cache” with PIV authentication.  This means that any SSH connections made within fifteen seconds of a Yubikey touch will be allowed to connect without requiring a second touch.  After configuring Ansible to perform up to 200 simultaneous connections, this reduces a full deployment from 100+ touches to 3, all within the first minute of the deployment.

If we could somehow get either the Yubikey or OpenSSH to support a touch cache for SK keys, I would switch to them in a heartbeat, but, until that feature is added (or we can find a workaround), we’re going to have to stick with PIV authentication.

As always, if you have any suggestions or comments, please either email me or ping me on twitter.

The Paddy's Day bug

Last Sunday, I got a message from a coworker and a good friend, Dhaval, that his Fedora laptop was stuck during the boot process. His work laptop, also running Fedora, also failed to boot. I checked my work laptop and personal laptop, and both of them rebooted just fine, so we then started going through the normal troubleshooting process on his work laptop. There were no error messages on the screen, just a hang after Update mlocate database every day. We booted a live environment, mounted the laptop’s filesystem there, and checked the journal, and, again, nothing to see there. Google didn’t turn anything up either. There wasn’t much installed on his personal laptop, so Dhaval suggested re-installing Fedora on that laptop. Booting into the live USB worked perfectly and the reinstall happened without a hitch. We then rebooted into the newly installed system and… it hung. Again.

What? This was a completely clean install! Was it possible that Dhaval’s laptops both had some kind of boot sector virus? But his work laptop had secure boot on, which is supposed to protect against that kind of attack. I decided to compare the startup systemd services in my laptop compared to his. After going through multiple services, I noticed that raid-check.timer was set to start on Dhaval’s laptop, but wasn’t setup on mine. I started the service on my laptop… and the system immediately became unresponsive! Using the live environment, I then disabled the service on Dhaval’s laptop. One reboot later… and his system booted perfectly!

Neither of our laptops have RAID and manually starting the raid-check.service didn’t kill the system, so the problem seemed to be in systemd itself rather than the RAID service. What really concerned me, though, was that the problem occurred on a fresh install of Fedora. It turns out that, on F33, raid-check.timer is enabled by default. My laptops had both been upgraded from previous Fedora releases where this wasn’t the case, but Dhaval had performed fresh installs on his systems. Further testing confirmed that the bug only affected systems running raid-check.timer after 1:00PM on Mar 21st.

As far as I could see, this was going to affect everyone when they booted Fedora, and this had me worried. I figured the best place to start getting the word out was to open a bug report, and I then left messages in both #fedora-devel on IRC and the Fedora development mailing list. tomhughes was the first to respond on IRC with a handy kernel command line option I’d never seen before to temporarily mask the timer in systemd (systemd.mask=raid-check.timer). cmurf and nirik then started trying to work out where the bug was coming from. nirik was the first one to realize that daylight savings might be the problem, since after 1:00AM on the 21st, the next trigger would be after the clocks change here in Ireland. But it was chrisawi who gave us the first ray of hope when they pointed out that the bug was only triggered if you were in the “Europe/Dublin” time zone.

This was the first indication I had that the bug wasn’t affecting everyone, and that was a huge relief. I had feared that everyone who had installed Fedora 33 was going to have to work around this bug, but this dropped the number of affected people down to the Fedora users in Ireland. tomhughes pointed out that Ireland is unique in the world in that summer time is “normal” and going back during the winter is the “savings” time. Apparently systemd was having problems with the negative time offset, though it’s unclear to me why this hadn’t been triggered in previous years.

Once it was clear that the bug was related to DST in Ireland, zbyszek was able to figure out that the problem involved an infinite loop and he created a fix. After an initial test update that caused major networking problems, we now have systemd-246.13 pushed to stable that fully resolves the problem.

One of the things I realized when I was doing the initial troubleshooting is how long it’s been since I’ve seen this kind of bug in Fedora. I think the last time I saw a bug of this severity was somewhere around ten years ago, though before that these kinds of bugs seemed to show up annually. It’s a tribute to the QA team and to the processes that have been established around creating and pushing updates that these kind of show-stopping bugs are so rare. This bug is a reminder, though, that no matter how good your testing is, there will always be some that fall through the cracks.

I would like to say a huge thank you again to zbyszek for fixing the bug and adamw for stopping the broken systemd update before it made it to stable.

If you’re in Ireland and doing a clean install of Fedora 33, you’ll need to work around the bug as follows:

  • In grub, type e to edit the current boot entry
  • Move the cursor to the line that starts with linux or linuxefi and type systemd.mask=raid-check.timer at the end of the line
  • Press Ctrl+X to boot
  • Once the system has booted, update systemd to 246.13 (or later)