codeblog code is freedom — patching my itch

April 19, 2018

UEFI booting and RAID1

Filed under: Debian,Kernel,Ubuntu,Ubuntu-Server — kees @ 5:34 pm

I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s md RAID1 for the root filesystem (and/or /boot). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.

With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running BOOT/EFI/BOOTX64.EFI from the FAT32. Under Linux, this .EFI code is either GRUB itself, or Shim which loads GRUB.

So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read md, LVM, etc), but how do I handle /boot/efi (the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with efibootmgr) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.

The current implementation of Linux’s md RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact, mdadm warns about this pretty loudly:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90

Reading from the mdadm man page:

-e, --metadata= ... 1, 1.0, 1.1, 1.2 default Use the new version-1 format superblock. This has fewer restrictions. It can easily be moved between hosts with different endian-ness, and a recovery operation can be checkpointed and restarted. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). "1" is equivalent to "1.2" (the commonly preferred 1.x format). "default" is equivalent to "1.2".

First we toss a FAT32 on the RAID (mkfs.fat -F32 /dev/md0), and looking at the results, the first 4K is entirely zeros, and file doesn’t see a filesystem:

# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............| ... # file -s /dev/sda1 /dev/sda1: Linux Software RAID version 1.2 ...

So, instead, we’ll use --metadata 1.0 to put the RAID metadata at the end:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1 ... # mkfs.fat -F32 /dev/md0 # dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd 00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac FAT32 ...w|. # file -s /dev/sda1 /dev/sda1: ... FAT (32 bit)

Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and grub-install will write to the RAID mounted at /boot/efi.

However, we’re left with a new problem: on (at least) Debian and Ubuntu, grub-install attempts to run efibootmgr to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run efibootmgr with an empty -d argument:

Installing for x86_64-efi platform. efibootmgr: option requires an argument -- 'd' ... grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted. Failed: grub-install --target=x86_64-efi WARNING: Bootloader is not properly installed, system may not be bootable

Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running: dpkg-reconfigure -p low grub-efi-amd64

So, now my system will boot with both or either drive present, and updates from Linux to /boot/efi are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).

To deal with this “external write” situation, I see some solutions:

  • Make the partition read-only when not under Linux. (I don’t think this is a thing.)
  • Create higher-level knowledge of the root-filesystem RAID configuration is needed to keep a collection of filesystems manually synchronized instead of doing block-level RAID. (Seems like a lot of work and would need redesign of /boot/efi into something like /boot/efi/booted, /boot/efi/spare1, /boot/efi/spare2, etc)
  • Prefer one RAID member’s copy of /boot/efi and rebuild the RAID at every boot. If there were no external writes, there’s no issue. (Though what’s really the right way to pick the copy to prefer?)

Since mdadm has the “--update=resync” assembly option, I can actually do the latter option. This required updating /etc/mdadm/mdadm.conf to add <ignore> on the RAID’s ARRAY line to keep it from auto-starting:

ARRAY <ignore> metadata=1.0 UUID=123...

(Since it’s ignored, I’ve chosen /dev/md100 for the manual assembly below.) Then I added the noauto option to the /boot/efi entry in /etc/fstab:

/dev/md100 /boot/efi vfat noauto,defaults 0 0

And finally I added a systemd oneshot service that assembles the RAID with resync and mounts it:

[Unit] Description=Resync /boot/efi RAID DefaultDependencies=no After=local-fs.target [Service] Type=oneshot ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync ExecStart=/bin/mount /boot/efi RemainAfterExit=yes [Install] WantedBy=sysinit.target

(And don’t forget to run “update-initramfs -u” so the initramfs has an updated copy of /dev/mdadm/mdadm.conf.)

If mdadm.conf supported an “update=” option for ARRAY lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!

And if I wanted to keep a “pristine” version of /boot/efi that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g. /boot/efi.img). This would make all external changes in the real ESPs disappear after resync. Something like:

# truncate --size 512M /boot/efi.img # losetup -f --show /boot/efi.img /dev/loop0 # mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1

And at boot just rebuild it from /dev/loop0, though I’m not sure how to “prefer” that partition…

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

5 Comments

  1. Use a boot partition that is not raid. Much much easier.

    Comment by odi — April 19, 2018 @ 9:55 pm

  2. But that doesn’t handle any of my requirements. :P

    Comment by kees — April 19, 2018 @ 11:21 pm

  3. It looks like the UEFI thinktank completely sidestepped the issue of ESP redundancy and left it all to the hardware vendors. Which predictably leads to vendor specific (often braindead) solutions, which is unfortunate.

    Comment by yaneti — April 19, 2018 @ 11:44 pm

  4. > With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID.

    The UEFI specification doesn’t require this.

    Instead it’s perfectly fine for a UEFI firmware to skip this step and just look for a GPT partition with a FAT filesystem on it that has the right files at the right place.

    The UEFI firmwares I used so far all don’t care about the GPT ESP UUID (i.e. Supermicro, Lenovo, Open Virtual Machine Firmware – OVMF).

    Thus, you can also set the GPT partition type of the ESP to a ‘Linux RAID’ UUID – thus making it explicit that the FS is part of a RAID.

    Perhaps this even helps against the firmware writing a ‘boot variable cache file’ (although I haven’t noticed such a cache file even with a non-RAID setup, though).

    FWIW, when selecting a mirror configuration in the Fedora installer it also creates the ESP on a RAID-1 (with superblock format 1.0) and sets the GPT partition type UUID to ‘Linux RAID’. The Fedora efibootmgr works fine in such a configuration.

    Comment by Georg Sauthoff — April 20, 2018 @ 12:18 am

  5. Reading between the lines, my UEFI firmware has Intel Matrix Raid driver built-in hence it can read ESP off RAID. So all we need is a clean-room BSD-2 implementation of an mdadm driver for OVMF/EDKII and ship that in the OEM firmwares, no? It can be quite basic, skip this much of the header and start reading from an offset. And maybe not write to anything, or write in all drives if it must.

    Comment by DImitri John Ledkov — June 12, 2018 @ 2:12 am

Powered by WordPress