April « 2018 « codeblog

April 19, 2018

UEFI booting and RAID1

Filed under: Debian,Kernel,Ubuntu,Ubuntu-Server — kees @ 5:34 pm

I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s md RAID1 for the root filesystem (and/or /boot). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.

With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running BOOT/EFI/BOOTX64.EFI from the FAT32. Under Linux, this .EFI code is either GRUB itself, or Shim which loads GRUB.

So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read md, LVM, etc), but how do I handle /boot/efi (the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with efibootmgr) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.

The current implementation of Linux’s md RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact, mdadm warns about this pretty loudly:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90

Reading from the mdadm man page:

   -e, --metadata=
...
          1, 1.0, 1.1, 1.2 default
                 Use  the new version-1 format superblock.  This has fewer
                 restrictions.  It can easily be moved between hosts  with
                 different  endian-ness,  and  a recovery operation can be
                 checkpointed and restarted.  The  different  sub-versions
                 store  the  superblock  at  different  locations  on  the
                 device, either at the end (for 1.0), at  the  start  (for
                 1.1)  or  4K from the start (for 1.2).  "1" is equivalent
                 to "1.2" (the commonly preferred 1.x format).   "default"
                 is equivalent to "1.2".

First we toss a FAT32 on the RAID (mkfs.fat -F32 /dev/md0), and looking at the results, the first 4K is entirely zeros, and file doesn’t see a filesystem:

# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  fc 4e 2b a9 01 00 00 00  00 00 00 00 00 00 00 00  |.N+.............|
...
# file -s /dev/sda1
/dev/sda1: Linux Software RAID version 1.2 ...

So, instead, we’ll use --metadata 1.0 to put the RAID metadata at the end:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1
...
# mkfs.fat -F32 /dev/md0
# dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd
00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac    FAT32   ...w|.
# file -s /dev/sda1
/dev/sda1: ... FAT (32 bit)

Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and grub-install will write to the RAID mounted at /boot/efi.

However, we’re left with a new problem: on (at least) Debian and Ubuntu, grub-install attempts to run efibootmgr to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run efibootmgr with an empty -d argument:

Installing for x86_64-efi platform.
efibootmgr: option requires an argument -- 'd'
...
grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted.
Failed: grub-install --target=x86_64-efi  
WARNING: Bootloader is not properly installed, system may not be bootable

Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running: dpkg-reconfigure -p low grub-efi-amd64

So, now my system will boot with both or either drive present, and updates from Linux to /boot/efi are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).

To deal with this “external write” situation, I see some solutions:

Make the partition read-only when not under Linux. (I don’t think this is a thing.)
Create higher-level knowledge of the root-filesystem RAID configuration is needed to keep a collection of filesystems manually synchronized instead of doing block-level RAID. (Seems like a lot of work and would need redesign of /boot/efi into something like /boot/efi/booted, /boot/efi/spare1, /boot/efi/spare2, etc)
Prefer one RAID member’s copy of /boot/efi and rebuild the RAID at every boot. If there were no external writes, there’s no issue. (Though what’s really the right way to pick the copy to prefer?)

Since mdadm has the “--update=resync” assembly option, I can actually do the latter option. This required updating /etc/mdadm/mdadm.conf to add <ignore> on the RAID’s ARRAY line to keep it from auto-starting:

ARRAY <ignore> metadata=1.0 UUID=123...

(Since it’s ignored, I’ve chosen /dev/md100 for the manual assembly below.) Then I added the noauto option to the /boot/efi entry in /etc/fstab:

/dev/md100 /boot/efi vfat noauto,defaults 0 0

And finally I added a systemd oneshot service that assembles the RAID with resync and mounts it:

[Unit]
Description=Resync /boot/efi RAID
DefaultDependencies=no
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync
ExecStart=/bin/mount /boot/efi
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target

(And don’t forget to run “update-initramfs -u” so the initramfs has an updated copy of /dev/mdadm/mdadm.conf.)

If mdadm.conf supported an “update=” option for ARRAY lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!

And if I wanted to keep a “pristine” version of /boot/efi that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g. /boot/efi.img). This would make all external changes in the real ESPs disappear after resync. Something like:


# truncate --size 512M /boot/efi.img
# losetup -f --show /boot/efi.img
/dev/loop0
# mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1

And at boot just rebuild it from /dev/loop0, though I’m not sure how to “prefer” that partition…

Notes: commands I used to muck with grub-install:

echo "grub-pc grub2/update_nvram boolean false" | debconf-set-selections
echo "grub-pc grub-efi/install_devices multiselect /dev/md0" | debconf-set-selections && dpkg-reconfigure -p low grub-efi-amd64-signed

Update: Ubuntu Focal 20.04 now provides a way for GRUB to install to an arbitrary collection of devices, so no RAID needed. Whew.

Comments (5)

April 12, 2018

security things in Linux v4.16

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:04 pm

Previously: v4.15.

Linux kernel v4.16 was released last week. I really should write these posts in advance, otherwise I get distracted by the merge window. Regardless, here are some of the security things I think are interesting:

KPTI on arm64

Will Deacon, Catalin Marinas, and several other folks brought Kernel Page Table Isolation (via CONFIG_UNMAP_KERNEL_AT_EL0) to arm64. While most ARMv8+ CPUs were not vulnerable to the primary Meltdown flaw, the Cortex-A75 does need KPTI to be safe from memory content leaks. It’s worth noting, though, that KPTI does protect other ARMv8+ CPU models from having privileged register contents exposed. So, whatever your threat model, it’s very nice to have this clean isolation between kernel and userspace page tables for all ARMv8+ CPUs.

hardened usercopy whitelisting

While whole-object bounds checking was implemented in CONFIG_HARDENED_USERCOPY already, David Windsor and I finished another part of the porting work of grsecurity’s PAX_USERCOPY protection: usercopy whitelisting. This further tightens the scope of slab allocations that can be copied to/from userspace. Now, instead of allowing all objects in slab memory to be copied, only the whitelisted areas (where a subsystem has specifically marked the memory region allowed) can be copied. For example, only the auxv array out of the larger mm_struct.

As mentioned in the first commit from the series, this reduces the scope of slab memory that could be copied out of the kernel in the face of a bug to under 15%. As can be seen, one area of work remaining are the kmalloc regions. Those are regularly used for copying things in and out of userspace, but they’re also used for small simple allocations that aren’t meant to be exposed to userspace. Working to separate these kmalloc users needs some careful auditing.

Total Slab Memory:           48074720
Usercopyable Memory:          6367532  13.2%
         task_struct                    0.2%         4480/1630720
         RAW                            0.3%            300/96000
         RAWv6                          2.1%           1408/64768
         ext4_inode_cache               3.0%       269760/8740224
         dentry                        11.1%       585984/5273856
         mm_struct                     29.1%         54912/188448
         kmalloc-8                    100.0%          24576/24576
         kmalloc-16                   100.0%          28672/28672
         kmalloc-32                   100.0%          81920/81920
         kmalloc-192                  100.0%          96768/96768
         kmalloc-128                  100.0%        143360/143360
         names_cache                  100.0%        163840/163840
         kmalloc-64                   100.0%        167936/167936
         kmalloc-256                  100.0%        339968/339968
         kmalloc-512                  100.0%        350720/350720
         kmalloc-96                   100.0%        455616/455616
         kmalloc-8192                 100.0%        655360/655360
         kmalloc-1024                 100.0%        812032/812032
         kmalloc-4096                 100.0%        819200/819200
         kmalloc-2048                 100.0%      1310720/1310720

This series took quite a while to land (you can see David’s original patch date as back in June of last year). Partly this was due to having to spend a lot of time researching the code paths so that each whitelist could be explained for commit logs, partly due to making various adjustments from maintainer feedback, and partly due to the short merge window in v4.15 (when it was originally proposed for merging) combined with some last-minute glitches that made Linus nervous. After baking in linux-next for almost two full development cycles, it finally landed. (Though be sure to disable CONFIG_HARDENED_USERCOPY_FALLBACK to gain enforcement of the whitelists — by default it only warns and falls back to the full-object checking.)

automatic stack-protector

While the stack-protector features of the kernel have existed for quite some time, it has never been enabled by default. This was mainly due to needing to evaluate compiler support for the feature, and Kconfig didn’t have a way to check the compiler features before offering CONFIG_* options. As a defense technology, the stack protector is pretty mature. Having it on by default would have greatly reduced the impact of things like the BlueBorne attack (CVE-2017-1000251), as fewer systems would have lacked the defense.

After spending quite a bit of time fighting with ancient compiler versions (*cough*GCC 4.4.4*cough*), I landed CONFIG_CC_STACKPROTECTOR_AUTO, which is default on, and tries to use the stack protector if it is available. The implementation of the solution, however, did not please Linus, though he allowed it to be merged. In the future, Kconfig will gain the knowledge to make better decisions which lets the kernel expose the availability of (the now default) stack protector directly in Kconfig, rather than depending on rather ugly Makefile hacks.

execute-only memory for PowerPC

Similar to the Protection Keys (pkeys) hardware support that landed in v4.6 for x86, Ram Pai landed pkeys support for Power7/8/9. This should expand the scope of what’s possible in the dynamic loader to avoid having arbitrary read flaws allow an exploit to read out all of executable memory in order to find ROP gadgets.

That’s it for now; let me know if you think I should add anything! The v4.17 merge window is open. :)

Edit: added details on ARM register leaks, thanks to Daniel Micay.

Edit: added section on protection keys for POWER, thanks to Florian Weimer.

Comments (2)

codeblog code is freedom — patching my itch

April 19, 2018

UEFI booting and RAID1

April 12, 2018

security things in Linux v4.16