Previously: v4.20.
Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:
read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel
under CONFIG_X86_PTDUMP=y
) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables
under CONFIG_ARM64_PTDUMP=y
, where you can now see the page-level granularity:
---[ Linear mapping ]--- ... 0xffffb07cfc402000-0xffffb07cfc403000 4K PTE ro NX SHD AF NG UXN MEM/NORMAL 0xffffb07cfc403000-0xffffb07cfc4d0000 820K PTE RW NX SHD AF NG UXN MEM/NORMAL 0xffffb07cfc4d0000-0xffffb07cfc4d1000 4K PTE ro NX SHD AF NG UXN MEM/NORMAL 0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE RW NX SHD AF NG UXN MEM/NORMAL
per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong
option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard
. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20
on 32-bit and %gs:40
on 64-bit), which means it’s possible to have a different canary for every task since the %gs
segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info
. As he describes in his blog post, the plugin results in replacing:
8010fad8: e30c4488 movw r4, #50312 ; 0xc488 8010fadc: e34840d0 movt r4, #32976 ; 0x80d0 ... 8010fb1c: e51b2030 ldr r2, [fp, #-48] ; 0xffffffd0 8010fb20: e5943000 ldr r3, [r4] 8010fb24: e1520003 cmp r2, r3 8010fb28: 1a000020 bne 8010fbb0 ... 8010fbb0: eb006738 bl 80129898 <__stack_chk_fail>
with:
8010fc18: e1a0300d mov r3, sp 8010fc1c: e3c34d7f bic r4, r3, #8128 ; 0x1fc0 ... 8010fc60: e51b2030 ldr r2, [fp, #-48] ; 0xffffffd0 8010fc64: e5943018 ldr r3, [r4, #24] 8010fc68: e1520003 cmp r2, r3 8010fc6c: 1a000020 bne 8010fcf4 ... 8010fcf4: eb006757 bl 80129a58 <__stack_chk_fail>
r2
holds the canary saved on the stack and r3
the known-good canary to check against. In the former, r3
is loaded through r4
at a fixed address (0x80d0c488, which “readelf -s vmlinux
” confirms is the global __stack_chk_guard
). In the latter, it’s coming from offset 0x24 in struct thread_info
(which “pahole -C thread_info vmlinux
” confirms is the “stack_canary
” field).
per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...
“). With this feature, the canary can be found relative to sp_el0
, since that register holds the pointer to the struct task_struct
, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)
top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).
ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch
statements. While the C language has a statement to indicate the end of a switch case (“break
“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break
” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough
to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).
ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t
to refcount_t
so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci
to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/
(49), drivers/
(41), and kernel/
(21).
userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo
for CPUs that support it.
platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.
SECCOMP_RET_USER_NOTIF
Tycho Andersen added the new SECCOMP_RET_USER_NOTIF
return type to seccomp which allows process monitors to receive notifications over a file descriptor when a seccomp filter has been hit. This is a much more light-weight method than ptrace for performing tasks on behalf of a process. It is especially useful for performing various admin-only tasks for unprivileged containers (like mounting filesystems). The process monitor can perform the requested task (after doing some ToCToU dances to read process memory for syscall arguments), or reject the syscall.
Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE
Edit: added details on SECCOMP_RET_USER_NOTIF
That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)
© 2019 – 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.