codeblog code is freedom — patching my itch

March 12, 2019

security things in Linux v5.0

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:04 pm

Previously: v4.20.

Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:

read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel under CONFIG_X86_PTDUMP=y) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables under CONFIG_ARM64_PTDUMP=y, where you can now see the page-level granularity:

---[ Linear mapping ]---
0xffffb07cfc402000-0xffffb07cfc403000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc403000-0xffffb07cfc4d0000  820K PTE   RW NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d0000-0xffffb07cfc4d1000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE   RW NX SHD AF NG    UXN MEM/NORMAL

per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20 on 32-bit and %gs:40 on 64-bit), which means it’s possible to have a different canary for every task since the %gs segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info. As he describes in his blog post, the plugin results in replacing:

8010fad8:       e30c4488        movw    r4, #50312      ; 0xc488
8010fadc:       e34840d0        movt    r4, #32976      ; 0x80d0
8010fb1c:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fb20:       e5943000        ldr     r3, [r4]
8010fb24:       e1520003        cmp     r2, r3
8010fb28:       1a000020        bne     8010fbb0
8010fbb0:       eb006738        bl      80129898 <__stack_chk_fail>


8010fc18:       e1a0300d        mov     r3, sp
8010fc1c:       e3c34d7f        bic     r4, r3, #8128   ; 0x1fc0
8010fc60:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fc64:       e5943018        ldr     r3, [r4, #24]
8010fc68:       e1520003        cmp     r2, r3
8010fc6c:       1a000020        bne     8010fcf4
8010fcf4:       eb006757        bl      80129a58 <__stack_chk_fail>

r2 holds the canary saved on the stack and r3 the known-good canary to check against. In the former, r3 is loaded through r4 at a fixed address (0x80d0c488, which “readelf -s vmlinux” confirms is the global __stack_chk_guard). In the latter, it’s coming from offset 0x24 in struct thread_info (which “pahole -C thread_info vmlinux” confirms is the “stack_canary” field).

per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...“). With this feature, the canary can be found relative to sp_el0, since that register holds the pointer to the struct task_struct, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)

top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).

ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch statements. While the C language has a statement to indicate the end of a switch case (“break“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).

ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t to refcount_t so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/ (49), drivers/ (41), and kernel/ (21).

userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo for CPUs that support it.

platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.

Tycho Andersen added the new SECCOMP_RET_USER_NOTIF return type to seccomp which allows process monitors to receive notifications over a file descriptor when a seccomp filter has been hit. This is a much more light-weight method than ptrace for performing tasks on behalf of a process. It is especially useful for performing various admin-only tasks for unprivileged containers (like mounting filesystems). The process monitor can perform the requested task (after doing some ToCToU dances to read process memory for syscall arguments), or reject the syscall.

Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE
Edit: added details on SECCOMP_RET_USER_NOTIF

That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)

© 2019 – 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

No Comments

No comments yet.

Powered by WordPress