Previously: v4.7. Here are a bunch of security things I’m excited about in Linux v4.8:
SLUB freelist ASLR
x86_64 KASLR text base offset physical/virtual decoupling
On x86_64, to implement the KASLR text base offset, the physical memory location of the kernel was randomized, which resulted in the virtual address being offset as well. Due to how the kernel’s “-2GB” addressing works (gcc‘s “
-mcmodel=kernel“), it wasn’t possible to randomize the physical location beyond the 2GB limit, leaving any additional physical memory unused as a randomization target. In order to decouple the physical and virtual location of the kernel (to make physical address exposures less valuable to attackers), the physical location of the kernel needed to be randomized separately from the virtual location. This required a lot of work for handling very large addresses spanning terabytes of address space. Yinghai Lu, Baoquan He, and I landed a series of patches that ultimately did this (and in the process fixed some other bugs too). This expands the physical offset entropy to roughly
$physical_memory_size_of_system / 2MB bits.
x86_64 KASLR memory base offset
Thomas Garnier rolled out KASLR to the kernel’s various statically located memory ranges, randomizing their locations with CONFIG_RANDOMIZE_MEMORY. One of the more notable things randomized is the physical memory mapping, which is a known target for attacks. Also randomized is the vmalloc area, which makes attacks against targets vmalloced during boot (which tend to always end up in the same location on a given system) are now harder to locate. (The vmemmap region randomization accidentally missed the v4.8 window and will appear in v4.9.)
x86_64 KASLR with hibernation
Rafael Wysocki (with Thomas Garnier, Borislav Petkov, Yinghai Lu, Logan Gunthorpe, and myself) worked on a number of fixes to hibernation code that, even without KASLR, were coincidentally exposed by the earlier W^X fix. With that original problem fixed, then memory KASLR exposed more problems. I’m very grateful everyone was able to help out fixing these, especially Rafael and Thomas. It’s a hard place to debug. The bottom line, now, is that hibernation and KASLR are no longer mutually exclusive.
gcc plugin infrastructure
Emese Revfy ported the PaX/Grsecurity gcc plugin infrastructure to upstream. If you want to perform compiler-based magic on kernel builds, now it’s much easier with
CONFIG_GCC_PLUGINS! The plugins live in
scripts/gcc-plugins/. Current plugins are a short example called “Cyclic Complexity” which just emits the complexity of functions as they’re compiled, and “Sanitizer Coverage” which provides the same functionality as gcc’s recent “-fsanitize-coverage=trace-pc” but back through gcc 4.5. Another notable detail about this work is that it was the first Linux kernel security work funded by Linux Foundation’s Core Infrastructure Initiative. I’m looking forward to more plugins!
If you’re on Debian or Ubuntu, the required gcc plugin headers are available via the
gcc-$N-plugin-dev package (and similarly for all cross-compiler packages).
Along with work from Rik van Riel, Laura Abbott, Casey Schaufler, and many other folks doing testing on the KSPP mailing list, I ported part of PAX_USERCOPY (the basic runtime bounds checking) to upstream as CONFIG_HARDENED_USERCOPY. One of the interface boundaries between the kernel and user-space are the
copy_from_user() family of functions. Frequently, the size of a copy is known at compile-time (“built-in constant”), so there’s not much benefit in checking those sizes (hardened usercopy avoids these cases). In the case of dynamic sizes, hardened usercopy checks for 3 areas of memory: slab allocations, stack allocations, and kernel text. Direct kernel text copying is simply disallowed. Stack copying is allowed as long as it is entirely contained by the current stack memory range (and on x86, only if it does not include the saved stack frame and instruction pointers). For slab allocations (e.g. those allocated through
kmem_cache_alloc() and the
kmalloc()-family of functions), the copy size is compared against the size of the object being copied. For example, if
copy_from_user() is writing to a structure that was allocated as size 64, but the copy gets tricked into trying to write 65 bytes, hardened usercopy will catch it and kill the process.
For testing hardened usercopy, lkdtm gained several new tests: USERCOPY_HEAP_SIZE_TO, USERCOPY_HEAP_SIZE_FROM, USERCOPY_STACK_FRAME_TO,
USERCOPY_STACK_FRAME_FROM, USERCOPY_STACK_BEYOND, and USERCOPY_KERNEL. Additionally, USERCOPY_HEAP_FLAG_TO and USERCOPY_HEAP_FLAG_FROM were added to test what will be coming next for hardened usercopy: flagging slab memory as “safe for copy to/from user-space”, effectively whitelisting certainly slab caches, as done by PAX_USERCOPY. This further reduces the scope of what’s allowed to be copied to/from, since most kernel memory is not intended to ever be exposed to user-space. Adding this logic will require some reorganization of usercopy code to add some new APIs, as PAX_USERCOPY’s approach to handling special-cases is to add bounce-copies (copy from slab to stack, then copy to userspace) as needed, which is unlikely to be acceptable upstream.
seccomp reordered after ptrace
By its original design, seccomp filtering happened before ptrace so that seccomp-based ptracers (i.e.
SECCOMP_RET_TRACE) could explicitly bypass seccomp filtering and force a desired syscall. Nothing actually used this feature, and as it turns out, it’s not compatible with process launchers that install seccomp filters (e.g. systemd, lxc) since as long as the
fork syscalls are allowed (and
fork is needed for any sensible container environment), a process could spawn a tracer to help bypass a filter by injecting syscalls. After Andy Lutomirski convinced me that ordering ptrace first does not change the attack surface of a running process (unless all syscalls are blacklisted, the entire ptrace attack surface will always be exposed), I rearranged things. Now there is no (expected) way to bypass seccomp filters, and containers with seccomp filters can allow ptrace again.
Edit: missed this next feature when I originally posted
NX stack and heap on MIPS
Other architectures have had a non-executable stack and heap for a while now, and now MIPS has caught up, thanks to Paul Barton. The primary reason for the delay was finding a way to cleanly deal with branch delay slot instructions which needed a place to write instructions. Traditionally this was on the stack, but now it’s handled with a per-mm page.
That’s it for v4.8! The merge window is open for v4.9…
© 2016 – 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.