Previously: v4.13.
Linux kernel v4.14 was released this last Sunday, and there’s a bunch of security things I think are interesting:
vmapped kernel stack on arm64
Similar to the same feature on x86, Mark Rutland and Ard Biesheuvel implemented CONFIG_VMAP_STACK
for arm64, which moves the kernel stack to an isolated and guard-paged vmap area. With traditional stacks, there were two major risks when exhausting the stack: overwriting the thread_info
structure (which contained the addr_limit
field which is checked during copy_to/from_user()
), and overwriting neighboring stacks (or other things allocated next to the stack). While arm64 previously moved its thread_info off the stack to deal with the former issue, this vmap change adds the last bit of protection by nature of the vmap guard pages. If the kernel tries to write past the end of the stack, it will hit the guard page and fault. (Testing for this is now possible via LKDTM’s STACK_GUARD_PAGE_LEADING/TRAILING
tests.)
One aspect of the guard page protection that will need further attention (on all architectures) is that if the stack grew because of a giant Variable Length Array on the stack (effectively an implicit alloca()
call), it might be possible to jump over the guard page entirely (as seen in the userspace Stack Clash attacks). Thankfully the use of VLAs is rare in the kernel. In the future, hopefully we’ll see the addition of PaX/grsecurity’s STACKLEAK plugin which, in addition to its primary purpose of clearing the kernel stack on return to userspace, makes sure stack expansion cannot skip over guard pages. This “stack probing” ability will likely also become directly available from the compiler as well.
set_fs()
balance checking
Related to the addr_limit
field mentioned above, another class of bug is finding a way to force the kernel into accidentally leaving addr_limit
open to kernel memory through an unbalanced call to set_fs()
. In some areas of the kernel, in order to reuse userspace routines (usually VFS or compat related), code will do something like: set_fs(KERNEL_DS); ...some code here...; set_fs(USER_DS);
. When the USER_DS
call goes missing (usually due to a buggy error path or exception), subsequent system calls can suddenly start writing into kernel memory via copy_to_user
(where the “to user” really means “within the addr_limit
range”).
Thomas Garnier implemented USER_DS checking at syscall exit time for x86, arm, and arm64 (with a small fix). This means that a broken set_fs()
setting will not extend beyond the buggy syscall that fails to set it back to USER_DS
. Additionally, as part of the discussion on the best way to deal with this feature, Christoph Hellwig and Al Viro (and others) have been making extensive changes to avoid the need for set_fs()
being used at all, which should greatly reduce the number of places where it might be possible to introduce such a bug in the future.
SLUB freelist hardening
A common class of heap attacks is overwriting the freelist pointers stored inline in the unallocated SLUB cache objects. PaX/grsecurity developed an inexpensive defense that XORs the freelist pointer with a global random value (and the storage address). Daniel Micay improved on this by using a per-cache random value, and I refactored the code a bit more. The resulting feature, enabled with CONFIG_SLAB_FREELIST_HARDENED
, makes freelist pointer overwrites very hard to exploit unless an attacker has found a way to expose both the random value and the pointer location. This should render blind heap overflow bugs much more difficult to exploit.
Additionally, Alexander Popov implemented a simple double-free defense, similar to the “fasttop” check in the GNU C library, which will catch sequential free()
s of the same pointer. (And has already uncovered a bug.)
Future work would be to provide similar metadata protections to the SLAB allocator (though SLAB doesn’t store its freelist within the individual unused objects, so it has a different set of exposures compared to SLUB).
setuid-exec stack limitation
Continuing the various additional defenses to protect against future problems related to userspace memory layout manipulation (as shown most recently in the Stack Clash attacks), I implemented an 8MiB stack limit for privileged (i.e. setuid) execs, inspired by a similar protection in grsecurity, after reworking the secureexec handling by LSMs. This complements the unconditional limit to the size of exec arguments that landed in v4.13.
randstruct automatic struct selection
While the bulk of the port of the randstruct gcc plugin from grsecurity landed in v4.13, the last of the work needed to enable automatic struct selection landed in v4.14. This means that the coverage of randomized structures, via CONFIG_GCC_PLUGIN_RANDSTRUCT
, now includes one of the major targets of exploits: function pointer structures. Without knowing the build-randomized location of a callback pointer an attacker needs to overwrite in a structure, exploits become much less reliable.
structleak passed-by-reference variable initialization
Ard Biesheuvel enhanced the structleak gcc plugin to initialize all variables on the stack that are passed by reference when built with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL
. Normally the compiler will yell if a variable is used before being initialized, but it silences this warning if the variable’s address is passed into a function call first, as it has no way to tell if the function did actually initialize the contents. So the plugin now zero-initializes such variables (if they hadn’t already been initialized) before the function call that takes their address. Enabling this feature has a small performance impact, but solves many stack content exposure flaws. (In fact at least one such flaw reported during the v4.15 development cycle was mitigated by this plugin.)
improved boot entropy
Laura Abbott and Daniel Micay improved early boot entropy available to the stack protector by both moving the stack protector setup later in the boot, and including the kernel command line in boot entropy collection (since with some devices it changes on each boot).
eBPF JIT for 32-bit ARM
The ARM BPF JIT had been around a while, but it didn’t support eBPF (and, as a result, did not provide constant value blinding, which meant it was exposed to being used by an attacker to build arbitrary machine code with BPF constant values). Shubham Bansal spent a bunch of time building a full eBPF JIT for 32-bit ARM which both speeds up eBPF and brings it up to date on JIT exploit defenses in the kernel.
seccomp improvements
Tyler Hicks addressed a long-standing deficiency in how seccomp could log action results. In addition to creating a way to mark a specific seccomp filter as needing to be logged with SECCOMP_FILTER_FLAG_LOG
, he added a new action result, SECCOMP_RET_LOG
. With these changes in place, it should be much easier for developers to inspect the results of seccomp filters, and for process launchers to generate logs for their child processes operating under a seccomp filter.
Additionally, I finally found a way to implement an often-requested feature for seccomp, which was to kill an entire process instead of just the offending thread. This was done by creating the SECCOMP_RET_ACTION_FULL
mask (née SECCOMP_RET_ACTION
) and implementing SECCOMP_RET_KILL_PROCESS
.
That’s it for now; please let me know if I missed anything. The v4.15 merge window is now open!
© 2017 – 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.