Previously: v4.5. The v4.6 Linux kernel release included a bunch of stuff, with much more of it under the KSPP umbrella.
seccomp support for parisc
Helge Deller added seccomp support for parisc, which including plumbing support for PTRACE_GETREGSET to get the self-tests working.
x86 32-bit mmap ASLR vs unlimited stack fixed
Hector Marco-Gisbert removed a long-standing limitation to mmap ASLR on 32-bit x86, where setting an unlimited stack (e.g. “
ulimit -s unlimited“) would turn off mmap ASLR (which provided a way to bypass ASLR when executing setuid processes). Given that ASLR entropy can now be controlled directly (see the v4.5 post), and that the cases where this created an actual problem are very rare, means that if a system sees collisions between unlimited stack and mmap ASLR, they can just adjust the 32-bit ASLR entropy instead.
x86 execute-only memory
Dave Hansen added Protection Key support for future x86 CPUs and, as part of this, implemented support for “execute only” memory in user-space. On pkeys-supporting CPUs, using
mmap(..., PROT_EXEC) (i.e. without
PROT_READ) will mean that the memory can be executed but cannot be read (or written). This provides some mitigation against automated ROP gadget finding where an executable is read out of memory to find places that can be used to build a malicious execution path. Using this will require changing some linker behavior (to avoid putting data in executable areas), but seems to otherwise Just Work. I’m looking forward to either emulated QEmu support or access to one of these fancy CPUs.
CONFIG_DEBUG_RODATA enabled by default on arm and arm64, and mandatory on x86
Ard Biesheuvel (arm64) and I (arm) made the poorly-named CONFIG_DEBUG_RODATA enabled by default. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so making it on-by-default is required to start any kind of attack surface reduction within the kernel.
On x86 CONFIG_DEBUG_RODATA was already enabled by default, but, at Ingo Molnar’s suggestion, I made it mandatory: CONFIG_DEBUG_RODATA cannot be turned off on x86. I expect we’ll get there with arm and arm64 too, but the protection is still somewhat new on these architectures, so it’s reasonable to continue to leave an “out” for developers that find themselves tripping over it.
arm64 KASLR text base offset
Ard Biesheuvel reworked a ton of arm64 infrastructure to support kernel relocation and, building on that, Kernel Address Space Layout Randomization of the kernel text base offset (and module base offset). As with x86 text base KASLR, this is a probabilistic defense that raises the bar for kernel attacks where finding the KASLR offset must be added to the chain of exploits used for a successful attack. One big difference from x86 is that the entropy for the KASLR must come either from Device Tree (in the “
/chosen/kaslr-seed” property) or from UEFI (via EFI_RNG_PROTOCOL), so if you’re building arm64 devices, make sure you have a strong source of early-boot entropy that you can expose through your boot-firmware or boot-loader.
zero-poison after free
Laura Abbott reworked a bunch of the kernel memory management debugging code to add zeroing of freed memory, similar to PaX/Grsecurity’s PAX_MEMORY_SANITIZE feature. This feature means that memory is cleared at free, wiping any sensitive data so it doesn’t have an opportunity to leak in various ways (e.g. accidentally uninitialized structures or padding), and that certain types of use-after-free flaws cannot be exploited since the memory has been wiped. To take things even a step further, the poisoning can be verified at allocation time to make sure that nothing wrote to it between free and allocation (called “sanity checking”), which can catch another small subset of flaws.
To understand the pieces of this, it’s worth describing that the kernel’s higher level allocator, the “page allocator” (e.g.
__get_free_pages()) is used by the finer-grained “slab allocator” (e.g.
kmalloc()). Poisoning is handled separately in both allocators. The zero-poisoning happens at the page allocator level. Since the slab allocators tend to do their own allocation/freeing, their poisoning happens separately (since on slab free nothing has been freed up to the page allocator).
Only limited performance tuning has been done, so the penalty is rather high at the moment, at about 9% when doing a kernel build workload. Future work will include some exclusion of frequently-freed caches (similar to PAX_MEMORY_SANITIZE), and making the options entirely CONFIG controlled (right now both CONFIGs are needed to build in the code, and a kernel command line is needed to activate it). Performing the sanity checking (mentioned above) adds another roughly 3% penalty. In the general case (and once the performance of the poisoning is improved), the security value of the sanity checking isn’t worth the performance trade-off.
Tests for the features can be found in lkdtm as
READ_BUDDY_AFTER_FREE. If you’re feeling especially paranoid and have enabled sanity-checking,
WRITE_BUDDY_AFTER_FREE can test these as well.
To perform zero-poisoning of page allocations and (currently non-zero) poisoning of slab allocations, build with:
and enable the page allocator poisoning and slab allocator poisoning at boot with this on the kernel command line:
To add sanity-checking, change
PAGE_POISONING_NO_SANITY=n, and add “
slub_debug as “
read-only after init
I added the infrastructure to support making certain kernel memory read-only after kernel initialization (inspired by a small part of PaX/Grsecurity’s KERNEXEC functionality). The goal is to continue to reduce the attack surface within the kernel by making even more of the memory, especially function pointer tables, read-only (which depends on
Function pointer tables (and similar structures) are frequently targeted by attackers when redirecting execution. While many are already declared “
const” in the kernel source code, making them read-only (and therefore unavailable to attackers) for their entire lifetime, there is a class of variables that get initialized during kernel (and module) start-up (i.e. written to during functions that are marked “
__init“) and then never (intentionally) written to again. Some examples are things like the VDSO, vector tables, arch-specific callbacks, etc.
As it turns out, most architectures with kernel memory protection already delay making their data read-only until after
mark_rodata_ro()), so it’s trivial to declare a new data section (“
.data..ro_after_init“) and add it to the existing read-only data section (“
.rodata“). Kernel structures can be annotated with the new section (via the “
__ro_after_init” macro), and they’ll become read-only once boot has finished.
The next step for attack surface reduction infrastructure will be to create a kernel memory region that is passively read-only, but can be made temporarily writable (by a single un-preemptable CPU), for storing sensitive structures that are written to only very rarely. Once this is done, much more of the kernel’s attack surface can be made read-only for the majority of its lifetime.
As people identify places where
__ro_after_init can be used, we can grow the protection. A good place to start is to look through the PaX/Grsecurity patch to find uses of
__read_only on variables that are only written to during
__init functions. The rest are places that will need the temporarily-writable infrastructure (PaX/Grsecurity uses
pax_close_kernel() for these).
That’s it for v4.6, next up will be v4.7!
© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.