codeblog code is freedom — patching my itch


security things in Linux v4.6

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 11:45 pm

Previously: v4.5. The v4.6 Linux kernel release included a bunch of stuff, with much more of it under the KSPP umbrella.

seccomp support for parisc

Helge Deller added seccomp support for parisc, which including plumbing support for PTRACE_GETREGSET to get the self-tests working.

x86 32-bit mmap ASLR vs unlimited stack fixed

Hector Marco-Gisbert removed a long-standing limitation to mmap ASLR on 32-bit x86, where setting an unlimited stack (e.g. “ulimit -s unlimited“) would turn off mmap ASLR (which provided a way to bypass ASLR when executing setuid processes). Given that ASLR entropy can now be controlled directly (see the v4.5 post), and that the cases where this created an actual problem are very rare, means that if a system sees collisions between unlimited stack and mmap ASLR, they can just adjust the 32-bit ASLR entropy instead.

x86 execute-only memory

Dave Hansen added Protection Key support for future x86 CPUs and, as part of this, implemented support for “execute only” memory in user-space. On pkeys-supporting CPUs, using mmap(..., PROT_EXEC) (i.e. without PROT_READ) will mean that the memory can be executed but cannot be read (or written). This provides some mitigation against automated ROP gadget finding where an executable is read out of memory to find places that can be used to build a malicious execution path. Using this will require changing some linker behavior (to avoid putting data in executable areas), but seems to otherwise Just Work. I’m looking forward to either emulated QEmu support or access to one of these fancy CPUs.

CONFIG_DEBUG_RODATA enabled by default on arm and arm64, and mandatory on x86

Ard Biesheuvel (arm64) and I (arm) made the poorly-named CONFIG_DEBUG_RODATA enabled by default. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so making it on-by-default is required to start any kind of attack surface reduction within the kernel.

On x86 CONFIG_DEBUG_RODATA was already enabled by default, but, at Ingo Molnar’s suggestion, I made it mandatory: CONFIG_DEBUG_RODATA cannot be turned off on x86. I expect we’ll get there with arm and arm64 too, but the protection is still somewhat new on these architectures, so it’s reasonable to continue to leave an “out” for developers that find themselves tripping over it.

arm64 KASLR text base offset

Ard Biesheuvel reworked a ton of arm64 infrastructure to support kernel relocation and, building on that, Kernel Address Space Layout Randomization of the kernel text base offset (and module base offset). As with x86 text base KASLR, this is a probabilistic defense that raises the bar for kernel attacks where finding the KASLR offset must be added to the chain of exploits used for a successful attack. One big difference from x86 is that the entropy for the KASLR must come either from Device Tree (in the “/chosen/kaslr-seed” property) or from UEFI (via EFI_RNG_PROTOCOL), so if you’re building arm64 devices, make sure you have a strong source of early-boot entropy that you can expose through your boot-firmware or boot-loader.

zero-poison after free

Laura Abbott reworked a bunch of the kernel memory management debugging code to add zeroing of freed memory, similar to PaX/Grsecurity’s PAX_MEMORY_SANITIZE feature. This feature means that memory is cleared at free, wiping any sensitive data so it doesn’t have an opportunity to leak in various ways (e.g. accidentally uninitialized structures or padding), and that certain types of use-after-free flaws cannot be exploited since the memory has been wiped. To take things even a step further, the poisoning can be verified at allocation time to make sure that nothing wrote to it between free and allocation (called “sanity checking”), which can catch another small subset of flaws.

To understand the pieces of this, it’s worth describing that the kernel’s higher level allocator, the “page allocator” (e.g. __get_free_pages()) is used by the finer-grained “slab allocator” (e.g. kmem_cache_alloc(), kmalloc()). Poisoning is handled separately in both allocators. The zero-poisoning happens at the page allocator level. Since the slab allocators tend to do their own allocation/freeing, their poisoning happens separately (since on slab free nothing has been freed up to the page allocator).

Only limited performance tuning has been done, so the penalty is rather high at the moment, at about 9% when doing a kernel build workload. Future work will include some exclusion of frequently-freed caches (similar to PAX_MEMORY_SANITIZE), and making the options entirely CONFIG controlled (right now both CONFIGs are needed to build in the code, and a kernel command line is needed to activate it). Performing the sanity checking (mentioned above) adds another roughly 3% penalty. In the general case (and once the performance of the poisoning is improved), the security value of the sanity checking isn’t worth the performance trade-off.

Tests for the features can be found in lkdtm as READ_AFTER_FREE and READ_BUDDY_AFTER_FREE. If you’re feeling especially paranoid and have enabled sanity-checking, WRITE_AFTER_FREE and WRITE_BUDDY_AFTER_FREE can test these as well.

To perform zero-poisoning of page allocations and (currently non-zero) poisoning of slab allocations, build with:


and enable the page allocator poisoning and slab allocator poisoning at boot with this on the kernel command line:

page_poison=on slub_debug=P

To add sanity-checking, change PAGE_POISONING_NO_SANITY=n, and add “F” to slub_debug as “slub_debug=PF“.

read-only after init

I added the infrastructure to support making certain kernel memory read-only after kernel initialization (inspired by a small part of PaX/Grsecurity’s KERNEXEC functionality). The goal is to continue to reduce the attack surface within the kernel by making even more of the memory, especially function pointer tables, read-only (which depends on CONFIG_DEBUG_RODATA above).

Function pointer tables (and similar structures) are frequently targeted by attackers when redirecting execution. While many are already declared “const” in the kernel source code, making them read-only (and therefore unavailable to attackers) for their entire lifetime, there is a class of variables that get initialized during kernel (and module) start-up (i.e. written to during functions that are marked “__init“) and then never (intentionally) written to again. Some examples are things like the VDSO, vector tables, arch-specific callbacks, etc.

As it turns out, most architectures with kernel memory protection already delay making their data read-only until after __init (see mark_rodata_ro()), so it’s trivial to declare a new data section (“.data..ro_after_init“) and add it to the existing read-only data section (“.rodata“). Kernel structures can be annotated with the new section (via the “__ro_after_init” macro), and they’ll become read-only once boot has finished.

The next step for attack surface reduction infrastructure will be to create a kernel memory region that is passively read-only, but can be made temporarily writable (by a single un-preemptable CPU), for storing sensitive structures that are written to only very rarely. Once this is done, much more of the kernel’s attack surface can be made read-only for the majority of its lifetime.

As people identify places where __ro_after_init can be used, we can grow the protection. A good place to start is to look through the PaX/Grsecurity patch to find uses of __read_only on variables that are only written to during __init functions. The rest are places that will need the temporarily-writable infrastructure (PaX/Grsecurity uses pax_open_kernel()/pax_close_kernel() for these).

That’s it for v4.6, next up will be v4.7!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License


security things in Linux v4.5

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:58 pm

Previously: v4.4. Some things I found interesting in the Linux kernel v4.5:


The CONFIG_STRICT_DEVMEM setting that has existed for a long time already protects system RAM from being accessible through the /dev/mem device node to root in user-space. Dan Williams added CONFIG_IO_STRICT_DEVMEM to extend this so that if a kernel driver has reserved a device memory region for use, it will become unavailable to /dev/mem also. The reservation in the kernel was to keep other kernel things from using the memory, so this is just common sense to make sure user-space can’t stomp on it either. Everyone should have this enabled. (And if you have a system where you discover you need IO memory access from userspace, you can boot with “iomem=relaxed” to disable this at runtime.)

If you’re looking to create a very bright line between user-space having access to device memory, it’s worth noting that if a device driver is a module, a malicious root user can just unload the module (freeing the kernel memory reservation), fiddle with the device memory, and then reload the driver module. So either just leave out /dev/mem entirely (not currently possible with upstream), build a monolithic kernel (no modules), or otherwise block (un)loading of modules (/proc/sys/kernel/modules_disabled).

ptrace fsuid checking

Jann Horn fixed some corner-cases in how ptrace access checks were handled on special files in /proc. For example, prior to this fix, if a setuid process temporarily dropped privileges to perform actions as a regular user, the ptrace checks would not notice the reduced privilege, possibly allowing a regular user to trick a privileged process into disclosing things out of /proc (ASLR offsets, restricted directories, etc) that they normally would be restricted from seeing.

ASLR entropy sysctl

Daniel Cashman standardized the way architectures declare their maximum user-space ASLR entropy (CONFIG_ARCH_MMAP_RND_BITS_MAX) and then created a sysctl (/proc/sys/vm/mmap_rnd_bits) so that system owners could crank up entropy. For example, the default entropy on 32-bit ARM was 8 bits, but the maximum could be as much as 16. If your 64-bit kernel is built with CONFIG_COMPAT, there’s a compat version of the sysctl as well, for controlling the ASLR entropy of 32-bit processes: /proc/sys/vm/mmap_rnd_compat_bits.

Here’s how to crank your entropy to the max, without regard to what architecture you’re on:

for i in "" "compat_"; do f=/proc/sys/vm/mmap_rnd_${i}bits; n=$(cat $f); while echo $n > $f ; do n=$(( n + 1 )); done; done

strict sysctl writes

Two years ago I added a sysctl for treating sysctl writes more like regular files (i.e. what’s written first is what appears at the start), rather than like a ring-buffer (what’s written last is what appears first). At the time it wasn’t clear what might break if this was enabled, so a WARN was added to the kernel. Since only one such string showed up in searches over the last two years, the strict writing mode was made the default. The setting remains available as /proc/sys/kernel/sysctl_writes_strict.

seccomp UM support

Mickaël Salaün added seccomp support (and selftests) for user-mode Linux. Moar architectures!

seccomp NNP vs TSYNC fix

Jann Horn noticed and fixed a problem where if a seccomp filter was already in place on a process (after being installed by a privileged process like systemd, a container launcher, etc) then the setting of the “no new privs” flag could be bypassed when adding filters with the SECCOMP_FILTER_FLAG_TSYNC flag set. Bypassing NNP meant it might be possible to trick a buggy setuid program into doing things as root after a seccomp filter forced a privilege drop to fail (generally referred to as the “sendmail setuid flaw”). With NNP set, a setuid program can’t be run in the first place.

That’s it! Next I’ll cover v4.6

Edit: Added notes about “iomem=…”

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License


security things in Linux v4.4

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 2:47 pm

Previously: v4.3. Continuing with interesting security things in the Linux kernel, here’s v4.4. As before, if you think there’s stuff I missed that should get some attention, please let me know.

seccomp Checkpoint/Restore-In-Userspace

Tycho Andersen added a way to extract and restore seccomp filters from running processes via PTRACE_SECCOMP_GET_FILTER under CONFIG_CHECKPOINT_RESTORE. This is a continuation of his work (that I failed to mention in my prior post) from v4.3, which introduced a way to suspend and resume seccomp filters. As I mentioned at the time (and for which he continues to quote me) “this feature gives me the creeps.” :)

x86 W^X detection

Stephen Smalley noticed that there was still a range of kernel memory (just past the end of the kernel code itself) that was incorrectly marked writable and executable, defeating the point of CONFIG_DEBUG_RODATA which seeks to eliminate these kinds of memory ranges. He corrected this in v4.3 and added CONFIG_DEBUG_WX in v4.4 which performs a scan of memory at boot time and yells loudly if unexpected memory protection are found. To nobody’s delight, it was shortly discovered the UEFI leaves chunks of memory in this state too, which posed an ugly-to-solve problem (which Matt Fleming addressed in v4.6).

x86_64 vsyscall CONFIG

I introduced a way to control the mode of the x86_64 vsyscall with a build-time CONFIG selection, though the choice I really care about is CONFIG_LEGACY_VSYSCALL_NONE, to force the vsyscall memory region off by default. The vsyscall memory region was always mapped into process memory at a fixed location, and it originally posed a security risk as a ROP gadget execution target. The vsyscall emulation mode was added to mitigate the problem, but it still left fixed-position static memory content in all processes, which could still pose a security risk. The good news is that glibc since version 2.15 doesn’t need vsyscall at all, so it can just be removed entirely. Any kernel built this way that discovered they needed to support a pre-2.15 glibc could still re-enable it at the kernel command line with “vsyscall=emulate”.

That’s it for v4.4. Tune in tomorrow for v4.5!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License


security things in Linux v4.3

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 2:54 pm

When I gave my State of the Kernel Self-Protection Project presentation at the 2016 Linux Security Summit, I included some slides covering some quick bullet points on things I found of interest in recent Linux kernel releases. Since there wasn’t a lot of time to talk about them all, I figured I’d make some short blog posts here about the stuff I was paying attention to, along with links to more information. This certainly isn’t everything security-related or generally of interest, but they’re the things I thought needed to be pointed out. If there’s something security-related you think I should cover from v4.3, please mention it in the comments. I’m sure I haven’t caught everything. :)

A note on timing and context: the momentum for starting the Kernel Self Protection Project got rolling well before it was officially announced on November 5th last year. To that end, I included stuff from v4.3 (which was developed in the months leading up to November) under the umbrella of the project, since the goals of KSPP aren’t unique to the project nor must the goals be met by people that are explicitly participating in it. Additionally, not everything I think worth mentioning here technically falls under the “kernel self-protection” ideal anyway — some things are just really interesting userspace-facing features.

So, to that end, here are things I found interesting in v4.3:


Russell King implemented this feature for ARM which provides emulated segregation of user-space memory when running in kernel mode, by using the ARM Domain access control feature. This is similar to a combination of Privileged eXecute Never (PXN, in later ARMv7 CPUs) and Privileged Access Never (PAN, coming in future ARMv8.1 CPUs): the kernel cannot execute user-space memory, and cannot read/write user-space memory unless it was explicitly prepared to do so. This stops a huge set of common kernel exploitation methods, where either a malicious executable payload has been built in user-space memory and the kernel was redirected to run it, or where malicious data structures have been built in user-space memory and the kernel was tricked into dereferencing the memory, ultimately leading to a redirection of execution flow.

This raises the bar for attackers since they can no longer trivially build code or structures in user-space where they control the memory layout, locations, etc. Instead, an attacker must find areas in kernel memory that are writable (and in the case of code, executable), where they can discover the location as well. For an attacker, there are vastly fewer places where this is possible in kernel memory as opposed to user-space memory. And as we continue to reduce the attack surface of the kernel, these opportunities will continue to shrink.

While hardware support for this kind of segregation exists in s390 (natively separate memory spaces), ARM (PXN and PAN as mentioned above), and very recent x86 (SMEP since Ivy-Bridge, SMAP since Skylake), ARM is the first upstream architecture to provide this emulation for existing hardware. Everyone running ARMv7 CPUs with this kernel feature enabled suddenly gains the protection. Similar emulation protections (PAX_MEMORY_UDEREF) have been available in PaX/Grsecurity for a while, and I’m delighted to see a form of this land in upstream finally.

To test this kernel protection, the ACCESS_USERSPACE and EXEC_USERSPACE triggers for lkdtm have existed since Linux v3.13, when they were introduced in anticipation of the x86 SMEP and SMAP features.

Ambient Capabilities

Andy Lutomirski (with Christoph Lameter and Serge Hallyn) implemented a way for processes to pass capabilities across exec() in a sensible manner. Until Ambient Capabilities, any capabilities available to a process would only be passed to a child process if the new executable was correctly marked with filesystem capability bits. This turns out to be a real headache for anyone trying to build an even marginally complex “least privilege” execution environment. The case that Chrome OS ran into was having a network service daemon responsible for calling out to helper tools that would perform various networking operations. Keeping the daemon not running as root and retaining the needed capabilities in children required conflicting or crazy filesystem capabilities organized across all the binaries in the expected tree of privileged processes. (For example you may need to set filesystem capabilities on bash!) By being able to explicitly pass capabilities at runtime (instead of based on filesystem markings), this becomes much easier.

For more details, the commit message is well-written, almost twice as long as than the code changes, and contains a test case. If that isn’t enough, there is a self-test available in tools/testing/selftests/capabilities/ too.

PowerPC and Tile support for seccomp filter

Michael Ellerman added support for seccomp to PowerPC, and Chris Metcalf added support to Tile. As the seccomp maintainer, I get excited when an architecture adds support, so here we are with two. Also included were updates to the seccomp self-tests (in tools/testing/selftests/seccomp), to help make sure everything continues working correctly.

That’s it for v4.3. If I missed stuff you found interesting, please let me know! I’m going to try to get more per-version posts out in time to catch up to v4.8, which appears to be tentatively scheduled for release this coming weekend. Next: v4.4.

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

Powered by WordPress