Kernel « codeblog

October 26, 2023

Enable MTE on Pixel 8

Filed under: Blogging,Kernel,Security — kees @ 12:19 pm

The Pixel 8 hardware (Tensor G3) supports the ARM Memory Tagging Extension (MTE), and software support is available both in Android userspace and the Linux kernel. This feature is a powerful defense against linear buffer overflows and many types of use-after-free flaws. I’m extremely happy to see this hardware finally available in the real world.

Turning it on for userspace is already wired up the Android UI: Settings / System / Developer options / Memory Tagging Extension / Enable MTE until you turn if off. Once enabled it will internally change an Android “system property” named “arm64.memtag.bootctl” by adding the option “memtag“.

Turning it on for the kernel is slightly more involved, but not difficult at all. This requires manually setting the “arm64.memtag.bootctl” property mentioned above to include “memtag-kernel” as well:

Plug your phone into a system that can run the adb tool
If not already installed, install adb. For example on Debian/Ubuntu: sudo apt install adb
Turn on “USB Debugging” in the phone’s “Developer options” menu, and accept the debugging session confirmation that will pop up when you first run adb
Verify the current setting: adb shell getprop | grep memtag.bootctl

[arm64.memtag.bootctl]: [memtag]

Enable kernel MTE: adb shell setprop arm64.memtag.bootctl memtag,memtag-kernel
Check the results: adb shell getprop | grep memtag.bootctl

[arm64.memtag.bootctl]: [memtag,memtag-kernel]

Reboot your phone

To check that MTE is enabled for the kernel (which is implemented using Kernel Address Sanitizer’s Hardware Tagging mode), you can check the kernel command line after rebooting:

$ mkdir foo && cd foo
$ adb bugreport
...
$ mkdir unpacked && cd unpacked
$ unzip ../bugreport*.zip
...
$ grep kasan= bugreport*.txt
...: Command line: ... kasan=off ... kasan=on ...

The latter “kasan=on” overrides the earlier “kasan=off“.

Enjoy!

Comments Off

June 24, 2022

finding binary differences

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:11 pm

As part of the continuing work to replace 1-element arrays in the Linux kernel, it’s very handy to show that a source change has had no executable code difference. For example, if you started with this:

struct foo {
    unsigned long flags;
    u32 length;
    u32 data[1];
};

void foo_init(int count)
{
    struct foo *instance;
    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
    ...
    instance = kmalloc(bytes, GFP_KERNEL);
    ...
};

And you changed only the struct definition:

-    u32 data[1];
+    u32 data[];

The bytes calculation is going to be incorrect, since it is still subtracting 1 element’s worth of space from the desired count. (And let’s ignore for the moment the open-coded calculation that may end up with an arithmetic over/underflow here; that can be solved separately by using the struct_size() helper or the size_mul(), size_add(), etc family of helpers.)

The missed adjustment to the size calculation is relatively easy to find in this example, but sometimes it’s much less obvious how structure sizes might be woven into the code. I’ve been checking for issues by using the fantastic diffoscope tool. It can produce a LOT of noise if you try to compare builds without keeping in mind the issues solved by reproducible builds, with some additional notes. I prepare my build with the “known to disrupt code layout” options disabled, but with debug info enabled:

$ KBF="KBUILD_BUILD_TIMESTAMP=1980-01-01 KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host KBUILD_BUILD_VERSION=1"
$ OUT=gcc
$ make $KBF O=$OUT allmodconfig
$ ./scripts/config --file $OUT/.config \
        -d GCOV_KERNEL -d KCOV -d GCC_PLUGINS -d IKHEADERS -d KASAN -d UBSAN \
        -d DEBUG_INFO_NONE -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
$ make $KBF O=$OUT olddefconfig

Then I build a stock target, saving the output in “before”. In this case, I’m examining drivers/scsi/megaraid/:

$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/before
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/before/

Then I patch and build a modified target, saving the output in “after”:

$ vi the/source/code.c
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/after
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/after/

And then run diffoscope:

$ diffoscope $OUT/before/ $OUT/after/

If diffoscope output reports nothing, then we’re done. 🥳

Usually, though, when source lines move around other stuff will shift too (e.g. WARN macros rely on line numbers, so the bug table may change contents a bit, etc), and diffoscope output will look noisy. To examine just the executable code, the command that diffoscope used is reported in the output, and we can run it directly, but with possibly shifted line numbers not reported. i.e. running objdump without --line-numbers:

$ ARGS="--disassemble --demangle --reloc --no-show-raw-insn --section=.text"
$ for i in $(cd $OUT/before && echo *.o); do
        echo $i
        diff -u <(objdump $ARGS $OUT/before/$i | sed "0,/^Disassembly/d") \
                <(objdump $ARGS $OUT/after/$i  | sed "0,/^Disassembly/d")
done

If I see an unexpected difference, for example:

-    c120:      movq   $0x0,0x800(%rbx)
+    c120:      movq   $0x0,0x7f8(%rbx)

Then I'll search for the pattern with line numbers added to the objdump output:

$ vi <(objdump --line-numbers $ARGS $OUT/after/megaraid_sas_fp.o)

I'd search for "0x0,0x7f8", find the source file and line number above it, open that source file at that position, and look to see where something was being miscalculated:

$ vi drivers/scsi/megaraid/megaraid_sas_fp.c +329

Once tracked down, I'd start over at the "patch and build a modified target" step above, repeating until there were no differences. For example, in the starting example, I'd also need to make this change:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = sizeof(*instance) + sizeof(u32) * count;

Though, as hinted earlier, better yet would be:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = struct_size(instance, data, count);

But sometimes adding the helper usage will add binary output differences since they're performing overflow checking that might saturate at SIZE_MAX. To help with patch clarity, those changes can be done separately from fixing the array declaration.

Comments Off

April 4, 2022

security things in Linux v5.10

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:01 pm

Previously: v5.9

Linux v5.10 was released in December, 2020. Here’s my summary of various security things that I found interesting:

AMD SEV-ES
While guest VM memory encryption with AMD SEV has been supported for a while, Joerg Roedel, Thomas Lendacky, and others added register state encryption (SEV-ES). This means it’s even harder for a VM host to reconstruct a guest VM’s state.

x86 static calls
Josh Poimboeuf and Peter Zijlstra implemented static calls for x86, which operates very similarly to the “static branch” infrastructure in the kernel. With static branches, an if/else choice can be hard-coded, instead of being run-time evaluated every time. Such branches can be updated too (the kernel just rewrites the code to switch around the “branch”). All these principles apply to static calls as well, but they’re for replacing indirect function calls (i.e. a call through a function pointer) with a direct call (i.e. a hard-coded call address). This eliminates the need for Spectre mitigations (e.g. RETPOLINE) for these indirect calls, and avoids a memory lookup for the pointer. For hot-path code (like the scheduler), this has a measurable performance impact. It also serves as a kind of Control Flow Integrity implementation: an indirect call got removed, and the potential destinations have been explicitly identified at compile-time.

network RNG improvements
In an effort to improve the pseudo-random number generator used by the network subsystem (for things like port numbers and packet sequence numbers), Linux’s home-grown pRNG has been replaced by the SipHash round function, and perturbed by (hopefully) hard-to-predict internal kernel states. This should make it very hard to brute force the internal state of the pRNG and make predictions about future random numbers just from examining network traffic. Similarly, ICMP’s global rate limiter was adjusted to avoid leaking details of network state, as a start to fixing recent DNS Cache Poisoning attacks.

SafeSetID handles GID
Thomas Cedeno improved the SafeSetID LSM to handle group IDs (which required teaching the kernel about which syscalls were actually performing setgid.) Like the earlier setuid policy, this lets the system owner define an explicit list of allowed group ID transitions under CAP_SETGID (instead of to just any group), providing a way to keep the power of granting this capability much more limited. (This isn’t complete yet, though, since handling setgroups() is still needed.)

improve kernel’s internal checking of file contents
The kernel provides LSMs (like the Integrity subsystem) with details about files as they’re loaded. (For example, loading modules, new kernel images for kexec, and firmware.) There wasn’t very good coverage for cases where the contents were coming from things that weren’t files. To deal with this, new hooks were added that allow the LSMs to introspect the contents directly, and to do partial reads. This will give the LSMs much finer grain visibility into these kinds of operations.

set_fs removal continues
With the earlier work landed to free the core kernel code from set_fs(), Christoph Hellwig made it possible for set_fs() to be optional for an architecture. Subsequently, he then removed set_fs() entirely for x86, riscv, and powerpc. These architectures will now be free from the entire class of “kernel address limit” attacks that only needed to corrupt a single value in struct thead_info.

sysfs_emit() replaces sprintf() in /sys
Joe Perches tackled one of the most common bug classes with sprintf() and snprintf() in /sys handlers by creating a new helper, sysfs_emit(). This will handle the cases where kernel code was not correctly dealing with the length results from sprintf() calls, which might lead to buffer overflows in the PAGE_SIZE buffer that /sys handlers operate on. With the helper in place, it was possible to start the refactoring of the many sprintf() callers.

nosymfollow mount option
Mattias Nissler and Ross Zwisler implemented the nosymfollow mount option. This entirely disables symlink resolution for the given filesystem, similar to other mount options where noexec disallows execve(), nosuid disallows setid bits, and nodev disallows device files. Quoting the patch, it is “useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts.” (i.e. for when /proc/sys/fs/protected_symlinks isn’t a big enough hammer.) Chrome OS uses this option for its stateful filesystem, as symlink traversal as been a common attack-persistence vector.

ARMv8.5 Memory Tagging Extension support
Vincenzo Frascino added support to arm64 for the coming Memory Tagging Extension, which will be available for ARMv8.5 and later chips. It provides 4 bits of tags (covering multiples of 16 byte spans of the address space). This is enough to deterministically eliminate all linear heap buffer overflow flaws (1 tag for “free”, and then rotate even values and odd values for neighboring allocations), which is probably one of the most common bugs being currently exploited. It also makes use-after-free and over/under indexing much more difficult for attackers (but still possible if the target’s tag bits can be exposed). Maybe some day we can switch to 128 bit virtual memory addresses and have fully versioned allocations. But for now, 16 tag values is better than none, though we do still need to wait for anyone to actually be shipping ARMv8.5 hardware.

fixes for flaws found by UBSAN
The work to make UBSAN generally usable under syzkaller continues to bear fruit, with various fixes all over the kernel for stuff like shift-out-of-bounds, divide-by-zero, and integer overflow. Seeing these kinds of patches land reinforces the the rationale of shifting the burden of these kinds of checks to the toolchain: these run-time bugs continue to pop up.

flexible array conversions
The work on flexible array conversions continues. Gustavo A. R. Silva and others continued to grind on the conversions, getting the kernel ever closer to being able to enable the -Warray-bounds compiler flag and clear the path for saner bounds checking of array indexes and memcpy() usage.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.11.

Comments (4)

April 5, 2021

security things in Linux v5.9

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:24 pm

Previously: v5.8

Linux v5.9 was released in October, 2020. Here’s my summary of various security things that I found interesting:

seccomp user_notif file descriptor injection
Sargun Dhillon added the ability for SECCOMP_RET_USER_NOTIF filters to inject file descriptors into the target process using SECCOMP_IOCTL_NOTIF_ADDFD. This lets container managers fully emulate syscalls like open() and connect(), where an actual file descriptor is expected to be available after a successful syscall. In the process I fixed a couple bugs and refactored the file descriptor receiving code.

zero-initialize stack variables with Clang
When Alexander Potapenko landed support for Clang’s automatic variable initialization, it did so with a byte pattern designed to really stand out in kernel crashes. Now he’s added support for doing zero initialization via CONFIG_INIT_STACK_ALL_ZERO, which besides actually being faster, has a few behavior benefits as well. “Unlike pattern initialization, which has a higher chance of triggering existing bugs, zero initialization provides safe defaults for strings, pointers, indexes, and sizes.” Like the pattern initialization, this feature stops entire classes of uninitialized stack variable flaws.

common syscall entry/exit routines
Thomas Gleixner created architecture-independent code to do syscall entry/exit, since much of the kernel’s work during a syscall entry and exit is the same. There was no need to repeat this in each architecture, and having it implemented separately meant bugs (or features) might only get fixed (or implemented) in a handful of architectures. It means that features like seccomp become much easier to build since it wouldn’t need per-architecture implementations any more. Presently only x86 has switched over to the common routines.

SLAB kfree() hardening
To reach CONFIG_SLAB_FREELIST_HARDENED feature-parity with the SLUB heap allocator, I added naive double-free detection and the ability to detect cross-cache freeing in the SLAB allocator. This should keep a class of type-confusion bugs from biting kernels using SLAB. (Most distro kernels use SLUB, but some smaller devices prefer the slightly more compact SLAB, so this hardening is mostly aimed at those systems.)

new CAP_CHECKPOINT_RESTORE capability
Adrian Reber added the new CAP_CHECKPOINT_RESTORE capability, splitting this functionality off of CAP_SYS_ADMIN. The needs for the kernel to correctly checkpoint and restore a process (e.g. used to move processes between containers) continues to grow, and it became clear that the security implications were lower than those of CAP_SYS_ADMIN yet distinct from other capabilities. Using this capability is now the preferred method for doing things like changing /proc/self/exe.

debugfs boot-time visibility restriction
Peter Enderborg added the debugfs boot parameter to control the visibility of the kernel’s debug filesystem. The contents of debugfs continue to be a common area of sensitive information being exposed to attackers. While this was effectively possible by unsetting CONFIG_DEBUG_FS, that wasn’t a great approach for system builders needing a single set of kernel configs (e.g. a distro kernel), so now it can be disabled at boot time.

more seccomp architecture support
Michael Karcher implemented the SuperH seccomp hooks, Guo Ren implemented the C-SKY seccomp hooks, and Max Filippov implemented the xtensa seccomp hooks. Each of these included the ever-important updates to the seccomp regression testing suite in the kernel selftests.

stack protector support for RISC-V
Guo Ren implemented -fstack-protector (and -fstack-protector-strong) support for RISC-V. This is the initial global-canary support while the patches to GCC to support per-task canaries is getting finished (similar to the per-task canaries done for arm64). This will mean nearly all stack frame write overflows are no longer useful to attackers on this architecture. It’s nice to see this finally land for RISC-V, which is quickly approaching architecture feature parity with the other major architectures in the kernel.

new tasklet API
Romain Perier and Allen Pais introduced a new tasklet API to make their use safer. Much like the timer_list refactoring work done earlier, the tasklet API is also a potential source of simple function-pointer-and-first-argument controlled exploits via linear heap overwrites. It’s a smaller attack surface since it’s used much less in the kernel, but it is the same weak design, making it a sensible thing to replace. While the use of the tasklet API is considered deprecated (replaced by threaded IRQs), it’s not always a simple mechanical refactoring, so the old API still needs refactoring (since that CAN be done mechanically is most cases).

x86 FSGSBASE implementation
Sasha Levin, Andy Lutomirski, Chang S. Bae, Andi Kleen, Tony Luck, Thomas Gleixner, and others landed the long-awaited FSGSBASE series. This provides task switching performance improvements while keeping the kernel safe from modules accidentally (or maliciously) trying to use the features directly (which exposed an unprivileged direct kernel access hole).

filter x86 MSR writes
While it’s been long understood that writing to CPU Model-Specific Registers (MSRs) from userspace was a bad idea, it has been left enabled for things like MSR_IA32_ENERGY_PERF_BIAS. Boris Petkov has decided enough is enough and has now enabled logging and kernel tainting (TAINT_CPU_OUT_OF_SPEC) by default and a way to disable MSR writes at runtime. (However, since this is controlled by a normal module parameter and the root user can just turn writes back on, I continue to recommend that people build with CONFIG_X86_MSR=n.) The expectation is that userspace MSR writes will be entirely removed in future kernels.

uninitialized_var() macro removed
I made treewide changes to remove the uninitialized_var() macro, which had been used to silence compiler warnings. The rationale for this macro was weak to begin with (“the compiler is reporting an uninitialized variable that is clearly initialized”) since it was mainly papering over compiler bugs. However, it creates a much more fragile situation in the kernel since now such uses can actually disable automatic stack variable initialization, as well as mask legitimate “unused variable” warnings. The proper solution is to just initialize variables the compiler warns about.

function pointer cast removals
Oscar Carter has started removing function pointer casts from the kernel, in an effort to allow the kernel to build with -Wcast-function-type. The future use of Control Flow Integrity checking (which does validation of function prototypes matching between the caller and the target) tends not to work well with function casts, so it’d be nice to get rid of these before CFI lands.

flexible array conversions
As part of Gustavo A. R. Silva’s on-going work to replace zero-length and one-element arrays with flexible arrays, he has documented the details of the flexible array conversions, and the various helpers to be used in kernel code. Every commit gets the kernel closer to building with -Warray-bounds, which catches a lot of potential buffer overflows at compile time.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.10.

Comments (3)

February 8, 2021

security things in Linux v5.8

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:47 pm

Previously: v5.7

Linux v5.8 was released in August, 2020. Here’s my summary of various security things that caught my attention:

arm64 Branch Target Identification
Dave Martin added support for ARMv8.5’s Branch Target Instructions (BTI), which are enabled in userspace at execve() time, and all the time in the kernel (which required manually marking up a lot of non-C code, like assembly and JIT code).

With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker’s code must make direct function calls. This basically reduces the “usable” code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a “low granularity” forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It’s a good first step to strong CFI, but (as we’ve seen with things like CFG) it isn’t usually strong enough to stop a motivated attacker. “High granularity” CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of Clang’s CFI implementation.

arm64 Shadow Call Stack
Sami Tolvanen landed the kernel implementation of Clang’s Shadow Call Stack (SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (x18) and keeping a copy of the return addresses stored in a separate “shadow stack”. In this way, manipulating the regular stack’s return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.)

It’s worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of x18) staying secret, since the memory could be written to directly. Intel’s hardware ROP defense (CET) uses a hardware shadow stack that isn’t directly writable. ARM’s hardware defense against ROP is PAC (which is actually designed as an arbitrary CFI defense — it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available.

Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via CONFIG_KCSAN. This immediately found real bugs, with some fixes having already landed too. For more details, see the KCSAN documentation.

new capabilities
Alexey Budankov added CAP_PERFMON, which is designed to allow access to perf(). The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of CAP_SYS_ADMIN, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN).

Alexei Starovoitov added CAP_BPF, which is designed to separate BPF access from the all-powerful CAP_SYS_ADMIN. It is designed to be used in combination with CAP_PERFMON for tracing-like activities and CAP_NET_ADMIN for networking-related activities. For things that could change kernel integrity (i.e. write access), CAP_SYS_ADMIN is still required.

network random number generator improvements
Willy Tarreau made the network code’s random number generator less predictable. This will further frustrate any attacker’s attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states).

fix various kernel address exposures to non-CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-CAP_SYSLOG) users, though usually only through odd corner cases. After refactoring how capabilities were being checked for files in /sys and /proc, the kernel modules sections, kprobes, and BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it properly fixed. Yikes!)

RISCV W^X detection
Following up on his recent work to enable strict kernel memory protections on RISCV, Zong Li has now added support for CONFIG_DEBUG_WX as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected.

execve() refactoring continues
Eric W. Biederman continued working on execve() refactoring, including getting rid of the frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old binfmt_script regression tests and get them into the kernel selftests.

multiple /proc instances
Alexey Gladkov modernized /proc internals and provided a way to have multiple /proc instances mounted in the same PID namespace. This allows for having multiple views of /proc, with different features enabled. (Including the newly added hidepid=4 and subset=pid mount options.)

set_fs() removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel’s set_fs() interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture signal handling code, this has needed to touch various parts of the ELF loader, and networking code too.

READ_IMPLIES_EXEC is no more for native 64-bit
The READ_IMPLIES_EXEC flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as “well, since we don’t know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too”. It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all mmap() allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn’t cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting READ_IMPLIES_EXEC from the ELF PT_GNU_STACK marking on x86 and arm/arm64, and declaring that a native 64-bit process would never gain READ_IMPLIES_EXEC on x86_64 and arm64, which matches the behavior of other native 64-bit architectures that correctly didn’t ever implement READ_IMPLIES_EXEC in the first place.

array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the flex_array_size() helper (as a cousin to struct_size()). The zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with -Warray-bounds.

scnprintf() replacement continues
Chen Zhou joined Takashi Iwai in continuing to replace potentially unsafe uses of sprintf() with scnprintf(). Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises.

That’s it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.

Comments Off

September 21, 2020

security things in Linux v5.7

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:32 pm

Previously: v5.6

Linux v5.7 was released at the end of May. Here’s my summary of various security things that caught my attention:

arm64 kernel pointer authentication
While the ARMv8.3 CPU “Pointer Authentication” (PAC) feature landed for userspace already, Kristina Martsenko has now landed PAC support in kernel mode. The current implementation uses PACIASP which protects the saved stack pointer, similar to the existing CONFIG_STACKPROTECTOR feature, only faster. This also paves the way to sign and check pointers stored in the heap, as a way to defeat function pointer overwrites in those memory regions too. Since the behavior is different from the traditional stack protector, Amit Daniel Kachhap added an LKDTM test for PAC as well.

BPF LSM
The kernel’s Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With CONFIG_BPF_LSM=y, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting).

execve() deadlock refactoring
There have been a number of long-standing races in the kernel’s process launching code where ptrace could deadlock. Fixing these has been attempted several times over the last many years, but Eric W. Biederman and Ernd Edlinger decided to dive in, and successfully landed the a series of refactorings, splitting up the problematic locking and refactoring their uses to remove the deadlocks. While he was at it, Eric also extended the exec_id counter to 64 bits to avoid the possibility of the counter wrapping and allowing an attacker to send arbitrary signals to processes they normally shouldn’t be able to.

slub freelist obfuscation improvements
After Silvio Cesare observed some weaknesses in the implementation of CONFIG_SLAB_FREELIST_HARDENED‘s freelist pointer content obfuscation, I improved their bit diffusion, which makes attacks require significantly more memory content exposures to defeat the obfuscation. As part of the conversation, Vitaly Nikolenko pointed out that the freelist pointer’s location made it relatively easy to target too (for either disclosures or overwrites), so I moved it away from the edge of the slab, making it harder to reach through small-sized overflows (which usually target the freelist pointer). As it turns out, there were a few assumptions in the kernel about the location of the freelist pointer, which had to also get cleaned up.

RISCV strict kernel memory protections
Zong Li implemented STRICT_KERNEL_RWX support for RISCV. And to validate the results, following v5.6’s generic page table dumping work, he landed the RISCV page dumping code. This means it’s much easier to examine the kernel’s page table layout when running a debug kernel (built with PTDUMP_DEBUGFS), visible in /sys/kernel/debug/kernel_page_tables.

array index bounds checking
This is a pretty large area of work that touches a lot of overlapping elements (and history) in the Linux kernel. The short version is: C is bad at noticing when it uses an array index beyond the bounds of the declared array, and we need to fix that. For example, don’t do this:

int foo[5];
...
foo[8] = bar;

The long version gets complicated by the evolution of “flexible array” structure members, so we’ll pause for a moment and skim the surface of this topic. While things like CONFIG_FORTIFY_SOURCE try to catch these kinds of cases in the memcpy() and strcpy() family of functions, it doesn’t catch it in open-coded array indexing, as seen in the code above. GCC has a warning (-Warray-bounds) for these cases, but it was disabled by Linus because of all the false positives seen due to “fake” flexible array members. Before flexible arrays were standardized, GNU C supported “zero sized” array members. And before that, C code would use a 1-element array. These were all designed so that some structure could be the “header” in front of some data blob that could be addressable through the last structure member:

/* 1-element array */
struct foo {
    ...
    char contents[1];
};

/* GNU C extension: 0-element array */
struct foo {
    ...
    char contents[0];
};

/* C standard: flexible array */
struct foo {
    ...
    char contents[];
};

instance = kmalloc(sizeof(struct foo) + content_size);

Converting all the zero- and one-element array members to flexible arrays is one of Gustavo A. R. Silva’s goals, and hundreds of these changes started landing. Once fixed, -Warray-bounds can be re-enabled. Much more detail can be found in the kernel’s deprecation docs.

However, that will only catch the “visible at compile time” cases. For runtime checking, the Undefined Behavior Sanitizer has an option for adding runtime array bounds checking for catching things like this where the compiler cannot perform a static analysis of the index values:

int foo[5];
...
for (i = 0; i < some_argument; i++) {
    ...
    foo[i] = bar;
    ...
}

It was, however, not separate (via kernel Kconfig) until Elena Petrova and I split it out into CONFIG_UBSAN_BOUNDS, which is fast enough for production kernel use. With this enabled, it's now possible to instrument the kernel to catch these conditions, which seem to come up with some regularity in Wi-Fi and Bluetooth drivers for some reason. Since UBSAN (and the other Sanitizers) only WARN() by default, system owners need to set panic_on_warn=1 too if they want to defend against attacks targeting these kinds of flaws. Because of this, and to avoid bloating the kernel image with all the warning messages, I introduced CONFIG_UBSAN_TRAP which effectively turns these conditions into a BUG() without needing additional sysctl settings.

Fixing "additive" snprintf() usage
A common idiom in C for building up strings is to use sprintf()'s return value to increment a pointer into a string, and build a string with more sprintf() calls:

/* safe if strlen(foo) + 1 < sizeof(string) */
wrote  = sprintf(string, "Foo: %s\n", foo);
/* overflows if strlen(foo) + strlen(bar) > sizeof(string) */
wrote += sprintf(string + wrote, "Bar: %s\n", bar);
/* writing way beyond the end of "string" now ... */
wrote += sprintf(string + wrote, "Baz: %s\n", baz);

The risk is that if these calls eventually walk off the end of the string buffer, it will start writing into other memory and create some bad situations. Switching these to snprintf() does not, however, make anything safer, since snprintf() returns how much it would have written:

/* safe, assuming available <= sizeof(string), and for this example
 * assume strlen(foo) < sizeof(string) */
wrote  = snprintf(string, available, "Foo: %s\n", foo);
/* if (strlen(bar) > available - wrote), this is still safe since the
 * write into "string" will be truncated, but now "wrote" has been
 * incremented by how much snprintf() *would* have written, so "wrote"
 * is now larger than "available". */
wrote += snprintf(string + wrote, available - wrote, "Bar: %s\n", bar);
/* string + wrote is beyond the end of string, and availabe - wrote wraps
 * around to a giant positive value, making the write effectively 
 * unbounded. */
wrote += snprintf(string + wrote, available - wrote, "Baz: %s\n", baz);

So while the first overflowing call would be safe, the next one would be targeting beyond the end of the array and the size calculation will have wrapped around to a giant limit. Replacing this idiom with scnprintf() solves the issue because it only reports what was actually written. To this end, Takashi Iwai has been landing a bunch scnprintf() fixes.

That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.8.

Comments Off

September 2, 2020

security things in Linux v5.6

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:22 pm

Previously: v5.5.

Linux v5.6 was released back in March. Here’s my quick summary of various features that caught my attention:

WireGuard
The widely used WireGuard VPN has been out-of-tree for a very long time. After 3 1/2 years since its initial upstream RFC, Ard Biesheuvel and Jason Donenfeld finished the work getting all the crypto prerequisites sorted out for the v5.5 kernel. For this release, Jason has gotten WireGuard itself landed. It was a twisty road, and I’m grateful to everyone involved for sticking it out and navigating the compromises and alternative solutions.

openat2() syscall and RESOLVE_* flags
Aleksa Sarai has added a number of important path resolution “scoping” options to the kernel’s open() handling, covering things like not walking above a specific point in a path hierarchy (RESOLVE_BENEATH), disabling the resolution of various “magic links” (RESOLVE_NO_MAGICLINKS) in procfs (e.g. /proc/$pid/exe) and other pseudo-filesystems, and treating a given lookup as happening relative to a different root directory (as if it were in a chroot, RESOLVE_IN_ROOT). As part of this, it became clear that there wasn’t a way to correctly extend the existing openat() syscall, so he added openat2() (which is a good example of the efforts being made to codify “Extensible Syscall” arguments). The RESOLVE_* set of flags also cover prior behaviors like RESOLVE_NO_XDEV and RESOLVE_NO_SYMLINKS.

pidfd_getfd() syscall
In the continuing growth of the much-needed pidfd APIs, Sargun Dhillon has added the pidfd_getfd() syscall which is a way to gain access to file descriptors of a process in a race-less way (or when /proc is not mounted). Before, it wasn’t always possible make sure that opening file descriptors via /proc/$pid/fd/$N was actually going to be associated with the correct PID. Much more detail about this has been written up at LWN.

openat() via io_uring
With my “attack surface reduction” hat on, I remain personally suspicious of the io_uring() family of APIs, but I can’t deny their utility for certain kinds of workloads. Being able to pipeline reads and writes without the overhead of actually making syscalls is pretty great for performance. Jens Axboe has added the IORING_OP_OPENAT command so that existing io_urings can open files to be added on the fly to the mapping of available read/write targets of a given io_uring. While LSMs are still happily able to intercept these actions, I remain wary of the growing “syscall multiplexer” that io_uring is becoming. I am, of course, glad to see that it has a comprehensive (if “out of tree”) test suite as part of liburing.

removal of blocking random pool
After making algorithmic changes to obviate separate entropy pools for random numbers, Andy Lutomirski removed the blocking random pool. This simplifies the kernel pRNG code significantly without compromising the userspace interfaces designed to fetch “cryptographically secure” random numbers. To quote Andy, “This series should not break any existing programs. /dev/urandom is unchanged. /dev/random will still block just after booting, but it will block less than it used to.” See LWN for more details on the history and discussion of the series.

arm64 support for on-chip RNG
Mark Brown added support for the future ARMv8.5’s RNG (SYS_RNDR_EL0), which is, from the kernel’s perspective, similar to x86’s RDRAND instruction. This will provide a bootloader-independent way to add entropy to the kernel’s pRNG for early boot randomness (e.g. stack canary values, memory ASLR offsets, etc). Until folks are running on ARMv8.5 systems, they can continue to depend on the bootloader for randomness (via the UEFI RNG interface) on arm64.

arm64 E0PD
Mark Brown added support for the future ARMv8.5’s E0PD feature (TCR_E0PD1), which causes all memory accesses from userspace into kernel space to fault in constant time. This is an attempt to remove any possible timing side-channel signals when probing kernel memory layout from userspace, as an alternative way to protect against Meltdown-style attacks. The expectation is that E0PD would be used instead of the more expensive Kernel Page Table Isolation (KPTI) features on arm64.

powerpc32 VMAP_STACK
Christophe Leroy added VMAP_STACK support to powerpc32, joining x86, arm64, and s390. This helps protect against the various classes of attacks that depend on exhausting the kernel stack in order to collide with neighboring kernel stacks. (Another common target, the sensitive thread_info, had already been moved away from the bottom of the stack by Christophe Leroy in Linux v5.1.)

generic Page Table dumping
Related to RISCV’s work to add page table dumping (via /sys/fs/debug/kernel_page_tables), Steven Price extracted the existing implementations from multiple architectures and created a common page table dumping framework (and then refactored all the other architectures to use it). I’m delighted to have this because I still remember when not having a working page table dumper for ARM delayed me for a while when trying to implement upstream kernel memory protections there. Anything that makes it easier for architectures to get their kernel memory protection working correctly makes me happy.

That’s in for now; let me know if there’s anything you think I missed. Next up: Linux v5.7.

Comments (1)

May 27, 2020

security things in Linux v5.5

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:04 pm

Previously: v5.4.

I got a bit behind on this blog post series! Let’s get caught up. Here are a bunch of security things I found interesting in the Linux kernel v5.5 release:

restrict perf_event_open() from LSM
Given the recurring flaws in the perf subsystem, there has been a strong desire to be able to entirely disable the interface. While the kernel.perf_event_paranoid sysctl knob has existed for a while, attempts to extend its control to “block all perf_event_open() calls” have failed in the past. Distribution kernels have carried the rejected sysctl patch for many years, but now Joel Fernandes has implemented a solution that was deemed acceptable: instead of extending the sysctl, add LSM hooks so that LSMs (e.g. SELinux, Apparmor, etc) can make these choices as part of their overall system policy.

generic fast full refcount_t
Will Deacon took the recent refcount_t hardening work for both x86 and arm64 and distilled the implementations into a single architecture-agnostic C version. The result was almost as fast as the x86 assembly version, but it covered more cases (e.g. increment-from-zero), and is now available by default for all architectures. (There is no longer any Kconfig associated with refcount_t; the use of the primitive provides full coverage.)

linker script cleanup for exception tables
When Rick Edgecombe presented his work on building Execute-Only memory under a hypervisor, he noted a region of memory that the kernel was attempting to read directly (instead of execute). He rearranged things for his x86-only patch series to work around the issue. Since I’d just been working in this area, I realized the root cause of this problem was the location of the exception table (which is strictly a lookup table and is never executed) and built a fix for the issue and applied it to all architectures, since it turns out the exception tables for almost all architectures are just a data table. Hopefully this will help clear the path for more Execute-Only memory work on all architectures. In the process of this, I also updated the section fill bytes on x86 to be a trap (0xCC, int3), instead of a NOP instruction so functions would need to be targeted more precisely by attacks.

KASLR for 32-bit PowerPC
Joining many other architectures, Jason Yan added kernel text base-address offset randomization (KASLR) to 32-bit PowerPC.

seccomp for RISC-V
After a bit of long road, David Abdurachmanov has added seccomp support to the RISC-V architecture. The series uncovered some more corner cases in the seccomp self tests code, which is always nice since then we get to make it more robust for the future!

seccomp USER_NOTIF continuation
When the seccomp SECCOMP_RET_USER_NOTIF interface was added, it seemed like it would only be used in very limited conditions, so the idea of needing to handle “normal” requests didn’t seem very onerous. However, since then, it has become clear that the overhead of a monitor process needing to perform lots of “normal” open() calls on behalf of the monitored process started to look more and more slow and fragile. To deal with this, it became clear that there needed to be a way for the USER_NOTIF interface to indicate that seccomp should just continue as normal and allow the syscall without any special handling. Christian Brauner implemented SECCOMP_USER_NOTIF_FLAG_CONTINUE to get this done. It comes with a bit of a disclaimer due to the chance that monitors may use it in places where ToCToU is a risk, and for possible conflicts with SECCOMP_RET_TRACE. But overall, this is a net win for container monitoring tools.

EFI_RNG_PROTOCOL for x86
Some EFI systems provide a Random Number Generator interface, which is useful for gaining some entropy in the kernel during very early boot. The arm64 boot stub has been using this for a while now, but Dominik Brodowski has now added support for x86 to do the same. This entropy is useful for kernel subsystems performing very earlier initialization whre random numbers are needed (like randomizing aspects of the SLUB memory allocator).

FORTIFY_SOURCE for MIPS
As has been enabled on many other architectures, Dmitry Korotin got MIPS building with CONFIG_FORTIFY_SOURCE, so compile-time (and some run-time) buffer overflows during calls to the memcpy() and strcpy() families of functions will be detected.

limit copy_{to,from}_user() size to INT_MAX
As done for VFS, vsnprintf(), and strscpy(), I went ahead and limited the size of copy_to_user() and copy_from_user() calls to INT_MAX in order to catch any weird overflows in size calculations.

Other things
Alexander Popov pointed out some more v5.5 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

KASan support for vmap memory expands KASan’s features to include analysis of memory allocated with vmalloc(). More specifically, this means KASan can examine the stack again, since it can be in that region since CONFIG_VMAP_STACK was introduced.
MIPS can build with GCC plugins and KCOV now, providing those systems with the associated features like stack wiping, coverage analysis, etc.
userfaultfd requires CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK. This was done to block unprivileged users from abusing the userfaultfd feature, as it is implemented in a way that provides the ability to inject file descriptors into unsuspecting processes.

Edit: added Alexander Popov’s notes

That’s it for v5.5! Let me know if there’s anything else that I should call out here. Next up: Linux v5.6.

Comments (1)

February 18, 2020

security things in Linux v5.4

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:37 pm

Previously: v5.3.

Linux kernel v5.4 was released in late November. The holidays got the best of me, but better late than never! ;) Here are some security-related things I found interesting:

waitid() gains P_PIDFD
Christian Brauner has continued his pidfd work by adding a critical mode to waitid(): P_PIDFD. This makes it possible to reap child processes via a pidfd, and completes the interfaces needed for the bulk of programs performing process lifecycle management. (i.e. a pidfd can come from /proc or clone(), and can be waited on with waitid().)

kernel lockdown
After something on the order of 8 years, Linux can now draw a bright line between “ring 0” (kernel memory) and “uid 0” (highest privilege level in userspace). The “kernel lockdown” feature, which has been an out-of-tree patch series in most Linux distros for almost as many years, attempts to enumerate all the intentional ways (i.e. interfaces not flaws) userspace might be able to read or modify kernel memory (or execute in kernel space), and disable them. While Matthew Garrett made the internal details fine-grained controllable, the basic lockdown LSM can be set to either disabled, “integrity” (kernel memory can be read but not written), or “confidentiality” (no kernel memory reads or writes). Beyond closing the many holes between userspace and the kernel, if new interfaces are added to the kernel that might violate kernel integrity or confidentiality, now there is a place to put the access control to make everyone happy and there doesn’t need to be a rehashing of the age old fight between “but root has full kernel access” vs “not in some system configurations”.

tagged memory relaxed syscall ABI
Andrey Konovalov (with Catalin Marinas and others) introduced a way to enable a “relaxed” tagged memory syscall ABI in the kernel. This means programs running on hardware that supports memory tags (or “versioning”, or “coloring”) in the upper (non-VMA) bits of a pointer address can use these addresses with the kernel without things going crazy. This is effectively teaching the kernel to ignore these high bits in places where they make no sense (i.e. mathematical comparisons) and keeping them in place where they have meaning (i.e. pointer dereferences).

As an example, if a userspace memory allocator had returned the address 0x0f00000010000000 (VMA address 0x10000000, with, say, a “high bits” tag of 0x0f), and a program used this range during a syscall that ultimately called copy_from_user() on it, the initial range check would fail if the tag bits were left in place: “that’s not a userspace address; it is greater than TASK_SIZE (0x0000800000000000)!”, so they are stripped for that check. During the actual copy into kernel memory, the tag is left in place so that when the hardware dereferences the pointer, the pointer tag can be checked against the expected tag assigned to referenced memory region. If there is a mismatch, the hardware will trigger the memory tagging protection.

Right now programs running on Sparc M7 CPUs with ADI (Application Data Integrity) can use this for hardware tagged memory, ARMv8 CPUs can use TBI (Top Byte Ignore) for software memory tagging, and eventually there will be ARMv8.5-A CPUs with MTE (Memory Tagging Extension).

boot entropy improvement
Thomas Gleixner got fed up with poor boot-time entropy and trolled Linus into coming up with reasonable way to add entropy on modern CPUs, taking advantage of timing noise, cycle counter jitter, and perhaps even the variability of speculative execution. This means that there shouldn’t be mysterious multi-second (or multi-minute!) hangs at boot when some systems don’t have enough entropy to service getrandom() syscalls from systemd or the like.

userspace writes to swap files blocked
From the department of “how did this go unnoticed for so long?”, Darrick J. Wong fixed the kernel to not allow writes from userspace to active swap files. Without this, it was possible for a user (usually root) with write access to a swap file to modify its contents, thereby changing memory contents of a process once it got paged back in. While root normally could just use CAP_PTRACE to modify a running process directly, this was a loophole that allowed lesser-privileged users (e.g. anyone in the “disk” group) without the needed capabilities to still bypass ptrace restrictions.

limit strscpy() sizes to INT_MAX
Generally speaking, if a size variable ends up larger than INT_MAX, some calculation somewhere has overflowed. And even if not, it’s probably going to hit code somewhere nearby that won’t deal well with the result. As already done in the VFS core, and vsprintf(), I added a check to strscpy() to reject sizes larger than INT_MAX.

ld.gold support removed
Thomas Gleixner removed support for the gold linker. While this isn’t providing a direct security benefit, ld.gold has been a constant source of weird bugs. Specifically where I’ve noticed, it had been pain while developing KASLR, and has more recently been causing problems while stabilizing building the kernel with Clang. Having this linker support removed makes things much easier going forward. There are enough weird bugs to fix in Clang and ld.lld. ;)

Intel TSX disabled
Given the use of Intel’s Transactional Synchronization Extensions (TSX) CPU feature by attackers to exploit speculation flaws, Pawan Gupta disabled the feature by default on CPUs that support disabling TSX.

That’s all I have for this version. Let me know if I missed anything. :) Next up is Linux v5.5!

Comments (2)

November 20, 2019

experimenting with Clang CFI on upstream Linux

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 9:09 pm

While much of the work on kernel Control Flow Integrity (CFI) is focused on arm64 (since kernel CFI is available on Android), a significant portion is in the core kernel itself (and especially the build system). Recently I got a sane build and boot on x86 with everything enabled, and I’ve been picking through some of the remaining pieces. I figured now would be a good time to document everything I do to get a build working in case other people want to play with it and find stuff that needs fixing.

First, everything is based on Sami Tolvanen’s upstream port of Clang’s forward-edge CFI, which includes his Link Time Optimization (LTO) work, which CFI requires. This tree also includes his backward-edge CFI work on arm64 with Clang’s Shadow Call Stack (SCS).

On top of that, I’ve got a few x86-specific patches that get me far enough to boot a kernel without warnings pouring across the console. Along with that are general linker script cleanups, CFI cast fixes, and x86 crypto fixes, all in various states of getting upstreamed. The resulting tree is here.

On the compiler side, you need a very recent Clang and LLD (i.e. “Clang 10”, or what I do is build from the latest git). For example, here’s how to get started. First, checkout, configure, and build Clang (leave out “--depth=1” if you want the full git history):

# Check out latest LLVM
mkdir -p $HOME/src
cd $HOME/src
git clone --depth=1 https://github.com/llvm/llvm-project.git
mkdir -p llvm-build
cd llvm-build
# Configure
mkdir -p $HOME/bin/clang-release
cmake -G Ninja \
      -DCMAKE_BUILD_TYPE=Release \
      -DLLVM_ENABLE_PROJECTS='clang;lld;compiler-rt' \
      -DCMAKE_INSTALL_PREFIX="$HOME/bin/clang-release" \
      ../llvm-project/llvm
# Build!
ninja install

Then checkout, configure, and build the CFI tree. (This assumes you’ve already got a checkout of Linus’s tree.)

# Check out my branch
cd ../linux
git remote add kees https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
git fetch kees
git checkout kees/kspp/cfi/x86 -b test/cfi
# Use the above built Clang path first for the needed binaries: clang, ld.lld, and llvm-ar.
PATH="$HOME/bin/clang-release/bin:$PATH"
# Configure (this uses "defconfig" but you could use "menuconfig"), but you must
# include CC and LD in the make args or your .config won't know about Clang.
make defconfig CC=clang LD=ld.lld
# Enable LTO and CFI.
scripts/config \
     -e CONFIG_LTO \
     -e CONFIG_THINLTO \
     -d CONFIG_LTO_NONE \
     -e CONFIG_LTO_CLANG \
     -e CONFIG_CFI_CLANG \
     -e CONFIG_CFI_PERMISSIVE \
     -e CONFIG_CFI_CLANG_SHADOW
# Enable LKDTM if you want runtime fault testing:
scripts/config -e CONFIG_LKDTM
# Build!
make -j$(getconf _NPROCESSORS_ONLN) CC=clang LD=ld.lld

Do not be alarmed by various warnings, such as:

ld.lld: warning: cannot find entry symbol _start; defaulting to 0x1000
llvm-ar: error: unable to load 'arch/x86/kernel/head_64.o': file too small to be an archive
llvm-ar: error: unable to load 'arch/x86/kernel/head64.o': file too small to be an archive
llvm-ar: error: unable to load 'arch/x86/kernel/ebda.o': file too small to be an archive
llvm-ar: error: unable to load 'arch/x86/kernel/platform-quirks.o': file too small to be an archive
WARNING: EXPORT symbol "page_offset_base" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: EXPORT symbol "vmalloc_base" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: EXPORT symbol "vmemmap_base" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: "__memcat_p" [vmlinux] is a static (unknown)
no symbols

Adjust your .config as you want (but, again, make sure the CC and LD args are pointed at Clang and LLD respectively). This should(!) result in a happy bootable x86 CFI-enabled kernel. If you want to see what a CFI failure looks like, you can poke LKDTM:

# Log into the booted system as root, then:
cat <(echo CFI_FORWARD_PROTO) >/sys/kernel/debug/provoke-crash/DIRECT
dmesg

Here’s the CFI splat I see on the console:

[   16.288372] lkdtm: Performing direct entry CFI_FORWARD_PROTO
[   16.290563] lkdtm: Calling matched prototype ...
[   16.292367] lkdtm: Calling mismatched prototype ...
[   16.293696] ------------[ cut here ]------------
[   16.294581] CFI failure (target: lkdtm_increment_int$53641d38e2dc4a151b75cbe816cbb86b.cfi_jt+0x0/0x10):
[   16.296288] WARNING: CPU: 3 PID: 2612 at kernel/cfi.c:29 __cfi_check_fail+0x38/0x40
...
[   16.346873] ---[ end trace 386b3874d294d2f7 ]---
[   16.347669] lkdtm: Fail: survived mismatched prototype function call!

The claim of “Fail: survived …” is due to CONFIG_CFI_PERMISSIVE=y. This allows the kernel to warn but continue with the bad call anyway. This is handy for debugging. In a production kernel that would be removed and the offending kernel thread would be killed. If you run this again with the config disabled, there will be no continuation from LKDTM. :)

Enjoy! And if you can figure out before me why there is still CFI instrumentation in the KPTI entry handler, please let me know and help us fix it. ;)

Comments (2)

November 14, 2019

security things in Linux v5.3

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:36 pm

Previously: v5.2.

Linux kernel v5.3 was released! I let this blog post get away from me, but it’s up now! :) Here are some security-related things I found interesting:

heap variable initialization
In the continuing work to remove “uninitialized” variables from the kernel, Alexander Potapenko added new “init_on_alloc” and “init_on_free” boot parameters (with associated Kconfig defaults) to perform zeroing of heap memory either at allocation time (i.e. all kmalloc()s effectively become kzalloc()s), at free time (i.e. all kfree()s effectively become kzfree()s), or both. The performance impact of the former under most workloads appears to be under 1%, if it’s measurable at all. The “init_on_free” option, however, is more costly but adds the benefit of reducing the lifetime of heap contents after they have been freed (which might be useful for some use-after-free attacks or side-channel attacks). Everyone should enable CONFIG_INIT_ON_ALLOC_DEFAULT_ON=1 (or boot with “init_on_alloc=1“), and the more paranoid system builders should add CONFIG_INIT_ON_FREE_DEFAULT_ON=1 (or “init_on_free=1” at boot). As workloads are found that cause performance concerns, tweaks to the initialization coverage can be added.

pidfd_open() added
Christian Brauner has continued his pidfd work by creating the next needed syscall: pidfd_open(), which takes a pid and returns a pidfd. This is useful for cases where process creation isn’t yet using CLONE_PIDFD, and where /proc may not be mounted.

-Wimplicit-fallthrough enabled globally
Gustavo A.R. Silva landed the last handful of implicit fallthrough fixes left in the kernel, which allows for -Wimplicit-fallthrough to be globally enabled for all kernel builds. This will keep any new instances of this bad code pattern from entering the kernel again. With several hundred implicit fallthroughs identified and fixed, something like 1 in 10 were missing breaks, which is way higher than I was expecting, making this work even more well justified.

x86 CR4 & CR0 pinning
In recent exploits, one of the steps for making the attacker’s life easier is to disable CPU protections like Supervisor Mode Access (and Execute) Prevention (SMAP and SMEP) by finding a way to write to CPU control registers to disable these features. For example, CR4 controls SMAP and SMEP, where disabling those would let an attacker access and execute userspace memory from kernel code again, opening up the attack to much greater flexibility. CR0 controls Write Protect (WP), which when disabled would allow an attacker to write to read-only memory like the kernel code itself. Attacks have been using the kernel’s CR4 and CR0 writing functions to make these changes (since it’s easier to gain that level of execute control), but now the kernel will attempt to “pin” sensitive bits in CR4 and CR0 to avoid them getting disabled. This forces attacks to do more work to enact such register changes going forward. (I’d like to see KVM enforce this too, which would actually protect guest kernels from all attempts to change protected register bits.)

additional kfree() sanity checking
In order to avoid corrupted pointers doing crazy things when they’re freed (as seen in recent exploits), I added additional sanity checks to verify kmem cache membership and to make sure that objects actually belong to the kernel slab heap. As a reminder, everyone should be building with CONFIG_SLAB_FREELIST_HARDENED=1.

KASLR enabled by default on arm64
Just as Kernel Address Space Layout Randomization (KASLR) was enabled by default on x86, now KASLR has been enabled by default on arm64 too. It’s worth noting, though, that in order to benefit from this setting, the bootloader used for such arm64 systems needs to either support the UEFI RNG function or provide entropy via the “/chosen/kaslr-seed” Device Tree property.

hardware security embargo documentation
As there continues to be a long tail of hardware flaws that need to be reported to the Linux kernel community under embargo, a well-defined process has been documented. This will let vendors unfamiliar with how to handle things follow the established best practices for interacting with the Linux kernel community in a way that lets mitigations get developed before embargoes are lifted. The latest (and HTML rendered) version of this process should always be available here.

Those are the things I had on my radar. Please let me know if there are other things I should add! Linux v5.4 is almost here…

Comments (6)

July 17, 2019

security things in Linux v5.2

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:07 pm

Previously: v5.1.

Linux kernel v5.2 was released last week! Here are some security-related things I found interesting:

page allocator freelist randomization
While the SLUB and SLAB allocator freelists have been randomized for a while now, the overarching page allocator itself wasn’t. This meant that anything doing allocation outside of the kmem_cache/kmalloc() would have deterministic placement in memory. This is bad both for security and for some cache management cases. Dan Williams implemented this randomization under CONFIG_SHUFFLE_PAGE_ALLOCATOR now, which provides additional uncertainty to memory layouts, though at a rather low granularity of 4MB (see SHUFFLE_ORDER). Also note that this feature needs to be enabled at boot time with page_alloc.shuffle=1 unless you have direct-mapped memory-side-cache (you can check the state at /sys/module/page_alloc/parameters/shuffle).

stack variable initialization with Clang
Alexander Potapenko added support via CONFIG_INIT_STACK_ALL for Clang’s -ftrivial-auto-var-init=pattern option that enables automatic initialization of stack variables. This provides even greater coverage than the prior GCC plugin for stack variable initialization, as Clang’s implementation also covers variables not passed by reference. (In theory, the kernel build should still warn about these instances, but even if they exist, Clang will initialize them.) Another notable difference between the GCC plugins and Clang’s implementation is that Clang initializes with a repeating 0xAA byte pattern, rather than zero. (Though this changes under certain situations, like for 32-bit pointers which are initialized with 0x000000AA.) As with the GCC plugin, the benefit is that the entire class of uninitialized stack variable flaws goes away.

Kernel Userspace Access Prevention on powerpc
Like SMAP on x86 and PAN on ARM, Michael Ellerman and Russell Currey have landed support for disallowing access to userspace without explicit markings in the kernel (KUAP) on Power9 and later PPC CPUs under CONFIG_PPC_RADIX_MMU=y (which is the default). This is the continuation of the execute protection (KUEP) in v4.10. Now if an attacker tries to trick the kernel into any kind of unexpected access from userspace (not just executing code), the kernel will fault.

Microarchitectural Data Sampling mitigations on x86
Another set of cache memory side-channel attacks came to light, and were consolidated together under the name Microarchitectural Data Sampling (MDS). MDS is weaker than other cache side-channels (less control over target address), but memory contents can still be exposed. Much like L1TF, when one’s threat model includes untrusted code running under Symmetric Multi Threading (SMT: more logical cores than physical cores), the only full mitigation is to disable hyperthreading (boot with “nosmt“). For all the other variations of the MDS family, Andi Kleen (and others) implemented various flushing mechanisms to avoid cache leakage.

unprivileged userfaultfd sysctl knob
Both FUSE and userfaultfd provide attackers with a way to stall a kernel thread in the middle of memory accesses from userspace by initiating an access on an unmapped page. While FUSE is usually behind some kind of access controls, userfaultfd hadn’t been. To avoid various heap grooming and heap spraying techniques for exploiting Use-after-Free flaws, Peter Xu added the new “vm.unprivileged_userfaultfd” sysctl knob to disallow unprivileged access to the userfaultfd syscall.

temporary mm for text poking on x86
The kernel regularly performs self-modification with things like text_poke() (during stuff like alternatives, ftrace, etc). Before, this was done with fixed mappings (“fixmap”) where a specific fixed address at the high end of memory was used to map physical pages as needed. However, this resulted in some temporal risks: other CPUs could write to the fixmap, or there might be stale TLB entries on removal that other CPUs might still be able to write through to change the target contents. Instead, Nadav Amit has created a separate memory map for kernel text writes, as if the kernel is trying to make writes to userspace. This mapping ends up staying local to the current CPU, and the poking address is randomized, unlike the old fixmap.

ongoing: implicit fall-through removal
Gustavo A. R. Silva is nearly done with marking (and fixing) all the implicit fall-through cases in the kernel. Based on the pull request from Gustavo, it looks very much like v5.3 will see -Wimplicit-fallthrough added to the global build flags and then this class of bug should stay extinct in the kernel.

CLONE_PIDFD added
Christian Brauner added the new CLONE_PIDFD flag to the clone() system call, which complements the pidfd work in v5.1 so that programs can now gain a handle for a process ID right at fork() (actually clone()) time, instead of needing to get the handle from /proc after process creation. With signals and forking now enabled, the next major piece (already in linux-next) will be adding P_PIDFD to the waitid() system call, and common process management can be done entirely with pidfd.

Other things
Alexander Popov pointed out some more v5.2 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

x86-64 IRQ/exception/debug stack now has guard pages to detect stack overflows immediately and deterministically from those contexts.
mincore(2) is made more conservative to avoid information exposures about memory cache states and similar.
x86’s objtool can now validate callers of uaccess helpers to make sure SMAP is not left disabled.
powerpc/32 now supports KASAN.
powerpc now warns if W+X pages are found on boot.

Edit: added CLONE_PIDFD notes, as reminded by Christian Brauner. :)
Edit: added Alexander Popov’s notes

That’s it for now; let me know if you think I should add anything here. We’re almost to -rc1 for v5.3!

Comments (4)

May 27, 2019

security things in Linux v5.1

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 8:49 pm

Previously: v5.0.

Linux kernel v5.1 has been released! Here are some security-related things that stood out to me:

introduction of pidfd
Christian Brauner landed the first portion of his work to remove pid races from the kernel: using a file descriptor to reference a process (“pidfd”). Now /proc/$pid can be opened and used as an argument for sending signals with the new pidfd_send_signal() syscall. This handle will only refer to the original process at the time the open() happened, and not to any later “reused” pid if the process dies and a new process is assigned the same pid. Using this method, it’s now possible to racelessly send signals to exactly the intended process without having to worry about pid reuse. (BTW, this commit wins the 2019 award for Most Well Documented Commit Log Justification.)

explicitly test for userspace mappings of heap memory
During Linux Conf AU 2019 Kernel Hardening BoF, Matthew Wilcox noted that there wasn’t anything in the kernel actually sanity-checking when userspace mappings were being applied to kernel heap memory (which would allow attackers to bypass the copy_{to,from}_user() infrastructure). Driver bugs or attackers able to confuse mappings wouldn’t get caught, so he added checks. To quote the commit logs: “It’s never appropriate to map a page allocated by SLAB into userspace” and “Pages which use page_type must never be mapped to userspace as it would destroy their page type”. The latter check almost immediately caught a bad case, which was quickly fixed to avoid page type corruption.

LSM stacking: shared security blobs
Casey Schaufler has landed one of the major pieces of getting multiple Linux Security Modules (LSMs) running at the same time (called “stacking”). It is now possible for LSMs to share the security-specific storage “blobs” associated with various core structures (e.g. inodes, tasks, etc) that LSMs can use for saving their state (e.g. storing which profile a given task confined under). The kernel originally gave only the single active “major” LSM (e.g. SELinux, Apprmor, etc) full control over the entire blob of storage. With “shared” security blobs, the LSM infrastructure does the allocation and management of the memory, and LSMs use an offset for reading/writing their portion of it. This unblocks the way for “medium sized” LSMs (like SARA and Landlock) to get stacked with a “major” LSM as they need to store much more state than the “minor” LSMs (e.g. Yama, LoadPin) which could already stack because they didn’t need blob storage.

SafeSetID LSM
Micah Morton added the new SafeSetID LSM, which provides a way to narrow the power associated with the CAP_SETUID capability. Normally a process with CAP_SETUID can become any user on the system, including root, which makes it a meaningless capability to hand out to non-root users in order for them to “drop privileges” to some less powerful user. There are trees of processes under Chrome OS that need to operate under different user IDs and other methods of accomplishing these transitions safely weren’t sufficient. Instead, this provides a way to create a system-wide policy for user ID transitions via setuid() (and group transitions via setgid()) when a process has the CAP_SETUID capability, making it a much more useful capability to hand out to non-root processes that need to make uid or gid transitions.

ongoing: refcount_t conversions
Elena Reshetova continued landing more refcount_t conversions in core kernel code (e.g. scheduler, futex, perf), with an additional conversion in btrfs from Anand Jain. The existing conversions, mainly when combined with syzkaller, continue to show their utility at finding bugs all over the kernel.

ongoing: implicit fall-through removal
Gustavo A. R. Silva continued to make progress on marking more implicit fall-through cases. What’s so impressive to me about this work, like refcount_t, is how many bugs it has been finding (see all the “missing break” patches). It really shows how quickly the kernel benefits from adding -Wimplicit-fallthrough to keep this class of bug from ever returning.

stack variable initialization includes scalars
The structleak gcc plugin (originally ported from PaX) had its “by reference” coverage improved to initialize scalar types as well (making “structleak” a bit of a misnomer: it now stops leaks from more than structs). Barring compiler bugs, this means that all stack variables in the kernel can be initialized before use at function entry. For variables not passed to functions by reference, the -Wuninitialized compiler flag (enabled via -Wall) already makes sure the kernel isn’t building with local-only uninitialized stack variables. And now with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled, all variables passed by reference will be initialized as well. This should eliminate most, if not all, uninitialized stack flaws with very minimal performance cost (for most workloads it is lost in the noise), though it does not have the stack data lifetime reduction benefits of GCC_PLUGIN_STACKLEAK, which wipes the stack at syscall exit. Clang has recently gained similar automatic stack initialization support, and I’d love to this feature in native gcc. To evaluate the coverage of the various stack auto-initialization features, I also wrote regression tests in lib/test_stackinit.c.

That’s it for now; please let me know if I missed anything. The v5.2 kernel development cycle is off and running already. :)

Comments Off

March 12, 2019

security things in Linux v5.0

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:04 pm

Previously: v4.20.

Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:

read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel under CONFIG_X86_PTDUMP=y) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables under CONFIG_ARM64_PTDUMP=y, where you can now see the page-level granularity:

---[ Linear mapping ]---
...
0xffffb07cfc402000-0xffffb07cfc403000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc403000-0xffffb07cfc4d0000  820K PTE   RW NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d0000-0xffffb07cfc4d1000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE   RW NX SHD AF NG    UXN MEM/NORMAL

per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20 on 32-bit and %gs:40 on 64-bit), which means it’s possible to have a different canary for every task since the %gs segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info. As he describes in his blog post, the plugin results in replacing:

8010fad8:       e30c4488        movw    r4, #50312      ; 0xc488
8010fadc:       e34840d0        movt    r4, #32976      ; 0x80d0
...
8010fb1c:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fb20:       e5943000        ldr     r3, [r4]
8010fb24:       e1520003        cmp     r2, r3
8010fb28:       1a000020        bne     8010fbb0
...
8010fbb0:       eb006738        bl      80129898 <__stack_chk_fail>

with:

8010fc18:       e1a0300d        mov     r3, sp
8010fc1c:       e3c34d7f        bic     r4, r3, #8128   ; 0x1fc0
...
8010fc60:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fc64:       e5943018        ldr     r3, [r4, #24]
8010fc68:       e1520003        cmp     r2, r3
8010fc6c:       1a000020        bne     8010fcf4
...
8010fcf4:       eb006757        bl      80129a58 <__stack_chk_fail>

r2 holds the canary saved on the stack and r3 the known-good canary to check against. In the former, r3 is loaded through r4 at a fixed address (0x80d0c488, which “readelf -s vmlinux” confirms is the global __stack_chk_guard). In the latter, it’s coming from offset 0x24 in struct thread_info (which “pahole -C thread_info vmlinux” confirms is the “stack_canary” field).

per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...“). With this feature, the canary can be found relative to sp_el0, since that register holds the pointer to the struct task_struct, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)

top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).

ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch statements. While the C language has a statement to indicate the end of a switch case (“break“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).

ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t to refcount_t so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/ (49), drivers/ (41), and kernel/ (21).

userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo for CPUs that support it.

platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.

SECCOMP_RET_USER_NOTIF
Tycho Andersen added the new SECCOMP_RET_USER_NOTIF return type to seccomp which allows process monitors to receive notifications over a file descriptor when a seccomp filter has been hit. This is a much more light-weight method than ptrace for performing tasks on behalf of a process. It is especially useful for performing various admin-only tasks for unprivileged containers (like mounting filesystems). The process monitor can perform the requested task (after doing some ToCToU dances to read process memory for syscall arguments), or reject the syscall.

Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE
Edit: added details on SECCOMP_RET_USER_NOTIF

That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)

Comments Off

December 24, 2018

security things in Linux v4.20

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 3:59 pm

Previously: v4.19.

Linux kernel v4.20 has been released today! Looking through the changes, here are some security-related things I found interesting:

stackleak plugin

Alexander Popov’s work to port the grsecurity STACKLEAK plugin to the upstream kernel came to fruition. While it had received Acks from x86 (and arm64) maintainers, it has been rejected a few times by Linus. With everything matching Linus’s expectations now, it and the x86 glue have landed. (The arch-specific portions for arm64 from Laura Abbott actually landed in v4.19.) The plugin tracks function calls (with a sufficiently large stack usage) to mark the maximum depth of the stack used during a syscall. With this information, at the end of a syscall, the stack can be efficiently poisoned (i.e. instead of clearing the entire stack, only the portion that was actually used during the syscall needs to be written). There are two main benefits from the stack getting wiped after every syscall. First, there are no longer “uninitialized” values left over on the stack that an attacker might be able to use in the next syscall. Next, the lifetime of any sensitive data on the stack is reduced to only being live during the syscall itself. This is mainly interesting because any information exposures or side-channel attacks from other kernel threads need to be much more carefully timed to catch the stack data before it gets wiped.

Enabling CONFIG_GCC_PLUGIN_STACKLEAK=y means almost all uninitialized variable flaws go away, with only a very minor performance hit (it appears to be under 1% for most workloads). It’s still possible that, within a single syscall, a later buggy function call could use “uninitialized” bytes from the stack from an earlier function. Fixing this will need compiler support for pre-initialization (this is under development already for Clang, for example), but that may have larger performance implications.

raise faults for kernel addresses in copy_*_user()

Jann Horn reworked x86 memory exception handling to loudly notice when copy_{to,from}_user() tries to access unmapped kernel memory. Prior this, those accesses would result in a silent error (usually visible to callers as EFAULT), making it indistinguishable from a “regular” userspace memory exception. The purpose of this is to catch cases where, for example, the unchecked __copy_to_user() is called against a kernel address. Fuzzers like syzcaller weren’t able to notice very nasty bugs because writes to kernel addresses would either corrupt memory (which may or may not get detected at a later time) or return an EFAULT that looked like things were operating normally. With this change, it’s now possible to much more easily notice missing access_ok() checks. This has already caught two other corner cases even during v4.20 in HID and Xen.

spectre v2 userspace mitigation

The support for Single Thread Indirect Branch Predictors (STIBP) has been merged. This allowed CPUs that support STIBP to effectively disable Hyper-Threading to avoid indirect branch prediction side-channels to expose information between userspace threads on the same physical CPU. Since this was a very expensive solution, this protection was made opt-in (via explicit prctl() or implicitly under seccomp()). LWN has a nice write-up of the details.

jump labels read-only after init

Ard Biesheuvel noticed that jump labels don’t need to be writable after initialization, so their data structures were made read-only. Since they point to kernel code, they might be used by attackers to manipulate the jump targets as a way to change kernel code that wasn’t intended to be changed. Better to just move everything into the read-only memory region to remove it from the possible kernel targets for attackers.

VLA removal finished

As detailed earlier for v4.17, v4.18, and v4.19, a whole bunch of people answered my call to remove Variable Length Arrays (VLAs) from the kernel. I count at least 153 commits having been added to the kernel since v4.16 to remove VLAs, with a big thanks to Gustavo A. R. Silva, Laura Abbott, Salvatore Mesoraca, Kyle Spiers, Tobin C. Harding, Stephen Kitt, Geert Uytterhoeven, Arnd Bergmann, Takashi Iwai, Suraj Jitindar Singh, Tycho Andersen, Thomas Gleixner, Stefan Wahren, Prashant Bhole, Nikolay Borisov, Nicolas Pitre, Martin Schwidefsky, Martin KaFai Lau, Lorenzo Bianconi, Himanshu Jha, Chris Wilson, Christian Lamparter, Boris Brezillon, Ard Biesheuvel, and Antoine Tenart. With all that done, “-Wvla” has been added to the top-level Makefile so we don’t get any more added back in the future.

per-task stack canaries, powerpc
For a long time, only x86 has had per-task kernel stack canaries. Other architectures would generate a single canary for the life of the boot and use it in every task. This meant that exposing a canary from one task would give an attacker everything they needed to spoof a canary in a separate attack in a different task. Christophe Leroy has solved this on powerpc now, integrating the new GCC support for the -mstack-protector-guard-reg and -mstack-protector-guard-offset options.

Given the holidays, Linus opened the merge window before v4.20 was released, letting everyone send in pull requests in the week leading up to the release. v4.21 is in the making. :) Happy New Year everyone!

Edit: clarified stackleak details, thanks to Alexander Popov. Added per-task canaries note too.

Comments (1)

October 22, 2018

security things in Linux v4.19

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:17 pm

Previously: v4.18.

Linux kernel v4.19 was released today. Here are some security-related things I found interesting:

L1 Terminal Fault (L1TF)

While it seems like ages ago, the fixes for L1TF actually landed at the start of the v4.19 merge window. As with the other speculation flaw fixes, lots of people were involved, and the scope was pretty wide: bare metal machines, virtualized machines, etc. LWN has a great write-up on the L1TF flaw and the kernel’s documentation on L1TF defenses is equally detailed. I like how clean the solution is for bare-metal machines: when a page table entry should be marked invalid, instead of only changing the “Present” flag, it also inverts the address portion so even a speculative lookup ignoring the “Present” flag will land in an unmapped area.

protected regular and fifo files

Salvatore Mesoraca implemented an O_CREAT restriction in /tmp directories for FIFOs and regular files. This is similar to the existing symlink restrictions, which take effect in sticky world-writable directories (e.g. /tmp) when the opening user does not match the owner of the existing file (or directory). When a program opens a FIFO or regular file with O_CREAT and this kind of user mismatch, it is treated like it was also opened with O_EXCL: it gets rejected because there is already a file there, and the kernel wants to protect the program from writing possibly sensitive contents to a file owned by a different user. This has become a more common attack vector now that symlink and hardlink races have been eliminated.

syscall register clearing, arm64

One of the ways attackers can influence potential speculative execution flaws in the kernel is to leak information into the kernel via “unused” register contents. Most syscalls take only a few arguments, so all the other calling-convention-defined registers can be cleared instead of just left with whatever contents they had in userspace. As it turns out, clearing registers is very fast. Similar to what was done on x86, Mark Rutland implemented a full register-clearing syscall wrapper on arm64.

Variable Length Array removals, part 3

As mentioned in part 1 and part 2, VLAs continue to be removed from the kernel. While CONFIG_THREAD_INFO_IN_TASK and CONFIG_VMAP_STACK cover most issues with stack exhaustion attacks, not all architectures have those features, so getting rid of VLAs makes sure we keep a few classes of flaws out of all kernel architectures and configurations. It’s been a long road, and it’s shaping up to be a 4-part saga with the remaining VLA removals landing in the next kernel. For v4.19, several folks continued to help grind away at the problem: Arnd Bergmann, Kyle Spiers, Laura Abbott, Martin Schwidefsky, Salvatore Mesoraca, and myself.

shift overflow helper
Jason Gunthorpe noticed that while the kernel recently gained add/sub/mul/div helpers to check for arithmetic overflow, we didn’t have anything for shift-left. He added check_shl_overflow() to round out the toolbox and Leon Romanovsky immediately put it to use to solve an overflow in RDMA.

Edit: I forgot to mention this next feature when I first posted:

trusted architecture-supported RNG initialization

The Random Number Generator in the kernel seeds its pools from many entropy sources, including any architecture-specific sources (e.g. x86’s RDRAND). Due to many people not wanting to trust the architecture-specific source due to the inability to audit its operation, entropy from those sources was not credited to RNG initialization, which wants to gather “enough” entropy before claiming to be initialized. However, because some systems don’t generate enough entropy at boot time, it was taking a while to gather enough system entropy (e.g. from interrupts) before the RNG became usable, which might block userspace from starting (e.g. systemd wants to get early entropy). To help these cases, Ted T’so introduced a toggle to trust the architecture-specific entropy completely (i.e. RNG is considered fully initialized as soon as it gets the architecture-specific entropy). To use this, the kernel can be built with CONFIG_RANDOM_TRUST_CPU=y (or booted with “random.trust_cpu=on“).

That’s it for now; thanks for reading. The merge window is open for v4.20! Wish us luck. :)

Comments Off

August 20, 2018

security things in Linux v4.18

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 11:29 am

Previously: v4.17.

Linux kernel v4.18 was released last week. Here are details on some of the security things I found interesting:

allocation overflow detection helpers

One of the many ways C can be dangerous to use is that it lacks strong primitives to deal with arithmetic overflow. A developer can’t just wrap a series of calculations in a try/catch block to trap any calculations that might overflow (or underflow). Instead, C will happily wrap values back around, causing all kinds of flaws. Some time ago GCC added a set of single-operation helpers that will efficiently detect overflow, so Rasmus Villemoes suggested implementing these (with fallbacks) in the kernel. While it still requires explicit use by developers, it’s much more fool-proof than doing open-coded type-sensitive bounds checking before every calculation. As a first-use of these routines, Matthew Wilcox created wrappers for common size calculations, mainly for use during memory allocations.

removing open-coded multiplication from memory allocation arguments

A common flaw in the kernel is integer overflow during memory allocation size calculations. As mentioned above, C doesn’t provide much in the way of protection, so it’s on the developer to get it right. In an effort to reduce the frequency of these bugs, and inspired by a couple flaws found by Silvio Cesare, I did a first-pass sweep of the kernel to move from open-coded multiplications during memory allocations into either their 2-factor API counterparts (e.g. kmalloc(a * b, GFP...) -> kmalloc_array(a, b, GFP...)), or to use the new overflow-checking helpers (e.g. vmalloc(a * b) -> vmalloc(array_size(a, b))). There’s still lots more work to be done here, since frequently an allocation size will be calculated earlier in a variable rather than in the allocation arguments, and overflows happen in way more places than just memory allocation. Better yet would be to have exceptions raised on overflows where no wrap-around was expected (e.g. Emese Revfy’s size_overflow GCC plugin).

Variable Length Array removals, part 2

As discussed previously, VLAs continue to get removed from the kernel. For v4.18, we continued to get help from a bunch of lovely folks: Andreas Christoforou, Antoine Tenart, Chris Wilson, Gustavo A. R. Silva, Kyle Spiers, Laura Abbott, Salvatore Mesoraca, Stephan Wahren, Thomas Gleixner, Tobin C. Harding, and Tycho Andersen. Almost all the rest of the VLA removals have been queued for v4.19, but it looks like the very last of them (deep in the crypto subsystem) won’t land until v4.20. I’m so looking forward to being able to add -Wvla globally to the kernel build so we can be free from the classes of flaws that VLAs enable, like stack exhaustion and stack guard page jumping. Eliminating VLAs also simplifies the porting work of the stackleak GCC plugin from grsecurity, since it no longer has to hook and check VLA creation.

Kconfig compiler detection

While not strictly a security thing, Masahiro Yamada made giant improvements to the kernel’s Kconfig subsystem so that kernel build configuration now knows what compiler you’re using (among other things) so that configuration is no longer separate from the compiler features. For example, in the past, one could select CONFIG_CC_STACKPROTECTOR_STRONG even if the compiler didn’t support it, and later the build would fail. Or in other cases, configurations would silently down-grade to what was available, potentially leading to confusing kernel images where the compiler would change the meaning of a configuration. Going forward now, configurations that aren’t available to the compiler will simply be unselectable in Kconfig. This makes configuration much more consistent, though in some cases, it makes it harder to discover why some configuration is missing (e.g. CONFIG_GCC_PLUGINS no longer gives you a hint about needing to install the plugin development packages).

That’s it for now! Please let me know if you think I missed anything. Stay tuned for v4.19; the merge window is open. :)

Comments (2)

June 14, 2018

security things in Linux v4.17

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:23 pm

Previously: v4.16.

Linux kernel v4.17 was released last week, and here are some of the security things I think are interesting:

Jailhouse hypervisor

Jan Kiszka landed Jailhouse hypervisor support, which uses static partitioning (i.e. no resource over-committing), where the root “cell” spawns new jails by shrinking its own CPU/memory/etc resources and hands them over to the new jail. There’s a nice write-up of the hypervisor on LWN from 2014.

Sparc ADI

Khalid Aziz landed the userspace support for Sparc Application Data Integrity (ADI or SSM: Silicon Secured Memory), which is the hardware memory coloring (tagging) feature in Sparc M7. I’d love to see this extended into the kernel itself, as it would kill linear overflows between allocations, since the base pointer being used is tagged to belong to only a certain allocation (sized to a multiple of cache lines). Any attempt to increment beyond, into memory with a different tag, raises an exception. Enrico Perla has some great write-ups on using ADI in allocators and a comparison of ADI to Intel’s MPX.

new kernel stacks cleared on fork

It was possible that old memory contents would live in a new process’s kernel stack. While normally not visible, “uninitialized” memory read flaws or read overflows could expose these contents (especially stuff “deeper” in the stack that may never get overwritten for the life of the process). To avoid this, I made sure that new stacks were always zeroed. Oddly, this “priming” of the cache appeared to actually improve performance, though it was mostly in the noise.

MAP_FIXED_NOREPLACE

As part of further defense in depth against attacks like Stack Clash, Michal Hocko created MAP_FIXED_NOREPLACE. The regular MAP_FIXED has a subtle behavior not normally noticed (but used by some, so it couldn’t just be fixed): it will replace any overlapping portion of a pre-existing mapping. This means the kernel would silently overlap the stack into mmap or text regions, since MAP_FIXED was being used to build a new process’s memory layout. Instead, MAP_FIXED_NOREPLACE has all the features of MAP_FIXED without the replacement behavior: it will fail if a pre-existing mapping overlaps with the newly requested one. The ELF loader has been switched to use MAP_FIXED_NOREPLACE, and it’s available to userspace too, for similar use-cases.

pin stack limit during exec

I used a big hammer and pinned the RLIMIT_STACK values during exec. There were multiple methods to change the limit (through at least setrlimit() and prlimit()), and there were multiple places the limit got used to make decisions, so it seemed best to just pin the values for the life of the exec so no games could get played with them. Too much assumed the value wasn’t changing, so better to make that assumption actually true. Hopefully this is the last of the fixes for these bad interactions between stack limits and memory layouts during exec (which have all been defensive measures against flaws like Stack Clash).

Variable Length Array removals start

Following some discussion over Alexander Popov’s ongoing port of the stackleak GCC plugin, Linus declared that Variable Length Arrays (VLAs) should be eliminated from the kernel entirely. This is great because it kills several stack exhaustion attacks, including weird stuff like stepping over guard pages with giant stack allocations. However, with several hundred uses in the kernel, this wasn’t going to be an easy job. Thankfully, a whole bunch of people stepped up to help out: Gustavo A. R. Silva, Himanshu Jha, Joern Engel, Kyle Spiers, Laura Abbott, Lorenzo Bianconi, Nikolay Borisov, Salvatore Mesoraca, Stephen Kitt, Takashi Iwai, Tobin C. Harding, and Tycho Andersen. With Linus Torvalds and Martin Uecker, I also helped rewrite the max() macro to eliminate false positives seen by the -Wvla compiler option. Overall, about 1/3rd of the VLA instances were solved for v4.17, with many more coming for v4.18. I’m hoping we’ll have entirely eliminated VLAs by the time v4.19 ships.

Edit: I missed this next feature when I first published this post:

syscall register clearing, x86

One of the ways attackers can influence potential speculative execution flaws in the kernel is to leak information into the kernel via “unused” register contents. Most syscalls take only a few arguments, so all the other calling-convention-defined registers can be cleared instead of just left with whatever contents they had in userspace. As it turns out, clearing registers is very fast. Dominik Brodowski expanded a proof-of-concept from Linus Torvalds into a full register-clearing syscall wrapper on x86.

That’s it for now! Please let me know if you think I missed anything. Stay tuned for v4.18; the merge window is open. :)

Comments Off

April 19, 2018

UEFI booting and RAID1

Filed under: Debian,Kernel,Ubuntu,Ubuntu-Server — kees @ 5:34 pm

I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s md RAID1 for the root filesystem (and/or /boot). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.

With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running BOOT/EFI/BOOTX64.EFI from the FAT32. Under Linux, this .EFI code is either GRUB itself, or Shim which loads GRUB.

So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read md, LVM, etc), but how do I handle /boot/efi (the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with efibootmgr) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.

The current implementation of Linux’s md RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact, mdadm warns about this pretty loudly:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90

Reading from the mdadm man page:

   -e, --metadata=
...
          1, 1.0, 1.1, 1.2 default
                 Use  the new version-1 format superblock.  This has fewer
                 restrictions.  It can easily be moved between hosts  with
                 different  endian-ness,  and  a recovery operation can be
                 checkpointed and restarted.  The  different  sub-versions
                 store  the  superblock  at  different  locations  on  the
                 device, either at the end (for 1.0), at  the  start  (for
                 1.1)  or  4K from the start (for 1.2).  "1" is equivalent
                 to "1.2" (the commonly preferred 1.x format).   "default"
                 is equivalent to "1.2".

First we toss a FAT32 on the RAID (mkfs.fat -F32 /dev/md0), and looking at the results, the first 4K is entirely zeros, and file doesn’t see a filesystem:

# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000  fc 4e 2b a9 01 00 00 00  00 00 00 00 00 00 00 00  |.N+.............|
...
# file -s /dev/sda1
/dev/sda1: Linux Software RAID version 1.2 ...

So, instead, we’ll use --metadata 1.0 to put the RAID metadata at the end:

# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1
...
# mkfs.fat -F32 /dev/md0
# dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd
00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac    FAT32   ...w|.
# file -s /dev/sda1
/dev/sda1: ... FAT (32 bit)

Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and grub-install will write to the RAID mounted at /boot/efi.

However, we’re left with a new problem: on (at least) Debian and Ubuntu, grub-install attempts to run efibootmgr to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run efibootmgr with an empty -d argument:

Installing for x86_64-efi platform.
efibootmgr: option requires an argument -- 'd'
...
grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted.
Failed: grub-install --target=x86_64-efi  
WARNING: Bootloader is not properly installed, system may not be bootable

Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running: dpkg-reconfigure -p low grub-efi-amd64

So, now my system will boot with both or either drive present, and updates from Linux to /boot/efi are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).

To deal with this “external write” situation, I see some solutions:

Make the partition read-only when not under Linux. (I don’t think this is a thing.)
Create higher-level knowledge of the root-filesystem RAID configuration is needed to keep a collection of filesystems manually synchronized instead of doing block-level RAID. (Seems like a lot of work and would need redesign of /boot/efi into something like /boot/efi/booted, /boot/efi/spare1, /boot/efi/spare2, etc)
Prefer one RAID member’s copy of /boot/efi and rebuild the RAID at every boot. If there were no external writes, there’s no issue. (Though what’s really the right way to pick the copy to prefer?)

Since mdadm has the “--update=resync” assembly option, I can actually do the latter option. This required updating /etc/mdadm/mdadm.conf to add <ignore> on the RAID’s ARRAY line to keep it from auto-starting:

ARRAY <ignore> metadata=1.0 UUID=123...

(Since it’s ignored, I’ve chosen /dev/md100 for the manual assembly below.) Then I added the noauto option to the /boot/efi entry in /etc/fstab:

/dev/md100 /boot/efi vfat noauto,defaults 0 0

And finally I added a systemd oneshot service that assembles the RAID with resync and mounts it:

[Unit]
Description=Resync /boot/efi RAID
DefaultDependencies=no
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync
ExecStart=/bin/mount /boot/efi
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target

(And don’t forget to run “update-initramfs -u” so the initramfs has an updated copy of /dev/mdadm/mdadm.conf.)

If mdadm.conf supported an “update=” option for ARRAY lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!

And if I wanted to keep a “pristine” version of /boot/efi that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g. /boot/efi.img). This would make all external changes in the real ESPs disappear after resync. Something like:


# truncate --size 512M /boot/efi.img
# losetup -f --show /boot/efi.img
/dev/loop0
# mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1

And at boot just rebuild it from /dev/loop0, though I’m not sure how to “prefer” that partition…

Notes: commands I used to muck with grub-install:

echo "grub-pc grub2/update_nvram boolean false" | debconf-set-selections
echo "grub-pc grub-efi/install_devices multiselect /dev/md0" | debconf-set-selections && dpkg-reconfigure -p low grub-efi-amd64-signed

Update: Ubuntu Focal 20.04 now provides a way for GRUB to install to an arbitrary collection of devices, so no RAID needed. Whew.

Comments (5)

April 12, 2018

security things in Linux v4.16

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:04 pm

Previously: v4.15.

Linux kernel v4.16 was released last week. I really should write these posts in advance, otherwise I get distracted by the merge window. Regardless, here are some of the security things I think are interesting:

KPTI on arm64

Will Deacon, Catalin Marinas, and several other folks brought Kernel Page Table Isolation (via CONFIG_UNMAP_KERNEL_AT_EL0) to arm64. While most ARMv8+ CPUs were not vulnerable to the primary Meltdown flaw, the Cortex-A75 does need KPTI to be safe from memory content leaks. It’s worth noting, though, that KPTI does protect other ARMv8+ CPU models from having privileged register contents exposed. So, whatever your threat model, it’s very nice to have this clean isolation between kernel and userspace page tables for all ARMv8+ CPUs.

hardened usercopy whitelisting

While whole-object bounds checking was implemented in CONFIG_HARDENED_USERCOPY already, David Windsor and I finished another part of the porting work of grsecurity’s PAX_USERCOPY protection: usercopy whitelisting. This further tightens the scope of slab allocations that can be copied to/from userspace. Now, instead of allowing all objects in slab memory to be copied, only the whitelisted areas (where a subsystem has specifically marked the memory region allowed) can be copied. For example, only the auxv array out of the larger mm_struct.

As mentioned in the first commit from the series, this reduces the scope of slab memory that could be copied out of the kernel in the face of a bug to under 15%. As can be seen, one area of work remaining are the kmalloc regions. Those are regularly used for copying things in and out of userspace, but they’re also used for small simple allocations that aren’t meant to be exposed to userspace. Working to separate these kmalloc users needs some careful auditing.

Total Slab Memory:           48074720
Usercopyable Memory:          6367532  13.2%
         task_struct                    0.2%         4480/1630720
         RAW                            0.3%            300/96000
         RAWv6                          2.1%           1408/64768
         ext4_inode_cache               3.0%       269760/8740224
         dentry                        11.1%       585984/5273856
         mm_struct                     29.1%         54912/188448
         kmalloc-8                    100.0%          24576/24576
         kmalloc-16                   100.0%          28672/28672
         kmalloc-32                   100.0%          81920/81920
         kmalloc-192                  100.0%          96768/96768
         kmalloc-128                  100.0%        143360/143360
         names_cache                  100.0%        163840/163840
         kmalloc-64                   100.0%        167936/167936
         kmalloc-256                  100.0%        339968/339968
         kmalloc-512                  100.0%        350720/350720
         kmalloc-96                   100.0%        455616/455616
         kmalloc-8192                 100.0%        655360/655360
         kmalloc-1024                 100.0%        812032/812032
         kmalloc-4096                 100.0%        819200/819200
         kmalloc-2048                 100.0%      1310720/1310720

This series took quite a while to land (you can see David’s original patch date as back in June of last year). Partly this was due to having to spend a lot of time researching the code paths so that each whitelist could be explained for commit logs, partly due to making various adjustments from maintainer feedback, and partly due to the short merge window in v4.15 (when it was originally proposed for merging) combined with some last-minute glitches that made Linus nervous. After baking in linux-next for almost two full development cycles, it finally landed. (Though be sure to disable CONFIG_HARDENED_USERCOPY_FALLBACK to gain enforcement of the whitelists — by default it only warns and falls back to the full-object checking.)

automatic stack-protector

While the stack-protector features of the kernel have existed for quite some time, it has never been enabled by default. This was mainly due to needing to evaluate compiler support for the feature, and Kconfig didn’t have a way to check the compiler features before offering CONFIG_* options. As a defense technology, the stack protector is pretty mature. Having it on by default would have greatly reduced the impact of things like the BlueBorne attack (CVE-2017-1000251), as fewer systems would have lacked the defense.

After spending quite a bit of time fighting with ancient compiler versions (*cough*GCC 4.4.4*cough*), I landed CONFIG_CC_STACKPROTECTOR_AUTO, which is default on, and tries to use the stack protector if it is available. The implementation of the solution, however, did not please Linus, though he allowed it to be merged. In the future, Kconfig will gain the knowledge to make better decisions which lets the kernel expose the availability of (the now default) stack protector directly in Kconfig, rather than depending on rather ugly Makefile hacks.

execute-only memory for PowerPC

Similar to the Protection Keys (pkeys) hardware support that landed in v4.6 for x86, Ram Pai landed pkeys support for Power7/8/9. This should expand the scope of what’s possible in the dynamic loader to avoid having arbitrary read flaws allow an exploit to read out all of executable memory in order to find ROP gadgets.

That’s it for now; let me know if you think I should add anything! The v4.17 merge window is open. :)

Edit: added details on ARM register leaks, thanks to Daniel Micay.

Edit: added section on protection keys for POWER, thanks to Florian Weimer.

Comments (2)

February 5, 2018

security things in Linux v4.15

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 3:45 pm

Previously: v4.14.

Linux kernel v4.15 was released last week, and there’s a bunch of security things I think are interesting:

Kernel Page Table Isolation
PTI has already gotten plenty of reporting, but to summarize, it is mainly to protect against CPU cache timing side-channel attacks that can expose kernel memory contents to userspace (CVE-2017-5754, the speculative execution “rogue data cache load” or “Meltdown” flaw).

Even for just x86_64 (as CONFIG_PAGE_TABLE_ISOLATION), this was a giant amount of work, and tons of people helped with it over several months. PowerPC also had mitigations land, and arm64 (as CONFIG_UNMAP_KERNEL_AT_EL0) will have PTI in v4.16 (though only the Cortex-A75 is vulnerable). For anyone with really old hardware, x86_32 is under development, too.

An additional benefit of the x86_64 PTI is that since there are now two copies of the page tables, the kernel-mode copy of the userspace mappings can be marked entirely non-executable, which means pre-SMEP hardware now gains SMEP emulation. Kernel exploits that try to jump into userspace memory to continue running malicious code are dead (even if the attacker manages to turn SMEP off first). With some more work, SMAP emulation could also be introduced (to stop even just reading malicious userspace memory), which would close the door on these common attack vectors. It’s worth noting that arm64 has had the equivalent (PAN emulation) since v4.10.

retpoline
In addition to the PTI work above, the retpoline kernel mitigations for CVE-2017-5715 (“branch target injection” or “Spectre variant 2”) started landing. (Note that to gain full retpoline support, you’ll need a patched compiler, as appearing in gcc 7.3/8+, and currently queued for release in clang.)

This work continues to evolve, and clean-ups are continuing into v4.16. Also in v4.16 we’ll start to see mitigations for the other speculative execution variant (i.e. CVE-2017-5753, “bounds check bypass” or “Spectre variant 1”).

x86 fast refcount_t overflow protection
In v4.13 the CONFIG_REFCOUNT_FULL code was added to stop many types of reference counting flaws (with a tiny performance loss). In v4.14 the infrastructure for a fast overflow-only refcount_t protection on x86 (based on grsecurity’s PAX_REFCOUNT) landed, but it was disabled at the last minute due to a bug that was finally fixed in v4.15. Since it was a tiny change, the fast refcount_t protection was backported and enabled for the Longterm maintenance kernel in v4.14.5. Conversions from atomic_t to refcount_t have also continued, and are now above 168, with a handful remaining.

%p hashing
One of the many sources of kernel information exposures has been the use of the %p format string specifier. The strings end up in all kinds of places (dmesg, /sys files, /proc files, etc), and usage is scattered through-out the kernel, which had made it a very hard exposure to fix. Earlier efforts like kptr_restrict‘s %pK didn’t really work since it was opt-in. While a few recent attempts (by William C Roberts, Greg KH, and others) had been made to provide toggles for %p to act like %pK, Linus finally stepped in and declared that %p should be used so rarely that it shouldn’t used at all, and Tobin Harding took on the task of finding the right path forward, which resulted in %p output getting hashed with a per-boot secret. The result is that simple debugging continues to work (two reports of the same hash value can confirm the same address without saying what the address actually is) but frustrates attacker’s ability to use such information exposures as building blocks for exploits.

For developers needing an unhashed %p, %px was introduced but, as Linus cautioned, either your %p remains useful when hashed, your %p was never actually useful to begin with and should be removed, or you need to strongly justify using %px with sane permissions.

It remains to be seen if we’ve just kicked the information exposure can down the road and in 5 years we’ll be fighting with %px and %lx, but hopefully the attitudes about such exposures will have changed enough to better guide developers and their code.

struct timer_list refactoring
The kernel’s timer (struct timer_list) infrastructure is, unsurprisingly, used to create callbacks that execute after a certain amount of time. They are one of the more fundamental pieces of the kernel, and as such have existed for a very long time, with over 1000 call sites. Improvements to the API have been made over time, but old ways of doing things have stuck around. Modern callbacks in the kernel take an argument pointing to the structure associated with the callback, so that a callback has context for which instance of the callback has been triggered. The timer callbacks didn’t, and took an unsigned long that was cast back to whatever arbitrary context the code setting up the timer wanted to associate with the callback, and this variable was stored in struct timer_list along with the function pointer for the callback. This creates an opportunity for an attacker looking to exploit a memory corruption vulnerability (e.g. heap overflow), where they’re able to overwrite not only the function pointer, but also the argument, as stored in memory. This elevates the attack into a weak ROP, and has been used as the basis for disabling SMEP in modern exploits (see retire_blk_timer). To remove this weakness in the kernel’s design, I refactored the timer callback API and and all its callers, for a whopping:

1128 files changed, 4834 insertions(+), 5926 deletions(-)

Another benefit of the refactoring is that once the kernel starts getting built by compilers with Control Flow Integrity support, timer callbacks won’t be lumped together with all the other functions that take a single unsigned long argument. (In other words, some CFI implementations wouldn’t have caught the kind of attack described above since the attacker’s target function still matched its original prototype.)

That’s it for now; please let me know if I missed anything. The v4.16 merge window is now open!

Comments (3)

January 4, 2018

SMEP emulation in PTI

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:43 pm

An nice additional benefit of the recent Kernel Page Table Isolation (CONFIG_PAGE_TABLE_ISOLATION) patches (to defend against CVE-2017-5754, the speculative execution “rogue data cache load” or “Meltdown” flaw) is that the userspace page tables visible while running in kernel mode lack the executable bit. As a result, systems without the SMEP CPU feature (before Ivy-Bridge) get it emulated for “free”.

Here’s a non-SMEP system with PTI disabled (booted with “pti=off“), running the EXEC_USERSPACE LKDTM test:

# grep smep /proc/cpuinfo
# dmesg -c | grep isolation
[    0.000000] Kernel/User page tables isolation: disabled on command line.
# cat <(echo EXEC_USERSPACE) > /sys/kernel/debug/provoke-crash/DIRECT
# dmesg
[   17.883754] lkdtm: Performing direct entry EXEC_USERSPACE
[   17.885149] lkdtm: attempting ok execution at ffffffff9f6293a0
[   17.886350] lkdtm: attempting bad execution at 00007f6a2f84d000

No crash! The kernel was happily executing userspace memory.

But with PTI enabled:

# grep smep /proc/cpuinfo
# dmesg -c | grep isolation
[    0.000000] Kernel/User page tables isolation: enabled
# cat <(echo EXEC_USERSPACE) > /sys/kernel/debug/provoke-crash/DIRECT
Killed
# dmesg
[   33.657695] lkdtm: Performing direct entry EXEC_USERSPACE
[   33.658800] lkdtm: attempting ok execution at ffffffff926293a0
[   33.660110] lkdtm: attempting bad execution at 00007f7c64546000
[   33.661301] BUG: unable to handle kernel paging request at 00007f7c64546000
[   33.662554] IP: 0x7f7c64546000
...

It should only take a little more work to leave the userspace page tables entirely unmapped while in kernel mode, and only map them in during copy_to_user()/copy_from_user() as ARM already does with ARM64_SW_TTBR0_PAN (or CONFIG_CPU_SW_DOMAIN_PAN on arm32).

Comments (3)

November 14, 2017

security things in Linux v4.14

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 9:23 pm

Previously: v4.13.

Linux kernel v4.14 was released this last Sunday, and there’s a bunch of security things I think are interesting:

vmapped kernel stack on arm64
Similar to the same feature on x86, Mark Rutland and Ard Biesheuvel implemented CONFIG_VMAP_STACK for arm64, which moves the kernel stack to an isolated and guard-paged vmap area. With traditional stacks, there were two major risks when exhausting the stack: overwriting the thread_info structure (which contained the addr_limit field which is checked during copy_to/from_user()), and overwriting neighboring stacks (or other things allocated next to the stack). While arm64 previously moved its thread_info off the stack to deal with the former issue, this vmap change adds the last bit of protection by nature of the vmap guard pages. If the kernel tries to write past the end of the stack, it will hit the guard page and fault. (Testing for this is now possible via LKDTM’s STACK_GUARD_PAGE_LEADING/TRAILING tests.)

One aspect of the guard page protection that will need further attention (on all architectures) is that if the stack grew because of a giant Variable Length Array on the stack (effectively an implicit alloca() call), it might be possible to jump over the guard page entirely (as seen in the userspace Stack Clash attacks). Thankfully the use of VLAs is rare in the kernel. In the future, hopefully we’ll see the addition of PaX/grsecurity’s STACKLEAK plugin which, in addition to its primary purpose of clearing the kernel stack on return to userspace, makes sure stack expansion cannot skip over guard pages. This “stack probing” ability will likely also become directly available from the compiler as well.

set_fs() balance checking
Related to the addr_limit field mentioned above, another class of bug is finding a way to force the kernel into accidentally leaving addr_limit open to kernel memory through an unbalanced call to set_fs(). In some areas of the kernel, in order to reuse userspace routines (usually VFS or compat related), code will do something like: set_fs(KERNEL_DS); ...some code here...; set_fs(USER_DS);. When the USER_DS call goes missing (usually due to a buggy error path or exception), subsequent system calls can suddenly start writing into kernel memory via copy_to_user (where the “to user” really means “within the addr_limit range”).

Thomas Garnier implemented USER_DS checking at syscall exit time for x86, arm, and arm64 (with a small fix). This means that a broken set_fs() setting will not extend beyond the buggy syscall that fails to set it back to USER_DS. Additionally, as part of the discussion on the best way to deal with this feature, Christoph Hellwig and Al Viro (and others) have been making extensive changes to avoid the need for set_fs() being used at all, which should greatly reduce the number of places where it might be possible to introduce such a bug in the future.

SLUB freelist hardening
A common class of heap attacks is overwriting the freelist pointers stored inline in the unallocated SLUB cache objects. PaX/grsecurity developed an inexpensive defense that XORs the freelist pointer with a global random value (and the storage address). Daniel Micay improved on this by using a per-cache random value, and I refactored the code a bit more. The resulting feature, enabled with CONFIG_SLAB_FREELIST_HARDENED, makes freelist pointer overwrites very hard to exploit unless an attacker has found a way to expose both the random value and the pointer location. This should render blind heap overflow bugs much more difficult to exploit.

Additionally, Alexander Popov implemented a simple double-free defense, similar to the “fasttop” check in the GNU C library, which will catch sequential free()s of the same pointer. (And has already uncovered a bug.)

Future work would be to provide similar metadata protections to the SLAB allocator (though SLAB doesn’t store its freelist within the individual unused objects, so it has a different set of exposures compared to SLUB).

setuid-exec stack limitation
Continuing the various additional defenses to protect against future problems related to userspace memory layout manipulation (as shown most recently in the Stack Clash attacks), I implemented an 8MiB stack limit for privileged (i.e. setuid) execs, inspired by a similar protection in grsecurity, after reworking the secureexec handling by LSMs. This complements the unconditional limit to the size of exec arguments that landed in v4.13.

randstruct automatic struct selection
While the bulk of the port of the randstruct gcc plugin from grsecurity landed in v4.13, the last of the work needed to enable automatic struct selection landed in v4.14. This means that the coverage of randomized structures, via CONFIG_GCC_PLUGIN_RANDSTRUCT, now includes one of the major targets of exploits: function pointer structures. Without knowing the build-randomized location of a callback pointer an attacker needs to overwrite in a structure, exploits become much less reliable.

structleak passed-by-reference variable initialization
Ard Biesheuvel enhanced the structleak gcc plugin to initialize all variables on the stack that are passed by reference when built with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL. Normally the compiler will yell if a variable is used before being initialized, but it silences this warning if the variable’s address is passed into a function call first, as it has no way to tell if the function did actually initialize the contents. So the plugin now zero-initializes such variables (if they hadn’t already been initialized) before the function call that takes their address. Enabling this feature has a small performance impact, but solves many stack content exposure flaws. (In fact at least one such flaw reported during the v4.15 development cycle was mitigated by this plugin.)

improved boot entropy
Laura Abbott and Daniel Micay improved early boot entropy available to the stack protector by both moving the stack protector setup later in the boot, and including the kernel command line in boot entropy collection (since with some devices it changes on each boot).

eBPF JIT for 32-bit ARM
The ARM BPF JIT had been around a while, but it didn’t support eBPF (and, as a result, did not provide constant value blinding, which meant it was exposed to being used by an attacker to build arbitrary machine code with BPF constant values). Shubham Bansal spent a bunch of time building a full eBPF JIT for 32-bit ARM which both speeds up eBPF and brings it up to date on JIT exploit defenses in the kernel.

seccomp improvements
Tyler Hicks addressed a long-standing deficiency in how seccomp could log action results. In addition to creating a way to mark a specific seccomp filter as needing to be logged with SECCOMP_FILTER_FLAG_LOG, he added a new action result, SECCOMP_RET_LOG. With these changes in place, it should be much easier for developers to inspect the results of seccomp filters, and for process launchers to generate logs for their child processes operating under a seccomp filter.

Additionally, I finally found a way to implement an often-requested feature for seccomp, which was to kill an entire process instead of just the offending thread. This was done by creating the SECCOMP_RET_ACTION_FULL mask (née SECCOMP_RET_ACTION) and implementing SECCOMP_RET_KILL_PROCESS.

That’s it for now; please let me know if I missed anything. The v4.15 merge window is now open!

Comments Off

September 5, 2017

security things in Linux v4.13

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:01 pm

Previously: v4.12.

Here’s a short summary of some of interesting security things in Sunday’s v4.13 release of the Linux kernel:

security documentation ReSTification
The kernel has been switching to formatting documentation with ReST, and I noticed that none of the Documentation/security/ tree had been converted yet. I took the opportunity to take a few passes at formatting the existing documentation and, at Jon Corbet’s recommendation, split it up between end-user documentation (which is mainly how to use LSMs) and developer documentation (which is mainly how to use various internal APIs). A bunch of these docs need some updating, so maybe with the improved visibility, they’ll get some extra attention.

CONFIG_REFCOUNT_FULL
Since Peter Zijlstra implemented the refcount_t API in v4.11, Elena Reshetova (with Hans Liljestrand and David Windsor) has been systematically replacing atomic_t reference counters with refcount_t. As of v4.13, there are now close to 125 conversions with many more to come. However, there were concerns over the performance characteristics of the refcount_t implementation from the maintainers of the net, mm, and block subsystems. In order to assuage these concerns and help the conversion progress continue, I added an “unchecked” refcount_t implementation (identical to the earlier atomic_t implementation) as the default, with the fully checked implementation now available under CONFIG_REFCOUNT_FULL. The plan is that for v4.14 and beyond, the kernel can grow per-architecture implementations of refcount_t that have performance characteristics on par with atomic_t (as done in grsecurity’s PAX_REFCOUNT).

CONFIG_FORTIFY_SOURCE
Daniel Micay created a version of glibc’s FORTIFY_SOURCE compile-time and run-time protection for finding overflows in the common string (e.g. strcpy, strcmp) and memory (e.g. memcpy, memcmp) functions. The idea is that since the compiler already knows the size of many of the buffer arguments used by these functions, it can already build in checks for buffer overflows. When all the sizes are known at compile time, this can actually allow the compiler to fail the build instead of continuing with a proven overflow. When only some of the sizes are known (e.g. destination size is known at compile-time, but source size is only known at run-time) run-time checks are added to catch any cases where an overflow might happen. Adding this found several places where minor leaks were happening, and Daniel and I chased down fixes for them.

One interesting note about this protection is that is only examines the size of the whole object for its size (via __builtin_object_size(..., 0)). If you have a string within a structure, CONFIG_FORTIFY_SOURCE as currently implemented will make sure only that you can’t copy beyond the structure (but therefore, you can still overflow the string within the structure). The next step in enhancing this protection is to switch from 0 (above) to 1, which will use the closest surrounding subobject (e.g. the string). However, there are a lot of cases where the kernel intentionally copies across multiple structure fields, which means more fixes before this higher level can be enabled.

NULL-prefixed stack canary
Rik van Riel and Daniel Micay changed how the stack canary is defined on 64-bit systems to always make sure that the leading byte is zero. This provides a deterministic defense against overflowing string functions (e.g. strcpy), since they will either stop an overflowing read at the NULL byte, or be unable to write a NULL byte, thereby always triggering the canary check. This does reduce the entropy from 64 bits to 56 bits for overflow cases where NULL bytes can be written (e.g. memcpy), but the trade-off is worth it. (Besdies, x86_64’s canary was 32-bits until recently.)

IPC refactoring
Partially in support of allowing IPC structure layouts to be randomized by the randstruct plugin, Manfred Spraul and I reorganized the internal layout of how IPC is tracked in the kernel. The resulting allocations are smaller and much easier to deal with, even if I initially missed a few needed container_of() uses.

randstruct gcc plugin
I ported grsecurity’s clever randstruct gcc plugin to upstream. This plugin allows structure layouts to be randomized on a per-build basis, providing a probabilistic defense against attacks that need to know the location of sensitive structure fields in kernel memory (which is most attacks). By moving things around in this fashion, attackers need to perform much more work to determine the resulting layout before they can mount a reliable attack.

Unfortunately, due to the timing of the development cycle, only the “manual” mode of randstruct landed in upstream (i.e. marking structures with __randomize_layout). v4.14 will also have the automatic mode enabled, which randomizes all structures that contain only function pointers.

A large number of fixes to support randstruct have been landing from v4.10 through v4.13, most of which were already identified and fixed by grsecurity, but many were novel, either in newly added drivers, as whitelisted cross-structure casts, refactorings (like IPC noted above), or in a corner case on ARM found during upstream testing.

lower ELF_ET_DYN_BASE
One of the issues identified from the Stack Clash set of vulnerabilities was that it was possible to collide stack memory with the highest portion of a PIE program’s text memory since the default ELF_ET_DYN_BASE (the lowest possible random position of a PIE executable in memory) was already so high in the memory layout (specifically, 2/3rds of the way through the address space). Fixing this required teaching the ELF loader how to load interpreters as shared objects in the mmap region instead of as a PIE executable (to avoid potentially colliding with the binary it was loading). As a result, the PIE default could be moved down to ET_EXEC (0x400000) on 32-bit, entirely avoiding the subset of Stack Clash attacks. 64-bit could be moved to just above the 32-bit address space (0x100000000), leaving the entire 32-bit region open for VMs to do 32-bit addressing, but late in the cycle it was discovered that Address Sanitizer couldn’t handle it moving. With most of the Stack Clash risk only applicable to 32-bit, fixing 64-bit has been deferred until there is a way to teach Address Sanitizer how to load itself as a shared object instead of as a PIE binary.

early device randomness
I noticed that early device randomness wasn’t actually getting added to the kernel entropy pools, so I fixed that to improve the effectiveness of the latent_entropy gcc plugin.

That’s it for now; please let me know if I missed anything. As a side note, I was rather alarmed to discover that due to all my trivial ReSTification formatting, and tiny FORTIFY_SOURCE and randstruct fixes, I made it into the most active 4.13 developers list (by patch count) at LWN with 76 patches: a whopping 0.6% of the cycle’s patches. ;)

Anyway, the v4.14 merge window is open!

Comments (1)

July 10, 2017

security things in Linux v4.12

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:24 am

Previously: v4.11.

Here’s a quick summary of some of the interesting security things in last week’s v4.12 release of the Linux kernel:

x86 read-only and fixed-location GDT
With kernel memory base randomization, it was stil possible to figure out the per-cpu base address via the “sgdt” instruction, since it would reveal the per-cpu GDT location. To solve this, Thomas Garnier moved the GDT to a fixed location. And to solve the risk of an attacker targeting the GDT directly with a kernel bug, he also made it read-only.

usercopy consolidation
After hardened usercopy landed, Al Viro decided to take a closer look at all the usercopy routines and then consolidated the per-architecture uaccess code into a single implementation. The per-architecture code was functionally very similar to each other, so it made sense to remove the redundancy. In the process, he uncovered a number of unhandled corner cases in various architectures (that got fixed by the consolidation), and made hardened usercopy available on all remaining architectures.

ASLR entropy sysctl on PowerPC
Continuing to expand architecture support for the ASLR entropy sysctl, Michael Ellerman implemented the calculations needed for PowerPC. This lets userspace choose to crank up the entropy used for memory layouts.

LSM structures read-only
James Morris used __ro_after_init to make the LSM structures read-only after boot. This removes them as a desirable target for attackers. Since the hooks are called from all kinds of places in the kernel this was a favorite method for attackers to use to hijack execution of the kernel. (A similar target used to be the system call table, but that has long since been made read-only.) Be wary that CONFIG_SECURITY_SELINUX_DISABLE removes this protection, so make sure that config stays disabled.

KASLR enabled by default on x86
With many distros already enabling KASLR on x86 with CONFIG_RANDOMIZE_BASE and CONFIG_RANDOMIZE_MEMORY, Ingo Molnar felt the feature was mature enough to be enabled by default.

Expand stack canary to 64 bits on 64-bit systems
The stack canary values used by CONFIG_CC_STACKPROTECTOR is most powerful on x86 since it is different per task. (Other architectures run with a single canary for all tasks.) While the first canary chosen on x86 (and other architectures) was a full unsigned long, the subsequent canaries chosen per-task for x86 were being truncated to 32-bits. Daniel Micay fixed this so now x86 (and future architectures that gain per-task canary support) have significantly increased entropy for stack-protector.

Expanded stack/heap gap
Hugh Dickens, with input from many other folks, improved the kernel’s mitigation against having the stack and heap crash into each other. This is a stop-gap measure to help defend against the Stack Clash attacks. Additional hardening needs to come from the compiler to produce “stack probes” when doing large stack expansions. Any Variable Length Arrays on the stack or alloca() usage needs to have machine code generated to touch each page of memory within those areas to let the kernel know that the stack is expanding, but with single-page granularity.

That’s it for now; please let me know if I missed anything. The v4.13 merge window is open!

Edit: Brad Spengler pointed out that I failed to mention the CONFIG_SECURITY_SELINUX_DISABLE issue with read-only LSM structures. This has been added now.

Comments (3)

May 2, 2017

security things in Linux v4.11

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 2:17 pm

Previously: v4.10.

Here’s a quick summary of some of the interesting security things in this week’s v4.11 release of the Linux kernel:

refcount_t infrastructure

Building on the efforts of Elena Reshetova, Hans Liljestrand, and David Windsor to port PaX’s PAX_REFCOUNT protection, Peter Zijlstra implemented a new kernel API for reference counting with the addition of the refcount_t type. Until now, all reference counters were implemented in the kernel using the atomic_t type, but it has a wide and general-purpose API that offers no reasonable way to provide protection against reference counter overflow vulnerabilities. With a dedicated type, a specialized API can be designed so that reference counting can be sanity-checked and provide a way to block overflows. With 2016 alone seeing at least a couple public exploitable reference counting vulnerabilities (e.g. CVE-2016-0728, CVE-2016-4558), this is going to be a welcome addition to the kernel. The arduous task of converting all the atomic_t reference counters to refcount_t will continue for a while to come.

CONFIG_DEBUG_RODATA renamed to CONFIG_STRICT_KERNEL_RWX

Laura Abbott landed changes to rename the kernel memory protection feature. The protection hadn’t been “debug” for over a decade, and it covers all kernel memory sections, not just “rodata”. Getting it consolidated under the top-level arch Kconfig file also brings some sanity to what was a per-architecture config, and signals that this is a fundamental kernel protection needed to be enabled on all architectures.

read-only usermodehelper

A common way attackers use to escape confinement is by rewriting the user-mode helper sysctls (e.g. /proc/sys/kernel/modprobe) to run something of their choosing in the init namespace. To reduce attack surface within the kernel, Greg KH introduced CONFIG_STATIC_USERMODEHELPER, which switches all user-mode helper binaries to a single read-only path (which defaults to /sbin/usermode-helper). Userspace will need to support this with a new helper tool that can demultiplex the kernel request to a set of known binaries.

seccomp coredumps

Mike Frysinger noticed that it wasn’t possible to get coredumps out of processes killed by seccomp, which could make debugging frustrating, especially for automated crash dump analysis tools. In keeping with the existing documentation for SIGSYS, which says a coredump should be generated, he added support to dump core on seccomp SECCOMP_RET_KILL results.

structleak plugin

Ported from PaX, I landed the structleak plugin which enforces that any structure containing a __user annotation is fully initialized to 0 so that stack content exposures of these kinds of structures are entirely eliminated from the kernel. This was originally designed to stop a specific vulnerability, and will now continue to block similar exposures.

ASLR entropy sysctl on MIPS
Matt Redfearn implemented the ASLR entropy sysctl for MIPS, letting userspace choose to crank up the entropy used for memory layouts.

NX brk on powerpc

Denys Vlasenko fixed a long standing bug where the kernel made assumptions about ELF memory layouts and defaulted the the brk section on powerpc to be executable. Now it’s not, and that’ll keep process heap from being abused.

That’s it for now; please let me know if I missed anything. The v4.12 merge window is open!

Comments (2)

February 27, 2017

security things in Linux v4.10

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 10:31 pm

Previously: v4.9.

Here’s a quick summary of some of the interesting security things in last week’s v4.10 release of the Linux kernel:

PAN emulation on arm64

Catalin Marinas introduced ARM64_SW_TTBR0_PAN, which is functionally the arm64 equivalent of arm’s CONFIG_CPU_SW_DOMAIN_PAN. While Privileged eXecute Never (PXN) has been available in ARM hardware for a while now, Privileged Access Never (PAN) will only be available in hardware once vendors start manufacturing ARMv8.1 or later CPUs. Right now, everything is still ARMv8.0, which left a bit of a gap in security flaw mitigations on ARM since CONFIG_CPU_SW_DOMAIN_PAN can only provide PAN coverage on ARMv7 systems, but nothing existed on ARMv8.0. This solves that problem and closes a common exploitation method for arm64 systems.

thread_info relocation on arm64

As done earlier for x86, Mark Rutland has moved thread_info off the kernel stack on arm64. With thread_info no longer on the stack, it’s more difficult for attackers to find it, which makes it harder to subvert the very sensitive addr_limit field.

linked list hardening
I added CONFIG_BUG_ON_DATA_CORRUPTION to restore the original CONFIG_DEBUG_LIST behavior that existed prior to v2.6.27 (9 years ago): if list metadata corruption is detected, the kernel refuses to perform the operation, rather than just WARNing and continuing with the corrupted operation anyway. Since linked list corruption (usually via heap overflows) are a common method for attackers to gain a write-what-where primitive, it’s important to stop the list add/del operation if the metadata is obviously corrupted.

seeding kernel RNG from UEFI

A problem for many architectures is finding a viable source of early boot entropy to initialize the kernel Random Number Generator. For x86, this is mainly solved with the RDRAND instruction. On ARM, however, the solutions continue to be very vendor-specific. As it turns out, UEFI is supposed to hide various vendor-specific things behind a common set of APIs. The EFI_RNG_PROTOCOL call is designed to provide entropy, but it can’t be called when the kernel is running. To get entropy into the kernel, Ard Biesheuvel created a UEFI config table (LINUX_EFI_RANDOM_SEED_TABLE_GUID) that is populated during the UEFI boot stub and fed into the kernel entropy pool during early boot.

arm64 W^X detection

As done earlier for x86, Laura Abbott implemented CONFIG_DEBUG_WX on arm64. Now any dangerous arm64 kernel memory protections will be loudly reported at boot time.

64-bit get_user() zeroing fix on arm
While the fix itself is pretty minor, I like that this bug was found through a combined improvement to the usercopy test code in lib/test_user_copy.c. Hoeun Ryu added zeroing-on-failure testing, and I expanded the get_user()/put_user() tests to include all sizes. Neither improvement alone would have found the ARM bug, but together they uncovered a typo in a corner case.

no-new-privs visible in /proc/$pid/status
This is a tiny change, but I like being able to introspect processes externally. Prior to this, I wasn’t able to trivially answer the question “is that process setting the no-new-privs flag?” To address this, I exposed the flag in /proc/$pid/status, as NoNewPrivs.

Kernel Userspace Execute Prevention on powerpc
Like SMEP on x86 and PXN on ARM, Balbir Singh implemented the protection on Power9 and later PPC CPUs under CONFIG_PPC_RADIX_MMU=y (which is the default). This means that if the kernel has a function pointer overwritten, it can no longer point into arbitrary code in userspace memory. An attacker’s ability is reduced to only using executable kernel memory, which is much less under their control (and has more limited visibility).

That’s all for now! Please let me know if you saw anything else you think needs to be called out. :) I’m already excited about the v4.11 merge window opening…

Edit: Added note about SMEP on Power9

Comments (1)

December 12, 2016

security things in Linux v4.9

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 11:05 am

Previously: v4.8.

Here are a bunch of security things I’m excited about in the newly released Linux v4.9:

Latent Entropy GCC plugin

Building on her earlier work to bring GCC plugin support to the Linux kernel, Emese Revfy ported PaX’s Latent Entropy GCC plugin to upstream. This plugin is significantly more complex than the others that have already been ported, and performs extensive instrumentation of functions marked with __latent_entropy. These functions have their branches and loops adjusted to mix random values (selected at build time) into a global entropy gathering variable. Since the branch and loop ordering is very specific to boot conditions, CPU quirks, memory layout, etc, this provides some additional uncertainty to the kernel’s entropy pool. Since the entropy actually gathered is hard to measure, no entropy is “credited”, but rather used to mix the existing pool further. Probably the best place to enable this plugin is on small devices without other strong sources of entropy.

vmapped kernel stack and thread_info relocation on x86

Normally, kernel stacks are mapped together in memory. This meant that attackers could use forms of stack exhaustion (or stack buffer overflows) to reach past the end of a stack and start writing over another process’s stack. This is bad, and one way to stop it is to provide guard pages between stacks, which is provided by vmalloced memory. Andy Lutomirski did a bunch of work to move to vmapped kernel stack via CONFIG_VMAP_STACK on x86_64. Now when writing past the end of the stack, the kernel will immediately fault instead of just continuing to blindly write.

Related to this, the kernel was storing thread_info (which contained sensitive values like addr_limit) at the bottom of the kernel stack, which was an easy target for attackers to hit. Between a combination of explicitly moving targets out of thread_info, removing needless fields, and entirely moving thread_info off the stack, Andy Lutomirski and Linus Torvalds created CONFIG_THREAD_INFO_IN_TASK for x86.

CONFIG_DEBUG_RODATA mandatory on arm64

As recently done for x86, Mark Rutland made CONFIG_DEBUG_RODATA mandatory on arm64. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so there’s no reason to make the protection optional.

random_page() cleanup

Cleaning up the code around the userspace ASLR implementations makes them easier to reason about. This has been happening for things like the recent consolidation on arch_mmap_rnd() for ET_DYN and during the addition of the entropy sysctl. Both uncovered some awkward uses of get_random_int() (or similar) in and around arch_mmap_rnd() (which is used for mmap (and therefore shared library) and PIE ASLR), as well as in randomize_stack_top() (which is used for stack ASLR). Jason Cooper cleaned things up further by doing away with randomize_range() entirely and replacing it with the saner random_page(), making the per-architecture arch_randomize_brk() (responsible for brk ASLR) much easier to understand.

Edit: missed this next feature when I first posted!

User namespace restrictions

Eric Biederman landed a sysctl to control how many user namespaces can be created (on a per-user and per-namespace basis) via /proc/sys/user/max_user_namespaces. The ability to limit access to user namespaces has been desired for a while, since a large amount of attack surface gets exposed from within a user namespace, due to it allowing various interfaces to operate within the namespace that were originally limited to the true root user before. A system can now disable user namespaces by writing 0 to the sysctl.

That’s it for now! Let me know if there are other fun things to call attention to in v4.10.

Comments Off

October 20, 2016

CVE-2016-5195

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:02 pm

My prior post showed my research from earlier in the year at the 2016 Linux Security Summit on kernel security flaw lifetimes. Now that CVE-2016-5195 is public, here are updated graphs and statistics. Due to their rarity, the Critical bug average has now jumped from 3.3 years to 5.2 years. There aren’t many, but, as I mentioned, they still exist, whether you know about them or not. CVE-2016-5195 was sitting on everyone’s machine when I gave my LSS talk, and there are still other flaws on all our Linux machines right now. (And, I should note, this problem is not unique to Linux.) Dealing with knowing that there are always going to be bugs present requires proactive kernel self-protection (to minimize the effects of possible flaws) and vendors dedicated to updating their devices regularly and quickly (to keep the exposure window minimized once a flaw is widely known).

So, here are the graphs updated for the 668 CVEs known today:

Critical: 3 @ 5.2 years average
High: 44 @ 6.2 years average
Medium: 404 @ 5.3 years average
Low: 216 @ 5.5 years average

Comments (3)

October 18, 2016

Security bug lifetime

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 9:46 pm

In several of my recent presentations, I’ve discussed the lifetime of security flaws in the Linux kernel. Jon Corbet did an analysis in 2010, and found that security bugs appeared to have roughly a 5 year lifetime. As in, the flaw gets introduced in a Linux release, and then goes unnoticed by upstream developers until another release 5 years later, on average. I updated this research for 2011 through 2016, and used the Ubuntu Security Team’s CVE Tracker to assist in the process. The Ubuntu kernel team already does the hard work of trying to identify when flaws were introduced in the kernel, so I didn’t have to re-do this for the 557 kernel CVEs since 2011.

As the README details, the raw CVE data is spread across the active/, retired/, and ignored/ directories. By scanning through the CVE files to find any that contain the line “Patches_linux:”, I can extract the details on when a flaw was introduced and when it was fixed. For example CVE-2016-0728 shows:

Patches_linux:
 break-fix: 3a50597de8635cd05133bd12c95681c82fe7b878 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2

This means that CVE-2016-0728 is believed to have been introduced by commit 3a50597de8635cd05133bd12c95681c82fe7b878 and fixed by commit 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2. If there are multiple lines, then there may be multiple SHAs identified as contributing to the flaw or the fix. And a “-” is just short-hand for the start of Linux git history.

Then for each SHA, I queried git to find its corresponding release, and made a mapping of release version to release date, wrote out the raw data, and rendered graphs. Each vertical line shows a given CVE from when it was introduced to when it was fixed. Red is “Critical”, orange is “High”, blue is “Medium”, and black is “Low”:

CVE lifetimes 2011-2016

And here it is zoomed in to just Critical and High:

Critical and High CVE lifetimes 2011-2016

The line in the middle is the date from which I started the CVE search (2011). The vertical axis is actually linear time, but it’s labeled with kernel releases (which are pretty regular). The numerical summary is:

Critical: 2 @ 3.3 years
High: 34 @ 6.4 years
Medium: 334 @ 5.2 years
Low: 186 @ 5.0 years

This comes out to roughly 5 years lifetime again, so not much has changed from Jon’s 2010 analysis.

While we’re getting better at fixing bugs, we’re also adding more bugs. And for many devices that have been built on a given kernel version, there haven’t been frequent (or some times any) security updates, so the bug lifetime for those devices is even longer. To really create a safe kernel, we need to get proactive about self-protection technologies. The systems using a Linux kernel are right now running with security flaws. Those flaws are just not known to the developers yet, but they’re likely known to attackers, as there have been prior boasts/gray-market advertisements for at least CVE-2010-3081 and CVE-2013-2888.

(Edit: see my updated graphs that include CVE-2016-5195.)

Comments (8)

Older Posts »

codeblog code is freedom — patching my itch

October 26, 2023

June 24, 2022

April 4, 2022

April 5, 2021

February 8, 2021

September 21, 2020

September 2, 2020

May 27, 2020

February 18, 2020

November 20, 2019

November 14, 2019

July 17, 2019

May 27, 2019

March 12, 2019

December 24, 2018

October 22, 2018

August 20, 2018

June 14, 2018

April 19, 2018

April 12, 2018

February 5, 2018

January 4, 2018

November 14, 2017

September 5, 2017

July 10, 2017

May 2, 2017

February 27, 2017

December 12, 2016

October 20, 2016

October 18, 2016