codeblog code is freedom — patching my itch

September 21, 2020

security things in Linux v5.7

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:32 pm

Previously: v5.6

Linux v5.7 was released at the end of May. Here’s my summary of various security things that caught my attention:

arm64 kernel pointer authentication
While the ARMv8.3 CPU “Pointer Authentication” (PAC) feature landed for userspace already, Kristina Martsenko has now landed PAC support in kernel mode. The current implementation uses PACIASP which protects the saved stack pointer, similar to the existing CONFIG_STACKPROTECTOR feature, only faster. This also paves the way to sign and check pointers stored in the heap, as a way to defeat function pointer overwrites in those memory regions too. Since the behavior is different from the traditional stack protector, Amit Daniel Kachhap added an LKDTM test for PAC as well.

BPF LSM
The kernel’s Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With CONFIG_BPF_LSM=y, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting).

execve() deadlock refactoring
There have been a number of long-standing races in the kernel’s process launching code where ptrace could deadlock. Fixing these has been attempted several times over the last many years, but Eric W. Biederman and Ernd Edlinger decided to dive in, and successfully landed the a series of refactorings, splitting up the problematic locking and refactoring their uses to remove the deadlocks. While he was at it, Eric also extended the exec_id counter to 64 bits to avoid the possibility of the counter wrapping and allowing an attacker to send arbitrary signals to processes they normally shouldn’t be able to.

slub freelist obfuscation improvements
After Silvio Cesare observed some weaknesses in the implementation of CONFIG_SLAB_FREELIST_HARDENED‘s freelist pointer content obfuscation, I improved their bit diffusion, which makes attacks require significantly more memory content exposures to defeat the obfuscation. As part of the conversation, Vitaly Nikolenko pointed out that the freelist pointer’s location made it relatively easy to target too (for either disclosures or overwrites), so I moved it away from the edge of the slab, making it harder to reach through small-sized overflows (which usually target the freelist pointer). As it turns out, there were a few assumptions in the kernel about the location of the freelist pointer, which had to also get cleaned up.

RISCV strict kernel memory protections
Zong Li implemented STRICT_KERNEL_RWX support for RISCV. And to validate the results, following v5.6’s generic page table dumping work, he landed the RISCV page dumping code. This means it’s much easier to examine the kernel’s page table layout when running a debug kernel (built with PTDUMP_DEBUGFS), visible in /sys/kernel/debug/kernel_page_tables.

array index bounds checking
This is a pretty large area of work that touches a lot of overlapping elements (and history) in the Linux kernel. The short version is: C is bad at noticing when it uses an array index beyond the bounds of the declared array, and we need to fix that. For example, don’t do this:

int foo[5];
...
foo[8] = bar;

The long version gets complicated by the evolution of “flexible array” structure members, so we’ll pause for a moment and skim the surface of this topic. While things like CONFIG_FORTIFY_SOURCE try to catch these kinds of cases in the memcpy() and strcpy() family of functions, it doesn’t catch it in open-coded array indexing, as seen in the code above. GCC has a warning (-Warray-bounds) for these cases, but it was disabled by Linus because of all the false positives seen due to “fake” flexible array members. Before flexible arrays were standardized, GNU C supported “zero sized” array members. And before that, C code would use a 1-element array. These were all designed so that some structure could be the “header” in front of some data blob that could be addressable through the last structure member:

/* 1-element array */
struct foo {
    ...
    char contents[1];
};

/* GNU C extension: 0-element array */
struct foo {
    ...
    char contents[0];
};

/* C standard: flexible array */
struct foo {
    ...
    char contents[];
};

instance = kmalloc(sizeof(struct foo) + content_size);

Converting all the zero- and one-element array members to flexible arrays is one of Gustavo A. R. Silva’s goals, and hundreds of these changes started landing. Once fixed, -Warray-bounds can be re-enabled. Much more detail can be found in the kernel’s deprecation docs.

However, that will only catch the “visible at compile time” cases. For runtime checking, the Undefined Behavior Sanitizer has an option for adding runtime array bounds checking for catching things like this where the compiler cannot perform a static analysis of the index values:

int foo[5];
...
for (i = 0; i < some_argument; i++) {
    ...
    foo[i] = bar;
    ...
}

It was, however, not separate (via kernel Kconfig) until Elena Petrova and I split it out into CONFIG_UBSAN_BOUNDS, which is fast enough for production kernel use. With this enabled, it's now possible to instrument the kernel to catch these conditions, which seem to come up with some regularity in Wi-Fi and Bluetooth drivers for some reason. Since UBSAN (and the other Sanitizers) only WARN() by default, system owners need to set panic_on_warn=1 too if they want to defend against attacks targeting these kinds of flaws. Because of this, and to avoid bloating the kernel image with all the warning messages, I introduced CONFIG_UBSAN_TRAP which effectively turns these conditions into a BUG() without needing additional sysctl settings.

Fixing "additive" snprintf() usage
A common idiom in C for building up strings is to use sprintf()'s return value to increment a pointer into a string, and build a string with more sprintf() calls:

/* safe if strlen(foo) + 1 < sizeof(string) */
wrote  = sprintf(string, "Foo: %s\n", foo);
/* overflows if strlen(foo) + strlen(bar) > sizeof(string) */
wrote += sprintf(string + wrote, "Bar: %s\n", bar);
/* writing way beyond the end of "string" now ... */
wrote += sprintf(string + wrote, "Baz: %s\n", baz);

The risk is that if these calls eventually walk off the end of the string buffer, it will start writing into other memory and create some bad situations. Switching these to snprintf() does not, however, make anything safer, since snprintf() returns how much it would have written:

/* safe, assuming available <= sizeof(string), and for this example
 * assume strlen(foo) < sizeof(string) */
wrote  = snprintf(string, available, "Foo: %s\n", foo);
/* if (strlen(bar) > available - wrote), this is still safe since the
 * write into "string" will be truncated, but now "wrote" has been
 * incremented by how much snprintf() *would* have written, so "wrote"
 * is now larger than "available". */
wrote += snprintf(string + wrote, available - wrote, "Bar: %s\n", bar);
/* string + wrote is beyond the end of string, and availabe - wrote wraps
 * around to a giant positive value, making the write effectively 
 * unbounded. */
wrote += snprintf(string + wrote, available - wrote, "Baz: %s\n", baz);

So while the first overflowing call would be safe, the next one would be targeting beyond the end of the array and the size calculation will have wrapped around to a giant limit. Replacing this idiom with scnprintf() solves the issue because it only reports what was actually written. To this end, Takashi Iwai has been landing a bunch scnprintf() fixes.

That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.8.

© 2020 - 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 2, 2020

security things in Linux v5.6

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:22 pm

Previously: v5.5.

Linux v5.6 was released back in March. Here’s my quick summary of various features that caught my attention:

WireGuard
The widely used WireGuard VPN has been out-of-tree for a very long time. After 3 1/2 years since its initial upstream RFC, Ard Biesheuvel and Jason Donenfeld finished the work getting all the crypto prerequisites sorted out for the v5.5 kernel. For this release, Jason has gotten WireGuard itself landed. It was a twisty road, and I’m grateful to everyone involved for sticking it out and navigating the compromises and alternative solutions.

openat2() syscall and RESOLVE_* flags
Aleksa Sarai has added a number of important path resolution “scoping” options to the kernel’s open() handling, covering things like not walking above a specific point in a path hierarchy (RESOLVE_BENEATH), disabling the resolution of various “magic links” (RESOLVE_NO_MAGICLINKS) in procfs (e.g. /proc/$pid/exe) and other pseudo-filesystems, and treating a given lookup as happening relative to a different root directory (as if it were in a chroot, RESOLVE_IN_ROOT). As part of this, it became clear that there wasn’t a way to correctly extend the existing openat() syscall, so he added openat2() (which is a good example of the efforts being made to codify “Extensible Syscall” arguments). The RESOLVE_* set of flags also cover prior behaviors like RESOLVE_NO_XDEV and RESOLVE_NO_SYMLINKS.

pidfd_getfd() syscall
In the continuing growth of the much-needed pidfd APIs, Sargun Dhillon has added the pidfd_getfd() syscall which is a way to gain access to file descriptors of a process in a race-less way (or when /proc is not mounted). Before, it wasn’t always possible make sure that opening file descriptors via /proc/$pid/fd/$N was actually going to be associated with the correct PID. Much more detail about this has been written up at LWN.

openat() via io_uring
With my “attack surface reduction” hat on, I remain personally suspicious of the io_uring() family of APIs, but I can’t deny their utility for certain kinds of workloads. Being able to pipeline reads and writes without the overhead of actually making syscalls is pretty great for performance. Jens Axboe has added the IORING_OP_OPENAT command so that existing io_urings can open files to be added on the fly to the mapping of available read/write targets of a given io_uring. While LSMs are still happily able to intercept these actions, I remain wary of the growing “syscall multiplexer” that io_uring is becoming. I am, of course, glad to see that it has a comprehensive (if “out of tree”) test suite as part of liburing.

removal of blocking random pool
After making algorithmic changes to obviate separate entropy pools for random numbers, Andy Lutomirski removed the blocking random pool. This simplifies the kernel pRNG code significantly without compromising the userspace interfaces designed to fetch “cryptographically secure” random numbers. To quote Andy, “This series should not break any existing programs. /dev/urandom is unchanged. /dev/random will still block just after booting, but it will block less than it used to.” See LWN for more details on the history and discussion of the series.

arm64 support for on-chip RNG
Mark Brown added support for the future ARMv8.5’s RNG (SYS_RNDR_EL0), which is, from the kernel’s perspective, similar to x86’s RDRAND instruction. This will provide a bootloader-independent way to add entropy to the kernel’s pRNG for early boot randomness (e.g. stack canary values, memory ASLR offsets, etc). Until folks are running on ARMv8.5 systems, they can continue to depend on the bootloader for randomness (via the UEFI RNG interface) on arm64.

arm64 E0PD
Mark Brown added support for the future ARMv8.5’s E0PD feature (TCR_E0PD1), which causes all memory accesses from userspace into kernel space to fault in constant time. This is an attempt to remove any possible timing side-channel signals when probing kernel memory layout from userspace, as an alternative way to protect against Meltdown-style attacks. The expectation is that E0PD would be used instead of the more expensive Kernel Page Table Isolation (KPTI) features on arm64.

powerpc32 VMAP_STACK
Christophe Leroy added VMAP_STACK support to powerpc32, joining x86, arm64, and s390. This helps protect against the various classes of attacks that depend on exhausting the kernel stack in order to collide with neighboring kernel stacks. (Another common target, the sensitive thread_info, had already been moved away from the bottom of the stack by Christophe Leroy in Linux v5.1.)

generic Page Table dumping
Related to RISCV’s work to add page table dumping (via /sys/fs/debug/kernel_page_tables), Steven Price extracted the existing implementations from multiple architectures and created a common page table dumping framework (and then refactored all the other architectures to use it). I’m delighted to have this because I still remember when not having a working page table dumper for ARM delayed me for a while when trying to implement upstream kernel memory protections there. Anything that makes it easier for architectures to get their kernel memory protection working correctly makes me happy.

That’s in for now; let me know if there’s anything you think I missed. Next up: Linux v5.7.

© 2020 – 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

Powered by WordPress