Previously: v5.3.
Linux kernel v5.4 was released in late November. The holidays got the best of me, but better late than never! ;) Here are some security-related things I found interesting:
waitid()
gains P_PIDFD
Christian Brauner has continued his pidfd
work by adding a critical mode to waitid()
: P_PIDFD
. This makes it possible to reap child processes via a pidfd
, and completes the interfaces needed for the bulk of programs performing process lifecycle management. (i.e. a pidfd
can come from /proc
or clone()
, and can be waited on with waitid()
.)
kernel lockdown
After something on the order of 8 years, Linux can now draw a bright line between “ring 0” (kernel memory) and “uid 0” (highest privilege level in userspace). The “kernel lockdown” feature, which has been an out-of-tree patch series in most Linux distros for almost as many years, attempts to enumerate all the intentional ways (i.e. interfaces not flaws) userspace might be able to read or modify kernel memory (or execute in kernel space), and disable them. While Matthew Garrett made the internal details fine-grained controllable, the basic lockdown LSM can be set to either disabled, “integrity” (kernel memory can be read but not written), or “confidentiality” (no kernel memory reads or writes). Beyond closing the many holes between userspace and the kernel, if new interfaces are added to the kernel that might violate kernel integrity or confidentiality, now there is a place to put the access control to make everyone happy and there doesn’t need to be a rehashing of the age old fight between “but root has full kernel access” vs “not in some system configurations”.
tagged memory relaxed syscall ABI
Andrey Konovalov (with Catalin Marinas and others) introduced a way to enable a “relaxed” tagged memory syscall ABI in the kernel. This means programs running on hardware that supports memory tags (or “versioning”, or “coloring”) in the upper (non-VMA) bits of a pointer address can use these addresses with the kernel without things going crazy. This is effectively teaching the kernel to ignore these high bits in places where they make no sense (i.e. mathematical comparisons) and keeping them in place where they have meaning (i.e. pointer dereferences).
As an example, if a userspace memory allocator had returned the address 0x0f00000010000000 (VMA address 0x10000000, with, say, a “high bits” tag of 0x0f), and a program used this range during a syscall that ultimately called copy_from_user()
on it, the initial range check would fail if the tag bits were left in place: “that’s not a userspace address; it is greater than TASK_SIZE
(0x0000800000000000)!”, so they are stripped for that check. During the actual copy into kernel memory, the tag is left in place so that when the hardware dereferences the pointer, the pointer tag can be checked against the expected tag assigned to referenced memory region. If there is a mismatch, the hardware will trigger the memory tagging protection.
Right now programs running on Sparc M7 CPUs with ADI (Application Data Integrity) can use this for hardware tagged memory, ARMv8 CPUs can use TBI (Top Byte Ignore) for software memory tagging, and eventually there will be ARMv8.5-A CPUs with MTE (Memory Tagging Extension).
boot entropy improvement
Thomas Gleixner got fed up with poor boot-time entropy and trolled Linus into coming up with reasonable way to add entropy on modern CPUs, taking advantage of timing noise, cycle counter jitter, and perhaps even the variability of speculative execution. This means that there shouldn’t be mysterious multi-second (or multi-minute!) hangs at boot when some systems don’t have enough entropy to service getrandom()
syscalls from systemd
or the like.
userspace writes to swap files blocked
From the department of “how did this go unnoticed for so long?”, Darrick J. Wong fixed the kernel to not allow writes from userspace to active swap files. Without this, it was possible for a user (usually root) with write access to a swap file to modify its contents, thereby changing memory contents of a process once it got paged back in. While root normally could just use CAP_PTRACE
to modify a running process directly, this was a loophole that allowed lesser-privileged users (e.g. anyone in the “disk” group) without the needed capabilities to still bypass ptrace restrictions.
limit strscpy()
sizes to INT_MAX
Generally speaking, if a size variable ends up larger than INT_MAX
, some calculation somewhere has overflowed. And even if not, it’s probably going to hit code somewhere nearby that won’t deal well with the result. As already done in the VFS core, and vsprintf()
, I added a check to strscpy()
to reject sizes larger than INT_MAX
.
ld.gold
support removed
Thomas Gleixner removed support for the gold linker. While this isn’t providing a direct security benefit, ld.gold
has been a constant source of weird bugs. Specifically where I’ve noticed, it had been pain while developing KASLR, and has more recently been causing problems while stabilizing building the kernel with Clang. Having this linker support removed makes things much easier going forward. There are enough weird bugs to fix in Clang and ld.lld
. ;)
Intel TSX disabled
Given the use of Intel’s Transactional Synchronization Extensions (TSX) CPU feature by attackers to exploit speculation flaws, Pawan Gupta disabled the feature by default on CPUs that support disabling TSX.
That’s all I have for this version. Let me know if I missed anything. :) Next up is Linux v5.5!
© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.