Previously: v4.14.
Linux kernel v4.15 was released last week, and there’s a bunch of security things I think are interesting:
Kernel Page Table Isolation
PTI has already gotten plenty of reporting, but to summarize, it is mainly to protect against CPU cache timing side-channel attacks that can expose kernel memory contents to userspace (CVE-2017-5754, the speculative execution “rogue data cache load” or “Meltdown” flaw).
Even for just x86_64 (as CONFIG_PAGE_TABLE_ISOLATION
), this was a giant amount of work, and tons of people helped with it over several months. PowerPC also had mitigations land, and arm64 (as CONFIG_UNMAP_KERNEL_AT_EL0
) will have PTI in v4.16 (though only the Cortex-A75 is vulnerable). For anyone with really old hardware, x86_32 is under development, too.
An additional benefit of the x86_64 PTI is that since there are now two copies of the page tables, the kernel-mode copy of the userspace mappings can be marked entirely non-executable, which means pre-SMEP hardware now gains SMEP emulation. Kernel exploits that try to jump into userspace memory to continue running malicious code are dead (even if the attacker manages to turn SMEP off first). With some more work, SMAP emulation could also be introduced (to stop even just reading malicious userspace memory), which would close the door on these common attack vectors. It’s worth noting that arm64 has had the equivalent (PAN emulation) since v4.10.
retpoline
In addition to the PTI work above, the retpoline kernel mitigations for CVE-2017-5715 (“branch target injection” or “Spectre variant 2”) started landing. (Note that to gain full retpoline support, you’ll need a patched compiler, as appearing in gcc 7.3/8+, and currently queued for release in clang.)
This work continues to evolve, and clean-ups are continuing into v4.16. Also in v4.16 we’ll start to see mitigations for the other speculative execution variant (i.e. CVE-2017-5753, “bounds check bypass” or “Spectre variant 1”).
x86 fast refcount_t overflow protection
In v4.13 the CONFIG_REFCOUNT_FULL
code was added to stop many types of reference counting flaws (with a tiny performance loss). In v4.14 the infrastructure for a fast overflow-only refcount_t
protection on x86 (based on grsecurity’s PAX_REFCOUNT) landed, but it was disabled at the last minute due to a bug that was finally fixed in v4.15. Since it was a tiny change, the fast refcount_t
protection was backported and enabled for the Longterm maintenance kernel in v4.14.5. Conversions from atomic_t
to refcount_t
have also continued, and are now above 168, with a handful remaining.
%p hashing
One of the many sources of kernel information exposures has been the use of the %p
format string specifier. The strings end up in all kinds of places (dmesg
, /sys
files, /proc
files, etc), and usage is scattered through-out the kernel, which had made it a very hard exposure to fix. Earlier efforts like kptr_restrict
‘s %pK
didn’t really work since it was opt-in. While a few recent attempts (by William C Roberts, Greg KH, and others) had been made to provide toggles for %p
to act like %pK
, Linus finally stepped in and declared that %p
should be used so rarely that it shouldn’t used at all, and Tobin Harding took on the task of finding the right path forward, which resulted in %p output getting hashed with a per-boot secret. The result is that simple debugging continues to work (two reports of the same hash value can confirm the same address without saying what the address actually is) but frustrates attacker’s ability to use such information exposures as building blocks for exploits.
For developers needing an unhashed %p
, %px
was introduced but, as Linus cautioned, either your %p
remains useful when hashed, your %p
was never actually useful to begin with and should be removed, or you need to strongly justify using %px
with sane permissions.
It remains to be seen if we’ve just kicked the information exposure can down the road and in 5 years we’ll be fighting with %px
and %lx
, but hopefully the attitudes about such exposures will have changed enough to better guide developers and their code.
struct timer_list refactoring
The kernel’s timer (struct timer_list
) infrastructure is, unsurprisingly, used to create callbacks that execute after a certain amount of time. They are one of the more fundamental pieces of the kernel, and as such have existed for a very long time, with over 1000 call sites. Improvements to the API have been made over time, but old ways of doing things have stuck around. Modern callbacks in the kernel take an argument pointing to the structure associated with the callback, so that a callback has context for which instance of the callback has been triggered. The timer callbacks didn’t, and took an unsigned long
that was cast back to whatever arbitrary context the code setting up the timer wanted to associate with the callback, and this variable was stored in struct timer_list
along with the function pointer for the callback. This creates an opportunity for an attacker looking to exploit a memory corruption vulnerability (e.g. heap overflow), where they’re able to overwrite not only the function pointer, but also the argument, as stored in memory. This elevates the attack into a weak ROP, and has been used as the basis for disabling SMEP in modern exploits (see retire_blk_timer
). To remove this weakness in the kernel’s design, I refactored the timer callback API and and all its callers, for a whopping:
1128 files changed, 4834 insertions(+), 5926 deletions(-)
Another benefit of the refactoring is that once the kernel starts getting built by compilers with Control Flow Integrity support, timer callbacks won’t be lumped together with all the other functions that take a single unsigned long
argument. (In other words, some CFI implementations wouldn’t have caught the kind of attack described above since the attacker’s target function still matched its original prototype.)
That’s it for now; please let me know if I missed anything. The v4.16 merge window is now open!
© 2018 – 2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
Hello. I have a question regarding KPTI. Would it be advisable to enable KPTI for AMD CPUs anyway? Especially since KPTI originally should protect KASLR better, and the separation of user and kernel space is basically better, despite performance losses.
Comment by Nick — April 7, 2018 @ 2:48 am
For CPUs without SMEP, I would say it’s worth it to keep KPTI on unconditionally. For SMEP machines, I would still keep it enabled, just because it provides a nice separation for cache-timing attacks of all kinds (including many common KASLR leaks). In the end, though, I would say it depends on your workloads. If you can handle the small change in performance, go for it.
Comment by kees — April 7, 2018 @ 5:05 am
This was also my first thought, not to do half measures in terms of security. Even if AMD CPUs are excluded by default from KPTI. Thinking one can also assume that there will be optimizations in the future, which will further contain the performance losses by KPTI. Thanks for the fast respond.
Comment by Nick — April 7, 2018 @ 7:56 am