Capability Hardware Enhanced RISC Instructions (CHERI)

6

kornel edited 2 years ago | link

This is an architecture where size of the pointer is larger than an integer covering all addressable memory. In other words, intptr_t is larger than size_t.

It exposes an IMHO mistake in Rust’s design: Rust declared that its usize is equal to intptr_t, but then used usize for indexing into collections, which is size_t‘s job. So in the best case, assuming everyone knew about the gotcha and wrote their programs conforming to Rust’s design, CHERI-Rust is going to wastefully perform 128-bit arithmetic on a 64-bit address space. In practice it’s even more screwed, because many people quite reasonably assumed Rust’s type called “unsigned size” is the same as C’s “unsigned size” type, and this has been true in practice on every Rust-supported platform so far.

13

david_chisnall 2 years ago | link

[ I am one of the authors of CHERI ]

There’s an open issue on the Rust bug tracker about this. Fortunately, it only matters in unsafe code because that’s the only place where a usize to pointer cast is allowed. One proposal that I like is to introduce a uptr type into Rust and start warning about usize to pointer casts, then disallow them on CHERI platforms and eventually everywhere. For most projects it’s a small change to just replace the few usizes that are converted to pointers with uptr and define only uptr + usize arithmetic, not uptr + uptr.

Note that CHERI C doesn’t do 128-bit arithmetic on uintptr_t (I am responsible for most of CHERI C/C++ and the interpretation of [u]intptr_t in particular). It’s a quite fun type: it’s effectively (on 64-bit platforms) a tagged union of a pointer and a 64-bit integer that supports the range of C arithmetic, which operates on the address (I originally had it operate on the offset, because I was hoping that we’d be able to support copying GC in C, Alex Richardson changed it to work on the address which is a much better choice for compatibility). All arithmetic works as if it’s on 64-bit integers.

Logically, the behaviour of intptr_t depends on whether the current value is a capability (was it constructed by casting from a valid pointer) or not. If not, then arithmetic operates on the low 64 bits as if it were a 64-bit integer with 64 bits of padding. If so, then arithmetic operates on the address of the capability (which is currently the low 64 bits, but may end up being the low 56 or 60 bits or something) and the other metadata (base, top, permissions) is preserved. If the arithmetic goes out of the representable bounds for a tagged capability then the tag is cleared and you end up with a thing that can never be converted back to a capability.

In LLVM IR, uintptr_t arithmetic becomes a sequence of get-address, arithmetic op, set-address. On Morello, most of these sequences can be folded into a single instruction.

1

0x2ba22e11 2 years ago | link

I’m curious about what “world domination” for CHERI would look like. Could CHERI for amd64 ever exist, or is that really really difficult for some reason like the fact that non RISC processors have loads of different instructions that touch memory, not just LDR and STR?

Could the story one day become roughly “we doubled the size of pointers, thought very very hard about the problem and then made most of the unsafeties in C go away”?

Relatedly, is CHERI good for hardening against fault injection attacks like firing ionising radiation at RAM / are the normally unforgeable capability bits also infeasible to forge even for an adversary who can flip bits they’re not supposed to be able to touch? :)

2
david_chisnall 2 years ago | link
I’m curious about what “world domination” for CHERI would look like.

Every ISA ships with CHERI extensions, all operating systems default to a pure-capability ABI, all software is transitioned to this ABI.

Could CHERI for amd64 ever exist, or is that really really difficult for some reason like the fact that non RISC processors have loads of different instructions that touch memory, not just LDR and STR?

Yes. These are normally cracked to micro-ops that separate the load or store part internally because you don’t want a load-store pipeline interleaved with your ALU pipelines. The load/store unit would do the capability checks. The impact on the ISA would be larger because all operations that take memory operands would need updating to take capabilities instead. We’ve done a few sketches of possible ways of extending x86 to support CHERI. My preference would be to make the segment registers and RIP into capability registers and add instructions for manipulating them, but I’ve thought less about this than people who came to different conclusions.

Could the story one day become roughly “we doubled the size of pointers, thought very very hard about the problem and then made most of the unsafeties in C go away”?

More or less. There are a couple of big caveats:
- Temporal safety is still a bit of an open problem. My team has a prototype for heap temporal safety that we hope can be optimised sufficiently for production use. Stack temporal safety is harder (Lawrence Esswood’s PhD thesis had a nice model but it’s not clear how well this will perform at scale). It may not matter in a given threat model. Stack allocations have lifetimes that are relatively easy for static analysis to reason about, so it’s not clear how much this matters in real-world deployed code.
- Type confusion still exists. We prevent confusing T* with any non-pointer type, but we don’t prevent confusing int with float or confusing T* with U* for any T and U (though we do prevent this from extending the bounds, so we’ll catch a cast to a larger object if you access any of the fields off the end of the real target). We can’t really fix this without breaking C because type punning via unions is allowed.
- The default policy for CHERI traps is to kill the process. See the section about safe languages in our blog about Morello. This means that safe languages have less of a security advantage but still have a big availability advantage: they prevent a load of things by construction that will cause a C/C++ program to abort at run time.
Relatedly, is CHERI good for hardening against fault injection attacks like firing ionising radiation at RAM / are the normally unforgeable capability bits also infeasible to forge even for an adversary who can flip bits they’re not supposed to be able to touch? :)

This is a question for the implementation, rather than the ISA, but it’s a good one. The ISA does guarantee that capabilities are unforgeable and we have formal proofs that this property holds (except in Morello if you’re in EL1, where there’s an instruction intended for swapping that lets you set the tag bit on arbitrary data - it still holds at EL0).

Whether a particular implementation correctly implements the ISA in the presence of hardware failures is a different question and the answer in the general case is ‘no’. A bit flip in an ALU may result in the tag bit being erroneously set, a permission being added, or the bounds being increased. It may be possible to use rowhammer to flip some of the bits in capabilities (I hope ECC prevents this). It’s hard to rowhammer the tag bits because they’re behind an extra layer of caching that makes creating eviction sets incredibly hard but you could try toggling a type bit to allow you to unseal a capability that you shouldn’t be able to, or toggle a permission or length bit to extend your access.
1. 1
  
  0x2ba22e11 2 years ago | link
  
  The default policy for CHERI traps is to kill the process.
  
  I think this has been empirically demonstrated to be a desirable thing. RCE leads to ransomware attacks which are hugely expensive to remediate, while clean segfaults are merely expensive and intensely annoying. :)
  
  More or less. There are a couple of big caveats
  
  Thanks, this is a great answer. Actually, I flip-flopped on writing “most” vs “about half” here. :)
  
  The reason I am curious about rowhammer/radiation hardening is that I suspect that they make it necessary to keep using hardware security protection (like e.g. MMU process isolation) in practice even if all code was proven safe.
  
  I believe it’s possible for a capability system to be radiation safe by making the capabilities be cryptographic signatures, since the ionising radiation is not going to get 2^-128 lucky and guess a valid key or signature. ;) But that’s maybe kinda expensive so I wouldn’t be surprised if practical implementations of non-networked capability systems didn’t want to do that? and I was wondering if CHERI does. :)
  
  (OTOH when there is a network involved you may as well use cryptography because computing SHA-2 and modern ECC signature is relatively cheap compared to the cost of shoving bytes into an Ethernet card, and networks mean cryptography is the only way for things to really be unforgeable anyway.)
  1. 1
    
    david_chisnall 2 years ago | link
    
    The reason I am curious about rowhammer/radiation hardening is that I suspect that they make it necessary to keep using hardware security protection (like e.g. MMU process isolation) in practice even if all code was proven safe.
    
    CHERI is also a hardware protection, but it’s layered in front of virtual memory. Accesses are checked based on the virtual address and the permissions of the capability, then they’re sent to the MMU. The kind of attacks that work to corrupt a capability can also be used to corrupt page tables and therefore break MMU isolation as well.
    
    I believe it’s possible for a capability system to be radiation safe by making the capabilities be cryptographic signatures, since the ionising radiation is not going to get 2^-128 lucky and guess a valid key or signature. ;) But that’s maybe kinda expensive so I wouldn’t be surprised if practical implementations of non-networked capability systems didn’t want to do that? and I was wondering if CHERI does. :)
    
    We considered this and some other capability systems have gone this route. The problem is that you really do need about 128 bits of signature to guarantee the integrity and that means capabilities have to be at least 256 bits. Some of the first feedback that we got from Arm was that 256-bit data paths in a modern CPU are going to make microarchitects deeply unhappy (even ignoring the overhead). Our original implementation used 256-bit capabilities (which gave us lots of space to play with for research) and shrinking to 128 bits was one of the major requirements for even considering adoption in a mainstream architecture.
    
    Using crypto would also be likely to significantly increase the load-to-use delay. It’s very important for performance that you can load a capability in one cycle and use it as the base for a load in the next (without this, any pointer chasing, even within the cache, gets slower and you see it in macrobenchmarks very quickly). Doing 128-bit signature verification in a single cycle sounds difficult. Most hardware crypto engines are pipelined so you may be able to issue an operation every cycle, but you have to wait several cycles for the next result. 5 years ago, I’d have said to just do the signature check in parallel and abort the load in speculation if it fails, but that’s not really an option now that speculative side channels are top of mind for CPU designers.
    1. 1
      
      0x2ba22e11 2 years ago | link
      
      The kind of attacks that work to corrupt a capability can also be used to corrupt page tables and therefore break MMU isolation as well.
      
      Yeah, I was only comparing MMU isolation to proof-carrying-code MMU-less isolation. I believe in that case that MMU isolation is a little less likely to go wrong because breaking it requires corrupting more stuff (than PCC, which may require as little as corrupting one pointer anywhere?)
      
      Using crypto would also be likely to significantly increase the load-to-use delay…
      
      It’s fun to think about but sure. I’m sure the power consumption would scuttle it too, even if you could hide all the latency with speculative execution.
      1. 1
        
        david_chisnall 2 years ago | link
        
        The kind of attacks that work to corrupt a capability can also be used to corrupt page tables and therefore break MMU isolation as well.
        
        Yeah, I was only comparing MMU isolation to proof-carrying-code MMU-less isolation. I believe in that case that MMU isolation is a little less likely to go wrong because breaking it requires corrupting more stuff (than PCC, which may require as little as corrupting one pointer anywhere?)
        
        Corrupting PCC lets you execute code from anywhere but that doesn’t give you the ability to modify memory. If you want to access data elsewhere then you need to corrupt a data capability to point to the bit of the address space that you want to access. This composes with MMU protections, so if you’re using capabilities and you’re using MMU-based process isolation then breaking CHERI wouldn’t help much because you’re still stuck in your address space. But if you’re using an MMU then corrupting a couple of bits in a page table can give you access to any page in physical memory. If you do it in the top-level then you can get 1 GiB mappings to physical memory.
        
        Basically, if you can corrupt a specific few bits in a specific 64-bit range of physical memory, then you can break either CHERI or MMU-based protection. If you attack CHERI, the MMU gives you some defence in depth (you can’t see memory in a different address space). If you attack the MMU, then CHERI gives you some defence in depth (it limits the PTEs that are useful for you to corrupt).
        
        Completely random bit flips are unlikely to help with either. The page tables are a fairly small part of total memory. If memory is unaliased and page tables don’t contain missing entries, then you have roughly 16 bytes of page table per 4096 bytes of memory, so you’ve got around a 4% probability of random corruption hitting a page table. Roughly 80% of all pages don’t contain capabilities, so even if you assume that they’re all full of capabilities then you’ve got a 20% change of a random bit flip corrupting a capability. A lot of those are benign though. Anything in the low bits of the address won’t impact the security model. Toggling bits from 1->0 will often also not help (as with some bits of the page tables). Bit flips in the top-top or to-bottom fields will extend the range, and 0->1 flips in the permissions bits will cause problems.
        
        If you have random discharge or RowHammer attacks that make enable only a particular bit flip direction then you could do an XOR of every 16-byte chunk with a valid pattern that would ensure that you couldn’t add permissions or increase the range with a single bit flip. Unfortunately, corrupting the high bits of the address (in either direction) would still allow you to move the range that a capability points to (the CHERI Concentrate paper has a good overview of how the encoding works. The newer versions are minor variations on this).
        
        1
        
        0x2ba22e11 2 years ago | link
        
        Corrupting PCC lets you execute code from anywhere but that doesn’t give you the ability to modify memory.
        
        Ah, what? If I have a bug or corruption that gives me arbitrary code execution, I can execute code that has pointer arithmetic, loads and stores in it.
        
        1
        
        0x2ba22e11 2 years ago | link
        
        Lemme just clarify:
        
        If I have a MMU-less CPU
        
        and the only way that capabilities or privilege separation are enforced is that a program loader checks program code (before letting it run) that it doesn’t corrupt any pointers
        
        but a bug or fault lets me corrupt the instruction pointer, and make the PC jump into some data
        
        then I could put code into that data that doesn’t abide by the no-corrupting-pointers rule
        
        and jump into it
        
        and now all the rules are broken
        
        IIRC I’ve seen a blog post where someone got total control of one of those Java smartcards by using a bug in the card’s stdlib that let them get 2 fields to alias, one with a long and the other with an object pointer. That let them violate all the JVM’s assumptions.
        
        1
        
        david_chisnall 2 years ago | link
        
        Sorry, I was confused by your use of PCC (program counter capability) - I guess you mean PC (program counter)? If you can control the PC on a non-CHERI system then you have total control (assuming sufficient gadgets to build a Turing-complete weird machine, which is normally a safe assumption). If you can control the PCC on a CHERI system then you can use it as a memory-disclosure gadget by trying to execute random things, but that’s all. You can use PCC-relative addressing to load constants, but that doesn’t let you perform writes to arbitrary memory.
        
        1
        
        0x2ba22e11 2 years ago | link
        
        Oh no I meant something totally different. To me PCC stands for “proof carrying code” which is the technique where you verify that code can’t violate virtual machine invariants before running it. The point is to be able to enforce the virtual machine’s invariants (such as code not being able to create pointers that weren’t handed to it) without needing hardware to enforce them at runtime. AIUI JavaCard relies on this kind of thing a bunch.
        
        Also I was using “PC” and “instruction pointer” interchangeably above

1

AviKav edited 2 years ago | link

How might a CHERI-dependent Linux process might prevent a compartment from bypassing capabilities through /proc/self/mem? (Assuming /proc is covered by Linux stability guarantees (https://lwn.net/Articles/309298), though I don’t know if that matters for new ABIs)

1

AviKav 2 years ago | link

To clarify, I mean leaning on CHERI to help sandbox untrusted code
1
david_chisnall 2 years ago | link
The short answer is either ‘it depends’ or ‘it’s complicated’. It depends mostly on your threat model. There are a few interesting ones for CHERI, but most are variants of two high-level ones:
- You trust the code in your process to not be actively malicious but you assume an attacker is trying to take control over it. You are using CHERI for memory safety. If violating memory safety requires being able to issue a system call that the program was not doing anyway then you don’t care because an attacker who can issue arbitrary system calls doesn’t need to violate memory safety because they’ve already won.
- You are running some untrusted code within an unprivileged compartment. You assume that the code is either actively malicious from the start, or is probably compromised by an adversary who is able to gain arbitrary code execution within the compartment. You want to prevent them from doing any damage other than corrupting the explicit inputs and outputs of the compartment (which can be externally validated).
In the first case, the answer is easy: you don’t restrict /proc/self/mem at all. If a program leaves a file descriptor to it open then that’s potentially a problem (if an attacker can influence the descriptor argument to a write system call then they can corrupt memory) but otherwise an attacker needs to be able to open an arbitrary file and write data to specific offsets and if they can do that then there are easier attacks (for example, overwrite .bash_rc to run their malicious payload).

The second case is a bit more interesting. The simple answer here is that you use CHERI to protect things inside the process and OS sandboxing mechanisms to protect the OS interfaces. In the initial prototypes, compartments that don’t run with the full authority of the address space can’t make system calls at all. Instead, the bottom half of their libc invokes a system object that implements some subset of the system calls. It can interpose on open and either refuse to open /dev/self/mem at all, or provide some proxy that prevents read / writing / mapping anything except ranges that the sandbox ought to be able to access. Brooks has been experimenting with a variant of this where unprivileged code gets a version of the Capsicum system call table that takes sealed capabilities instead of files as file-descriptor arguments. This would let untrusted code issue fast-path system calls (read, write, send, and so on) directly, but would still require proxying to the host environment for anything that touched a global namespace (connect, open, bind, and so on). Capsicum is a great tool for this, the Linux sandboxing tools are a horrible mess in comparison.

In the colocated-process (coprocess) model, the OS understands explicitly which coprocess owns any given part of the address space and so can exclude anything from /dev/self/mem that is not owned by process’ identified by the self part (which, if I remember correctly, is a virtual symlink to the PID).

Note that /dev/self/mem is only one of the ways of doing this kind of thing. The ptrace APIs on most *NIX systems, Windows has a pair of system calls to read or write memory from a process identified by a HANDLE. Some of these can be fixed by requiring a process descriptor / handle that conveys the right authority, others may require some more thought.

The first model is enough to deterministically mitigate somewhere between 45-70% of security vulnerabilities (depending on the software mitigations composed with CHERI), so I’d be happy if we could start there.

If you use WebAssembly as your abstract machine for sandboxes and treat CHERI as an acceleration technology, then WASI already exposes the same model as Capsicum and doesn’t define any of the awkward interfaces. I think this is one of the more promising paths: use WAsm for sandboxing, use CHERI to make the sandboxes fast.