How (memory) safe is zig? (UPDATED)

68

How (memory) safe is zig? (UPDATED) zig scattered-thoughts.net
via adaszko 2 years ago | caches
Archive.org Archive.today Ghostarchive
| 37 comments

37

1. 46
  
  icefox 2 years ago | link
  
  In materialize we wrote ~140kloc of rust in the first 14 months while growing the team from ~3 to ~20 people. It’s a complex system with high demands on both throughput and latency. We reached that point with (IIRC) only 9 unsafe blocks, all of which were in a single module and existed to work around a performance bug in the equivalent safe api. Despite heavy generative testing and fuzzing, we only discovered one memory safety bug (in the unsafe module, naturally) which was easy to debug and fix.
  
  By comparison, in several much smaller and much simpler zig codebases where I am the only developer, I run into multiple memory safety bugs per week. This isn’t a perfect comparison, because my throwaway research projects in zig are not written carefully (=> more bugs added) but are also not tested thoroughly (=> fewer bugs detected). But it does make me doubt my ability to ship secure zig programs without substantial additional mitigations.
  
  In at least one of those codebases, the memory safety bugs are outnumbered 20:1 by bounds-check panics. So I assume that if I wrote that same project in idiomatic c (ie without bounds checks) then I would encounter at least 20x as many memory safety bugs per week.
  
  My goodness, actual metrics, even if sloppy ones? I wish more projects could do this.
2. 17
  
  hsivonen 2 years ago | link
  
  The notion that Zig’s level of safety is OK if you compile to wasm seems like unintended consequences waiting to happen: Software originally meant for wasm deployment getting reused outside the sandbox. If the argument is performance in the sense that you are willing to pay the cost of wasm overhead but no overhead on top of that, you can write in Rust and provide a panic handler that makes panics UB and, therefore, allows LLVM to reason that panics don’t happen and the panicking paths can be eliminated. (Don’t do this if you are not targeting wasm!)
  
  That leaves the “small language” argument, which I personally don’t find persuasive but others really seem to like small languages.
  1. 10
    
    andyc 2 years ago | link
    
    Hm, also WASM running native code has fewer protections than the OS running native code (stack and heap protection), so Zig in WASM is probably worse along some dimensions than C on the OS:
    
    https://lobste.rs/s/pzr5ip/everything_old_is_new_again_binary
    
    https://old.reddit.com/r/ProgrammingLanguages/comments/icb9ve/everything_old_is_new_again_binary_security_of/
    
    I’m not a Rust user but it does seem conveniently appropriate for WASM :)
  2. 2
    
    kevincox 2 years ago | link
    
    I don’t see how compiling to wasm really helps here. I guess the logic is because the client runs it it doesn’t matter if it has vulnerabilities or crashes?
    
    But look at something like iMessage and all of the vulnerabilities that it has. Clearly even for single-user applications safety is still important as soon as you interact with untrusted data.
    1. 2
      
      Gaelan 2 years ago | link
      
      wasm runs in a strong sandbox, with no ability to do I/O, make syscalls, etc, only call functions that have been passed in. Also it’s a weird architecture where instructions and the return stack aren’t present in the main address space, so traditional arbitrary code execution isn’t really possible. All that makes it difficult to escalate a memory safety bug in wasm into an attack on the wider system.
      1. 3
        
        kevincox edited 2 years ago | link
        
        But if the wasm module is “the entire” system it doesn’t matter much. If this is my document editor that handles sensitive data it is still very important that it isn’t compromised even if “the wider system” is protected.
        
        Sandboxes are great but they are just a mitigation. They just reduce the blast radius of a vulnerability, so my wasm document editor can’t read my email, it only has access to my documents. But if I want to safely open sensitive and untrusted documents in my document editor I need higher protection than this basic containment.
        
        It reminds me a bit of https://xkcd.com/1200/. I see wasm as a great tool to create sandboxes which can be used to create a multi-level defense system. But just compiling a monolithic application to wasm doesn’t obsolete the need of secure and reliable applications in most cases.
        
        2
        
        Gaelan 2 years ago | link
        
        Certainly agreed.
3. 15
  
  sanxiyn 2 years ago | link
  
  I think CHERI is a big unknown here. If CHERI works, language level memory safety is less valuable, and Zig will be more attractive and Rust less.
  
  I am pretty optimistic about CHERI. The technology is solid, and its necessity is clear. There is just no way we will rewrite existing C and C++ code. So we will have CHERI for C and C++, and Zig will be an unintended beneficiary.
  1. 26
    
    mwcampbell 2 years ago | link
    
    For desktop and mobile applications, I’d prefer a safety solution that doesn’t require a billion or more people to throw out the hardware they already have. So whatever we do, I don’t think relying exclusively on CHERI is a good solution.
    1. 3
      
      david_chisnall 2 years ago | link
      
      People throw away their hardware, at least on average, once every decade. I’d much rather a solution that didn’t require rewriting trillions of dollars of software.
      1. 2
        
        mwcampbell 2 years ago | link
        
        People throw away their hardware, at least on average, once every decade.
        
        True, the software upgrade treadmill forces them to do that. But not everyone can keep up. Around 2017, I made friends with a poor person whose only current computer was a lower-end smartphone; they had a mid-2000s desktop PC in storage, and at some point also had a PowerPC iMac. It would be great if current software, meeting current security standards, were usable on such old computers. Of course, there has to be a cut-off point somewhere; programming for ancient 16-bit machines isn’t practical. I’m afraid that 3D-accelerated graphics hardware might be another hard requirement for modern GUIs; I was disappointed that GNOME 3 chose fancy visual effects over desktop Linux’s historical advantage of running well on older hardware. But let’s try not to keep introducing new hardware requirements and leaving people behind.
  2. 13
    
    blake 2 years ago | link
    
    Wouldn’t CHERI still discover these issues at runtime versus compile time? Do not get me wrong, I’m still bullish on CHERI and it would be a material improvement, but I do think finding these bugs earlier in the lifecycle is part of the benefit of safety as a language feature.
    1. 8
      
      sanxiyn 2 years ago | link
      
      That’s why I said “less valuable” instead of “nearly useless”.
      1. 2
        
        blake 2 years ago | link
        
        Makes sense, thank you, just checking my understanding
  3. 2
    
    teymour 2 years ago | link
    
    Is there a performance overhead from using CHERI?
    1. 6
      
      sanxiyn 2 years ago | link
      
      CheriABI paper measured 6.8% overhead for PostgreSQL benchmark running on FreeBSD in 2019. It mostly comes from larger pointer (128 bits) and its effect on cache.
      1. 1
        
        david_chisnall 2 years ago | link
        
        Note that those numbers were from the CHERI/MIPS prototype, which was an in-order core with a fairly small cache but disproportionately fast DRAM (cache misses cost around 30ish cycles). Setting the bounds on a stack allocation was disproportionately expensive, for example, because the CPU couldn’t do anything else that cycle, whereas a more modern system would do that in parallel with other operations and so we typically see that as being in the noise on Morello. It also had a software-managed TLB and so couldn’t speculatively execute on any paths involving cache misses.
        
        The numbers that we’re getting from Morello are a bit more realistic, though with the caveat that Arm made minimal modifications to the Neoverse N1 for Morello and so couldn’t widen data paths of queues in a couple of places where the performance win would have been huge for CHERI workloads relative to the power / area that they cost.
    2. 3
      
      david_chisnall 2 years ago | link
      
      We’re starting to get data on Morello, though it’s not quite as realistic a microarchitecture as we’d like, Arm had to cut a few corners to ship it on time. Generally, most of the overhead comes from doubling pointer sizes, so varies from almost nothing (for weird reasons, a few things get 5-10% faster) to 20% for very pointer-dense workloads. Adding temporal safety on top, on the four worst affected of the SPECCPU benchmarks costs about 1% for two, closer to 20% for the others (switching from glibc’s malloc to snmalloc made one of those 30% faster on non-CHERI platforms, some of SPEC is really a tight loop around the allocator). We have some thoughts about improving performance here.
      
      It’s worth nothing that any microarchitecture tends to be turned for specific workloads. In designed for CHERI would see different curves because some things would be sized where they are hitting big wins for CHERI but diminishing returns for everything else. The folks working on Rust are guessing that Rust would be about 10% faster with CHERI. I believe WASM will see a similar speed up and MSWasm could be 50% or more faster than software enforcement.
      1. 1
        
        1amzave 2 years ago | link
        
        for weird reasons, a few things get 5-10% faster
        
        If you happen to have any easily-explained concrete examples, I’d be curious to hear about these weird reasons…
        
        5
        
        david_chisnall 2 years ago | link
        
        I don’t know if anyone has done root-cause analysis on them yet, but typically it’s things like the larger pointers reduce cache aliasing. I’ve seen one of the SPEC benchmarks get faster (I probably can’t share how much) when you enable MTE on one vendor’s core because they disable a prefetcher with MTE and that prefetcher happens to hit a pathological case in that one benchmark and slow things down.
        
        It’s one of the annoying things you hit working on hardware security features. Modern CPUs are so complex that changing anything is likely to have a performance change of up to around 10% for any given workload, so when you expect your overhead to be around 5% on average you’re going to see a bunch of things that are faster, slower, or about the same. Some things have big differences for truly weird reasons. I’ve seen one thing go a lot faster because a change made the read-only data segment slightly bigger, which made two branch instructions on a hot path land in slightly different places and no longer alias in the branch predictor.
        
        My favourite weird performance story was from some Apple folks. Apparently they got samples of a newer iPhone chip (this is probably about 10 years ago now), which was meant to be a lot faster and they found that a core part of iOS ran much, much slower. It turned out that, with the old core, it was always mispredicting a branch, which was issuing a load, and then being cancelled after 10 cycles or so. In the newer core, the branch was correctly predicted and so the load wasn’t issued. The non-speculative branch needed that load a couple of hundred cycles later and ended up stalling for 100-200 cycles waiting for memory. The cost of the memory wait fast over an order of magnitude higher than the cost of the branch misprediction. They were able to add an explicit prefetch to regain performance (and get the expected benefit from the better core), but it’s a nice example of how improving one bit of the hardware cause a huge systemic regression in performance.
        
        1
        
        1amzave 2 years ago | link
        
        Interesting, thanks – reminds me of some of the effects described in this paper (performance impacts of environment size and link order).
        
        Once doing some benchmarking for a research project circa 2015 or so I found a MySQL workload that somehow got consistently somewhat faster when I attached strace to it, though I unfortunately never figured out exactly why or how it happened…
        
        2
        
        david_chisnall 2 years ago | link
        
        There was another similar paper at ASPLOS a few years later where they compiled with function and data sections and randomised the order of a bunch of benchmarks. They found that this gave a 20% perf delta and that a lot of papers about compiler optimisations were seeing a speed up simply as a result of this effect. Apparently the same team later produced a tool for properly evaluating optimisations that would do this randomisation and apply some statistics to see if your speed up is actually statistically significant.
  4. gpm 2 years ago | link
    
    [Comment removed by author]
4. 10
  
  ac edited 2 years ago | link
  
  I feel like zig should add an allocator that uses conservative heap and stack scanning to the stdlib selection of allocators - see https://security.googleblog.com/2022/05/retrofitting-temporal-memory-safety-on-c.html . I did a hare version for fun and it wasn’t so hard to do.
  
  This basically would let you make zig programs selectively memory safe via conservative garbage collection - the linked article shows the overheads are quite low. Then you could just turn it on for portions of programs and deployments that require that extra bit of safety.
  
  I also think you could just expose it as a GC allocator rather than a quarantine allocator and let people take advantage of GC when they want it.
5. 13
  
  olliej 2 years ago | link
  
  I’m sorry, zig uses a memory ownership model that has been repeatedly demonstrated to be insufficient. Manual reference counting simply does not produce safe code - as demonstrated by decades of software vulnerabilities. In this regard idiomatic c++ is safer - the various smart pointers handle reference counting correctly - it only goes wrong when people take things out of said smart pointers and relying manually managing things correctly.
  
  As for the “we have safe allocators”, so do the Mac and iOS system allocators: they have double free and use after free mitigations, and they zero initialize returned memory as of ios16. The webkit and chrome custom allocators have similar protections as well. Custom allocators are not a new concept, and in C++ at least is trivial and transparent.
  
  I would argue fairly strongly that c++ is a safer language than Zig.
  1. 9
    
    sanxiyn 2 years ago | link
    
    It only goes wrong when people take things out of said smart pointers and relying manually managing things correctly.
    
    I agree C++ has good safety features, but this is, like, not true? If m is a hash map, m[i] = m[j] is a memory safety bug. This is not theoretical, people routinely get bitten by this. “C++ is safe if you use smart pointers and collections” is just a dangerous fantasy.
    1. 8
      
      andrewrk 2 years ago | link
      
      If m is a hash map, m[i] = m[j] is a memory safety bug
      
      The natural Zig code equivalent would look like this:
      
      // The natural way to write such code in Zig (OK): m.put(i, m.get(j).?);
      
      It does not have the bug exhibited by the C++ code. However, if we bend over backwards, we can cause the same problem:
      
      // Equivalent of C++ `m[i] = m[j]` (buggy): const rhs_ptr = (try m.getOrPut(j)).value_ptr; const lhs_ptr = (try m.getOrPut(i)).value_ptr; lhs_ptr.* = rhs_ptr.*;
      
      Here I’ll run it in a small unit test:
      
      const std = @import("std"); test { var m = std.AutoHashMap(i32, i32).init(std.testing.allocator); defer m.deinit(); { // Here we prepopulate the hash map so that the 2nd next insertion will re-allocate. var x: i32 = 0; while (x < 5) : (x += 1) { try m.put(x, x + 1); } } const i = 100; const j = 200; // Equivalent of C++ `m[i] = m[j]` (buggy): const rhs_ptr = (try m.getOrPut(j)).value_ptr; const lhs_ptr = (try m.getOrPut(i)).value_ptr; lhs_ptr.* = rhs_ptr.*; }
      
      Output:
      
      Test [1/1] test_0... Segmentation fault at address 0x7fbe41197054 ./test.zig:19:5: 0x213cd8 in test_0 (test) lhs_ptr.* = rhs_ptr.*; ^ /home/andy/Downloads/zig/lib/test_runner.zig:62:28: 0x21a4ba in main (test) } else test_fn.func(); ^ /home/andy/Downloads/zig/lib/std/start.zig:568:22: 0x2147ad in posixCallMainAndExit (test) root.main(); ^
      
      We get a stack trace pointing directly at the bug.
      
      Note that Undefined Behavior did not occur here because std.testing.allocator is being used - which is backed by std.heap.GeneralPurposeAllocator, providing memory safety on use-after-free.
      1. 5
        
        olliej 2 years ago | link
        
        Pointing to the output of the debug allocator isn’t relevant - C and C++ - and every other language has that. What matters for security is the production allocators, and on Mac+iOS at least the production allocators do endeavor to catch those errors. In the face of malicious data however any stochastic protection will have a failure state, and the goal of the allocator is to make that state as hard as possible. Failing that have crash metrics so a memory error is detectable.
      2. 3
        
        AviKav 2 years ago | link
        
        I see that there’s a bug, but what’s happening?
        
        4
        
        sanxiyn 2 years ago | link
        
        When i is not in m, m[i] inserts i to m with default constructed value. (This is because C++ handles index getters and index setters together, which turned out to be a mistake. Other languages handle them separately.) Insertion can cause reallocation which can invalidate m[j], causing use after free.
    2. 1
      
      olliej 2 years ago | link
      
      First, I did not say c++ was a safe language - that’s clearly and objectively false :D. But the latter issue is not a memory safety issue unless the containers are incorrect, if std::map does not handle self assignment correctly, then that’s a bug in the container - I recognize that the library authors seem to think that correct behavior isn’t required just because a standard says that it UB - but that’s just a result of poor library implementation. If I were implementing the standard containers it would never occur to me that rehashing should be performed unsafely, but that’s just me.
      1. 6
        
        sanxiyn 2 years ago | link
        
        So C++ standard is buggy, common C++ implementations are buggy, but ethereal essence of C++ is not buggy, got it. It would not be buggy if I were implementing the standard containers. But Zig’s standard testing allocator doesn’t count, maybe because it wasn’t implemented by you?
        
        I am really trying hard to understand this but I just can’t.
        
        2
        
        olliej 2 years ago | link
        
        No, Zig is unsafe because it is a language that does not support automatic lifetime management, for no good reason, despite those being the most commonly abused errors in security exploits.
        
        Zig’s standard testing allocator does not count because it is the testing allocator - every C/C++ environment also has testing allocators that are more aggressive that the default system allocators, because yes catching errors before they ship is good, but Zig does not have anything special here. What matters for end user security is the behaviour of the allocator[s] used in production. The fact is that production allocators cannot do full UaF, etc checking as aggressively as these testing and debug allocators, and so are not relevant.
  2. 3
    
    kornel edited 2 years ago | link
    
    I think Zig can be a tiny bit smarter with allocators and partition allocations by type (not just size bucket), so instead of type confusion UAF you may get lucky and only get instances of the same type mixed up.
    
    But after having a taste of Rust’s compile-time correctness, I’m unsatisfied with crash faster solutions. Such mitigations technically improve safety, but the programs are still as buggy as ever.
    1. 3
      
      david_chisnall 2 years ago | link
      
      I think Zig can be a tiny bit smarter with allocators and partition allocations by type (not just size bucket), so instead of type confusion UAF you may get lucky and only get instances of the same type mixed up.
      
      This was common for performance in the early ‘90s until Hans Boehm showed that per-type pooling was bad for performance of anything that isn’t a microbenchmark. It is generally an improvement for security, because it prevents use-after-free from becoming a type-safety violation, but it still has some interesting exploit possibilities. In particular, for any object that represents something like a security context, being able to alias an instance that authorises things with one that doesn’t is painful. If I can open a file as me, and then cause the kernel to free my rights structure and then reuse the same memory for a root-user’s rights structure, then I now have privilege elevation without any type confusion. This kind of attack is actually easier with type-based pooling than without, because you can guarantee that you’ll alias some other valid instance of the same structure and so just need to ensure that the next entity to open a file is more privileged than you.
    2. 1
      
      sanxiyn 2 years ago | link
      
      I am pretty okay with “crash faster”. It is equivalent to Rust panics.
      1. 14
        
        kornel edited 2 years ago | link
        
        No, it’s not. Rust doesn’t panic on UAF, it doesn’t have UAF (with the usual disclaimer about broken unsafe).
        
        Rust, Zig, and C all allow you to write perfectly valid programs that never crash or panic, and all allow you to write buggy crashy mess. In an objective binary yes/no proof-by-contradiction terms they are all technically equivalent. But the real difference is a vague notion of how they deal with human error, and how idiomatic code steers programmers away from the crashy parts.
        
        Rust can panic if you .unwrap() all over the place, but it has ? and a bunch of other features to steer away from that. Rust can panic if you use arr[i] indexing, but it has iterators and helper methods to discourage that. The language, the tooling, and the community is focused on eliminating things before they even become runtime problems.
        
        Panics still happen, but lots of errors are prevented even before they could become panics/crashes. Rust has the borrow checker, so the whole class of crash-faster bugs doesn’t compile. It has send/sync, so another whole class of crashy bugs doesn’t compile.
6. 2
  
  bsder 2 years ago | link
  
  One of the big things that everybody overlooks is that Rust forces certain architectural patterns on you, and, if you cannot abide that, you are in for deep, deep problems. See: “Giving up on wlroots-rs” http://way-cooler.org/blog/2019/04/29/rewriting-way-cooler-in-c.html
  
  Zig is not this opinionated. It will let you work with a weird abstraction at the cost that you can blow your foot off.
  
  To me, Rust is fine when I can encapsulate–ie. everything feeds through network sockets or files or … and I don’t have to cooperate with something else. If I have to cooperate with a kind of whacky abstraction, Rust starts feeling really nasty.
  
  It will be interesting to see if Zig can sew up the gamedev and embedded programming arenas. Rust has struggled there from the beginning, and it hasn’t gotten very much better over time. If Zig (or anything else, for that matter) can slice those pieces off of Rust, Rust will wind up in a very tough place between the low-level which it’s not very good at and the GC languages which get continuously better with time.

Stories with similar links:

How safe is zig? via adaszko 3 years ago | 82 points | 42 comments