How We Built a Self-Healing System to Survive a Terrifying Concurrency Bug At Netflix

25

How We Built a Self-Healing System to Survive a Terrifying Concurrency Bug At Netflix debugging devops practices pushtoprod.substack.com
via briankung 3 months ago | caches
Archive.org Archive.today Ghostarchive
| 10 comments

10

1. 9
  
  mjb 3 months ago
  
  This was a rare but brutal example of how writing non-thread-safe code can cripple your systems.
  
  I’ve come to strongly dislike Java’s “silently allow badness” approach to thread safety, and appreciate Rust’s much more explicit approach (with things like Send+Sync). I introduced a HashMap concurrency bug that souds a whole lot like the one the poster is talking about into a Java codebase in my first year at AWS (so 2009-ish). Code compiled, code ran fine in test, code ran fine on low-scale machines, code blew up in production (latently!). Ugh.
  
  There are some nice ways Java improved on the thread safety status quo over C and C++ (like a proper memory model and a rich standard library of good concurrent algorithms and primitives), some of which have made their way into those languages. But the fundamental approach is still way too easy to get wrong.
  
  Go is an interesting case here because in one sense it doesn’t improve over Java. It’ll let you make all the mistakes. But, if you follow the conventions for structuring multi-threaded code you can very effectively avoid mistakes. So it’s a kind of half solution, but a very useful one.
  
  It’s easy to sit back and say “oh, I simply wouldn’t have written those bugs”, but that’s a naive view. For example, it assumed that every piece of code you write is either fully thread safe or never ever will be refactored into a multithreaded context. Without the compiler helping you with that, I don’t know how that’s generally possible. It also assumes that you’re willing to spend a bunch of brainpower keeping track of things that a compiler can keep track of for you, and that seems inefficient at best. Computers are great at “apply these rules everywhere” in a way that humans just aren’t, and I’ve never understood the reluctance of programmers to take advantage of that.
  
  The solution of automatically terminating random instances felt like a terrible engineering practice.
  
  Doesn’t sound like a terrible engineering practice to me. Sounds like a pragmatic solution making the most of the tools available in the environment, which is ultimately what good engineering looks like. You don’t want to gather too many work-around like this, because each adds complexity to a system and makes it harder to reason about, but as a short-term mitigation its nothing to be embarrassed about at all.
  
  Why not just reboot them? Terminating was faster.
  
  I’ve really come to appreciate the approach of cycling capacity and replacing it with freshly imaged new capacity continuously. There’s an efficiency trade-off on the frequency, but if you pick a good frequency it effectively eliminates whole classes of resource exhaustion problems in cloud systems. There’s a sense of craftsmanship in “our systems shouldn’t have those resource exhaustion bugs in the first place” which isn’t entirely wrong, but does distract from spending the same resources preventing more impactful bugs (like correctness bugs).
  
  The interesting thing about the post is that this was also a correctness bug from what it sounds like, and the explanation for why it’s OK to have that out in production for so long isn’t clear from the post. But its not generally true that resource exhaustion bugs are correctness bugs, and hitting them with the heavy hammer of replacing machines (or VMs or MicroVMs) works very well.
  1. 2
    
    quicksilver03 3 months ago
    
    Sorry for the random comment, but that brought back a moment quite a while ago (it could’ve been 2007 or 2008): I was called for a performance problem on an ATG e-commerce website and in a couple of hours on the first day I found that they had a droplet which was instantiated on multiple threads, and all of those threads were calling HashMap.containsKey() on a variable also modified by multiple threads… Nothing like that to achieve wizard-like status in the eyes of your client.
  2. 1
    
    kkiri 3 months ago
    
    I’ve come to strongly dislike Java’s “silently allow badness” approach to thread safety
    
    You’re far from the first. Thread safety is a problem older than most people realize (definitely older than Java). Actors are a very good tried-and-true method to sharing data without race conditions if you don’t have Rust’s ownership/trait system, and you also get the “just restart it” behavior out of the box. Only downside is, of course, everything is runtime defined and you don’t get any help from the compiler in regards to coordinating access like you can with Rust.
    
    Java itself has the third-party Akka framework, but Kotlin and Clojure also have their own built-in systems for highly parallel share-nothing workloads (I know Kotlin has actors, Clojure doesn’t but it’s because it doesn’t need them, but I’m not sure what they use instead.) Of course, everyone should be using anything but Java if targeting the JVM, there’s not really a good excuse to use Java directly anymore (well, there never was, but Sun’s marketing department made sure we got stuck with Java anyway.) Kotlin’s basically done to Java what Rust has done to C++.
    1. 3
      
      mjb 3 months ago
      
      Shared-nothing patterns are nice when you can get them, but most real-world systems end up needing to share state (distributed and durable state like work queues and databases, local and ephemeral state like connections and caches, etc). The thread safety bugs come in with that shared state. If you’re in a domain where there’s no shared state, then great, take advantage of that. If you’re in a domain that can be modeled with actors or CSP or whatever, great, take advantage of that. If you’re in a domain where shared state can all be made somebody else’s problem through an interface (e.g. and RDBMS through an ACID transaction interface), then take advantage of that.
      
      But that’s not all applications, and fairly few systems programs, and so lower-level primitives for state sharing are important and useful.
      
      well, there never was, but Sun’s marketing department made sure we got stuck with Java anyway
      
      Java become popular at a time when most of the alternatives were significantly worse, lacking at least one of: memory safety, a real memory model, a rich standard library, a wide variety of open-source and proprietary libraries, built-in monitoring and observability, a sane cross-platform story, or decent performance. There are tons of better choices today (on and off the JVM), but the idea that Java’s only popular because Sun bamboozled us isn’t right.
    2. 2
      
      c-- edited 3 months ago
      
      Kotlin has actors, Clojure doesn’t but it’s because it doesn’t need them, but I’m not sure what they use instead.
      
      Because Clojure data structures are immutable and persistent, concurrency is a lot less painful to deal with than it is in some other languages. I wouldn’t say it is painless but it is pretty good.
      
      If you want something more CSP-like, Clojure also has core.async, which provides lightweight threads that communicate over channels. It looks quite a lot like Go: the lightweight threads are even declared using “go blocks”.
      
      (Edit: removed conflation of actors and channels)
      1. 4
        
        tonyg 3 months ago
        
        Nit: channels and actors are notably different.
        
        Mixing Metaphors: Actors as Channels and Channels as Actors, Simon Fowler, Sam Lindley, and Philip Wadler, 2017: https://simonjf.com/writing/acca.pdf
        
        “Channel- and actor-based programming languages are both used in practice, but the two are often confused”
        
        3
        
        c-- 3 months ago
        
        Good point! I think I read that paper when it first came out and then gradually slid back into conflating the two. I’ll edit my comment.
    3. 1
      
      valenterry 3 months ago
      
      Akka is one thing (it is btw written in Scala and not Java, but you can use it from Java).
      
      The other solution is immutability / functional programming, which comes at a performance cost though.
2. 3
  
  pronoiac 3 months ago
  
  This feels familiar. At a previous job, we had to / got to set up zombie reaper jobs to deal with this. It feels like this would be detectable by looking for cores / CPUs pegged at 100% over some period of time, and you could order and prioritize instances by which ones were worst hit.
3. 2
  
  drmorr 3 months ago
  
  This isn’t directly related to the post but the over the top language in the title and throughout caused me to bounce off pretty hard. “terrifying”? “carnage”? Really?