Great post and 100% agree, especially as someone who opened more than a few tickets with Heroku asking them to allow our apps to not have to restart. We (Stitch Fix) ended up building much more resilient everything because of this.
But what about tasks that are going to take a long amount of time anyway?
For these, we must instead decompose into smaller tasks.
Yes! Or, you can use a platform that doesn’t require you to jump through a bunch of totally unnecessary hoops if you don’t want to, and it isn’t giving you any direct benefits that you are personally interested in.
Just as most people don’t need or want Kubernetes, most people don’t need to make their batch processing fully checkpointed and resumable. It’s a lot of extra complexity and mechanism when frankly it can be hard enough to write the part of the software that actually does the business logic correctly.
It’s like an Erlang thing. Does every program need to be made out of a bunch of little restartable autonomous systems with a supervision hierarchy? No, but there are certainly programs that do benefit from such an architecture, quite significantly.
It’s a lot of extra complexity and mechanism when frankly it can be hard enough to write the part of the software that actually does the business logic correctly.
Sometimes when I’m being forced to deconstruct part of my program into smaller parts (reentrant component, splitting up monolithic code into libraries, refactoring big functions/types, whatever) I found the actual solution to the problem to be more clear and well-structured compared to if I was allowed to just make the entire subsystem one big ball of mud. It’s an example of rubber duck debugging, you’re forcing yourself to understand the problem to a depth that’s uncomfortable to you, but very necessary to the computer.
Totally agree. This property ends up being a hidden “cost of doing business” on the platform. The restart behaviour had had enough feedback that it did become part of the larger roadmap item.
Personally speaking, I feel the pros outweight the cons - higher resiliency pays off in the long run - but totally understand why folks wouldn’t want deal with the added complexity.
Personally speaking, I feel the pros outweight the cons - higher resiliency pays off in the long run
This varies so much depending on the context. I have built distributable job systems, crash-safe coordinators, restartable computations, etc. But all this complexity comes at a very real cost. So you need to look at the problem you’re trying to solve, and the underlying business need.
Keeping a big Linux server running for a year is a solved problem. “Big dumb batch jobs” are a perfectly legitimate part of the solution space. They’re easy to write, easy to maintain, and they don’t require distributed systems knowledge. They don’t scale well horizontally, so they get bottlenecked by how big a server you can buy. But they work.
Don’t get me wrong—I love 12-factor design, and having all state stored externally. I like using my servers to run empheral containers. But it’s not the right tool for every business problem. For many things, perhaps “higher resiliency pays off in the long run”. But for many taks, that “long run” might be hopelessly hypothetical and not occur for 5 or 10 years. As Keynes put it, “The long run is a misleading guide to current affairs. In the long run we are all dead.”
I may be missing some context here. Why is that even a problem? I mean, if I had a batch process lasting more then the 24 hours you mentioned above, I would definitely want it check pointed. And if it’s a small, few-minutes thing, then restarting it probably won’t be a problem anyway, and not relying it to be finished within exactly X minutes is probably smart. I fully agree, there are cons, but the pros in cases completely outweigh them.
As was discussed elsewhere in these comments - why should you have to re-write an existing process that “just” works to use a specific platform? Lift-and-shift projects often only have a limited budget, after all.
Customers typically want to be able to customise a lot. Heroku takes strong opinions - sometimes these clash, and this is one of the areas that we have clashed a lot.
A system that is intentionally regularly restarted is going to have certain things figured out like load shedding or a maintenance window. But it is also good to know that something can keep on ticking.. no leaks, no excessive fragmentation, whatever. You might ideally figure out a way to know you can do both.
I wonder; you could just keep one instance (or a tenth of instances or something) on no-restart, and hopefully you can pick up/record/instrument and address whatever weirdness comes up on them without having to worry about your whole fleet being subject to it.
Abstract: “A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable microrebooting – a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application. […]”
For these, we must instead decompose into smaller tasks. Providing an intermediate representation that can be stored (again, not locally!) somewhere and loaded, while keeping track of where in the sequence we are to resume processing after cancellation.
https://github.com/shopify/job-iteration Shopify has a gem precisely for that. And even have another gem thar gives a UI to run one-off tasks that one often needs to run that takes advantage of the ‘resumability’
Multiple companies I worked at would only notice memory leaks during the holiday code freeze every year. Weekdays without deployments would be rare and some slow leaks wouldn’t be noticed over weekends. But leaving the same processes running for a few weeks would reveal some slow memory growth.
At some point is it even worth fixing? Restering the process (or the machine) is the ultimate GC. As long as the growth can’t be triggered and abused by users for DoS it should be fine.
They are good in the same way a crutch is good. That doesn’t mean that it’s bad, a crutch can be a lifesaver. It’s a pragmatic choice for dealing with crappy systems (which is all of them).
I’ve written code that I didn’t think was worth monitoring carefully; some of it probably had memory leaks. But I’ve also written computationally-bound simulations that run without memory leaks or crashes for weeks or even months!
My point: I (and many others) like having choice. I want to be able to provision machines that have some guaranteed uptime. And without unannounced restarts. Sure, I understand the points above: they are part of what I would call the relevant “decision graph”. But not everyone weighs the decision points the same.
If you are often restarting services to roll out updates because you are doing rapid continuous delivery, then you should restart services automatically at roughly the same cadence even when no deploys are happening.
The region of your software’s state space which is near its freshly-restarted state is very well explored and is known to mostly work. The region of state space which your program moves through over longer periods is not well explored. You don’t have anywhere near as much evidence that it doesn’t contain wedged or broken states.
Memory leaks changing the memory pressure is the most obvious variable that regular restarts mitigate. You van also run into more subtle things like heap fragmentation, cache fullness, rare dreadlocks very slowly causing thread pools to get emptied, or software continuing to hold onto an old DNS result indefinitely regardless of the TTL.
Great post and 100% agree, especially as someone who opened more than a few tickets with Heroku asking them to allow our apps to not have to restart. We (Stitch Fix) ended up building much more resilient everything because of this.
Yes! Or, you can use a platform that doesn’t require you to jump through a bunch of totally unnecessary hoops if you don’t want to, and it isn’t giving you any direct benefits that you are personally interested in.
Just as most people don’t need or want Kubernetes, most people don’t need to make their batch processing fully checkpointed and resumable. It’s a lot of extra complexity and mechanism when frankly it can be hard enough to write the part of the software that actually does the business logic correctly.
It’s like an Erlang thing. Does every program need to be made out of a bunch of little restartable autonomous systems with a supervision hierarchy? No, but there are certainly programs that do benefit from such an architecture, quite significantly.
Sometimes when I’m being forced to deconstruct part of my program into smaller parts (reentrant component, splitting up monolithic code into libraries, refactoring big functions/types, whatever) I found the actual solution to the problem to be more clear and well-structured compared to if I was allowed to just make the entire subsystem one big ball of mud. It’s an example of rubber duck debugging, you’re forcing yourself to understand the problem to a depth that’s uncomfortable to you, but very necessary to the computer.
Agreed, although I’ve been experimenting with a supervision tree in the FE with a lot success: https://starfx.bower.sh
It’s a flexible and robust way to build front end apps
Author (and Heroku employee) here.
Totally agree. This property ends up being a hidden “cost of doing business” on the platform. The restart behaviour had had enough feedback that it did become part of the larger roadmap item.
Personally speaking, I feel the pros outweight the cons - higher resiliency pays off in the long run - but totally understand why folks wouldn’t want deal with the added complexity.
This varies so much depending on the context. I have built distributable job systems, crash-safe coordinators, restartable computations, etc. But all this complexity comes at a very real cost. So you need to look at the problem you’re trying to solve, and the underlying business need.
Keeping a big Linux server running for a year is a solved problem. “Big dumb batch jobs” are a perfectly legitimate part of the solution space. They’re easy to write, easy to maintain, and they don’t require distributed systems knowledge. They don’t scale well horizontally, so they get bottlenecked by how big a server you can buy. But they work.
Don’t get me wrong—I love 12-factor design, and having all state stored externally. I like using my servers to run empheral containers. But it’s not the right tool for every business problem. For many things, perhaps “higher resiliency pays off in the long run”. But for many taks, that “long run” might be hopelessly hypothetical and not occur for 5 or 10 years. As Keynes put it, “The long run is a misleading guide to current affairs. In the long run we are all dead.”
I may be missing some context here. Why is that even a problem? I mean, if I had a batch process lasting more then the 24 hours you mentioned above, I would definitely want it check pointed. And if it’s a small, few-minutes thing, then restarting it probably won’t be a problem anyway, and not relying it to be finished within exactly X minutes is probably smart. I fully agree, there are cons, but the pros in cases completely outweigh them.
As was discussed elsewhere in these comments - why should you have to re-write an existing process that “just” works to use a specific platform? Lift-and-shift projects often only have a limited budget, after all.
Customers typically want to be able to customise a lot. Heroku takes strong opinions - sometimes these clash, and this is one of the areas that we have clashed a lot.
A system that is intentionally regularly restarted is going to have certain things figured out like load shedding or a maintenance window. But it is also good to know that something can keep on ticking.. no leaks, no excessive fragmentation, whatever. You might ideally figure out a way to know you can do both.
I wonder; you could just keep one instance (or a tenth of instances or something) on no-restart, and hopefully you can pick up/record/instrument and address whatever weirdness comes up on them without having to worry about your whole fleet being subject to it.
See also “Microreboot – A Technique for Cheap Recovery”, Candea et al., OSDI 2004, https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/candea/candea.pdf
Abstract: “A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable microrebooting – a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application. […]”
https://github.com/shopify/job-iteration Shopify has a gem precisely for that. And even have another gem thar gives a UI to run one-off tasks that one often needs to run that takes advantage of the ‘resumability’
I believe this is the gem you were referring to:
https://github.com/Shopify/maintenance_tasks
Sidekiq now also has iterable jobs built in: https://github.com/sidekiq/sidekiq/wiki/Iteration
Multiple companies I worked at would only notice memory leaks during the holiday code freeze every year. Weekdays without deployments would be rare and some slow leaks wouldn’t be noticed over weekends. But leaving the same processes running for a few weeks would reveal some slow memory growth.
At some point is it even worth fixing? Restering the process (or the machine) is the ultimate GC. As long as the growth can’t be triggered and abused by users for DoS it should be fine.
They are good in the same way a crutch is good. That doesn’t mean that it’s bad, a crutch can be a lifesaver. It’s a pragmatic choice for dealing with crappy systems (which is all of them).
But not all systems are equally crappy.
I’ve written code that I didn’t think was worth monitoring carefully; some of it probably had memory leaks. But I’ve also written computationally-bound simulations that run without memory leaks or crashes for weeks or even months!
My point: I (and many others) like having choice. I want to be able to provision machines that have some guaranteed uptime. And without unannounced restarts. Sure, I understand the points above: they are part of what I would call the relevant “decision graph”. But not everyone weighs the decision points the same.
P.S. Fuzz testing is amazing. If you haven’t tried it yet, I highly recommend it. e.g. https://releases.llvm.org/3.8.0/docs/LibFuzzer.html
Another argument for regular restarts, which came up at https://rachelbythebay.com/w/2024/03/05/outage/ in the bit about three day weekends:
If you are often restarting services to roll out updates because you are doing rapid continuous delivery, then you should restart services automatically at roughly the same cadence even when no deploys are happening.
The region of your software’s state space which is near its freshly-restarted state is very well explored and is known to mostly work. The region of state space which your program moves through over longer periods is not well explored. You don’t have anywhere near as much evidence that it doesn’t contain wedged or broken states.
Memory leaks changing the memory pressure is the most obvious variable that regular restarts mitigate. You van also run into more subtle things like heap fragmentation, cache fullness, rare dreadlocks very slowly causing thread pools to get emptied, or software continuing to hold onto an old DNS result indefinitely regardless of the TTL.