Agreed. I’m 100% behind the idea that people should use FreeBSD, but this isn’t entirely a FreeBSD success story. The key lessons are:
If things slow down over time, don’t just throw more hardware at the problem. You may have a thing that is accidentally O(n) in the time that it has been running (as is the case with the file-based storage here, I’ve seen other interesting cases of CPU leaks). Autoscaling that just means that your costs will go up every month. Fixing it will let your costs scale with the number of users. This is OS-independent.
If you are using the same platform as everyone else, generic attacks will work on you. If an attack needs to work at multiple levels in the stack, changing some of it can help. For example, x86 malware won’t run on Arm, PHP malware won’t run on Ruby, and Linux malware may not work on FreeBSD (possibly need to make sure you haven’t loaded the Linux ABI kernel module - it’s embarrassing if the only user of it is malware).
The second of these used to be one of the key arguments people used for promoting Linux: a monoculture is vulnerable, diversity was good. As soon as the Windows monoculture was replaced with a Linux monoculture, Linux advocates forgot this.
Using jails is not necessarily an advantage. With the relative growth in application stacks versus kernels, the overhead of a VM is fairly low now. Managing jails directly feels like a step back if you’re doing container orchestration (using podman to manage containers with jails as the underlying isolation mechanism works quite nicely).
Article would be more compelling if it explained the actual cost differences, the size of the development team, or the traffic volume of the sites being served.
Being a test server, we didn’t consider to implement a proper HA - at the moment, it wouldn’t make sense.
Seems to be a lot buried in this line. First, any test server that is actively used in development processes does need proper HA, because businesses can’t make money if their developers can’t develop. Sounds like there’s a cold standby server, but is it properly configured to take over quickly if the main server needs a repair?
What about for firmware updates? Is the failover process non-disruptive enough to apply firmware updates regularly?
I trust the author knows about these things, but remember that there’s an entire generation of tech workers who only know “cloud”, where firmware updates and hardware failures are mostly invisible. These are the people I often see trying to revive self-hosting, but then forgetting to account for the various “unhappy paths” that the cloud handles for you.
First, any test server that is actively used in development processes does need proper HA, because businesses can’t make money if their developers can’t develop.
As with all things, this still really depends a lot on the circumstances. High availability, especially if it’s allegedly automatic, is a source of inherent and irreducible complexity: it requires correct configuration, the HA software itself can and will have bugs, they often require constant operation by a person or a team, etc. On the other hand, contemporary server hardware has very few moving parts and is extremely reliable. Modern operating systems are also similarly unlikely to fall over all the time just because. Security updates are frequently quick to apply and reboots are often on the order of two minutes.
When you inevitably have incidents caused by the complexity of the HA system, it’s likely they will exceed two minutes in time to resolution. At that point there arguably hasn’t really been any economic benefit over the simpler single system. If you have a handful of developers making moderate use of the test system, this might really be all that’s justified. Obviously if you have a hundred or a thousand engineers, things may be different at that scale; though a hundred servers managed uniformly (over which you distribute your users) may still be better than a single large scale HA system.
Plus if you’ve done a good job at making your test setup reproducible, then in the unlikely event that the server literally explodes and you need to replace it, you can just do that.
It’s not accurate to compare a well-designed single-server setup with a poorly-designed HA setup. Migrating a VM to another node is such a mundane activity nowadays. The big cloud providers do it thousands of times a day.
Sure, the average HA system is complex, but the average single-server setup often doesn’t have good reproducibility, since those systems don’t prioritize immutability and infra-as-code. Failover is rare, so little time is invested in the process, making it error-prone and leading to long recovery times.
My point is just that both aren’t simple to do right, and cloud-minded devs are likely to not even consider all the non-simple parts of self-hosting. I want people to be explicit about it instead of glossing over it.
It’s not accurate to compare a well-designed single-server setup with a poorly-designed HA setup. Migrating a VM to another node is such a mundane activity nowadays.
In my comment, the “well-designed” part was an optional extra in the last paragraph. Even just doing backups, which you have confirmed that you can restore, is probably enough for many or most small scale deployments. You don’t strictly need infrastructure as code or any of the other fashionable stuff. You also need backups on the HA system, anyway, so this cost applies to both systems.
My point stands either way: even a “poorly” designed standalone system is likely to be more reliable than all but the best-staffed HA systems.
I want people to be explicit about it instead of glossing over it.
I do too!
Sure, the average HA system is complex, but the average single-server setup often doesn’t have good reproducibility, since those systems don’t prioritize immutability and infra-as-code. Failover is rare, so little time is invested in the process, making it error-prone and leading to long recovery times.
Every HA system is complex; it’s unavoidable. These systems require some form of consensus and liveness monitoring, and need to deal with a vast array of failure modes (like, say, network partitions indistinguishable from a hard failure of a peer system) that simply don’t impact standalone systems in the same way. For virtual machine hosting they tend to involve some kind of replicated network storage, which produces a whole new class of performance bottleneck as well. All of this is significant extra complexity that the operator needs to learn about and understand in order to be able to operate the system and have any hope of avoiding the sorts of problems that lead to HA systems being unreliable.
The big cloud providers do it thousands of times a day.
HA systems like this are literally their core business. They generally have a large staff of experts at all levels working on these systems all the time. They invest substantially in this in a way that is not necessary in smaller scale deployments; as people are fond of saying: you are not Google/Amazon/etc!
I trust the author knows about these things, but remember that there’s an entire generation of tech workers who only know “cloud”, where firmware updates and hardware failures are mostly invisible.
I would take that with a grain of salt. I’ve been CC’d in enough discussions with AWS/GCS/Azure support and heard enough (believable) anecdotes to know that you really shouldn’t take that for granted. AWS does neither magically solve hardware failures, nor should it be assumed that there is no “forgotten” server (this one actually feels really scary) nor that cloud providers have more competent employees than everyone else. They are big, they move fast, of course things will fails, sometimes really big, sometimes it’s the failures that scale.
The myth of “they surely know better” is often a myth. If they actually do, that means you’ll likely have a really hard time should such an issue arise. Or you get lucky and don’t.
I don’t wanna spread FUD there, but I really dislike it when people assume that companies when they are big have a magic barrier protecting from failure, incompetence, hardware failures, etc. And these companies of course are also developing their products. There is truth to “combined experience” and so on, but don’t assume that the guy that has said experience will necessarily be working on things relevant to you. A lot of that effort is occupied by building AWS itself, to make such a system work. It’s all effort that if your product is not a big cloud business, which quite a bit of legacy stuff hanging around, because some important customer still needs it, you simply don’t have to deal with.
The whole “if Google/Amazon/… doesn’t manage, you certainly won’t be able to” is pure marketing. It’s usually rooted in comparisons that don’t make sense. Usually the goal you have and the goal that they have are pretty different, if you take a slightly closer look.
Doesn’t mean you never should use their stuff, but only means that you shouldn’t start to believe in and assume magic in these companies.
Running any kind of generic commercial product compared to running something so your software can reside on it is completely different worlds in terms of effort that needs to go into it. If you run your own stuff you only are faced with very few of kinda similar challenges. And largely challenges-that people successfully changed decades ago, where there is experience and where “thinking cloud” (from separating state from applications, etc.) really helped. Before cloud services emerged Sysadmins (or what we call SREs today) would have loved to be able to tell developers that they cannot do this and that. It took big companies to point to to be able to do that and to develop ecosystems in which horrible practices (and they were largely already bad practice back then) were rooted out.
I trust the author knows about these things, but remember that there’s an entire generation of tech workers who only know “cloud”, where firmware updates and hardware failures are mostly invisible
To be honest, I think a lot people didn’t know this kind of stuff even before “cloud”. At least in my experience. Devs would dev, BOFHs would operate and that was it. Now we call them devops, not BOFH, but these are the people dealing with this stuff, rarely developers.
Except it isn’t. Most of the saving came from fixing how sessions were stored so that they didn’t need to keep scaling up the compute. They could have kept doing Linux (on-prem or in the cloud) in that model and kept the same cost saving.
I wonder would have they found the issue while on the old setup. I tend to believe they would, but in some distant future, when autoscaling would not have hidden the issue.
Nice post but clickbait-ish submission title
Agreed. I’m 100% behind the idea that people should use FreeBSD, but this isn’t entirely a FreeBSD success story. The key lessons are:
The second of these used to be one of the key arguments people used for promoting Linux: a monoculture is vulnerable, diversity was good. As soon as the Windows monoculture was replaced with a Linux monoculture, Linux advocates forgot this.
Using jails is not necessarily an advantage. With the relative growth in application stacks versus kernels, the overhead of a VM is fairly low now. Managing jails directly feels like a step back if you’re doing container orchestration (using podman to manage containers with jails as the underlying isolation mechanism works quite nicely).
IMHO that is the meritum/TL;DR/outcome of the article.
Article would be more compelling if it explained the actual cost differences, the size of the development team, or the traffic volume of the sites being served.
True, I wish that they outlined all of those variables.
But you can derive the costs at least:
Let’s say that both of those servers were roughly at 100 euros.
So probably they are spending less than 200 euros a month and if that is 1/10, then they were spending around 2000 euros a month before.
Seems to be a lot buried in this line. First, any test server that is actively used in development processes does need proper HA, because businesses can’t make money if their developers can’t develop. Sounds like there’s a cold standby server, but is it properly configured to take over quickly if the main server needs a repair?
What about for firmware updates? Is the failover process non-disruptive enough to apply firmware updates regularly?
I trust the author knows about these things, but remember that there’s an entire generation of tech workers who only know “cloud”, where firmware updates and hardware failures are mostly invisible. These are the people I often see trying to revive self-hosting, but then forgetting to account for the various “unhappy paths” that the cloud handles for you.
As with all things, this still really depends a lot on the circumstances. High availability, especially if it’s allegedly automatic, is a source of inherent and irreducible complexity: it requires correct configuration, the HA software itself can and will have bugs, they often require constant operation by a person or a team, etc. On the other hand, contemporary server hardware has very few moving parts and is extremely reliable. Modern operating systems are also similarly unlikely to fall over all the time just because. Security updates are frequently quick to apply and reboots are often on the order of two minutes.
When you inevitably have incidents caused by the complexity of the HA system, it’s likely they will exceed two minutes in time to resolution. At that point there arguably hasn’t really been any economic benefit over the simpler single system. If you have a handful of developers making moderate use of the test system, this might really be all that’s justified. Obviously if you have a hundred or a thousand engineers, things may be different at that scale; though a hundred servers managed uniformly (over which you distribute your users) may still be better than a single large scale HA system.
Plus if you’ve done a good job at making your test setup reproducible, then in the unlikely event that the server literally explodes and you need to replace it, you can just do that.
It’s not accurate to compare a well-designed single-server setup with a poorly-designed HA setup. Migrating a VM to another node is such a mundane activity nowadays. The big cloud providers do it thousands of times a day.
Sure, the average HA system is complex, but the average single-server setup often doesn’t have good reproducibility, since those systems don’t prioritize immutability and infra-as-code. Failover is rare, so little time is invested in the process, making it error-prone and leading to long recovery times.
My point is just that both aren’t simple to do right, and cloud-minded devs are likely to not even consider all the non-simple parts of self-hosting. I want people to be explicit about it instead of glossing over it.
In my comment, the “well-designed” part was an optional extra in the last paragraph. Even just doing backups, which you have confirmed that you can restore, is probably enough for many or most small scale deployments. You don’t strictly need infrastructure as code or any of the other fashionable stuff. You also need backups on the HA system, anyway, so this cost applies to both systems.
My point stands either way: even a “poorly” designed standalone system is likely to be more reliable than all but the best-staffed HA systems.
I do too!
Every HA system is complex; it’s unavoidable. These systems require some form of consensus and liveness monitoring, and need to deal with a vast array of failure modes (like, say, network partitions indistinguishable from a hard failure of a peer system) that simply don’t impact standalone systems in the same way. For virtual machine hosting they tend to involve some kind of replicated network storage, which produces a whole new class of performance bottleneck as well. All of this is significant extra complexity that the operator needs to learn about and understand in order to be able to operate the system and have any hope of avoiding the sorts of problems that lead to HA systems being unreliable.
HA systems like this are literally their core business. They generally have a large staff of experts at all levels working on these systems all the time. They invest substantially in this in a way that is not necessary in smaller scale deployments; as people are fond of saying: you are not Google/Amazon/etc!
I would take that with a grain of salt. I’ve been CC’d in enough discussions with AWS/GCS/Azure support and heard enough (believable) anecdotes to know that you really shouldn’t take that for granted. AWS does neither magically solve hardware failures, nor should it be assumed that there is no “forgotten” server (this one actually feels really scary) nor that cloud providers have more competent employees than everyone else. They are big, they move fast, of course things will fails, sometimes really big, sometimes it’s the failures that scale.
The myth of “they surely know better” is often a myth. If they actually do, that means you’ll likely have a really hard time should such an issue arise. Or you get lucky and don’t.
I don’t wanna spread FUD there, but I really dislike it when people assume that companies when they are big have a magic barrier protecting from failure, incompetence, hardware failures, etc. And these companies of course are also developing their products. There is truth to “combined experience” and so on, but don’t assume that the guy that has said experience will necessarily be working on things relevant to you. A lot of that effort is occupied by building AWS itself, to make such a system work. It’s all effort that if your product is not a big cloud business, which quite a bit of legacy stuff hanging around, because some important customer still needs it, you simply don’t have to deal with.
The whole “if Google/Amazon/… doesn’t manage, you certainly won’t be able to” is pure marketing. It’s usually rooted in comparisons that don’t make sense. Usually the goal you have and the goal that they have are pretty different, if you take a slightly closer look.
Doesn’t mean you never should use their stuff, but only means that you shouldn’t start to believe in and assume magic in these companies.
Running any kind of generic commercial product compared to running something so your software can reside on it is completely different worlds in terms of effort that needs to go into it. If you run your own stuff you only are faced with very few of kinda similar challenges. And largely challenges-that people successfully changed decades ago, where there is experience and where “thinking cloud” (from separating state from applications, etc.) really helped. Before cloud services emerged Sysadmins (or what we call SREs today) would have loved to be able to tell developers that they cannot do this and that. It took big companies to point to to be able to do that and to develop ecosystems in which horrible practices (and they were largely already bad practice back then) were rooted out.
To be honest, I think a lot people didn’t know this kind of stuff even before “cloud”. At least in my experience. Devs would dev, BOFHs would operate and that was it. Now we call them devops, not BOFH, but these are the people dealing with this stuff, rarely developers.
Would love to see all “switched to X and saved Y %” articles banned on lobsters, always a misleading title.
Note that the original title of the post is not that shape. I’m not sure why @vermaden decided to make it even more clickbaity than it was already.
As I mentioned in one of the comments earlier:
IMHO that is the meritum/TL;DR/outcome of the article.
Except it isn’t. Most of the saving came from fixing how sessions were stored so that they didn’t need to keep scaling up the compute. They could have kept doing Linux (on-prem or in the cloud) in that model and kept the same cost saving.
I wonder would have they found the issue while on the old setup. I tend to believe they would, but in some distant future, when autoscaling would not have hidden the issue.
[Comment removed by author]