The only good cloud is a google cloud
Table of Contents
First, this is not marketing. I’m not paid nor do I have any incentive to say what comes next. My personal opinion overall is that we’re all ~dumb as fuck~ for giving so much of our foundational infrastructure to three US companies. That said:
Yours truly once dared to suggest on the hallowed grounds of HackerNews that Amazon’s cloud computing juggernaut, AWS, had cleverly exploited the very shortsightedness that plagued operations teams and caused people to believe that we could not do better internally. A stroke of genius, for sure.
But, let’s be frank, outside of S3 have any of their so-called “innovations” truly served anyone beyond lining the pockets of some fat cats in Seattle?
I’ve traversed our industry as a counter-culture goblin for a long time (though I prefer the term “independent thinker”), and I’ve heard every snake oil pitch imaginable about the cloud being the true technological messiah. Meanwhile, I, for daring to suggest that perhaps virtual machines shouldn’t cost a king’s ransom [ HN comments ], am painted as vaudvillian villain, cartoonishly outdated or- in one case: that I lacked testicles to learn. (though I am certainly unsure how your brain is wired if it is dependent on your testicles.)
“Why would you give people the Sisyphean task of operating servers, that is for peasants! Economies of scale make it cheaper to operate in a cloud!”; as if wrangling vendor-locked terraform and stodgy YAML manifests is somehow intellectually superior, or that we couldn’t possibly measure our outcomes related to cost.
I won’t waste column inches lambasting AWS for failing to deliver on their promise of leaner tech teams (though statistics prove me right). Instead, allow me to air a few grievances, then make a bold claim: Google Cloud Platform, is the superior choice in every way that counts for a cloud.
Why? Because I refuse to believe this *gestures broadly at AWS console* is as good as it gets, but I’ve come to believe that we are a bunch of technological copycats, too afraid to deviate from the herd. But today, I, your friendly neighborhood goblin, will hopefully piss you off enough to respond to me angrily, and in doing so, force you to confront the true reality of why you choose inferiority and mediocrity.
Of course, Google isn’t without its faults. They’ve been known to pull the rug out from under their users. But desperate times call for desperate measures, and sometimes, you have to risk a nibble from the alligator to escape the clutches of the wolf; and I contend: that this is actually a good thing, as it prevents you from being too comfortable sticking around.
The lure of cloud: MIRAGE OR MIRACLE? #
What is it that draws folks to this nebulous “cloud,” anyway? If you’re a pain in the ass like me you hear it constantly, and if you boil it down, and you’ll find the same trio of promises: “it’s reliable” (supposedly), “it’s easy” (they say), and “it’s economical” (or so we’re told).
Well, I’m calling their bluff. If that’s all the cloud’s got going for it, then Google’s offering leaves Amazon’s in the dust. And if you think I’m wrong, I’m all ears. Name one more genuine advantage AWS has over GCP when compared to colocation of rented machines, and I’ll eat my hat.
This post is an indirect response to: “it’s hard to recommend Google Cloud” [ lobste.rs comments ]
1. Reliability #
VPCs #
AWS is zonal by default, it’s clear. VPC’s require manual networking intervention (outside the happy path) to set up a “peered” network. This, by itself and with the context of “cloud helps you do the right thing for HA”, is fucking moronic.
You could try to argue that it’s hard to do networking right, or that they have a legacy to take care of. – But why the fuck is that my problem if I’m paying them hand over fist for them to commoditise my compute? As soon as Google released global based routing Bezos should have been commanding improvement from the telescreens that I am certain are placed on every engineers desk (and cars, and homes).
To add insult to injury, doing it properly, for years, costed more and is so complex there’s an actual certification for it!.
(I guess google has one too but I’m not aware of people actually taking it or needing it, AWS claims 1.31M certificate holders though).
Newcomers often assume that basic networking within a cloud environment would be region-wide or at least span multiple Availability Zones (AZs) for redundancy.
EC2 instances are isolated within a single AZ by default - not obvious to newcomers. It violates principle of least surprise when talking about a “cloud” that’s supposed to be helping you become more reliable, and again, doing it properly costed more, meaning people naturally deviated their dev environments from prod.
ELBs #
The classic ELB (now considered a legacy service) has a default configuration that doesn’t distribute traffic across multiple AZs.
This might now be a legacy service, but let me repeat that.
THE FUCKING LOAD BALANCER -FOR 15 YEARS- DID NOT LOAD BALANCE ACROSS ZONES BY DEFAULT.
You need to choose the “Application Load Balancer” or “Network Load Balancer” and configure them appropriately for high availability.
Obviously this is not the case in GCP, networking is global by default and load balancers actually balance across zones; disks can also be regional- it’s just one click away (and it costs double, but, it works, and this is what I’d expect).
I guess they consider it a legacy service, so they finally fixed it after a decade and a half of service, so I should give them some credit…
But even the new “ALB” has weird quirks even though it does the right thing by default:
After you disable an Availability Zone, the targets in that Availability Zone remain registered with the load balancer. However, even though they remain registered, the load balancer does not route traffic to them.
WHAT?! WHY?!
EC2 #
Reliability is hard when the underlying systems become unavailable, Google was the first to support live migration between hosts- I don’t even notice it happening and I ran some performance sensitive applications.
AWS though? notice of termination, you have 4-24 hours to comply. Or, none. Sometimes.
In some cases, Amazon will notice their hardware is in a degraded state and tell you to get off of it (stop and start your instance) by a certain date or it will be stopped automatically.
In some cases, there will be no warning and it will just stop. Or not enter STOP state, and simply become unreachable. It may or may not reboot after they take care of it. Sometimes, there will be an apology mail after the fact.
You probably think that this shit is “good enough”, or maybe that we’ve recieved enough collective brain damage over time that it has become “good enough” to get by… even if it’s not perfect.. well…
2. “Good Enough” != Ease of Use #
This section used to be titled: “Good enough” helps you limp along or: you can’t 60% your way to good UX.
The insidious nature of the “good enough” approach is that it works 20-40% of the time, that’s why it’s pervasive; however, people like to take the “good enough” approach every time: and on a long enough timeline you end up with AWS.
I mean, it works, right, but why can’t I see what project I’m in? (sorry: “Account”)…
Look, there was a really big reason why Windows kicked the absolute crap out of everything else back in the day, and part of that reason was a slavish devotion to user interface design, with an eye for accessibility and consistency; they forgot this, but remains a large part of what put them on top. [ HN Comments ]
Clearly AWS doesn’t believe this.
I mean, what is wrong with my SSH key?
There’s no “help” dialog, or information about supported key types, or examples, literally sweet fuck-all.
Additionally why would I care about “terminated” machines? I can’t do anything with the info here? If I need to know what’s terminated shouldn’t I just look at logs? I can’t recover anything from these; it’s not like I can “un”-terminate them or recover the drive…
And.. while I’m here… why can’t I see instances hosted in other regions?
Clearly UI weirdness isn’t a deal-breaker, but it’s a constant reminder that AWS settles for ‘good enough’ instead of striving for excellence, which they should be able to afford.
Let’s see how Google handles those things:
First, logging in allows you to have a standard google account or federated identity
and.. project’s/workspaces are human readable…
Better yet, my VM list doesn’t have dead resources and is actually global by default, meaning no random resources in Frankfurt that are chewing our bill silently for years on end that we never noticed.. (Yes, that happened because we took our eye off the ball).
(ignore the fact they’re in the same zone, they have regional disks :’) )
And they even have human readable names, which are used by the API and CLI too…
… Oh wait, what the fuck is that!? In three places in that image I see the word “Save”!
3. Cost optimisations #
AWS Cost management is basically a meme at this point, they even invented a term for it FinOps (hey, weren’t we supposed to be reducing operations?! - we just invented a whole fucking role, and a complicated boring one!).
Meanwhile Google bakes cost saving into its offering with a comprehensive pricing calculator, sustained usage based discounts (in addition to comitted use, like in AWS), and a “FinOps hub”🤮 to help you organise and further reduce your spend…
The actual cost reporting system is pretty nice too, and you can export it to BigQuery and get extremely detailed reporting if you really wanted.
Yeah, it’s still a goddamn rip-off! Cloud’s gonna bleed you dry compared to a colo, no question. You’ll be paying five to even eleven times the price sometimes.
At least they’re not sneaky bastards about it, though. They’ll show you how to trim some fat off that bill.
And let’s face it, one engineer can actually move mountains on this platform. No bullshit, no hidden gotchas - it’s compute the way it should be: comodditised and without brainrot.
Imagine getting paged at 3am. You’re bleary-eyed, trying to figure out if i-0256162531f6a2ed
or i-0256162531f6a2ec
is the problem VM. They both have the same ‘Name’ label/tag, and you accidentally opened the dev environment instead of production. You need some obscure browser extension just to tell them apart!
And don’t even get me started on naming conventions. Instances can share the same ‘Name’ label so you can’t trust, and if you want to auto-generate unique names in an autoscaling group, you need to write a Lambda function. Seriously, I laughed out loud when I saw that was the main recommended solution. It’s ridiculous!
Google is scary though #
Google is a scary proposition for two reasons;
-
Google itself tends to deprecate fucking everything it seems.
-
It’s not popular enough, so “enhancements” are AWS first, like cloudcraft.
Point #2 is self-fulfilling, so let’s not even bother talking about it.
Point #1 however, is important.
Windows is well-known for its commitment to backward compatibility, sometimes to a fault. This can lead to frustrating situations, like having 32 different USER_INFO
struct “levels” for interacting with win32 functions. The problem is, many of these structures don’t play nicely with group functions because groups don’t let you work with usernames directly. Groups in Windows fundamentally, only understand security identifiers (sids),
not usernames, which can make things tricky for developers as there’s no USER_INFO
struct level that understands SID
-s.. A fucking nightmare. (
ask how I know, go on..)
A slavish devotion to backwards compatibility stifles innovation, this is largely the reason Moxie Marlinspike did not enjoy the idea of federating Signal.
Thus, we should allow our clouds to shed the dead weight. Crucially, it’s better for us too: since more work can go into working on better developer experience instead of complicated cruft that makes things slower, cruftier, complex and more idiosyncratic over time (like Windows).
Maybe I’m biased, because as a game developer we’re usually quick to move fast and abandon things if they’re not working.
Cloud in general #
Look, the cloud is a rip-off. All that talk about its benefits is just short-term thinking in disguise. Even GCP, which I think is the best of the bunch, is still highway robbery. Sure, it’s great for handling sudden traffic spikes or figuring out your needs, but in the long run, you’ll save a fortune by ditching the cloud and running your own hardware. My build servers paid for themselves in two damn months! And don’t give me that “humans cost money” shit: AWS experts cost way more than old-school sysadmins - and I haven’t touched my build servers in almost two years. It’s all a scam to funnel your startup’s cash straight to Bezos.
The whole cloud industry is a racket designed to suck your startup dry because you’re too stupid to think beyond 6 months and you have heard stories regarding heavily underfunded ops teams. This is made worse by slick consultants, professional services salesmen and developer evangelists whispering sweet nothings in your ear about “scale” and “abstraction” while you rack up a bill you can’t afford, and learn things you can never apply outside. Hopefully they at least lubed you up with $100,000 in startup credits, if not, what were you fucking thinking?
You traded freedom for a little comfort in being told you were smart. Dumbass.
GCP might have its quirks, but at least it doesn’t try to rewire your brain like AWS. You can actually get shit done without memorizing their entire goddamn encyclopedia, and the skills you learn are useful elsewhere, unlike AWS’s proprietary nonsense. GCP is a tool, not a religion – except maybe for their IAM, which is still less painful than AWS’s convoluted mess.
Azure? #
“Jan, you’re talking a lot about AWS vs GCP (and, no cloud at all) but what about Azure?”
Well… If you’re using Azure, it’s because your CFO decided that since you already have a deal with Microsoft for Office, why not also put it on the same bill instead of having another relationship with another vendor.
Nobody chooses Azure for Azure, even the Azure representatives didn’t recommend Azure to me, so they’re even aware of this. (which, honestly endears me to other service offerings from Microsoft…).
Conclusion #
So, GCP: Makes it easier to do the right thing by default, has significantly improved DX/UX (no missing instances because they’re in another castle or garbage errors, or missing help, and you can reference instances by human readable names) and does it’s best to help you understand costs and save money… the three reasons you would even use a cloud in the first place (supposedly).
Issues raised with AWS could be answered with “you’re holding it wrong”, but isn’t it the entire fucking point that it should be easy to hold it right? This is supposed to be replacing Ops specialisation, not changing ops specialisation to be vendor specific instead!
Isn’t the point that “the cloud” is meant to be a clean abstraction over a commodity system, so we spend less time and energy on this cruft? That it requires less staff, less time and more focus on the product itself?
If that’s the case, then.. Google cloud is the only game in town.
But, they might deprecate your shit.
(In fairness, Google Domains was not a Google Cloud product (which was actually extremely annoying when using it before); it’s a notable distinction because Google Cloud products seem to always have worthwhile replacements waiting for them. That said, yes, it pissed me off. I recommend porkbun as they suck less.)
Personally, I see that as a good thing, the absolute best use of cloud is to scale a bit for peaks, or to save you time and energy up front when you are figuring things out. Staying in any cloud long-term is a money losing strategy on all fronts.