I think the migration is cool, but I'd love to hear how Github circumvented the pretty fundamental issues with MySQL scalability and its penchant for silently corrupting data. I'm sure they solved it but it must have taken a really long time.
I'd love to spend time with a database engineer at github. I'm aware that Facebook also scaled MySQL pretty widely but Facebook only used MySQL as dumb storage for various things which is not the same.
I was the first database engineer at GitHub and eventually VP of Engineering. We did not once have an issue with MySQL corruption. We were able to leverage MySQL's high connection limiteds (not possible with postgres because of a fundamentally flawed per process model).
MySQL is not just used at Facebook and GitHub it is ubiquitous among the largest deployments in the world. Yahoo, Slack, Hubspot, LinkedIn, Twitter, Roblox, Etsy, Shopify, Blizzard, EA, and so many more. I think this disproves there is a fundamental scaling issues.
Postgres didn't even have real replication when GitHub was founded.
You strike me as someone who has had to defend this choice many times.
For what it’s worth I never noticed the silent corruption until I did, then it was everywhere and it was too late. Not catching it is not a sign it wasn't there: it easily could have been and nobody checked or bugs were assumed up the stack.
I didnt intend a discussion about postgres vs mysql, not only because thats a false dichotomy but also because it elides discussion about working around the limitations of either and becomes a sort of holy war.
Most of the common issues are slowly being fixed over time, things like truncation or casting and are enhanced with toggles like strict mode which I am not sure when became available.
the row based replication is still a huge footgun.
But I guess you are trying to get a gotcha, because ultimately there are ways to avoid the corruption and what I am talking about goes deeper than that and from a time where a lot of these footguns are exposed a lot.
RDBMs tend to choose between:
App does the checks (DB flexible)
database does the checks (DB strict)
In MySQL land it has always erred to the former, things like returning a cached value for count() or avoidind ACID compliance, or permitting obviously invalid states (February 30th).
So checking these things is not enforced and it leads to a lot of accidental casting and acceptance of invalid data, when you access that data your app blows up or your state is impossible in some other way.
In this way: these are not bugs, this is a design decision.
Anyway, since you asked I didn't have to look far to find an actual issue, heres one I found for MySQL 8;
Not really a gotcha, just my pet peeve about people posting about big issues without receipts, so you never know if that's an ancient version thing, what do they mean by "corruption", etc. This is a massive problem with btrfs posts (what version?, did you use unstable features?, how did you verify its not hardware?, etc.)
I'm glad you mentioned the design decisions, because they're both really important to know about and something I wouldn't classify as corruption.
I appreciate you weighing in,
and apart from the success GitHub has had with mysql - I think opting for it was probably a sound decision at the time - if scaling and replication was expected.
That said, regarding:
Postgres didn’t even have real replication when GitHub was founded.
If tfa 15 years ago is correct - that would put the choice between (I think) proven mysql 3.x, or newer 4/5/5.1 and postgresql 8.3 ish?
The latter already had solutions like slony which for some uses was probably better than mysql?
15 years ago I worked at a company doing a few hundred million a year in revenue and they used bucardo trigger based replication with postgres.
I'm not criticizing, they (the point parent comment GitHub employee) aren't experts in the technology they didn't use 15 years ago. But postgres definitely had replication circa 8.1 (released 2005).
the pretty fundamental issues with MySQL scalability
Any details on what you mean here? If anything MySQL used to have a much better reputation for scaling than postgres.
As for how, they're now using Vitess, and before that they used quite a bit of manual sharding if I'm not mistaken.
its penchant for silently corrupting data
Assuming you means all the MySQL idiosyncrasies that truncate or cast inserted values with just a warning, they're pretty much all fixed as of MySQL 5.7 as long as you run in strict mode.
I think you're right that mysql has a better reputation with scalability compared to db2 or postgres but falls somewhat short compared to MSSQL or Oracle (I know, yuck).
However the limitations of MySQL that I discovered are related to memory contention on large systems. That is ignoring complete the replication story which is ostensibly statement based and does not provide protection from replicas becoming out of sync - thus replication can stall when statements become impossible.
Oddly MySQL enthusiasts bill this as a feature and insist that you are holding it wrong if it ever happens to you. However I still consider this one of the major limiting scaling factors: if you cant trust your read replicas to be consistent with primaries.
the replication story which is ostensibly statement based
So I'm no DBA, but I work somewhat close-ish to the database team at a company that runs a huge MySQL infra (comparable to GitHub's).
I may be totally off, but I'm pretty sure all serious MySQL deployments use row based replication, not statement.
As samlambert said. Numerous companies make MySQL work at humongous scale. I'm not gonna tell you you are holding it wrong, but it's certainly possible to make it work ¯\_(ツ)_/¯
To be clear here I said: by default the scalability and consistency story for mysql is weak, I am interested to understand how these large scale companies make it work given the obvious misgivings that I have personally witnessed even in large companies.
And the answer seems to be: “large companies made it work so there are no issues”.
So, to reiterate: I am aware that people have made it scale and I presume they have avoided the corruption pitfalls.
Imagine using the features of your database. The horror.
I've seen cascade deletes that were completely backwards and ended up deleting a whole table which cascade deleted most/all of the active database at the time. I highly suggest avoid using them. If you do use them you need to fully test them.
I’m pretty sure we were using a WHERE IN query for heavy bulk deletion at a previous job. I can’t remember if that worked well, unfortunately. I guess it did.
One of my favourite IRC quotes from 2012
I think the migration is cool, but I'd love to hear how Github circumvented the pretty fundamental issues with MySQL scalability and its penchant for silently corrupting data. I'm sure they solved it but it must have taken a really long time.
I'd love to spend time with a database engineer at github. I'm aware that Facebook also scaled MySQL pretty widely but Facebook only used MySQL as dumb storage for various things which is not the same.
I was the first database engineer at GitHub and eventually VP of Engineering. We did not once have an issue with MySQL corruption. We were able to leverage MySQL's high connection limiteds (not possible with postgres because of a fundamentally flawed per process model).
MySQL is not just used at Facebook and GitHub it is ubiquitous among the largest deployments in the world. Yahoo, Slack, Hubspot, LinkedIn, Twitter, Roblox, Etsy, Shopify, Blizzard, EA, and so many more. I think this disproves there is a fundamental scaling issues.
Postgres didn't even have real replication when GitHub was founded.
You strike me as someone who has had to defend this choice many times.
For what it’s worth I never noticed the silent corruption until I did, then it was everywhere and it was too late. Not catching it is not a sign it wasn't there: it easily could have been and nobody checked or bugs were assumed up the stack.
I didnt intend a discussion about postgres vs mysql, not only because thats a false dichotomy but also because it elides discussion about working around the limitations of either and becomes a sort of holy war.
Could you link a bug report for the corruption issues?
Most of the common issues are slowly being fixed over time, things like truncation or casting and are enhanced with toggles like strict mode which I am not sure when became available.
the row based replication is still a huge footgun.
But I guess you are trying to get a gotcha, because ultimately there are ways to avoid the corruption and what I am talking about goes deeper than that and from a time where a lot of these footguns are exposed a lot.
RDBMs tend to choose between:
In MySQL land it has always erred to the former, things like returning a cached value for
count()or avoidind ACID compliance, or permitting obviously invalid states (February 30th).So checking these things is not enforced and it leads to a lot of accidental casting and acceptance of invalid data, when you access that data your app blows up or your state is impossible in some other way.
In this way: these are not bugs, this is a design decision.
Anyway, since you asked I didn't have to look far to find an actual issue, heres one I found for MySQL 8;
https://forums.freebsd.org/threads/dont-upgrade-to-mysql-8-0-29-corruption-issues.85412/
Not really a gotcha, just my pet peeve about people posting about big issues without receipts, so you never know if that's an ancient version thing, what do they mean by "corruption", etc. This is a massive problem with btrfs posts (what version?, did you use unstable features?, how did you verify its not hardware?, etc.)
I'm glad you mentioned the design decisions, because they're both really important to know about and something I wouldn't classify as corruption.
I appreciate you weighing in, and apart from the success GitHub has had with mysql - I think opting for it was probably a sound decision at the time - if scaling and replication was expected.
That said, regarding:
If tfa 15 years ago is correct - that would put the choice between (I think) proven mysql 3.x, or newer 4/5/5.1 and postgresql 8.3 ish?
The latter already had solutions like slony which for some uses was probably better than mysql?
https://www.postgresql.org/docs/8.4/release-8-3.html
https://wiki.postgresql.org/wiki/Slony
https://en.m.wikipedia.org/wiki/MySQL
https://stackoverflow.com/questions/142068/which-database-has-the-best-support-for-replication
15 years ago I worked at a company doing a few hundred million a year in revenue and they used bucardo trigger based replication with postgres.
I'm not criticizing, they (the point parent comment GitHub employee) aren't experts in the technology they didn't use 15 years ago. But postgres definitely had replication circa 8.1 (released 2005).
Add Booking.com to that list as well. I know quite a handful of DB Engineers at Booking went to Github 🤗
Any details on what you mean here? If anything MySQL used to have a much better reputation for scaling than postgres.
As for how, they're now using Vitess, and before that they used quite a bit of manual sharding if I'm not mistaken.
Assuming you means all the MySQL idiosyncrasies that truncate or cast inserted values with just a warning, they're pretty much all fixed as of MySQL 5.7 as long as you run in strict mode.
I think you're right that mysql has a better reputation with scalability compared to db2 or postgres but falls somewhat short compared to MSSQL or Oracle (I know, yuck).
However the limitations of MySQL that I discovered are related to memory contention on large systems. That is ignoring complete the replication story which is ostensibly statement based and does not provide protection from replicas becoming out of sync - thus replication can stall when statements become impossible.
Oddly MySQL enthusiasts bill this as a feature and insist that you are holding it wrong if it ever happens to you. However I still consider this one of the major limiting scaling factors: if you cant trust your read replicas to be consistent with primaries.
So I'm no DBA, but I work somewhat close-ish to the database team at a company that runs a huge MySQL infra (comparable to GitHub's).
I may be totally off, but I'm pretty sure all serious MySQL deployments use row based replication, not statement.
As samlambert said. Numerous companies make MySQL work at humongous scale. I'm not gonna tell you you are holding it wrong, but it's certainly possible to make it work ¯\_(ツ)_/¯
To be clear here I said: by default the scalability and consistency story for mysql is weak, I am interested to understand how these large scale companies make it work given the obvious misgivings that I have personally witnessed even in large companies.
And the answer seems to be: “large companies made it work so there are no issues”.
So, to reiterate: I am aware that people have made it scale and I presume they have avoided the corruption pitfalls.
I am interested in HOW.
One scaling related thing i read about GB is that apparenty they dont use Foreign Keys at all.
What's "GB" here? If you mean GitHub, that is not (at all, or in any way) correct.
(I am going to continue the assumption that GB is GitHub although I've never seen it shortened in that way previously.)
https://github.com/github/gh-ost/issues/331#issuecomment-266027731
Imagine using the features of your database. The horror.
The point about sharding is valid, if you are at GitHub scale.
I've seen cascade deletes that were completely backwards and ended up deleting a whole table which cascade deleted most/all of the active database at the time. I highly suggest avoid using them. If you do use them you need to fully test them.
Ok, how are they different from an ORM doing it for you?
I'd be intrigued to learn what sort of functionality those queries are used for.
I’m pretty sure we were using a WHERE IN query for heavy bulk deletion at a previous job. I can’t remember if that worked well, unfortunately. I guess it did.