This document discusses Craigslist's migration from older MySQL database servers to new servers equipped with Fusion-io SSDs. It describes Craigslist's high database load of over 100 million postings and 1 billion daily page views. The migration involved replacing 14 older, less performant servers with just 3 new servers using Fusion-io SSDs. This reduced total power usage from 4,500 watts to 570 watts while greatly increasing I/O performance and reducing query response times.
3. Some Numbers~100,000,000 postings in live databaseOver 1,000,000,000 page views dailyHigh churn rate (avg lifetime ~14 days)~350-500GB on diskMySQL 5.5.x and InnoDB CompressionUsed to be ~100-150GB larger (or more!)All records touched multiple times98% of queries are OLTP
4. The Posting CacheWeb Server Tier(apache/mod_perl)Posting Cache Tier(memcached + perl)Database Tier(MySQL)
5. The ProblemAdding more memcached nodesLots of cache misses initiallyMySQL boxes take a big query load(time passes)MySQL boxes pegged many hours later(time passes)Next day: WTF?!
6. Fire!Web Server Tier(apache/mod_perl)We were sending nearly 30% of requests all the way back to the DB tier instead of the normal 2-5%Posting Cache Tier(memcached + perl)Database Tier(MySQL)
7. SolutionLet’s put the New Hardware in the poolAdd 4 machinesAnd it still sucked…The 4 were fast but only took ~20% of the hitsRemove all the Old HardwareRemove 14 of 18 machinesSounds totally sane, right?
8. Old Hardware3 years old3U, Dual AMD 2218 HE32GB RAM16 15k RPM SAS disksRAID-10~2,000 iops/sec~325 watts
10. Before and AfterLoad AverageI/O Capacity of Data DisksNight and Day!Old boxes return to “steady state”
11. This Chart Should be GreenAverage power for Fusion-io equipped server: ~200 watts.It was closer to 160 when replicating but not serving traffic.
12. Fusion-io FTWCan you tell when this machine started getting live traffic?OLTP means disk matters WAY more than CPUInto the fire!
13. The NumbersOld: 2,000iops / 325W = 6.15 iops/wattNew: 40,000iops / 200W = 200 iops/wattConservatively assumes a lot of degradation33-66x performance/wattBut let’s just call it 50x
14. EpilogueA week later, we re-purposed 1 Fusion-io boxThe cache eventually did fillPoor slab size configuration had been causing early expiration of cached objects14 “old” servers: 4,500 watts28,000 iops/sec capacity3 “new” servers: 570 watts240,000+ iops/sec capacityWhat to do with 10+ spare “db class” boxes?