This week Facebook will complete its roll-out of a new photo storage system designed to reduce the social network’s reliance on expensive proprietary solutions from NetApp and Akamai. The new large blob storage system, named Haystack, is a custom-built file system solution for the over 850 million photos uploaded to the site each month (500 GB per day!). Jason Sobel, a former NetApp engineer, led Facebook‘s effort to design a more cost-effective and high-performance storage system for their unique needs. Robert Johnson, Facebook‘s Director of Engineering, mentioned the new storage system rollout in a Computerworld interview last week. Most of what we know about Haystack comes from a Stanford ACM presentation by Jason Sobel in June 2008. Haystack will allow Facebook to operate its massive photo archive from commodity hardware while reducing its dependence on CDNs in the United States.
The old Facebook system
Facebook has two main types of photo storage: profile photos and photo libraries. Members upload photos to Facebook and treat the transaction as digital archive with very few deletions and intermittent reads. Profile photos are a per-member representation stored in multiple viewing sizes (150px, 75px, etc). The past Facebook system relied heavily on CDNs from Akamai and Limelight to protect its origin servers from a barrage of expensive requests and improve latency.
Facebook profile photo access is accelerated by Cachr, an image server powered by evhttp with a memcached backing store. Cachr protects the file system from new requests for heavily-accessed files.
The old photo storage system relied on a file handle cache placed in front of NetApp to quickly translate file name requests into a inode mapping. When a Facebook member deletes a photo its index entry is removed but the file still exists within the backing file system. Facebook photos’ file handling cache is powered by lighttpd with a memcache storage layer to reduce load on the NetApp filers.
No need for POSIX
Facebook photographs are viewable by anyone in the world aware of the full asset URL. Each URL contains a profile ID, photo asset ID, requested size, and a magic hash to protect against brute-force access attempts.
/[pvid]_[key]_[magic]_[size].jpg
Traditional file systems are governed by the POSIX standard governing metadata and access methods for each file. These file systems are designed for access control and accountability within a shared system. An Internet storage system written once and never deleted, with access granted to the world, has little need for such overhead. A POSIX-compliant node must specifically contain:
- File length
- Device ID
- Storage block pointers
- File owner
- Group owner
- Access rights on each assignment: read, write execute
- Change time
- Modification time
- Last access time
- Reference counts
Only the top three POSIX requirements matter to a file system such as Facebook. Its servers care where the file is located and its total length but have little concern for file system owners, access rights, timestamps, or the possibility of linked references. The additional overhead of POSIX-compliant metadata storage and lookup on NetApp Filers led to 3 disk I/O operations for each photo read. Facebook simply needs a fast blob store but was stuck inside a file system.
Haystack file storage
Haystack stores photo data inside 10 GB bucket with 1 MB of metadata for every GB stored. Metadata is guaranteed to be memory-resident, leading to only one disk seek for each photo. Haystack servers are built from commodity servers and disks assembled by Facebook to reduce costs associated with proprietary systems.
The Haystack index stores metadata about the one needle it needs to find within the Haystack. Incoming requests for a given photo asset are interpreted as before, but now contain a direct reference to the storage offset containing the appropriate data.
Cachr remains a first line-of-defense to Haystack lookups, quickly processing requests and loading images from memcached where appropriate. Haystack provides a fast and reliable file backing for these specialized requests.
Reduced CDN costs
The high performance of Haystack combined with new data center presence on the east and west coasts of the United States reduces Facebook’s reliance on costly CDNs. Facebook does not currently have the points of presence to match a specialist such as Akamai, but the combined latency of speed of light plus file access should be performant enough to reduce CDN in areas where Facebook already has existing data center assets. Facebook can partner with specialized CDN operators in markets such as Asia where it has no foreseeable physical presence to boost its access times for Asian market files.
Summary
Facebook has invested in its own large blob storage solution to replace expensive proprietary offerings from NetApp and others. The new server structure should reduce Facebook’s total cost per photo for both storage and delivery moving forward.
Big companies don’t always listen to the growing needs of application specialists such as Facebook. Yet you can always hire away their engineering talent to build you a new custom solution in-house, which is what Facebook has done.
Facebook has hinted at releasing more details about Haystack later this month, which may include an open-source roadmap.
Update April 30, 2009: Facebook officially announced Haystack and further details.
Hmm, seems remarkably similar to Flickr’s architecture, except they use Squid instead of lighty.Using a CDN in front of such a long-tailed collection of resources doesn’t make any sense; I’m sure Akamai et al loved them for the revenue flow, but they weren’t providing much value there, at all.
Hey Niall,Great article. Thank you for writing this up. I’d be very curious how they’re setting up those commodity servers if you have any insights on that.
It would nice to see Haystack open-sourced. With more tools for high-scale in the open source domain, fewer companies who have great ideas but not the resources to scale will die. That’s a hot topic currently, but even in the good times, anything that lowers the cost of building new stuff levels the playing field.
This is a pretty fascinating look at what Facebook has done, they know their business and have optimized for it. I wonder how much this applies to most other companies, though – most of us are not Facebook.
Facebook made their make or buy decision, choosing make, but most of us would worry about our ability to support that over time. If we construct our own one-off solutions using our own Jason Sobels, what about when they leave? I’ve had to abandon perfectly good applications which were core to my company when we could not get sufficient support.
The hacker in me loves this though – this seems like a brilliant hack
Am I the only one thinking it’s about time? Any one with this much overhead and this many engineers should be jumping at ways to do things themselves. A project/plan that saves even 1% cost is worth it when dealing with these incredible usage numbers.
The idea of this scalable filesystem using commodity hardware is what intrigues me. Small companies can get something like this from Amazon and other cloud storage platforms, but the costs – both fiscal and latency – become unsustainable long term. Furthermore, when S3 goes down, so does your app (and likely your revenue). Not that these solutions work at Facebook scale anyway – but you get the idea.
Are most photo sharing collections long-tailed like this? I wonder what is the distribution of calls to a typical photo collection over a given year. For example, do they get 90% of the calls within the first 2 months of upload?
What are they using for backup in Haystack?
It would be interesting to see this compared with MogileFS. I’m especially curious to know how they handle replicas between data centers and how they deal with commodity hardware failing at regular intervals.
Where can I find this “Cachr”? I can’t find anything on it online. I want to try it out.Cheers, DT