Earlier this month, Vox launched its own liveblogging platform: Syllabus. On Monday of this week, Syllabus was put to the test in its first highly-trafficked event, The Verge’s coverage of the WWDC keynote, where it delivered near-realtime updates to 300,000+ unique clients. Syllabus was well received by "the industry" and, more importantly, our writers and our users.
We’re pretty proud of Syllabus, and would like to show you how it works.
A Quick Tour
This is the public-facing part of Syllabus — the actual liveblog itself. While an event is underway, updates are automatically streamed into view. The highlighted posts at the top of the stream are "pinned" posts, major announcements or information about the event.
The client is responsively designed, so it works well across desktop, tablet, and mobile browsers (seen above). We’ve put a lot of effort into optimizing Syllabus for speed, as well — the initial page weight is less than 100 KB, and the JSON file polled by the client is around 2 KB after gzipping.
This is the Syllabus CMS, where writers compose and publish liveblog updates. Like the public-facing client, the CMS UI is lightweight and designed for speed. Writers can publish updates, upload images, pin and unpin posts, and see updates as they’re published by other writers (all without refreshing the page). They also have the option of using Markdown to format their updates. Naturally, Syllabus delegates authentication to Vox’s main publishing platform, Chorus, using our own omniauth strategy.
As you can see, there is nothing terribly complicated going on here. A CRUD interface, a means of displaying updates, and a healthy dose of Ajax to make it as fast as possible. The tricky part is making the content available to hundreds of thousands of users simultaneously.
Designed for Stability
To provide some context, Vox Media previously used a third-party liveblogging platform that services a number of well-known media companies. Unfortunately, that platform experienced outages during the last two highly-trafficked events covered by The Verge (the iPhone 4S and iPad 3 announcements), requiring that we find a more dependable alternative.
After some discussion and exploration following the iPad 3 event back in March, we determined that providing a great liveblog experience was important enough to our sites and users that building our own platform was worthwhile — and (importantly) that it was feasible to do so in time for the next big event to be covered by The Verge, WWDC 2012 in June.
Stability and uptime was our number one requirement. We knew from our Google Analytics numbers during the iPad 3 liveblog that this system had to be able to handle loads of at least 800,000 concurrent users.
We worked closely with writers from The Verge to determine what they needed from the CMS and the public-facing client. Given a timeframe of about one month, Syllabus is what we came up with.
CMS Architecture
When our third-party liveblog provider suffered outages during those recent highly-trafficked events, it seemed that excessive load on the client front-end also caused the CMS back-end to fail. Not only was the liveblog not performing properly for users - our writers weren’t able to keep covering the event, either. So we knew that we wanted to decouple these systems, such that an outage or slowness in one would not affect the other. We realized early that the best way to accomplish this goal would be to build a writer-facing CMS app to handle CRUD, and then a separate, scalable distribution layer that would receive updates generated by the CMS and deliver them to the client.
Before long, we realized that we already had access to a scalable distribution layer for static files: Amazon S3. If we could reduce the client-side portion of Syllabus to a static set of HTML and JavaScript files that would consume a set of constantly-updated JSON files, then we could just serve the whole thing from S3.
Being able to leverage Amazon’s 99.9% availability is great, but there was one big question: If a client is polling a single file, how long would it take for changes to that file to become visible to all clients?
Some S3 buckets provide read-after-write consistency — which means that writes are available to all clients immediately — but that level of consistency only applies to initial creates. For updates to existing files, Amazon only guarantees eventual consistency — which means that updates to files will eventually be available to all users, but with no guarantee on how soon.
So, in practice, we wondered, how long would it actually take for updates to propagate to users? We had guesses, but no hard data. If distribution was slow, there was no point in liveblogging, and users would surely go elsewhere.
Despite our uncertainty, we thought this was a promising enough avenue to pursue. If S3 ended up being a dead end, we’d just iterate and figure something else out. (For example, Google Cloud Storage advertises "strong" read-after-write consistency for both creates and updates, and may have been a viable alternative.)
We’ll go into how S3 worked out in this respect a bit later.
Given all of these considerations, this was the architecture we designed:
The first couple items should be no surprise: the client-side CMS app talks to Rails, which persists data to MySQL. We also persist most of our models into Redis using a generalized set of ActiveRecord callbacks. As part of these callbacks, we send out a Pub/Sub publish signal whenever an entry is saved. Meanwhile, we have a separate JsonPublisher process which subscribes to this signal. Whenever it is triggered, it writes all the JSON files that need to be changed to S3. Usually, this means writing only two files, and we’ve optimized the process to take about 250ms to execute.
S3 Setup
In order to use a custom domain and exercise tighter control over routing, we use Amazon’s Cloudfront CDN in front of all S3 requests. Since Cloudfront doesn’t support differentiation between gzip-compatible requests to S3, we opted to gzip all HTML/JSON before uploading and set the content-encoding headers manually. We felt that the benefit of decreased transfer size outweighed the very small number of browsers that didn’t support gzip (we haven’t received any complaints about this).
Obviously, we didn’t want Cloudfront caching our JSON for extended periods of time, otherwise updates would never be seen by users. Thus, when uploading assets to S3, we set cache_control to ‘public, max-age=1’ and always request those assets using a timestamped querystring. The non-zero expiration was recommended by our AWS representative, and seems to have worked well — the CDN was able to serve many requests directly (and thus probably respond faster for many clients).
Client Architecture
Not surprisingly, the client is just an HTML page that bootstraps a Backbone.js application, which polls our Cloudfront instance for JSON updates. We chose Backbone because we’d used it successfully for a couple projects already at Vox, and it provides great facilities for managing, updating, and rendering collections of things (like posts) all on the client. (Of course, since we don’t have the benefit of a server-side application to call home to, faciliating rich client-side interactivity was key.)
There are two types of JSON files that the client tries to retrieve. The first is live.json, which contains the N most recent updates (we ended up with N=20), and is polled on an interval (we found 3 seconds to be a good number). Updates to this file are how newly-published posts are delivered to the liveblog.
The second kind of JSON file is the paginated archive of entries, starting with page0.json. Each entry is assigned a sequence number, which in turn determines the page it resides on. Each page contains N entries. By paginating the entries, we allow the user to lazy load updates from the entire event by scrolling down the page.
Here is an example of what live.json looks like:
{ "entries":[ { "body":"Tim Cook announces iOS 6
", "id":1788, "pinned":false, "sequence":318000, "timestamp":"2012-05-29T15:40:43Z", "published_at":"2012-05-29 14:26:38 -0400" }, ... /* 19 more updates here */ ], "deletedEntryIds":[ 1784 ], "pinnedEntries":[ ], "settings":{ "version":"14", "status":"live", "pollInterval":3000, "published_at":"2012/05/29 14:26:38 -0400" } }
As you can see, we have an array of all of the latest entries with all the information we need to render them, and nothing more. We keep a list of deleted entry IDs, as well, so that when an entry is deleted, the client knows to remove that element. Since deletes are rare, this list doesn’t grow too large.
The pinnedEntries array always contains all of the pinned entries, regardless of age. This way, even if an entry moves out of the latest entries array, the client can still render it without an additional query.
Finally, there is a settings hash. This data structure provides information to the client about whether the event is still live, how often to poll for updates, and so on. We bump settings.version when changes are made to the settings or the client HTML file, which causes a message to appear, telling the user to kindly refresh the page to get the new version. Thus we have the ability to tweak settings and get the update out to clients as fast as possible.
Wait, what about S3’s read-after-update consistency?!
Right, I did promise to talk about that.
Once we had built out a testable version of the platform, we needed to benchmark how long it would take for updates to be distributed to users. The first part of this test was adding logic to our client to determine the time between when it received an update and when the update was actually published. Determining when the JSON was published was as easy as injecting a timestamp into it, and we used the Date jqxhr response header to determine the time when the client received the update. We POST that delta back into Syllabus where it gets put into a Redis list. (This is a great use case for Redis because of the rapid volume of writes, and rpushing to a list is O(1).)
Then, once we had a means of collecting data, we needed to get the client running in as many browsers as possible. To do this, we used a technique known as dark launching. Specifically, we embedded a hidden iFrame in the footer of various Vox Media sites for a brief period of time, and then started publishing updates in Syllabus. Within about 15 minutes, we had gathered 20,000 data points. Here’s a histogram of that data:
From this test, we were excited to discover that 97.26% of all updates were received by clients within five seconds of publication. Given this knowledge, we were confident that S3 would be able to deliver on both uptime and speed of delivery.
Syllabus in the Wild
The first use of Syllabus was the Microsoft E3 keynote on June 4th, which provided a great testing ground before the gauntlet of WWDC. 23,000+ unique users viewed the blog, and everything went smoothly. Since we knew traffic wouldn’t be crazy for this event, we left our delay reporting in place. We collected significantly more data points than before and still saw a satisfactory distribution, leaving us optimistic about WWDC.
Fortunately, WWDC went as smoothly as we could have imagined. Load on our CMS servers was low, graphs looked good, and JSON publishing and distribution went off without a hitch. The future of Syllabus is bright, and we hope to continue to improve it while bringing more liveblogging goodness to all of our Chorus-powered sites.