Big Data in Real-Time at Twitter

Big Data in Real-Time
at Twitter
by Nick Kallen (@nk)

Follow along
http://www.slideshare.net/nkallen/qcon

What is Real-Time Data?
• On-line queries for a single web request
• Off-line computations with very low latency
• Latency and throughput are equally important
• Not talking about Hadoop and other high-latency,
Big Data tools

The four data problems
• Tweets
• Timelines
• Social graphs
• Search indices

What is a Tweet?
• 140 character message, plus some metadata
• Query patterns:
• by id
• by author
• (also @replies, but not discussed here)
• Row Storage

Find by primary key: 4376167936

Original Implementation
id user_id text created_at

20 12 just setting up my twttr 2006-03-21 20:50:14

29 12 inviting coworkers 2006-03-21 21:02:56

34 16 Oh shit, I just twittered a little. 2006-03-21 21:08:09

• Relational
• Single table, vertically scaled
• Master-Slave replication and Memcached for
read throughput.

Master-Slave Replication Memcached for reads

Problems w/ solution
• Disk space: did not want to support disk arrays larger
than 800GB
• At 2,954,291,678 tweets, disk was over 90% utilized.

Possible implementations
Partition by primary key
Partition 1 Partition 2

id user_id id user_id

20 ... 21 ...

22 ... 23 ...

24 ... 25 ...

Partition by primary key


20 ... 21 ...

22 ... 23 ...

24 ... 25 ...
Finding recent tweets
by user_id queries N
partitions

Partition by user id


... 1 21 2

... 1 23 2

... 3 25 2

Partition by user id


... 1 21 2

... 1 23 2

... 3 25 2

Finding a tweet by id
queries N partitions

Current Implementation
Partition by time
id user_id
24 ...
Partition 2
23 ...

id user_id
Partition 1 22 ...
21 ...

Queries try each
Partition by time
partition in order
id until enough data
user_id
24 ... is accumulated
Partition 2
23 ...

id user_id
Partition 1 22 ...
21 ...

Low Latency
PK Lookup

Memcached 1ms
MySQL <10ms*

* Depends on the number of partitions searched

Principles
• Partition and index
• Exploit locality (in this case, temporal locality)
• New tweets are requested most frequently, so
usually only 1 partition is checked

• Write throughput
• Have encountered deadlocks in MySQL at crazy
tweet velocity
• Creating a new temporal shard is a manual process
and takes too long; it involves setting up a parallel
replication hierarchy. Our DBA hates us

Future solution
Partition k1 Partition k2
20 ... 21 ...
22 ... 23 ...
Partition u1 Partition u2
user_id ids user_id ids
12 20, 21, ... 13 48, 27, ...

14 25, 32, ... 15 23, 51, ...

• Cassandra (non-relational)
• Primary Key partitioning
• Manual secondary index on user_id
• Memcached for 90+% of reads

What is a Timeline?
• Sequence of tweet ids
• Query pattern: get by user_id
• Operations:
• append
• merge
• truncate
• High-velocity bounded vector
• Space-based (in-place mutation)

Tweets from 3
different people

SELECT * FROM tweets
WHERE user_id IN
(SELECT source_id
FROM followers
WHERE destination_id = ?)
ORDER BY created_at DESC
LIMIT 20

SELECT * FROM tweets
WHERE user_id IN
(SELECT source_id
FROM followers
WHERE destination_id = ?)
ORDER BY created_at DESC
LIMIT 20

Crazy slow if you have lots of
friends or indices can’t be
kept in RAM

OFF-LINE VS.
ONLINE
COMPUTATION


• Sequences stored in Memcached
• Fanout off-line, but has a low latency SLA
• Truncate at random intervals to ensure bounded
length
• On cache miss, merge user timelines

Throughput Statistics

date average tps peak tps fanout ratio deliveries

10/7/2008 30 120 175:1 21,000

4/15/2010 700 2,000 600:1 1,200,000

• Fanout to disk
• Ridonculous number of IOPS required, even with
fancy buffering techniques
• Cost of rebuilding data from other durable stores not
too expensive
• Fanout to memory
• Good if cardinality of corpus * bytes/datum not too
many GB

Low Latency

get append fanout

1ms 1ms <1s*

* Depends on the number of followers of the tweeter

Principles
• Off-line vs. Online computation
• The answer to some problems can be pre-computed
if the amount of work is bounded and the query
pattern is very limited
• Keep the memory hierarchy in mind
• The efficiency of a system includes the cost of
generating data from another source (such as a
backup) times the probability of needing to

What is a Social Graph?
• List of who follows whom, who blocks whom, etc.
• Operations:
• Enumerate by time
• Intersection, Union, Difference
• Inclusion
• Cardinality
• Mass-deletes for spam
• Medium-velocity unbounded vectors
• Complex, predetermined queries

Inclusion
Temporal enumeration

Inclusion
Temporal enumeration

Cardinality

Intersection: Deliver to people who
follow both @aplusk and
@foursquare

source_id destination_id

20 12

29 12

34 16

• Master-Slave replication

Index


20 12

29 12

34 16


Index Index


20 12

29 12

34 16


• Indices couldn’t be kept in RAM

Current solution
Forward Backward
source_id destination_id updated_at x destination_id source_id updated_at x
20 12 20:50:14 x 12 20 20:50:14 x
20 13 20:51:32 12 32 20:51:32
20 16 12 16

• Partitioned by user id
• Edges stored in “forward” and “backward” directions
• Indexed by time
• Indexed by element (for set algebra)
• Denormalized cardinality

Current solution
Forward Backward
20 12 20:50:14 x 12 20 20:50:14 x
20 13 20:51:32 12 32 20:51:32
20 16 12 16

Partitioned by user
• Indexed by time

Edges stored in both directions

Current solution
Forward Backward
20 12 20:50:14 x 12 20 20:50:14 x
20 13 20:51:32 12 32 20:51:32
20 16 12 16

Partitioned by user
• Indexed by time

Challenges
• Data consistency in the presence of failures
• Write operations are idempotent: retry until success
• Last-Write Wins for edges
• (with an ordering relation on State for time
conflicts)
• Other commutative strategies for mass-writes

Low Latency

write
cardinality iteration write ack inclusion
materialize

1ms 100edges/ms* 1ms 16ms 1ms

* 2ms lower bound

Principles
• It is not possible to pre-compute set algebra queries
• Simple distributed coordination techniques work
• Partition, replicate, index. Many efficiency and
scalability problems are solved the same way

What is a Search Index?
• “Find me all tweets with these words in it...”
• Posting list
• Boolean and/or queries
• Complex, ad hoc queries
• Relevance is recency*

* Note: there is a non-real-time component to search, but it is not discussed here

Intersection of
three posting lists

term_id doc_id

20 12

20 86

34 16

• Master-Slave replication for read throughput

• Index could not be kept in memory

term_id doc_id
24 ...
Partition 2
23 ...

term_id doc_id
Partition 1 22 ...
21 ...

• Partitioned by time
• Uses MySQL
• Uses delayed key-write

• Queries for rare terms need to search many
partitions
• Space efficiency/recall
• MySQL requires lots of memory

Future solution

• Document partitioning
• Time partitioning too
• Merge layer
• May use Lucene instead of MySQL

Principles
• Partition so that work can be parallelized
• Temporal locality is not always enough

Summary Statistics
writes/
reads/second cardinality bytes/item durability
second

Tweets 100k 850 12b 300b durable

Timelines 80k 1.2m a lot 3.2k non

Graphs 100k 20k 13b 110 durable

Search 13k 21k† 315m‡ 1k durable

† tps * 25 postings
‡ 75 partitions * 4.2m tweets

Principles
• All engineering solutions are transient
• Nothing’s perfect but some solutions are good enough
for a while
• Scalability solutions aren’t magic. They involve
partitioning, indexing, and replication
• All data for real-time queries MUST be in memory.
Disk is for writes only.
• Some problems can be solved with pre-computation,
but a lot can’t
• Exploit locality where possible

Big Data in Real-Time at Twitter

More Related Content

Big Data in Real-Time at Twitter

Editor's Notes