SlideShare a Scribd company logo
Big Data in Real-Time
at Twitter
by Nick Kallen (@nk)
Follow along
http://www.slideshare.net/nkallen/qcon
What is Real-Time Data?
• On-line queries for a single web request
• Off-line computations with very low latency
• Latency and throughput are equally important
• Not talking about Hadoop and other high-latency,
Big Data tools
The four data problems
• Tweets
• Timelines
• Social graphs
• Search indices
Big Data in Real-Time at Twitter
What is a Tweet?
• 140 character message, plus some metadata
• Query patterns:
  • by id
  • by author
  • (also @replies, but not discussed here)
• Row Storage
Find by primary key: 4376167936
Find all by user_id: 749863
Original Implementation
id      user_id                  text                       created_at

20        12          just setting up my twttr          2006-03-21 20:50:14

29        12             inviting coworkers             2006-03-21 21:02:56

34        16      Oh shit, I just twittered a little.   2006-03-21 21:08:09



     • Relational
     • Single table, vertically scaled
     • Master-Slave replication and Memcached for
     read throughput.
Original Implementation
 Master-Slave Replication   Memcached for reads
Problems w/ solution
• Disk space: did not want to support disk arrays larger
than 800GB
• At 2,954,291,678 tweets, disk was over 90% utilized.
PARTITION
Possible implementations
           Partition by primary key
        Partition 1             Partition 2

   id         user_id      id         user_id

   20            ...      21             ...

   22            ...      23             ...

   24            ...      25             ...
Possible implementations
           Partition by primary key
        Partition 1               Partition 2

   id         user_id        id         user_id

   20            ...         21            ...

   22            ...         23            ...

   24            ...         25            ...
                        Finding recent tweets
                        by user_id queries N
                        partitions
Possible implementations
                Partition by user id
         Partition 1               Partition 2

   id          user_id        id         user_id

   ...             1          21             2

   ...             1          23             2

   ...             3          25             2
Possible implementations
                Partition by user id
         Partition 1               Partition 2

   id          user_id        id         user_id

   ...             1          21             2

   ...             1          23             2

   ...             3          25             2

                         Finding a tweet by id
                         queries N partitions
Current Implementation
            Partition by time
                   id    user_id
                   24      ...
     Partition 2
                   23      ...

                   id    user_id
     Partition 1   22      ...
                   21      ...
Current Implementation
                                   Queries try each
             Partition by time
                                  partition in order
                    id            until enough data
                            user_id
                   24          ... is accumulated
     Partition 2
                   23     ...

                   id   user_id
     Partition 1   22     ...
                   21     ...
LOCALITY
Low Latency
                                PK Lookup

        Memcached                  1ms
          MySQL                 <10ms*




   * Depends on the number of partitions searched
Principles
• Partition and index
• Exploit locality (in this case, temporal locality)
  • New tweets are requested most frequently, so
usually only 1 partition is checked
Problems w/ solution
• Write throughput
• Have encountered deadlocks in MySQL at crazy
tweet velocity
• Creating a new temporal shard is a manual process
and takes too long; it involves setting up a parallel
replication hierarchy. Our DBA hates us
Future solution
         Partition k1               Partition k2
    id          user_id        id          user_id
    20             ...         21             ...
    22             ...         23             ...
         Partition u1               Partition u2
  user_id         ids        user_id         ids
    12         20, 21, ...     13         48, 27, ...

    14         25, 32, ...     15         23, 51, ...


   • Cassandra (non-relational)
   • Primary Key partitioning
   • Manual secondary index on user_id
   • Memcached for 90+% of reads
The four data problems
• Tweets
• Timelines
• Social graphs
• Search indices
Big Data in Real-Time at Twitter
What is a Timeline?
• Sequence of tweet ids
• Query pattern: get by user_id
• Operations:
  • append
  • merge
  • truncate
• High-velocity bounded vector
• Space-based (in-place mutation)
Tweets from 3
different people
Original Implementation
    SELECT * FROM tweets
    WHERE user_id IN
      (SELECT source_id
       FROM followers
       WHERE destination_id = ?)
    ORDER BY created_at DESC
    LIMIT 20
Original Implementation
    SELECT * FROM tweets
    WHERE user_id IN
      (SELECT source_id
       FROM followers
       WHERE destination_id = ?)
    ORDER BY created_at DESC
    LIMIT 20

         Crazy slow if you have lots of
          friends or indices can’t be
                  kept in RAM
OFF-LINE VS.
ONLINE
COMPUTATION
Current Implementation



• Sequences stored in Memcached
• Fanout off-line, but has a low latency SLA
• Truncate at random intervals to ensure bounded
length
• On cache miss, merge user timelines
Throughput Statistics

   date     average tps   peak tps   fanout ratio    deliveries


10/7/2008      30          120        175:1          21,000

4/15/2010     700         2,000       600:1         1,200,000
1.2m
Deliveries per second
MEMORY
HIERARCHY
Possible implementations
• Fanout to disk
  • Ridonculous number of IOPS required, even with
fancy buffering techniques
  • Cost of rebuilding data from other durable stores not
too expensive
• Fanout to memory
 • Good if cardinality of corpus * bytes/datum not too
many GB
Low Latency

          get            append          fanout



         1ms             1ms             <1s*



   * Depends on the number of followers of the tweeter
Principles
• Off-line vs. Online computation
  • The answer to some problems can be pre-computed
if the amount of work is bounded and the query
pattern is very limited
• Keep the memory hierarchy in mind
• The efficiency of a system includes the cost of
generating data from another source (such as a
backup) times the probability of needing to
The four data problems
• Tweets
• Timelines
• Social graphs
• Search indices
Big Data in Real-Time at Twitter
What is a Social Graph?
• List of who follows whom, who blocks whom, etc.
• Operations:
  • Enumerate by time
  • Intersection, Union, Difference
  • Inclusion
  • Cardinality
  • Mass-deletes for spam
• Medium-velocity unbounded vectors
• Complex, predetermined queries
Big Data in Real-Time at Twitter
Temporal enumeration
Inclusion
Temporal enumeration
Inclusion
Temporal enumeration




                           Cardinality
Big Data in Real-Time at Twitter
Intersection: Deliver to people who
     follow both @aplusk and
           @foursquare
Original Implementation
      source_id                     destination_id

         20                              12

         29                              12

         34                              16



• Single table, vertically scaled
• Master-Slave replication
Index

Original Implementation
      source_id                     destination_id

         20                              12

         29                              12

         34                              16



• Single table, vertically scaled
• Master-Slave replication
Index                             Index

Original Implementation
      source_id                     destination_id

         20                              12

         29                              12

         34                              16



• Single table, vertically scaled
• Master-Slave replication
Problems w/ solution
• Write throughput
• Indices couldn’t be kept in RAM
Current solution
               Forward                                     Backward
source_id destination_id   updated_at   x   destination_id source_id   updated_at   x
   20           12          20:50:14    x        12           20        20:50:14    x
   20           13          20:51:32             12           32        20:51:32
   20           16                               12           16



        • Partitioned by user id
        • Edges stored in “forward” and “backward” directions
        • Indexed by time
        • Indexed by element (for set algebra)
        • Denormalized cardinality
Current solution
               Forward                                     Backward
source_id destination_id   updated_at   x   destination_id source_id   updated_at   x
   20           12          20:50:14    x        12           20        20:50:14    x
   20           13          20:51:32             12           32        20:51:32
   20           16                               12           16



        • Partitioned by user id
        • Edges stored in “forward” and “backward” directions
                   Partitioned by user
        • Indexed by time
        • Indexed by element (for set algebra)
        • Denormalized cardinality
Edges stored in both directions

          Current solution
               Forward                                     Backward
source_id destination_id   updated_at   x   destination_id source_id   updated_at   x
   20           12          20:50:14    x        12           20        20:50:14    x
   20           13          20:51:32             12           32        20:51:32
   20           16                               12           16



        • Partitioned by user id
        • Edges stored in “forward” and “backward” directions
                   Partitioned by user
        • Indexed by time
        • Indexed by element (for set algebra)
        • Denormalized cardinality
Challenges
• Data consistency in the presence of failures
• Write operations are idempotent: retry until success
• Last-Write Wins for edges
  • (with an ordering relation on State for time
conflicts)
• Other commutative strategies for mass-writes
Low Latency

                                                   write
cardinality      iteration          write ack                  inclusion
                                                 materialize


  1ms         100edges/ms*           1ms          16ms          1ms



                             * 2ms lower bound
Principles
• It is not possible to pre-compute set algebra queries
• Simple distributed coordination techniques work
• Partition, replicate, index. Many efficiency and
scalability problems are solved the same way
The four data problems
• Tweets
• Timelines
• Social graphs
• Search indices
Big Data in Real-Time at Twitter
What is a Search Index?
   • “Find me all tweets with these words in it...”
   • Posting list
   • Boolean and/or queries
   • Complex, ad hoc queries
   • Relevance is recency*



* Note: there is a non-real-time component to search, but it is not discussed here
Intersection of
three posting lists
Original Implementation
      term_id                      doc_id

        20                          12

        20                          86

        34                          16



• Single table, vertically scaled
• Master-Slave replication for read throughput
Problems w/ solution
• Index could not be kept in memory
Current Implementation
                   term_id   doc_id
                     24        ...
     Partition 2
                     23        ...

                   term_id   doc_id
     Partition 1     22        ...
                     21        ...


       • Partitioned by time
       • Uses MySQL
       • Uses delayed key-write
Problems w/ solution
• Write throughput
• Queries for rare terms need to search many
partitions
• Space efficiency/recall
  • MySQL requires lots of memory
DATA NEAR
COMPUTATION
Future solution




   • Document partitioning
   • Time partitioning too
   • Merge layer
   • May use Lucene instead of MySQL
Principles
• Partition so that work can be parallelized
• Temporal locality is not always enough
The four data problems
• Tweets
• Timelines
• Social graphs
• Search indices
Summary Statistics
                           writes/
            reads/second                 cardinality   bytes/item   durability
                           second

 Tweets       100k          850            12b          300b        durable

Timelines      80k         1.2m            a lot        3.2k         non

 Graphs       100k          20k            13b           110        durable

 Search        13k         21k†          315m‡            1k        durable


                       † tps * 25 postings
                       ‡ 75 partitions * 4.2m tweets
Principles
• All engineering solutions are transient
• Nothing’s perfect but some solutions are good enough
for a while
• Scalability solutions aren’t magic. They involve
partitioning, indexing, and replication
• All data for real-time queries MUST be in memory.
Disk is for writes only.
• Some problems can be solved with pre-computation,
but a lot can’t
• Exploit locality where possible
Appendix

More Related Content

Big Data in Real-Time at Twitter

  • 1. Big Data in Real-Time at Twitter by Nick Kallen (@nk)
  • 3. What is Real-Time Data? • On-line queries for a single web request • Off-line computations with very low latency • Latency and throughput are equally important • Not talking about Hadoop and other high-latency, Big Data tools
  • 4. The four data problems • Tweets • Timelines • Social graphs • Search indices
  • 6. What is a Tweet? • 140 character message, plus some metadata • Query patterns: • by id • by author • (also @replies, but not discussed here) • Row Storage
  • 7. Find by primary key: 4376167936
  • 8. Find all by user_id: 749863
  • 9. Original Implementation id user_id text created_at 20 12 just setting up my twttr 2006-03-21 20:50:14 29 12 inviting coworkers 2006-03-21 21:02:56 34 16 Oh shit, I just twittered a little. 2006-03-21 21:08:09 • Relational • Single table, vertically scaled • Master-Slave replication and Memcached for read throughput.
  • 10. Original Implementation Master-Slave Replication Memcached for reads
  • 11. Problems w/ solution • Disk space: did not want to support disk arrays larger than 800GB • At 2,954,291,678 tweets, disk was over 90% utilized.
  • 13. Possible implementations Partition by primary key Partition 1 Partition 2 id user_id id user_id 20 ... 21 ... 22 ... 23 ... 24 ... 25 ...
  • 14. Possible implementations Partition by primary key Partition 1 Partition 2 id user_id id user_id 20 ... 21 ... 22 ... 23 ... 24 ... 25 ... Finding recent tweets by user_id queries N partitions
  • 15. Possible implementations Partition by user id Partition 1 Partition 2 id user_id id user_id ... 1 21 2 ... 1 23 2 ... 3 25 2
  • 16. Possible implementations Partition by user id Partition 1 Partition 2 id user_id id user_id ... 1 21 2 ... 1 23 2 ... 3 25 2 Finding a tweet by id queries N partitions
  • 17. Current Implementation Partition by time id user_id 24 ... Partition 2 23 ... id user_id Partition 1 22 ... 21 ...
  • 18. Current Implementation Queries try each Partition by time partition in order id until enough data user_id 24 ... is accumulated Partition 2 23 ... id user_id Partition 1 22 ... 21 ...
  • 20. Low Latency PK Lookup Memcached 1ms MySQL <10ms* * Depends on the number of partitions searched
  • 21. Principles • Partition and index • Exploit locality (in this case, temporal locality) • New tweets are requested most frequently, so usually only 1 partition is checked
  • 22. Problems w/ solution • Write throughput • Have encountered deadlocks in MySQL at crazy tweet velocity • Creating a new temporal shard is a manual process and takes too long; it involves setting up a parallel replication hierarchy. Our DBA hates us
  • 23. Future solution Partition k1 Partition k2 id user_id id user_id 20 ... 21 ... 22 ... 23 ... Partition u1 Partition u2 user_id ids user_id ids 12 20, 21, ... 13 48, 27, ... 14 25, 32, ... 15 23, 51, ... • Cassandra (non-relational) • Primary Key partitioning • Manual secondary index on user_id • Memcached for 90+% of reads
  • 24. The four data problems • Tweets • Timelines • Social graphs • Search indices
  • 26. What is a Timeline? • Sequence of tweet ids • Query pattern: get by user_id • Operations: • append • merge • truncate • High-velocity bounded vector • Space-based (in-place mutation)
  • 28. Original Implementation SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20
  • 29. Original Implementation SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20 Crazy slow if you have lots of friends or indices can’t be kept in RAM
  • 31. Current Implementation • Sequences stored in Memcached • Fanout off-line, but has a low latency SLA • Truncate at random intervals to ensure bounded length • On cache miss, merge user timelines
  • 32. Throughput Statistics date average tps peak tps fanout ratio deliveries 10/7/2008 30 120 175:1 21,000 4/15/2010 700 2,000 600:1 1,200,000
  • 35. Possible implementations • Fanout to disk • Ridonculous number of IOPS required, even with fancy buffering techniques • Cost of rebuilding data from other durable stores not too expensive • Fanout to memory • Good if cardinality of corpus * bytes/datum not too many GB
  • 36. Low Latency get append fanout 1ms 1ms <1s* * Depends on the number of followers of the tweeter
  • 37. Principles • Off-line vs. Online computation • The answer to some problems can be pre-computed if the amount of work is bounded and the query pattern is very limited • Keep the memory hierarchy in mind • The efficiency of a system includes the cost of generating data from another source (such as a backup) times the probability of needing to
  • 38. The four data problems • Tweets • Timelines • Social graphs • Search indices
  • 40. What is a Social Graph? • List of who follows whom, who blocks whom, etc. • Operations: • Enumerate by time • Intersection, Union, Difference • Inclusion • Cardinality • Mass-deletes for spam • Medium-velocity unbounded vectors • Complex, predetermined queries
  • 46. Intersection: Deliver to people who follow both @aplusk and @foursquare
  • 47. Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication
  • 48. Index Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication
  • 49. Index Index Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication
  • 50. Problems w/ solution • Write throughput • Indices couldn’t be kept in RAM
  • 51. Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality
  • 52. Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions Partitioned by user • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality
  • 53. Edges stored in both directions Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions Partitioned by user • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality
  • 54. Challenges • Data consistency in the presence of failures • Write operations are idempotent: retry until success • Last-Write Wins for edges • (with an ordering relation on State for time conflicts) • Other commutative strategies for mass-writes
  • 55. Low Latency write cardinality iteration write ack inclusion materialize 1ms 100edges/ms* 1ms 16ms 1ms * 2ms lower bound
  • 56. Principles • It is not possible to pre-compute set algebra queries • Simple distributed coordination techniques work • Partition, replicate, index. Many efficiency and scalability problems are solved the same way
  • 57. The four data problems • Tweets • Timelines • Social graphs • Search indices
  • 59. What is a Search Index? • “Find me all tweets with these words in it...” • Posting list • Boolean and/or queries • Complex, ad hoc queries • Relevance is recency* * Note: there is a non-real-time component to search, but it is not discussed here
  • 61. Original Implementation term_id doc_id 20 12 20 86 34 16 • Single table, vertically scaled • Master-Slave replication for read throughput
  • 62. Problems w/ solution • Index could not be kept in memory
  • 63. Current Implementation term_id doc_id 24 ... Partition 2 23 ... term_id doc_id Partition 1 22 ... 21 ... • Partitioned by time • Uses MySQL • Uses delayed key-write
  • 64. Problems w/ solution • Write throughput • Queries for rare terms need to search many partitions • Space efficiency/recall • MySQL requires lots of memory
  • 66. Future solution • Document partitioning • Time partitioning too • Merge layer • May use Lucene instead of MySQL
  • 67. Principles • Partition so that work can be parallelized • Temporal locality is not always enough
  • 68. The four data problems • Tweets • Timelines • Social graphs • Search indices
  • 69. Summary Statistics writes/ reads/second cardinality bytes/item durability second Tweets 100k 850 12b 300b durable Timelines 80k 1.2m a lot 3.2k non Graphs 100k 20k 13b 110 durable Search 13k 21k† 315m‡ 1k durable † tps * 25 postings ‡ 75 partitions * 4.2m tweets
  • 70. Principles • All engineering solutions are transient • Nothing’s perfect but some solutions are good enough for a while • Scalability solutions aren’t magic. They involve partitioning, indexing, and replication • All data for real-time queries MUST be in memory. Disk is for writes only. • Some problems can be solved with pre-computation, but a lot can’t • Exploit locality where possible