SlideShare a Scribd company logo
The Secrets of Building
  Realtime Big Data
       Systems

                Nathan Marz
                @nathanmarz
Who am I?
Who am I?
Who am I?
Who am I?




(Upcoming book)
BackType

• >30 TB of data
• Process 100M messages / day
• Serve 300 requests / sec
• 100 to 200 machine cluster
• 3 full-time employees, 2 interns
Built on open-source
                   Thrift

                  Cascading

                   Scribe

                  ZeroMQ

                 Zookeeper

                    Pallet
What is a data system?

                View 1
  Raw data


                View 2




                View 3
What is a data system?

               # Tweets /
                  URL
  Tweets


                Influence
                  scores



                Trending
                 topics
Everything else: schemas, databases,
 indexing, etc are implementation
Essential properties of
    a data system
1. Robust
1. Robust
to machine failure
1. Robust
to machine failure
and human error
2. Low latency reads
     and updates
3. Scalable
4. General
5. Extensible
6. Allows ad-hoc
     analysis
7. Minimal maintenance
8. Debuggable
Layered Architecture

       Speed Layer




       Batch Layer
Let’s pretend temporarily that
update latency doesn’t matter
Let’s pretend it’s OK for a view to
         lag by a few hours
Batch layer

• Arbitrary computation
• Horizontally scalable
• High latency
Batch layer




  Not the end-all-be-all of batch
computation, but the most general
Hadoop
Distributed               Distributed
Filesystem                Filesystem


Input files                Output files
              MapReduce
Input files                Output files


Input files                Output files
Hadoop


• Express your computation in terms of
  MapReduce
• Get parallelism and scalability “for free”
Batch layer


• Store master copy of dataset
• Master dataset is append-only
Batch layer


view = fn(master dataset)
Batch layer

                   MapReduce   Batch
Master dataset
                               View 1



                   MapReduce   Batch
                               View 2



                               Batch
                               View 3
                   MapReduce
Batch layer

• In practice, too expensive to fully
  recompute each view to get updates
• A production batch workflow adds
  minimum amount of incrementalization
  necessary for performance
Incremental batch layer
                                                Batch
                                                View 1




New data
             Batch                   View       Batch
                                                View 2
                                  maintenance
            workflow
           Query              Append            Batch
                                                View 3

                   All data
Batch layer
Robust and fault-tolerant to both machine
and human error.
Low latency reads.
Low latency updates.
Scalable to increases in data or traffic.
Extensible to support new features or related
services.
Generalizes to diverse types of data and requests.
Allows ad hoc queries.
Minimal maintenance.
Debuggable: can trace how any value in the
system came to be.
Speed layer


Compensate for high latency
 of updates to batch layer
Speed layer

Key point: Only needs to compensate for
  data not yet absorbed in serving layer
Speed layer

Key point: Only needs to compensate for
  data not yet absorbed in serving layer



  Hours of data instead of years of data
Application-level Queries

  Batch Layer   Query

                        Merge

  Speed Layer   Query
Speed layer

Once data is absorbed into batch layer, can
       discard speed layer results
Speed layer
• Message passing
• Incremental algorithms
• Read/Write databases
    • Riak
    • Cassandra
    • HBase
    • etc.
Speed layer


Significantly more complex
    than the batch layer
Speed layer


But the batch layer eventually
  overrides the speed layer
Speed layer


So that complexity is transient
Flexibility in layered
      architecture
• Do slow and accurate algorithm in batch
  layer
• Do fast but approximate algorithm in speed
  layer
• “Eventual accuracy”
Data model

Every record is a single, discrete
   fact at a moment in time
Data model

• Alice lives in San Francisco as of time 12345
• Bob and Gary are friends as of time 13723
• Alice lives in New York as of time 19827
Data model

• Remember: master dataset is append-only
• A person can have multiple location
  records
• “Current location” is a view on this data:
  pick location with most recent timestamp
Data model

• Extremely useful having the full history for
  each entity
    • Doing analytics
    • Recovering from mistakes (like writing
      bad data)
Data model
                              Reshare: true
Gender: female
                                      Property
                                                       Tweet: 456
 Property
                                                 Reaction
                    Reactor                                            Reactor
                                  Tweet: 123

            Alice
                                                                            Bob
                                                            Property
                                    Property



                    Content: RT @bob                        Content: Data is fun!
                       Data is fun!
Questions?

   Twitter: @nathanmarz

Email: nathan.marz@gmail.com

 Web: http://nathanmarz.com

More Related Content

The Secrets of Building Realtime Big Data Systems

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n