The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.
Presented at POSSCON '11.
1 of 50
Downloaded 3,286 times
More Related Content
The Secrets of Building Realtime Big Data Systems
1. The Secrets of Building
Realtime Big Data
Systems
Nathan Marz
@nathanmarz
32. Batch layer
• In practice, too expensive to fully
recompute each view to get updates
• A production batch workflow adds
minimum amount of incrementalization
necessary for performance
33. Incremental batch layer
Batch
View 1
New data
Batch View Batch
View 2
maintenance
workflow
Query Append Batch
View 3
All data
34. Batch layer
Robust and fault-tolerant to both machine
and human error.
Low latency reads.
Low latency updates.
Scalable to increases in data or traffic.
Extensible to support new features or related
services.
Generalizes to diverse types of data and requests.
Allows ad hoc queries.
Minimal maintenance.
Debuggable: can trace how any value in the
system came to be.
44. Flexibility in layered
architecture
• Do slow and accurate algorithm in batch
layer
• Do fast but approximate algorithm in speed
layer
• “Eventual accuracy”
46. Data model
• Alice lives in San Francisco as of time 12345
• Bob and Gary are friends as of time 13723
• Alice lives in New York as of time 19827
47. Data model
• Remember: master dataset is append-only
• A person can have multiple location
records
• “Current location” is a view on this data:
pick location with most recent timestamp
48. Data model
• Extremely useful having the full history for
each entity
• Doing analytics
• Recovering from mistakes (like writing
bad data)
49. Data model
Reshare: true
Gender: female
Property
Tweet: 456
Property
Reaction
Reactor Reactor
Tweet: 123
Alice
Bob
Property
Property
Content: RT @bob Content: Data is fun!
Data is fun!