The Secrets of Building Realtime Big Data Systems

The Secrets of Building
Realtime Big Data
Systems

Nathan Marz
@nathanmarz

BackType

• >30 TB of data
• Process 100M messages / day
• Serve 300 requests / sec
• 100 to 200 machine cluster
• 3 full-time employees, 2 interns

Built on open-source
Thrift

Cascading

Scribe

ZeroMQ

Zookeeper

Pallet

What is a data system?

View 1
Raw data

View 2

View 3

What is a data system?

# Tweets /
URL
Tweets

Inﬂuence
scores

Trending
topics

Everything else: schemas, databases,
indexing, etc are implementation

Essential properties of
a data system

1. Robust
to machine failure
and human error

2. Low latency reads
and updates

6. Allows ad-hoc
analysis

Layered Architecture

Speed Layer

Batch Layer

Let’s pretend temporarily that
update latency doesn’t matter

Let’s pretend it’s OK for a view to
lag by a few hours

Batch layer

• Arbitrary computation
• Horizontally scalable
• High latency

Batch layer

Not the end-all-be-all of batch
computation, but the most general

Hadoop
Distributed Distributed
Filesystem Filesystem

Input ﬁles Output ﬁles
MapReduce


Hadoop

• Express your computation in terms of
MapReduce
• Get parallelism and scalability “for free”

Batch layer

• Store master copy of dataset
• Master dataset is append-only

Batch layer

view = fn(master dataset)

Batch layer

MapReduce Batch
Master dataset
View 1

MapReduce Batch
View 2

Batch
View 3
MapReduce

Batch layer

• In practice, too expensive to fully
recompute each view to get updates
• A production batch workﬂow adds
minimum amount of incrementalization
necessary for performance

Incremental batch layer
Batch
View 1

New data
Batch View Batch
View 2
maintenance
workﬂow
Query Append Batch
View 3

All data

Batch layer
Robust and fault-tolerant to both machine
and human error.
Low latency reads.
Low latency updates.
Scalable to increases in data or trafﬁc.
Extensible to support new features or related
services.
Generalizes to diverse types of data and requests.
Allows ad hoc queries.
Minimal maintenance.
Debuggable: can trace how any value in the
system came to be.

Speed layer

Compensate for high latency
of updates to batch layer

Speed layer

Key point: Only needs to compensate for
data not yet absorbed in serving layer

Speed layer

Key point: Only needs to compensate for
data not yet absorbed in serving layer

Hours of data instead of years of data

Application-level Queries

Batch Layer Query

Merge

Speed Layer Query

Speed layer

Once data is absorbed into batch layer, can
discard speed layer results

Speed layer
• Message passing
• Incremental algorithms
• Read/Write databases
• Riak
• Cassandra
• HBase
• etc.

Speed layer

Signiﬁcantly more complex
than the batch layer

Speed layer

But the batch layer eventually
overrides the speed layer

Speed layer

So that complexity is transient

Flexibility in layered
architecture
• Do slow and accurate algorithm in batch
layer
• Do fast but approximate algorithm in speed
layer
• “Eventual accuracy”

Data model

Every record is a single, discrete
fact at a moment in time

Data model

• Alice lives in San Francisco as of time 12345
• Bob and Gary are friends as of time 13723
• Alice lives in New York as of time 19827

Data model

• Remember: master dataset is append-only
• A person can have multiple location
records
• “Current location” is a view on this data:
pick location with most recent timestamp

Data model

• Extremely useful having the full history for
each entity
• Doing analytics
• Recovering from mistakes (like writing
bad data)

Data model
Reshare: true
Gender: female
Property
Tweet: 456
Property
Reaction
Reactor Reactor
Tweet: 123

Alice
Bob
Property
Property

Content: RT @bob Content: Data is fun!
Data is fun!

Questions?

Twitter: @nathanmarz

Email: nathan.marz@gmail.com

Web: http://nathanmarz.com

The Secrets of Building Realtime Big Data Systems

More Related Content

The Secrets of Building Realtime Big Data Systems

Editor's Notes