Netflix is migrating its datacenter infrastructure from Oracle databases to a globally distributed Apache Cassandra database on AWS. This will allow Netflix to scale more easily and deploy new features faster without being limited by the capacity of its own datacenters. The migration involves transitionally replicating data between Oracle and AWS services like SimpleDB while new services are deployed directly on Cassandra. This will cut Netflix's dependence on its existing datacenters and allow it to fully leverage the elasticity of the public cloud.
1 of 52
More Related Content
Migrating Netflix from Datacenter Oracle to Global Cassandra
1. Replacing
Datacenter
Oracle
with
Global
Apache
Cassandra
on
AWS
July
11,
2011
Adrian
Cockcro4
@adrianco
#ne8lixcloud
h;p://www.linkedin.com/in/adriancockcro4
2. Ne8lix
Inc.
With
more
than
23
million
subscribers
in
the
United
States
and
Canada,
Ne9lix,
Inc.
is
the
world’s
leading
Internet
subscripAon
service
for
enjoying
movies
and
TV
shows.
InternaAonal
Expansion
We
plan
to
expand
into
an
addiAonal
market
in
the
second
half
of
2011…
If
the
second
market
meets
our
expectaAons…
we
will
conAnue
to
invest
and
expand
aggressively
in
2012.
Source:
h;p://ir.ne8lix.com
3. Building
a
Global
Ne8lix
Service
Ne8lix
Cloud
MigraKon
Data
MigraKon
to
Cassandra
Highly
Available
and
Globally
Distributed
Data
Backups
and
Archives
in
the
Cloud
Monitoring
Cassandra
ContribuKons
and
OrganizaKon
8. Data
Center
Ne8lix
could
not
build
new
datacenters
fast
enough
Capacity
growth
is
acceleraKng,
unpredictable
Product
launch
spikes
-‐
iPhone,
Wii,
PS3,
XBox
10. Out-‐Growing
Data
Center
h;p://techblog.ne8lix.com/2011/02/redesigning-‐ne8lix-‐api.html
37x
Growth
Jan
2010-‐Jan
2011
Datacenter
Capacity
11. Ne8lix.com
is
now
~100%
Cloud
Account
sign-‐up
is
currently
being
moved
to
cloud
All
internaKonal
product
is
cloud
based
USA
specific
logisKcs
remains
in
the
Datacenter
12. Ne8lix
Choice
was
AWS
with
our
own
pla8orm
and
tools
Unique
pla8orm
requirements
and
extreme
agility
and
flexibility
13. Leverage
AWS
Scale
“the
biggest
public
cloud”
AWS
investment
in
features
and
automaKon
Use
AWS
zones
and
regions
for
high
availability,
scalability
and
global
deployment
14. We
want
to
use
clouds,
we
don’t
have
Kme
to
build
them
Public
cloud
for
agility
and
scale
AWS
because
they
are
big
enough
to
allocate
thousands
of
instances
per
hour
when
we
need
to
15. Ne8lix
Deployed
on
AWS
Content
Logs
Play
WWW
API
Video
S3
DRM
Sign-‐Up
Metadata
Masters
EMR
CDN
Device
EC2
Search
Hadoop
rouKng
Config
Movie
TV
Movie
S3
Hive
Bookmarks
Choosing
Choosing
Business
Mobile
CDN
Logging
RaKngs
Intelligence
iPhone
16. Port
to
Cloud
Architecture
Short
term
investment,
long
term
payback!
Pay
down
technical
debt
Robust
pa;erns
17. TransiKon
• The
Goals
– Faster,
Scalable,
Available
and
ProducKve
• AnK-‐pa;erns
and
Cloud
Architecture
– The
things
we
wanted
to
change
and
why
• Data
MigraKon
– Minimizing
datacenter
dependencies
18. Datacenter
AnK-‐Pa;erns
What
do
we
currently
do
in
the
datacenter
that
prevents
us
from
meeKng
our
goals?
19. Old
Datacenter
vs.
New
Cloud
Arch
Central
SQL
Database
Distributed
Key/Value
NoSQL
SKcky
In-‐Memory
Session
Shared
Memcached
Session
Cha;y
Protocols
Latency
Tolerant
Protocols
Tangled
Service
Interfaces
Layered
Service
Interfaces
Instrumented
Code
Instrumented
Service
Pa;erns
Fat
Complex
Objects
Lightweight
Serializable
Objects
Components
as
Jar
Files
Components
as
Services
20. The
Central
SQL
Database
• Datacenter
has
central
Oracle
databases
– Everything
in
one
place
is
convenient
unKl
it
fails
– Customers,
movies,
history,
configuraKon
• Schema
changes
require
downKme
AnA-‐paOern
impacts
scalability,
availability
21. The
Distributed
Key-‐Value
Store
• Cloud
has
many
key-‐value
data
stores
– More
complex
to
keep
track
of,
do
backups
etc.
– Each
store
is
much
simpler
to
administer
– Joins
take
place
in
java
code
DBA
• No
schema
to
change,
no
scheduled
downKme
• Latency
for
typical
queries
– Memcached
is
dominated
by
network
latency
<1ms
– Cassandra
replicaKon
takes
a
few
milliseconds
– Oracle
for
simple
queries
is
a
few
milliseconds
– SimpleDB
replicaKon
and
REST
auth
overheads
>10ms
23. TransiKonal
Steps
• BidirecKonal
ReplicaKon
– Oracle
to
SimpleDB
– Queued
reverse
path
using
SQS
– Backups
remain
in
Datacenter
via
Oracle
• New
Cloud-‐Only
Data
Sources
– Cassandra
based
– No
replicaKon
to
Datacenter
– Backups
performed
in
the
cloud
24. API
AWS
EC2
Front
End
Load
Balancer
Discovery
Service
API
Proxy
API
etc.
Load
Balancer
Component
API
SQS
Services
Oracl
e
Oracle
Oracle
Cassandra
memcached
ReplicaKon
memcached
EC2
Internal
Disks
Ne=lix
S3
Data
Center
SimpleDB
25. Cuvng
the
Umbilical
• TransiKon
Oracle
Data
Sources
to
Cassandra
– Offload
Datacenter
Oracle
hardware
– Free
up
capacity
for
growth
of
remaining
services
• TransiKon
SimpleDB+Memcached
to
Cassandra
– Primary
data
sources
that
need
backup
– Keep
simple
use
cases
like
configuraKon
service
• New
challenges
– Backup,
restore,
archive,
business
conKnuity
– Business
Intelligence
integraKon
26. API
AWS
EC2
Front
End
Load
Balancer
Discovery
Service
API
Proxy
Load
Balancer
Component
API
Services
memcached
Cassandra
EC2
Internal
Disks
Backup
S3
SimpleDB
27. High
Availability
• Cassandra
stores
3
local
copies,
1
per
zone
– Synchronous
access,
durable,
highly
available
– Read/Write
One
fastest,
least
consistent
-‐
~1ms
– Read/Write
Quorum
2
of
3,
consistent
-‐
~3ms
• AWS
Availability
Zones
– Separate
buildings
– Separate
power
etc.
– Close
together
28. Remote
Copies
• Cassandra
duplicates
across
AWS
regions
– Asynchronous
write,
replicates
at
desKnaKon
– Doesn’t
directly
affect
local
read/write
latency
• Global
Coverage
– Business
agility
– Follow
AWS…
• Local
Access
3
3
– Be;er
latency
3
3
– Fault
IsolaKon
29. Cassandra
Backup
• Full
Backup
Cassandra
– Cron
on
each
node
Cassandra
Cassandra
– Snapshot
-‐>
tar.gz
-‐>
S3
Cassandra
Cassandra
• Incremental
S3
– SSTable
write
triggers
Cassandra
Backup
Cassandra
copy
to
S3
• ConKnuous
Cassandra
Cassandra
– Scrape
commit
log
Cassandra
Cassandra
– Write
to
EBS
every
30s
30. Cassandra
Restore
• Full
Restore
Cassandra
Cassandra
Cassandra
– Replace
previous
data
• New
Ring
from
Backup
Cassandra
Cassandra
– New
name
old
data
S3
Backup
Cassandra
Cassandra
– One
line
command!
Cassandra
Cassandra
Cassandra
Cassandra
31. Cassandra
Data
ExtracKon
• Business
Intelligence
Brisk
Brisk
Brisk
– Re-‐normalize
data
using
Hadoop
job
Brisk
Brisk
• Daily
ExtracKon
S3
– Create
Brisk
ring
Brisk
Backup
Brisk
– Extract
backup
– Run
Hadoop
job
Brisk
Brisk
– Remove
Brisk
ring
Brisk
Brisk
– Under
1hr…
32. Cassandra
Online
BI
• Intra-‐Day
ExtracKon
Cassandra
Brisk
Cassandra
– Use
split
Brisk
ring
– Size
each
separately
Brisk
Cassandra
– Hourly
Hadoop
job
S3
Backup
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
33. Cassandra
Archive
Appropriate
level
of
paranoia
needed…
• Archive
could
be
un-‐readable
– Base
on
restored
S3
backup
and
BI
extracted
data
• Archive
could
be
stolen
– Encrypt
archive
• AWS
East
Region
could
have
a
problem
– Copy
data
to
AWS
West
• ProducKon
AWS
Account
could
have
an
issue
– Separate
Archive
account
with
no-‐delete
S3
ACL
• AWS
S3
could
have
a
global
problem
– Create
an
extra
copy
on
a
different
cloud
vendor
34. Tools
and
AutomaKon
• Developer
and
Build
Tools
– Jira,
Perforce,
Eclipse,
Jenkins,
Ivy,
ArKfactory
– Builds,
creates
.war
file,
.rpm,
bakes
AMI
and
launches
• Custom
Ne8lix
ApplicaKon
Console
– AWS
Features
at
Enterprise
Scale
(hide
the
AWS
security
keys!)
– Auto
Scaler
Group
is
unit
of
deployment
to
producKon
• Open
Source
+
Support
– Apache,
Tomcat,
Cassandra,
Hadoop,
OpenJDK,
CentOS
– Datastax
support
for
Cassandra,
AWS
support
for
Hadoop
via
EMR
• Monitoring
Tools
– Datastax
Opscenter
for
monitoring
Cassandra
– AppDynamics
–
Developer
focus
for
cloud
h;p://appdynamics.com
35. Developer
MigraKon
• Detailed
SQL
to
NoSQL
TransiKon
Advice
– Sid
Anand
-‐
QConSF
Nov
5th
–
Ne8lix’
TransiKon
to
High
Availability
Storage
Systems
– Blog
-‐
h;p://pracKcalcloudcompuKng.com/
– Download
Paper
PDF
-‐
h;p://bit.ly/bhOTLu
• Mark
Atwood,
"Guide
to
NoSQL,
redux”
– YouTube
h;p://youtu.be/zAbFRiyT3LU
36. Cloud
OperaKons
Cassandra
Use
Cases
Model
Driven
Architecture
Capacity
Planning
&
Monitoring
Chaos
Monkey
37. Cassandra
Use
Cases
• Key
by
Customer
– Several
separate
Cassandra
rings,
read-‐intensive
– Sized
to
fit
in
memory
using
m2.4xl
Instances
• Key
by
Customer:Movie
–
e.g.
Viewing
History
– Growing
fast,
write
intensive
–
m1.xl
instances
– Sized
to
hold
hot
data
in
memory
only
• Large
scale
data
logging
–
lots
of
writes
– Column
data
expires
a4er
Kme
period
– Working
on
using
distributed
counters…
38. Model
Driven
Architecture
• Datacenter
PracKces
– Lots
of
unique
hand-‐tweaked
systems
– Hard
to
enforce
pa;erns
• Model
Driven
Cloud
Architecture
– Perforce/Ivy/Jenkins
based
builds
for
everything
– Every
producKon
instance
is
a
pre-‐baked
AMI
– Every
applicaKon
is
managed
by
an
Autoscaler
Every
change
is
a
new
AMI
39. Ne8lix
Pla8orm
Cassandra
AMI
• Tomcat
server
– Always
running,
registers
with
pla8orm
– Manages
Cassandra
state,
tokens,
backups
• SimpleDB
configuraKon
– Stores
token
slots
and
opKons
– Avoids
circular
“bootstrap
problems”
• Removed
Root
Disk
Dependency
on
EBS
– Use
S3
backed
AMI
for
stateful
services
– Normally
use
EBS
backed
AMI
for
fast
provisioning
42. Chaos
Monkey
• Make
sure
systems
are
resilient
– Allow
any
instance
to
fail
without
customer
impact
• Chaos
Monkey
hours
– Monday-‐Thursday
9am-‐3pm
random
instance
kill
• ApplicaKon
configuraKon
opKon
– Apps
now
have
to
opt-‐out
from
Chaos
Monkey
• Computers
(Datacenter
or
AWS)
randomly
die
– Fact
of
life,
but
too
infrequent
to
test
resiliency
44. Capacity
Planning
in
Clouds
(a
few
things
have
changed…)
• Capacity
is
expensive
• Capacity
takes
Kme
to
buy
and
provision
• Capacity
only
increases,
can’t
be
shrunk
easily
• Capacity
comes
in
big
chunks,
paid
up
front
• Planning
errors
can
cause
big
problems
• Systems
are
clearly
defined
assets
• Systems
can
be
instrumented
in
detail
• Depreciate
assets
over
3
years
(reservaKons!)
45. Data
Sources
• External
URL
availability
and
latency
alerts
and
reports
–
Keynote
External
TesKng
• Stress
tesKng
-‐
SOASTA
• Ne8lix
REST
calls
–
Chukwa
to
DataOven
with
GUID
transacKon
idenKfier
Request
Trace
Logging
• Generic
HTTP
–
AppDynamics
service
Ker
aggregaKon,
end
to
end
tracking
• Tracers
and
counters
–
log4j,
tracer
central,
Chukwa
to
DataOven
ApplicaKon
logging
• Trackid
and
Audit/Debug
logging
–
DataOven,
Appdynamics
GUID
cross
reference
• ApplicaKon
specific
real
Kme
–
Datastax
Opscenter,
Appdynamics
JMX
Metrics
• Service
and
SLA
percenKles
–
Appdynamics,
Epic
logged
to
DataOven
• Stdout
logs
–
S3
–
DataOven
Tomcat
and
Apache
logs
• Standard
format
Access
and
Error
logs
–
S3
–
DataOven
• Garbage
CollecKon
–
Appdynamics
JVM
• Memory
usage,
call
stacks,
resource/call
-‐
AppDynamics
• system
CPU/Net/RAM/Disk
metrics
–
AppDynamics
Linux
• SNMP
metrics
–
Epic,
Network
flows
–
boundary.com
• Load
balancer
traffic
–
Amazon
Cloudwatch,
SimpleDB
usage
stats
AWS
• System
configuraKon
-‐
CPU
count/speed
and
RAM
size,
overall
usage
-‐
AWS
46. AppDynamics
How
to
look
deep
inside
your
cloud
applicaKons
• AutomaKc
Monitoring
– Base
AMI
bakes
in
all
monitoring
tools
– Outbound
calls
only
–
no
discovery/polling
issues
– InacKve
instances
removed
a4er
a
few
days
• Incident
Alarms
(deviaKon
from
baseline)
– Business
TransacKon
latency
and
error
rate
– Alarm
thresholds
discover
their
own
baseline
– Email
contains
URL
to
Incident
Workbench
UI
49. Ne8lix
ContribuKons
to
Cassandra
• Cassandra
as
a
mutable
toolkit
– Cassandra
is
in
Java,
pluggable,
well
structured
– Ne8lix
has
a
building
full
of
Java
engineers….
• Actual
ContribuKons
delivered
in
0.8
– First
prototype
of
off-‐heap
row
cache
(Vijay)
– Incremental
backup
SSTable
write
callback
• Work
In
Progress
– AWS
integraKon
and
backup
using
Tomcat
helper
– Total
re-‐write
of
Hector
Java
client
library
(Eran)
50. Ne8lix
“NoOps”
OrganizaKon
MarkeKng
&
AdverKsing
Site
Member
Site
PersonalizaKon
for
Customer
AcquisiKon
for
Customer
RetenKon
Cloud
Ops
Build
Tools
Database
Pla8orm
Cloud
Cloud
Reliability
and
Engineering
Development
Performance
SoluKons
Engineering
AutomaKon
Perforce
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Jenkins
AWS
AWS
AWS
AWS
AWS
AWS
51. Takeaway
Ne9lix
is
using
Cassandra
on
AWS
as
a
key
infrastructure
component
of
its
globally
distributed
streaming
product.
h;p://www.linkedin.com/in/adriancockcro4
@adrianco
#ne8lixcloud
52. Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)
• AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaKon
code)
• EC2
–
ElasKc
Compute
Cloud
– Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraKons.
– Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.
– Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage
– Availability
Zone
–
datacenter
with
own
power
and
cooling
hosKng
cloud
instances
– Region
–
group
of
Availability
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan
• ASG
–
Auto
Scaling
Group
(instances
booKng
from
the
same
AMI)
• S3
–
Simple
Storage
Service
(h;p
access)
• EBS
–
ElasKc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)
• RDS
–
RelaKonal
Database
Service
(managed
MySQL
master
and
slaves)
• SDB
–
Simple
Data
Base
(hosted
h;p
based
NoSQL
data
store)
• SQS
–
Simple
Queue
Service
(h;p
based
message
queue)
• SNS
–
Simple
NoKficaKon
Service
(h;p
and
email
based
topics
and
messages)
• EMR
–
ElasKc
Map
Reduce
(automaKcally
managed
Hadoop
cluster)
• ELB
–
ElasKc
Load
Balancer
• EIP
–
ElasKc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)
• VPC
–
Virtual
Private
Cloud
(extension
of
enterprise
datacenter
network
into
cloud)
• IAM
–
IdenKty
and
Access
Management
(fine
grain
role
based
security
keys)