Netflix Architecture Tutorial at Gluecon

Cloud
Architecture
Tutorial

Construc2ng
Cloud
Architecture
the
Ne5lix
Way

Gluecon
May
23rd,
2012

Adrian
Cockcro7

@adrianco
#ne:lixcloud

h=p://www.linkedin.com/in/adriancockcro7

Tutorial
Abstract
–
Set
Context

•  Dispensing
with
the
usual
quesMons:
“Why
Ne:lix,
why
cloud,
why
AWS?”
as
they
are
old
hat
now.

•  This
tutorial
explains
how
developers
use
the
Ne:lix
cloud,
and
how
it
is
built
and
operated.

•  The
real
meat
of
the
tutorial
comes
when
we
look
at
how
to
construct
an
applicaMon
with
a
host
of

important
properMes:
elasMc,
dynamic,
scalable,
agile,
fast,
cheap,
robust,
durable,
observable,

secure.
Over
the
last
three
years
Ne:lix
has
figured
out
cloud
based
soluMons
with
these

properMes,
deployed
them
globally
at
large
scale
and
refined
them
into
a
global
Java
oriented

Pla:orm
as
a
Service.
The
PaaS
is
based
on
low
cost
open
source
building
blocks
such
as
Apache

Tomcat,
Apache
Cassandra,
and
Memcached.
Components
of
this
pla:orm
are
in
the
process
of

being
open-‐sourced
by
Ne:lix,
so
that
other
companies
can
get
a
start
on
building
their
own

customized
PaaS
that
leverages
advanced
features
of
AWS
and
supports
rapid
agile
development.

•  The
architecture
is
described
in
terms
of
anM-‐pa=erns
-‐
things
to
avoid
in
the
datacenter
to
cloud

transiMon.
A
scalable
global
persistence
Mer
based
on
Cassandra
provides
a
highly
available
and

durable
under-‐pinning.
Lessons
learned
will
cover
soluMons
to
common
problems,
availability
and

robustness,
observability.
A=endees
should
leave
the
tutorial
with
a
clear
understanding
of
what
is

different
about
the
Ne:lix
cloud
architecture,
how
it
empowers
and
supports
developers,
and
a
set

of
flexible
and
scalable
open
source
building
blocks
that
can
be
used
to
construct
their
own
cloud

pla:orm.

PresentaMon
vs.
Tutorial

•  PresentaMon

–  Short
duraMon,
focused
subject

–  One
presenter
to
many
anonymous
audience

–  A
few
quesMons
at
the
end

•  Tutorial

–  Time
to
explore
in
and
around
the
subject

–  Tutor
gets
to
know
the
audience

–  Discussion,
rat-‐holes,
“bring
out
your
dead”

Cloud
Tutorial
SecMons

Intro:
Who
are
you,
what
are
your
quesMons?

Part
1
–
WriMng
and
Performing

Developer
Viewpoint

Part
2
–
Running
the
Show

Operator
Viewpoint

Part
3
–
Making
the
Instruments

Builder
Viewpoint

Adrian
Cockcro7

•  Director,
Architecture
for
Cloud
Systems,
Ne:lix
Inc.

–  Previously
Director
for
PersonalizaMon
Pla:orm

•  DisMnguished
Availability
Engineer,
eBay
Inc.
2004-‐7

–  Founding
member
of
eBay
Research
Labs

•  DisMnguished
Engineer,
Sun
Microsystems
Inc.
1988-‐2004

–  2003-‐4
Chief
Architect
High
Performance
Technical
CompuMng

–  2001
Author:
Capacity
Planning
for
Web
Services

–  1999
Author:
Resource
Management

–  1995
&
1998
Author:
Sun
Performance
and
Tuning

–  1996
Japanese
EdiMon
of
Sun
Performance
and
Tuning

• 
SPARC
&
Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)

•  Heavy
Metal
Bass
Guitarist
in
“Black
Tiger”
1980-‐1982

–  Inﬂuenced
by
Van
Halen,
Yesterday
&
Today,
AC/DC

•  More

–  Twi=er
@adrianco
–
Blog
h=p://perfcap.blogspot.com

–  PresentaMons
at
h=p://www.slideshare.net/adrianco

A=endee
IntroducMons

•  Who
are
you,
where
do
you
work

•  Why
are
you
here
today,
what
do
you
need

•  “Bring
out
your
dead”

–  Do
you
have
a
speciﬁc
problem
or
quesMon?

–  One
sentence
elevator
pitch

•  What
instrument
do
you
play?

WriMng
and
Performing

Developer
Viewpoint

Part
1
of
3

Van
Halen

Audience
and
Fans

Listen
to
Songs
and
Albums

Wri=en
and
Played
by
Van
Halen

Using
Instruments
and
Studios

Developers

Toons
from
gapingvoid.com

Customers

Use
Products

Built
by
Developers

That
run
on
Infrastructure

Why
Use
Cloud?

“Runnin’
with
the
Devil
–
Van
Halen”

Things
we
don’t
do

“Unchained
–
Van
Halen”

What
do
developers
care
about?

“Right
Now
–
Van
Halen”

Keeping
up
with
Developer
Trends

In
producMon

at
Ne:lix

•  Big
Data/Hadoop
2009

•  Cloud
2009

•  ApplicaMon
Performance
Management
2010

•  Integrated
DevOps
PracMces
2010

•  ConMnuous
IntegraMon/Delivery
2010

•  NoSQL
2010

•  Pla:orm
as
a
Service
2010

•  Social
coding,
open
development/github
2011

AWS
speciﬁc
feature
dependence….

“Why
can’t
this
be
love?
–
Van
Halen”

Portability
vs.
FuncMonality

•  Portability
–
the
OperaMons
focus

–  Avoid
vendor
lock-‐in

–  Support
datacenter
based
use
cases

–  Possible
operaMons
cost
savings

•  FuncMonality
–
the
Developer
focus

–  Less
complex
test
and
debug,
one
mature
supplier

–  Faster
Mme
to
market
for
your
products

–  Possible
developer
cost
savings

Portable
PaaS

•  Portable
IaaS
Base
-‐
some
AWS
compaMbility

–  Eucalyptus
–
AWS
licensed
compaMble
subset

–  CloudStack
–
Citrix
Apache
project

–  OpenStack
–
Rackspace,
Cloudscaling,
HP
etc.

•  Portable
PaaS

–  Cloud
Foundry
-‐
run
it
yourself
in
your
DC

–  AppFog
and
Stackato
–
Cloud
Foundry/Openstack

–  Vendor
opMons:
Rightscale,
Enstratus,
Smartscale

FuncMonal
PaaS

•  IaaS
base
-‐
all
the
features
of
AWS

–  Very
large
scale,
mature,
global,
evolving
rapidly

–  ELB,
Autoscale,
VPC,
SQS,
EIP,
EMR,
DynamoDB
etc.

–  Large
ﬁles
and
mulMpart
writes
in
S3

•  FuncMonal
PaaS
–
based
on
Ne:lix
features

–  Very
large
scale,
mature,
ﬂexible,
customizable

–  Asgard
console,
Monkeys,
Big
data
tools

–  Cassandra/Zookeeper
data
store
automaMon

Developers
choose
FuncMonal

Don’t
let
the
roadie
write
the
set
list!

(yes
you
do
need
all
those
guitars
on
tour…)

Freedom
and
Responsibility

•  Developers
leverage
cloud
to
get
freedom

–  Agility
of
a
single
organizaMon,
no
silos

•  But
now
developers
are
responsible

–  For
compliance,
performance,
availability
etc.

“As
far
as
my
rehab
is
concerned,
it
is
within
my

ability
to
change
and
change
for
the
beNer
-‐
Eddie

Van
Halen”

Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features

•  AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)

•  AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaMon
code)

•  EC2
–
ElasMc
Compute
Cloud

–  Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraMons.

–  Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.

–  Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage

–  Availability
Zone
–
datacenter
with
own
power
and
cooling
hosMng
cloud
instances

–  Region
–
group
of
Avail
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan,
SA-‐Brazil,
US-‐Gov

•  ASG
–
Auto
Scaling
Group
(instances
booMng
from
the
same
AMI)

•  S3
–
Simple
Storage
Service
(h=p
access)

•  EBS
–
ElasMc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)

•  RDS
–
RelaMonal
Database
Service
(managed
MySQL
master
and
slaves)

•  DynamoDB/SDB
–
Simple
Data
Base
(hosted
h=p
based
NoSQL
datastore,
DynamoDB
replaces
SDB)

•  SQS
–
Simple
Queue
Service
(h=p
based
message
queue)

•  SNS
–
Simple
NoMficaMon
Service
(h=p
and
email
based
topics
and
messages)

•  EMR
–
ElasMc
Map
Reduce
(automaMcally
managed
Hadoop
cluster)

•  ELB
–
ElasMc
Load
Balancer

•  EIP
–
ElasMc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)

•  VPC
–
Virtual
Private
Cloud
(single
tenant,
more
flexible
network
and
security
constructs)

•  DirectConnect
–
secure
pipe
from
AWS
VPC
to
external
datacenter

•  IAM
–
IdenMty
and
Access
Management
(fine
grain
role
based
security
keys)

Ne:lix
Deployed
on
AWS

2009
2009
2010
2010
2010
2011

Content
Logs
Play
WWW
API
CS

Content
S3
InternaMonal

Management
DRM
Sign-‐Up
Metadata
CS
lookup

Terabytes

EC2
Device
DiagnosMcs

EMR
CDN
rouMng
Search
Conﬁg
&
AcMons

Encoding

S3
Movie
TV
Movie
Customer

Hive
&
Pig
Bookmarks
Choosing
Choosing
Call
Log

Petabytes

Business
Social

Logging
RaMngs
Facebook
CS
AnalyMcs

Intelligence

CDNs

ISPs

Terabits

Customers

Datacenter
to
Cloud
TransiMon
Goals

“Go
ahead
and
Jump
–
Van
Halen”

•  Faster

–  Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls

–  Measured
as
mean
and
99th
percenMle

–  For
both
ﬁrst
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user

•  Scalable

–  Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases

–  No
central
verMcally
scaled
databases

–  Leverage
AWS
elasMc
capacity
eﬀecMvely

•  Available

–  SubstanMally
higher
robustness
and
availability
than
datacenter
services

–  Leverage
mulMple
AWS
availability
zones

–  No
scheduled
down
Mme,
no
central
database
schema
to
change

•  ProducMve

–  OpMmize
agility
of
a
large
development
team
with
automaMon
and
tools

–  Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)

–  Enforce
clean
layered
interfaces
and
re-‐usable
components

Datacenter
AnM-‐Pa=erns

What
do
we
currently
do
in
the

datacenter
that
prevents
us
from

meeMng
our
goals?

“Me
Wise
Magic
–
Van
Halen”

Ne:lix
Datacenter
vs.
Cloud
Arch

Central
SQL
Database
Distributed
Key/Value
NoSQL

SMcky
In-‐Memory
Session
Shared
Memcached
Session

Cha=y
Protocols
Latency
Tolerant
Protocols

Tangled
Service
Interfaces
Layered
Service
Interfaces

Instrumented
Code
Instrumented
Service
Pa=erns

Fat
Complex
Objects
Lightweight
Serializable
Objects

Components
as
Jar
Files
Components
as
Services

The
Central
SQL
Database

•  Datacenter
has
a
central
database

–  Everything
in
one
place
is
convenient
unMl
it
fails

•  Schema
changes
require
downMme

–  Customers,
movies,
history,
conﬁguraMon

AnS-‐paNern
impacts
scalability,
availability

The
Distributed
Key-‐Value
Store

•  Cloud
has
many
key-‐value
data
stores

–  More
complex
to
keep
track
of,
do
backups
etc.

–  Each
store
is
much
simpler
to
administer

DBA

–  Joins
take
place
in
java
code

–  No
schema
to
change,
no
scheduled
downMme

•  Minimum
Latency
for
Simple
Requests

–  Memcached
is
dominated
by
network
latency
<1ms

–  Cassandra
cross
zone
replicaMon
around
one
millisecond

–  DynamoDB
replicaMon
and
auth
overheads
around
5ms

–  SimpleDB
higher
replicaMon
and
auth
overhead
>10ms

The
SMcky
Session

•  Datacenter
SMcky
Load
Balancing

–  Eﬃcient
caching
for
low
latency

–  Tricky
session
handling
code

•  Encourages
concentrated
funcMonality

–  one
service
that
does
everything

–  Middle
Mer
load
balancer
had
issues
in
pracMce

AnS-‐paNern
impacts
producSvity,
availability

Shared
Session
State

•  ElasMc
Load
Balancer

–  We
don’t
use
the
cookie
based
rouMng
opMon

–  External
“session
caching”
with
memcached

•  More
ﬂexible
ﬁne
grain
services

–  Any
instance
can
serve
any
request

–  Works
be=er
with
auto-‐scaled
instance
counts

Cha=y
Opaque
and
Bri=le
Protocols

•  Datacenter
service
protocols

–  Assumed
low
latency
for
many
simple
requests

•  Based
on
serializing
exisMng
java
objects

–  Ineﬃcient
formats

–  IncompaMble
when
deﬁniMons
change

AnS-‐paNern
causes
producSvity,
latency
and

availability
issues

Robust
and
Flexible
Protocols

•  Cloud
service
protocols

–  JSR311/Jersey
is
used
for
REST/HTTP
service
calls

–  Custom
client
code
includes
service
discovery

–  Support
complex
data
types
in
a
single
request

•  Apache
Avro

–  Evolved
from
Protocol
Buﬀers
and
Thri7

–  Includes
JSON
header
deﬁning
key/value
protocol

–  Avro
serializaMon
is
half
the
size
and
several
Mmes

faster
than
Java
serializaMon,
more
work
to
code

Persisted
Protocols

•  Persist
Avro
in
Memcached

–  Save
space/latency
(zigzag
encoding,
half
the
size)

–  New
keys
are
ignored

–  Missing
keys
are
handled
cleanly

•  Avro
protocol
deﬁniMons

–  Less
bri=le
across
versions

–  Can
be
wri=en
in
JSON
or
generated
from
POJOs

–  It’s
hard,
needs
be=er
tooling

Tangled
Service
Interfaces

•  Datacenter
implementaMon
is
exposed

–  Oracle
SQL
queries
mixed
into
business
logic

•  Tangled
code

–  Deep
dependencies,
false
sharing

•  Data
providers
with
sideways
dependencies

–  Everything
depends
on
everything
else

AnS-‐paNern
aﬀects
producSvity,
availability

Untangled
Service
Interfaces

•  New
Cloud
Code
With
Strict
Layering

–  Compile
against
interface
jar

–  Can
use
spring
runMme
binding
to
enforce

–  Fine
grain
services
as
components

•  Service
interface
is
the
service

–  ImplementaMon
is
completely
hidden

–  Can
be
implemented
locally
or
remotely

–  ImplementaMon
can
evolve
independently

Untangled
Service
Interfaces

Poundcake
–
Van
Halen

Two
layers:

•  SAL
-‐
Service
Access
Library

–  Basic
serializaMon
and
error
handling

–  REST
or
POJO’s
deﬁned
by
data
provider

•  ESL
-‐
Extended
Service
Library

–  Caching,
conveniences,
can
combine
several
SALs

–  Exposes
faceted
type
system
(described
later)

–  Interface
deﬁned
by
data
consumer
in
many
cases

Service
InteracMon
Pa=ern

Sample
Swimlane
Diagram

Service
Architecture
Pa=erns

•  Internal
Interfaces
Between
Services

–  Common
pa=erns
as
templates

–  Highly
instrumented,
observable,
analyMcs

–  Service
Level
Agreements
–
SLAs

•  Library
templates
for
generic
features

–  Instrumented
Ne:lix
Base
Servlet
template

–  Instrumented
generic
client
interface
template

–  Instrumented
S3,
SimpleDB,
Memcached
clients

CLIENT

Request
Start

Timestamp,
Client

Inbound
Request
End
outbound

deserialize
end
Timestamp
serialize
start

Mmestamp

Mmestamp

Inbound
Client

deserialize
outbound

start
serialize
end

Mmestamp
Mmestamp

Client
network

receive

Mmestamp

Service
Request
Client
Network

send

Mmestamp

Instruments
Every

Service

network
send

Mmestamp

Step
in
the
call
Service

Network

receive

Mmestamp

Service
Service

outbound
inbound

serialize
end
serialize
start

Mmestamp
Mmestamp

Service
Service

outbound
inbound

serialize
start
SERVICE
execute
serialize
end

request
start

Mmestamp
Mmestamp

Mmestamp,

execute
request

end
Mmestamp

Boundary
Interfaces

•  Isolate
teams
from
external
dependencies

–  Fake
SAL
built
by
cloud
team

–  Real
SAL
provided
by
data
provider
team
later

–  ESL
built
by
cloud
team
using
faceted
objects

•  Fake
data
sources
allow
development
to
start

–  e.g.
Fake
IdenMty
SAL
for
a
test
set
of
customers

–  Development
solidiﬁes
dependencies
early

–  Helps
external
team
provide
the
right
interface

One
Object
That
Does
Everything

Can’t
Get
This
Stuﬀ
No
More
–
Van
Halen

•  Datacenter
uses
a
few
big
complex
objects

–  Good
choice
for
a
small
team
and
one
instance

–  ProblemaMc
for
large
teams
and
many
instances

•  False
sharing
causes
tangled
dependencies

–  Movie
and
Customer
objects
are
foundaMonal

–  UnproducMve
re-‐integraMon
work

AnS-‐paNern
impacSng
producSvity
and
availability

An
Interface
For
Each
Component

•  Cloud
uses
faceted
Video
and
Visitor

–  Basic
types
hold
only
the
idenMﬁer

–  Facets
scope
the
interface
you
actually
need

–  Each
component
can
deﬁne
its
own
facets

•  No
false-‐sharing
and
dependency
chains

–  Type
manager
converts
between
facets
as
needed

–  video.asA(PresentaMonVideo)
for
www

–  video.asA(MerchableVideo)
for
middle
Mer

Stan
Lanning’s
Soap
Box

•  Business
Level
Object
-‐
Level
Confusion
Listen
to
the
bearded
guru…

–  Don’t
pass
around
IDs
when
you
mean
to
refer
to
the
BLO

•  Using
Basic
Types
helps
the
compiler
help
you

–  Compile
Mme
problems
are
be=er
than
run
Mme
problems

•  More
readable
by
people

–  But
beware
that
asA
operaMons
may
be
a
lot
of
work

•  MulMple-‐inheritance
for
Java?

–  Kinda-‐sorta…

Model
Driven
Architecture

•  TradiMonal
Datacenter
PracMces

–  Lots
of
unique
hand-‐tweaked
systems

–  Hard
to
enforce
pa=erns

–  Some
use
of
Puppet
to
automate
changes

•  Model
Driven
Cloud
Architecture

–  Perforce/Ivy/Jenkins
based
builds
for
everything

–  Every
producMon
instance
is
a
pre-‐baked
AMI

–  Every
applicaMon
is
managed
by
an
Autoscaler

Every
change
is
a
new
AMI

Ne:lix
PaaS
Principles

•  Maximum
FuncMonality

–  Developer
producMvity
and
agility

•  Leverage
as
much
of
AWS
as
possible

–  AWS
is
making
huge
investments
in
features/scale

•  Interfaces
that
isolate
Apps
from
AWS

–  Avoid
lock-‐in
to
speciﬁc
AWS
API
details

•  Portability
is
a
long
term
goal

–  Gets
easier
as
other
vendors
catch
up
with
AWS

Ne:lix
Global
PaaS
Features

•  Supports
all
AWS
Availability
Zones
and
Regions

•  Supports
mulMple
AWS
accounts
{test,
prod,
etc.}

•  Cross
Region/Acct
Data
ReplicaMon
and
Archiving

•  InternaMonalized,
Localized
and
GeoIP
rouMng

•  Security
is
ﬁne
grain,
dynamic
AWS
keys

•  Autoscaling
to
thousands
of
instances

•  Monitoring
for
millions
of
metrics

•  ProducMve
for
100s
of
developers
on
one
product

•  25M+
users
USA,
Canada,
LaMn
America,
UK,
Eire

Basic
PaaS
EnMMes

•  AWS
Based
EnMMes

–  Instances
and
Machine
Images,
ElasMc
IP
Addresses

–  Security
Groups,
Load
Balancers,
Autoscale
Groups

–  Availability
Zones
and
Geographic
Regions

•  Ne:lix
PaaS
EnMMes

–  ApplicaMons
(registered
services)

–  Clusters
(versioned
Autoscale
Groups
for
an
App)

–  ProperMes
(dynamic
hierarchical
conﬁguraMon)

Core
PaaS
Services

•  AWS
Based
Services

–  S3
storage,
to
5TB
ﬁles,
parallel
mulMpart
writes

–  SQS
–
Simple
Queue
Service.
Messaging
layer.

•  Ne:lix
Based
Services

–  EVCache
–
memcached
based
ephemeral
cache

–  Cassandra
–
distributed
persistent
data
store

•  External
Services

–  GeoIP
Lookup
interfaced
to
a
vendor

–  Secure
Keystore
HSM

Instance
Architecture

Linux
Base
AMI
(CentOS
or
Ubuntu)

OpMonal

Apache

frontend,

Java
(JDK
6
or
7)

memcached,

non-‐java
apps

AppDynamics

Monitoring

appagent

monitoring
Tomcat

Log
rotaMon
ApplicaMon
war
ﬁle,
base
Healthcheck,
status

to
S3
GC
and
thread
servlet,
pla:orm,
interface
servlets,
JMX
interface,

AppDynamics
dump
logging
jars
for
dependent
services
Servo
autoscale

machineagent

Epic

Security
Architecture

•  Instance
Level
Security
baked
into
base
AMI

–  Login:
ssh
only
allowed
via
portal
(not
between
instances)

–  Each
app
type
runs
as
its
own
userid
app{test|prod}

•  AWS
Security,
IdenMty
and
Access
Management

–  Each
app
has
its
own
security
group
(ﬁrewall
ports)

–  Fine
grain
user
roles
and
resource
ACLs

•  Key
Management

–  AWS
Keys
dynamically
provisioned,
easy
updates

–  High
grade
app
speciﬁc
key
management
support

ConMnuous
IntegraMon
/
Release

Lightweight
process
scales
as
the
organizaMon
grows

•  No
centralized
two-‐week
sprint/release
“train”

•  Thousands
of
builds
a
day,
tens
of
releases

•  Engineers
release
at
their
own
pace

•  Unit
of
release
is
a
web
service,
over
200
so
far…

•  Dependencies
handled
as
excepMons

Hello
World?

Ge•ng
started
for
a
new
developer…

•  Register
the
“helloadrian”
app
name
in
Asgard

•  Get
the
example
helloworld
code
from
perforce

•  Edit
some
properMes
to
update
the
name
etc.

•  Check-‐in
the
changes

•  Clone
a
Jenkins
build
job

•  Build
the
code

•  Bake
the
code
into
an
Amazon
Machine
Image

•  Use
Asgard
to
setup
an
AutoScaleGroup
with
the
AMI

•  Check
instance
healthcheck
is
“Up”
using
Asgard

•  Hit
the
URL
to
get
“HTTP
200,
Hello”
back

Register
new
applicaMon
name

naming
rules:
all
lower
case
with
underscore,
no
spaces
or
dashes

Portals
and
Explorers

•  Ne:lix
ApplicaMon
Console
(Asgard/NAC)

–  Primary
AWS
provisioning/conﬁg
interface

•  AWS
Usage
Analyzer

–  Breaks
down
costs
by
applicaMon
and
resource

•  Cassandra
Explorer

–  Browse
clusters,
keyspaces,
column
families

•  Base
Server
Explorer

–  Browse
service
endpoints
conﬁguraMon,
perf

AWS
Usage

for
test,
carefully
omi•ng
any
$
numbers…

Pla:orm
Services

•  Discovery
–
service
registry
for
“ApplicaMons”

•  IntrospecMon
–
Entrypoints

•  Cryptex
–
Dynamic
security
key
management

•  Geo
–
Geographic
IP
lookup

•  ConﬁguraMon
Service
–
Dynamic
properMes

•  LocalizaMon
–
manage
and
lookup
local
translaMons

•  Evcache
–
ephemeral
volaMle
cache

•  Cassandra
–
Cross
zone/region
distributed
data
store

•  Zookeeper
–
Distributed
CoordinaMon
(Curator)

•  Various
proxies
–
access
to
old
datacenter
stuﬀ

IntrospecMon
-‐
Entrypoints

•  REST
API
for
tools,
apps,
explorers,
monkeys…

–  E.g.
GET
/REST/v1/instance/$INSTANCE_ID

•  AWS
Resources

–  Autoscaling
Groups,
EIP
Groups,
Instances

•  Ne:lix
PaaS
Resources

–  Discovery
ApplicaMons,
Clusters
of
ASGs,
History

•  Full
History
of
all
Resources

–  Supports
Janitor
Monkey
cleanup
of
unused
resources

Entrypoints
Queries

MongoDB
used
for
low
traffic
complex
queries
against
complex
objects

Descrip2on
Range
expression

Find
all
acMve
instances.

all()

Find
all
instances
associated
with
a
group
%(cloudmonkey)

name.

Find
all
instances
associated
with
a
/^cloudmonkey$/discovery()

discovery
group.

Find
all
auto
scale
groups
with
no
instances.
asg(),-‐has(INSTANCES;asg())

How
many
instances
are
not
in
an
auto
count(all(),-‐info(eval(INSTANCES;asg())))

scale
group?

What
groups
include
an
instance?
*(i-‐4e108521)

What
auto
scale
groups
and
elasMc
load
filter(TYPE;asg,elb;*(i-‐4e108521))

balancers
include
an
instance?

What
instance
has
a
given
public
ip?
filter(PUBLIC_IP;174.129.188.{0..255};all())

Metrics
Framework

•  System
and
ApplicaMon

–  CollecMon,
AggregaMon,
Querying
and
ReporMng

–  Non-‐blocking
logging,
avoids
log4j
lock
contenMon

–  Honu-‐Streaming
-‐>
S3
-‐>
EMR
-‐>
Hive

•  Performance,
Robustness,
Monitoring,
Analysis

–  Tracers,
Counters
–
explicit
code
instrumentaMon
log

–  SLA
–
service
level
response
Mme
percenMles

–  Servo
annotated
JMX
extract
to
Cloudwatch

•  Latency
TesMng
and
InspecMon
Infrastructure

–  Latency
Monkey
injects
random
delays
and
errors
into
service
responses

–  Base
Server
Explorer
Inspect
client
Mmeouts

–  Global
property
management
to
change
client
Mmeouts

Interprocess
Communica2on

•  Discovery
Service
registry
for
“applicaMons”

–  “here
I
am”
call
every
30s,
drop
a7er
3
missed

–  “where
is
everyone”
call

–  Redundant,
distributed,
moving
to
Zookeeper

•  NIWS
–
Ne:lix
Internal
Web
Service
client

–  So7ware
Middle
Tier
Load
Balancer

–  Failure
retry
moves
to
next
instance

–  Many
opMons
for
encoding,
etc.

Security
Key
Management

•  AKMS

–  Dynamic
Key
Management
interface

–  Update
AWS
keys
at
runMme,
no
restart

–  All
keys
stored
securely,
none
on
disk
or
in
AMI

•  Cryptex
-‐
Flexible
key
store

–  Low
grade
keys
processed
in
client

–  Medium
grade
keys
processed
by
Cryptex
service

–  High
grade
keys
processed
by
hardware
(Ingrian)

AWS
Persistence
Services

•  SimpleDB

–  Got
us
started,
migrated
to
Cassandra
now

–  NFSDB
-‐
Instrumented
wrapper
library

–  Domain
and
Item
sharding
(workarounds)

•  S3

–  Upgraded/Instrumented
JetS3t
based
interface

–  Supports
mulMpart
upload
and
5TB
ﬁles

–  Global
S3
endpoint
management

Ne5lix
Pla5orm
Persistence

•  Ephemeral
VolaMle
Cache
–
evcache

–  Discovery-‐aware
memcached
based
backend

–  Client
abstracMons
for
zone
aware
replicaMon

–  OpMon
to
write
to
all
zones,
fast
read
from
local

•  Cassandra

–  Highly
available
and
scalable
(more
later…)

•  MongoDB

–  Complex
object/query
model
for
small
scale
use

•  MySQL

–  Hard
to
scale,
legacy
and
small
relaMonal
models

Priam
–
Cassandra
AutomaMon

Available
at
h=p://github.com/ne:lix

•  Ne:lix
Pla:orm
Tomcat
Code

•  Zero
touch
auto-‐conﬁguraMon

•  State
management
for
Cassandra
JVM

•  Token
allocaMon
and
assignment

•  Broken
node
auto-‐replacement

•  Full
and
incremental
backup
to
S3

•  Restore
sequencing
from
S3

•  Grow/Shrink
Cassandra
“ring”

Astyanax

Available
at
h=p://github.com/ne:lix

•  Cassandra
java
client

•  API
abstracMon
on
top
of
Thri7
protocol

•  “Fixed”
ConnecMon
Pool
abstracMon
(vs.
Hector)

–  Round
robin
with
Failover

–  Retry-‐able
operaMons
not
Med
to
a
connecMon

–  Ne:lix
PaaS
Discovery
service
integraMon

–  Host
reconnect
(fixed
interval
or
exponenMal
backoff)

–  Token
aware
to
save
a
network
hop
–
lower
latency

–  Latency
aware
to
avoid
compacMng/repairing
nodes
–
lower
variance

•  Batch
mutaMon:
set,
put,
delete,
increment

•  Simplified
use
of
serializers
via
method
overloading
(vs.
Hector)

•  ConnecMonPoolMonitor
interface
for
counters
and
tracers

•  Composite
Column
Names
replacing
deprecated
SuperColumns

Astyanax
Query
Example

Paginate
through
all
columns
in
a
row

ColumnList<String>
columns;

int
pageize
=
10;

try
{

RowQuery<String,
String>
query
=
keyspace

.prepareQuery(CF_STANDARD1)

.getKey("A")

.setIsPaginaMng()

.withColumnRange(new
RangeBuilder().setMaxSize(pageize).build());

while
(!(columns
=
query.execute().getResult()).isEmpty())
{

for
(Column<String>
c
:
columns)
{

}

}

}
catch
(ConnecMonExcepMon
e)
{

}

High
Availability

•  Cassandra
stores
3
local
copies,
1
per
zone

–  Synchronous
access,
durable,
highly
available

–  Read/Write
One
fastest,
least
consistent
-‐
~1ms

–  Read/Write
Quorum
2
of
3,
consistent
-‐
~3ms

•  AWS
Availability
Zones

–  Separate
buildings

–  Separate
power
etc.

–  Fairly
close
together

“TradiMonal”
Cassandra
Write
Data
Flows

Single
Region,
MulMple
Availability
Zone,
Not
Token
Aware

Cassandra

• Disks

• Zone
A

2
2

4
2

1.  Client
Writes
to
any
Cassandra
3
3

Cassandra
If
a
node
goes
oﬄine,

Cassandra
Node
• Disks
5 • Disks
5
hinted
handoﬀ

2.  Coordinator
Node
• Zone
C
1 • Zone
A
completes
the
write

replicates
to
nodes
when
the
node
comes

and
Zones

Non
Token
back
up.

3.  Nodes
return
ack
to

Aware

coordinator
Clients
Requests
can
choose
to

4.  Coordinator
returns
3
wait
for
one
node,
a

Cassandra
Cassandra

ack
to
client
• Disks
• Disks
5
quorum,
or
all
nodes
to

5.  Data
wri=en
to
• Zone
C
• Zone
B
ack
the
write

internal
commit
log

disk
(no
more
than
Cassandra
SSTable
disk
writes
and

• Disks

10
seconds
later)
• Zone
B

compacMons
occur

asynchronously

Astyanax
-‐
Cassandra
Write
Data
Flows

Single
Region,
MulMple
Availability
Zone,
Token
Aware

Cassandra

• Disks

• Zone
A

1.  Client
Writes
to
Cassandra
2
2

Cassandra
If
a
node
goes
oﬄine,

nodes
and
Zones
• Disks
3 • Disks
3
hinted
handoﬀ

2.  Nodes
return
ack
to
• Zone
C
1 • Zone
A
completes
the
write

client

3.  Data
wri=en
to

Token
when
the
node
comes

back
up.

internal
commit
log
Aware

disks
(no
more
than
Clients
2

Requests
can
choose
to

10
seconds
later)
Cassandra
Cassandra
wait
for
one
node,
a

• Disks
• Disks
3
quorum,
or
all
nodes
to

• Zone
C
• Zone
B
ack
the
write

Cassandra
SSTable
disk
writes
and

• Disks

• Zone
B

compacMons
occur

asynchronously

Data
Flows
for
MulM-‐Region
Writes

Token
Aware,
Consistency
Level
=
Local
Quorum

1.  Client
writes
to
local
replicas
If
a
node
or
region
goes
offline,
hinted
handoff

2.  Local
write
acks
returned
to
completes
the
write
when
the
node
comes
back
up.

Client
which
conMnues
when
Nightly
global
compare
and
repair
jobs
ensure

2
of
3
local
nodes
are
everything
stays
consistent.

commi=ed

3.  Local
coordinator
writes
to

remote
coordinator.

Cassandra
100+ms
latency

4.  When
data
arrives,
remote

Cassandra

•  Disks
•  Disks

•  Zone
A
•  Zone
A

coordinator
node
acks
and
Cassandra
2
2

Cassandra
Cassandra
4

Cassandra

6
6
3
5
Disks
6

copies
to
other
remote
zones
6

•  Disks
•  Disks

•  Zone
C
•  Zone
A

• 
•  Zone
C
4
Disks
A

• 
•  Zone

1

4

5.  Remote
nodes
ack
to
local
US
EU

coordinator
Clients
Clients

Cassandra
2

Cassandra
Cassandra
5

Cassandra

6.  Data
flushed
to
internal
•  Disks

•  Zone
C

•  Disks

6

•  Zone
B

•  Disks

•  Zone
C

•  Disks
6

•  Zone
B

commit
log
disks
(no
more
Cassandra
Cassandra

than
10
seconds
later)

•  Disks
•  Disks

•  Zone
B
•  Zone
B

Part
2.
Running
the
Show

Operator
Viewpoint

Rules
of
the
Roadie

•  Don’t
lose
stuﬀ

•  Make
sure
it
scales

•  Figure
out
when
it
breaks
and
what
broke

•  Yell
at
the
right
guy
to
ﬁx
it

•  Keep
everything
organized

Cassandra
Backup

•  Full
Backup
Cassandra

Cassandra
Cassandra

–  Time
based
snapshot

–  SSTable
compress
-‐>
S3
Cassandra
Cassandra

•  Incremental
S3

Backup

Cassandra
Cassandra

–  SSTable
write
triggers

compressed
copy
to
S3
Cassandra
Cassandra

•  Archive
Cassandra
Cassandra

–  Copy
cross
region

A

ETL
for
Cassandra

•  Data
is
de-‐normalized
over
many
clusters!

•  Too
many
to
restore
from
backups
for
ETL

•  SoluMon
–
read
backup
ﬁles
using
Hadoop

•  Aegisthus

–  h=p://techblog.ne:lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html

–  High
throughput
raw
SSTable
processing

–  Re-‐normalizes
many
clusters
to
a
consistent
view

–  Extract,
Transform,
then
Load
into
Teradata

Cassandra
Archive
A

Appropriate
level
of
paranoia
needed…

•  Archive
could
be
un-‐readable

–  Restore
S3
backups
weekly
from
prod
to
test,
and
daily
ETL

•  Archive
could
be
stolen

–  PGP
Encrypt
archive

•  AWS
East
Region
could
have
a
problem

–  Copy
data
to
AWS
West

•  ProducMon
AWS
Account
could
have
an
issue

–  Separate
Archive
account
with
no-‐delete
S3
ACL

•  AWS
S3
could
have
a
global
problem

–  Create
an
extra
copy
on
a
diﬀerent
cloud
vendor….

Tools
and
AutomaMon

•  Developer
and
Build
Tools

–  Jira,
Perforce,
Eclipse,
Jenkins,
Ivy,
ArMfactory

–  Builds,
creates
.war
ﬁle,
.rpm,
bakes
AMI
and
launches

•  Custom
Ne:lix
ApplicaMon
Console

–  AWS
Features
at
Enterprise
Scale
(hide
the
AWS
security
keys!)

–  Auto
Scaler
Group
is
unit
of
deployment
to
producMon

•  Open
Source
+
Support

–  Apache,
Tomcat,
Cassandra,
Hadoop

–  Datastax
support
for
Cassandra,
AWS
support
for
Hadoop
via
EMR

•  Monitoring
Tools

–  Alert
processing
gateway
into
Pagerduty

–  AppDynamics
–
Developer
focus
for
cloud
h=p://appdynamics.com

Scalability
TesMng

•  Cloud
Based
TesMng
–
fricMonless,
elasMc

–  Create/destroy
any
sized
cluster
in
minutes

–  Many
test
scenarios
run
in
parallel

•  Test
Scenarios

–  Internal
app
speciﬁc
tests

–  Simple
“stress”
tool
provided
with
Cassandra

•  Scale
test,
keep
making
the
cluster
bigger

–  Check
that
tooling
and
automaMon
works…

–  How
many
ten
column
row
writes/sec
can
we
do?

<DrEvil>ONE
MILLION</DrEvil>

Scale-‐Up
Linearity

h=p://techblog.ne:lix.com/2011/11/benchmarking-‐cassandra-‐scalability-‐on.html

Client
Writes/s
by
node
count
–
Replica2on
Factor
=
3

1200000

1099837

1000000

800000

600000

537172

400000
366828

200000
174373

0

0
50
100
150
200
250
300
350

Availability
and
Resilience

Chaos
Monkey

•  Computers
(Datacenter
or
AWS)
randomly
die

–  Fact
of
life,
but
too
infrequent
to
test
resiliency

•  Test
to
make
sure
systems
are
resilient

–  Allow
any
instance
to
fail
without
customer
impact

•  Chaos
Monkey
hours

–  Monday-‐Thursday
9am-‐3pm
random
instance
kill

•  ApplicaMon
conﬁguraMon
opMon

–  Apps
now
have
to
opt-‐out
from
Chaos
Monkey

Responsibility
and
Experience

•  Make
developers
responsible
for
failures

–  Then
they
learn
and
write
code
that
doesn’t
fail

•  Use
Incident
Reviews
to
find
gaps
to
fix

–  Make
sure
its
not
about
finding
“who
to
blame”

•  Keep
Mmeouts
short,
fail
fast

–  Don’t
let
cascading
Mmeouts
stack
up

•  Make
configuraMon
opMons
dynamic

–  You
don’t
want
to
push
code
to
tweak
an
opMon

Resilient
Design
–
Circuit
Breakers

h=p://techblog.ne:lix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html

PaaS
OperaMonal
Model

•  Developers

–  Provision
and
run
their
own
code
in
producMon

–  Take
turns
to
be
on
call
if
it
breaks
(pagerduty)

–  Conﬁgure
autoscalers
to
handle
capacity
needs

•  DevOps
and
PaaS
(aka
NoOps)

–  DevOps
is
used
to
build
and
run
the
PaaS

–  PaaS
constrains
Dev
to
use
automaMon
instead

–  PaaS
puts
more
responsibility
on
Dev,
with
tools

What’s
Le7
for
Corp
IT?

•  Corporate
Security
and
Network
Management

–  Billing
and
remnants
of
streaming
service
back-‐ends
in
DC

•  Running
Ne:lix’
DVD
Business

–  Tens
of
Oracle
instances
Corp
WiFi
Performance

–  Hundreds
of
MySQL
instances

–  Thousands
of
VMWare
VMs

–  Zabbix,
CacM,
Splunk,
Puppet

•  Employee
ProducMvity

–  Building
networks
and
WiFi

–  SaaS
OneLogin
SSO
Portal

–  Evernote
Premium,
Safari
Online
Bookshelf,
Dropbox
for
Teams

–  Google
Enterprise
Apps,
Workday
HCM/Expense,
Box.com

–  Many
more
SaaS
migraMons
coming…

ImplicaMons
for
IT
OperaMons

•  Cloud
is
run
by
developer
organizaMon

–  Product
group’s
“IT
department”
is
the
AWS
API
and
PaaS

–  CorpIT
handles
billing
and
some
security
funcMons

Cloud
capacity
is
10x
bigger
than
Datacenter

–  Datacenter
oriented
IT
didn’t
scale
up
as
we
grew

–  We
moved
a
few
people
out
of
IT
to
do
DevOps
for
our
PaaS

•  TradiMonal
IT
Roles
and
Silos
are
going
away

–  We
don’t
have
SA,
DBA,
Storage,
Network
admins
for
cloud

–  Developers
deploy
and
“run
what
they
wrote”
in
producMon

Ne:lix
PaaS
OrganizaMon

Developer
Org
ReporMng
into
Product
Development,
not
ITops

Ne:lix
Cloud
Pla:orm
Team

Cloud
Ops
Build
Tools
Pla:orm
and

Cloud
Cloud

Reliability
Architecture
and
Database

Performance
SoluMons

Engineering
AutomaMon
Engineering

Perforce
Jenkins
Pla:orm
jars
Cassandra

Future
planning
ArMfactory
JIRA
Benchmarking
Monitoring

Alert
RouMng
Key
store

Security
Arch
Monkeys

Incident
Lifecycle
Base
AMI,
Bakery
Zookeeper
JVM
GC
Tuning

Eﬃciency
Ne:lix
App
Console
Wiresharking
Entrypoints

Cassandra

AWS
VPC

PagerDuty
Hyperguard
AWS
API
AWS
Instances
AWS
Instances
AWS
Instances

Powerpoint
J

Part
3.
Making
the
Instruments

Builder
Viewpoint

Components

•  ConMnuous
build
framework
turns
code
into
AMIs

•  AWS
accounts
for
test,
producMon,
etc.

•  Cloud
access
gateway

•  Service
registry

•  ConﬁguraMon
properMes
service

•  Persistence
services

•  Monitoring,
alert
forwarding

•  Backups,
archives

Common
Build
Framework

Extracted
from

“Building
and
Deploying
Ne:lix
in
the
Cloud”

by
@bmoyles
and
@garethbowles

On
slideshare.net/ne:lix

Build
Pipeline

ArMfactory
yum

libraries

Jenkins

CBF
steps

resolve
compile
publish
report

sync
check
build
test

source

Perforce

GitHub

Jenkins
Architecture

x86_64
slave
11

x86_64
slave

1

x86_64
slave

buildnode01
slave

buildnode01
1

x86_64
slave

Standard

buildnode01
custom
slaves

custom
slaves

buildnode01

group
custom
slaves

misc.
architecture

custom
slaves

misc.
architecture

Amazon
Linux
misc.
architecture

custom
slaves

Single
Master
misc.
architecture

Ad-‐hoc
slaves

m1.xlarge
misc.
architecture

Red
Hat
Linux
misc.
O/S
&

2x
quad
core
x86_64
architectures

26G
RAM

x86_64
slave
11

x86_64
slave

slave

Custom
~40
custom
slaves

buildnode01
1

x86_64
slave

buildnode01

group

buildnode01
maintained
by
product

Amazon
Linux
teams

various

us-‐west-‐1
VPC
Ne:lix
data
center
Ne:lix
data
center
and

oﬃce

Other
Uses
of
Jenkins

Maintence
of
test
and
prod
Cassandra
clusters

Automated
integraMon
tests
for
bake
and
deploy

ProducMon
bake
and
deployment

Housekeeping
of
the
build
/
deploy
infrastructure

Ne:lix
Extensions
to
Jenkins

"  Job
DSL
plugin:
allow
jobs
to
be
set
up
with

minimal
deﬁniMon,
using
templates
and
a

Groovy-‐based
DSL

"  Housekeeping
and
maintenance
processes

implemented
as
Jenkins
jobs,
system
Groovy

scripts

The
DynaSlave
Plugin

What
We
Have

"   Exposes
a
new
endpoint
in
Jenkins
that
EC2
instances

in
VPC
use
for
registraMon

"   Allows
a
slave
to
name
itself,
label
itself,
tell
Jenkins

how
many
executors
it
can
support

"   EC2
==
Ephemeral.
Disconnected
nodes
that
are
gone

for
>
30
mins
are
reaped

"   Sizing
handled
by
EC2
ASGs,
tweaks
passed
through
via

user
data
(labels,
names,
etc)

The
DynaSlave
Plugin

What’s
Next

"  Enhanced
security/registraMon
of
nodes

"  Dynamic
resource
management

"  have
Jenkins
respond
to
build
demand

"  Slave
groups

"  Allows
us
to
create
specialized
pools
of
build
nodes

"  Refresh
mechanism
for
slave
tools

"  JDKs,
Ant
versions,
etc.

"  Give
it
back
to
the
community

"  watch
techblog.ne:lix.com!

The
Bakery

•  Create
base
AMIs

–  We
have
CentOS,
Ubuntu
and
Windows
base
AMIs

–  All
the
generic
code,
apache,
tomcat
etc.

–  Standard
system
and
applicaMon
monitoring
tools

–  Update
~monthly
with
patches
and
new
versions

•  Add
yummy
topping
and
bake

–  Build
app
speciﬁc
AMI
including
all
code
etc.

–  Bakery
mounts
EBS
snapshot,
installs
and
bakes

–  One
bakery
per
region,
delivers
into
paastest

–  Tweak
conﬁg
and
publish
AMI
to
paasprod

Accounts
Isolate
Concerns

•  paastest
–
for
development
and
tesMng

–  Fully
funcMonal
deployment
of
all
services

–  Developer
tagged
“stacks”
for
separaMon

•  paasprod
–
for
producMon

–  Autoscale
groups
only,
isolated
instances
are
terminated

–  Alert
rouMng,
backups
enabled
by
default

•  paasaudit
–
for
sensiMve
services

–  To
support
SOX,
PCI,
etc.

–  Extra
access
controls,
audiMng

•  paasarchive
–
for
disaster
recovery

–  Long
term
archive
of
backups

–  Diﬀerent
region,
perhaps
diﬀerent
vendor

ReservaMons
and
Billing

•  Consolidated
Billing

–  Combine
all
accounts
into
one
bill

–  Pooled
capacity
for
bigger
volume
discounts

h=p://docs.amazonwebservices.com/AWSConsolidatedBilling/1.0/AWSConsolidatedBillingGuide.html

•  ReservaMons

–  Save
up
to
71%
on
your
baseline
load

–  Priority
when
you
request
reserved
capacity

–  Unused
reservaMons
are
shared
across
accounts

Cloud
Access
Gateway

•  Datacenter
or
oﬃce
based

–  A
separate
VM
for
each
AWS
account

–  Two
per
account
for
high
availability

–  Mount
NFS
shared
home
directories
for
developers

–  Instances
trust
the
gateway
via
a
security
group

•  Manage
how
developers
login
to
cloud

–  Access
control
via
ldap
group
membership

–  Audit
logs
of
every
login
to
the
cloud

–  Similar
to
awsfabrictasks
ssh
wrapper

h=p://readthedocs.org/docs/awsfabrictasks/en/latest/

Cloud
Access
Control

developers

Cloud
Access

www-‐ •  Userid
wwwprod

ssh
Gateway
prod

Security
groups
don’t
allow

ssh
between
instances

Dal-‐ •  Userid
dalprod

prod

Cass-‐ •  Userid
cassprod

prod

Now
Add
Code

Ne:lix
has
open
sourced
a
lot
of

what
you
need,
more
is
on
the
way…

Ne:lix
Open
Source
Strategy

•  Release
PaaS
Components
git-‐by-‐git

–  Source
at
github.com/ne:lix
–
we
build
from
it…

–  Intros
and
techniques
at
techblog.ne:lix.com

–  Blog
post
or
new
code
every
few
weeks

•  MoMvaMons

–  Give
back
to
Apache
licensed
OSS
community

–  MoMvate,
retain,
hire
top
engineers

–  “Peer
pressure”
code
cleanup,
external
contribuMons

Open
Source
Projects
and
Posts

Legend

Github
/
Techblog
Priam
Exhibitor
Servo
and
Autoscaling

Cassandra
as
a
Service
Zookeeper
as
a
Service
Scripts

Apache
ContribuMons

Astyanax
Honu

Curator

Techblog
Post
Cassandra
client
for
Log4j
streaming
to

Zookeeper
Pa=erns

Java
Hadoop

Coming
Soon

EVCache

CassJMeter
Circuit
Breaker

Memcached
as
a

Cassandra
test
suite
Robust
service
pa=ern

Service

Cassandra
Asgard

Discovery
Service

MulM-‐region
EC2
AutoScaleGroup
based

Directory

datastore
support
AWS
console

Aegisthus

ConﬁguraMon
Chaos
Monkey

Hadoop
ETL
for
ProperMes
Service
Robustness
veriﬁcaMon

Cassandra

Asgard

Not
quite
out
yet…

•  Runs
in
a
VM
in
our
datacenter

–  So
it
can
deploy
to
an
empty
account

–  Groovy/Grails/JVM
based

–  Supports
all
AWS
regions
on
a
global
basis

•  Hides
the
AWS
credenMals

–  Use
AWS
IAM
to
issue
restricted
keys
for
Asgard

–  Each
Asgard
instance
manages
one
account

–  One
install
each
for
paastest,
paasprod,
paasaudit

“Discovery”
-‐
Service
Directory

•  Map
an
instance
to
a
service
type

–  Load
balance
over
clusters
of
instances

–  Private
namespace,
so
DNS
isn’t
useful

–  FoundaMon
service,
ﬁrst
to
deploy

•  Highly
available
distributed
coordinaMon

–  Deploy
one
Apache
Zookeeper
instance
per
zone

–  Ne:lix
Curator
includes
simple
discovery
service

–  Ne:lix
Exhibitor
manages
Zookeeper
reliably

ConfiguraMon
ProperMes
Service

•  Dynamic
hierarchical
&
propagates
in
seconds

–  Client
Mmeouts,
feature
set
enables

–  Region
specific
service
endpoints

–  Cassandra
token
assignments
etc.
etc.

•  Used
to
configure
everything

–  So
everything
depends
on
it…

–  Coming
soon
to
github

–  Pluggable
backend
storage
interface

Persistence
services

•  Use
SimpleDB
as
a
bootstrap

–  Good
use
case
for
DynamoDB
or
SimpleDB

•  Ne:lix
Priam

–  Cassandra
automaMon

Monitoring,
alert
forwarding

•  MulMple
monitoring
systems

–  Internally
developed
data
collecMon
runs
on
AWS

–  AppDynamics
APM
product
runs
as
external
SaaS

–  When
one
breaks
the
other
is
usually
OK…

•  Alerts
routed
to
the
developer
of
that
app

–  Alert
gateway
combines
alerts
from
all
sources

–  DeduplicaMon,
source
quenching,
rouMng

–  Warnings
sent
via
email,
criMcal
via
pagerduty

Backups,
archives

•  Cassandra
Backup
via
Priam
to
S3
bucket

–  Create
versioned
S3
bucket
with
TTL
opMon

–  Setup
service
to
encrypt
and
copy
to
archive

•  Archive
Account
with
Read/Write
ACL
to
prod

–  Setup
in
a
diﬀerent
AWS
region
from
producMon

–  Create
versioned
S3
bucket
with
TTL
opMon

Chaos
Monkey

•  Install
it
on
day
1
in
test
and
producMon

•  Prevents
people
from
doing
local
persistence

•  Kill
anything
not
protected
by
an
ASG

•  Supports
whitelist
for
temporary
do-‐not-‐kill

•  Open
source
soon,
code
cleanup
in
progress…

You
take
it
from
here…

•  Keep
watching
github
for
more
goodies

•  Add
your
own
code

•  Let
us
know
what
you
ﬁnd
useful

•  Bugs,
patches
and
addiMons
all
welcome

•  See
you
at
AWS
Re:Invent?

Roadmap
for
2012

•  More
resiliency
and
improved
availability

•  More
automaMon,
orchestraMon

•  “Hardening”
the
pla:orm,
code
clean-‐up

•  Lower
latency
for
web
services
and
devices

•  IPv6
support

•  More
open
sourced
components

Wrap
Up

Answer
your
remaining
quesMons…

What
was
missing
that
you
wanted
to
cover?

Takeaway

NeVlix
has
built
and
deployed
a
scalable
global
PlaVorm
as
a
Service.

Key
components
of
the
NeVlix
PaaS
are
being
released
as
Open
Source

projects
so
you
can
build
your
own
custom
PaaS.

h=p://github.com/Ne:lix

h=p://techblog.ne:lix.com

h=p://slideshare.net/Ne:lix

h=p://www.linkedin.com/in/adriancockcro7

@adrianco
#ne:lixcloud

End
of
Part
3
of
3

You
want
an
Encore?

If
there
is
enough
Mme…
(there
wasn’t)

Something
for
the
hard
core
complex
adapMve

systems
people
to
digest.

A
Discussion
of
Workloads
and

How
They
Behave

Workload
CharacterisMcs

•  A
quick
tour
through
a
taxonomy
of

workload
types

•  Start
with
the
easy
ones
and
work
up

•  Why
personalized
workloads
are
diﬀerent

and
hard

•  Some
examples
and
coping
strategies

5/15/12
Slide
254

Simple
Random
Arrivals

•  Random
arrival
of
transacMons
with
ﬁxed
mean

service
Mme

–  Li=le’s
Law:
QueueLength
=
Throughput
*
Response

–  UMlizaMon
Law:
UMlizaMon
=
Throughput
*
ServiceTime

•  Complex
models
are
o7en
reduced
to
this
model

–  By
averaging
over
longer
Mme
periods
since
the
formulas

only
work
if
you
have
stable
averages

–  By
wishful
thinking
(i.e.
how
to
fool
yourself)

5/15/12
Slide
255

Mixed
random
arrivals
of
transacMons

with
stable
mean
service
Mmes

•  Think
of
the
grocery
store
checkout
analogy

–  Trolleys
full
of
shopping
vs.
baskets
full
of
shopping

–  Baskets
are
quick
to
service,
but
get
stuck
behind
carts

–  RelaMve
mixture
of
transacMon
types
starts
to
ma=er

•  Many
transacMonal
systems
handle
a
mixture

–  Databases,
web
services

•  Consider
separaMng
fast
and
slow
transacMons

–  So
that
we
have
a
“10
items
or
less”
line
just
for
baskets

–  Separate
pools
of
servers
for
diﬀerent
services

–  The
old
rule
-‐
don’t
mix
OLTP
with
DSS
queries
in
databases

•  Performance
is
o7en
thread-‐limited

–  Thread
limit
and
slow
transacMons
constrains
maximum
throughput

•  Model
mix
using
analyMcal
solvers
(e.g.
PDQ
perfdynamics.com)

5/15/12
Slide
256

Load
dependent
servers
–
varying

mean
service
Mmes

•  Mean
service
Mme
may
increase
at
high
throughput

–  Due
to
non-‐scalable
algorithms,
lock
contenMon

–  System
runs
out
of
memory
and
starts
paging
or
frequent
GC

•  Mean
service
Mme
may
also
decrease
at
high
throughput

–  Elevator
seek
and
write
cancellaMon
opMmizaMons
in
storage

–  Load
shedding
and
simpliﬁed
fallback
modes

•  Systems
have
“Mpping
points”
if
the
service
Mme
increases

–  Hysteresis
means
they
don’t
come
back
when
load
drops

–  This
is
why
you
have
to
kill
catatonic
systems

–  Best
designs
shed
load
to
be
stable
at
the
limit
–
circuit
breaker
pa=ern

–  PracMcal
opMon
is
to
try
to
avoid
Mpping
points
by
reducing
variance

•  Model
using
discrete
event
simulaMon
tools

–  Behaviour
is
non-‐linear
and
hard
to
model

5/15/12
Slide
257

Self-‐similar
/
fractal
workloads

•  Bursty
rather
than
random
arrival
rates

•  Self-‐similar

–  Looks
“random”
at
close
up,
stays
“random”
as
you
zoom
out

–  Work
arrives
in
bursts,
transacMons
aren’t
independent

–  Bursts
cluster
together
in
super-‐bursts,
etc.

•  Network
packet
streams
tend
to
be
fractal

•  Common
in
pracMce,
too
hard
to
model

–  Probably
the
most
common
reason
why
your
model
is
wrong!

5/15/12
Slide
258

State
Dependent
Service
Workloads

•  Personalized
services
that
store
user
state/history

–  TransacMons
for
new
users
are
quick

–  TransacMons
for
users
with
lots
of
state/history
are
slower

–  As
user
base
builds
state
and
ages
you
get
into
trouble…

•  Social
Networks,
RecommendaMon
Services

–  Facebook,
Flickr,
Ne:lix,
Twi=er
etc.

•  “Abandon
hope
all
ye
who
enter
here”

–  Not
tractable
to
model,
repeatable
tests
are
tricky

–  Long
fat
tail
response
Mme
distribuMon
and
Mmeouts

•  Try
to
transform
workloads
to
more
tractable
forms

5/15/12
Slide
259

Example
-‐
Twi=er
Workload

•  @adrianco
tweets
–
copy
to
4300
or
so
other
users

•  @zoecello
tweets
many
Mmes
a
day

–
to
over
1M
users

•  @barackobama
tweets
every
few
days
–
to
over
12M
users

•  It’s
the
same
transacMon,
but
the
service
Mme
varies
by
several

orders
of
magnitude

•  The
best
(most
acMve
and
connected
=
most
valuable)
users

trigger
a
“denial
of
service
a=ack”
on
the
systems
when
they

tweet

•  Cascading
eﬀect
as
many
others
re-‐tweet

5/15/12
Slide
260

Example
-‐
Ne:lix
Movie
Choosing

•  “Pick
24
genres/subgenres
etc.
of
75
movies
each
for
me”

–  used
by
TV
based
devices
like
Xbox360,
PS/3,
iPhone
app

•  New
user

–  No
history
of
what
they
have
rented
(DVD)
or
streamed

–  No
star
raMngs
for
movies,
possibly
some
genre
raMngs

–  Basic
demographic
info

–  Fast
to
calculate,
easy
to
ﬁnd
many
good
choices
to
return

•  User
with
several
years
tenure

–  Thousands
of
movies
rented
or
streamed,
“seen
it
already”

–  Hundreds
to
thousands
of
star
raMngs,
lots
of
genre
raMngs

–  Requests
may
Mme
out
and
return
fewer
or
worse
choices

5/15/12
Slide
261

Workload
Modelling
Survival

Methods

•  Simplify
the
workload
algorithms

–  move
from
hard
or
impossible
to
simpler
models

–  decouple,
cache
and
pre-‐compute
to
get
constant
service
Mmes

•  Stand
further
away

–  averaging
is
your
friend
–
gets
rid
of
complex
ﬂuctuaMons

•  Minimalist
Models

–  most
models
are
far
too
complex
–
the
classic
beginners
error…

–  the
art
of
modelling
is
to
only
model
what
really
ma=ers

•  Don’t
model
details
you
don’t
use

–  model
peak
hour
of
the
week,
not
day
to
day
ﬂuctuaMons

–  e.g.
“Will
the
web
site
survive
next
Sunday
night?”

5/15/12
Slide
262

Netflix Architecture Tutorial at Gluecon

Recommended

More Related Content

What's hot (20)

Similar to Netflix Architecture Tutorial at Gluecon (20)

More from Adrian Cockcroft (20)

Recently uploaded (20)

Netflix Architecture Tutorial at Gluecon