High performance network programming on the jvm oscon 2012

High Performance
Network Programming on the JVM
OSCON, July 2012
Erik Onnen

About Me
• Director of Architecture and Delivery at Urban Airship
• Most of my career biased towards performance and scale
• Java, Python, C++ in service oriented architectures

In this Talk

• WTF is an “Urban Airship”?
• Networked Systems on the JVM
• Choosing a framework
• Critical learnings
• Q&A

About This Talk
You probably won’t like this talk if you:

About This Talk
• Are willing to give up orders of magnitude in performance
for a slower runtime or language

About This Talk
• Enjoy spending money on virtualized servers (e.g. ec2)

About This Talk
• Think that a startup should’t worry about CoGS

About This Talk
• Think that writing code is the hardest part of a developer’s
job

About This Talk
• Think that writing code is the hardest part of a developer’s
job
• Think async for all the things

Lexicon
What makes something “High Performance”?

Lexicon
What makes something “High Performance”?
• Low Latency - I’m doing an operation that includes a
request/reply
• Throughput - how many operations can I drive through my
architecture at one time?
• Productivity - how quickly can I create a new operation? A
new service?
• Sustainability - when a service breaks, what’s the time to
RCA
• Fault tolerance

WTF is an Urban Airship?

• Fundamentally, an engagement platform
• Buzzword compliant - Cloud Service providing an API for
Mobile
• Uniﬁed API for services across platforms for messaging,
location, content entitlements, in-app purchase
• SLAs for throughput, latency
• Heavy users and contributors to HBase, ZooKeeper,
Cassandra

What is Push?

• Cost
• Throughput and immediacy
• The platform makes it compelling
• Push can be intelligent
• Push can be precisely targeted
• Deeper measurement of user engagement

How does this relate to the JVM?

• We deal with lots of heterogeneous connections from the
public network, the vast majority of them are handled by a
JVM
• We perform millions of operations per second across our
LAN
• Billions and billions of discrete system events a day
• Most of those operations are JVM-JVM

Distributed Systems on the JDK
• Platform has several tools baked in
• HTTP Client and Server
• RMI (Remote Method Invocation) or better JINI
• CORBA/IIOP
• JDBC
• Lower level
• Sockets + streams, channels + buffers
• Java5 brought NIO which included Async I/O
• High performance, high productivity platform when used correctly
• Missing some low-level capabilities

Synchronous vs. Async I/O
• Synchronous Network I/O on the JRE
• Sockets (InputStream, OutputStream)
• Channels and Buffers
• Asynchronous Network I/O on the JRE
• Selectors (async)
• Buffers fed to Channels which are asynchronous
• Almost all asynchronous APIs are for Socket I/O
• Can operate on direct, off heap buffers
• Offer decent low-level conﬁguration options


• Synchronous I/O has many upsides on the JVM
• Clean streaming - good for moving around really large
things
• Sendﬁle support for MMap’d ﬁles
(FileChannel::transferTo)
• Vectored I/O support
• No need for additional SSL abstractions (except for
maybe Keystore cruft)
• No idiomatic impedance for RPC

• Synchronous I/O - doing it well

• Buffers all the way down (streams, readers, channels)

• Minimize trips across the system boundary

• Minimize copies of data

• Vector I/O if possible

• MMap if possible

• Favor direct ByteBufffers and NIO Channels

• Favor direct ByteBufffers and NIO Channels
• Netty does support sync. I/O but it feels tedious on that
abstraction

• Async I/O
• On Linux, implemented via epoll as the “Selector”
abstraction with async Channels
• Async Channels fed buffers, you have to tend to fully
reading/writing them
• Async I/O - doing it well
• Again, favor direct ByteBuffers, especially for large data
• Consider the application - what do you gain by not
waiting for a response?
• Avoid manual TLS operations

Sync vs. Async - FIGHT!
Async I/O Wins:

Async I/O Wins:
• Large numbers of clients

Async I/O Wins:
• Only way to be notiﬁed if a socket is
closed without trying to read it

Async I/O Wins:
• Large number of open sockets

Async I/O Wins:
• Large number of open sockets
• Lightweight proxying of trafﬁc

Async I/O Loses:

Async I/O Loses:
• Context switching, CPU cache
pipeline loss can be substantial
overhead for simple protocols

Async I/O Loses:
• Not always the best option for raw,
full bore throughput

Async I/O Loses:
• Not always the best option for raw,
full bore throughput
• Complexity, ability to reason about
code diminished

Async I/O Loses:

http://www.youtube.com/watch?v=bzkRVzciAZg&feature=player_detailpage#t=133s

Sync I/O Wins:

Sync I/O Wins:
• Simplicity, readability

Sync I/O Wins:
• Better ﬁt for dumb protocols, less
impedance for request/reply

Sync I/O Wins:
• Better ﬁt for dumb protocols, less
impedance for request/reply
• Squeezing every bit of throughput
out of a single host, small number of
threads

Sync vs. Async - Memcache

• UA uses memcached heavily
• memcached is an awesome example of why choosing
Sync vs. Async is hard
• Puts always should be completely asynchronous
• Reads are fairly useless when done asynchronously
• Protocol doesn’t lend itself well to Async I/O
• For Java clients, we experimented with Xmemcached but
didn’t like its complexity, I/O approach
• Created FSMC (freakin’ simple memcache client)

FSMC vs. Xmemcached
Synch vs. Async Memcache Client Throughput
60000
SET/GET per Second

45000

30000

15000

0
1 2 4 8 16 32 56 128
Threads

FSMC (no nagle) FSMC Xmemcached

FSMC vs. Xmemcached
FSMC: Xmemcached:
% time seconds usecs/call calls errors syscall % time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ---------------- ------ ----------- ----------- --------- --------- ----------------
99.97 143.825726 11811 12177 2596 futex 54.87 875.668275 4325 202456 epoll_wait
0.01 0.014143 0 402289 read 45.13 720.259447 454 1587899 130432 futex
0.01 0.011088 0 200000 writev 0.00 0.020783 3 6290 sched_yield
0.01 0.008087 0 200035 write 0.00 0.011119 0 200253 write
0.00 0.002831 0 33223 mprotect 0.00 0.008682 0 799387 2 epoll_ctl
0.00 0.001664 12 139 madvise 0.00 0.003759 0 303004 100027 read
0.00 0.000403 1 681 brk 0.00 0.000066 0 1099 mprotect
0.00 0.000381 0 1189 sched_yield 0.00 0.000047 1 81 madvise
0.00 0.000000 0 120 59 open 0.00 0.000026 0 92 sched_getaffinity
0.00 0.000000 0 68 close 0.00 0.000000 0 126 59 open
0.00 0.000000 0 108 42 stat 0.00 0.000000 0 148 close
0.00 0.000000 0 59 fstat 0.00 0.000000 0 109 42 stat
0.00 0.000000 0 124 3 lstat 0.00 0.000000 0 61 fstat
0.00 0.000000 0 2248 lseek 0.00 0.000000 0 124 3 lstat
0.00 0.000000 0 210 mmap 0.00 0.000000 0 2521 lseek
0.00 0.000000 0 292 mmap

14:37:31,568 INFO [main] 14:38:09,912 INFO [main]
[com.urbanairship.oscon.memcache.FsmcTest] Finished [com.urbanairship.oscon.memcache.XmemcachedTest]
800000 operations in 12659ms. Finished 800000 operations in 18078ms.

real 0m12.881s real 0m18.248s
user 0m34.430s user 0m30.020s
sys 0m22.830s sys 0m16.700s

A Word on Garbage Collection
• Any JVM service on most hardware has to live with GC

• A good citizen will create lots of ParNew garbage and
nothing more

nothing more
• Allocation is near free

nothing more
• Collection also near free if you don’t copy anything

nothing more
• Don’t buffer large things, stream or chunck

nothing more
• When you must cache:

nothing more
• Cache early and don’t touch

nothing more
• Cache early and don’t touch
• Better, cache off heap or use memcache


GOOD


BAD

GOOD


When you care about throughput, the virtualization tax is high
ParNew GC Effectiveness
300

225

150

75

0

MB Collected
Bare Metal EC2 XL

About EC2...

When you care about throughput, the virtualization tax is high
Mean Time ParNew GC
0.04

0.03

0.02

0.01

0
Collection Time (sec)
Bare Metal EC2 XL

How we do at UA

• Originally our codebase was mostly one giant monolithic
application, over time several databases
• Difﬁcult to scale, technically and operationally
• Wanted to break off large pieces of functionality into coarse
grained services encapsulating their capability and function
• Most message exchange was done using beanstalkd after
migrating off RabbitMQ
• Fundamentally, our business is message passing

Choosing A Framework

• All frameworks are a form of concession


• Nobody would use Spring if people called it “Concessions
to the horrors of EJB”


• Understand concessions when choosing, look for:


• Conﬁguration options - how do I conﬁgure Nagle
behavior?


behavior?
• Metrics - what does the framework tell me about its
internals?


behavior?
internals?
• Intelligent logging - next level down from metrics


behavior?
internals?
• Intelligent logging - next level down from metrics
• How does the framework play with peers?

Frameworks - DO IT LIVE!
• Our requirements:

• Capable of > 100K requests per second in aggregate
across multiple threads

• Simple protocol - easy to reason about, inspect

• Efﬁcient, ﬂexible message format - Google Protocol
Buffers

Buffers
• Compostable - easily create new services

Buffers
• Support both sync and async operations

Buffers
• Support for multiple languages (Python, Java, C++)

Buffers
• Support for multiple languages (Python, Java, C++)
• Simple conﬁguration

• Desirable:

• Desirable:
• Discovery mechanism

• Desirable:
• Predictable fault handling

• Desirable:
• Predictable fault handling
• Adaptive load balancing

Frameworks - Akka

• Predominantly Scala platform for sending messages,
distributed incarnation of the Actor pattern
• Message abstraction tolerates distribution well
• If you like OTP, you’ll probably like Akka

Frameworks - Akka

• Cons:
• We don’t like reading other people’s Scala
• Some pretty strong assertions in the docs that aren’t
substantiated
• Bulky wire protocol, especially for primitives
• Conﬁguration felt complicated
• Sheer surface area of the framework is daunting
• Unclear integration story with Python

Frameworks - Aleph

• Clojure framework based on Netty, Lamina
• Conceptually funs are applied to a channels to move
around messages
• Channels are refs that you realize when you want data
• Operations with channels very easy
• Concise format for standing up clients and services using
text protocols

High performance network programming on the jvm oscon 2012

Frameworks - Aleph

• Cons:
• Very high level abstraction, knobs are buried if they exist
• Channel concept leaky for large messages
• Documentation, tests

Frameworks - Netty
• The preeminent framework for doing Async Network I/O
on the JVM
• Netty Channels backed by pipelines on top of Channels
• Pros:
• Abstraction doesn’t hide the important pieces
• The only sane way to do TLS with Async I/O on the JVM
• Protocols well abstracted into pipeline steps
• Clean callback model for events of interest but optional in
simple cases - no death by callback
• Many implementations of interesting protocols

Frameworks - Netty

• Cons:
• Easy to make too many copies of the data
• Some old school bootstrap idioms
• Writes can occasionally be reordered
• Failure conditions can be numerous, difﬁcult to reason
about
• Simple things can feel difﬁcult - UDP, simple request/reply

• Considered but passed:

• PB-RPC Implementations

• Thrift

• Thrift
• Twitter’s Finagle

• Thrift
• Akka

• Thrift
• Akka
• ØMQ

• Thrift
• Akka
• ØMQ
• HTTP + JSON

• Thrift
• Akka
• ØMQ
• HTTP + JSON
• ZeroC Ice

• Ultimately implemented our own using combination of
Netty and Google Protocol Buffers called Reactor

• Discovery (optional) using a deﬁned tree of services in
ZooKeeper

ZooKeeper
• Service instances periodically publish load factor to
ZooKeeper for clients to inform routing decisions

ZooKeeper
• Rich metrics using Yammer Metrics

ZooKeeper
• Core service traits are part of the framework

ZooKeeper
• Service instances quiesce gracefully

ZooKeeper
• Service instances quiesce gracefully
• Netty made UDP, Sync, Async. easy

• All operations are Callables, services define a mapping b/t
a request type and a Callable
• Client API always returns a Future, sometimes it’s always
materialized
• Precise tuning from config files

What We Learned - In General

• Straight through RPC was fairly easy, edge cases were
hard


hard
• ZooKeeper is brutal to program with, recover from errors


hard
• Discovery is also difﬁcult - clients need to defend
themselves, consider partitions


hard
• RPC is great for latency, but upstream pushback is
important


hard
important
• Save RPC for latency sensitive operations - use Kafka


hard
important
• Save RPC for latency sensitive operations - use Kafka
• RPC less than ideal for fan-out

What We Learned - TCP
• RTO (retransmission timeout) and Karn and Jacobson’s
Algorithms

Algorithms
• Linux defaults to 15 retry attempts, 3 seconds between

Algorithms
• With no ACKs, congestion control kicks in and widens
that 3 second window exponentially, thinking its
congested

Algorithms
congested
• Connection timeout can take up to 30 minutes

Algorithms
congested
• Devices, Carriers and EC2 at scale eat FIN/RST

Algorithms
congested
• Devices, Carriers and EC2 at scale eat FIN/RST
• Our systems think a device is still online at the time of a
push

• After changing the RTO


• Efﬁciency means understanding your trafﬁc


• Size send/recv buffers appropriately (defaults way too low
for edge tier services)


• Nagle! Non-duplex protocols can beneﬁt signiﬁcantly


• Example: 19K message deliveries per second vs. 2K


• Example: 19K message deliveries per second vs. 2K
• Example: our protocol has a size frame, w/o Nagle that
went in its own packet


• Don’t Nagle!


• Don’t Nagle!
• Again, understand what your trafﬁc is doing


• Don’t Nagle!
• Buffer and make one syscall instead of multiple


• Don’t Nagle!
• High-throughput RPC mechanisms disable it explicitly


• Don’t Nagle!
• See also:


• Don’t Nagle!
• See also:
• http://www.evanjones.ca/software/java-
bytebuffers.html


• Don’t Nagle!
• See also:
• http://www.evanjones.ca/software/java-
bytebuffers.html
• http://blog.boundary.com/2012/05/02/know-a-delay-
nagles-algorithm-and-you/

About UDP...

• Generally to be avoided

About UDP...

• Great for small unimportant data like memcache operations
at extreme scale

About UDP...

at extreme scale
• Bad for RPC when you care about knowing if your request
was handled

About UDP...

at extreme scale
• Bad for RPC when you care about knowing if your request
was handled
• Conditions where you most want your data are also the
most likely to cause your data to be dropped

About TLS

• Try to avoid it - complex, slow and expensive, especially for
internal services

About TLS

internal services
• ~6.5K and 4 hops to secure the channel

About TLS

internal services
• 40 bytes overhead per frame

About TLS

internal services
• 38.1MB overhead for every keep-alive sent to 1M devices

About TLS

internal services
• 38.1MB overhead for every keep-alive sent to 1M devices

TLS source: http://netsekure.org/2010/03/tls-overhead/

We Learned About HTTPS
• Thought we could ignore - basic plumbing of the internet

• 100s of millions of devices, performing 100s of millions of
tiny request/reply cycles:

• TLS Handshake

• TLS Handshake
• HTTP Request

• TLS Handshake
• HTTP Request
• HTTP Response

• TLS Handshake
• HTTP Request
• HTTP Response
• TLS End

• TLS Handshake
• HTTP Request
• HTTP Response
• TLS End
• Server TIME_WAIT

• TLS Handshake
• HTTP Request
• HTTP Response
• TLS End
• Server TIME_WAIT
• Higher grade crypto eats more cycles

• Corrective measures:

• Reduce TIME_WAIT - 60 seconds too long for an HTTPS
connection

connection
• Reduce non-critical HTTPS operations to lower cyphers

connection
• Ofﬂoad TLS handshake to EC2

connection
• Deployed Akamai for SSL/TCP ofﬂoad and to pipeline
device requests into our infrastructure

connection
• Implement adaptive backoff at the client layer

connection
• Implement adaptive backoff at the client layer
• Aggressive batching

We Learned About Carriers

• Data plans are like gym memberships


• Aggressively cull idle stream connections


• Don’t like TCP keepalives


• Don’t like UDP


• Like to batch, delay or just drop FIN/FIN ACK/RST


• Like to batch, delay or just drop FIN/FIN ACK/RST
• Move data through aggregators

About Devices...

• Small compute units that do exactly what you tell them to

About Devices...

• Like phone home when you push to them...

About Devices...

• 10M at a time...

About Devices...

• 10M at a time...
• Causing...

About Devices...

• Herds can happen for many of reasons:

About Devices...

• Network events

About Devices...

• Network events
• Android imprecise timer

About Devices...

• By virtue of being a mobile device, they move around a lot

About Devices...

• When they move, they often change IP addresses

About Devices...

• New cell tower

About Devices...

• New cell tower
• Change connectivity - 4G -> 3G, 3G -> WiFi, etc.

About Devices...

• New cell tower
• When they change IP addresses, they need to reconnect
TCP sockets

About Devices...

• New cell tower
TCP sockets
• Sometimes they are kind enough to let us know

About Devices...

• New cell tower
TCP sockets
• Sometimes they are kind enough to let us know
• Those reconnections are expensive for us and the devices

We Learned About EC2

• EC2 is a great jumping-off point


• Scaling vertically is very expensive


• Like Carriers, EC2 networking is fond of holding on to TCP
teardown sequence packets


• vNICs obfuscate important data when you care about 1M
connections


connections
• Great for surge capacity


connections
• Great for surge capacity
• Don’t split services into the virtual domain

About EC2...

• When you care about throughput, the virtualization tax is
high

About EC2...

• Limited applicability for testing

About EC2...

• Egress port limitations kick in at ~63K egress
connections - 16 XLs to test 1M connections

About EC2...

• Can’t create vNIC in an EC2 guest

About EC2...

• Killing a client doesn’t disconnect immediately

About EC2...

• Killing a client doesn’t disconnect immediately
• Pragmatically, smalls have no use for our purposes, not
enough RAM, %steal too high

Lessons Learned - Failing Well
• Scale vertically and horizontally
• Scale vertically but remember...
• We can reliably take one Java process up to 990K open
connections
• What happens when that one process fails?
• What happens when you need to do maintenance?

Thanks!

• Urban Airship http://urbanairship.com/
• Me @eonnen on Twitter or erik@urbanairship.com
• We’re hiring! http://urbanairship.com/company/jobs/

Additional UA Reading

• Infrastructure Improvements - http://urbanairship.com/
blog/2012/05/17/scaling-urban-airships-messaging-
infrastructure-to-light-up-a-stadium-in-one-second/

Additional UA Reading

• Infrastructure Improvements - http://urbanairship.com/
blog/2012/05/17/scaling-urban-airships-messaging-
infrastructure-to-light-up-a-stadium-in-one-second/
• C500K - http://urbanairship.com/blog/2010/08/24/c500k-
in-action-at-urban-airship/

High performance network programming on the jvm oscon 2012

More Related Content

High performance network programming on the jvm oscon 2012

Editor's Notes