Qcon2015brokenperformancetools 151118013619 Lva1 App6892
Qcon2015brokenperformancetools 151118013619 Lva1 App6892
Qcon2015brokenperformancetools 151118013619 Lva1 App6892
Brendan Gregg
Senior Performance Architect, Netflix
CAUTION: PERFORMANCE TOOLS
• Over 60 million subscribers
• AWS EC2 Linux cloud
• FreeBSD CDN
• Awesome place to work
This Talk
• Observability, benchmarking, anti-patterns, and lessons
• Broken and misleading things that are surprising
• "load"
– Usually CPU demand (scheduler run queue length/latency)
– On Linux, task demand: CPU + uninterruptible disk I/O (?)
• "average"
– Exponentially damped moving sum
• "1, 5, and 15 minutes"
– Constants used in the equation
• Don't study these for longer than 10 seconds
t=0 1
Load begins
(1 thread)
@ 1 min:
1 min avg =~ 0.62 5
15
Load Average
really means…
https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
I/O WAIT
I/O Wait
$ mpstat -P ALL 1
08:06:43 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
08:06:44 PM all 53.45 0.00 3.77 0.00 0.00 0.39 0.13 0.00 42.26
[…]
"Idle"
FREE MEMORY
Free Memory
$ free -m
total used free shared buffers cached
Mem: 3750 1111 2639 0 147 527
-/+ buffers/cache: 436 3313
Swap: 0 0 0
[…]
Tcp:
17337 active connections openings
395515 passive connection openings
8953 failed connection attempts
240214 connection resets received
3 connections established
7198870 segments received
7505329 segments send out
62697 segments retransmited
10 bad segments received.
1072 resets sent
InCsumErrors: 5
[…]
netstat -s
• Many metrics on Linux (can be over 200)
• Still doesn't include everything: getting better, but don't
assume everything is there
• Includes typos & inconsistencies
• Might be more readable to:
cat /proc/net/snmp /proc/net/netstat
• Totals since boot can be misleading
• On Linux, -s needs -c support
• Often no documentation outside kernel source code
• Requires expertise to comprehend
DISK METRICS
Disk Metrics
• All disk metrics are misleading
• Disk %utilization / %busy
– Logical devices (volume managers) can process requests in
parallel, and may accept more I/O at 100%
• Disk IOPS
– High IOPS is "bad"? That depends…
• Disk latency
– Does it matter? File systems and volume managers try hard
to hide latency and make latency asynchronous
– Better measuring latency via application->FS calls
Rules for Metrics Makers
• They must work
– As well as possible. Clearly document caveats.
• They must be useful
– Document a real use case (eg, my example.txt files).
If you get stuck, it's not useful – ditch it.
• Aim to be intuitive
– Document it. If it's too weird to explain, redo it.
• As few as possible
– Respect end-user's time
• Good system examples:
– iostat -x: workload columns, then resulting perf columns
– Linux sar: consistency, units on columns, logical groups
Observability:
Profilers
PROFILERS
Linux perf
• Can sample stack traces and summarize output:
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
Too Much Output
… as a Flame Graph
PROFILER VISIBILITY
System Profilers with Java
Java Kernel
TCP/IP
JVM GC
Locks epoll
Idle
Time
thread
System Profilers with Java
• e.g., Linux perf
• Visibility
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
• Typical problems (x86):
– Stacks missing for Java and other runtimes
– Symbols missing for Java methods
• Profile everything except Java and similar runtimes
Java Profilers
Kernel,
libraries,
JVM
Java
GC
Java Profilers
• Visibility
– Java method execution
– Object usage
– GC logs
– Custom Java context
• Typical problems:
– Sampling often happens at safety/yield points (skew)
– Method tracing has massive observer effect
– Misidentifies RUNNING as on-CPU (e.g., epoll)
– Doesn't include or profile GC or JVM CPU time
– Tree views not quick (proportional) to comprehend
• Inaccurate (skewed) and incomplete profiles
COMPILER OPTIMIZATIONS
Broken System Stack Traces
# perf record –F 99 –a –g – sleep 30
• Profiling Java on # perf script
[…]
x86 using perf java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
• The stacks are 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
• After:
$ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
[…]
512000 bytes (512 kB) copied, 45.9599 s, 11.1 kB/s
• After:
# time dtrace -n 'pid$target:::entry { @[probefunc] = count(); }' -c 'wc systemlog'
dtrace: description 'pid$target:::entry ' matched 3756 probes
262600 2995200 23925200 systemlog
[…]
real 7m2.896s
user 7m2.650s
sys 0m0.572s
– http://valgrind.org/docs/manual/quick-start.html
– W.I.E. Gates
Statistics
• Averages can be misleading
– Hide latency outliers
– Per-minute averages can hide multi-second issues
• Percentiles can be misleading
– Probability of hitting 99.9th latency may be more than 1/1000
after many dependency requests
• Show the distribution:
– Summarize: histogram, density plot, frequency trail
– Over-time: scatter plot, heat map
• See Gil Tene's "How Not to Measure Latency" QCon talk
from earlier today
Average Latency
• When the index of central tendency isn't…
Observability:
Visualizations
VISUALIZATIONS
Tachometers
Kernel
Java JVM
GC
Visualize Everything
Benchmark Nothing
• Trust nothing, verify everything
• Do Active Benchmarking:
1. Configure the benchmark to run in steady state, 24x7
2. Do root-cause analysis of benchmark performance
3. Answer: why is it not 10x?
Links & References
• https://www.rfc-editor.org/rfc/rfc546.pdf
• https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_Nehalem_arch.svg
• http://www.linuxatemyram.com/
• Traeger, A., E. Zadok, N. Joukov, and C. Wright. “A Nine Year Study of File System
and Storage Benchmarking,” ACM Trans- actions on Storage, 2008.
• http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
• http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
• http://www.brendangregg.com/ActiveBenchmarking/bonnie++.html
• https://blogs.oracle.com/roch/entry/decoding_bonnie
• http://www.brendangregg.com/blog/2014-05-02/compilers-love-messing-with-
benchmarks.html
• https://code.google.com/p/byte-unixbench/
• https://qconsf.com/sf2015/presentation/how-not-measure-latency
• https://qconsf.com/sf2015/presentation/profilers-lying
• Caution signs drawn be me, inspired by real-world signs
THANKS
Nov 2015
Thanks
• Questions?
• http://techblog.netflix.com
• http://slideshare.net/brendangregg
• http://www.brendangregg.com
• [email protected]
• @brendangregg