JavaOne 2015: Java Mixed-Mode Flame Graphs
Video: https://www.youtube.com/watch?v=BHA65BqlqSkSlides for JavaOne 2015 talk by Brendan Gregg, Netflix.
Description: "At Netflix we dreamed of one visualization to show all CPU consumers: Java methods, GC, JVM internals, system libraries, and the kernel. With the help of Oracle this is now possible on x86 systems using system profilers (eg, Linux perf_events) and the new JDK option -XX:+PreserveFramePointer. This lets us create Java mixed-mode CPU flame graphs, exposing all CPU consumers. We can also use system profilers to analyze memory page faults, TCP events, storage I/O, and scheduler events, also with Java method context. This talk describes the background for this work, instructions generating Java mixed-mode flame graphs, and examples from our use at Netflix where Java on x86 is the primary platform for the Netflix cloud."
PDF: JavaOne2015_MixedModeFlameGraphs.pdf
Keywords (from pdftotext):
slide 1:
Oct  2015  Java Mixed-Mode Flame Graphs Brendan Gregg Senior Performance Architectslide 2:
Understanding Java CPU usage quickly and completelyslide 3:
Quickly ⢠Via SSH and open source tools (covered in this talk) ⢠Or using Netflix Vector GUI (also open source): 1. Observe high CPU usage 2. Generate a flame graphslide 4:
Completely Java Mixed-Mode Flame Graph via Linux perf_events: Kernel Java JVMslide 5:
Messy House Fallacy Fallacy:  my  code  is  a  mess,  I  bet  yours  is  immaculate,  therefore  the  bug  must  be  mine   Reality:  everyone's  code  is  terrible  and  buggy  ⢠Don't overlook system code: kernel, libraries, etc.slide 6:
Contextslide 7:
⢠Over 60 million subscribers â Just launched in Spain! ⢠AWS EC2 Linux cloud ⢠FreeBSD CDN ⢠Awesome place to workslide 8:
Cloud ⢠Tens of thousands of AWS EC2 instances ⢠Mostly running Java applications (Oracle JVM) Linux  (usually  Ubuntu)  OpMonal  Apache,  memcached,  Node.js,  ⦠ Atlas,  S3  log  rotaMon,  sar,  Trace,  perf,  stap,  perf-Ââtools  Vector,  pcp  Java  (JDK  8)  GC  and  thread  dump  logging  Tomcat  ApplicaMon  war  ï¬les,  plaYorm,  base  servlet  hystrix,  metrics  (Servo),  health  check Âslide 9:
Why we need CPU profiling ⢠Improving performance ⢠Cost savings â ASGs often scale on load average (CPU), so CPU usage is proportional to cost Auto  Scaling  Group  Scaling  Policy  CloudWatch,  Servo  â Identify tuning targets â Incident response â Non-regression testing â Software evaluations â CPU workload characterization loadavg,  latency,  ⦠  Instance  Instance  Instance Âslide 10:
The Problem with Profilersslide 11:
Java Profilers Kernel, libraries, JVM Javaslide 12:
Java Profilers ⢠Visibility â Java method execution â Object usage â GC logs â Custom Java context ⢠Typical problems: â Sampling often happens at safety/yield points (skew) â Method tracing has massive observer effect â Misidentifies RUNNING as on-CPU (e.g., epoll) â Doesn't include or profile GC or JVM CPU time â Tree views not quick (proportional) to comprehend ⢠Inaccurate (skewed) and incomplete profilesslide 13:
System Profilers Kernel TCP/IP Java JVM Locks Time epoll Idle threadslide 14:
System Profilers ⢠Visibility â JVM (C++) â GC (C++) â libraries (C) â kernel (C) ⢠Typical problems (x86): â Stacks missing for Java â Symbols missing for Java methods ⢠Other architectures (e.g., SPARC) have fared better ⢠Profile everything except Javaslide 15:
Workaround ⢠Capture both Java and system profiles, and examine side by side Java System ⢠An improvement, but Java context is often crucial for interpreting system profilesslide 16:
Solution Java Mixed-Mode Flame Graph Kernel Java JVMslide 17:
Solution ⢠Fix system profiling â Only way to see it all Kernel ⢠Visibility is everything: â Java methods â JVM (C++) â GC (C++) â libraries (C) â kernel (C) Java JVM ⢠Minor Problems: â 0-3% CPU overhead to enable frame pointers (usuallyslide 18: Simple Production Example 1. Poor performance, and one CPU at 100% 2. perf_events flame graph shows JVM stuck compilingslide 19:Another System Example Exception handling consuming CPUslide 20:DEMO FlameGraph_tomcat01.svgslide 21:Exonerating The System ⢠From last week: - Frequent thread creation/ destruction assumed to be consuming CPU resources. Recode application? - A flame graph quantified this CPU time: near zero - Time mostly other Java methodsslide 22:Profiling GC GC internals, visualized:slide 23:CPU Profilingslide 24:CPU Profiling ⢠Record stacks at a timed interval: simple and effective â Pros: Low (deterministic) overhead â Cons: Coarse accuracy, but usually sufficient stack samples: syscall on-CPU time off-CPU block interruptslide 25:Stack Traces ⢠A code path snapshot. e.g., from jstack(1): $ jstack 1819 [â¦] "main" prio=10 tid=0x00007ff304009000 nid=0x7361 runnable [0x00007ff30d4f9000] java.lang.Thread.State: RUNNABLE running at Func_abc.func_c(Func_abc.java:6) codepath at Func_abc.func_b(Func_abc.java:16) start at Func_abc.func_a(Func_abc.java:23) at Func_abc.main(Func_abc.java:27) running parent g.parent g.g.parenslide 26:System Profilers ⢠Linux â perf_events (aka "perf") ⢠Oracle Solaris â DTrace ⢠OS X â Instruments ⢠Windows â XPerf ⢠And many othersâ¦slide 27:Linux perf_events ⢠Standard Linux profiler â Provides the perf command (multi-tool) â Usually pkg added by linux-tools-common, etc. ⢠Features: â Timer-based sampling â Hardware events â Tracepoints â Dynamic tracing ⢠Can sample stacks of (almost) everything on CPU â Can miss hard interrupt ISRs, but these should be near-zero. They can be measured if needed (I wrote my own tools)slide 28:perf record Profiling ⢠Stack profiling on all CPUs at 99 Hertz, then dump: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf script [â¦] bash 13204 cpu-clock: 459c4c dequote_string (/root/bash-4.3/bash) 465c80 glob_expand_word_list (/root/bash-4.3/bash) 466569 expand_word_list_internal (/root/bash-4.3/bash) 465a13 expand_words (/root/bash-4.3/bash) 43bbf7 execute_simple_command (/root/bash-4.3/bash) 435f16 execute_command_internal (/root/bash-4.3/bash) one 435580 execute_command (/root/bash-4.3/bash) stack 43a771 execute_while_or_until (/root/bash-4.3/bash) sample 43a636 execute_while_command (/root/bash-4.3/bash) 436129 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 420cd5 reader_loop (/root/bash-4.3/bash) 41ea58 main (/root/bash-4.3/bash) 7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so) [⦠~47,000 lines truncated â¦]slide 29:perf report Summary ⢠Generates a call tree and combines samples: # perf report -n -stdio [â¦] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. 20.42% bash [kernel.kallsyms] [k] xen_hypercall_xen_version --- xen_hypercall_xen_version check_events call tree summary |--44.13%-- syscall_trace_enter tracesys |--35.58%-- __GI___libc_fcntl |--65.26%-- do_redirection_internal do_redirections execute_builtin_or_function execute_simple_command [⦠~13,000 lines truncated â¦]slide 30:Flame Graphsslide 31:perf report Verbosity ⢠Despite summarizing, output is still verbose # perf report -n -stdio [â¦] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. 20.42% bash [kernel.kallsyms] [k] xen_hypercall_xen_version --- xen_hypercall_xen_version check_events |--44.13%-- syscall_trace_enter tracesys |--35.58%-- __GI___libc_fcntl |--65.26%-- do_redirection_internal do_redirections execute_builtin_or_function execute_simple_command [⦠~13,000 lines truncated â¦]slide 32:Full perf report Outputslide 33:⦠as a Flame Graphslide 34:Flame Graphs git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a âg -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl >gt; perf.svg ⢠Flame Graphs: â x-axis: alphabetical stack sort, to maximize merging â y-axis: stack depth â color: random (default), or a dimension ⢠Currently made from Perl + SVG + JavaScript â Multiple d3 versions are being developed ⢠Easy to get working â http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html â Above commands are Linux; see URL for other OSesslide 35:Linux perf_events Workflow list events count events capture stacks perf list perf stat perf record Typical Workflow perf.data  text UI dump profile perf report flame graph visualization perf script stackcollapse-perf.pl flamegraph.plslide 36:Flame Graph Interpretation g() e() f() d() c() i() b() h() a()slide 37:Flame Graph Interpretation (1/3) Top edge shows who is running on-CPU, and how much (width) g() e() f() d() c() i() b() h() a()slide 38:Flame Graph Interpretation (2/3) Top-down shows ancestry e.g., from g(): g() e() f() d() c() i() b() h() a()slide 39:Flame Graph Interpretation (3/3) Widths are proportional to presence in samples e.g., comparing b() to h() (incl. children) g() e() f() d() c() i() b() h() a()slide 40:Flame Graph Colors ⢠Randomized by default ⢠Can be used as a dimension. e.g.: â Mixed-mode flame graphs â Differential flame graphs â Searchslide 41:Mixed-Mode Flame Graphs ⢠Hues: â green == Java â red == system â yellow == C++ ⢠Intensity randomized to differentiate frames â Or hashed based on function name Mixed-Mode Kernel Java JVMslide 42:Differential Flame Graphs ⢠Hues: Differential â red == more samples â blue == less samples ⢠Intensity shows the degree of difference ⢠Used for comparing two profiles ⢠Also used for showing other metrics: e.g., CPI more lessslide 43:Flame Graph Search ⢠Color: magenta to show matched frames search buttonslide 44:Flame Charts ⢠Final note: these are useful, but are not flame graphs ⢠Flame charts: x-axis is time ⢠Flame graphs: x-axis is population (maximize merging)slide 45:Stack Tracingslide 46:System Profiling Java on x86 ⢠For example, using Linux perf ⢠The stacks are 1 or 2 levels deep, and have junk values # perf record âF 99 âa âg â sleep 30 # perf script [â¦] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2⦠java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) [â¦]slide 47:⦠as a Flame Graph Broken Java stacks (missing frame pointer)slide 48:Why Stacks are Broken ⢠On x86 (x86_64), hotspot uses the frame pointer register (RBP) as general purpose ⢠This "compiler optimization" breaks (simple) stack walking ⢠Once upon a time, x86 had fewer registers, and this made much more sense ⢠gcc provides -fno-omit-frame-pointer to avoid doing this, but the JVM had no such optionâ¦slide 49:Fixing Stack Walking Possibilities: A. Fix frame pointer-based stack walking (the default) â Pros: simple, supported by many tools â Cons: might cost a little extra CPU B. Use a custom walker (likely needing kernel support) â Pros: full stack walking (incl. inlining) & arguments â Cons: custom kernel code, can cost more CPU when in use C. Try libunwind and DWARF â Even feasible with JIT? Our current preference is (A)slide 50:Hacking OpenJDK (1/2) ⢠As a proof of concept, I hacked hotspot to support an x86_64 frame pointer --- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 ⦠+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 ⦠@@ -166,10 +166,9 @@ // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ ) -// Class for all pointer registers (including RSP) +// Class for all pointer registers (including RSP, excluding RBP) reg_class any_reg(RAX, RAX_H, RDX, RDX_H, RBP, RBP_H, Remove RBP from RDI, RDI_H, RSI, RSI_H, register pools RCX, RCX_H, [...]slide 51:Hacking OpenJDK (2/2) --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04⦠+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 ⦠@@ -5236,6 +5236,7 @@ // We always push rbp, so that on return to interpreter rbp, will be // restored correctly and we can correct the stack. push(rbp); mov(rbp, rsp); Fix x86_64 function // Remove word for ebp prologues framesize -= wordSize; --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp ⦠+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp ⦠[...] ⢠We used this patched version successfully for some limited (and urgent) performance analysisslide 52:-XX:+PreserveFramePointer ⢠We shared our patch publicly â See "A hotspot patch for stack profiling (frame pointer)" on the hotspot complier dev mailing list â It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8, and the -XX:+PreserveFramePointer option ⢠Zoltán Majó (Oracle) took this on, rewrote it, and it is now: â In JDK 9 â In JDK 8 update 60 build 19 â Thanks to Zoltán, Oracle, and the other hotspot engineers for helping get this done! ⢠It might cost 0 â 3% CPU, depending on workloadslide 53:Broken Java Stacks (before) # perf script [â¦] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2⦠java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f417908c194 [unknown] (/tmp/perf-4458.map) [â¦] ⢠Check with "perf script" to see stack samples ⢠These are 1 or 2 levels deep (junk values)slide 54:Fixed Java Stacks # perf script [â¦] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm⦠7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,⦠7fd3171dce44 JavaCalls::call_virtual(JavaValue*⦠7fd3171dd43a JavaCalls::call_virtual(JavaValue*⦠7fd31721b6ce thread_entry(JavaThread*, Thread*)⦠7fd3175389e0 JavaThread::thread_main_inner() (/⦠7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf⦠7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/⦠7fd317a7e182 start_thread (/lib/x86_64-linux-gn⦠⢠With -XX: +PreserveFramePointer stacks are full, and go all the way to start_thread() ⢠This is what the CPUs are really running: inlined frames are not presentslide 55:Fixed Stacks Flame Graph Java stacks (but no symbols)slide 56:Stacks & Inlining ⢠Frames may be missing (inlined) ⢠Disabling inlining: No inlining â -XX:-Inline â Many more Java frames â Can be 80% slower! ⢠May not be necessary â Inlined flame graphs often make enough sense â Or tune -XX:MaxInlineSize and -XX:InlineSmallCode a little to reveal more frames ⢠Can even improve performance! ⢠perf-map-agent (next) has experimental un-inline supportslide 57:Symbolsslide 58:Missing Symbols ⢠Missing symbols may show up as hex; e.g., Linux perf: 71.79% sed sed |--11.65%-- 0x40a447 0x40659a 0x408dd8 0x408ed1 0x402689 0x7fa1cd08aec5 12.06% sed sed --- re_search_internal |--96.78%-- re_search_stub rpl_re_search match_regex do_subst execute_program process_files main __libc_start_main [.] 0x000000000001afc1 broken [.] re_search_internal not brokenslide 59:Fixing Symbols For JIT'd code, Linux perf already looks for an externally provided symbol file: /tmp/perf-PID.map, and warns if it doesn't exist # perf script Failed to open /tmp/perf-8131.map, continuing without symbols [â¦] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm⦠7fd301861e46 [unknown] (/tmp/perf-8131.map) [â¦] This file can be created by a Java agentslide 60:Java Symbols for perf ⢠perf-map-agent â https://github.com/jrudolph/perf-map-agent â Agent attaches and writes the /tmp file on demand (previous versions attached on Java start, wrote continually) â Thanks Johannes Rudolph! ⢠Use of a /tmp symbol file â Pros: simple, can be low overhead (snapshot on demand) â Cons: stale symbols ⢠Using a symbol logger with perf instead â Patch by Stephane Eranian currently being discussed on lkml; see "perf: add support for profiling jitted code"slide 61:Stacks & Symbols Java Mixed-Mode Flame Graph Kernel Java JVMslide 62:Stacks & Symbols (zoom)slide 63:Instructionsslide 64:Instructions 1. Check Java version 2. Install perf-map-agent 3. Set -XX:+PreserveFramePointer 4. Profile Java 5. Dump symbols 6. Generate Mixed-Mode Flame Graph Note these are unsupported: use at your own risk Reference: http://techblog.netflix.com/2015/07/java-in-flames.htmlslide 65:1. Check Java Version ⢠Need JDK8u60 or better â for -XX:+PreserveFramePointer $ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) ⢠Upgrade Java if necessaryslide 66:2. Install perf-map-agent ⢠Check https://github.com/jrudolph/perf-map-agent for the latest instructions. e.g.: $ sudo bash # apt-get install -y cmake # export JAVA_HOME=/usr/lib/jvm/java-8-oracle # cd /usr/lib/jvm # git clone --depth=1 https://github.com/jrudolph/perf-map-agent # cd perf-map-agent # cmake . # makeslide 67:3. Set -XX:+PreserveFramePointer ⢠Needs to be set on Java startup ⢠Check it is enabled (on Linux): $ ps wwp `pgrep ân java` | grep PreserveFramePointerslide 68:4. Profile Java ⢠Using Linux perf_events to profile all processes, at 99 Hertz, for 30 seconds (as root): # perf record -F 99 -a -g -- sleep 30 ⢠Just profile one PID (broken on some older kernels): # perf record -F 99 -p PID -g -- sleep 30 ⢠These create a perf.data fileslide 69:5. Dump Symbols ⢠See perf-map-agent docs for updated usage ⢠e.g., as the same user as java: $ cd /usr/lib/jvm/perf-map-agent/out $ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar \ net.virtualvoid.perf.AttachOnce PID ⢠perf-map-agent contains helper scripts. I wrote my own: â https://github.com/brendangregg/Misc/blob/master/java/jmaps ⢠Dump symbols quickly after perf record to minimize stale symbols. How I do it: # perf record -F 99 -a -g -- sleep 30; jmapsslide 70:6. Generate a Mixed-Mode Flame Graph ⢠Using my FlameGraph software: # perf script >gt; out.stacks01 # git clone --depth=1 https://github.com/brendangregg/FlameGraph # cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | \ ./FlameGraph/flamegraph.pl --color=java --hash >gt; flame01.svg â perf script reads perf.data with /tmp/*.map â out.stacks01 is an intermediate file; can be handy to keep ⢠Finally open flame01.svg in a browser ⢠Check for newer flame graph implementations (e.g., d3)slide 71:Automationslide 72:Netflix Vectorslide 73:Netflix Vector Select Instance Select Metrics Flame Graphs Near real-time, per-second metricsslide 74:Netflix Vector ⢠Open source, on-demand, instance analysis tool â https://github.com/netflix/vector ⢠Shows various real-time metrics ⢠Flame graph support currently in development â Automating previous steps â Using it internally already â Also developing a new d3 front endslide 75:DEMO d3-flame-graphslide 76:Advanced Analysisslide 77:Linux perf_events Coverage ⦠all possible with Java stacksslide 78:Advanced Flame Graphs ⢠Examples: â Page faults â Context switches â Disk I/O requests â TCP events â CPU cache misses â CPI ⢠Any event issued in synchronous Java contextslide 79:Synchronous Java Context ⢠Java thread still on-CPU, and event is directly triggered ⢠Examples: â Disk I/O requests issued directly by Java à ï yes ⢠direct reads, sync writes, page faults â Disk I/O completion interrupts à ï no* â Disk I/O requests triggered async, e.g., readahead à ï no* * can be made yes by tracing and associating contextslide 80:Page Faults ⢠Show what triggered main memory (resident) to grow: # perf record -e page-faults -p PID -g -- sleep 120 ⢠"fault" as (physical) main memory is allocated ondemand, when a virtual page is first populated ⢠Low overhead tool to solve some types of memory leak RES column in top(1) grows becauseslide 81:Page Fault Flame Graph Java code epollslide 82:Context Switches ⢠Show why Java blocked and stopped running on-CPU: # perf record -e context-switches -p PID -g -- sleep 5 ⢠Identifies locks, I/O, sleeps â If code path shouldn't block and looks random, it's an involuntary context switch. I could filter these, but you should have solved them beforehand (CPU load). ⢠e.g., was used to understand framework differences: rxNetty Tomcatslide 83:Context Switch Flame Graph (1/2) rxNetty epoll futexslide 84:Context Switch Flame Graph (2/2) Tomcat sys_poll futexslide 85:Disk I/O Requests ⢠Shows who issued disk I/O (sync reads & writes): # perf record -e block:block_rq_insert -a -g -- sleep 60 ⢠e.g.: page faults in GC? This JVM has swapped out!:slide 86:TCP Events ⢠TCP transmit, using dynamic tracing: # perf probe tcp_sendmsg # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace >gt; out.stacks # perf probe --del tcp_sendmsg ⢠Note: can be high overhead for high packet rates â For the current perf trace, dump, post-process cycle ⢠Can also trace TCP connect & accept (lower overhead) ⢠TCP receive is async â Could trace via socket readslide 87:TCP Send Flame Graph Only one code-path taken in this example kernel Java JVM ab (client process)slide 88:CPU Cache Misses ⢠In this example, sampling via Last Level Cache loads: # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso >gt; out.stacks ⢠-c is the count (samples once per count) ⢠Use other CPU counters to sample hits, misses, stallsslide 89:One Last Example ⢠Back to a mixed-mode CPU flame graph ⢠What else can we show with color?slide 90:CPI Flame Graph ⢠Cycles Per Instruction! â red == instruction heavy â blue == cycle heavy (likely mem stall cycles) zoomed:slide 91:Links & References Flame Graphs â http://www.brendangregg.com/flamegraphs.html â http://techblog.netflix.com/2015/07/java-in-flames.html â http://techblog.netflix.com/2014/11/nodejs-in-flames.html â http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html Linux perf_events â https://perf.wiki.kernel.org/index.php/Main_Page â http://www.brendangregg.com/perf.html â http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html Netflix Vector â https://github.com/netflix/vector â http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html JDK tickets â JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465 â JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945 hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.htmlslide 92:Oct  2015  Thanks Questions? http://techblog.netflix.com http://slideshare.net/brendangregg http://www.brendangregg.com [email protected] @brendangregg





























































































