USENIX LISA 2010: Performance Visualizations
Video: http://www.youtube.com/watch?v=ml-t-9g6Mjc and http://www.usenix.org/events/lisa10/tech/#greggTalk for USENIX LISA 2010, San Jose, by Brendan Gregg.
Description: "Visualizations that include heat maps can be an effective way to present performance data: I/O latency, resource utilization, and more. Patterns can emerge that would be difficult to notice from columns of numbers or line graphs, which are revealing previously unknown behavior. These visualizations are used in a product as a replacement for traditional metrics such as %CPU and are allowing end users to identify more issues much more easily (and some issues are becoming nearly impossible to identify with tools such as vmstat(1)). This talk covers what has been learned, crazy heat map discoveries, and thoughts for future applications beyond performance analysis."
PDF: LISA2010_visualizations.pdf
Keywords (from pdftotext):
slide 1:
Performance Visualizations Brendan Gregg Software Engineer [email protected] USENIX/LISAâ10 November, 2010slide 2:
⢠GâDay, Iâm Brendan ⢠... also known as âshouting guyâslide 3:
硬ç¢ä¹æ鬧æ ç·slide 4:
⢠I do performance analysis ⢠and Iâm a DTrace addictslide 5:
slide 6:
Agenda ⢠Performance ⢠Workload Analysis and Resource Monitoring ⢠Understanding available and ideal metrics before plotting ⢠Visualizations ⢠Current examples ⢠Latency ⢠Utilization ⢠Future opportunities ⢠Cloud Computingslide 7:
Visualizations like these ⢠The ârainbow pterodactylâ Latency Time ⢠... which needs quite a bit of explanationslide 8:
Primary Objectives ⢠Consider performance metrics before plotting ⢠See the value of visualizations ⢠Remember key examplesslide 9:
Secondary Objectives ⢠Consider performance metrics before plotting ⢠Why studying latency is good ⢠... and studying IOPS can be bad ⢠See the value of visualizations ⢠Why heat maps are needed ⢠... and line graphs can be bad ⢠Remember key examples ⢠I/O latency, as a heat map ⢠CPU utilization by CPU, as a heat mapslide 10:
Content based on ⢠âVisualizing System Latencyâ, Communications of the ACM July 2010, by Brendan Gregg ⢠and moreslide 11:
Performance Understanding the metrics before we visualize themslide 12:
Performance Activities ⢠Workload analysis ⢠Is there an issue? Is an issue real? ⢠Where is the issue? ⢠Will the proposed fix work? Did it work? ⢠Resource monitoring ⢠How utilized are the environment components? ⢠Important activity for capacity planningslide 13:
Workload Analysis ⢠Applied during: ⢠software and hardware development ⢠proof of concept testing ⢠regression testing ⢠benchmarking ⢠monitoringslide 14:
Workload Performance Issues ⢠Load ⢠Architectureslide 15:
Workload Performance Issues ⢠Load ⢠Workload applied ⢠Too much for the system? ⢠Poorly constructed? ⢠Architecture ⢠System configuration ⢠Software and hardware bugsslide 16:
Workload Analysis Steps ⢠Identify or confirm if a workload has a performance issue ⢠Quantify ⢠Locate issue ⢠Quantify ⢠Determine, apply and verify solution ⢠Quantifyslide 17:
Quantify ⢠Finding a performance issue isnât the problem ... itâs finding the issue that mattersslide 18:
bugs.mysql.com âperformanceâslide 19:
bugs.opensolaris.org âperformanceâslide 20:
bugs.mozilla.org: âperformanceâslide 21:
âperformanceâ bugs ⢠... and those are just the known performance bugs ⢠... and usually only of a certain type (architecture)slide 22:
How to Quantify ⢠Observation based ⢠Choose a reliable metric ⢠Estimate performance gain from resolving issue ⢠Experimentation based ⢠Apply fix ⢠Quantify before vs. after using a reliable metricslide 23:
Observation based ⢠For example: ⢠Observed: application I/O takes 10 ms ⢠Observed: 9 ms of which is disk I/O ⢠Suggestion: replace disks with flash-memory based SSDs, with an expected latency of ~100 us ⢠Estimated gain: 10 ms ->gt; 1.1 ms (10 ms - 9 ms + 0.1 ms) =~ 9x gain ⢠Very useful - but not possible without accurate quantificationslide 24:
Experimentation based ⢠For example: ⢠Observed: Application transaction latency average 10 ms ⢠Experiment: Added more DRAM to increase cache hits and reduce average latency ⢠Observed: Application transaction latency average 2 ms ⢠Gain: 10 ms ->gt; 2 ms = 5x ⢠Also very useful - but risky without accurate quantificationslide 25:
Metrics to Quantify Performance ⢠Choose reliable metrics to quantify performance: ⢠IOPS ⢠transactions/second ⢠throughput ⢠utilization ⢠latency ⢠Ideally ⢠interpretation is straightforward ⢠reliableslide 26:
Metrics to Quantify Performance ⢠Choose reliable metrics to quantify performance: ⢠IOPS generally better suited for: ⢠transactions/second ⢠throughput Capacity Planning ⢠utilization ⢠latency ⢠Ideally ⢠interpretation is straightforward ⢠reliable Workload Analysisslide 27:
Metrics Availability ⢠Ideally (given the luxury of time): ⢠design the desired metrics ⢠then see if they exist, or, ⢠implement them (eg, DTrace) ⢠Non-ideally ⢠see what already exists ⢠make-do (eg, vmstat ->gt; gnuplot)slide 28:
Assumptions to avoid ⢠Available metrics are implemented correctly ⢠all software has bugs eg, CR: 6687884 nxge rbytes and obytes kstat are wrong ⢠trust no metric without double checking from other sources ⢠Available metrics are designed by performance experts ⢠sometimes added by the programmer to only debug their code ⢠Available metrics are complete ⢠you wonât always find what you really needslide 29:
Getting technical ⢠This will be explained using two examples: ⢠Workload Analysis ⢠Capacity Planningslide 30:
Example: Workload Analysis ⢠Quantifying performance issues with IOPS vs latency ⢠IOPS is commonly presented by performance analysis tools ⢠eg: disk IOPS via kstat, SNMP, iostat, ...slide 31:
IOPS ⢠Depends on where the I/O is measured ⢠app ->gt; library ->gt; syscall ->gt; VFS ->gt; filesystem ->gt; RAID ->gt; device ⢠Depends on what the I/O is ⢠synchronous or asynchronous ⢠random or sequential ⢠size ⢠Interpretation difficult ⢠what value is good or bad? ⢠is there a max?slide 32:
Some disk IOPS problems ⢠IOPS Inflation ⢠Library or Filesystem prefetch/read-ahead ⢠Filesystem metadata ⢠RAID stripes ⢠IOPS Deflation ⢠Read caching ⢠Write cancellation ⢠Filesystem I/O aggregation ⢠IOPS arenât created equalslide 33:
IOPS example: iostat -xnz 1 ⢠Consider this disk: 86 IOPS == 99% busy r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 1.0 %b device 99 c1d0 ⢠Versus this disk: 21,284 IOPS == 99% busy r/s extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 2 99 c1d0slide 34:
IOPS example: iostat -xnz 1 ⢠Consider this disk: 86 IOPS == 99% busy r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 1.0 %b device 99 c1d0 ⢠Versus this disk: 21,284 IOPS == 99% busy r/s extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 2 99 c1d0 ⢠... they are the same disk, different I/O types ⢠1) 8 Kbyte random ⢠2) 512 byte sequential (on-disk DRAM cache)slide 35:
Using IOPS to quantify issues ⢠to identify ⢠is 100 IOPS an problem? Per disk? ⢠to locate ⢠90% of IOPS are random. Is that the problem? ⢠to verify ⢠A filesystem tunable caused IOPS to reduce. Has this fixed the issue?slide 36:
Using IOPS to quantify issues ⢠to identify ⢠is 100 IOPS an problem? Per disk? (depends...) ⢠to locate ⢠90% of IOPS are random. Is that the problem? (depends...) ⢠to verify ⢠A filesystem tunable caused IOPS to reduce. Has this fixed the issue? (probably, assuming...) ⢠We can introduce more metrics to understand these, but standalone IOPS is tricky to interpretslide 37:
Using latency to quantify issues ⢠to identify ⢠is a 100ms I/O a problem? ⢠to locate ⢠90ms of the 100ms is lock contention. Is that the problem? ⢠to verify ⢠A filesystem tunable caused the I/O latency to reduce to 1ms. Has this fixed the issue?slide 38:
Using latency to quantify issues ⢠to identify ⢠is a 100ms I/O a problem? (probably - if synchronous to the app.) ⢠to locate ⢠90ms of the 100ms is lock contention. Is that the problem? (yes) ⢠to verify ⢠A filesystem tunable caused the I/O latency to reduce to 1ms. Has this fixed the issue? (probably - if 1ms is acceptable) ⢠Latency is much more reliable, easier to interpretslide 39:
Latency ⢠Time from I/O or transaction request to completion ⢠Synchronous latency has a direct impact on performance ⢠Application is waiting ⢠higher latency == worse performance ⢠Not all latency is synchronous: ⢠Asynchronous filesystem threads flushing dirty buffers to disk eg, zfs TXG synchronous thread ⢠Filesystem prefetch no one is waiting at this point ⢠TCP buffer and congestion window: individual packet latency may be high, but pipe is kept full for good throughput performanceslide 40:
Turning other metrics into latency ⢠Currency converter (* ->gt; ms): ⢠random disk IOPS == I/O service latency ⢠disk saturation == I/O wait queue latency ⢠CPU utilization == code path execution latency ⢠CPU saturation == dispatcher queue latency ⢠... ⢠Quantifying as latency allows different components to be compared, ratios examined, improvements estimated.slide 41:
Example: Resource Monitoring ⢠Different performance activity ⢠Focus is environment components, not specific issues ⢠incl. CPUs, disks, network interfaces, memory, I/O bus, memory bus, CPU interconnect, I/O cards, network switches, etc. ⢠Information is used for capacity planning ⢠Identifying future issues before they happen ⢠Quantifying resource monitoring with IOPS vs utilizationslide 42:
IOPS vs Utilization ⢠Another look at this disk: r/s [...] r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 1.0 %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 2 99 c1d0 ⢠Q. does this system need more spindles for IOPS capacity?slide 43:
IOPS vs Utilization ⢠Another look at this disk: r/s [...] r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 1.0 %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 2 99 c1d0 ⢠Q. does this system need more spindles for IOPS capacity? ⢠IOPS (r/s + w/s): ??? ⢠Utilization (%b): yes (even considering NCQ)slide 44:
IOPS vs Utilization ⢠Another look at this disk: r/s [...] r/s w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 1.0 %b device 99 c1d0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10642.4 0.0 0.0 1.8 2 99 c1d0 ⢠Q. does this system need more spindles for IOPS capacity? ⢠IOPS (r/s + w/s): ??? ⢠Utilization (%b): yes (even considering NCQ) ⢠Latency (wsvc_t): no ⢠Latency will identify the issue once it is an issue; utilization will forecast the issue - capacity planningslide 45:
Performance Summary ⢠Metrics matter - need to reliably quantify performance ⢠to identify, locate, verify ⢠try to think, design ⢠Workload Analysis ⢠latency ⢠Resource Monitoring ⢠utilization ⢠Other metrics are useful to further understand the nature of the workload and resource behaviorslide 46:
Objectives ⢠Consider performance metrics before plotting ⢠Why latency is good ⢠... and IOPS can be bad ⢠See the value of visualizations ⢠Why heat maps are needed ⢠... and line graphs can be bad ⢠Remember key examples ⢠I/O latency, as a heat map ⢠CPU utilization by CPU, as a heat mapslide 47:
Visualizations Current Examples Latencyslide 48:
Visualizations ⢠So far weâve picked: ⢠Latency ⢠for workload analysis ⢠Utilization ⢠for resource monitoringslide 49:
Latency ⢠For example, disk I/O ⢠Raw data looks like this: # iosnoop -o DTIME UID PID D 337 R 337 R 337 R 337 R 337 R 337 R 337 R 337 R 337 R 337 R 337 R [...many pages of output...] BLOCK SIZE COMM PATHNAME bash /usr/sbin/tar bash /usr/sbin/tar bash /usr/sbin/tar bash /usr/sbin/tar bash /usr/sbin/tar bash /usr/sbin/tar bash /usr/sbin/tar tar /etc/default/lu tar /etc/default/lu tar /etc/default/cron tar /etc/default/devfsadm ⢠iosnoop is DTrace based ⢠examines latency for every disk (back end) I/Oslide 50:
Latency Data ⢠tuples ⢠I/O completion time ⢠I/O latency ⢠can be 1,000s of these per secondslide 51:
Summarizing Latency ⢠iostat(1M) can show per second average: $ iostat -xnz 1 [...] r/s r/s r/s [...] extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t 7.0 786.1 12.0 0.1 1.2 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t 0.0 1063.1 0.0 0.2 1.0 extended device statistics w/s kr/s kw/s wait actv wsvc_t asvc_t 0.0 529.0 0.0 0.0 1.0 %b device 90 c1d0 %b device 92 c1d0 %b device 94 c1d0slide 52:
Per second ⢠Condenses I/O completion time ⢠Almost always a sufficient resolution ⢠(So far Iâve only had one case where examining raw completion time data was crucial: an interrupt coalescing bug)slide 53:
Average/second ⢠Average loses latency outliers ⢠Average loses latency distribution ⢠... but not disk distribution: $ iostat -xnz 1 [...] r/s [...] w/s extended device statistics kr/s kw/s wait actv wsvc_t asvc_t 0.0 0.0 0.4 0.0 0.0 0.5 0.0 0.0 0.4 0.0 0.0 0.4 0.0 0.0 0.4 0.0 0.0 0.4 0.0 0.0 0.4 0.0 0.0 0.4 %b device 34 c0t5000CCA215C46459d0 36 c0t5000CCA215C4521Dd0 35 c0t5000CCA215C45F89d0 32 c0t5000CCA215C42A4Cd0 34 c0t5000CCA215C45541d0 34 c0t5000CCA215C458F1d0 33 c0t5000CCA215C450E3d0 35 c0t5000CCA215C45323d0 ⢠only because iostat(1M) prints this per-disk ⢠but that gets hard to read for 100s of disks, per second!slide 54:
Latency outliers ⢠Occasional high-latency I/O ⢠Can be the sole reason for performance issues ⢠Can be lost in an average ⢠10,000 fast I/O @ 1ms ⢠1 slow I/O @ 500ms ⢠average = 1.05 ms ⢠Can be seen using max instead of (or as well as) averageslide 55:
Maximum/second ⢠iostat(1M) doesnât show this, however DTrace can ⢠can be visualized along with average/second ⢠does identify outliers ⢠doesnât identify latency distribution detailsslide 56:
Latency distribution ⢠Apart from outliers and average, it can be useful to examine the full profile of latency - all the data. ⢠For such a crucial metric, keep as much details as possible ⢠For latency, distributions weâd expect to see include: ⢠bi-modal: cache hit vs cache miss ⢠tri-modal: multiple cache layers ⢠flat: random disk I/Oslide 57:
Latency Distribution Example ⢠Using DTrace: # ./disklatency.d Tracing... Hit Ctrl-C to end. sd4 (28,256), us: value ------------- Distribution ------------- count 16 | 32 | 64 |@@@ 128 |@@@@@ 256 |@@@@ 512 |@@@ 1024 |@@@@@@@ 2048 |@@@@@@@@@ 4096 |@@@@@@@@ 8192 | 16384 | 32768 | 65536 | 131072 |slide 58:
disklatency.d ⢠not why we are here, but before someone asks... #!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN printf("Tracing... Hit Ctrl-C to end.\n"); io:::start start_time[arg0] = timestamp; io:::done /this->gt;start = start_time[arg0]/ this->gt;delta = (timestamp - this->gt;start) / 1000; @[args[1]->gt;dev_statname, args[1]->gt;dev_major, args[1]->gt;dev_minor] = quantize(this->gt;delta); start_time[arg0] = 0; dtrace:::END printa(" %s (%d,%d), us:\n%@d\n", @);slide 59:
Latency Distribution Example # ./disklatency.d Tracing... Hit Ctrl-C to end. sd4 (28,256), us: value ------------- Distribution ------------- count 16 | 32 | 64 |@@@ 128 |@@@@@ 256 |@@@@ 512 |@@@ 1024 |@@@@@@@ 2048 |@@@@@@@@@ 4096 |@@@@@@@@ 8192 | 16384 | 32768 | 65536 | 131072 | ⢠... but can we see this distribution per second? ⢠... how do we visualize a 3rd dimension? 65 - 131 ms outliersslide 60:
Column Quantized Visualization aka âheat mapâ ⢠For example:slide 61:
Heat Map: Offset Distribution ⢠x-axis: time ⢠y-axis: offset ⢠z-axis (color scale): I/O count for that time/offset range ⢠Identified random vs. sequential very well ⢠Similar heat maps have been used before by defrag toolsslide 62:
Heat Map: Latency Distribution ⢠For example: ⢠x-axis: time ⢠y-axis: latency ⢠z-axis (color saturation): I/O count for that time/latency rangeslide 63:
Heat Map: Latency Distribution ⢠... in fact, this is a great example: ⢠reads served from: DRAM disk DRAM flash-memory based SSD disk ZFS âL2ARCâ enabledslide 64:
Heat Map: Latency Distribution ⢠... in fact, this is a great example: ⢠reads served from: DRAM disk DRAM flash-memory based SSD disk ZFS âL2ARCâ enabledslide 65:
Latency Heat Map ⢠A color shaded matrix of pixels ⢠Each pixel is a time and latency range ⢠Color shade picked based on number of I/O in that range ⢠Adjusting saturation seems to work better than color hue. Eg: ⢠darker == more I/O ⢠lighter == less I/Oslide 66:
Pixel Size ⢠Large pixels (and corresponding time/latency ranges) ⢠increases likelyhood that adjacent pixels include I/O, have color, and combine to form patterns ⢠allows color to be more easily seen ⢠Smaller pixels (and time/latency ranges) ⢠can make heat map look like a scatter plot ⢠of the same color - if ranges are so small only one I/O is typically includedslide 67:
Color Palette ⢠Linear scale can make subtle details (outliers) difficult to see ⢠observing latency outliers is usually of high importance ⢠outliers are usuallyslide 68: Outliers ⢠Heat maps show these very well ⢠However, latency outliers can compress the bulk of the heat map data outlier ⢠eg, 1 second outlier while most I/O isslide 69: Data Storage ⢠Since heat-maps are three dimensions, storing this data can become costly (volume) ⢠Most of the data points are zero ⢠and you can prevent storing zeroâs by only storing populated elements: associative array ⢠You can reduce to a sufficiently high resolution, and resample lower as needed ⢠You can also be aggressive at reducing resolution at higher latencies ⢠10 us granularity not as interesting for I/O >gt; 1 second ⢠non-linear resolutionslide 70:Other Interesting Latency Heat Maps ⢠The âIcy Lakeâ ⢠The âRainbow Pterodactylâ ⢠Latency Levelsslide 71:The âIcy Lakeâ Workload ⢠About as simple as it gets: ⢠Single client, single thread, sequential synchronous 8 Kbyte writes to an NFS share ⢠NFS server: 22 x 7,200 RPM disks, striped pool ⢠The resulting latency heat map was unexpectedslide 72:The âIcy Lakeâslide 73:âIcy Lakeâ Analysis: Observation ⢠Examining single disk latency: ⢠Pattern match with NFS latency: similar lines ⢠each disk contributing some lines to the overall patternslide 74:Pattern Match? ⢠We just associated NFS latency with disk device latency, using our eyeballs ⢠see the titles on the previous heat maps ⢠You can programmatically do this (DTrace), but that can get difficult to associate context across software stack layers (but not impossible!) ⢠Heat Maps allow this part of the problem to be offloaded to your brain ⢠and we are very good at pattern matchingslide 75:âIcy Lakeâ Analysis: Experimentation ⢠Same workload, single disk pool: ⢠No diagonal lines ⢠but more questions - see the line (false color palette enhanced) at 9.29 ms? this isslide 76: âIcy Lakeâ Analysis: Experimentation ⢠Same workload, two disk striped pool: ⢠Ah-hah! Diagonal lines. ⢠... but still more questions: why does the angle sometimes change? why do some lines slope upwards and some down?slide 77:âIcy Lakeâ Analysis: Experimentation ⢠... each disk from that pool:slide 78:âIcy Lakeâ Analysis: Questions ⢠Remaining Questions: ⢠Why does the slope sometimes change? ⢠What exactly seeds the slope in the first place?slide 79:âIcy Lakeâ Analysis: Mirroring ⢠Trying mirroring the pool disks instead of striping:slide 80:Another Example: âX marks the spotâslide 81:The âRainbow Pterodactylâ Workload ⢠48 x 7,200 RPM disks, 2 disk enclosures ⢠Sequential 128 Kbyte reads to each disk (raw device), adding disks every 2 seconds ⢠Goal: Performance analysis of system architecture ⢠identifying I/O throughput limits by driving I/O subsystem to saturation, one disk at a time (finds knee points)slide 82:The âRainbow Pterodactylâslide 83:The âRainbow Pterodactylâslide 84:The âRainbow Pterodactylâ Buldge Beak Head Wing Neck Shoulders Bodyslide 85:The âRainbow Pterodactylâ: Analysis ⢠Hasnât been understood in detail ⢠Would never be understood (or even known) without heat maps ⢠It is repeatableslide 86:The âRainbow Pterodactylâ: Theories ⢠âBeakâ: disk cache hit vs disk cache miss ->gt; bimodal ⢠âHeadâ: 9th disk, contention on the 2 x4 SAS ports ⢠âBuldgeâ: ? ⢠âNeckâ: ? ⢠âWingâ: contention? ⢠âShouldersâ: ? ⢠âBodyâ: PCI-gen1 bus contentionslide 87:Latency Levels Workload ⢠Same as âRainbow Pterodactylâ, stepping disks ⢠Instead of sequential reads, this is repeated 128 Kbyte reads (read ->gt; seek 0 ->gt; read ->gt; ...), to deliberately hit from the disk DRAM cache to improve test throughputslide 88:Latency Levelsslide 89:Latency Levels Theories ⢠???slide 90:Bonus Latency Heat Map ⢠This time we do know the source of the latency...slide 91:硬ç¢ä¹æ鬧æ ç·slide 92:Latency Heat Maps: Summary ⢠Shows latency distribution over time ⢠Shows outliers (maximums) ⢠Indirectly shows average ⢠Shows patterns ⢠allows correlation with other software stack layersslide 93:Similar Heat Map Uses ⢠These all have a dynamic y-axis scale: ⢠I/O size ⢠I/O offset ⢠These arenât a primary measure of performance (like latency); they provide secondary information to understand the workloadslide 94:Heat Map: I/O Offset ⢠y-axis: I/O offset (in this case, NFSv3 file location)slide 95:Heat Map: I/O Size ⢠y-axis: I/O size (bytes)slide 96:Heat Map Abuse ⢠What can we âpaintâ by adjusting the workload?slide 97:I/O Size ⢠How was this done?slide 98:I/O Offset ⢠How was this done?slide 99:I/O Latency ⢠How was this done?slide 100:Visualizations Current Examples Utilizationslide 101:CPU Utilization ⢠Commonly used indicator of CPU performance ⢠eg, vmstat(1M) $ vmstat 1 5 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 cs us sy id 0 0 0 95125264 28022732 301 1742 1 17 17 0 0 -0 -0 -0 6 5008 21927 3886 4 1 94 0 0 0 91512024 25075924 6 55 0 0 0 0 0 0 0 0 0 4665 18228 4299 10 1 89 0 0 0 91511864 25075796 9 24 0 0 0 0 0 0 0 0 0 3504 12757 3158 8 0 92 0 0 0 91511228 25075164 3 163 0 0 0 0 0 0 0 0 0 4104 15375 3611 9 5 86 0 0 0 91510824 25074940 5 66 0 0 0 0 0 0 0 0 0 4607 19492 4394 10 1 89slide 102:CPU Utilization: Line Graph ⢠Easy to plot:slide 103:CPU Utilization: Line Graph ⢠Easy to plot: ⢠Average across all CPUs: ⢠Identifies how utilized all CPUs are, indicating remaining headroom - provided sufficient threads to use CPUsslide 104:CPU Utilization by CPU ⢠mpstat(1M) can show utilization by-CPU: $ mpstat 1 [...] CPU minf mjf xcal intr ithr 313 105 456 285 2620 2497 csw icsw migr smtx ⢠can identify a single hot CPU (thread) ⢠and un-balanced configurations srw syscl 0 1331 0 1266 0 2559 0 1041 0 1250 0 1408 usr sys wt idl 0 94 0 98 0 99 0 98 0 98 0 99 0 96 0 97 0 98 0 92 0 98 0 92 0 93 0 96 0 100slide 105:CPU Resource Monitoring ⢠Monitor overall utilization for capacity planning ⢠Also valuable to monitor individual CPUs ⢠can identify un-balanced configurations ⢠such as a single hot CPU (thread) ⢠The virtual CPUs on a single host can now reach the 100s ⢠its own dimension ⢠how can we display this 3rd dimension?slide 106:Heat Map: CPU Utilization 100% 60s ⢠x-axis: time ⢠y-axis: percent utilization ⢠z-axis (color saturation): # of CPUs in that time/utilization rangeslide 107:Heat Map: CPU Utilization idle CPUs single âhotâ CPU 100% 60s ⢠Single âhotâ CPUs are a common problem due to application scaleability issues (single threaded) ⢠This makes identification easy, without reading pages of mpstat(1M) outputslide 108:Heat Map: Disk Utilization ⢠Ditto for disks ⢠Disk Utilization heat map can identify: ⢠overall utilization ⢠unbalanced configurations ⢠single hot disks (versus all disks busy) ⢠Ideally, the disk utilization heat map is tight (y-axis) and below 70%, indicating a well balanced config with headroom ⢠which canât be visualized with line graphsslide 109:Back to Line Graphs... ⢠Are typically used to visualize performance, be it IOPS or utilization ⢠Show patterns over time more clearly than text (higher resolution) ⢠But graphical environments can do much more ⢠As shown by the heat maps (to start with); which convey details line graphs cannot ⢠Ask: what âvalue addâ does the GUI bring to the data?slide 110:Resource Utilization Heat Map Summary ⢠Can exist for any resource with multiple components: ⢠CPUs ⢠Disks ⢠Network interfaces ⢠I/O busses ⢠... ⢠Quickly identifies single hot component versus all components ⢠Best suited for physical hardware resources ⢠difficult to express âutilizationâ for a software resourceslide 111:Visualizations Future Opportunitiesslide 112:Cloud Computing So far analysis has been for a single server What about the cloud?slide 113:From one to thousands of servers Workload Analysis: latency I/O x cloud Resource Monitoring: # of CPUs x cloud # of disks x cloud etc.slide 114:Heat Maps for the Cloud ⢠Heat Maps are promising for cloud computing observability: ⢠additional dimension accommodates the scale of the cloud ⢠Find outliers regardless of node ⢠cloud-wide latency heat map just has more I/O ⢠Examine how applications are load balanced across nodes ⢠similar to CPU and disk utilization heat maps ⢠mpstat and iostatâs output are already getting too long ⢠multiply by 1000x for the number of possible hosts in a large cloud applicationslide 115:Proposed Visualizations ⢠Include: ⢠Latency heat map across entire cloud ⢠Latency heat maps for cloud application components ⢠CPU utilization by cloud node ⢠CPU utilization by CPU ⢠Thread/process utilization across entire cloud ⢠Network interface utilization by cloud node ⢠Network interface utilization by port ⢠lots, lots moreslide 116:Cloud Latency Heat Map ⢠Latency at different layers: ⢠Apache ⢠PHP/Ruby/... ⢠MySQL ⢠DNS ⢠Disk I/O ⢠CPU dispatcher queue latency ⢠and pattern match to quickly identify and locate latencyslide 117:Latency Example: MySQL ⢠Query latency (DTrace): query time (ns) value ------------- Distribution ------------- count 1024 | 2048 | 4096 |@ 8192 | 16384 |@ 32768 |@ 65536 |@ 131072 |@@@@@@@@@@@@@ 262144 |@@@@@@@@@@@ 524288 |@@@@ 1048576 |@@ 2097152 |@ 4194304 | 8388608 |@@@@@ 16777216 | 33554432 | 67108864 | 134217728 | 268435456 | 536870912 |slide 118:Latency Example: MySQL ⢠Query latency (DTrace): query time (ns) value ------------- Distribution ------------- count 1024 | 2048 | 4096 |@ 8192 | 16384 |@ 32768 |@ 65536 |@ 131072 |@@@@@@@@@@@@@ 262144 |@@@@@@@@@@@ 524288 |@@@@ 1048576 |@@ 2097152 |@ 4194304 | 8388608 |@@@@@ 16777216 | What is this? 33554432 | 67108864 | (8-16 ms latency) 134217728 | 268435456 | 536870912 |slide 119:Latency Example: MySQL query time (ns) value ------------- Distribution ------------- count 1024 | 2048 | 4096 |@ 8192 | 16384 |@ 32768 |@ 65536 |@ 131072 |@@@@@@@@@@@@@ 262144 |@@@@@@@@@@@ 524288 |@@@@ 1048576 |@@ 2097152 |@ 4194304 | 8388608 |@@@@@ 16777216 | 33554432 | 67108864 | 134217728 | oh... 268435456 | 536870912 | innodb srv sleep (ns) value ------------- Distribution ------------- count 4194304 | 8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 841 16777216 |slide 120:Latency Example: MySQL ⢠Spike of MySQL query latency: 8 - 16 ms ⢠innodb thread concurrency back-off sleep latency: 8 - 16 ms ⢠Both have a similar magnitude (see âcountâ column) ⢠Add the dimension of time as a heat map, for more characteristics to compare ⢠... quickly compare heat maps from different components of the cloud to pattern match and locate latencyslide 121:Cloud Latency Heat Map ⢠Identify latency outliers, distributions, patterns ⢠Can add more functionality to identify these by: ⢠cloud node ⢠application, cloud-wide ⢠I/O type (eg, query type) ⢠Targeted observability (DTrace) can be used to fetch this ⢠Or, we could collect it for everything ⢠... do we need a 4th dimension?slide 122:4th Dimension! ⢠Bryan Cantrill @Joyent coded this 11 hours ago ⢠assuming itâs now about 10:30am during this talk ⢠... and I added these slides about 7 hours agoslide 123:4th Dimension Example: Thread Runtime 100 ms 0 ms ⢠x-axis: time ⢠y-axis: thread runtime ⢠z-axis (color saturation): count at that time/runtime rangeslide 124:4th Dimension Example: Thread Runtime 100 ms 0 ms ⢠x-axis: time ⢠y-axis: thread runtime ⢠z-axis (color saturation): count at that time/runtime range ⢠omega-axis (color hue): application ⢠blue == âcoreaudiodâslide 125:4th Dimension Example: Thread Runtime 100 ms 0 ms ⢠x-axis: time ⢠y-axis: thread runtime ⢠z-axis (color saturation): count at that time/runtime range ⢠omega-axis (color hue): application ⢠green == âiChatâslide 126:4th Dimension Example: Thread Runtime 100 ms 0 ms ⢠x-axis: time ⢠y-axis: thread runtime ⢠z-axis (color saturation): count at that time/runtime range ⢠omega-axis (color hue): application ⢠violet == âChromeâslide 127:4th Dimension Example: Thread Runtime 100 ms 0 ms ⢠x-axis: time ⢠y-axis: thread runtime ⢠z-axis (color saturation): count at that time/runtime range ⢠omega-axis (color hue): application ⢠All colorsslide 128:âDimensionalityâ ⢠While the data supports the 4th dimension, visualizing this properly may become difficult (we are eager to find out) ⢠The image itself is still only 2 dimensional ⢠May be best used to view a limited set, to limit the number of different hues; uses can include: ⢠Highlighting different cloud application types: DB, web server, etc. ⢠Highlighting one from many components: single node, CPU, disk, etc. ⢠Limiting the set also helps storage of dataslide 129:More Visualizations ⢠We plan much more new stuff ⢠We are building a team of engineers to work on it; including Bryan Cantrill, Dave Pacheo, and mysqlf ⢠Dave and I have only been at Joyent for 2 1/2 weeksslide 130:Beyond Performance Analysis ⢠Visualizations such as heat maps could also be applied to: ⢠Security, with pattern matching for: ⢠robot identification based on think-time latency analysis ⢠inter-keystroke-latency analysis ⢠brute force username latency attacks? ⢠System Administration ⢠monitoring quota usage by user, filesystem, disk ⢠Other multi-dimensional datasetsslide 131:Objectives ⢠Consider performance metrics before plotting ⢠Why latency is good ⢠... and IOPS can be bad ⢠See the value of visualizations ⢠Why heat maps are needed ⢠... and line graphs can be bad ⢠Remember key examples ⢠I/O latency, as a heat map ⢠CPU utilization by CPU, as a heat mapslide 132:Heat Map: I/O Latency ⢠Latency matters ⢠synchronous latency has a direct impact on performance ⢠Heat map shows ⢠outliers, balance, cache layers, patternsslide 133:Heat Map: CPU Utilization 100% 60s ⢠Identify single threaded issues ⢠single CPU hitting 100% ⢠Heat map shows ⢠fully utilized components, balance, overall headroom, patternsslide 134:Tools Demonstrated ⢠For Reference: ⢠DTraceTazTool ⢠2006; based on TazTool by Richard McDougall 1995. Open source, unsupported, and probably no longer works (sorry). ⢠Analytics ⢠2008; Oracle Sun ZFS Storage Appliance ⢠ânew stuffâ (not named yet) ⢠2010; Joyent; Bryan Cantrill, Dave Pacheco, Brendan Greggslide 135:Question Time ⢠Thank you! ⢠How to find me on the web: ⢠http://dtrace.org/blogs/brendan ⢠http://blogs.sun.com/brendan