zfsday talk (a video is on the last slide). The performance of the file system, or disks, is often the target of blame, especially in multi-tenant cloud environments. At Joyent we deploy a public cloud on ZFS-based systems, and frequently investigate performance with a wide variety of applications in growing environments. This talk is about ZFS performance observability, showing the tools and approaches we use to quickly show what ZFS is doing. This includes observing ZFS I/O throttling, an enhancement added to illumos-ZFS to isolate performance between neighbouring tenants, and the use of DTrace and heat maps to examine latency distributions and locate outliers.
2. whoami
•
•
•
G’Day, I’m Brendan
These days I do systems performance analysis of the cloud
Regarding ZFS:
•
Perf analysis of ZFS (mostly using DTrace) for 5+ years,
both enterprise and cloud usage
•
•
Wrote many DTrace-based ZFS perf analysis tools
including those in the DTrace book
Developed ZFS L2ARC while at Sun
3. Who is Joyent
•
•
Cloud computing provider (public cloud + software)
Use ZFS as much as possible:
•
•
•
Host storage
Guest storage: OS virtualization (SmartOS), and KVM guests
(Linux, Windows)
We use ZFS because
•
Reliability: checksums, COW
•
Features: snapshots, clones, compression, ...
•
Performance: large ARCs
•
It can boil oceans
4. Joyent, cont.
•
•
We build tools for ZFS automation and observability.
Performance is a key company feature.
•
Need to solve FS/disk issues fast.
5. Agenda
•
My top 12 tools for ZFS performance analysis (unsorted):
•
•
•
•
•
•
•
•
•
•
•
•
iostat
vfsstat
zfsslower.d
iosnoop
iostacks
metaslab_free.d
spasync.d
arcstat
arcaccess.d
latency counters
scatter plots
heat maps (CA)
For cloud computing
from within a Zone, add:
•
•
mysqld_pid_fslatency.d
syscall with fi_fs == zfs
15. iostat, cont.
•
Minor nits:
•
•
•
does not show read/write latency separately. ZFS TXG flushes
drag up the latency, which looks alarming, but are
asynchronous. Can use DTrace for the split.
no higher level context: PID, ZFS dataset, file pathname, ...
(not its role)
Major problem (although, not iostat’s fault): commonly confused
with application-level (logical) I/O.
•
The I/O rates, sizes, and latency, can dramatically differ
between logical file system I/O and physical disk I/O.
•
Users commonly draw the wrong conclusions when only
provided with iostat statistics to understand a system.
17. iostat, cont.
mysql_pid_fslatency.d
For so LITTLE visibility
syscall with
fi_fs == zfs
Process
This leaves many perf issues unsolved
User-Land
Kernel
VFS
zioslower.d
spasync.d
metaslab_free.d
arcstat.pl
arcaccess.d
ZFS
Syscall
Interface
vfsstat
...
iostacks.d
Block Device Interface
kernel drivers as needed
see DTrace book chap 4
Disks
iostat
iosnoop
19. vfsstat, cont.
•
•
Good high-level summary of logical I/O: application FS workload
Summarizes by zone
•
•
•
Impetus was observability for cloud “noisy neighbors”
Shows affect of ZFS I/O throttling (performance isolation)
Summarizes performance applications actually experience!
•
•
Usually a lot better than disk-level, due to ZFS caching (ARC,
L2ARC) and buffering
Required kernel changes, new kstats (thanks Bill Pijewski)
20. zfsslower.d
•
ZFS reads/writes slower than 10 ms:
# ./zfsslower.d 10
TIME
2012 Sep 30 04:56:22
2012 Sep 30 04:56:23
2012 Sep 30 04:56:24
2012 Sep 30 04:56:24
2012 Sep 30 04:56:25
2012 Sep 30 04:56:25
2012 Sep 30 04:56:25
2012 Sep 30 04:56:26
2012 Sep 30 04:56:26
2012 Sep 30 04:56:27
[...]
•
PROCESS
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
cat
beam.smp
beam.smp
D
R
R
R
R
R
R
R
R
R
R
KB
2
4
3
2
0
2
2
5
2
2
ms
12
15
11
12
21
18
13
10
11
12
FILE
/var/db/riak/leveldb/.../205788.sst
/var/db/riak/leveldb/.../152831.sst
/var/db/riak/leveldb/.../220432.sst
/var/db/riak/leveldb/.../208619.sst
/var/db/riak/leveldb/.../210710.sst
/var/db/riak/leveldb/.../217757.sst
/var/db/riak/leveldb/.../191942.sst
/db/run/beam.smp.pid
/var/db/riak/leveldb/.../220389.sst
/var/db/riak/leveldb/.../186749.sst
Traces at VFS level to show the true application suffered I/O time
•
allows immediate confirm/deny of FS (incl. disk) based issue
21. zfsslower.d
•
ZFS reads/writes slower than 100 ms:
# ./zfsslower.d 100
TIME
2012 Sep 30 05:01:17
2012 Sep 30 05:01:54
2012 Sep 30 05:02:35
2012 Sep 30 05:02:35
2012 Sep 30 05:02:35
2012 Sep 30 05:02:40
2012 Sep 30 05:03:11
2012 Sep 30 05:03:38
[...]
PROCESS
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
less frequent
D
R
R
R
R
R
R
R
R
KB
2
1
2
2
1
1
0
1
ms
144
149
188
159
178
172
200
142
FILE
/var/db/riak/leveldb/.../238108.sst
/var/db/riak/leveldb/.../186222.sst
/var/db/riak/leveldb/.../200051.sst
/var/db/riak/leveldb/.../209376.sst
/var/db/riak/leveldb/.../203436.sst
/var/db/riak/leveldb/.../204688.sst
/var/db/riak/leveldb/.../219837.sst
/var/db/riak/leveldb/.../222443.sst
22. zfsslower.d, cont.
•
All ZFS read/writes:
# ./zfsslower.d
TIME
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
[...]
PROCESS
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
nginx
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
D
R
R
R
R
R
W
R
R
R
R
R
R
R
R
R
R
R
R
KB
2
2
2
1
3
0
4
2
0
56
0
2
5
2
3
0
41
0
ms
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
FILE
/var/db/riak/leveldb/.../221844.sst
/var/db/riak/leveldb/.../221155.sst
/var/db/riak/leveldb/.../215917.sst
/var/db/riak/leveldb/.../190134.sst
/var/db/riak/leveldb/.../234539.sst
/db/log/apps/prod17_nginx_access.log
/var/db/riak/leveldb/.../205368.sst
/var/db/riak/leveldb/.../199665.sst
/var/db/riak/leveldb/.../177866.sst
/var/db/riak/leveldb/.../177866.sst
/var/db/riak/leveldb/.../177866.sst
/var/db/riak/leveldb/.../177866.sst
/var/db/riak/leveldb/.../218473.sst
/var/db/riak/leveldb/.../210712.sst
/var/db/riak/leveldb/.../234194.sst
/var/db/riak/leveldb/.../233302.sst
/var/db/riak/leveldb/.../233302.sst
/var/db/riak/leveldb/.../233302.sst
23. zfsslower.d, cont.
•
Written in DTrace
[...]
fbt::zfs_read:entry,
fbt::zfs_write:entry
{
self-path = args[0]-v_path;
self-kb = args[1]-uio_resid / 1024;
self-start = timestamp;
}
fbt::zfs_read:return,
fbt::zfs_write:return
/self-start (timestamp - self-start) = min_ns/
{
this-iotime = (timestamp - self-start) / 1000000;
this-dir = probefunc == zfs_read ? R : W;
printf(%-20Y %-16s %1s %4d %6d %sn, walltimestamp,
execname, this-dir, self-kb, this-iotime,
self-path != NULL ? stringof(self-path) : null);
}
[...]
•
zfsslower.d, also on github, originated from the DTrace book
27. iosnoop, cont.
•
370 lines of code, mostly to process options:
USAGE: iosnoop [-a|-A|-DeghiNostv] [-d device] [-f filename]
[-m mount_point] [-n name] [-p PID]
iosnoop
# default output
-a
# print all data (mostly)
-A
# dump all data, space delimited
-D
# print time delta, us (elapsed)
-e
# print device name
-g
# print command arguments
-i
# print device instance
-N
# print major and minor numbers
-o
# print disk delta time, us
-s
# print start time, us
-t
# print completion time, us
-v
# print completion time, string
-d device
# instance name to snoop
-f filename
# snoop this file only
-m mount_point # this FS only
-n name
# this process name only
-p PID
# this PID only
eg,
iosnoop -v
# human readable timestamps
iosnoop -N
# print major and minor numbers
iosnoop -m /
# snoop events on filesystem / only
28. iosnoop, cont.
•
I originally wrote it in 2004 to solve disk-I/O-by-process
•
•
•
Shipped by default on Mac OS X (/usr/bin), Oracle Solaris 11
I wrote a companion called iotop. A similar iotop tool later
appeared for Linux (via the blk tracing framework)
Dosen’t do ZFS pathnames yet (known issue; hackathon?)
•
Pathnames from the block-device layer was always a party
trick: relied on a cached vnode-v_path
•
ZFS aggregates I/O. At the block-device layer, no one vnode
responsible (or vnode pointer)
34. iostacks.d, cont.
•
Can identify the reason for disk I/O
•
•
including unexplained additional I/O from ZFS
Times it doesn’t work (ZIO pipeline task pool), can try a similar
one-liner from fbt::zio_create:entry, for the stack trace from the
creation of all ZIO.
36. metaslab_free.d, cont.
•
Written in DTrace
dtrace:::BEGIN
{
printf(Tracing ZFS metaslab alloc.
`metaslab_df_free_pct);
}
metaslab_df_free_pct = %d %%n,
fbt::metaslab_df_alloc:entry
{
this-pct = args[0]-sm_space * 100 / args[0]-sm_size;
@[this-pct] = count();
}
profile:::tick-1s
{
printf(n%Y free %%pct by allocations:, walltimestamp);
printa(@);
trunc(@);
}
•
metaslab_free.d is also on github
37. metaslab_free.d, cont.
•
•
•
Shows if allocations are currently fast fit or best fit
Correlate to performance changes
Use while tuning
•
Can’t just try out a new setting and watch performance: perf
can improve if ZFS switched to a new, emptier, metaslab at
the same time. Script identifies if that occurred.
40. spasync.d, cont.
•
Simple and effective: helps diagnose TXG sync issues, providing
information to help guide tuning.
•
Long heritage: Matt Ahrens - Roch Bourbannias - Ben
Rockwood - Brendan Gregg
44. arcstat, cont.
•
Written in Perl, uses kstats (zfs::arcstats:):
[...]
sub snap_stats {
my %prev = %cur;
if ($kstat-update()) {
printf(State Changedn);
}
my $hashref_cur = $kstat-{zfs}{0}{arcstats};
%cur = %$hashref_cur;
foreach my $key (keys %cur) {
next if $key =~ /class/;
if (defined $prev{$key}) {
$d{$key} = $cur{$key} - $prev{$key};
[...]
sub calculate {
%v = ();
$v{time} = strftime(%H:%M:%S, localtime);
$v{hits} = $d{hits}/$int;
$v{miss} = $d{misses}/$int;
$v{read} = $v{hits} + $v{miss};
$v{hit%} = 100*$v{hits}/$v{read} if $v{read} 0;
$v{miss%} = 100 - $v{hit%} if $v{read} 0;
[...]
•
github.com/mharsch/arcstat
45. arcstat, cont.
•
•
Options to configure output, including L2ARC stats
•
Originally by Neelakanth Nadgir, then Jason Hellenthal (FreeBSD
port) and Mike Harsh
Crucial data when analyzing cache performance
(and easier to read than the raw form: kstat -pn arcstats)
48. arcaccess.d, cont.
•
Shows population age of the ARC based on access time
•
•
Helps determine working set size and turn over rate
Previous example showed ARC was so large, it was able to keep
buffers that had not been accessed in over a day.
•
Turn over rate == low, working set size == entirely fits, likely
much smaller.
50. Latency counters
•
kstat -p zone_vfs:::*_ops, show latency outlier counts since boot:
$ kstat -p zone_vfs:::*_ops
zone_vfs:0:global:100ms_ops
zone_vfs:0:global:10ms_ops
zone_vfs:0:global:10s_ops
zone_vfs:0:global:1s_ops
zone_vfs:1:5cedb79e:100ms_ops
zone_vfs:1:5cedb79e:10ms_ops
zone_vfs:1:5cedb79e:10s_ops
zone_vfs:1:5cedb79e:1s_ops
•
reads/writes:
13
2220
0
0
173367
64071247
0
430
yikes, 430 ops 1s!
however, this is a small fraction
$ kstat -p zone_vfs:::reads zone_vfs:::writes
zone_vfs:0:global:reads 666148884
zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457
zone_vfs:0:global:writes
46861078
zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes
356075500
1s = 0.00002%
51. Latency counters, cont.
•
•
Added by Bill Pijewski of Joyent
Proven invaluable, usually to quickly eliminate ZFS (and disk) as a
source of issues, after the fact (when counters are near-zero)
•
Has had identified real ZFS/disk-level issues as well
53. Scatter plots, cont.
•
Has been an effective last-resort to investigate nasty PERC
issues: reads queueing behind writes
•
Has a big data problem: many points, x-y coordinates, takes time
space to capture and render.
•
Heat maps solve this, and can be real-time
57. Heat maps
•
•
File system latency is commonly bi-modal: cache hits vs misses
Average latency, stddev, 99th percentile, don’t make a lot of sense
•
•
Given line graphs of these, we would not be able to solve
many problems
Need to see the full distribution: heat maps
•
Also good at identifying outliers
60. Cloud computing: within the zone
•
Suspected “noisy neighbors” hurting ZFS performance
•
•
Need ZFS observability to confirm/deny issues from within a zone
(eg, Joyent SmartMachine), however, kernel access (including
DTrace io and fbt providers) is not available.
•
•
vfsstat (already covered) does work, as it is zone-aware
Cloud safe DTrace tools include:
•
•
•
ZFS I/O throttling was added to address this
mysql_pid_latency.d: tracing FS calls from within the app
syscall tracing when fds[].fi_fs == “zfs”
Scripts are in:
http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3/
http://dtrace.org/blogs/brendan/2011/05/24/file-system-latency-part-4/
61. Recap
mysql_pid_fslatency.d
syscall with
fi_fs == zfs
Process
User-Land
Kernel
VFS
zioslower.d
spasync.d
metaslab_free.d
arcstat.pl
arcaccess.d
Syscall
Interface
vfsstat
ZFS
...
iostacks.d
Block Device Interface
kernel drivers as needed
see DTrace book chap 4
Disks
iostat
iosnoop
62. There’s more
•
There are more tools
•
zpool iostat
•
zpool status
•
•
zfs get all
•
•
degraded? resilvering?
recsize match workload? compression? dedup?
ziostat