ZFSperftools2012

ZFS Performance
Analysis and Tools
Brendan Gregg
Lead Performance Engineer
brendan@joyent.com
@brendangregg

ZFS Day
October, 2012

whoami

•
•
•

G’Day, I’m Brendan
These days I do systems performance analysis of the cloud
Regarding ZFS:

•

Perf analysis of ZFS (mostly using DTrace) for 5+ years,
both enterprise and cloud usage

•
•

Wrote many DTrace-based ZFS perf analysis tools
including those in the DTrace book

Developed ZFS L2ARC while at Sun

Who is Joyent

•
•

Cloud computing provider (public cloud + software)
Use ZFS as much as possible:

•
•
•

Host storage
Guest storage: OS virtualization (SmartOS), and KVM guests
(Linux, Windows)

We use ZFS because

•

Reliability: checksums, COW

•

Features: snapshots, clones, compression, ...

•

Performance: large ARCs

•

It can boil oceans

Joyent, cont.

•
•

We build tools for ZFS automation and observability.
Performance is a key company feature.

•

Need to solve FS/disk issues fast.

Agenda

•

My top 12 tools for ZFS performance analysis (unsorted):

•
•
•
•
•
•
•
•
•
•
•
•

iostat
vfsstat
zfsslower.d
iosnoop
iostacks
metaslab_free.d
spasync.d
arcstat
arcaccess.d
latency counters
scatter plots
heat maps (CA)

For cloud computing
from within a Zone, add:

•
•

mysqld_pid_fslatency.d
syscall with ﬁ_fs == zfs

Functional diagram: full stack

•

Unix 101

Process
User-Land
Kernel

logical I/O

VFS
ZFS

...

Block Device Interface
physical I/O

Disks

Syscall
Interface

Functional diagram: full stack, cont.

•

Unix 102

User-Land
Kernel

sync.

Process
Syscall
Interface
VFS

ZFS

...

Disks

iostat(1)
often async:
write buﬀering,
read ahead

Functional diagram: full stack, cont.

•

DTrace 301

mysql_pid_fslatency.d
syscall with
fi_fs == zfs

Process
User-Land
Kernel
VFS

zioslower.d
spasync.d
metaslab_free.d
arcstat.pl
arcaccess.d

Syscall
Interface
vfsstat

ZFS

...
iostacks.d


kernel drivers as needed
see DTrace book chap 4

Disks

iostat
iosnoop

ZFS Internals

•
•

That’s just my top 12
Use more as needed

ZFS Internals

•
•

That’s just my top 12
Use more as needed

DTRACE

THINGS!

http://hub.opensolaris.org/bin/view/Community+Group+zfs/source
http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

iostat

•

Block-device level (almost disk-level) I/O statistics:
$ iostat -xnz 1
[...]
r/s
0.0
1.0

w/s
11.0
381.0

r/s
0.0
1.0

w/s
9.0
154.9

r/s
0.0
6.0

w/s
8.0
0.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
0.0
52.0 0.0 0.0
0.0
1.0
0
1 c0t0d0
16.0 43325.5 0.0 4.0
0.0
10.4
1 12 c0t1d0
kr/s
0.0
34.0 0.0 0.0
0.0
0.1
0
0 c0t0d0
16.0 1440.5 0.0 2.0
0.0
12.6
0 10 c0t1d0
kr/s
0.0
36.0 0.0 0.0
0.0
0.0
0
0 c0t0d0
96.0
0.0 0.0 0.0
0.0
7.9
0
4 c0t1d0

ZFS-Disk Workload

Disk Resulting Performance

iostat, cont.

•

Effective tool for a class of disk issues, especially:
$ iostat -xnz 1
[...]
r/s
0.0
r/s
0.0

w/s
0.0

r/s
0.0

w/s
0.0

r/s
0.0

w/s
0.0

r/s
0.0

•

w/s
0.0

w/s
0.0

extended
kr/s
kw/s
0.0
0.0
extended
kr/s
kw/s
0.0
0.0
extended
kr/s
kw/s
0.0
0.0
extended
kr/s
kw/s
0.0
0.0
extended
kr/s
kw/s
0.0
0.0

device statistics
wait actv wsvc_t asvc_t
0.0 4.0
0.0
0.0
device statistics
0.0 4.0
0.0
0.0
device statistics
0.0 4.0
0.0
0.0
device statistics
0.0 4.0
0.0
0.0
device statistics
0.0 4.0
0.0
0.0

Disks “out to lunch” (PERC ECC error)

%w %b device
0 100 c0t0d0
%w %b device
0 100 c0t0d0
%w %b device
0 100 c0t0d0
%w %b device
0 100 c0t0d0
%w %b device
0 100 c0t0d0

iostat, cont.

•

Minor nits:

•

•
•

does not show read/write latency separately. ZFS TXG flushes
drag up the latency, which looks alarming, but are
asynchronous. Can use DTrace for the split.
no higher level context: PID, ZFS dataset, file pathname, ...
(not its role)

Major problem (although, not iostat’s fault): commonly confused
with application-level (logical) I/O.

•

The I/O rates, sizes, and latency, can dramatically differ
between logical file system I/O and physical disk I/O.

•

Users commonly draw the wrong conclusions when only
provided with iostat statistics to understand a system.

iostat, cont.

•

iostat output (or disk kstats) graphed by various monitoring software

So MANY graphs!

iostat, cont.

For so LITTLE visibility

syscall with
fi_fs == zfs

Process

This leaves many perf issues unsolved
User-Land
Kernel
VFS

zioslower.d
spasync.d
metaslab_free.d
arcstat.pl
arcaccess.d

ZFS

Syscall
Interface
vfsstat

...
iostacks.d



Disks

iostat
iosnoop

vfsstat

•

VFS-level I/O statistics (VFS-iostat):

# vfsstat -Z 1
r/s
w/s kr/s
1.2
2.8
0.6
0.1
0.0
0.1
0.1
0.0
0.1
0.0
0.0
0.0
0.3
0.1
0.1
5.9
0.2
0.5
0.1
0.0
0.1
0.1
0.0
0.1
0.2
0.8
0.5

kw/s ractv wactv read_t writ_t
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.3
0.0
0.0
0.0
0.0
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.6
0.0
0.0
0.0
0.0

App-ZFS Workload

%r
0
0
0
0
0
0
0
0
0

%w
0
0
0
0
0
0
0
0
0

ZFS Resulting Performance

d/s
0.0
0.0
0.0
0.0
0.0
5.0
0.0
0.0
0.0

del_t
0.0
34.9
46.5
16.5
27.6
11.3
132.0
40.7
31.9

ZFS I/O
Throttling

zone
global (0)
9cc2d0d3 (2)
72188ca0 (3)
4d2a62bb (4)
8bbc4000 (5)
d305ee44 (6)
9897c8f5 (7)
5f3c7d9e (9)
22ef87fc (10)

vfsstat, cont.

•
•

Good high-level summary of logical I/O: application FS workload
Summarizes by zone

•
•
•

Impetus was observability for cloud “noisy neighbors”
Shows aﬀect of ZFS I/O throttling (performance isolation)

Summarizes performance applications actually experience!

•
•

Usually a lot better than disk-level, due to ZFS caching (ARC,
L2ARC) and buﬀering

Required kernel changes, new kstats (thanks Bill Pijewski)

zfsslower.d

•

ZFS reads/writes slower than 10 ms:

# ./zfsslower.d 10
TIME
2012 Sep 30 04:56:22
2012 Sep 30 04:56:23
2012 Sep 30 04:56:24
2012 Sep 30 04:56:24
2012 Sep 30 04:56:25
2012 Sep 30 04:56:25
2012 Sep 30 04:56:25
2012 Sep 30 04:56:26
2012 Sep 30 04:56:26
2012 Sep 30 04:56:27
[...]

•

PROCESS
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
cat
beam.smp
beam.smp

D
R
R
R
R
R
R
R
R
R
R

KB
2
4
3
2
0
2
2
5
2
2

ms
12
15
11
12
21
18
13
10
11
12

FILE
/var/db/riak/leveldb/.../205788.sst
/db/run/beam.smp.pid

Traces at VFS level to show the true application suffered I/O time

•

allows immediate conﬁrm/deny of FS (incl. disk) based issue

zfsslower.d

•

ZFS reads/writes slower than 100 ms:

# ./zfsslower.d 100
TIME
2012 Sep 30 05:01:17
2012 Sep 30 05:01:54
2012 Sep 30 05:02:35
2012 Sep 30 05:02:35
2012 Sep 30 05:02:35
2012 Sep 30 05:02:40
2012 Sep 30 05:03:11
2012 Sep 30 05:03:38
[...]

PROCESS
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp

less frequent

D
R
R
R
R
R
R
R
R

KB
2
1
2
2
1
1
0
1

ms
144
149
188
159
178
172
200
142

FILE

zfsslower.d, cont.

•

All ZFS read/writes:

# ./zfsslower.d
TIME
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
2012 Sep 30 04:46:09
[...]

PROCESS
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
nginx
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp

D
R
R
R
R
R
W
R
R
R
R
R
R
R
R
R
R
R
R

KB
2
2
2
1
3
0
4
2
0
56
0
2
5
2
3
0
41
0

ms
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0

FILE
/db/log/apps/prod17_nginx_access.log

zfsslower.d, cont.

•

Written in DTrace
[...]
fbt::zfs_read:entry,
fbt::zfs_write:entry
{
self-path = args[0]-v_path;
self-kb = args[1]-uio_resid / 1024;
self-start = timestamp;
}
fbt::zfs_read:return,
fbt::zfs_write:return
/self-start (timestamp - self-start) = min_ns/
{
this-iotime = (timestamp - self-start) / 1000000;
this-dir = probefunc == zfs_read ? R : W;
printf(%-20Y %-16s %1s %4d %6d %sn, walltimestamp,
execname, this-dir, self-kb, this-iotime,
self-path != NULL ? stringof(self-path) : null);
}
[...]

•

zfsslower.d, also on github, originated from the DTrace book

zfsslower.d, cont.

•

Traces VFS/ZFS interface (kernel)
from usr/src/uts/common/fs/zfs/zfs_vnops.c:
/*
* Regular file vnode operations template
*/
vnodeops_t *zfs_fvnodeops;
const fs_operation_def_t zfs_fvnodeops_template[] = {
VOPNAME_OPEN,
{ .vop_open = zfs_open },
VOPNAME_CLOSE,
{ .vop_close = zfs_close },
VOPNAME_READ,
{ .vop_read = zfs_read },
VOPNAME_WRITE,
{ .vop_write = zfs_write },
VOPNAME_IOCTL,
{ .vop_ioctl = zfs_ioctl },
VOPNAME_GETATTR,
{ .vop_getattr = zfs_getattr },
[...]

iosnoop

•

Traces block-device-level I/O:
# ./iosnoop
UID
PID
103 5891
103 5891
103 5891
103 5891
103 5891
[...]

•

D
BLOCK
SIZE
R 238517878 131072
R 544800120 131072
R 286317766 131072
R 74515841 131072
R 98341339 131072

COMM
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp

PATHNAME
none
none
none
none
none

Has many options:
# ./iosnoop -Dots
STIME(us)
TIME(us)
787809430864
787809435385
787809472876
787809478812
787809479681
787809483470
787809484451
787809490703
787809489555
787809497167
787809491122
787809498010
[...]

DELTA(us)
4520
5935
3788
6251
7611
6888

DTIME(us)
4539
5955
3809
6266
6463
842

UID
103
103
103
103
103
103

PID
5891
5891
5891
5891
5891
5891

D
R
R
R
R
R
R

BLOCK
128302372
143783200
84913822
14964144
283398320
288468148

SIZE
131072
131072
131072
131072
131072
131072

COMM
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp
beam.smp

PATHNAME
none
none
none
none
none
none

iosnoop, cont.

•

Written in DTrace
[...]
io:genunix::done
/start_time[args[0]-b_edev, args[0]-b_blkno]/
{
[...]
/* fetch entry values */
this-dev = args[0]-b_edev;
this-blk = args[0]-b_blkno;
this-suid = start_uid[this-dev, this-blk];
this-spid = start_pid[this-dev, this-blk];
this-sppid = start_ppid[this-dev, this-blk];
self-sargs = (int)start_args[this-dev, this-blk] == 0 ?
: start_args[this-dev, this-blk];
self-scomm = start_comm[this-dev, this-blk];
[...]
printf(%5d %5d %1s %8d %6d ,
this-suid, this-spid, args[0]-b_flags B_READ ? R : W,
args[0]-b_blkno, args[0]-b_bcount);
[...]

•

From the DTraceToolkit

iosnoop, cont.

•

370 lines of code, mostly to process options:
USAGE: iosnoop [-a|-A|-DeghiNostv] [-d device] [-f filename]
[-m mount_point] [-n name] [-p PID]
iosnoop
# default output
-a
# print all data (mostly)
-A
# dump all data, space delimited
-D
# print time delta, us (elapsed)
-e
# print device name
-g
# print command arguments
-i
# print device instance
-N
# print major and minor numbers
-o
# print disk delta time, us
-s
# print start time, us
-t
# print completion time, us
-v
# print completion time, string
-d device
# instance name to snoop
-f filename
# snoop this file only
-m mount_point # this FS only
-n name
# this process name only
-p PID
# this PID only
eg,
iosnoop -v
# human readable timestamps
iosnoop -N
# print major and minor numbers
iosnoop -m /
# snoop events on filesystem / only

iosnoop, cont.

•

I originally wrote it in 2004 to solve disk-I/O-by-process

•
•
•

Shipped by default on Mac OS X (/usr/bin), Oracle Solaris 11
I wrote a companion called iotop. A similar iotop tool later
appeared for Linux (via the blk tracing framework)

Dosen’t do ZFS pathnames yet (known issue; hackathon?)

•

Pathnames from the block-device layer was always a party
trick: relied on a cached vnode-v_path

•

ZFS aggregates I/O. At the block-device layer, no one vnode
responsible (or vnode pointer)

iosnoop, cont.

•

Locate latency origin:

Process
Syscall
Interface
VFS
zfsslower.d
Correlate

iosnoop

ZFS

...

Disks

iostacks.d

•

Just a one-liner really (could make an “iostacks.d”):
# dtrace -n 'io:::start { @[stack()] = count(); }'
dtrace: description 'io:::start ' matched 6 probes
^C
genunix`ldi_strategy+0x53
zfs`vdev_disk_io_start+0xcc
zfs`zio_vdev_io_start+0xab
zfs`zio_execute+0x88
zfs`zio_nowait+0x21
zfs`vdev_mirror_io_start+0xcd
zfs`zio_vdev_io_start+0x250
zfs`zio_nowait+0x21
zfs`arc_read_nolock+0x4f9
zfs`arc_read+0x96
zfs`dsl_read+0x44
zfs`dbuf_read_impl+0x166
zfs`dbuf_read+0xab
zfs`dmu_buf_hold_array_by_dnode+0x189
zfs`dmu_buf_hold_array+0x78
zfs`dmu_read_uio+0x5c
zfs`zfs_read+0x1a3
genunix`fop_read+0x8b
genunix`read+0x2a7
143

iostacks.d, cont.

•

Stack recognition chart:
^C
zfs`zio_nowait+0x21
zfs`vdev_mirror_io_start+0xcd
zfs`zio_vdev_io_start+0x250
zfs`zio_nowait+0x21
zfs`arc_read_nolock+0x4f9
zfs`arc_read+0x96
zfs`dsl_read+0x44
zfs`dbuf_read_impl+0x166
zfs`dbuf_read+0xab
zfs`dmu_read_uio+0x5c
zfs`zfs_read+0x1a3
genunix`fop_read+0x8b
genunix`read+0x2a7
143

syscall
read()
arc-miss

iostacks.d, cont.

•

Stack recognition chart:
^C
zfs`vdev_queue_io_done+0x70
zfs`zio_vdev_io_done+0x80
genunix`taskq_thread+0x2d0
unix`thread_start+0x8
395

ZIO
pipeline

iostacks.d, cont.

•

From zio, when parent == NULL (ﬁrst zio):
# dtrace -n 'zio_create:entry /arg0 == NULL/ { @[stack()] = count(); }'
[...]
zfs`zio_null+0x77
zfs`zio_root+0x2d
zfs`dmu_write+0x80
zfs`space_map_sync+0x288
zfs`metaslab_sync+0x135
SPA
zfs`vdev_sync+0x7f
sync
zfs`spa_sync+0x38b
zfs`txg_sync_thread+0x204
unix`thread_start+0x8
7

iostacks.d, cont.

•

Can identify the reason for disk I/O

•
•

including unexplained additional I/O from ZFS

Times it doesn’t work (ZIO pipeline task pool), can try a similar
one-liner from fbt::zio_create:entry, for the stack trace from the
creation of all ZIO.

metaslab_free.d

•

Traces ZFS metaslab details:
# ./metaslab_free.d
Tracing ZFS metaslab alloc.

metaslab_df_free_pct = 4 %

2012 Sep 30 05:32:47 free %pct by allocations:
13
12
89
34
81
287
12
389
2012 Sep 30 05:32:49 free %pct by allocations:
89
15
81
265
12
378

•
•

ZFS pools are split into metaslabs (eg, 10 GB each)
Metaslabs have two allocation algorithms:
metaslab %free = metaslab_df_free_pct == first fit (fast)
metaslab %free metaslab_df_free_pct == best fit (slow)

metaslab_free.d, cont.

•

Written in DTrace

dtrace:::BEGIN
{
printf(Tracing ZFS metaslab alloc.
`metaslab_df_free_pct);
}

metaslab_df_free_pct = %d %%n,

fbt::metaslab_df_alloc:entry
{
this-pct = args[0]-sm_space * 100 / args[0]-sm_size;
@[this-pct] = count();
}
profile:::tick-1s
{
printf(n%Y free %%pct by allocations:, walltimestamp);
printa(@);
trunc(@);
}

•

metaslab_free.d is also on github

metaslab_free.d, cont.

•
•
•

Shows if allocations are currently fast fit or best fit
Correlate to performance changes
Use while tuning

•

Can’t just try out a new setting and watch performance: perf
can improve if ZFS switched to a new, emptier, metaslab at
the same time. Script identifies if that occurred.

spasync.d

•

Traces ZFS SPA syncs:
# ./spasync.d
Tracing ZFS spa_sync() slower than 1 ms...
2012 Sep 30 05:56:07 zones
30 ms, 25 MB 464 I/O
2012 Sep 30 05:56:12 zones
22 ms, 15 MB 366 I/O
2012 Sep 30 05:56:17 zones
39 ms, 19 MB 361 I/O
2012 Sep 30 05:56:22 zones
143 ms, 31 MB 536 I/O
2012 Sep 30 05:56:27 zones
32 ms, 27 MB 465 I/O
2012 Sep 30 05:56:32 zones
27 ms, 24 MB 468 I/O
2012 Sep 30 05:56:37 zones
31 ms, 23 MB 489 I/O
2012 Sep 30 05:56:42 zones
200 ms, 85 MB 956 I/O
2012 Sep 30 05:56:47 zones
122 ms, 46 MB 593 I/O
2012 Sep 30 05:56:52 zones
48 ms, 18 MB 394 I/O
2012 Sep 30 05:56:57 zones
43 ms, 16 MB 409 I/O
2012 Sep 30 05:57:02 zones
33 ms, 27 MB 526 I/O
[...]

•

Check for correlations with I/O latency

spasync.d, cont.

•

Written in DTrace
[...]
fbt::spa_sync:entry
/!self-start/
{
in_spa_sync = 1;
self-start = timestamp;
self-spa = args[0];
}
io:::start
/in_spa_sync/
{
@io = count();
@bytes = sum(args[0]-b_bcount);
}
fbt::spa_sync:return
/self-start (this-ms = (timestamp - self-start) / 1000000) MIN_MS/
{
normalize(@bytes, 1048576);
printf(%-20Y %-10s %6d ms, , walltimestamp,
stringof(self-spa-spa_name), this-ms);
printa(%@d MB %@d I/On, @bytes, @io);
}
[...]

•

spasync.d is also on github

spasync.d, cont.

•

Simple and effective: helps diagnose TXG sync issues, providing
information to help guide tuning.

•

Long heritage: Matt Ahrens - Roch Bourbannias - Ben
Rockwood - Brendan Gregg

%iowait

•

I/O wait percent
# mpstat 1
CPU minf mjf xcal
0 483
0 427
1 625
0 153
2 269
0 702
3 281
0 161
4 173
0 577
5 172
0
98
6 139
0 678
7 140
0
91
8 124
0 527
[...]

intr ithr csw icsw migr smtx
2216 913 4415
6
79
45
4171 1748 2887
4
42
37
1720 636 2943
5
71
39
1433 529 2255
3
39
35
3361 1465 2565
3
50
38
1108 418 1640
2
27
30
2406 802 2509
4
60
34
1343 527 2196
2
36
31
1145 419 1676
2
43
29

srw syscl
0 9275
0 5909
0 6422
0 5269
0 4645
0 3699
0 5135
0 4879
0 3335

usr sys
7 11
7
8
5
8
5
5
4
6
3
3
4
6
4
4
4
4

wt idl
0 83
0 85
0 87
0 90
0 90
0 94
0 90
0 92
0 92

%iowait

•

I/O wait percent ... still zero in illumos!
# mpstat 1
CPU minf mjf xcal
0 483
0 427
1 625
0 153
2 269
0 702
3 281
0 161
4 173
0 577
5 172
0
98
6 139
0 678
7 140
0
91
8 124
0 527
[...]

intr ithr csw icsw migr smtx
2216 913 4415
6
79
45
4171 1748 2887
4
42
37
1720 636 2943
5
71
39
1433 529 2255
3
39
35
3361 1465 2565
3
50
38
1108 418 1640
2
27
30
2406 802 2509
4
60
34
1343 527 2196
2
36
31
1145 419 1676
2
43
29

srw syscl
0 9275
0 5909
0 6422
0 5269
0 4645
0 3699
0 5135
0 4879
0 3335

usr sys
7 11
7
8
5
8
5
5
4
6
3
3
4
6
4
4
4
4

wt idl
0 83
0 85
0 87
0 90
0 90
0 94
0 90
0 92
0 92

arcstat

•

ZFS ARC and L2ARC statistics:
$ ./arcstat 1
time read
04:45:47
0
04:45:49
15K
04:45:50
23K
04:45:51
65K
04:45:52
30K
[...]

•

miss
0
10
81
25
11

miss%
0
0
0
0
0

dmis
0
10
81
25
11

dm%
0
0
0
0
0

pmis
0
0
0
0
0

pm%
0
0
0
0
0

mmis
0
1
1
4
3

mm%
0
0
0
0
0

arcsz
14G
14G
14G
14G
14G

Statistics are per-interval:

•
•

read, miss: total ARC accesses, misses

•
•

dmis, pmis, mmis: misses for demand, prefetch, metadata

miss%, dm%, pm%, mm%: ARC miss percentage total,
demand, prefetch, metadata
arcsz, c: ARC size, ARC target size

c
14G
14G
14G
14G
14G

arcstat, cont.

•

Written in Perl, uses kstats (zfs::arcstats:):
[...]
sub snap_stats {
my %prev = %cur;
if ($kstat-update()) {
printf(State Changedn);
}
my $hashref_cur = $kstat-{zfs}{0}{arcstats};
%cur = %$hashref_cur;
foreach my $key (keys %cur) {
next if $key =~ /class/;
if (defined $prev{$key}) {
$d{$key} = $cur{$key} - $prev{$key};
[...]
sub calculate {
%v = ();
$v{time} = strftime(%H:%M:%S, localtime);
$v{hits} = $d{hits}/$int;
$v{miss} = $d{misses}/$int;
$v{read} = $v{hits} + $v{miss};
$v{hit%} = 100*$v{hits}/$v{read} if $v{read} 0;
$v{miss%} = 100 - $v{hit%} if $v{read} 0;
[...]

•

github.com/mharsch/arcstat

arcstat, cont.

•
•

Options to conﬁgure output, including L2ARC stats

•

Originally by Neelakanth Nadgir, then Jason Hellenthal (FreeBSD
port) and Mike Harsh

Crucial data when analyzing cache performance
(and easier to read than the raw form: kstat -pn arcstats)

arcaccess.d

•

ARC population age:

# ./arcaccess.d -n 'tick-10s { exit(0); }'
lbolt rate is 100 Hertz.
Tracing lbolts between ARC accesses...
value
-1
0
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432

------------- Distribution ------------- count
|
0
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 729988
|
3805
|
3038
|
2028
|
1428
|
1398
|
1618
|
2883
|
738
|
681
|
338
|
569
|
166
|
607
|
632
|
808
|
373
|
110
|
142
|
39
|
5
|
97
|
10
|
44
|
617
|
1
|
0

Age:
10 ms

1 second

1 minute

1 hour
1 day

arcaccess.d, cont.

•

Written in DTrace
#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
printf(lbolt rate is %d Hertz.n, `hz);
printf(Tracing lbolts between ARC accesses...);
}
fbt::arc_access:entry
{
self-ab = args[0];
self-lbolt = args[0]-b_arc_access;
}
fbt::arc_access:return
/self-lbolt/
{
@ = quantize(self-ab-b_arc_access - self-lbolt);
self-ab = 0;
self-lbolt = 0;
}

•

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

arcaccess.d, cont.

•

Shows population age of the ARC based on access time

•
•

Helps determine working set size and turn over rate

Previous example showed ARC was so large, it was able to keep
buffers that had not been accessed in over a day.

•

Turn over rate == low, working set size == entirely ﬁts, likely
much smaller.

Latency counters

•

kstat -p zone_vfs:::*_ops, show latency outlier counts since boot:
$ kstat -p zone_vfs:::*_ops
zone_vfs:0:global:100ms_ops
zone_vfs:0:global:10s_ops
zone_vfs:1:5cedb79e:100ms_ops
zone_vfs:1:5cedb79e:10s_ops

•

13
2220
0
0
173367
64071247
0
430

reads/writes:
$ kstat -p zone_vfs:::reads zone_vfs:::writes
zone_vfs:0:global:reads 666148884
zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457
zone_vfs:0:global:writes
46861078
zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes
356075500

Latency counters

•

kstat -p zone_vfs:::*_ops, show latency outlier counts since boot:
$ kstat -p zone_vfs:::*_ops

•

reads/writes:

13
2220
0
0
173367
64071247
0
430

yikes, 430 ops 1s!
however, this is a small fraction

$ kstat -p zone_vfs:::reads zone_vfs:::writes
zone_vfs:0:global:reads 666148884
zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457
zone_vfs:0:global:writes
46861078
zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes
356075500

1s = 0.00002%

Latency counters, cont.

•
•

Added by Bill Pijewski of Joyent
Proven invaluable, usually to quickly eliminate ZFS (and disk) as a
source of issues, after the fact (when counters are near-zero)

•

Has had identiﬁed real ZFS/disk-level issues as well

Scatter plots
iosnoop -Dots outfile
+
awk/perl
+
gnuplot/R
=

read = red, write = blue
60ms

latency

time (s)

Scatter plots, cont.

•

Has been an effective last-resort to investigate nasty PERC
issues: reads queueing behind writes

•

Has a big data problem: many points, x-y coordinates, takes time
space to capture and render.

•

Heat maps solve this, and can be real-time

Heat maps

•

WHAT DOES IT MEAN?

Heat maps

•

WHAT DOES IT MEAN?

disk I/O
DRAM I/O

workload becomes cached

Heat maps

•

WHAT DOES IT MEAN?

bi-modal
distribution

Heat maps

•
•

File system latency is commonly bi-modal: cache hits vs misses
Average latency, stddev, 99th percentile, don’t make a lot of sense

•
•

Given line graphs of these, we would not be able to solve
many problems

Need to see the full distribution: heat maps

•

Also good at identifying outliers

Heat maps, cont.

•

Latency outliers, cloud wide (entire datacenter):

Heat maps, cont.

•

Latency outliers, cloud wide (entire datacenter):

Inconceivable
Very Bad
Bad
Good

Cloud computing: within the zone

•

Suspected “noisy neighbors” hurting ZFS performance

•
•

Need ZFS observability to confirm/deny issues from within a zone
(eg, Joyent SmartMachine), however, kernel access (including
DTrace io and fbt providers) is not available.

•
•

vfsstat (already covered) does work, as it is zone-aware

Cloud safe DTrace tools include:

•
•
•

ZFS I/O throttling was added to address this

mysql_pid_latency.d: tracing FS calls from within the app
syscall tracing when fds[].fi_fs == “zfs”

Scripts are in:
http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3/
http://dtrace.org/blogs/brendan/2011/05/24/file-system-latency-part-4/

Recap
syscall with
fi_fs == zfs

Process
User-Land
Kernel
VFS

zioslower.d
spasync.d
metaslab_free.d
arcstat.pl
arcaccess.d

Syscall
Interface
vfsstat

ZFS

...
iostacks.d



Disks

iostat
iosnoop

There’s more

•

There are more tools

•

zpool iostat

•

zpool status

•

•

zfs get all

•

•

degraded? resilvering?

recsize match workload? compression? dedup?

ziostat

There’s more, cont.

•

There are more tools

•

And much, much, more DTrace

DTRACE

ZFSperftools2012

More Related Content

ZFSperftools2012