perf, ftraceã®ããã¿
Linuxã®ãã¬ã¼ãµã¼ã§ããperfãftraceã®ãã¼ã«ã®ä½¿ãæ¹ã«é¢ããæ å ±ã¯çµæ§ããã¾ããï¼æ§é ã«é¢ãã¦ã¯ãã¾ãè¦ã¤ããããªãã£ãããï¼ããã«ç°¡åã«èª¿ã¹ããã¨ãã¾ã¨ããããã¨æãã¾ãï¼ï¼ãã¼ã«ã®ä½¿ãæ¹ã®èª¬æã¯ããã¾ãããªãã§ãï¼ï¼
ãã®æç« ã¯Linux 4.15ã®ã½ã¼ã¹ã«åºã¥ãã¦ãã¾ãï¼
å ¨ä½å
ããããLinuxã®ãã¬ã¼ãµã¼ã¨ã²ã¨ãã«è¨ã£ã¦ãperfã¨ãftraceã¨ãkprobeã¨ãuprobeã¨ãããããããããã¦ä¸ä½ã©ããªã£ã¦ãããã ã¨ããæããªã®ã§ç°¡åã«é¢ä¿ãå³ç¤ºãã¦ãã¾ãï¼
å®éã¯ããããã¨è¤éã«çµ¡ã¿åã£ã¦ãã®ã§ãªããªãå¯è¦åããã®ãé£ããã§ããï¼ã¾ãä¸ã¤ã®è¦æ¹ã ã¨æã£ã¦ãã ããï¼
大éæã«ã¯ä»¥ä¸ã®ããã«åé¡ã§ãã¾ãï¼
- ã¦ã¼ã¶ã©ã³ãã®ãã¼ã«
- perf, perf-tools, bcc, trace-cmd
- ã¤ã³ã¿ãã§ã¼ã¹
perf_event_open(2)
,bpf(2)
,ioctl(2)
, debugfs (tracefs), ...
- ãã¬ã¼ãµã¼, ãã¬ã¼ã ã¯ã¼ã¯
- perf, ftrace, eBPF
- event, data source
- Performance counter
- tracepoint
- kprobe
- uprobe
- mcount
ftrace
ftraceã¯ä¸»ã«ã«ã¼ãã«ã®ã³ã¼ãã®ãã¬ã¼ã¹ãç®çã¨ãããã¬ã¼ã ã¯ã¼ã¯ã§ãï¼ ftraceã¨ããååã®éãï¼function tracingã主æ©è½ã®ä¸ã¤ã§ããï¼å®éã«ã¯ä»¥ä¸ã®ãããªãã¬ã¼ãµã¼ãå«ã¾ãã¦ãã¾ãï¼
- function
- é¢æ°å¼ã³åºãã®è¨é²
- functrion_graph
- é¢æ°ã®æ»ããè¨é²
- event tracer
- tracepoint, kprobe, uprobe
- latency tracer
- wakeup
- wakeupæéã¾ã§ã®ãã¬ã¼ã¹
- irqsoff
- å²ãè¾¼ã¿åæ¢æéãã¬ã¼ã¹
- ...
- wakeup
- mmio tracer
- ...
ftraceã¨ã®ããã¨ãã¯ã¦ã¼ã¶ã¹ãã¼ã¹ããã¯debugfsãéãã¦ãããªãã¾ãï¼ debugfs㯠/sys/kernel/debug/ ã«ãã¦ã³ãããã¦ãããã¨ãå¤ãã¨æãã¾ãï¼ ç¹ã«debugfsã®ä¸ã®tracing/ãã£ã¬ã¯ããªä»¥ä¸ãftraceé¨åã§ãï¼ Linux 4.1ããï¼ï¼ä¸»ã«ã»ãã¥ãªãã£ã®ããï¼debugfsããã¦ã³ãããããªãå ´åã§ãftraceãå©ç¨ã§ããããã«ï¼tracefsãå°å ¥ããã¦ãã¾ãï¼ ããã¯å¾æ¥ã®debugfsã®tracingé¨åãåé¢ãããã®ã§ãï¼ debugfsããã¦ã³ãããã¦ããå ´åï¼äºææ§ã®ããã«debug/tracingã«tracefsããã¦ã³ããããããã«ãªã£ã¦ãã¾ãï¼
ftraceãå©ç¨ããå ´åã¯ã¾ã使ç¨ãããã¬ã¼ãµã¼ãé¸æãã¾ãï¼ ãã¬ã¼ãµã¼ã¯ãã¬ã¼ã¹çµæããªã³ã°ãããã¡ã¸åºåãã¾ãï¼ãã®ãããã¡ã¸ã¯tracefsçµç±ã§ã¢ã¯ã»ã¹ãããã¨ãå¯è½ã§ãï¼ ftraceã«ã¯ãã¬ã¼ã¹çµæã®ãã£ã«ã¿ãªã³ã°ãï¼ããã¤ãã³ãæã«ãã¬ã¼ã¹ãéå§ãããªã©ã®æ©è½ãããã¾ãï¼
ãã¬ã¼ãµã¼ã®ã³ã¼ãã¯ä¸»ã« kernel/trace以ä¸ã«åå¨ãã¾ãï¼
tracefsã«ããftraceã®ä½¿ãæ¹ã¯ä»¥ä¸ãåèã«ãªãã¾ãï¼
- ftrace - Function Tracer, https://github.com/torvalds/linux/blob/v4.15/Documentation/trace/ftrace.txt
- Steven Rostedt, Debugging the kernel using Ftrace - part 1, https://lwn.net/Articles/365835/, 2009.
- Steven Rsotedt, Secrets of the Ftrace function tracer, https://lwn.net/Articles/370423/, 2010.
- mhiramat, Ftraceã§ã«ã¼ãã«ã®ä¸é¨ã®å¦çã追ããããæ¹æ³, https://qiita.com/mhiramat/items/42a6af4f3c289ad37095, 2016.
- Andrej Yemelianov, Kernel Tracing with Ftrace, https://blog.selectel.com/kernel-tracing-ftrace/, 2017.
以ä¸ã§ã¯function tracing, tracepint, kprobeã«ã¤ãã¦ç°¡åã«èª¬æãã¾ãï¼
function trace
function tracingã¯gccã®ãããã¡ã¤ãªã³ã°æ©è½ãå©ç¨ãã¾ãï¼
gccã§ã¯-pg
ãªãã·ã§ã³ãã¤ãã¦ã³ã³ãã¤ã«ããã¨ï¼é¢æ°å¼ã³åºããmcount
ã¨ããé¢æ°ã®å¼ã³åºãã«å¤æããã¾ãï¼
ä¾:
# w/o -pg % echo "int main(){return 0;}" | gcc -x c -O0 -S -fno-asynchronous-unwind-tables -o- - .file "" .text .globl main .type main, @function main: pushq %rbp movq %rsp, %rbp movl $0, %eax popq %rbp ret .size main, .-main .ident "GCC: (Ubuntu 7.2.0-8ubuntu3) 7.2.0" .section .note.GNU-stack,"",@progbits
# w/ -pg % echo "int main(){return 0;}" | gcc -pg -x c -O0 -S -fno-asynchronous-unwind-tables -o- - .file "" .text .globl main .type main, @function main: pushq %rbp movq %rsp, %rbp 1: call *mcount@GOTPCREL(%rip) movl $0, %eax popq %rbp ret .size main, .-main .ident "GCC: (Ubuntu 7.2.0-8ubuntu3) 7.2.0" .section .note.GNU-stack,"",@progbits
ã¦ã¼ã¶ã©ã³ãã®ããã°ã©ã ã®å ´åï¼glibcã«å«ã¾ããmcountã®é¢æ°ã¨ãªã³ã¯ããã¾ãï¼ ãã®mcountã®é¢æ°ã¯ãããããã©ã³ããªã³ã³ã¼ãã¨ãã¦åä½ãï¼mcountã®é¢æ°å ã§è¨é²ãã¨ããã¨ã§é¢æ°å¼ã³åºãããã¬ã¼ã¹ã§ãã¾ãï¼
ftraceã®function traceãåºæ¬ã¯åãã§ããï¼ã«ã¼ãã«å
ã®å
¨ã¦ã®é¢æ°å¼ã³åºããmcountãçµç±ãã¦ãã¾ãã¨æ§è½ã大å¹
ã«ä½ä¸ãããã¨ã¯æ³åã«é£ãããã¾ããï¼
ããã§ï¼ftraceã§ã¯-pg
ä»ãã§ã³ã³ãã¤ã«ããã®ã¡ï¼mcountã®callå½ä»¤ãnopã«ç½®ãæããã¨ãããã¨ããã¾ãï¼ããã¯CONFIG_DYNAMIC_FTRACE=y
ã®ã¨ãã§ããï¼æ®éftraceã使ãå ´åã¯æå¹ã«ããã¯ãï¼ï¼
ãã®å¦çã¯ã«ã¼ãã«ãã«ãæã«ãããªãã¾ãï¼ã©ãã«mcountã®callå½ä»¤ãåå¨ãããã¨ããã®ã¯ã«ã¼ãã«å
ã®__start_mcount_loc
ã¨__end_mcount_loc
ã®éã«ä¿æãã¦ããã¾ãï¼
% sudo cat /proc/kallsyms | grep mcount ffffffffbc83d1c0 T __start_mcount_loc ffffffffbc886f90 T __stop_mcount_loc
ãã®æ å ±ã使ã£ã¦ftraceã§function tracingããéã«å¯¾è±¡ç®æã®ã³ã¼ããæ¸ãæãã¦mcountãå¼ã¶ããã«ãã¾ãï¼ ãããããã¨ã§ï¼ftraceãå©ç¨ãã¦ããªãã¨ãã®ãªã¼ãããããã»ã¼0ã«æãã¦ãã¾ãï¼
åèã¾ã§ã«ï¼æå
ã®ç°å¢ã§function traceã®ãªã³ãªãã§dd
ãå®è¡ããéã®å®è¡æéã¯ä»¥ä¸ã®ããã«ãªãã¾ããï¼
ãã¬ã¼ã¹ãªã
# current_tracer = nop % time dd if=/dev/zero of=/dev/null bs=1 count=500k 512000+0 records in 512000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 0.471769 s, 1.1 MB/s dd if=/dev/zero of=/dev/null bs=1 count=500k 0.13s user 0.34s system 99% cpu 0.473 total
ãã¬ã¼ã¹ãªã³
# current_tracder = function_graph % time dd if=/dev/zero of=/dev/null bs=1 count=500k 512000+0 records in 512000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 5.88682 s, 87.0 kB/s dd if=/dev/zero of=/dev/null bs=1 count=500k 0.17s user 5.72s system 99% cpu 5.898 total
ãã¬ã¼ã¹ãªã³æã¯å®è¡æéã10å以ä¸ã«ãªã£ã¦ãã¾ãï¼ ï¼ãã ãï¼å®éã«ãã¬ã¼ã¹ããéã¯ï¼å ¨ã¦ã®é¢æ°ããã¬ã¼ã¹ãã¦ã訳ãããããªããªãã®ã§ï¼ãã£ã«ã¿ãªã³ã°ãæãããä¸é¨ã®å¦çé¨åã ããã¬ã¼ã¹ãæå¹åããã¨æãã¾ãï¼ï¼
ããå°ãå ·ä½çãªæ§é ã®èª¬æã¯ä»¥ä¸ãåèã«ãªãã¾ãï¼
- function tracer guts, https://github.com/torvalds/linux/tree/master/Documentation/trace/ftrace-design.txt
- Steven Rostedt, Ftrace Kernel Hooks: More than just tracing, https://www.linuxplumbersconf.org/2014/ocw/system/presentations/1773/original/ftrace-kernel-hooks-2014.pdf, 2014.
- mcountã®åçæ¸ãæãæ¹æ³
x86ã§ã®mcountã®å®è£ 㯠arch/x86/kernel/ftrace.c, arch/x86/kernel/ftrace_64.Sã«ããã¾ãï¼
ä½è«ã§ããï¼Linux 4.0ããå°å ¥ãããã©ã¤ããããã¯ãã®mcountã®ããã¯ãå©ç¨ãã¦ãã¾ãï¼ (mcountããããããããé¢æ°ãå¼ã³åºã)ï¼
tracepoint (static event)
tracepointã¯ã«ã¼ãã«ã®ã³ã¼ãå ã§ç°¡åã«probe functionãå®ç¾©ã§ããããã«ããããã®ä»çµã¿ã§ãï¼ ãã£ã¦ãããã¨ãåç´åããã¨ï¼ã½ã¼ã¹å ã§ä»¥ä¸ã®ããã«probe functinonãå¼ã³åºãã¾ãï¼
if(event_on){
callback()
}
å®éã«ã¯ä¸ã¤ã®ãã¬ã¼ã¹ãã¤ã³ãã«è¤æ°ã®é¢æ°ãç»é²ãããã¨ãå¯è½ã§ãï¼ ï¼ã«ã¼ãã«ã¢ã¸ã¥ã¼ã«ãããç»é²ãå¯è½ã§ã(ä¾)ï¼ï¼ ã«ã¼ãã«å ã®1000以ä¸ã®ç®æã§tracepointãå®ç¾©ããã¦ãã¾ãï¼
tracepointeã®å®ç¾©ã®è©³ç´°ã¯ãã¯ããå¤ç¨ããã¦ãã¦é常ã«åããã«ããã§ããï¼ã©ãã以ä¸ã®ããã«ãªã£ã¦ããã¿ããã§ãï¼
ä¾: sched_process_exec
https://github.com/torvalds/linux/blob/v4.15/include/trace/events/sched.h#L301
TRACE_EVENT(sched_process_exec, TP_PROTO(struct task_struct *p, pid_t old_pid, struct linux_binprm *bprm), TP_ARGS(p, old_pid, bprm), TP_STRUCT__entry( __string( filename, bprm->filename ) __field( pid_t, pid ) __field( pid_t, old_pid ) ), TP_fast_assign( __assign_str(filename, bprm->filename); __entry->pid = p->pid; __entry->old_pid = old_pid; ), TP_printk("filename=%s pid=%d old_pid=%d", __get_str(filename), __entry->pid, __entry->old_pid) );
ãããã®ãã¯ã㯠include/linux/tracepoint.hã§å®ç¾©ããã¦ãã¾ãï¼ åãã¯ãã®ç´°ãã説æã¯ãã¡ããã¿ã¦ãã ããï¼
TRACE_EVENT
ãã¯ãã¯æçµçã«DECLARE_TRACE
ã¨ãã¦å±éããã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/include/linux/tracepoint.h#L185
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \ extern struct tracepoint __tracepoint_##name; \ static inline void trace_##name(proto) \ { \ if (static_key_false(&__tracepoint_##name.key)) \ __DO_TRACE(&__tracepoint_##name, \ TP_PROTO(data_proto), \ TP_ARGS(data_args), \ TP_CONDITION(cond), 0); \ if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \ rcu_read_lock_sched_notrace(); \ rcu_dereference_sched(__tracepoint_##name.funcs);\ rcu_read_unlock_sched_notrace(); \ } \ } ...
ãã®åå²ã§ã¯ static-key ã¨å¼ã°ããä»çµã¿ãå©ç¨ãã¦ãã¾ãï¼
if (static_key_false(&__tracepoint_##name.key))
ã®é¨åã¯æånopã¨ãã¦ã³ã³ãã¤ã«ããã¾ãï¼
å¾ããtracepointãæå¹ã«ããã¨ãï¼ãã®é¨åã__DO_TRACE()
ãå®è¡ãããããªjmpå½ä»¤ã«æ¸ãæãã¾ãï¼
(ã¡ãªã¿ã«ï¼ææ°ã®ããã¥ã¡ã³ãã«ã¯static_key_false()
ã¯deprecatedã¨æ¸ãã¦ããã¾ããï¼æ®éã«å©ç¨ããã¦ã¾ãã..)
__DO_TRACE()
ã®ä¸ã§ã³ã¼ã«ããã¯é¢æ°ãå¼ã³åºãã¾ãï¼
ããã§å®ç¾©ããã trace_##name()
ãããã¯ãããå ´æããå¼ã³ã¾ãï¼
sched_process_exec
ã¯ä»¥ä¸ããå¼ã°ãã¦ãã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/fs/exec.c#L1683
... if (ret >= 0) { audit_bprm(bprm); trace_sched_process_exec(current, old_pid, bprm); ptrace_event(PTRACE_EVENT_EXEC, old_vpid); proc_exec_connector(current); } ...
ã§ï¼ããã ãã ã¨tracepointãå®ç¾©ããã ãã§ï¼ããã«å¯¾å¿ããã³ã¼ã«ããã¯é¢æ°ã¯ä½ãç»é²ããã¦ãã¾ããï¼
sched_process_exec
ãå®ç¾©ãã¦ããsched.hã§ã¯ï¼ãããã®æ«å°¾ã§ä»¥ä¸ã®ãã¡ã¤ã«ãincludeãã¦ãã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/include/trace/events/sched.h#L576
/* This part must be outside protection */ #include <trace/define_trace.h>
ãã®define_trace.h
ã§ããï¼trace/trace_events.h
ãã¤ã³ã¯ã«ã¼ãããã®ã¡ï¼ããä¸åº¦sched.hãã¤ã³ã¯ã«ã¼ããã¾ãï¼
(TRACE_INCLUDE
ã®é¨åã§includeããã¾ã)
https://github.com/torvalds/linux/blob/v4.15/include/trace/define_trace.h
... #include <trace/trace_events.h> #include <trace/perf.h> ... #define TRACE_HEADER_MULTI_READ #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) ...
trace_events.h
ã®ä¸ã§ï¼DECLARE_EVENT_CLASS
ãªã©ã®ãã¯ããåå®ç¾©ããã¾ãï¼
å¾ã£ã¦ï¼shced.hã2åç®ã«ã¤ã³ã¯ã«ã¼ãããéã¯ãããã®ãã¯ããé©ç¨ããã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/include/trace/trace_events.h#L757
#undef DECLARE_EVENT_CLASS #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ _TRACE_PERF_PROTO(call, PARAMS(proto)); \ static char print_fmt_##call[] = print; \ static struct trace_event_class __used __refdata event_class_##call = { \ .system = TRACE_SYSTEM_STRING, \ .define_fields = trace_event_define_fields_##call, \ .fields = LIST_HEAD_INIT(event_class_##call.fields),\ .raw_init = trace_event_raw_init, \ .probe = trace_event_raw_event_##call, \ .reg = trace_event_reg, \ _TRACE_PERF_INIT(call) \ }; #undef DEFINE_EVENT #define DEFINE_EVENT(template, call, proto, args) \ \ static struct trace_event_call __used event_##call = { \ .class = &event_class_##template, \ { \ .tp = &__tracepoint_##call, \ }, \ .event.funcs = &trace_event_type_funcs_##template, \ .print_fmt = print_fmt_##template, \ .flags = TRACE_EVENT_FL_TRACEPOINT, \ }; \ static struct trace_event_call __used \ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
DEFINE_EVENT
ãã¯ãã«ããï¼_ftrace_events
ã»ã¯ã·ã§ã³ã«struct trace_event_call
ã®ãã¼ã¿ãæ ¼ç´ããã¾ãï¼
ftraceã®åæåæã«ãã®æ å ±ãå©ç¨ãã¦ãã¬ã¼ã¹ãã¤ã³ãã®ã¤ãã³ãããªã¹ãã«è¿½å ãã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/kernel/trace/trace_events.c#L3085
static __init int event_trace_enable(void) ... for_each_event(iter, __start_ftrace_events, __stop_ftrace_events) { call = *iter; ret = event_init(call); if (!ret) list_add(&call->list, &ftrace_events);
eventã®æå¹åã¯ftrace_event_enable_disable
ã§ãããªãã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/kernel/trace/trace_events.c#L456
static int __ftrace_event_enable_disable(struct trace_event_file *file, int enable, int soft_disable) { ... ret = call->class->reg(call, TRACE_REG_REGISTER, file); ...
ãã®reg()
ã¯ï¼ä¸ã®DECLARE_EVENT_CLASS
ã§å®ç¾©ããtrace_event_reg()
ã§ãï¼
https://github.com/torvalds/linux/blob/v4.15/kernel/trace/trace_events.c#L286
int trace_event_reg(struct trace_event_call *call, enum trace_reg type, void *data) { struct trace_event_file *file = data; WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT)); switch (type) { case TRACE_REG_REGISTER: return tracepoint_probe_register(call->tp, call->class->probe, file); case TRACE_REG_UNREGISTER: ...
ããã®tracepoint_probe_register()
ã«ããï¼ftraceã®ã³ã¼ã«ããã¯é¢æ°ããã¬ã¼ã¹ãã¤ã³ãã«ç»é²ããã¾ãï¼
é¢æ°ãæåã«ç»é²ããéã¯satatic keyã®åå²ã®é¨åãæ¸ãæãã¾ãï¼
ã¡ãªã¿ã«ï¼call->class->probe
ã¨ããã®ã¯DECLARE_EVENT_CLASS
ã«ãã£ã¦å®ç¾©ãããtrace_event_raw_event_##call
ã§ï¼ããã¯ä»¥ä¸ã®ããã«ãªã£ã¦ãã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/include/trace/trace_events.h#L698
static notrace void \ trace_event_raw_event_##call(void *__data, proto) \ { \ struct trace_event_file *trace_file = __data; \ struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\ struct trace_event_buffer fbuffer; \ struct trace_event_raw_##call *entry; \ int __data_size; \ \ if (trace_trigger_soft_disabled(trace_file)) \ return; \ \ __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \ \ entry = trace_event_buffer_reserve(&fbuffer, trace_file, \ sizeof(*entry) + __data_size); \ \ if (!entry) \ return; \ \ tstruct \ \ { assign; } \ \ trace_event_buffer_commit(&fbuffer); \ }
trace_event_buffer_commit()
ã«ãã£ã¦ring bufferã¸åºåããããªãã¾ãï¼
ftraceã§ã¯tracepointã®eventã®ãªã³ãªãã ãã§ãªãï¼eventã«å¿ãããã¬ã¼ã¹ã®éå§/çµäºã®åãæ¿ããªã©ãã§ããããã«ãªã£ã¦ãã¾ãï¼
tracepointã«é¢ãã¦ã¯ä»¥ä¸ã«è³æãããã¾ãï¼
- Using the Linux Kernel Tracepoints, https://github.com/torvalds/linux/blob/master/Documentation/trace/tracepoints.txt
- Event Tracing, https://github.com/torvalds/linux/blob/master/Documentation/trace/events.txt
- Steven Rostedt, Using the TRACE_EVENT() macro (Part 1), https://lwn.net/Articles/379903/, 2010.
- Steven Rostedt, Using the TRACE_EVENT() macro (Part 2), https://lwn.net/Articles/381064/, 2010.
- Steven Rostedt, Using the TRACE_EVENT() macro (Part 3), https://lwn.net/Articles/383362/, 2010.
kprobe (dynamic event)
kprobeã¯ã«ã¼ãã«ã³ã¼ãå
ã«åçã«ããã¯ãã¤ã³ãã追å ããããã®ä»çµã¿ã§ãï¼
ã¢ã¤ãã£ã¢ã®åºæ¬ã¯ããã¯ãããç®æã®ã³ã¼ãããã¬ã¼ã¯ãã¤ã³ãå½ä»¤ã§æ¸ãæãããã¨ã§ãï¼
ããã«ããå½ä»¤åä½ã§ã«ã¼ãã«å
ã®ã»ã¼å
¨ã¦ã®å ´æã®ããã¯ãå¯è½ã«ãªãã¾ãï¼
(kprobeèªèº«ã®ã³ã¼ããªã©ã¯ããã¯ä¸å¯è½ã§ãï¼NOKPROBE_SYMBOL
ãã¯ãã使ãã¨ãã®ã¢ãã¬ã¹ã_kprobe_blacklist
ã»ã¯ã·ã§ã³ã«ç»é²ããï¼ãã®ã¢ãã¬ã¹ç¯å²ã«å¯¾ããkprobeãç¦æ¢ããã¾ã)ï¼
kprobeã¨tracepointãæ¯è¼ããã¨ï¼kprobeã¯tracepointã®ä¸ä½äºæã®ãããªæ°ããã¾ããï¼kprobeã¯ã¢ãã¬ã¹åä½ã§ããã¯ããããªããããã¤ããªã«ä¾åãã¦ãã¾ãã®ã«å¯¾ãï¼tracepointã®æ¹ã¯ãã¤ããªå¤æ´ã®å½±é¿ãåãã¾ããï¼ï¼ãã ãï¼ã«ã¼ãã«éçºè å´çã«ã¯ä¸åº¦å°å ¥ããtracepointãä¿å®ãã責任ãçºçããã¨ããã¾ãï¼ï¼ tracepiontã®æ¹ããã¼ã¿æ§é ã®åå¾ãªã©ã¯æ¥½ãã¨æãã¾ãï¼ ã¾ããã¬ã¼ã¯ãã¤ã³ãã®ããã¯ã®æ¹ãtracepintã®ifæã«ããããã¯ãããã¯ãªã¼ããããã大ããã¨æãã¾ãï¼ã¨ãã£ã¦ãå½±é¿ãåºãã»ã©å¤§ããã¯ãªãã¨æãã¾ãï¼ï¼ ãã¨ã¯kprobeã¯åçã«ã³ã¼ããæ¸ãæããããï¼ããããæå³ã§ã¯tracepointã®æ¹ãå®å®æ§ãããã¨ããã¾ãï¼ ã¨ã¯ãã£ã¦ãkprobeãå¤åæ¬ä½ã«å°å ¥ããã¦ãã10å¹´è¿ãçµã¡ã¾ããï¼ç¹ã«å©ç¨ã«åé¡ã¯ãªããã¨æãã¾ãï¼
kprobeã®ä½¿ãæ¹ã¯ï¼samples/kprobesãåèã«ãªãã¾ãï¼
ãã¬ã¼ã¯ãã¤ã³ãç®æã®å½ä»¤ãå®è¡ããåã«å¼ã°ããpre handlerã¨ï¼å½ä»¤å®è¡å¾ã«å¼ã°ããpost handlerãè¨å®ãã¦register_kprobe()
ãå¼ã³ã¾ãï¼
ftraceã®è¦³ç¹ããã¿ãã¨ï¼tracefsã«ãã£ã¦kprobeãè¨å®ãããã¨ããéï¼kprobe_events_ops
ã«å¾ã£ã¦probes_write
=> trace_parse_run_command
=> trace_run_command
ã®ä¸ã§ createfn
ã®ã³ã¼ã«ããã¯é¢æ°ãå¼ã°ãï¼çµå± create_trace_kprobe
ãå®è¡ããã¾ãï¼
ããã§register_trace_kprobe
=> register_kprobe_event
=> __register_trace_kprobe
ã®ä¸ã§ register_kprobe()
ããã¾ãï¼
ãã®ã¨ãkprobeã«ç»é²ãããé¢æ°ã¯ï¼alloc_trace_kprobe
ã®ä¸ã§ tk->rp.kp.pre_handler = kprobe_dispatcher;
ã¨ã㦠kprobe_dispathcer
ãè¨å®ããã¦ãã¾ãï¼
kprobe_dispathcer
ã®ä¸ããå¼ã°ãã
__kprobe_trace_func
ã§ring bufferã¸ã®æ¸ãè¾¼ã¿ããããªã£ã¦ãã¾ãï¼
ã¾ãï¼kretprobeã¨ããï¼é¢æ°ã®returnãããã¯ããããã®ä»çµã¿ãæä¾ããã¦ãã¾ãï¼ é¢æ°ã®returnãããã¯ãããã¨ãããããããã¼ãºã«çããããã«å°å ¥ããããã ã¨æãã¾ãï¼ å®è£ çã«ã¯åç´ã«retãkprobeã§ããã¯ããã®ã§ã¯ãªãï¼ã«ã¼ãã«ã®entryãkprobeã§ããã¯ãï¼ãã®éã«ã¹ã¿ãã¯ä¸ã®æ»ãã¢ãã¬ã¹ãæ¸ãæãã¦retæã«kretprobeã®ãã©ã³ããªã³ã³ã¼ããå¼ã¶ããã«ãã¦ããããã§ãï¼ããããï¼ï¼
ã¾ãï¼uprobeã¨ããkprobeã®ã¦ã¼ã¶ã©ã³ãçãããã¾ãï¼ ã¡ãªã¿ã«ï¼uprobeã¯inodeã¨ç´ã¥ããå½¢ã§ç»é²ãã¾ãï¼
kprobe/uprobeã«é¢ãã¦ã¯ä»¥ä¸ãåèã«ãªãã¾ãï¼
- Sudhanshu Goswami, An introduction to KProbes, https://lwn.net/Articles/132196/, 2005.
- Kprobe-based Event Tracing, https://github.com/torvalds/linux/blob/v4.15/Documentation/trace/kprobetrace.txt
- kretprobe: Linuxã®åå¿é²ã¨ãã»ã»ã», http://wiki.bit-hive.com/north/pg/kretprobe, 2012.
- Uprobe-tracer: Uprobe-based Event Tracing, https://github.com/torvalds/linux/blob/v4.15/Documentation/trace/uprobetracer.txt
ãã®ä»ã®tracer
mmio tracerã«é¢ãã¦ï¼ç°¡åã«å¦çã追ã£ã¦ã¿ã¾ãï¼
struct tracerã§tracefsã§ããã¨ãããéã®é¢æ°ã®å®ç¾©ããã¦ããããã§ãï¼
mmiotraceã®ä¾
- åæå
- ioremap, iounmapæã«traceãã
ioremap_trace_core
=>__trace_mmiotrace_map()
=>mmio_trace_mapping()
call_filter_check_discard
ã§ãã£ã«ã¿ãªã³ã°
- ãã®å¾
trace_buffer_unlock_commit()
=>trace_buffer_unlock_commit_regs()
=>__buffer_unlock_commit()
=>ring_buffer_write()
ã§ring bufferã¸æ¸ãè¾¼ã¿
trace-cmd
å®éã«ftraceãå©ç¨ããå ´åã«ã¯ï¼ftraceã®ããã³ãã¨ã³ãã§ããtrace-cmdãå©ç¨ã§ãã¾ãï¼ ftraceã®ã¡ã³ããã§ããSteven Rostedtæ°ãç´ã ã«éçºãã¦ãã¾ãï¼
- man page of trace-cmd, http://man7.org/linux/man-pages/man1/trace-cmd.1.html
- Steven Rostedt, Ftrace Profiling, https://events.static.linuxfound.org/sites/events/files/slides/collab-2015-ftrace-profiling.pdf, 2015.
ã¡ãªã¿ã«ï¼githubã®ãªãã¸ããªã®æ¹ã¯ããªãå¤ãã®ã§æ³¨æãå¿ è¦ã§ãï¼
perf
perfã¨ã¯Linuxã«åå¨ããããã©ã¼ãã³ã¹ã¢ãã¿ãªã³ã°ã®ããã®æ©è½ã§ãï¼perf_eventã¨ãããã¾ãï¼ perfã¨ããå称ã®ã¦ã¼ã¶ã¹ãã¼ã¹ç¨ãã¼ã«ãéçºããã¦ããï¼åã«perfã¨è¨ã£ãå ´åã¯ãã®ãã¼ã«ãæããã¨ãå¤ããã¨æãã¾ãï¼ ã¡ãã£ã¨ãããããã®ã§ï¼ããã§ã¯ã«ã¼ãã«ã®æ©è½ã®æ¹ã¯perf_eventã¨æ¸ããã¨ã«ãã¾ãï¼
perfï¼ã®åé²ï¼ã¯ãã¨ãã¨Performance counters for Linux (PCL) ã¨ããååã ã£ãã¿ãããªã®ã§ï¼å¯ããã«CPUã®perfomance counterã¸ã®ã¢ã¯ã»ã¹æ段ã®æä¾ãå½åã®ç®çã ã£ããã ã¨æãã¾ãï¼
ãã ãï¼ä»ã§ã¯permance counter以å¤ã®ã¤ãã³ãã«ã対å¿ãã¦ãã¾ãï¼
perf list
ã³ãã³ãã«ãã£ã¦å¯¾å¿ãã¦ããã¤ãã³ãã®ç¢ºèªãã§ãã¾ãï¼
perf_eventã§ã¯eventã以ä¸ã®ããã«åé¡ãã¦ãã¾ãï¼
- PERF_TYPE_HARDWARE
- PERF_TYPE_HW_CACHE
- PERF_TYPE_RAW
- PERF_TYPE_SOFTWARE
- PERF_TYPE_TRACEPOINT
- PERF_TYPE_BREAKPOINT
hardware, hw_cache, raw ãCPUã®perfomance counterã®ã¤ãã³ãã«å¯¾å¿ãã¾ãï¼
perf_event_open(2)
ã·ã¹ãã ã³ã¼ã«ã«ãã£ã¦ï¼ä¸ã¤ã®perf eventã«å¯¾å¿ããfile descriptorãå
¥æã§ãã¾ãï¼
ãã®fdã«å¯¾ãã¦read()
ãªã©ããããã¨ã§eventã®ã«ã¦ã³ã¿ã«ã¢ã¯ã»ã¹ãã¾ãï¼
ã¾ãï¼perfã§ã¯ã¤ãã³ãã®ã«ã¦ã³ã¿ã«2種é¡ããã¾ãï¼
ä¸ã¤ã¯counting counterã§ï¼ã¤ãã³ãã®çºçåæ°ãå¾ãããã«å©ç¨ãã¾ãï¼read()ããã¨ã«ã¦ã³ã¿ã®å¤ãå¾ããã¾ãï¼
ããä¸ã¤ãsampling counterã§ï¼ãã®ã«ã¦ã³ã¿ã®å ´åï¼Nåã®ã¤ãã³ããã¨ã«è¨å®ããã³ã¼ã«ããã¯é¢æ°ãå¼ã³ã¾ãï¼
perf stat
ã§å¾ãããã®ã¯counting counterã®å¤ï¼perf record
ã§å¾ãããã®ã¯samping counterã®çµæã§ãï¼
ã¡ãªã¿ã«ããã¯ä½è«ã§ããï¼perf_event_open(2)
ã®man pageã¯ã·ã¹ãã ã³ã¼ã«ã®ä¸ã§ãããããã£ã¨ãé·ãã§ãï¼
% git clone https://github.com/mkerrisk/man-pages && cd man-pages/man2 % find ./ -name "*.2" | parallel wc {} | sort -nr | head 3331 13388 88727 ./perf_event_open.2 2796 12584 78237 ./ptrace.2 2281 9245 58337 ./keyctl.2 2102 9984 58088 ./fcntl.2 1938 9262 58178 ./futex.2 1756 7672 45635 ./open.2 1598 6891 43519 ./prctl.2 1368 6316 37622 ./clone.2 1179 5042 32922 ./bpf.2 1100 5122 33092 ./seccomp.2
ã¦ã¼ã¶ã¹ãã¼ã¹ãã¼ã«ã®perfã®ä½¿ãæ¹ã¯ä»¥ä¸ãåèã«ãªãã¾ãï¼
- https://github.com/torvalds/linux/tree/master/tools/perf/Documentation
- perf toolã®ããã¥ã¡ã³ã (man page)
- Brendan Gregg, perf Examples, http://www.brendangregg.com/perf.html
- ç¥è³æ
- perf wiki, https://perf.wiki.kernel.org/index.php/Main_Page
- Paul J. Drongowski, PERF tutorial: Finding execution hot spots, http://sandsoftwaresound.net/perf/perf-tutorial-hot-spots/
PERF_TYPE_HARDWARE, HW_CACHE, RAW
ãããã®ã¤ãã³ãã¯CPUã®perfomance counterã®ã¢ã¯ã»ã¹ã«å©ç¨ãã¾ãï¼ perfomance counterã¨ã¯CPUã«ã¤ãã¦ããã¤ãã³ãã®ã¢ãã¿ãªã³ã°æ©è½ã®ãã¨ã§ãï¼ Intelã®CPUã®å ´åSDMã®18,19ç« ãããã«æ¸ãã¦ããã¾ãï¼ ã©ããªã¤ãã³ããåããã®ãã¯CPUãã¨ã«ç°ãªãã¾ããï¼ç¹ã«ä¸è¬çãªã¤ãã³ããPERF_EVENT_HARDWARE, PERF_EVENT_HW_CACHEã«åé¡ãã¦ãã¾ãï¼ PERF_EVENT_HARDWARE, PERF_EVENT_HW_CACHEã«ã¯ä»¥ä¸ã®ãããªãã®ãããã¾ãï¼
% sudo perf list | grep -i hardware branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-stores [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-stores [Hardware cache event] iTLB-load-misses [Hardware cache event] iTLB-loads [Hardware cache event] node-load-misses [Hardware cache event] node-loads [Hardware cache event] node-store-misses [Hardware cache event] node-stores [Hardware cache event]
Intelã®CPUã®è©±ãå°ãã ãããã¨ï¼intelã®cpuã§ã¯perfomance counterã®eventãarchitectural performance eventsã¨non-architectural performance events (model-specific performance events)ã®äºã¤ã«åãã¦ãã¾ãï¼
architectgural performance counterï¼ã¯ããã¯æ°ã¨ã)ã®å¤ã¯IA32_FIXED_CTR[0-2]
ã¬ã¸ã¹ã¿ããåå¾å¯è½ã§ãï¼
ãã以å¤ã®eventã¯ï¼IA32_PERFEVTSELx
ã¬ã¸ã¹ã¿ã§ã©ã®eventãåãããããè¨å®ãã¾ãï¼
ãã®ã¤ãã³ãã®çµæã¯IA32_PMCx
ã«æ ¼ç´ããã¾ãï¼
ãããã®ã¬ã¸ã¹ã¿ã¯å
¨ã¦MSRã§ãï¼å¾ã£ã¦ï¼wrmsr/rdmsrã§ã¢ã¯ã»ã¹ãã¾ãï¼
ã¾ãï¼CR4.PCE (Performance-Monitoring Counter enable) = 1ã®ã¨ãï¼rdpmcå½ä»¤ã使ã£ã¦ã¦ã¼ã¶ã©ã³ãããIA32_PMCx
ã®å¤ãèªããã¨ãå¯è½ã§ãï¼rdpmcã®æ¹ãrdmsrãããæ©ããããã§ãï¼åèï¼ï¼
IA32_PMCx
ã®ã¬ã¸ã¹ã¿ã®æ°ã¯éããã¦ãã¾ãï¼CPUã«ããã¾ããï¼2åã¨ã4åã¨ã6åã¨ãã§ãï¼
ããã§perfã§ã¯ã¬ã¸ã¹ã¿æ°ä»¥ä¸ã®ã¤ãã³ããè¨é²ããå ´åï¼ã©ã¦ã³ãããã³ã«ãã£ã¦é©å½ãªæéééã§ã¬ã¸ã¹ã¿ãå
±æãã¾ãï¼
å¾ã£ã¦ï¼æçµçã«å¾ãããå¤ã¯ããã¾ã§æ¨å®å¤ã¨ãªãã¾ãï¼
æ£ç¢ºãªå¤ãå¿
è¦ãªå ´åã«ã¯åå¾ããã¤ãã³ããçµãå¿
è¦ãããã¾ãï¼
ãã®ãããã¯perf wikiã«æ¸ãã¦ããã¾ãï¼
perfomance counterã«ã¯ãªã¼ãã¼ããã¼ããã¨å²ãè¾¼ã¿ãçºçãããæ©è½ãããã¾ãï¼ sampling counterã¯ãããå©ç¨ãã¾ãï¼ ç¹ã«ï¼ã¯ããã¯æ°ãªã©ã®ã¤ãã³ããåºæºã¨ãã¦ï¼é©å½ãªééã§å²ãè¾¼ã¿ãçºçããï¼å²ãè¾¼ã¿çºçæã®ripã®è¨é²ãåããã¨ã§ãããã¡ã¤ãªã³ã°ãã§ãã¾ãï¼
PERF_EVENT_HARDWARE, PERF_EVENT_HW_CACHE以å¤ã®CPUåºæã®ã¤ãã³ãã«ã¢ã¯ã»ã¹ããã«ã¯PERF_TYPE_RAWãå©ç¨ãã¦ç´æ¥ã¤ãã³ãã®çªå·ãæå®ãã¾ãï¼
ã¡ãªã¿ã«ï¼Linux 4.10ä»è¿ããããã»ããµæ¯ã®åºæã®PMU eventãååã§åç
§ã§ããããã«ãªã£ã¦ãã¾ãï¼
perf list
ããã¨ãã«Kernel PMU Event
ã¨æ¸ããã¦ãããã®ãããã§ãï¼
ãã®æ
å ±ã¯tools/perf/pmu-event/arch/以ä¸ã®jsonãã¡ã¤ã«ã§å®ç¾©ããã¦ããããã§ãï¼
PERF_TYPE_SOFTWARE
perf list
ããã¨ãã«software eventã¨è¡¨ç¤ºããããã¤ã§ãï¼context switchãªã©ãããã¾ãï¼
% sudo perf list | grep -i software alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event]
ããã¯ã©ããªã£ã¦ããã®ãã¨ããã¨ï¼ããããã®ã¤ãã³ãç®æã§æ示çã«perf_sw_event()
ãå¼ãã§ãã¾ãï¼
- ä¾: arch/x86/mm/fault.c
perf_sw_event()
=> __perf_sw_event()
=> ___perf_sw_event
=> do_pwerf_sw_event
=> perf_swevent_event
PERF_TYPE_TRACEPOINT
ftraceã§ã使ããã¦ããtracepoint, kprobe, uprobeãªã©ã®ã¤ãã³ããPERF_TYPE_TRACEPOINTã§ãï¼
kprobeãuprobeã¯ftraceã«ãã£ã¦ç»é²ãããã³ã¼ã«ããã¯é¢æ°(kprobe_dispatcher
, uprobe_dispatcher
)ã®ä¸ããperf_trace_buf_submit()
ãå¼ã°ãã¦ãã¾ãï¼
tracepointã®å ´åã¯define_trace.h
ã®ä¸ã§ï¼perf.h
ãå¼ã°ãï¼ãã®ä¸ã§perf_eventç¨ã®tracepointã®ã³ã¼ã«ããã¯é¢æ°ãå®ç¾©ããã¦ãã¾ãï¼
ãã®ã³ã¼ã«ããã¯é¢æ°ã®ç»é²ã¯ftraceã®ã³ã¼ã«ããã¯é¢æ°ã®ã¨ããã§ãããªã£ã¦ãã¾ãï¼
https://github.com/torvalds/linux/blob/v4.15/kernel/trace/trace_events.c#L305
int trace_event_reg(struct trace_event_call *call, enum trace_reg type, void *data) { struct trace_event_file *file = data; WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT)); switch (type) { case TRACE_REG_REGISTER: return tracepoint_probe_register(call->tp, call->class->probe, file); case TRACE_REG_UNREGISTER: tracepoint_probe_unregister(call->tp, call->class->probe, file); return 0; #ifdef CONFIG_PERF_EVENTS case TRACE_REG_PERF_REGISTER: return tracepoint_probe_register(call->tp, call->class->perf_probe, call); ...
ããå°ãå ·ä½çã«ã¯ï¼ä»¥ä¸ã®ããã«ãªã£ã¦ãã¾ãï¼
- kprobe
kprobe_dispatcher
=>kprobe_perf_func
=>perf_trace_buf_submit
- uprobe
uprobe_dispatcher
=>uprobe_perf_func
=> ... =>perf_trace_buf_submit
- tracepoint
DECLARE_EVENT_CLASS
=>perf_trace_run_bpf_submit
=>perf_tp_event
- å¾è¿°ããeBPFããã°ã©ã ã®å¼ã³åºãã¨å ±éåããã¦ããï¼
- syscall
perf_syscall_enter
=>perf_trace_buf_submit
perf_syscall_exit
=>perf_trace_buf_submit
perf_trace_buf_submit()
=> perf_tp_event
=> perf_swevent_event
ã¨ãªãå¦çãç¶ç¶ããã¾ãï¼
PERF_TYPE_BREAKPOINT
ããã¯ãã¼ãã¦ã§ã¢ãã¬ã¼ã¯ãã¤ã³ãã«å¯¾å¿ããã¤ãã³ãã§ãï¼
æ®éã¯kprobeãuprobeã使ãã°ããã®ã§ä½¿ç¨ä¾ãã»ã¨ãã©ã©è¦ã¤ããã¾ãããï¼ä»¥ä¸ã®ããã«å©ç¨ã§ãã¾ãï¼
% sudo cat /proc/kallsyms| grep sys_brk ffffffffbb400930 T sys_brk % sudo perf stat -e mem:0xffffffffbb400930:x ls bin perf.data work Performance counter stats for 'ls': 3 mem:0xffffffffbb400930:x 0.001000937 seconds time elapsed
å ·ä½çãªãã©ã¼ãããã¯man pageåç §ï¼
ãã¬ã¼ã¯ãã¤ã³ãã®è¨å®é¨åã¯ï¼ptraceãªã©ãããå©ç¨ãããããã§ãï¼
USDT (SDT Event)
perf_eventã¨ã¯ç´æ¥ã¯é¢ä¿ãªãã§ããï¼USDTãããã¯SDTã¨å¼ã°ããã¦ã¼ã¶ã¹ãã¼ã¹ã®ããã°ã©ã ã§tracepointã®ãããªãã¬ã¼ã¹ãå®ç¾ããæ¹æ³ãããã¾ãï¼ ããã¯ãã¨ãã¨ã¯Dtraceã§åå¨ãã¦ããæ©è½ã®ããã§ï¼SystemTapããµãã¼ããã¦ãã¾ãï¼ ã¾ãï¼perfãæè¿å¯¾å¿ãã¦ãã¾ã(https://lwn.net/Articles/618956/)ï¼
ããã°ã©ã ã®ã½ã¼ã¹ã³ã¼ãã®é©å½ãªç®æã«USDTã®probeãåãè¾¼ãã¨ï¼ããèªä½ã¯nopã¨ãã¦ã³ã³ãã¤ã«ããã¾ãï¼
USDTã®æ
å ±ãELFã®.note.stapsdtã»ã¯ã·ã§ã³ã«æ ¼ç´ãããã®ã§ï¼å¾ãããã®æ
å ±ãå©ç¨ãã¦uprobeã§ããã¯ããã°ç®çã®ç®æã§ã®ããã¯ãã§ãã¾ãï¼
perf buildid-cache
ã³ãã³ãã§ï¼.note.stapsdtã®æ
å ±ã«åºã¥ãã¦ã¤ãã³ãã追å ã§ãã¾ãï¼
SDTã®ã¤ãã³ãã¯perf list
ã§SDT eventã¨ãã¦è¦ãã¾ãï¼
% sudo perf list | grep SDT sdt_libc:lll_lock_wait_private [SDT event] sdt_libc:longjmp [SDT event] sdt_libc:longjmp_target [SDT event] sdt_libc:memory_arena_new [SDT event] sdt_libc:memory_arena_retry [SDT event] sdt_libc:memory_arena_reuse [SDT event] sdt_libc:memory_arena_reuse_free_list [SDT event] sdt_libc:memory_arena_reuse_wait [SDT event] ...
perf-tools
perfã¨ftraceãå©ç¨ããããã©ã¼ãã³ã¹è§£æãã¼ã«ã¨ãã¦ï¼perf-toolsãããã¾ãï¼ perfãftrace (tracefs)ã®ã©ããã¼ã¨ãã¦åä½ãã¾ãï¼(perf-toolsã¨ããå称ã§ããï¼ftraceã使ã£ã¦ãã¾ã)ï¼
ãã ãï¼ä»ã§ã¯bccã®ãã¼ã«ã§perf-toolsã§ã§ãããã¨ã¯å ¨ã¦ã§ããããããªããã¨æãã¾ãï¼
perf ftrace
è¥å¹²ãããããã§ããï¼perfã³ãã³ãã«ãftraceã®ã©ããã¼ãå«ã¾ãã¦ããï¼perf ftrace
ã³ãã³ãã§å©ç¨ã§ãã¾ãï¼
perf ftrace record
ã¨ãã¦ç°¡åã«function traceã®çµæãè¨é²ã§ãã¾ãï¼
straceã¨ã®æ¯è¼
ã·ã¹ãã ã³ã¼ã«å¼ã³åºãã®ãã¬ã¼ã¹ããããªãstraceãã©ã¤ãã©ãªé¢æ°å¼ã³åºãã®ãã¬ã¼ã¹ããããªãltraceã¨ãã£ãã³ãã³ããããã¾ããï¼ãããã¯ããããptraceã使ç¨ãã¦ãã¾ãï¼ straceã®å ´åï¼ã·ã¹ãã ã³ã¼ã«çºè¡/çµäºæã«SIGTRAPãéä¿¡ãã¾ãï¼ ltraceã®å ´åã¯ã©ã¤ãã©ãªé¢æ°å¼ã³åºãã®PLTé¨åããã¬ã¼ã¯ãã¤ã³ãã§ããã¯ãï¼SIGTRAPãéä¿¡ãã¾ãï¼
perfã使ã£ã¦ã·ã¹ãã ã³ã¼ã«ããã¬ã¼ã¹ãããã¨ã¯å¯è½ã§ããï¼straceã¨æ¯ã¹ã¦perfã¯ã·ã°ãã«ãä»ããã«è¨é²ãåããã¨ãã§ããããï¼é«éã«åä½ãã¾ãï¼ ä»¥ä¸ã«ç°¡åãªä¾ã示ãã¾ãï¼
strace
% time strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k 512000+0 records in 512000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 18.8414 s, 27.2 kB/s +++ exited with 0 +++ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k 2.67s user 20.21s system 121% cpu 18.847 total
perf
% time perf record -e 'syscalls:sys_enter_accept' dd if=/dev/zero of=/dev/null bs=1 count=500k 512000+0 records in 512000+0 records out 512000 bytes (512 kB, 500 KiB) copied, 0.490177 s, 1.0 MB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.013 MB perf.data ] perf record -e 'syscalls:sys_enter_accept' dd if=/dev/zero of=/dev/null 0.20s user 0.40s system 93% cpu 0.635 total
ãã®ä»
/proc/sys/kernel/perf_event_paranoid
ã®å¤ã§perfã®å®è¡ã«CAP_SYS_ADMIN
ãå¿ è¦ãã©ããè¨å®ã§ãã¾ãï¼
2 allow only user-space measurements (default since Linux 4.6). 1 allow both kernel and user measurements (default before Linux 4.6). 0 allow access to CPU-specific data but not raw tracepoint samples. -1 no restrictions.
perfã¨eBPF
Linux Kernel 4.1以éï¼perfã®eventã«å¯¾ãã¦eBPFã®ããã°ã©ã ãã¢ã¿ããã§ããããã«ãªã£ã¦ãã¾ãï¼ å ·ä½çã«ã¯ï¼ä»¥ä¸ã®ã«ã¼ãã«ã®ãã¼ã¸ã§ã³ã§æ©è½ã追å ããã¦ãã¾ãï¼
- 4.1: kprobe (commit)
BPF_PROG_TYPE_KPROBE
kprobe_perf_func
=>trace_call_bpf
- 4.3: uprobe (commit)
BPF_PROG_TYPE_KPROBE
- prog typeã¯kprobeã®ãã®ãå©ç¨
uprobe_perf_func
=>trace_call_bpf
- 4.7: tracepoint (commit)
BPF_PROG_TYPE_TRACEPOINT
DECLARE_EVENT_CLASS
=>perf_trace_run_bpf_submit
=>trace_call_bpf
- sysenter, sysexitã¯ä»ã®tracepointã¨æ±ããç°ãªãããï¼ç¹å¥ãªå¦çãå¿
è¦ (c.f. bpf: add support for
sys_enter_*
andsys_exit_*
tracepoints
- 4.9: perf software / hardware event (commit)
BPF_PROG_TYPE_PERF_EVENT
__perf_event_overflow
=>READ_ONCE(event->overflow_handler)(event, data, regs);
=>bpf_overflow_handler
- sampling counterããªã¼ãã¼ããã¼ããéã«bpfããã°ã©ã ãå¼ã°ãã
BPFããã°ã©ã ã¯perf_event_open()
ã§å¾ãããfdã«å¯¾ãã¦ioctl(fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
ãå®è¡ãã¦ã¢ã¿ãããã¾ãï¼
kprobe, uprobeãtracepointã¯perfå´ã¸ã¤ãã³ãã渡ãéã«trace_call_bpf()
ãå¼ã¶ããã«ãªã£ã¦ãã¾ãï¼ã«ã¦ã³ã¿ãªã¼ãããã¼æã«bpfããã°ã©ã ãå¼ã°ãã訳ã§ã¯ãªãã§ãï¼ï¼ããããæå³ã§ã¯perfã®eventã«ã¢ã¿ãããã¦ããã¨ãããããã¯ï¼kprobeçã«ç´æ¥ã¢ã¿ãããã¦ããã¨ãã£ãæ¹ãé©åãããããªãã§ãï¼ï¼
ãã¬ã¼ã¯ãã¤ã³ãã¤ãã³ãã«å¯¾ãã¦ã¯BPFããã°ã©ã ã¯ã¢ã¿ããã§ããªãããã§ãï¼
BPFããã°ã©ã ããã¯ï¼bpf_trace_printk()
ãå©ç¨ãã¦ftraceã®ring bufferã¸ã®åºåï¼bpf_perf_event_output()
ã§perfã®ring bufferã¸ã®åºåãã§ãã¾ãï¼
sample/bpf/trace_output_kern.c, sample/bpf/trace_output_user.cã«BPFå´ããperfã®ring bufferã«åºåãããµã³ãã«ãããã¾ãï¼ æ¦ç¥ã¯ä»¥ä¸ã®éãã§ãï¼
BPF_MAP_TYPE_PERF_EVENT_ARRAY
ã®bpf arrayãä½æ (trace_output_kern.c#6)- ã¦ã¼ã¶ã©ã³ãå´ã§
perf_event_attr.type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_BPF_OUTPUT
ã¨ãã¦perf_event_open
(trace_output_user.c#L162) bpf_map_update_elem()
ã§BPF arrayã¨perf ã®fdã¨ã®å¯¾å¿ä» (trace_output_user.c#L165)bpf_map_update_elem()
=>bpf_fd_array_map_update_elem()
map_fd_get_ptr()
ã¯perf_event_fd_array_get_ptr
(perf_event_array_map_ops
ã§å®ç¾©)bpf_event_entry_gen
ã§bpf_map_update_elemU()
ã®å¼æ°ã§æ¸¡ããperf eventã®fdã«å¯¾å¿ããperf_fileãæ ¼ç´array->ptrs[index]
ã«ãã®æ å ±ãä¿åããã
- perf ã®fdã«å¯¾ãã¦mmap (trace_output_user.c#L41)
- BPFããã°ã©ã å´ã§ã¯
bpf_perf_event_output
ã使ã£ã¦åºå (trace_output_kern.c#24)
ã¾ãï¼sample/bpf/tracex6_kern.c, sample/bpf/tracex6_user.cã«BPFå´ããperfã®ã«ã¦ã³ã¿ã«ã¢ã¯ã»ã¹ããä¾ãããã¾ãï¼
å®éã«ãããã®æ©è½ãå©ç¨ããå ´åã¯bccãå©ç¨ããã®ããããã¨æãã¾ãï¼
perfã³ãã³ããããï¼ã¤ãã³ããBPFã®ããã°ã©ã ã§ãã£ã«ã¿ãªã³ã°ã§ããããã«ãªã£ã¦ãã¾ã (åè)
ftraceã¨eBPF
Linux 4.15ã®æç¹ã§ã¯ftarceã®ã¤ãã³ãã«å¯¾ãã¦eBPFããã°ã©ã ã¯ã¢ã¿ããã§ãã¾ããï¼
æ¨å¹´æ«ã«BPF_PROG_TYPE_FTRACE
ã®ææ¡ãããã¾ããã(ããã)ï¼ããã¯BPFããã¬ã¼ã¹ã®ãªã³ãªãã®åãæ¿ãã ãã«ä½¿ãã¨ããéå®ããã¦ããã®ã ã£ãã¨ããç¹ãï¼ãããããããèªä½ã«ããããåé¡ããã£ãã¨ãããã¨ã§æ¡ç¨ã«ã¯ããã£ã¦ã¾ããï¼
ä»å¾è¿½å ãããå¯è½æ§ã¯ååããããããªããã¨æãã¾ãï¼
ãã®ä»ãã¬ã¼ã·ã³ã°ãã¼ã«
perfãftraceã¯ã«ã¼ãã«ã¨å ±ã«éçºããã¦ãã¾ããï¼ãã®ä»ç¬èªã«éçºããã¦ãããã¬ã¼ã·ã³ã°ãã¼ã«ãããã¤ãããã¾ãï¼ ï¼ä¸»ã«ã«ã¼ãã«ã¢ã¸ã¥ã¼ã«ã®å½¢ã§å©ç¨ãã¾ãï¼ï¼
ç¹ã«æåãªã®ãSystemTapã§ï¼kprobeãuprobe, tracepointãªã©ã«å¯¾å¿ãï¼SystemTap Scriptã¨ããå½¢ã§å®è³ªçã«Cè¨èªã§ããã¯ããç®æã«å¦çã追å ã§ããã®ã§ããªãèªç±åº¦ãé«ãã¨æãã¾ãï¼ ãã¡ãããã®åå®å ¨æ§ã«ã¯æ°ãã¤ããå¿ è¦ã¯ããã¾ãï¼ã¾ãï¼æè¿bpfã®ããã¯ã¨ã³ãã追å ãããããã§ãï¼ ä»ã«ã代表çãªãã¼ã«ã«LTTngãããã¾ãï¼ SystemTapãLTTngã2000年代ãããã£ã¨éçºããã¦ããã®ã§ããããã¨ãã¼ã«ãæã£ã¦ããã¨æãã¾ãï¼ SystemTapã®wikiã«systemtap, dtrace, LTTng, perfã®æ¯è¼ãããã¾ãï¼ ããã¾ãSystemTapã使ã£ããã¨ããªãã®ã§ã¯ã£ããã¨ã¯åããã¾ãããï¼å¤åæè¿ã«ãªã£ã¦ããããftrace, perf, eBPFã§SystemTapã§ã§ãããã¨ã®å¤ã(+α)ãã§ããããã«ãªã£ã¦ãæããªããããªããã¨æãã¾ãï¼
ã¾ãï¼ç¹ã«çµã¿è¾¼ã¿åãã®è»½éãªãã¬ã¼ã·ã³ã°ãã¼ã«ã¨ãã¦Luaã使ã£ãdynamic tracingãã§ããktapã¨ããã®ãããã¾ã(Huaweiãéçº)ï¼ ãã ï¼ããã¯Linuxæ¬ä½ã«ãã¼ã¸ããããã«ãªããä¸åº¦eBPFã®ãã¼ã¸ã¨ã¶ã¤ãã£ãããã¦çµå±ãã¼ã¸ãããï¼ä»ã§ã¯æ´æ°ã¯æ¢ã¾ã£ã¦ããã¿ããã§ãï¼
æè¿ãéçºããã¦ãããã¬ã¼ã·ã³ã°ãã¼ã«ã¨ãã¦ã¯sysdigã¨ããã®ãããã¾ãï¼ ããã¯å ¬å¼æ°ã "sysdig as strace + tcpdump + htop + iftop + lsof + wireshark" ã§ï¼ã«ã¼ãã«ãã¬ã¼ã¹ã®ç¨éãªã©ã«ã¯ä½¿ãã¾ãããï¼ã³ã³ãããµãã¼ããå ¨é¢ã«æ¼ãåºãã¦ãããã®ãªã®ã§ããããç¨éã«ã¯ä¾¿å©ãããããªãã§ãï¼
ã¾ã¨ã
perfãftraceå¨ãã®å¦çã®æ¦è¦ã«ã¤ãã¦ç°¡åã«æ¸ãã¾ããï¼
çµå±ã®ã¨ããã©ãã使ãã°ãããã ã¨ãã話ã§ããï¼ã¾ãèªåã好ããªã®ã使ãã°ããããããªãã§ããããï¼ãï¼ ã¨ããããï¼performance counterã®å¤ãç¥ãããã®ãªãperfï¼ã«ã¼ãã«ã³ã¼ãã®ã¡ããã¨ãããã¬ã¼ã¹ãåããªãftraceã§ããï¼ãã¨ã¯ä»ãªãbccã®ãã¼ã«ã§æ軽ã«ç®çã®ãã¨ãã§ããããã¨ãå¤ãããããªãããªã¨æãã¾ãï¼ å ´åã«ãã£ã¦ã¯SystemTapãLTTngãè¦ã¦ã¿ãã¨ããã¨æãã¾ãï¼