Retsnoop is a BPF-based tool for non-intrusive mass-tracing of Linux kernel1 internals.
Retsnoop's main goal is to provide a flexible and ergonomic way to extract the exact information from the kernel that is useful to the user. At any given moment, a running kernel is doing many different things, across many different subsystems, and on many different cores. Extracting and reviewing all of the various logs, callstacks, tracepoints, etc across the whole kernel can be a very time consuming and burdensome task. Similarly, iteratively adding printk() statements requires long iteration cycles of recompiling, rebooting, and rerunning testcases. Retsnoop, on the other hand, allows users to achieve a much higher signal-to-noise ratio by allowing them to specify both the specific subset of kernel functions that they would like to monitor, as well as the types of information to be collected from those functions, all without requiring any kernel changes. Retsnoop now also can capture function arguments, furthering the promise of giving user relevant and useful information simply and flexibly.
Retsnoop achieves its goal by low-overhead non-intrusive tracing of a collection of kernel functions, intercepting their entries and exits. Retsnoop's central concept is a user-specified set of kernel functions of interest. This allows retsnoop to capture high-relevance data by letting the user flexibly control a relevant subset of kernel functions. All other kernel functions are ignored and don't pollute captured data with irrelevant information.
Retsnoop also supports a set of additional filters for further restricting the context and conditions under which tracing data is captured, allowing to filter based on PID or process name, choose a subset of admissible errors returned from functions, or gate on function latencies.
Retsnoop supports three different and complementary modes.
The default stack trace mode succinctly points to the deepest function call stack that satisfies user conditions (e.g., an error returned from the syscall). It shows a sequence of function calls, the corresponding source code locations at each level of the stack, and emits latencies and returned results:
$ sudo ./retsnoop -e '*sys_bpf' -a ':kernel/bpf/*.c'
Receiving data...
20:19:36.372607 -> 20:19:36.372682 TID/PID 8346/8346 (simfail/simfail):
entry_SYSCALL_64_after_hwframe+0x63 (arch/x86/entry/entry_64.S:120:0)
do_syscall_64+0x35 (arch/x86/entry/common.c:80:7)
. do_syscall_x64 (arch/x86/entry/common.c:50:12)
73us [-ENOMEM] __x64_sys_bpf+0x1a (kernel/bpf/syscall.c:5067:1)
70us [-ENOMEM] __sys_bpf+0x38b (kernel/bpf/syscall.c:4947:9)
. map_create (kernel/bpf/syscall.c:1106:8)
. find_and_alloc_map (kernel/bpf/syscall.c:132:5)
! 50us [-ENOMEM] array_map_alloc
!* 2us [NULL] bpf_map_alloc_percpu
^C
Detaching... DONE in 251 ms.
The function call trace mode (-T
) additionally provides a detailed trace
of control flow across the given set of functions, allowing to understand the
kernel behavior more comprehensively:
FUNCTION CALL TRACE RESULT DURATION
----------------------------------------------- -------------------- ---------
→ bpf_prog_load
→ bpf_prog_alloc
↔ bpf_prog_alloc_no_stats [0xffffc9000031e000] 5.539us
← bpf_prog_alloc [0xffffc9000031e000] 10.265us
[...]
→ bpf_prog_kallsyms_add
↔ bpf_ksym_add [void] 2.046us
← bpf_prog_kallsyms_add [void] 6.104us
← bpf_prog_load [5] 374.697us
Last, but not least, LBR mode (Last Branch Records) allows the user to "look back" and peek deeper into individual function's internals, trace "invisible" inlined functions, and pinpoint the problem all the way down to the individual C statements. This mode is especially great when tracing unfamiliar parts of the kernel without having a good idea what to even look for. It enables iterative discovery process without having much of an idea where to look and what functions are relevant:
$ sudo ./retsnoop -e '*sys_bpf' -a 'array_map_alloc_check' --lbr=any
Receiving data...
20:29:17.844718 -> 20:29:17.844749 TID/PID 2385333/2385333 (simfail/simfail):
...
[#22] ftrace_trampoline+0x14c -> array_map_alloc_check+0x5 (kernel/bpf/arraymap.c:53:20)
[#21] array_map_alloc_check+0x13 (kernel/bpf/arraymap.c:54:18) -> array_map_alloc_check+0x75 (kernel/bpf/arraymap.c:54:18)
. bpf_map_attr_numa_node (include/linux/bpf.h:1735:19) . bpf_map_attr_numa_node (include/linux/bpf.h:1735:19)
[#20] array_map_alloc_check+0x7a (kernel/bpf/arraymap.c:54:18) -> array_map_alloc_check+0x18 (kernel/bpf/arraymap.c:57:5)
. bpf_map_attr_numa_node (include/linux/bpf.h:1735:19)
[#19] array_map_alloc_check+0x1d (kernel/bpf/arraymap.c:57:5) -> array_map_alloc_check+0x6f (kernel/bpf/arraymap.c:62:10)
[#18] array_map_alloc_check+0x74 (kernel/bpf/arraymap.c:79:1) -> __kretprobe_trampoline+0x0
...
Please also check out the companion blog post, "Tracing Linux kernel with retsnoop", which goes into more details about each mode and demonstrates retsnoop usage on a few examples inspired by the actual real-world problems.
NOTE: Retsnoop is utilizing the power of BPF technology and so requires root permissions, or a sufficient level of capabilities (
CAP_BPF
andCAP_PERFMON
). Also note that despite all the safety guarantees of BPF technology and careful implementation, kernel tracing is inherently tricky and potentially disruptive to production workloads, so it's always recommended to test whatever you are trying to do on a non-production system as much as possible. Typical user-context kernel code (which is a majority of Linux kernel code) won't cause any problems, but tracing very low-level internals running in special kernel context (e.g., hard IRQ, NMI, etc) might need some care. Linux kernel itself normally protects itself from tracing such dangerous and sensitive parts, but kernel bugs do slip in sometimes, so please use your best judgement (and test, if possible).
NOTE: Retsnoop relies on BPF CO-RE technology, so please make sure your Linux kernel is built with
CONFIG_DEBUG_INFO_BTF=y
kernel config. Without this retsnoop will refuse to start.
This section provides a reference-style description of various retsnoop
concepts and features for those who want to use the full power of retsnoop
to
their advantage. If you are new to retsnoop
, consider checking
"Tracing Linux kernel with retsnoop"
blog post to familiarize yourself with the output format and see how the tool
can be utilized in practice.
Retsnoop's operation is centered around tracing multiple functions. Functions
are split into two categories: entry and non-entry (auxiliary)
functions. Entry functions define a set of functions that activate
retsnoop
's recording logic. When any entry function is called for the first
time, retsnoop
starts tracking and recording all subsequent functions
calls, until the entry function that triggered recording returns. Once
recording is activated, both entry and non-entry functions are recorded in
exactly the same way and are not distinguished between each other. Function
calls within each thread are traced completely independently, so there could
be multiple recordings going on at the same time simultaneously on multi-CPU
systems.
So, in short, entry functions are recording triggers, while non-entry
functions are augmenting recorded data with additional internal function calls,
but only if those are called from an activated entry functions. If some
non-entry function is called before the entry function is called, such a call
is ignored by retsnoop
. Such a split avoids low-signal spam such as
recordings of common helper functions that happen outside of an interesting
context of entry functions.
A set of functions is specified through:
-e
(--entry
) argument(s), to specify a set of entry functions.-a
(--allow
) argument(s), to specify a set of non-entry functions. If any of the function in the non-entry set overlaps with the entry set, the entry set takes precedence.-d
(--deny
) argument(s), to exclude specified functions from both entry and non-entry sets. Functions that are denied are filtered out from both entry and non-entry subsets. Denylist always take precedence.
Each of -e
, -a
, and -d
expect a value which could be of two forms:
- function name glob, similar to shell's file globs. E.g.,
*bpf*
will match any function that has the "bpf" substring in its name. Only*
and?
wildcards are supported, and they can be specified multiple times.*
wildcard matches any sequence of characters (or none), and?
matches exactly one character. Sofoo??
will matchfoo10
, but notfoo1
. You can also optionally add kernel module glob, in the form of<func-glob> [<module-glob>]
, to narrow down function search to only kernel modules matching specified glob pattern. E.g.,*_read_* [*kvm*]
would matchvmx_read_guest_seg_ar
fromkvm_intel
module, orsegmented_read_std
fromkvm
module. So, specifying* [kvm]
is probably the easiest way to discover all the traceable functions inkvm
module. - source code path glob, prefixed with ':' (e.g.,
:kernel/bpf/*.c
). Any function that is defined in a source file that matches specified file path glob is added to the match. Source code location has to be relative to the kernel repo root. So,:kernel/bpf/*.c
will match any function defined in*.c
files under Linux'skernel/bpf
directory. All this, of course, relies on the kernel image file (vmlinux
) being available in one of the standard places and having DWARF debug information in it. Note thatretsnoop
doesn't analyze source code itself and doesn't expect it to be present anywhere. All this information is supposed to be recorded in DWARF debug information.
Any of -e
, -a
, and -d
can be specified multiple times and matched
functions are concatenated within their category. This allows full flexibility
in matching disparate subsets of functions that can't be expressed through one
simple glob. Mixing function name globs and source code path globs are also
supported.
All matched functions are additionally checked against a list of traceable
functions, which the kernel reports in the
/sys/kernel/tracing/available_filter_functions
file. If you don't see
the expected function there, then most probably it's due to one of a few common
reasons:
- function is inlined and as such isn't traceable directly; try to instead trace a non-inlined function that calls into the desired function or is called from it;
- function is in some way special and is not allowed to be traceable by the kernel itself (usually some low-level functions executed in restricted kernel contexts);
- function could be compiled out due to kernel config or is in a kernel module which isn't loaded at the time of tracing;
- sometimes function got renamed due to some compiler optimization, getting
additional suffix like
.isra.0
and similar; in such a case append '*' at the end to match such suffixes.
If in doubt whether you've specified the correct set of functions, use the
--dry-run -v
argument to do a verbose dry-run, in which retsnoop
will
report all the functions it discovered and will attempt to attach, but will
stop short of actually attaching them. This is the best way to validate
everything without any risk of interfering with the system's workload.
As mentioned above, by default retsnoop
is capturing a stack trace, with the
deepest nested function calls that satisfy conditions. E.g., if no custom error
filters are specified, retsnoop
will try to capture stack trace of a deepest
function call chain resulting in the error return. See
the stack trace mode example
in the companion blog post for more details.
Retsnoop always captures and emits stack trace. Other modes (function call trace and LBR, described below) are complementary to the default stack trace mode and each other.
Providing -T
(--trace
) flag enables function call trace mode. In this
mode, retsnoop
will keep track of a detailed sequence of calls between each
entry and non-entry function. Just like stack trace mode, this recording is
activated only upon hitting an entry function and also satisfying any of the
additional filtering conditions.
As mentioned above, this mode is complementing default stack trace mode, and,
if enabled, LBR mode. It provides a different view of a captured function call
sequence. Given it might be quite verbose and expensive to record, depending
on the specific workload and a set of functions of interest, it requires
explicit opt-in with a -T
argument.
This mode is perfect for understanding kernel behavior in details, especially unfamiliar parts of it. See the function call trace mode example in the companion blog post for more details.
Retsnoop also support adding pointed probes that can be captured in addition to function calls in the trace output. Currently, four different probe kinds are supported:
- kprobes (e.g.,
-J 'kprobe:bprm_execve+0x5'
), with optional offset within the kernel function. Injected kprobe doesn't necessarily have to be attached at the kernel function entry, which allows to trace deeper into the function, including tracing inside the inlined functions. - kretprobes (e.g.,
-J 'kretprobe:bprm_execve'
), which is similar to kprobe, but doesn't allow offsets and will be triggered as probed functions returns to the caller. - tracepoints (e.g.,
-J 'tp:sched:sched_process_exec'
), which attaches to specified kernel tracepoint, which is specified as a combination of tracepoint category and name. - raw tracepoints (e.g.,
-J 'rawtp:task_newtask'
), which, similarly to (classic) tracepoints, attaches to kernel-defined tracepoint.
Classic tracepoints are normally "derived" from raw tracepoints. But there are important differences between them, which makes supporting and using both important and worthwhile. They are by far not equivalent.
Most obviously, performance-wise, raw tracepoints are faster and have smaller tracing overhead. When debugging with retsnoop this should be a pretty negligible factor, but it's there for those who care.
More importantly, arguments passed to a raw tracepoint can be sufficiently
different compared to a corresponding classic tracepoint. Raw tracepoints tend
to accept "raw" pointers to full kernel objects (like struct task_struct
),
while classic tracepoints have post-processed extracted values (like task's
comm
field, i.e., task name). Both have advantages and disadvantages and
heavily come into play when arguments capture feature
is used. Some of the classic tracepoint arguments might be hard to fish out
out of raw tracepoint, for instance.
Another big difference is syscall tracepoints. Syscall-specific tracepoints
are special and are only available through classic tracepoints. They are
triggered dynamically with a special code in the kernel. There are only
generic sys_enter
and sys_exit
raw tracepoints, which would have to be
additionally filtered by syscall number, which is one of the raw tracepoint
arguments. In this sense, classic tracepoints are more convenient and allow
more fine-tuned targeting.
Finally, some tracepoints are only defined as raw (a.k.a., "bare")
tracepoints, normally for very performance critical parts of the kernel. You
can find a bunch of them in scheduler (see include/trace/events/sched.h
),
grep for DECLARE_TRACE()
uses in the kernel.
LBR (Last Branch Records) is an Intel CPU
feature that allows users to instruct the CPU to record the last
N calls/returns/jumps (what exactly is captured is configurable) constantly
with no overhead. The number of captured records depends on the generation of
CPU and is typically in 8 to 32 range. In recent enough Linux kernel (v5.16+)
it's possible to capture such LBRs from a BPF program in ad-hoc fashion, which
is utilized by retsnoop
in the LBR mode. Some non-Intel CPUs have a similar
capabilities, which are abstracted away by the kernel's perf subsystem, so you
don't necessarily need Intel CPUs to take advantage of it with retsnoop
.
So when is LBR mode useful? There are a few typical scenarios.
One of them is when you are investigating some generic error being returned
from the kernel. The kernel can return errors in various different cases and
it's not clear which one it is. This is often the case with bpf()
syscall,
for example, where errors like -EINVAL
can be returned due to dozens of
various error conditions, and a bunch of such conditions could be checked
within the same big function. Debugging exactly which error condition is hit
can be maddening at times. LBR mode allows users to get insight into which
exact if condition within the function returned an error.
Another common scenario is when you are trying to trace a completely unfamiliar part of Linux kernel code. You might know the entry function, but have no clue what other functions it calls and where in the source code they are defined. E.g., a common case would be an entry function that calls into generic callback and it's not clear where the callback function is actually defined. In such cases it's very hard to know which function name globs or source code path globs to specify, as there could be lots of possible implementations of such callbacks. LBR mode can help here because it doesn't require tracing relevant functions to discover them.
No matter what circumstances call for LBR mode, it can be activated by using
-R
(--lbr
) argument. Similar to the function call trace mode, LBR mode is
independent and complimentary to the default stack trace and function call
trace modes. When the right conditions happen, retsnoop
captures LBR data,
in addition to stack trace and other information, and then outputs data in its
own format.
See the LBR mode example in the companion blog post for more details.
Here we'll just note that LBR mode allows customizing what kind of records the
CPU is instructed to record. It can be one of the following values: any
,
any_call
, any_return
(default), cond
, call
, ind_call
, ind_jump
,
call_stack
, abort_tx
, in_tx
, no_tx
. See Linux's
perf_event.h
UAPI header and its enum perf_branch_sample_type_shift
's comments for brief
descriptions of each of those modes.
By default, retsnoop
assumes any_return
LBR configuration, which records
the last N returns from functions. This is very useful for the second type of
cases when we want to know which functions are being called without having to
study code thoroughly. The LBR stack will point to functions called before the
deepest explicitly traced by retsnoop
function was called, even if they
weren't part of entry/non-entry set of functions. So you can keep iterating
with retsnoop
and expanding entry/non-entry function sets this way.
For the first class of scenarios (complicated functions with multiple error
conditions), LBR config any
might be more appropriate. It records any
function calls, returns, and jumps, both conditional and unconditional. It
might help pinpoint the exact statement within a single big function that is
returning an error, as it's not bound to function boundaries.
Note that LBR is a tricky business and it might not capture enough relevant
details or sometimes it capture irrelevant details. If the latter is the case,
you can trim it down with --lbr-max-count
argument to emit specified number
of most relevant entries.
NOTE: Function arguments capture feature is currently only supported on x86-64 (amd64) architecture. Most probably arm64 support will be added as well, please request this using Github issues, if you are interested in this.
retsnoop
can capture function arguments values for all traced functions.
This works both for default stack trace mode and function call trace modes.
BTF information is used to determine names, exact types and sizes of all input
arguments, but for functions that don't have BTF available, retsnoop
will
fallback to capturing raw register values for all pass-by-value registers,
according to platform's calling convention ABI.
For pointer arguments, not just the pointer value is captured, but also
pointee type data as well! retsnoop
also understands strings and will
capture and render string contents. This makes this functionality much more
useful in practice for values passed by pointer and strings.
As mentioned in injected probes section above, when arguments capture is used together with injected probes, retsnoop will automatically capture and print probe's context data. What exactly is captured differs depending on the kind of injected probe:
- for kprobes (
-J kprobe:...
) and kretprobes (-J kretprobe:...
),struct pt_regs
, which records the state of CPU registers at the point of tracing, is captured and printed; - for classic tracepoints (
-J tp:...
) all post-processed tracepoint arguments are captured an printed, including variable-sized strings; - for raw tracepoints (
-J rawtp:...
) arguments are captured and printed with the same logic and behavior as for normal function arguments.
Rendering captured arguments might be tricky due to a potentially large struct
types, retsnoop
supports three rendering modes, which can be selected using
-C args.fmt-mode=<mode>
configuration setting.
In default, compact
, mode all function's captured arguments will be rendered
in a single (possibly quite) long line next to corresponding function entry in
function call trace or stack trace data. Each argument output will be
truncated to 120 characters, by default, but this can be tuned using
args.fmt-max-arg-width
setting with valid values between 0 and 250, zero
meaning no truncation should be performed.
Here's an example, with per-argument truncation length overriden to just 40 (both stack trace and function call trace modes are shown). Note that it's quite wide, but beyond that doesn't really disturb the usual trace information rendering and visual coherence.
$ sudo ./retsnoop -e '*sys_bpf' -TAS -C args.fmt-max-arg-width=40
...
FUNCTION CALLS RESULT DURATION ARGS
--------------- ------ -------- ----
→ __x64_sys_bpf regs=&{.r15=0x959998935,.r14=0x171abb7677,.r…
↔ __sys_bpf [0] 3.366us cmd=1 uattr={{.kernel=0x7f1c73665120,.user=0x7f1c73… size=32
← __x64_sys_bpf [0] 4.506us
do_syscall_64+0x3d (arch/x86/entry/common.c:80)
. do_syscall_x64 (arch/x86/entry/common.c:50)
4us [0] __x64_sys_bpf+0x18 (kernel/bpf/syscall.c:5672) regs=&{.r15=0x959998935,.r14=0x171abb7677,.r…
. __se_sys_bpf (kernel/bpf/syscall.c:5672)
. __do_sys_bpf (kernel/bpf/syscall.c:5674)
! 3us [0] __sys_bpf cmd=1 uattr={{.kernel=0x7f1c73665120,.user=0x7f1c73… size=32
For a bit more width-contained data rendering, multiline
mode is available
(i.e., -C args.fmt-mode=multiline
), in which each argument will be rendered
on a separate line (under corresponding function entry), but argument value itself (even if it's a rather large
struct) will be rendered compactly on a single line. The same per-argument
trunctation treatment is applied in multiline
mode as with default compact
mode. It just as well can be adjusted with -C args.fmt-max-arg-width=N
setting. Here's the same example as above, but this time in multiline
mode
and with truncation limit set to 80 characters:
$ sudo ./retsnoop -e '*sys_bpf' -TAS -C args.fmt-mode=multiline -C args.fmt-max-arg-width=80
...
FUNCTION CALLS RESULT DURATION
--------------- ------ --------
→ __x64_sys_bpf
› regs=&{.r15=0x7ffc86cfd05c,.r14=0x7ffc86cfd048,.r13=0x7f5ae9a55180,.r12=0x7fffffffff…
↔ __sys_bpf [0] 0.842us
› cmd=1
› uattr={{.kernel=0x7ffc86cfcf90,.user=0x7ffc86cfcf90}}
› size=32
← __x64_sys_bpf [0] 2.105us
entry_SYSCALL_64_after_hwframe+0x46 (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:120)
do_syscall_64+0x3d (arch/x86/entry/common.c:80)
. do_syscall_x64 (arch/x86/entry/common.c:50)
2us [0] __x64_sys_bpf+0x18 (kernel/bpf/syscall.c:5672)
› regs=&{.r15=0x7ffc86cfd05c,.r14=0x7ffc86cfd048,.r13=0x7f5ae9a55180,.r12=0x7fffffffff…
. __se_sys_bpf (kernel/bpf/syscall.c:5672)
. __do_sys_bpf (kernel/bpf/syscall.c:5674)
! 0us [0] __sys_bpf
› cmd=1
› uattr={{.kernel=0x7ffc86cfcf90,.user=0x7ffc86cfcf90}}
› size=32
It's a bit more visually distracting, but function call trace flow is still quite easy to follow, yet arguments data is more easily inspectable at the same time. Arguments are generally aligned with function return value output (both for function call trace and stack trace outputs).
Lastly, the ultimate, verbose
, formatting mode is there to render all the
captured data without any compromises. This is for the times when inspecting
data is the highest priority and function calls rendering visual coherence is
of secondary concern. In verbose
mode there is no truncation, each argument
value can take however many lines necessary (that mostly applies to rendering
structs/unions and arrays). Note, the data still can be truncated if
retsnoop
wasn't able to capture enough data bytes, see paragraph
below explaining default and maximum limits. Tune them as appropriate if
defaults are not satisfactory. Here's the example:
$ sudo ./retsnoop -e '*sys_bpf' -TAS -C args.fmt-mode=verbose
...
FUNCTION CALLS RESULT DURATION
--------------- ------ --------
→ __x64_sys_bpf
› regs = &{
.r15 = 0x7ffc86cfd05c,
.r14 = 0x7ffc86cfd048,
.r13 = 0x7f5ae9a55180,
.r12 = 0x7fffffffffffffff,
.bp = 0x7ffc86cfd030,
.r11 = 582,
.r10 = 70,
.r9 = 70,
.r8 = 70,
.ax = 0xffffffffffffffda,
.cx = 0x7f5aefb20e39,
.dx = 32,
.si = 0x7ffc86cfcf90,
.di = 1,
.orig_ax = 321,
.ip = 0x7f5aefb20e39,
.cs = 51,
.flags = 582,
.sp = 0x7ffc86cfcf88,
.ss = 43
}
↔ __sys_bpf [0] 0.793us
› cmd = 1
› uattr = {
{
.kernel = 0x7ffc86cfcf90,
.user = 0x7ffc86cfcf90
}
}
› size = 32
← __x64_sys_bpf [0] 1.982us
entry_SYSCALL_64_after_hwframe+0x46 (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:120)
do_syscall_64+0x3d (arch/x86/entry/common.c:80)
. do_syscall_x64 (arch/x86/entry/common.c:50)
1us [0] __x64_sys_bpf+0x18 (kernel/bpf/syscall.c:5672)
› regs = &{
.r15 = 0x7ffc86cfd05c,
.r14 = 0x7ffc86cfd048,
.r13 = 0x7f5ae9a55180,
.r12 = 0x7fffffffffffffff,
.bp = 0x7ffc86cfd030,
.r11 = 582,
.r10 = 70,
.r9 = 70,
.r8 = 70,
.ax = 0xffffffffffffffda,
.cx = 0x7f5aefb20e39,
.dx = 32,
.si = 0x7ffc86cfcf90,
.di = 1,
.orig_ax = 321,
.ip = 0x7f5aefb20e39,
.cs = 51,
.flags = 582,
.sp = 0x7ffc86cfcf88,
.ss = 43
}
. __se_sys_bpf (kernel/bpf/syscall.c:5672)
. __do_sys_bpf (kernel/bpf/syscall.c:5674)
! 0us [0] __sys_bpf
› cmd = 1
› uattr = {
{
.kernel = 0x7ffc86cfcf90,
.user = 0x7ffc86cfcf90
}
}
› size = 32
It's truly verbose, but will render all captured data for easy inspection.
Here's an another example of (multi-line formatted, in this case) output for
all supported injected probes. Note how bprm_execve()
kernel function is
traced using normal retsnoop's function tracing capabilities (i.e., with -a bprm_execve
) and its input arguments (bprm
, below) are captured. But we
also install kprobe:bprm_execve+5
kprobe in the middle of the function (not
at the entry point), so the concept of "function arguments" might not make any
sense anymore (they could be clobbered). So for injected kprobes retsnoop only
captures and emits the low-level state of CPU registers.
$ sudo ./retsnoop -e '*sys_execve' -a 'bprm_execve' -T -A -C args.fmt-mode=m -J 'kprobe:bprm_execve+5' -J 'kretprobe:bprm_execve' -J 'rawtp:task_newtask' -J 'tp:sched:sched_process_exec' -J 'rawtp:sched_process_exec'
Receiving data...
11:52:13.997527 -> 11:52:13.997965 TID/PID 4066/4066 (zsh/zsh):
FUNCTION CALLS RESULT DURATION
------------------------------------- ------ ---------
→ __x64_sys_execve
› regs = &{.r15=0x7ffce71f44f0,.r13=0x559f84ae2d90,.r12=0x7ffce71f4570,.bp=0x7ffce71f4670,.bx=0x7f680bf01ac0,.r11=518,.r10=8,.r9…
→ bprm_execve
› bprm = &{.vma=0xffff888126dd1ef0,.vma_pages=1,.argmin=0x7fffffdff128,.mm=0xffff8881032a1f80,.p=0x7fffffffe379,.file=0xffff8881…
◎ kprobe:bprm_execve+0x5
› regs = {.r12=0x559f84ae2d90,.bp=0xffff888110323800,.bx=0xffff88810f433000,.r11=0x20363532204e454c,.r10=131064,.r9=0x8080808080…
◎ tp:sched:sched_process_exec
› filename = '/usr/bin/date'
› pid = 4066
› old_pid = 4066
◎ rawtp:sched_process_exec
› p = &{.thread_info={.flags=16384,.cpu=5},.stack=0xffffc90000b40000,.usage={.refs={.counter=1}},.flags=0x400000,.on_cpu=1,.w…
› old_pid = 4066
› bprm = &{.vma=0xffff888126dd1ef0,.argmin=0x7fffffdff128,.p=0x7fff898bc350,.point_of_no_return=1,.file=0xffff888101bdc000,.argc…
◎ kretprobe:bprm_execve
› regs = {.r12=0x559f84ae2d90,.bp=0xffff888110323800,.bx=0xffff88810f433000,.r11=0x20363532204e454c,.r10=0x2053534543435553,.r9=…
← bprm_execve [0] 339.341us
← __x64_sys_execve [0] 425.778us
By default, retsnoop
will capture up to 3KB of data for any single function
call, up to 256 bytes per each argument, whether fixed-sized like integers,
enums, structs passed by value, etc., as well as variable-length string
arguments. All these limits can be adjusted with extra settings in args.*
group, see Extra settings section. You can request capturing
up to a total 64KB of data (per function call), maximum 16KBs per each
argument.
retsnoop
is trying to achieve a high signal-to-noise ration when rendering
captured argument data, so it will skip printing all-zero values. I.e., if some
fields of captured structure are either zero integer value, or another embedded
struct with all its fields zero, such a field and value won't be printed. If
the argument is a zero integer argument itself, it still will be printed, of
course. Similarly, struct-by-value or struct-by-pointer arguments with
completely zeroed out struct data will still be rendered as arg={}
or
arg=&{}
.
By default, retsnoop
records any function call traces (based on entry and
non-entry function sets) that result in a triggered entry function returning an
error. The "error return" is defined heuristically and is either NULL
for
pointer-returning functions or -Exxx
small negative error for integer
returning ones. This can be further adjusted, as you'll see below. But such
function traces are collected globally across any process, which might be
inconvenient and lead to irrelevant "spam" in output.
Fortunately, retsnoop
allows the user to adjust conditions under which it
captures traces.
-S
(--success-stacks
) forces retsnoop
to capture any function trace,
disabling the default logic of capturing only error-returning cases. If you
don't see retsnoop
capturing stack traces or recording function call traces
that you are sure are happening inside the kernel, check if you need to enable
the successful stack traces capture with -S
.
retsnoop
also allows the user to fine-tune which return results are
considered to be erroneous with -x
(--allow-errors
) and -X
(--deny-errors
) arguments. They expect either Exxx
/-Exxx
symbolic error
codes (you can find all errors recognized by retsnoop
in this table)
or NULL
.
These arguments can be specified multiple times and all errors are
concatenated into a single list. The error denylist (-X
) takes precedence,
so if there is an overlap between -x
and -X
, -X
wins and the specified
error will be ignored, even if it is allowed by -x
argument.
If no -x
is provided, all supported errors are assumed to be allowed, except
those denied by -X
.
retsnoop
allows the user to narrow down a set of processes in the context of
which data will be captured. These filters are invaluable on busy production
hosts that have a lot of things going on at the same time, but you need to
investigate something happening only within a small subset of processes.
-p
(--pid
) and -P
(--no-pid
) allows filtering based on process ID
(PID). Just as with most other arguments like this, you can specify them
multiple times to combine multiple PID filters.
-n
(--comm
) and -N
(--no-comm
) allows the user to filter by
process/thread names, in addition to or instead of PID filtering. Can be
specified multiple times as well.
-L
(--longer
) allows the user to specify the minimal duration of
a triggering entry function execution that will be captured and reported,
skipping everything that completed sooner. This argument should be a positive
number in units of milliseconds.
If you need to investigate latency issues, this filter allows the user to ignore irrelevant fast-completing function calls and instead trace slow ones in a more focused way.
retsnoop
supports various levels of "verboseness". By default it doesn't
emit any extra information about what it's doing and which functions it's
going to trace. The default verbose level (-v
) provides a good list of
high-signal verbose output to understand what's going on under the cover. With
-v
retsnoop will report a list of discovered functions, where the kernel
image is located, etc. retsnoop
also has more verbose levels (-vv
and
-vvv
), but they most probably are only useful for developers of retsnoop
and for debugging.
The default verbose output is extremely useful in combination with dry-run mode,
activated with --dry-run
argument. In this mode retsnoop
will report all
the actions it's going to perform (including listing which functions it
discovered and is going to trace), but stops short of actually activating any
of that. This way there is not even a chance to disturb production workload and
it's a safe way to pre-check everything upfront.
-V
(--version
) will print retsnoop
's version. If combined with -v
,
retsnoop
will also output all the detected kernel features that it relies
on. You'll need to run retsnoop
under root or with CAP_BPF
and
CAP_PERFMON
capabilities for feature detection to work. You should see
something like below:
$ sudo ./retsnoop -Vv
retsnoop v0.9.1
Feature detection:
BPF ringbuf map supported: yes
bpf_get_func_ip() supported: yes
bpf_get_branch_snapshot() supported: yes
BPF cookie supported: yes
multi-attach kprobe supported: no
Feature calibration:
kretprobe IP offset: 8
fexit sleep fix: yes
fentry re-entry protection: yes
retsnoop
tries to provide as accurate and full function and stack trace
information as possible. If Linux kernel image can be found in the system in
one of the standard locations and it contains DWARF type information, this
will be used to augment captured stack traces with information about source
code location and inline functions. Using DWARF information adds a bit of
extra CPU overhead, so this behavior and related parameters can be tuned.
If retsnoop
fails to find the kernel image in the standard location, it can
be pointed to a custom location through the -k
(--kernel
) argument.
-s
(--symbolize
) allows the user to tune stack symbolization behavior.
Specify -sn
to disable extra DWARF-based symbolization. In such case
retsnoop
will stick to a basic symbolization based on /proc/kallsyms
data.
-s
alone will try to get source code location information, but won't attempt
to symbolize inline functions. -ss
allows both inlined functions and source
code information. This is a default mode, if kernel with DWARF information is
found, due to its extreme usefulness in most of the cases.
Note, DWARF type information is also necessary for source code path globs to work.
As retsnoop
gains more functionality, it also gains various knobs and extra
settings that advanced users might want to adjust depending on their specific
use case. To not overload the main CLI interface with them, all such less
often needed settings are put behind a separate --config
argument, which
accepts configuration parameters of the form <group>.<key>=<value>
. E.g., to
request retsnoop
to print out raw addresses in stack traces, one can call
retsnoop
with --config stacks.emit-addrs=true
argument.
Use --config-help
argument to request a list and corresponding descriptions
of all supported advanced settings:
$ retsnoop --config-help
It's possible to customize various retsnoop's internal implementation details.
This can be done by specifying one or multiple extra parameters using --config KEY=VALUE CLI arguments.
Supported configuration parameters:
bpf.ringbuf-size - BPF ringbuf size in bytes.
Increase if you experience dropped data. By default is set to 8MB.
bpf.sessions-size - BPF sessions map capacity (defaults to 4096)
stacks.symb-mode - Stack symbolization mode.
Determines how much processing is done for stack symbolization
and what kind of extra information is included in stack traces:
none - no source code info, no inline functions;
linenum - source code info (file:line), no inline functions;
inlines - source code info and inline functions.
stacks.emit-addrs - Emit raw captured stack trace/LBR addresses (in addition to symbols)
stacks.dec-offs - Emit stack trace/LBR function offsets in decimal (by default, it's in hex)
stacks.unfiltered - Emit all stack stace/LBR entries (turning off relevancy filtering)
args.max-total-args-size - Maximum total amount of arguments data bytes captured per each function call
args.max-sized-arg-size - Maximum amount of data bytes captured for any fixed-sized argument
args.max-str-arg-size - Maximum amount of data bytes captured for any string argument
args.fmt-mode - Function arguments formatting mode (compact, multiline, verbose)
args.fmt-max-arg-width - Maximum amount of horizontal space taken by a single argument output.
Applies only to compact and multiline modes.
If set to zero, no truncation is performed.
fmt.lbr-max-count - Limit number of printed LBRs to N
As with most retsnoop
arguments, multiple --config
/-C
arguments can be
provided at the same time.
Each release has a pre-built retsnoop binary for x86-64 (amd64) and arm64 architectures ready to be downloaded and used. Go to "Releases" page to download latest binary.
It's also pretty straightforward to build retsnoop
from the sources. Most of retsnoop
's
dependencies are already included:
- libbpf is checked out as a submodule,
built and statically linked automatically by
retsnoop
's Makefile; - the only runtime libraries (beyond
libc
) islibelf
andzlib
, you'll also need development versions of them (for API headers) to compilelibbpf
; retsnoop
pre-packages x86-64 versions of necessary tooling (bpftool required during the build, but this can be improved if there is an interest inretsnoop
on non-x86 architecture (please open an issue to request);- the largest external dependency is Clang compiler with support for
bpf
target. Try to use at least Clang 11+, but the latest Clang version you can get, the better.
Once dependencies are satisfied, the rest is simple:
$ make -C src
You'll get retsnoop
binary under src/
folder. You can copy it to
a production server and run it. There are no extra files that need to be
distributed besides the main retsnoop
executable.
Retsnoop started to be packaged by distros. Table below will point out which distros package retsnoop and at which version.
Footnotes
-
User space tracing might be supported in the future, but for now retsnoop is specializing in tracing kernel internals only. ↩