Kernel development

Brief items

Kernel release status

The 2.6.32 kernel is out, released on December 2. Some of the more significant features in 2.6.32 include devtmpfs, a bunch of block layer scalability work, HWPOISON, kernel shared memory, a number of additions to the perf events subsystem, and much more. See the KernelNewbies 2.6.32 page for lots of details.

The 2.6.33 merge window is open; see the accompanying article for a summary of the most significant changes merged so far.

Stable kernel updates: 2.6.31.7, containing a very long list of fixes, was released on December 8. 2.6.27.40 was released at the same time; it was quickly followed by 2.6.27.41 to fix a build problem.

Comments (none posted)

Quotes of the week

Unfortunately, even the best human reflexes simply cannot keep up with a 5GHz CPU. Even if we imagine a hyper-caffeinated 5GHz superhero, there are many millions of computers to be kept up with. And so the modern microprocessor invalidates untold millenia of evolution, frustrating untold numbers of would-be computer professionals.

-- Paul McKenney

The fact is, any time anybody makes up a new locking mechanism, THEY ALWAYS GET IT WRONG. Don't do it.

-- Linus Torvalds

But us kernel developers stay up too late at night, drinking jolt (or beer if you are in Europe), and our brain cells have fused to only logical circuitry, thus understanding concepts that are not engraved in stone becomes a bit too straining for us, and we may finally have to give up on solving this one last bug to get some rest with our love one that's been sleeping since 9pm.

This means using DECLARE_* and DEFINE_* will push us over that brink to normalcy and must be avoided. A new name must be established to clearly describe the mystical CPP magic that comprises the TRACE_EVENT hackery. Something that can bring us back to our roots. Something where it all begins. The stone age.

Thus, this patch renames the MACROS to the most obvious definitions around. Something we should have thought of at the start.

s/DEFINE_EVENT_CLASS/FRED/g
s/DEFINE_EVENT/WILMA/g
s/TRACE_EVENT/BARNEY/g

-- Steven "Bam Bam" Rostedt

Comments (1 posted)

Vger outage on Dec 12th and 13th

David Miller has informed the world that the machine serving as vger.kernel.org will be moving from one hosting facility to another on the 12th and 13th of December. That will, of course, affect traffic to all of the kernel-oriented mailing lists managed by that server. That empty mailbox will not mean that your mail system is broken; it's just a temporary pause in the linux-kernel firehose.

Full Story (comments: 2)

Spinlock naming resolved

By Jonathan Corbet
December 9, 2009

The eventual merging of sleeping spinlocks from the realtime tree requires splitting existing locks into two types: those (the majority) which can sleep in a realtime kernel, and those which must remain "real" spinlocks. The 2009 Kernel Summit decided against renaming the vast number of ordinary locks, but did not even attempt to come up with a name for the "raw" spinlock type.

One possibility was raw_spinlock_t, but that name is already used by the lockdep code. After numerous other possible names were considered, the winning suggestion was one by Linus: arch_spinlock_t. This name carries the connotation of a low-level spinlock which is implemented by architecture-specific code; in the absence of a better idea, it has carried the day.

Thomas Gleixner has posted a patch which makes the naming changes. It's worth noting that even though the new type is not a raw_spinlock_t, one still calls raw_spin_lock() to acquire an arch_spinlock_t lock. As of this writing, this patch has not been merged for 2.6.33, but chances are that it will be.

Comments (2 posted)

Kernel development news

2.6.33 merge window part 1

By Jonathan Corbet
December 9, 2009

Linus began merging patches for the 2.6.33 development cycle on December 5. The merge window got off to a slow start, but then Linus began to make up for lost time; as of this writing, some 3900 changesets have been merged.

User-visible changes include:

The ftrace framework has seen a number of improvements, including the ability to trace multiple processes simultaneously, regular expression support in tracing filters, tracing of big kernel lock events, and tracing of accesses and modifications to arbitrary kernel variables.
Perhaps most significantly (for ftrace), the dynamic probes for ftrace patch set has been merged, allowing the placement of arbitrary tracepoints at run time. The "perf" tool has also been extended to be able to place and use dynamic tracepoints.
There are many other enhancements to "perf," including a new framework for benchmark suites, a command to record and analyze kernel memory allocations, and a generic scripting language hook set.
Eric Biederman's long quest to remove binary sysctl() support has finally made it into the mainline.
The recvmmsg() system call has been added.
The anticipatory I/O scheduler has been removed, in favor of CFQ which is seen as providing a superset of features.
The new, unified block I/O bandwidth controller has been merged.
The networking layer has gained support for TCP cookie transactions [PDF], a mechanism which allows faster, more secure, and more robust initiation of TCP connections.
The DRBD distributed block device has been merged.
New drivers:
- Boards and processors: ST-Ericsson U8500 boards, Marvell Dove (88AP510) system-on-chip CPUs, Palm Centro 685 phones, and CompuLab CM-T35 boards.
- Networking: TI High End CAN controllers, Intel Wireless MultiCom 3200 chips, Ralink rt2800 wireless chipsets, Microchip MCP251x SPI CAN controllers, Freescale MSCAN-based CAN controllers, and Solarflare SFC9000 10G Ethernet controllers.
- Sound: miroSOUND PCM20 radio tuners, Texas Instruments TPA6130A2 stereo headphone amplifiers, TI tlv320dac33 codecs, Asahi Kasei AK4113 and AK4671 codecs, WM8580 based audio subsystems on SMDK64xx systems, Wolfson Micro WM8711/L sound devices, and Raumfeld audio adapters.
- Miscellaneous: GRLIB APBUART serial ports, Oki MSM6242 realtime clock chips, and Ricoh RP5C01 RTCs.

Changes visible to kernel developers include:

There is a new unreachable() macro which can be used to mark code which will never be executed. Its main application is in macros like BUG().
New security module hooks, intended for pathname-based modules, have been added to check chmod(), chown(), and chroot().
There is a new RCU variant, called "tiny RCU," which is meant for non-SMP situations where memory footprint must be minimized.
printk_ratelimit() can, once again, be used in atomic context. (Note, though, that there are developers who would like to eliminate this function in favor of some sort of more local rate limiting).
The creation of nearly-identical tracepoints has been made significantly easier. TRACE_EVENT_TEMPLATE() has a syntax identical to TRACE_EVENT(), but it creates a template which can be used by the simpler DEFINE_EVENT() macro to create specific tracepoints. The code gets simpler, and, as a side benefit, the kernel gets smaller.

The merge window should stay open for at least another week; expect quite a bit more code to be incorporated for 2.6.33 before the window closes.

Comments (8 posted)

Minimizing instrumentation impacts

By Jake Edge
December 9, 2009

Minimizing the overhead of various kernel debugging and tracing mechanisms is important for many reasons. For static instrumentation, like tracepoints, the impact when they are not enabled must be very low or they won't get used—or merged. In addition, for any kind of instrumentation, the impact when enabled needs to be as small as possible so that whatever behavior is under observation will not radically change due to the tracing. Two separate proposals, jump labels for tracepoints and kprobes jump optimization, are both trying to reduce the effect that instrumentation has on performance. In addition, they share some underlying code.

The kprobes jump optimization has been proposed by Masami Hiramatsu, and trades off a bit of extra memory for approximately one-fifth the overhead in making a kprobe call. According to Hiramatsu's posting, kprobes went from 0.68us (32-bit) and 0.91us (64-bit) to 0.06us (both) when they were optimized with this technique. kretprobes dropped from 0.95us (32-bit) and 1.21us (64-bit) to 0.30 and 0.35us respectively. All of his testing was done on a 2.33GHz Xeon processor.

Those numbers are pretty eye-opening, especially since the optimization only adds around 200 bytes per probe. The basic idea is to use a jump instruction, rather than a breakpoint, to implement probes whenever that is possible. The patch includes some fairly elaborate "safety checks" to see if it is possible to do the optimization. Before any of that is done, however, a regular breakpoint-based kprobe is inserted—if the optimization can't be done, that will be used instead.

The jump instruction that will be put at the address to be probed is longer than one byte, so the optimization step needs to look at the region of code it will be affecting. If that region straddles the boundary between functions (i.e. spills out of the probed function into the next), the optimization is not done. It then decodes the function looking for jump instructions that would—or could—jump into the region, if none are found, the optimization proceeds.

The instructions that are located at the address to be probed still need to be executed once they are replaced by a jump, of course, so a "detour" buffer is created. The detour buffer emulates an exception that contains the instructions copied from the probed location, followed by a jump back to the original execution path. This detour buffer will be used once the kprobe code itself is executed to finish the execution after the probe point.

Once the detour buffer has been created, the kprobe is enqueued on the kprobe-optimizer workqueue, where the actual jump is patched into the probe site. The optimizer needs to ensure that there are no interrupts executing and does so by using synchronize_sched() in the workqueue function. Once that completes, the text_poke_fixup() function, which is added as part of the patchset, is called to actually modify the code to patch the jump in.

The text_poke_fixup() patch is the piece that is shared with jump labels. It looks like:

    void *text_poke_fixup(void *addr, const void *opcode, size_t len,
                          void *fixup);

where addr points to the location to change, opcode and len specify the new opcode (and its length) to be written there. fixup is the address where a processor should jump if it hits addr while the modification is in process.

Essentially, text_poke_fixup() puts a breakpoint that will execute the code at fixup on addr and synchronizes that on all CPUs. It then modifies all the other bytes (except the first) of the region, once again synchronizing with the other CPUs. The next step is to modify the first byte, again requiring synchronization, and then it can clear the breakpoint. Any calls made during the modification will be routed by the breakpoint to the fixup code instead.

A jump label uses the same technique, but, since it applies to static instrumentation (tracepoints), it is meant to reduce the impact of the likely case that the tracepoint is disabled. It does that by using an assembly construct that will be available in the soon-to-be-released GCC 4.5, the asm goto, which allows branching to labels.

For a tracepoint, the idea is that the disabled case will consist of a 5-byte NOP (conveniently sized to be overwritten with a jump) followed by a jump around the disabled tracepoint code. When the tracepoint gets enabled, text_poke_fixup() is used to turn the NOP into a jump to the label in the DECLARE_TRACE() macro. That code is what the original unconditional jump skips over.

The jump labels patch then has code to manage the state of the tracepoints, including the labels and addresses, along with the current enabled/disabled status of the tracepoint. It is somewhat of a hackish abuse of the pre-processor and assembler, but according to Jason Baron, who proposed the patch, it results in "an average improvement of 30 cycles per-tracepoint on x86_64 systems that I've tested".

Jump labels eliminate the current test and jump that is done for each tracepoint, because it can dynamically enable and disable the tracepoint code. Adding the NOP and unconditional jump add "2 - 4 cycles on average vs. no tracepoint", Baron said, which is a pretty low cost for this kind of instrumentation.

Both of these techniques are likely to need some more "soaking" time before they are ready for the mainline. Jump labels is a more recent proposal and relies on features in a not-yet-released compiler, which would seem to put it a bit further behind. The reaction to both has been relatively positive, though, which probably indicates general agreement with their goals. Reducing the overhead for tracing and debugging is something that few will argue against.

Comments (1 posted)

Debugging the kernel using Ftrace - part 1

December 9, 2009

This article was contributed by Steven Rostedt

Ftrace is a tracing utility built directly into the Linux kernel. Many distributions already have various configurations of Ftrace enabled in their most recent releases. One of the benefits that Ftrace brings to Linux is the ability to see what is happening inside the kernel. As such, this makes finding problem areas or simply tracking down that strange bug more manageable.

Ftrace's ability to show the events that lead up to a crash gives a better chance of finding exactly what caused it and can help the developer in creating the correct solution. This article is a two part series that will cover various methods of using Ftrace for debugging the Linux kernel. This first part will talk briefly about setting up Ftrace, using the function tracer, writing to the Ftrace buffer from within the kernel, and various ways to stop the tracer when a problem is detected.

Ftrace was derived from two tools. One was the "latency tracer" by Ingo Molnar used in the -rt tree. The other was my own "logdev" utility that had its primary use on debugging the Linux kernel. This article will mostly describe features that came out of logdev, but will also look at the function tracer that originated in the latency tracer.

Setting up Ftrace

Currently the API to interface with Ftrace is located in the Debugfs file system. Typically, that is mounted at /sys/kernel/debug. For easier accessibility, I usually create a /debug directory and mount it there. Feel free to choose your own location for Debugfs.

When Ftrace is configured, it will create its own directory called tracing within the Debugfs file system. This article will reference those files in that directory as though the user first changed directory to the Debugfs tracing directory to avoid any confusion as to where the Debugfs file system has been mounted.

    [~]# cd /sys/kernel/debug/tracing
    [tracing]#

This article is focusing on using Ftrace as a debugging tool. Some configurations for Ftrace are used for other purposes, like finding latency or analyzing the system. For the purpose of debugging, the kernel configuration parameters that should be enabled are:

    CONFIG_FUNCTION_TRACER
    CONFIG_FUNCTION_GRAPH_TRACER
    CONFIG_STACK_TRACER
    CONFIG_DYNAMIC_FTRACE

Function tracing - no modification necessary

One of the most powerful tracers of Ftrace is the function tracer. It uses the -pg option of gcc to have every function in the kernel call a special function "mcount()". That function must be implemented in assembly because the call does not follow the normal C ABI.

When CONFIG_DYNAMIC_FTRACE is configured the call is converted to a NOP at boot time to keep the system running at 100% performance. During compilation the mcount() call-sites are recorded. That list is used at boot time to convert those sites to NOPs. Since NOPs are pretty useless for tracing, the list is saved to convert the call-sites back into trace calls when the function (or function graph) tracer is enabled.

It is highly recommended to enable CONFIG_DYNAMIC_FTRACE because of this performance enhancement. In addition, CONFIG_DYNAMIC_FTRACE gives the ability to filter which function should be traced. Note, even though the NOPs do not show any impact in benchmarks, the addition of frame pointers that come with the -pg option has been known to cause a slight overhead.

To find out which tracers are available, simply cat the available_tracers file in the tracing directory:

    [tracing]# cat available_tracers 
    function_graph function sched_switch nop

To enable the function tracer, just echo "function" into the current_tracer file.

    [tracing]# echo function > current_tracer
    [tracing]# cat current_tracer
    function

    [tracing]# cat trace | head -10
    # tracer: function
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
                bash-16939 [000]  6075.461561: mutex_unlock <-tracing_set_tracer
              <idle>-0     [001]  6075.461561: _spin_unlock_irqrestore <-hrtimer_get_next_event
              <idle>-0     [001]  6075.461562: rcu_needs_cpu <-tick_nohz_stop_sched_tick
                bash-16939 [000]  6075.461563: inotify_inode_queue_event <-vfs_write
              <idle>-0     [001]  6075.461563: mwait_idle <-cpu_idle
                bash-16939 [000]  6075.461563: __fsnotify_parent <-vfs_write

The header explains the format of the output pretty well. The first two items are the traced task name and PID. The CPU that the trace was executed on is within the brackets. The timestamp is the time since boot, followed by the function name. The function in this case is the function being traced with its parent following the "<-" symbol.

This information is quite powerful and shows the flow of functions nicely. But it can be a bit hard to follow. The function graph tracer, created by Frederic Weisbecker, traces both the entry and exit of a function, which gives the tracer the ability to know the depth of functions that are called. The function graph tracer can make following the flow of execution within the kernel much easier to follow with the human eye:

    [tracing]# echo function_graph > current_tracer 
    [tracing]# cat trace | head -20
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     1)   1.015 us    |        _spin_lock_irqsave();
     1)   0.476 us    |        internal_add_timer();
     1)   0.423 us    |        wake_up_idle_cpu();
     1)   0.461 us    |        _spin_unlock_irqrestore();
     1)   4.770 us    |      }
     1)   5.725 us    |    }
     1)   0.450 us    |    mutex_unlock();
     1) + 24.243 us   |  }
     1)   0.483 us    |  _spin_lock_irq();
     1)   0.517 us    |  _spin_unlock_irq();
     1)               |  prepare_to_wait() {
     1)   0.468 us    |    _spin_lock_irqsave();
     1)   0.502 us    |    _spin_unlock_irqrestore();
     1)   2.411 us    |  }
     1)   0.449 us    |  kthread_should_stop();
     1)               |  schedule() {

This gives the start and end of a function denoted with the C like annotation of "{" to start a function and "}" at the end. Leaf functions, which do not call other functions, simply end with a ";". The DURATION column shows the time spent in the corresponding function. The function graph tracer records the time the function was entered and exited and reports the difference as the duration. These numbers only appear with the leaf functions and the "}" symbol. Note that this time also includes the overhead of all functions within a nested function as well as the overhead of the function graph tracer itself. The function graph tracer hijacks the return address of the function in order to insert a trace callback for the function exit. This breaks the CPU's branch prediction and causes a bit more overhead than the function tracer. The closest true timings only occur for the leaf functions.

The lonely "+" that is there is an annotation marker. When the duration is greater than 10 microseconds, a "+" is shown. If the duration is greater than 100 microseconds a "!" will be displayed.

Using `trace_printk()`

printk() is the king of all debuggers, but it has a problem. If you are debugging a high volume area such as the timer interrupt, the scheduler, or the network, printk() can lead to bogging down the system or can even create a live lock. It is also quite common to see a bug "disappear" when adding a few printk()s. This is due to the sheer overhead that printk() introduces.

Ftrace introduces a new form of printk() called trace_printk(). It can be used just like printk(), and can also be used in any context (interrupt code, NMI code, and scheduler code). What is nice about trace_printk() is that it does not output to the console. Instead it writes to the Ftrace ring buffer and can be read via the trace file.

Writing into the ring buffer with trace_printk() only takes around a tenth of a microsecond or so. But using printk(), especially when writing to the serial console, may take several milliseconds per write. The performance advantage of trace_printk() lets you record the most sensitive areas of the kernel with very little impact.

For example you can add something like this to the kernel or module:

    trace_printk("read foo %d out of bar %p\n", bar->foo, bar);

Then by looking at the trace file, you can see your output.

    [tracing]# cat trace
    # tracer: nop
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
               <...>-10690 [003] 17279.332920: : read foo 10 out of bar ffff880013a5bef8

The above example was done by adding a module that actually had a foo and bar construct.

trace_printk() output will appear in any tracer, even the function and function graph tracers.

    [tracing]# echo function_graph > current_tracer
    [tracing]# insmod ~/modules/foo.ko
    [tracing]# cat trace
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     3) + 16.283 us   |      }
     3) + 17.364 us   |    }
     3)               |    do_one_initcall() {
     3)               |      /* read foo 10 out of bar ffff88001191bef8 */
     3)   4.221 us    |    }
     3)               |    __wake_up() {
     3)   0.633 us    |      _spin_lock_irqsave();
     3)   0.538 us    |      __wake_up_common();
     3)   0.563 us    |      _spin_unlock_irqrestore();

Yes, the trace_printk() output looks like a comment in the function graph tracer.

Starting and stopping the trace

Obviously there are times where you only want to trace a particular code path. Perhaps you only want to trace what is happening when you run a specific test. The file tracing_on is used to disable the ring buffer from recording data:

    [tracing]# echo 0 > tracing_on

This will disable the Ftrace ring buffer from recording. Everything else still happens with the tracers and they will still incur most of their overhead. They do notice that the ring buffer is not recording and will not attempt to write any data, but the calls that the tracers make are still performed.

To re-enable the ring buffer, simply write a '1' into that file:

    [tracing]# echo 1 > tracing_on

Note, it is very important that you have a space between the number and the greater than sign ">". Otherwise you may be writing standard input or output into that file.

    [tracing]# echo 0> tracing_on   /* this will not work! */

A common run might be:

    [tracing]# echo 0 > tracing_on
    [tracing]# echo function_graph > current_tracer
    [tracing]# echo 1 > tracing_on; run_test; echo 0 > tracing_on

The first line disables the ring buffer from recording any data. The next enables the function graph tracer. The overhead of the function graph tracer is still present but nothing will be recorded into the trace buffer. The last line enables the ring buffer, runs the test program, then disables the ring buffer. This narrows the data stored by the function graph tracer to include mostly just the data accumulated by the run_test program.

What's next?

The next article will continue the discussion on debugging the kernel with Ftrace. The method above to disable the tracing may not be fast enough. The latency between the end of the program run_test and echoing the 0 into the tracing_on file may cause the ring buffer to overflow and lose the relevant data. I will discuss other methods to stop tracing a bit more efficiently, how to debug a crash, and looking at what functions in the kernel are stack hogs. The best way to find out more is to enable Ftrace and just play with it. You can learn a lot about how the kernel works by just following the function graph tracer.

Comments (11 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.32 ?

Greg KH Linux 2.6.31.7 ?

Greg KH Linux 2.6.27.40 ?

Greg KH Linux 2.6.27.41 ?

Architecture-specific

Amit Kucheria Base support for Freescale i.MX51 SoC platform ?

Wu Zhangjin Loongson: YeeLoong: add platform specific driver ?

Andi Kleen [0/31] HWPOISON 2.6.33 pre-merge posting ?

Core kernel code

Ben Blum [PATCH 0/5] cgroups: support for module-loadable subsystems ?

Jiri Slaby pull request for writable limits for 2.6.33-rc1 ?

KOSAKI Motohiro Added PR_SET_PROCTITLE_AREA option for prctl() ?

Paul E. McKenney rcu: preemptible expedited grace periods and cleanups ?

Thomas Gleixner locking: name space cleanup and -rt spinlock annotation ?

Thomas Gleixner [patch 0/4] sched: Replace read_lock(&tasklist_lock) with RCU - the easy part ?

Development tools

Subrata Modak The Linux Test Project has been Released for NOVEMBER 2009 ?

Arnaldo Carvalho de Melo ANOUNCE: pahole & the dwarves 1.8 ?

Device drivers

Jeff Garzik What's in libata-dev.git? ?

Takashi Iwai What's in sound git tree for 2.6.33 ?

Jes Sorensen Toshiba Bluetooth enabler (rfkill) ?

Rafael J. Wysocki PM updates for 2.6.33 ?

Antonio Ospite leds: Add LED class driver for regulator driven LEDs. ?

Pavel Machek GPIO support for HTC Dream ?

Cory Maccarrone [MISC] Add HTC PLD driver ?

Jani Nikula gpiolib: gpio naming in sysfs ?

Documentation

Jonathan Corbet Document the videobuf layer ?

[email protected] V4L - Digital Video Timings API documentation ?

Filesystems and block I/O

Eric Paris fsnotify/vfsmount: add fsnotify fields to struct vfsmount ?

Sage Weil ceph distributed file system client ?

J. R. Okajima pathconf(3) with _PC_LINK_MAX ?

Jan Kara Implement 64-bit quota format ?

Benny Halevy Initial pnfsd file layout support ?

Nick Piggin fs: truncate introduce new sequence ?

Josef Bacik Btrfs: change how we mount subvolumes ?

Memory management

KAMEZAWA Hiroyuki percpu mm struct counter cache ?

Security-related

James Morris Security subsystem changes for 2.6.33 ?

Virtualization and containers

Rusty Russell vhost_net: a kernel-level virtio server ?

Miscellaneous

Douglas Gilbert lsscsi-0.23 released ?

Kay Sievers udev 149 release ?

Karel Zak util-linux-ng v2.17-rc2 ?

Page editor: Jonathan Corbet
Next page: Distributions>>