Improving scheduler latency

By Jonathan Corbet
September 14, 2010

The level of interactive response provided by the kernel's CPU scheduler is the subject of endless discussion and tweaking. It is one of those problems which, seemingly, can never be fully solved to everybody's satisfaction. Some recent discussions on the topic have shown, though, that low-hanging fruit can remain after all these years; it's just a matter of drawing attention to the right place.

The CFS scheduler divides time into periods, during which each process is expected to run once. The length of the period should thus determine the maximum amount of time that any given process can expect to have to wait to be able to run - the maximum latency. That length, by default, is 6ms. If there are two processes running, those 6ms will be divided up something like this:

This assumes that both processes are completely CPU-bound, have the same priority, and that nothing else perturbs the situation, naturally. If a third ideal CPU-bound process shows up, that same period is divided into smaller pieces:

This process of dividing the scheduler period cannot continue forever, though. Every context switch has its cost in terms of operating system overhead and cache behavior; switching too often will have a measurable effect on the total throughput of the system. The current scheduler, by default, draws the line at 2ms; if the average time slice threatens to go below that length, the period will be extended instead. So if one more cranker process shows up, the result will be:

$ sudo subscribe today
Subscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.

In other words, once the load gets high enough, the kernel will start to sacrifice latency in order to keep throughput up. In situations where the load is quite high (kernel builds with a lot of parallel processes are often mentioned), latencies can reach a point where users start to get truly irritable. Mathieu Desnoyers decided he could improve the situation with this patch, which attempted to shrink the minimum possible time slice until there were more than eight running processes; in this way, he hoped to improve latencies on more heavily-loaded systems.

Mathieu's patch included some test results showing that the maximum latencies had been cut roughly in half. Even so, Peter Zijlstra dismissed the patch, saying "Not at all charmed, this look like random changes without conceptual integrity." That, in turn, earned a mild rebuke from Linus, who felt that the kernel's latency performance was not as good as it could be. After that, the discussion went on for a while, leading to the interesting conclusion that everybody was partly right.

Mathieu's patch was based on a slightly flawed understanding of how the scheduler period was calculated, so it didn't do quite what he was expecting. Rejecting the patch was, thus, the correct thing for the scheduler maintainers to do. The patch did improve latencies, though. It turns out that the change that actually mattered was reducing the length of the minimum time slice from 2ms to 750µs. That allows the scheduler to keep the same period with up to eight processes, and reduces the expansion of the period thereafter. The result is better latency measurements and, it seems, a nicer interactive feel. A patch making just the minimum time slice change was fast-tracked into the mainline and will be present in 2.6.36-rc5. Interestingly, despite the concerns that a shorter time slice would affect throughput, there has not been a whole lot of throughput benchmarking done on this patch so far.

Things don't stop there, though. One of Mathieu's tests uses the SIGEV_THREAD flag to timer_create(), causing the creation of a new thread for each event. That new thread, it seems, takes a long time to find its way into the CPU. The culprit here seems to be in the code which tries to balance CPU access between a newly forked process and its parent - a place which has often proved problematic in the past. Mike Galbraith pointed out that the START_DEBIT scheduler feature - which serves to defer a new task's first execution into the next period - has an unpleasant effect on latency. Turning that feature off improves things considerably, but with costs felt elsewhere in the system; in particular, it allows fork-heavy loads to unfavorably impact other processes.

Mathieu posted a patch adding a new feature called START_NICE; if it is enabled, both processes returning from a fork() will have their priority reduced for one scheduler period. With that penalty, both processes can be allowed to run in the current period; their effect on the rest of the system will be reduced. The associated benchmark numbers show a significant improvement from this change.

Meanwhile, Peter went away for a bit and came back with a rather more complex patch demonstrating a different approach. With this patch, new tasks are still put at the end of the queue to ensure that they don't deprive existing processes of their current time slices. But, if the new DEADLINE feature is turned on, each new task also gets a deadline set to one scheduler period in the future. Should that deadline pass without that process being scheduled, it will be run immediately. That should put a cap on the maximum latency that new threads will see.

This patch is large and complex, though, and Peter warns that his testing stopped once the code compiled. So this one is not something to expect for 2.6.36; if it survives benchmarking, though, we might see it become ready for the next development cycle.

Index entries for this article
Kernel	Latency
Kernel	Scheduler/Completely fair scheduler
Kernel	Scheduler/Latency

Improving scheduler latency

Posted Sep 16, 2010 8:22 UTC (Thu) by rvfh (guest, #31018) [Link] (2 responses)

> A patch making just the minimum time slice change was fast-tracked into the mainline and will be present in 2.6.36-rc5.

I am not an expert, but should not this be a bit more dynamic? I mean, it could impact CPU-constrained systems which usually run few tasks. Maybe the slice could start at 2 ms and reduce down to 750 us as more processes turn up?

"minimum" time slice

Posted Sep 16, 2010 10:20 UTC (Thu) by knewt (subscriber, #32124) [Link] (1 responses)

Note the "minimum". Assuming I've read the article correctly, the slice size is 6ms/process_count, until this reaches 750µs, at which point it doesn't reduce any further (all with the default 6ms maximum latency setting). So with only two processes you would have a 3ms slice, as shown in the first example.

"minimum" time slice

Posted Sep 16, 2010 11:39 UTC (Thu) by rvfh (guest, #31018) [Link]

Indeed. Sorry for missing this obvious point :-(

Improving scheduler latency

Posted Sep 16, 2010 14:17 UTC (Thu) by clugstj (subscriber, #4020) [Link] (4 responses)

Whether it's 2ms or 750us, it seems arbitrary. Shouldn't the timeslice minimum be based on the speed of the CPU? That is to say, faster processors can tolerate more context switching before it becomes significant.

Improving scheduler latency

Posted Sep 16, 2010 20:05 UTC (Thu) by bfields (subscriber, #19510) [Link] (1 responses)

Isn't the real expense of a context switch in the various kinds of cache misses that it causes? In any case it's more complicated than "speed of the CPU". Maybe there's some way it could be measured dynamically.

Improving scheduler latency

Posted Sep 17, 2010 14:52 UTC (Fri) by clugstj (subscriber, #4020) [Link]

750us is a HUGE amount of time compared to cache misses.

Improving scheduler latency

Posted Sep 17, 2010 22:06 UTC (Fri) by Julie (guest, #66693) [Link] (1 responses)

Shouldn't the timeslice minimum be based on the speed of the CPU?

I wondered this before too, I assumed the answer was so obvious to everyone else that I would just be exposing my naivety in bringing it up. (So I'm glad you did because that makes me feel better :-))
After all, CFS scraps the 'arbitrary' un-CPU-centric HZ.

So, um, why _is_ it that we don't scale process scheduling resolution according to the capability of the CPU? Assuming this isn't a silly question.

Improving scheduler latency

Posted Sep 18, 2010 19:03 UTC (Sat) by oak (guest, #2786) [Link]

I remember that on x86 the scheduler tick has been increased to 1 KHz. I wonder what happens on (embedded & slower) architectures where the scheduler tick is still e.g. 128Hz...?

Improving scheduler latency

Posted Sep 16, 2010 17:17 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> The result is better latency measurements and, it seems, a nicer interactive feel.

There is a really simple and cheap way to make subjective feelings objective: blind trials. Randomly switch between the two kernels to compare and write down your impressions. After a few times, look in the logs which kernels were tested when.

Improving scheduler latency

Posted Sep 16, 2010 21:35 UTC (Thu) by martinfick (subscriber, #4455) [Link]

This assumes that interactivity is either better or not. It ignores the possibility that with some use cases interactivity might be better with one kernel, and that with other use cases it might be better with the another kernel. So, while your suggestion might work with an exact use case, which use cases are important is itself likely very subjective.

Improving scheduler latency

Posted Sep 22, 2010 20:50 UTC (Wed) by Corkscrew (guest, #65853) [Link]

Regards subjective measures like "nicer interactive feel": this should be easy to test. At worst it would take about an hour plus a willing volunteer.

1) Set up both kernel versions so that they can be swapped with minimal effort (e.g. via bootloader).

2) Give the volunteer a few minutes on the old version, as a baseline.

3) Send the volunteer out of the room. Toss a coin to decide which kernel version to run. Secretly record which version you booted.

4) Leave the room and send the volunteer in. They should play around with the computer and record whether it feels more or less responsive than the baseline.

(Ideally there should be no contact between the developer and the volunteer after the choice in step #2 is made.)

5) Repeat steps #3-#4 (or #2-#4 if you're not worried about the volunteer getting bored). Each second repetition, use the kernel version you didn't use the first time round (this doesn't compromise randomisation and makes the statistics easier).

6) Perform a basic statistical analysis on the results. [Turns out this isn't as basic as I thought, and my battery's about to give out; will look it up tomorrow.]

This is a bit primitive - there are still a few opportunities for bias. A better tool would be a randomised bootloader of some kind that would log the choices of kernel and only reveal them after the tester had stated their preferences. This would have the additional advantage that the developer could do the testing without compromising blinding.

Anyone fancy developing one of those?