Improving scheduler latency
The CFS scheduler divides time into periods, during which each process is expected to run once. The length of the period should thus determine the maximum amount of time that any given process can expect to have to wait to be able to run - the maximum latency. That length, by default, is 6ms. If there are two processes running, those 6ms will be divided up something like this:
This assumes that both processes are completely CPU-bound, have the same priority, and that nothing else perturbs the situation, naturally. If a third ideal CPU-bound process shows up, that same period is divided into smaller pieces:
This process of dividing the scheduler period cannot continue forever, though. Every context switch has its cost in terms of operating system overhead and cache behavior; switching too often will have a measurable effect on the total throughput of the system. The current scheduler, by default, draws the line at 2ms; if the average time slice threatens to go below that length, the period will be extended instead. So if one more cranker process shows up, the result will be:
$ sudo subscribe todaySubscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.
In other words, once the load gets high enough, the kernel will start to sacrifice latency in order to keep throughput up. In situations where the load is quite high (kernel builds with a lot of parallel processes are often mentioned), latencies can reach a point where users start to get truly irritable. Mathieu Desnoyers decided he could improve the situation with this patch, which attempted to shrink the minimum possible time slice until there were more than eight running processes; in this way, he hoped to improve latencies on more heavily-loaded systems.
Mathieu's patch included some test results showing that the maximum
latencies had been cut roughly in half. Even so, Peter Zijlstra dismissed the patch, saying "Not at all
charmed, this look like random changes without conceptual
integrity.
" That, in turn, earned a
mild rebuke from Linus, who felt that the kernel's latency performance
was not as good as it could be. After that, the discussion went on for a
while, leading to the interesting conclusion that everybody was partly
right.
Mathieu's patch was based on a slightly flawed understanding of how the scheduler period was calculated, so it didn't do quite what he was expecting. Rejecting the patch was, thus, the correct thing for the scheduler maintainers to do. The patch did improve latencies, though. It turns out that the change that actually mattered was reducing the length of the minimum time slice from 2ms to 750µs. That allows the scheduler to keep the same period with up to eight processes, and reduces the expansion of the period thereafter. The result is better latency measurements and, it seems, a nicer interactive feel. A patch making just the minimum time slice change was fast-tracked into the mainline and will be present in 2.6.36-rc5. Interestingly, despite the concerns that a shorter time slice would affect throughput, there has not been a whole lot of throughput benchmarking done on this patch so far.
Things don't stop there, though. One of Mathieu's tests uses the SIGEV_THREAD flag to timer_create(), causing the creation of a new thread for each event. That new thread, it seems, takes a long time to find its way into the CPU. The culprit here seems to be in the code which tries to balance CPU access between a newly forked process and its parent - a place which has often proved problematic in the past. Mike Galbraith pointed out that the START_DEBIT scheduler feature - which serves to defer a new task's first execution into the next period - has an unpleasant effect on latency. Turning that feature off improves things considerably, but with costs felt elsewhere in the system; in particular, it allows fork-heavy loads to unfavorably impact other processes.
Mathieu posted a patch adding a new feature called START_NICE; if it is enabled, both processes returning from a fork() will have their priority reduced for one scheduler period. With that penalty, both processes can be allowed to run in the current period; their effect on the rest of the system will be reduced. The associated benchmark numbers show a significant improvement from this change.
Meanwhile, Peter went away for a bit and came back with a rather more complex patch demonstrating a different approach. With this patch, new tasks are still put at the end of the queue to ensure that they don't deprive existing processes of their current time slices. But, if the new DEADLINE feature is turned on, each new task also gets a deadline set to one scheduler period in the future. Should that deadline pass without that process being scheduled, it will be run immediately. That should put a cap on the maximum latency that new threads will see.
This patch is large and complex, though, and Peter warns that his testing
stopped once the code compiled. So this one is not something to expect for
2.6.36; if it survives benchmarking, though, we might see it become ready
for the next development cycle.
Index entries for this article | |
---|---|
Kernel | Latency |
Kernel | Scheduler/Completely fair scheduler |
Kernel | Scheduler/Latency |
Posted Sep 16, 2010 8:22 UTC (Thu)
by rvfh (guest, #31018)
[Link] (2 responses)
I am not an expert, but should not this be a bit more dynamic? I mean, it could impact CPU-constrained systems which usually run few tasks. Maybe the slice could start at 2 ms and reduce down to 750 us as more processes turn up?
Posted Sep 16, 2010 10:20 UTC (Thu)
by knewt (subscriber, #32124)
[Link] (1 responses)
Posted Sep 16, 2010 11:39 UTC (Thu)
by rvfh (guest, #31018)
[Link]
Posted Sep 16, 2010 14:17 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (4 responses)
Posted Sep 16, 2010 20:05 UTC (Thu)
by bfields (subscriber, #19510)
[Link] (1 responses)
Posted Sep 17, 2010 14:52 UTC (Fri)
by clugstj (subscriber, #4020)
[Link]
Posted Sep 17, 2010 22:06 UTC (Fri)
by Julie (guest, #66693)
[Link] (1 responses)
I wondered this before too, I assumed the answer was so obvious to everyone else that I would just be exposing my naivety in bringing it up. (So I'm glad you did because that makes me feel better :-))
Posted Sep 18, 2010 19:03 UTC (Sat)
by oak (guest, #2786)
[Link]
Posted Sep 16, 2010 17:17 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (1 responses)
There is a really simple and cheap way to make subjective feelings objective: blind trials. Randomly switch between the two kernels to compare and write down your impressions. After a few times, look in the logs which kernels were tested when.
Posted Sep 16, 2010 21:35 UTC (Thu)
by martinfick (subscriber, #4455)
[Link]
Posted Sep 22, 2010 20:50 UTC (Wed)
by Corkscrew (guest, #65853)
[Link]
1) Set up both kernel versions so that they can be swapped with minimal effort (e.g. via bootloader).
2) Give the volunteer a few minutes on the old version, as a baseline.
3) Send the volunteer out of the room. Toss a coin to decide which kernel version to run. Secretly record which version you booted.
4) Leave the room and send the volunteer in. They should play around with the computer and record whether it feels more or less responsive than the baseline.
(Ideally there should be no contact between the developer and the volunteer after the choice in step #2 is made.)
5) Repeat steps #3-#4 (or #2-#4 if you're not worried about the volunteer getting bored). Each second repetition, use the kernel version you didn't use the first time round (this doesn't compromise randomisation and makes the statistics easier).
6) Perform a basic statistical analysis on the results. [Turns out this isn't as basic as I thought, and my battery's about to give out; will look it up tomorrow.]
This is a bit primitive - there are still a few opportunities for bias. A better tool would be a randomised bootloader of some kind that would log the choices of kernel and only reveal them after the tester had stated their preferences. This would have the additional advantage that the developer could do the testing without compromising blinding.
Anyone fancy developing one of those?
Improving scheduler latency
"minimum" time slice
"minimum" time slice
Improving scheduler latency
Isn't the real expense of a context switch in the various kinds of cache misses that it causes? In any case it's more complicated than "speed of the CPU". Maybe there's some way it could be measured dynamically.
Improving scheduler latency
Improving scheduler latency
Improving scheduler latency
Shouldn't the timeslice minimum be based on the speed of the CPU?
After all, CFS scraps the 'arbitrary' un-CPU-centric HZ. Improving scheduler latency
Improving scheduler latency
Improving scheduler latency
Improving scheduler latency