论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2006-05-01 12:28 |只看该作者 |倒序浏览

Solving starvation problems in the scheduler
[Posted March 22, 2006 by corbet]

The Linux CPU scheduler has come a long way since the early 2.6 days, when it was the cause for quite a bit of worry. Scheduling domains fixed many of the problems on larger systems, while a whole set of interactivity heuristics made desktops work better. The interactivity work, in particular, is based on the notion of a "sleep average." Any process which spends a significant amount of its time sleeping, relative to the time it runs, is deemed to be "interactive" and is given a higher priority.

This mechanism works well enough that few people complain about interactive response with current 2.6 kernels. Every now and then, however, somebody comes up with a workload which manages to fool the scheduler and bring the desktop to a halt. Mike Galbraith has been chasing down a few of these, producing patches in the process which should help to mitigate the problems.

The Linux scheduler maintains two "arrays" of run queues for each processor. When a process starts out, it is given a time slice and put onto the "active" array, where it can compete for the CPU. Once the time slice runs out, that process will move over to the "expired" array, where it languishes until all other runnable processes have used up their time slices. Once all processes are on the expired array, the two arrays are switched and the process begins again.

There is an exception, however, in the 2.6 kernel: a process which is deemed to be interactive (because it spends enough time in interruptible sleeps) will, on expiration of its time slice, be put back onto the active array. As a result, an interactive process should not have to wait while some long-running batch process cranks through its time slice. To keep this mechanism from blocking out expired processes altogether, however, the scheduler checks to see if the processes in the expired array have been waiting for too long. Once the starvation threshold has been exceeded, all processes go to the expired array at the end of their slices, allowing the scheduler to perform the array switch in the relatively near future.

Mike found that, on a system with a heavily-loaded Apache server running, tasks could find themselves starved for long periods of time; it seems that the starvation-avoidance logic was not working right. The problem turned out to be in the wakeup code. That code was always putting freshly-awakened processes onto the active array, regardless of what was going on elsewhere in the system. With a large number of server processes being continually awakened as requests came in, the scheduler was never able to switch arrays. The fix was to put the starvation test into __activate_task(); as a result, when expired processes are starving, processes will be awakened onto the expired array. That small fix fixed much of the problem.

A fuller fix, however, involves the task throttling patch which Mike has been working on for some time. There's a number of fixes involved in this work, but the core observation is this: the "sleep average" code can be too generous to processes which sleep only part of the time. A process which manages regular, short sleeps can boost its priority significantly, to the point that it can force out other processes running on the system. And once a process obtains an interactivity bonus, it can keep it for some time. This behavior is all by design; some interactive programs can sit for a very long time, then perform some serious processing for a while. Think about the X server with that nice compositing window manager; it spends quite a bit of time idle, only to pin the CPU when the user starts dragging windows around. But this behavior can also give an interactive priority bonus to processes which are not truly interactive.

The solution here involves a few changes. One of them is to simply be a bit less generous with the interactivity bonuses. But the core of the patch is a function called refresh_timeslice(). This function looks at the current sleep average, and compares it to the amount of time that the process is actually spending in the CPU. Based on this comparison, a per-process throttle time is adjusted. If more CPU time is being used than would be suggested by the sleep average, the throttle time is moved backward; otherwise it moves forward. If a process runs into the throttle time, its sleep average starts to decay quickly, depriving it of its interactivity bonus.

The throttle time provides a grace period which allows processes to use short bursts of CPU time without being penalized. The amount of grace time can be adjusted by way of a pair of knobs exported by the throttling code. "Grace 1" is the amount of time new processes get to establish their averages before being exposed to the throttling mechanism, while "grace 2" is how long a process can run above its expected CPU usage before the throttle kicks in. There have been some objections to the addition of these knobs; they look like another obscure set of kernel tunables that most administrators will not know how to set properly. So there has been a push for the knobs to be replaced with a simple on/off switch. Systems meant for interactive use would leave the throttling on, while server systems would simply turn it all off. Working this issue out may delay the acceptance of this patch, though there seems to be little disagreement with the rest of it.

from: http://lwn.net/Articles/176635/

返回列表

Chinaunix › 论坛 › 操作系统 › Linux论坛 › 内核/嵌入技术 › Solving starvation problems in the scheduler

Solving starvation problems in the scheduler [复制链接]