- 论坛徽章:
- 6
|
内核支持NOHZ已经很久了,但是特征支持的一直不是太好。3.10版本发布,彻底解决了这个问题。
This tree from Frederic Weisbecker adds a new, (exciting! core kernel
feature to the timer and scheduler subsystems: 'full dynticks', or
CONFIG_NO_HZ_FULL=y.
This feature extends the nohz variable-size timer tick feature from idle
to busy CPUs (running at most one task) as well, potentially reducing the
number of timer interrupts significantly.
This feature got motivated by real-time folks and the -rt tree, but the
general utility and motivation of full-dynticks runs wider than that:
- HPC workloads get faster: CPUs running a single task should be able to
utilize a maximum amount of CPU power. A periodic timer tick at HZ=1000
can cause a constant overhead of up to 1.0%. This feature removes that
overhead - and speeds up the system by 0.5%-1.0% on typical distro
configs even on modern systems.
- Real-time workload latency reduction: CPUs running critical tasks
should experience as little jitter as possible. The last remaining
source of kernel-related jitter was the periodic timer tick.
- A single task executing on a CPU is a pretty common situation,
especially with an increasing number of cores/CPUs, so this feature
helps desktop and mobile workloads as well.
The cost of the feature is mainly related to increased timer-reprogramming
overhead when a CPU switches its tick period, and thus slightly longer
to-idle and from-idle latency.
Configuration-wise a third mode of operation is added to the existing two
NOHZ kconfig modes:
- CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named as a
config option. This is the traditional Linux periodic tick design:
there's a HZ tick going on all the time, regardless of whether a CPU is
idle or not.
- CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
periodic tick when a CPU enters idle mode.
- CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the tick
when a CPU is idle, also slows the tick down to 1 Hz (one timer
interrupt per second) when only a single task is running on a CPU.
The .config behavior is compatible: existing !CONFIG_NO_HZ and
CONFIG_NO_HZ=y settings get translated to the new values, without the user
having to configure anything. CONFIG_NO_HZ_FULL is turned off by default.
This feature is based on a lot of infrastructure work that has been
steadily going upstream in the last 2-3 cycles: related RCU support and
non-periodic cputime support in particular is upstream already.
This tree adds the final pieces and activates the feature. The pull
request is marked RFC because:
- it's marked 64-bit only at the moment - the 32-bit support patch is
small but did not get ready in time.
- it has a number of fresh commits that came in after the merge window.
The overwhelming majority of commits are from before the merge window,
but still some aspects of the tree are fresh and so I marked it RFC.
- it's a pretty wide-reaching feature with lots of effects - and while
the components have been in testing for some time, the full combination
is still not very widely used. That it's default-off should reduce its
regression abilities and obviously there are no known regressions with
CONFIG_NO_HZ_FULL=y enabled either.
- the feature is not completely idempotent: there is no 100% equivalent
replacement for a periodic scheduler/timer tick. In particular there's
ongoing work to map out and reduce its effects on scheduler
load-balancing and statistics. This should not impact correctness
though, there are no known regressions related to this feature at this
point.
- it's a pretty ambitious feature that with time will likely be enabled
by most Linux distros, and we'd like you to make input on its
design/implementation, if you dislike some aspect we missed. Without
flaming us to crisp! 
Future plans:
- there's ongoing work to reduce 1Hz to 0Hz, to essentially shut
off the periodic tick altogether when there's a single busy task on a
CPU. We'd first like 1 Hz to be exposed more widely before we go for
the 0 Hz target though.
- once we reach 0 Hz we can and remove the periodic tick assumption from
nr_running>=2 as well, by essentially interrupting busy tasks only as
frequently as the sched_latency constraints require us to do - once
every 4-40 msecs, depending on nr_running.
I am personally leaning towards biting the bullet and doing this in v3.10,
like the -rt tree this effort has been going on for too long - but the
final word is up to you as usual.
More technical details can be found in Documentation/timers/NO_HZ.txt.
Thanks,
Ingo |
|