- 论坛徽章:
- 0
|
I/O SchedulersSimply
sending out requests to the block devices in the order that the kernel
issues them, as soon as it issues them, results in awful performance.
One of the slowest operations in a modern computer is disk seeks. Each
seekpositioning the hard disk's head at the location of a specific
blocktakes many milliseconds. Minimizing seeks is absolutely crucial to
the system's performance.
Therefore,
the kernel does not issue block I/O requests to the disk in the order
they are received or as soon as they are received. Instead, it performs
operations called merging and sorting to greatly improve the performance of the system as a whole
[2]
. The subsystem of the kernel that performs these operations is called the I/O scheduler.
[2]
This point must be stressed. A system without these features, or
wherein these features are poorly implemented, would perform horribly
with even a modest number of block I/O operations.
The
I/O scheduler divides the resource of disk I/O among the pending block
I/O requests in the system. It does this through the merging and
sorting of pending requests in the request queue. The I/O scheduler is
not to be confused with the process scheduler (see
Chapter 4
,
"Process Scheduling"), which divides the resource of the processor
among the processes on the system. The two subsystems are similar in
nature but not the same. Both the process scheduler and the I/O
scheduler virtualize a resource among multiple objects. In the case of
the process scheduler, the processor is virtualized and shared among
the processes on the system. This provides the illusion of
virtualization inherent in a multitasking and timesharing operating
system, such as any Unix. On the other hand, the I/O scheduler
virtualizes block devices among multiple outstanding block requests.
This is done to minimize disk seeks and ensure optimum disk performance.
The Job of an I/O SchedulerAn
I/O scheduler works by managing a block device's request queue. It
decides the order of requests in the queue and at what time each
request is dispatched to the block device. It manages the request queue
with the goal of reducing seeks, which results in greater global throughput.
The "global" modifier here is important. An I/O scheduler, very openly,
is unfair to some requests at the expense of improving the overall performance of the system.
I/O
schedulers perform two primary actions to minimize seeks: merging and
sorting. Merging is the coalescing of two or more requests into one.
Consider an example request that is submitted to the queue by a
filesystemsay, to read a chunk of data from a file. (At this point, of
course, everything is occurring in terms of sectors and blocks and not
files, but presume that the requested blocks originate from a chunk of
a file.) If a request is already in the queue to read from an adjacent
sector on the disk (for example, an earlier chunk of the same file),
the two requests can be merged into a single request operating on one
or more adjacent on-disk sectors. By merging requests, the I/O
scheduler reduces the overhead of multiple requests down to a single
request. More importantly, only a single command needs to be issued to
the disk and servicing the multiple requests can be done without
seeking. Consequently, merging requests reduces overhead and minimizes
seeks.
Now, assume your fictional read request is
submitted to the request queue, but there is no read request to an
adjacent sector. You are therefore unable to merge this request with
any other request. Now, you could simply stick this request onto the
tail of the queue. But, what if there are other requests to a similar
location on the disk? Would it not make sense to insert this new
request into the queue at a spot near other requests operating on
physically near sectors? In fact, I/O schedulers do exactly this. The
entire request queue is kept sorted, sector-wise, so that all seeking
activity along the queue moves (as much as possible) sequentially over
the sectors of the hard disk. The goal is not just to minimize each
individual seek, but to minimize all seeking by keeping the disk head
moving in a straight line. This is similar to the algorithm employed in
elevatorselevators do not jump all over, wildly, from floor to floor.
Instead, they try to move gracefully in a single direction. When the
final floor is reached in one direction, the elevator can reverse
course and move in the other direction. Because of this similarity, I/O
schedulers (or sometimes just their sorting algorithm) are called elevators.
The Linus ElevatorNow let's look at some real-life I/O schedulers. The first I/O scheduler is called the Linus Elevator
(yes, Linus has an elevator named after him!). It was the default I/O
scheduler in 2.4. In 2.6, it was replaced by the following I/O
schedulers that we will look athowever, because this elevator is
simpler than the subsequent ones, while performing many of the same
functions, it still deserves discussion.
The
Linus Elevator performs both merging and sorting. When a request is
added to the queue, it is first checked against every other pending
request to see whether it is a possible candidate for merging. The
Linus Elevator I/O scheduler performs both front and back merging.
The type of merging performed depends on the location of the existing
adjacent request. If the new request immediately proceeds an existing
request, it is front merged. Conversely, if the new request immediately
precedes an existing request, it is back merged. Because of the way
files are laid out (usually by increasing sector number) and the I/O
operations performed in a typical workload (data is normally read from
start to finish and not in reverse), front merging is very rare
compared to back merging. Nonetheless, the Linus Elevator checks for
and performs both types of merge.
If
the merge attempt fails, a possible insertion point in the queue (a
location in the queue where the new request fits sector-wise between
the existing requests) is then sought. If one is found, the new request
is inserted there. If a suitable location is not found, the request is
added to the tail of the queue. Additionally, if an existing request is
found in the queue that is older than a predefined threshold, the new
request is added to the tail of the queue even if it can be insertion
sorted elsewhere. This prevents many requests to nearby on-disk
locations from indefinitely starving requests to other locations on the
disk. Unfortunately, this "age" check is not very efficient. It does
not provide any real attempt to service requests in a given timeframeit
merely stops insertion-sorting requests after a suitable delay. This
improves latency but can still lead to request starvation, which was
the big must-fix of the 2.4 I/O scheduler.
In summary, when a request is added to the queue, four operations are possible. In order, they are
1. First,
if a request to an adjacent on-disk sector is in the queue, the
existing request and the new request are merged into a single request.
2. Second,
if a request in the queue is sufficiently old, the new request is
inserted at the tail of the queue to prevent starvation of the other,
older, requests.
3. Next,
if there is a suitable location sector-wise in the queue, the new
request is inserted there. This keeps the queue sorted by physical
location on disk.
4. Finally, if no such suitable insertion point exists, the request is inserted at the tail of the queue
The Deadline I/O SchedulerThe
Deadline I/O scheduler sought to prevent the starvation caused by the
Linus Elevator. In the interest of minimizing seeks, heavy disk I/O
operations to one area of the disk can indefinitely starve request
operations to another part of the disk. Indeed, a stream of requests to
the same area of the disk can result in other far-off requests never
being serviced. This starvation is unfair.
Worse, the general issue of request starvation introduces a specific instance of the problem known as writes starving reads.
Write operations can usually be committed to disk whenever the kernel
gets around to them, entirely asynchronous with respect to the
submitting application. Read operations are quite different. Normally,
when an application submits a read request, the application blocks
until the request is fulfilled. That is, read requests occur
synchronously with respect to the submitting application. Although
system response is largely unaffected by write latency (the time
required to commit a write request), read latency (the time required to
commit a read request) is very important. Write latency has little
bearing on application performance
[3]
,
but an application must wait, twiddling its thumbs, for the completion
of each read request. Consequently, read latency is very important to
the performance of the system.
[3]
We still do not want to delay write requests indefinitely, however,
because the kernel wants to ensure that data is eventually written to
disk to prevent in-memory buffers from growing too large or too old.
Compounding
the problem, read requests tend to be dependent on each other. For
example, consider the reading of a large number of files. Each read
occurs in small buffered chunks. The application does not start reading
the next chunk (or the next file, for that matter) until the previous
chunk is read from disk and returned to the application. Worse, both
read and write operations require the reading of various metadata, such
as inodes. Reading these blocks off the disk further serializes I/O.
Consequently, if each read request is individually starved, the total
delay to such applications compounds and can grow enormous. Recognizing
that the asynchrony and interdependency of read requests results in a
much stronger bearing of read latency on the performance of the system,
the Deadline I/O scheduler implements several features to ensure that
request starvation in general, and read starvation in specific, is
minimized.
Note
that reducing request starvation comes at a cost to global throughput.
Even the Linus Elevator makes this compromise, albeit in a much milder
mannerthe Linus Elevator could provide better overall throughput (via a
greater minimization of seeks) if it always
inserted requests into the queue sector-wise and never checked for old
requests and reverted to insertion at the tail of the queue. Although
minimizing seeks is very important, indefinite starvation is not good
either. The Deadline I/O scheduler, therefore, works harder to limit
starvation while still providing good global throughput. Make no
mistake: It is a tough act to provide request fairness, yet maximize
global throughput.
In
the Deadline I/O scheduler, each request is associated with an
expiration time. By default, the expiration time is 500 milliseconds in
the future for read requests and 5 seconds in the future for write
requests. The Deadline I/O scheduler operates similarly to the Linus
Elevator in that it maintains a request queue sorted by physical
location on disk. It calls this queue the sorted queue.
When a new request is submitted to the sorted queue, the Deadline I/O
scheduler performs merging and insertion like the Linus Elevator
[4]
.
The Deadline I/O scheduler also, however, inserts the request into a
second queue that depends on the type of request. Read requests are
sorted into a special read FIFO queue and write requests are inserted
into a special write FIFO queue. Although the normal queue is sorted by
on-disk sector, these queues are kept FIFO (effectively, they are
sorted by time). Consequently, new requests are always added to the
tail of the queue. Under normal operation, the Deadline I/O scheduler
pulls requests from the head of the sorted queue into the dispatch
queue. The dispatch queue is then fed to the disk drive. This results
in minimal seeks.
[4]
Performing front merging is optional in the Deadline I/O scheduler,
however. It is not always worth the trouble because many workloads have
very few requests that can be front merged.
If
the request at the head of either the write FIFO queue or the read FIFO
queue expires (that is, if the current time becomes greater than the
expiration time associated with the request), the Deadline I/O
scheduler then begins servicing requests from the FIFO queue. In this
manner, the Deadline I/O scheduler attempts to ensure that no request
is outstanding longer than its expiration time. See
Figure 13.3
.
Figure 13.3. The three queues of the Deadline I/O scheduler. ...miss...
Note
that the Deadline I/O scheduler does not make any strict guarantees
over request latency. It is capable, however, of generally committing
requests on or before their expiration. This prevents request
starvation. Because read requests are given a substantially smaller
expiration value than write requests, the Deadline I/O scheduler also
works to ensure that write requests do not starve read requests. This
preference toward read requests provides minimized read latency.
The Deadline I/O scheduler lives in drivers/block/deadline-iosched.c.
The Anticipatory I/O SchedulerAlthough
the Deadline I/O scheduler does a great job minimizing read latency, it
does so at the expense of global throughput. Consider a system
undergoing heavy write activity. Every time a read request is
submitted, the I/O scheduler quickly rushes to handle the read request.
This results in the disk seeking over to where the read is, performing
the read operation, and then seeking back to continue the ongoing write
operation, repeating this little charade for each read request. The
preference toward read requests is a good thing, but the resulting pair
of seeks (one to the location of the read request and another back to
the ongoing write) is detrimental to global disk throughput. The
Anticipatory I/O scheduler aims to continue to provide excellent read
latency, but also provide excellent global throughput.
First,
the Anticipatory I/O scheduler starts with the Deadline I/O scheduler
as its base. Therefore, it is not entirely different. The Anticipatory
I/O scheduler implements three queues (plus the dispatch queue) and
expirations for each request, just like the Deadline I/O scheduler. The
major change is the addition of an anticipation heuristic.
The
Anticipatory I/O scheduler attempts to minimize the seek storm that
accompanies read requests issued during other disk I/O activity. When a
read request is issued, it is handled as usual, within its usual
expiration period. After the request is submitted, however, the
Anticipatory I/O scheduler does not immediately seek back and return to
handling other requests. Instead, it does absolutely nothing for a few
milliseconds (the actual value is configurable; by default it is six
milliseconds). In those few milliseconds, there is a good chance that
the application will submit another read request. Any requests issued
to an adjacent area of the disk are immediately handled. After the
waiting period elapses, the Anticipatory I/O scheduler seeks back to
where it left off and continues handling the previous requests.
It is important to note that the few milliseconds that are spent in anticipation
for more requests are well worth it if they minimize even a modest
percentage of the back-and-forth seeking that results from the
servicing of read requests during other heavy requests. If an adjacent
I/O request is issued within the waiting period, the I/O scheduler just
saved a pair of seeks. As more and more reads are issued to the same
area of disk, many more seeks are prevented.
Of
course, if no activity occurs within the waiting period, the
Anticipatory I/O scheduler loses and a few milliseconds are wasted. The
key to reaping maximum benefit from the Anticipatory I/O scheduler is
correctly anticipating the actions of applications and filesystems.
This is done via a set of statistics and associated heuristics. The
Anticipatory I/O scheduler keeps track of per-process statistics
pertaining to block I/O habits in hopes of correctly anticipating the
actions of applications. With a sufficiently high percentage of correct
anticipations, the Anticipatory I/O scheduler can greatly reduce the
penalty of seeking to service read requests, while still providing the
attention to such requests that system response requires. This allows
the Anticipatory I/O scheduler to minimize read latency, while also
minimizing the number and duration of seeks. This results in low system
latency and high system throughput.
The Anticipatory I/O scheduler lives in the file drivers/block/as-iosched.c
in the kernel source tree. It is the default I/O scheduler in the Linux
kernel. It performs well across most workloads. It is ideal for
servers, although it performs very poorly on certain uncommon but
critical workloads involving seek-happy databases.
The Complete Fair Queuing I/O SchedulerThe
Complete Fair Queuing (CFQ) I/O scheduler is an I/O scheduler designed
for specialized workloads, but that in practice actually provides good
performance across multiple workloads. It is fundamentally different
from the previous I/O schedulers that have been covered, however.
The
CFQ I/O scheduler assigns incoming I/O requests to specific queues
based on the process originating the I/O request. For example, I/O
requests from process foo go in foo's queues, and I/O requests from
process bar go in bar's queue. Within each queue, requests are
coalesced with adjacent requests and insertion sorted. The queues are
thus kept sorted sector-wise, as with the other I/O scheduler's queues.
The difference with the CFQ I/O scheduler is that there is one queue
for each process submitting I/O.
The CFQ I/O
scheduler then services the queues round robin, plucking a configurable
number of requests (by default, four) from each queue before continuing
on to the next. This provides fairness at a per-process level, assuring
that each process receives a fair slice of the disk's bandwidth. The
intended workload is multimedia, in which such a fair algorithm can
guarantee that, for example, an audio player is able to always refill
its audio buffers from disk in time. In practice, however, the CFQ I/O
scheduler performs well in many scenarios.
The Complete Fair Queuing I/O scheduler lives in drivers/block/cfq-iosched.c.
It is recommended for desktop workloads, although it performs
reasonably well in nearly all workloads without any pathological corner
cases.
The Noop I/O SchedulerA
fourth and final I/O scheduler is the Noop I/O scheduler, so named
because it is basically a noopit does not do much. The Noop I/O
scheduler does not perform sorting or any other form of seek-prevention
whatsoever. In turn, it has no need to implement anything akin to the
slick algorithms to minimize request latency that you saw in the
previous three I/O schedulers.
The
Noop I/O scheduler does perform merging, however, as its lone chore.
When a new request is submitted to the queue, it is coalesced with any
adjacent requests. Other than this operation, the Noop I/O Scheduler
truly is a noop, merely maintaining the request queue in near-FIFO
order, from which the block device driver can pluck requests.
The
Noop I/O scheduler is not the lazy and worthless I/O scheduler of the
bunch; its lack of hard work is with reason. It is intended for block
devices that are truly random-access, such as flash memory cards. If a
block device has little or no overhead associated with "seeking," then
there is no need for insertion sorting of incoming requests, and the
Noop I/O scheduler is the ideal candidate.
The Noop I/O scheduler lives in drivers/block/noop-iosched.c. It is intended only for random-access devices.
I/O Scheduler SelectionYou
have now seen four different I/O schedulers in the 2.6 kernel. Each of
these I/O schedulers can be enabled and built into the kernel. By
default, block devices use the Anticipatory I/O scheduler. This can be
overridden via the boot-time option elevator=foo on the kernel command line, where foo is a valid and enabled I/O Scheduler. See
Table 13.2
.
Table 13.2. Parameters Given to the elevator OptionParameter
I/O Scheduler
as
Anticipatory
cfq
Complete Fair Queuing
deadline
Deadline
noop
Noop
For example, the kernel command line option elevator=cfq would enable use of the Complete Fair Queuing I/O scheduler for all block devices.
本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u1/37897/showart_2103994.html |
|