- 论坛徽章:
- 0
|
Linux: The Journaling Block Device
June 21, 2006 - 2:40am
Submitted by
Kedar Sovani
-->
Submitted by
Kedar Sovani
on June 21, 2006 - 2:40am.
![]()
Atomicity is a property of an operation either to succeed or fail
completely. Disks assure atomicity at the sector level. This means that
a write to a sector either goes through completely or not at all. But
when an operation spans over multiple sectors of the disk, a
higher-level mechanism is needed. This mechanism should ensure that
modifications to the entire set of sectors are handled atomically.
Failure to do so leads to inconsistencies. This document talks about
the implementation of the Journaling Block Device in Linux.
Let's look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The
filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the
parent
directory for the newly created file. But what happens if the machine
crashes after only the first operation is executed? In this
circumstance, an inconsistency has been introduced in the filesystem.
The number of free inodes has decreased, but no initialisation of the
inode has been performed on the disk.
The only way to detect these inconsistencies is by scanning the
entire filesystem. This task is called fsck, filesystem consistency
check. In large installations, the consistency check requires a
significant amount of time (many hours) to check and fix
inconsistencies. As you might have guessed, such downtime is not
desirable. A better approach to solve this problem is to avoid
introducing inconsistencies in the first place, and this could be
accomplished by providing atomicity to operations. Journaling is such a
way to provide atomicity to operations.
Simply stated, using journaling is like using a scratch pad. You
perform operations on the scratch pad, and once you are satisfied that
the operations are correct, you reflect them in a fairer copy.
In the case of filesystems, all the metadata and data are stored on
the block device for the filesystem. Journaling filesystems use a
journal or the log area as the scratch pad. A journal may be a part of
the same block device or it may be a separate device in itself. A
journaling filesystem first records all the operations it has performed
in the journal. Once the set of operations that is part of one single
atomic operation has completed and been recorded in the journal, only
then is it writtent to the actual block device. Henceforth, the term
disk is used to indicate the actual block device, whereas the term
journal is used for the log area.
Journal Recovery Scenarios
The example operation from above requires that three blocks be
modified—the inode count block, the block containing the on-disk inode
and the block holding the directory where the entry is to be added. All
of these blocks first are written to the journal. After that, a special
block, called the commit record, is written to the journal. The commit
record is used to indicate that all the blocks belonging to a single
atomic operation are written to the journal.
Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:
The machine crashes after only the first block is flushed to the
journal. In this case, when the machine comes back up again and checks
the journal, it finds an operation with no commit record at the end.
This indicates that it may not be a completed operation. Hence, no
modifications are done to the disk, preserving the consistency.
The machine crashes after the commit record is flushed to the
journal. In this case, when the machine comes back up again and checks
the journal, it finds an operation with the commit record at the end.
The commit record indicates that this is a completed operation and
could be written to the disk. All the blocks belonging to this
operation are written at their actual locations on the disk, replaying
the journal.
The machine crashes after all the three blocks are flushed to the
journal but the commit record is not yet flushed to the journal. Even
in this case, because of the absence of the commit record, no
modifications are done to the disk. The scenario thus is reduced to the
scenario described in the first case.
Likewise, any other crash scenario could be reduced to any of the scenarios listed above.
Thus, journaling guarantees consistency for the filesystem. The time
required for looking up the journal and replaying the journal is
minimal as compared to that taken by the filesystem consistency check.
Journaling Block Device
The Linux Journaling Block Device (JBD) provides this scratch pad
for providing atomicity in operations. Thus, a filesystem controlling a
block device can make use of JBD on the same or on another block device
in order to maintain consistency. The JBD is a modular implementation
that exposes a set of APIs for the use of such applications. The
following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.
Before we move on to the implementation details of the JBD, an
understanding of some of the objects that JBD uses is required. A
journal is a log that internally manages updates for a single block
device. As mentioned above, the updates first are stored in the journal
and then are reflected to their real locations on the disk. The area
belonging to the journal is managed like a circular-linked list. That
is, the journal reuses its area when the journal is full.
A handle represents a single atomic update. The entire set of
changes/writes that should be performed atomically are carried out with
reference to a single handle.
It may not be an efficient approach to flush each atomic update
(handle) to the journal, however. To achieve better performance, the
JBD bunches a set of handles together into a transaction and flushes
this transaction to the journal. The JBD ensures that the transaction
is atomic in nature. Hence, the handles, which are the subcomponents of
the transaction, also are guaranteed to be atomic.
The most important property of a transaction is its state. When a
transaction is being committed, it follows the lifecycle of states
listed below.
Running: the transaction currently is live and can accept new
handles. In a system only one transaction can be in the running state.
Locked: the transaction does not accept any new handles but existing
handles are not complete. Once all the existing handles are completed,
the transaction goes to the next state.
Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.
Commit: the entire transaction log has been written to the journal.
The transaction is writing a commit block indicating that the
transaction log in the journal is complete.
Finished: the transaction is written completely to the journal. It
has to remain there until the blocks are updated to the actual
locations on the disk.
Transaction Committing and CheckPointing
A running transaction is written to the journal area after a certain
period. Thus, a transaction can be either in-memory (running) or
on-disk. Flushing a transaction to the journal and marking that
particular transaction as finished is a process called transaction
commit.
The journal has a limited area under its control, and it needs to
reuse this area. As for committed transactions, those having all their
blocks written to the disk, they no longer need to be kept in the
journal. Checkpointing, then, is the process of flushing the finished
transactions to the disk and reclaiming the corresponding space in the
journal. It is discussed in more detail later in this article.
Implementation Briefs
The JBD layer performs journaling of the metadata, during which the
data simply is written to the disk without being journaled. But this
does not stop applications from journaling the data, as it could be
presented to the JBD as metadata itself. This document takes the linux
kernel version 2.6.0 as a reference.
![]()
Commit
[journal_commit_transaction(journal object)]
A Kjournald thread is associated with every journaled device. The
Kjournald thread ensures that the running transaction is committed
after a specific interval. The transaction commit code is divided into
eight different phases, described below. Figure 1 shows a logical
layout of a journal.
Phase 0: moves the transaction from running state (T_RUNNING) to
locked state (T_LOCKED), meaning the transaction no longer can issue
new handles. The transaction waits until all the existing handles have
completed. A transaction always has a set of buffers reserved for when
the transaction is initiated. Some of these buffers may be unused and
are unfiled in this phase. The transaction now is ready to be committed
with no outstanding handles.
Phase 1: the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing
transaction
for the journal. This phase also marks that no running transaction
exists for the journal; therefore, new requests for handles initiate a
new transaction.
Phase 2: the actual buffers of the transaction are flushed to the
disk. Data buffers go first. There are no complications here, as data
buffers are not saved in the log area. Instead, they are flushed
directly to their actual positions on the disk. This phase ends when
the I/O completion notifications for all such buffers are received.
Phase 3: all the data buffers are written to a disk but their
metadata still is in the volatile memory. Metadata flushing is not as
straightforward as data buffer flushing, because metadata needs to be
written to the log area and the actual positions on the disk need to be
remembered. This phase starts with flushing these metadata buffers, for
which a journal descriptor block is acquired. The journal descriptor
block stores the mapping of each metadata buffer in the journal to its
actual location on the disk in the form of tags. After this, metadata
buffers are flushed to the journal. Once the journal descriptor is full
of tags or all metadata buffers are flushed to the journal, the journal
descriptor also is flushed to the journal. Now we have all the metadata
buffers in the journal, and their actual positions on the disk are
remembered. This data, being persistent, can be used for recovery if
failure occurs.
Phase 4 and Phase 5: both phase 4 and phase 5 wait on I/O completion notifications
of metadata buffers and journal descriptor blocks, respectively. The
buffers are unfiled from in-memory lists once I/O completion is
received.
Phase 6: all the data and metadata is on safe storage, data at its
actual locations and metadata in the journal. Now transactions need to
be marked as committed so that it can be known that all the updates are
safe in the journal. For this reason, a journal descriptor block again
is allocated. A tag is written stating that the transaction has
committed successfully, and the block is synchronously written to its
position in the journal. After this, the transaction is moved to the
committed state, T_COMMIT.
Phase 7: occurs when a number of transactions are present in the
journal, without yet being flushed to the disk. Some of the metadata
buffers in this transaction already may be a part of some previous
transaction. These need not be kept in the older transactions as we
have their latest copy in the current committed transaction. Such
buffers are removed from older transactions.
Phase 8: the transaction is marked as being in the finished state,
T_FINISHED. The journal structure is updated to reflect this particular
transaction as the latest committed transaction. It also is added to
the list of transactions to be checkpointed.
Checkpointing
Checkpointing is initiated when the journal is being flushed to the
disk—think of unmount— or when a new handle is started. A new handle
can fall short of guaranteed number of buffers, so it may be necessary
to carry out a checkpointing process in order to free some space in the
journal.
The checkpointing process flushes the metadata buffers of a
transaction not yet written to its actual location on the disk. The
transaction then is removed from the journal. The journal can have
multiple checkpointing transactions, and each checkpointing transaction
can have multiple buffers. The process considers each committing
transaction, and for each transaction, it finds the metadata buffers
that need to be flushed to the disk. All these buffers are flushed in
one batch. Once all the transactions are checkpointed, their log is
removed from the journal.
![]()
Recovery
[journal_recover(journal object)]
When the system comes up after a crash and it can see that the log
entries are not null, it indicates that the last unmount was not
successful or never occurred. At this point, you need to attempt a
recovery. Figure 2 depicts a sample physical layout of journal. The
recovery takes place in three phases.
PASS_SCAN: the end of the log is found.
PASS_REVOKE: a list of revoked blocks is prepared from the log.
PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.
For recovery, the available information is provided in terms of the
journal. But the exact state of the journal is unknown, as we do not
know the point at which the system crashed. Hence, the last transaction
could be in the checkpointing or committing state. A running
transaction cannot be found, as it was only in the memory.
For committing transactions, we have to forget the updates made, as
all of the updates may not be in place. So in the PASS_SCAN phase, the
last log entry in the log is found. From here, the recovery process
knows which transactions need to be replayed.
Every transaction can have a set of revoked blocks. This is
important to know in order to prevent older journal records from being
replayed on top of newer data using the same block. In PASS_REVOKE, a
hash table of all these revoked blocks is prepared. This table is used
every time we need to find out whether a particular block should get
written to a disk through a replay.
In the last phase, all the blocks that need to be replayed are
considered. Each block is tested for its presence in the revoked
blocks' hash table. If the block is not in there, it is safe to write
the block to its actual location on the disk. If the block is there,
only the newest version of the block is written to the disk. Notice
that we have not changed anything in the on-disk journal. Hence, even
if system crashes again while the recovery is in progress, no harm is
done.
The same journal is present for the recovery next time, and no
non-idempotent operation is performed during the process of recovery.
Amey Inamdar
(
www.geocities.com/amey_inamdar
) is a kernel developer working at Kernel Corporation. His interest areas include filesystems and distributed systems.
Kedar Sovani
(
www.geocities.com/kedarsovani
) works for Kernel Corporation as a kernel developer. His areas of interest include filesystems and storage technologies.
本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u1/53103/showart_1083354.html |
|