免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1104 | 回复: 0
打印 上一主题 下一主题

Linux: The Journaling Block Device [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-07-16 13:50 |只看该作者 |倒序浏览
Linux: The Journaling Block Device
   
   
   
   
  
  
    June 21, 2006 - 2:40am
  Submitted by
Kedar Sovani
-->
  Submitted by
Kedar Sovani
on June 21, 2006 - 2:40am.
  
  
   

  
  
  
Atomicity is a property of an operation either to succeed or fail
completely. Disks assure atomicity at the sector level. This means that
a write to a sector either goes through completely or not at all. But
when an operation spans over multiple sectors of the disk, a
higher-level mechanism is needed. This mechanism should ensure that
modifications to the entire set of sectors are handled atomically.
Failure to do so leads to inconsistencies. This document talks about
the implementation of the Journaling Block Device in Linux.
Let's look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The
filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the
parent
directory for the newly created file. But what happens if the machine
crashes after only the first operation is executed? In this
circumstance, an inconsistency has been introduced in the filesystem.
The number of free inodes has decreased, but no initialisation of the
inode has been performed on the disk.
The only way to detect these inconsistencies is by scanning the
entire filesystem. This task is called fsck, filesystem consistency
check. In large installations, the consistency check requires a
significant amount of time (many hours) to check and fix
inconsistencies. As you might have guessed, such downtime is not
desirable. A better approach to solve this problem is to avoid
introducing inconsistencies in the first place, and this could be
accomplished by providing atomicity to operations. Journaling is such a
way to provide atomicity to operations.
Simply stated, using journaling is like using a scratch pad. You
perform operations on the scratch pad, and once you are satisfied that
the operations are correct, you reflect them in a fairer copy.
In the case of filesystems, all the metadata and data are stored on
the block device for the filesystem. Journaling filesystems use a
journal or the log area as the scratch pad. A journal may be a part of
the same block device or it may be a separate device in itself. A
journaling filesystem first records all the operations it has performed
in the journal. Once the set of operations that is part of one single
atomic operation has completed and been recorded in the journal, only
then is it writtent to the actual block device. Henceforth, the term
disk is used to indicate the actual block device, whereas the term
journal is used for the log area.
Journal Recovery Scenarios
The example operation from above requires that three blocks be
modified—the inode count block, the block containing the on-disk inode
and the block holding the directory where the entry is to be added. All
of these blocks first are written to the journal. After that, a special
block, called the commit record, is written to the journal. The commit
record is used to indicate that all the blocks belonging to a single
atomic operation are written to the journal.
Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:

  • The machine crashes after only the first block is flushed to the
    journal. In this case, when the machine comes back up again and checks
    the journal, it finds an operation with no commit record at the end.
    This indicates that it may not be a completed operation. Hence, no
    modifications are done to the disk, preserving the consistency.

  • The machine crashes after the commit record is flushed to the
    journal. In this case, when the machine comes back up again and checks
    the journal, it finds an operation with the commit record at the end.
    The commit record indicates that this is a completed operation and
    could be written to the disk. All the blocks belonging to this
    operation are written at their actual locations on the disk, replaying
    the journal.

  • The machine crashes after all the three blocks are flushed to the
    journal but the commit record is not yet flushed to the journal. Even
    in this case, because of the absence of the commit record, no
    modifications are done to the disk. The scenario thus is reduced to the
    scenario described in the first case.

    Likewise, any other crash scenario could be reduced to any of the scenarios listed above.
    Thus, journaling guarantees consistency for the filesystem. The time
    required for looking up the journal and replaying the journal is
    minimal as compared to that taken by the filesystem consistency check.
    Journaling Block Device
    The Linux Journaling Block Device (JBD) provides this scratch pad
    for providing atomicity in operations. Thus, a filesystem controlling a
    block device can make use of JBD on the same or on another block device
    in order to maintain consistency. The JBD is a modular implementation
    that exposes a set of APIs for the use of such applications. The
    following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.
    Before we move on to the implementation details of the JBD, an
    understanding of some of the objects that JBD uses is required. A
    journal is a log that internally manages updates for a single block
    device. As mentioned above, the updates first are stored in the journal
    and then are reflected to their real locations on the disk. The area
    belonging to the journal is managed like a circular-linked list. That
    is, the journal reuses its area when the journal is full.
    A handle represents a single atomic update. The entire set of
    changes/writes that should be performed atomically are carried out with
    reference to a single handle.
    It may not be an efficient approach to flush each atomic update
    (handle) to the journal, however. To achieve better performance, the
    JBD bunches a set of handles together into a transaction and flushes
    this transaction to the journal. The JBD ensures that the transaction
    is atomic in nature. Hence, the handles, which are the subcomponents of
    the transaction, also are guaranteed to be atomic.
    The most important property of a transaction is its state. When a
    transaction is being committed, it follows the lifecycle of states
    listed below.

  • Running: the transaction currently is live and can accept new
    handles. In a system only one transaction can be in the running state.

  • Locked: the transaction does not accept any new handles but existing
    handles are not complete. Once all the existing handles are completed,
    the transaction goes to the next state.

  • Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.

  • Commit: the entire transaction log has been written to the journal.
    The transaction is writing a commit block indicating that the
    transaction log in the journal is complete.

  • Finished: the transaction is written completely to the journal. It
    has to remain there until the blocks are updated to the actual
    locations on the disk.
    Transaction Committing and CheckPointing
    A running transaction is written to the journal area after a certain
    period. Thus, a transaction can be either in-memory (running) or
    on-disk. Flushing a transaction to the journal and marking that
    particular transaction as finished is a process called transaction
    commit.
    The journal has a limited area under its control, and it needs to
    reuse this area. As for committed transactions, those having all their
    blocks written to the disk, they no longer need to be kept in the
    journal. Checkpointing, then, is the process of flushing the finished
    transactions to the disk and reclaiming the corresponding space in the
    journal. It is discussed in more detail later in this article.
    Implementation Briefs
    The JBD layer performs journaling of the metadata, during which the
    data simply is written to the disk without being journaled. But this
    does not stop applications from journaling the data, as it could be
    presented to the JBD as metadata itself. This document takes the linux
    kernel version 2.6.0 as a reference.

    Commit
    [journal_commit_transaction(journal object)]
    A Kjournald thread is associated with every journaled device. The
    Kjournald thread ensures that the running transaction is committed
    after a specific interval. The transaction commit code is divided into
    eight different phases, described below. Figure 1 shows a logical
    layout of a journal.
    Phase 0: moves the transaction from running state (T_RUNNING) to
    locked state (T_LOCKED), meaning the transaction no longer can issue
    new handles. The transaction waits until all the existing handles have
    completed. A transaction always has a set of buffers reserved for when
    the transaction is initiated. Some of these buffers may be unused and
    are unfiled in this phase. The transaction now is ready to be committed
    with no outstanding handles.
    Phase 1: the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing
    transaction
    for the journal. This phase also marks that no running transaction
    exists for the journal; therefore, new requests for handles initiate a
    new transaction.
    Phase 2: the actual buffers of the transaction are flushed to the
    disk. Data buffers go first. There are no complications here, as data
    buffers are not saved in the log area. Instead, they are flushed
    directly to their actual positions on the disk. This phase ends when
    the I/O completion notifications for all such buffers are received.
    Phase 3: all the data buffers are written to a disk but their
    metadata still is in the volatile memory. Metadata flushing is not as
    straightforward as data buffer flushing, because metadata needs to be
    written to the log area and the actual positions on the disk need to be
    remembered. This phase starts with flushing these metadata buffers, for
    which a journal descriptor block is acquired. The journal descriptor
    block stores the mapping of each metadata buffer in the journal to its
    actual location on the disk in the form of tags. After this, metadata
    buffers are flushed to the journal. Once the journal descriptor is full
    of tags or all metadata buffers are flushed to the journal, the journal
    descriptor also is flushed to the journal. Now we have all the metadata
    buffers in the journal, and their actual positions on the disk are
    remembered. This data, being persistent, can be used for recovery if
    failure occurs.
    Phase 4 and Phase 5: both phase 4 and phase 5 wait on I/O completion notifications
    of metadata buffers and journal descriptor blocks, respectively. The
    buffers are unfiled from in-memory lists once I/O completion is
    received.
    Phase 6: all the data and metadata is on safe storage, data at its
    actual locations and metadata in the journal. Now transactions need to
    be marked as committed so that it can be known that all the updates are
    safe in the journal. For this reason, a journal descriptor block again
    is allocated. A tag is written stating that the transaction has
    committed successfully, and the block is synchronously written to its
    position in the journal. After this, the transaction is moved to the
    committed state, T_COMMIT.
    Phase 7: occurs when a number of transactions are present in the
    journal, without yet being flushed to the disk. Some of the metadata
    buffers in this transaction already may be a part of some previous
    transaction. These need not be kept in the older transactions as we
    have their latest copy in the current committed transaction. Such
    buffers are removed from older transactions.
    Phase 8: the transaction is marked as being in the finished state,
    T_FINISHED. The journal structure is updated to reflect this particular
    transaction as the latest committed transaction. It also is added to
    the list of transactions to be checkpointed.
    Checkpointing
    Checkpointing is initiated when the journal is being flushed to the
    disk—think of unmount— or when a new handle is started. A new handle
    can fall short of guaranteed number of buffers, so it may be necessary
    to carry out a checkpointing process in order to free some space in the
    journal.
    The checkpointing process flushes the metadata buffers of a
    transaction not yet written to its actual location on the disk. The
    transaction then is removed from the journal. The journal can have
    multiple checkpointing transactions, and each checkpointing transaction
    can have multiple buffers. The process considers each committing
    transaction, and for each transaction, it finds the metadata buffers
    that need to be flushed to the disk. All these buffers are flushed in
    one batch. Once all the transactions are checkpointed, their log is
    removed from the journal.

    Recovery
    [journal_recover(journal object)]
    When the system comes up after a crash and it can see that the log
    entries are not null, it indicates that the last unmount was not
    successful or never occurred. At this point, you need to attempt a
    recovery. Figure 2 depicts a sample physical layout of journal. The
    recovery takes place in three phases.

  • PASS_SCAN: the end of the log is found.

  • PASS_REVOKE: a list of revoked blocks is prepared from the log.

  • PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.
    For recovery, the available information is provided in terms of the
    journal. But the exact state of the journal is unknown, as we do not
    know the point at which the system crashed. Hence, the last transaction
    could be in the checkpointing or committing state. A running
    transaction cannot be found, as it was only in the memory.
    For committing transactions, we have to forget the updates made, as
    all of the updates may not be in place. So in the PASS_SCAN phase, the
    last log entry in the log is found. From here, the recovery process
    knows which transactions need to be replayed.
    Every transaction can have a set of revoked blocks. This is
    important to know in order to prevent older journal records from being
    replayed on top of newer data using the same block. In PASS_REVOKE, a
    hash table of all these revoked blocks is prepared. This table is used
    every time we need to find out whether a particular block should get
    written to a disk through a replay.
    In the last phase, all the blocks that need to be replayed are
    considered. Each block is tested for its presence in the revoked
    blocks' hash table. If the block is not in there, it is safe to write
    the block to its actual location on the disk. If the block is there,
    only the newest version of the block is written to the disk. Notice
    that we have not changed anything in the on-disk journal. Hence, even
    if system crashes again while the recovery is in progress, no harm is
    done.
    The same journal is present for the recovery next time, and no
    non-idempotent operation is performed during the process of recovery.
    Amey Inamdar
    (
    www.geocities.com/amey_inamdar
    ) is a kernel developer working at Kernel Corporation. His interest areas include filesystems and distributed systems.
    Kedar Sovani
    (
    www.geocities.com/kedarsovani
    ) works for Kernel Corporation as a kernel developer. His areas of interest include filesystems and storage technologies.
                   
                   
                   

    本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u1/53103/showart_1083354.html
  • 您需要登录后才可以回帖 登录 | 注册

    本版积分规则 发表回复

      

    北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
    未成年举报专区
    中国互联网协会会员  联系我们:huangweiwei@itpub.net
    感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

    清除 Cookies - ChinaUnix - Archiver - WAP - TOP