- 论坛徽章:
- 0
|
Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
1. Introduction to the BeOS and BFS
1.1 History leading up to BFS
The Solution
Starting in September 1996, Cyril Meurillon and I set about to define a new I/O architecture and file system for BeOS. We knew that the existing split of file system and database would no longer work. We wanted a new, high-performance file system that supported the database functionality the BeOS was known for as well as a mechanism to support multiple file systems. We also took the opportunity to clean out some of the accumulated cruft that had worked its way into the system over the course of the previous five years of development.
The task we had to solve had two very clear components. First there was the higher-level file system and device interface. This half of the project involved defining an API for file system and device drivers, managing the name space, connecting program requests for files into file descriptors, and managing all the associated state. The second half of the project involved writing a file system that would provide the functionality required by the rest of the BeOS. Cyril, being the primary kernel architect at Be, took on the first portioin of the task. The most difficult portion of Cyril's project involved defining the file system API in such a way that it was as multithreaded as possible, correct, deadlock-free, and efficient. That task involved many major iterations as we battled over what a file system had to do and what the kernel layer would manage. There is some discussion of this level of the file system in Chapter 10, but it is not the primary focus of this book.
My half of the project involved defining the on-disk data structures, managing all the nity-gritty physical details of the raw disk blocks, and performing the I/O requests made by programs. Because the disk block cache is intimately intertwined with the file system (especially a journaled file system), I also took on the task of rewriting the block cache.
1.2 Design Goals
...
In addition to the above design goals, we had the long-standing goals of making the system as multithreaded and as efficient as possible, which meant fine-grained locking everywhere and paying close attention to the overhead introduced by the file system. Memory usage was also a big concern. ...
1.3 Design Constraints
There were also several design constraints that the project had to contend with. The first and foremost was the lack of engineering resources. The Be engineering staff is quite small, at the time only 13 engineers. Cyril and I had to wrok alone because everyone else was busy with other projects. We also did not have very much time to complete the project. Be, Inc., tries to have regular software releases, once every four to six months. The initial target was for the project to take six months. The short amount of time to complete the project and the lack of engineering resources meant that there was little time to explore different designs and to experiment with completely untested ideas. In the end it took nine months for the first beta release of BFS. The final version of BFS shipped the following month.
2. What is a File System ?
2.1 The Fundamentals
It is important to keep in mind the abstract goal of what a file system must achieve: to store, retrieve, locate, and manipulate information. Keeping the goal stated in general terms frees us to think of alternative implementations and possibilities that might not otherwise occur if we were to only think of a file system as a typical, strictly hierarchical, disk-based structure.
...
2.3 The Abstractions
...
Extents
Another technique to manage mapping from logical positions in a byte stream to data blocks on disk is to use extent lists. An extent list is similar to the simple block list described previously except that each block address is not just for a single block but rather for a range of blocks. That is, every block address is given as a starting block and a length (expressed as the number of successive blocks following the starting block). The size of an extent is usually larger than a simple block address but is potentially able to map a much larger region of disk space.
...
Although extent lists are a more compact way to refer to large amounts of data, they may still require use of indirect or double-indirect blocks. If a file system becomes highly fragmented and each extent can only map a few blocks of data, then the use of indirect and double-indirect blocks becomes a necessity. One disadvantage to using extent lists is that locating a specific file position may require scanning a large number of extents. Because the length of an extent is variable, when locating a specific position the file system must start at the first extent and scan through all of them until it finds the extent that covers the position of interest. ...
Storing Directory Entries
...
Another method of organizing directory entries is to use a sorted data structure suitable for on-disk storage. One such data structure is a B- tree (or its variants, B+ tree and B* tree). A B- tree keeps the keys sorted by their name and is efficient at looking up whether a key exists in the directory. B- trees also scale well and are able to deal efficiently with directories that contain many tens of thousands of files.
2.5 Extended file system Operations
...
Indexing
File attributes allow users to associate additional information with files, but there is even more that a file system can do with extended file attributes to aid users in managing and locating their information. If the file system also indexes the attributes. For example, if we added a *keywork* attribute to a set of files and the *keyword* attribute was indexed, the user could then issue queries asking which files contained various keywords regardless of their location in the hierarchy.
When coupled with a good query language, indexing offers a powerful alternative interface to the file system. With queries, users are not restricted to navigating a fixed hierarchy of files; instead they can issue queries to find the working set of files they would like to see, regardless of the location of the files.
Journaling/Logging
Avoiding corruption in a file system is a difficult task. Some file systems go to great lengths to avoid corruptioin problems. They may attempt to oder disk writes in such a way that corruption is recoverable, or they may force operations that can cause corruption to be synchronous so that the file system is always in a known state. Still other systems simply avoid the issue and depend on a very sophisticated file system check program to recover in the event of failures. All of these approaches must check the disk at boot time, a potentially lengthy operation (especially as disk size increase). Further, should a crash happen at an inopportune time, the file system may still be corrupt.
A more modern approach to avoiding corruption is *journaling*. Journaling, a technique borrowed from the database world, avoids corruption by batching groups of changes and committing them all at once to a transaction log. The batched changes guarantee the atomicity of multiple changes. That atomicity guarantee allows the file system to guarantee that operations either happen completely or not at all. Further, if a crash does happen, the system need only replay the transaction log to recover the system to a known state. Replaying the log is an operation that takes at most a few seconds, which is considerably faster than the file system check that nonjournaled file systems must make.
Guaranteed bandwidth/Bandwidth Reservationo
The desire to guarantee high-bandwidth I/O for multimedia applications drives some file system designers to provide special hooks that allow applications to guarantee that they will receive a certain amount of I/O bandwidth (within the limits of the hardware). To accomplish this the file system needs a great deal of knowledge about the capabilities of the underlying hardware it uses and must schedule I/O requests. This problem is nontrivial and still an area of research.
Access Control Lists
Access Control Lists (ACLs) provide an extended mechanism for specifying who may access a file and how they may access it. The traditional POSIX approach of three sets of permissions - for the owner of a file, the group that the owner is in, and everyone else - is not sufficient in some settings. An access control list specifies the exact level of access that any person may have to a file. This allows for fine-grained control over the access to a file in comparison to the braod divisions defined in the POSIX security model. |
|