免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
楼主: yftty
打印 上一主题 下一主题

浩存 - 面向数据库,虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [复制链接]

论坛徽章:
0
91 [报告]
发表于 2005-06-16 00:13 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

按照甘特图的画法, 应该是这个样子的吧

  |
  |  研发 ->;
  |    ^ 开发 ->;
  |       ^ QA ->;
  |          ^ 运营(维护) ->;
----------------------------

那你对 FC  SCSI  iSCSI NFS SAMBA CACHE RAID 等很熟悉喽  

论坛徽章:
0
92 [报告]
发表于 2005-06-16 11:36 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

"帖子总数发表于: 2005-06-16 00:13    发表主题: To: raidcracker"

楼主老大这么晚还在

论坛徽章:
0
93 [报告]
发表于 2005-06-16 11:41 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

By the way, is yf your real name's acronym?

论坛徽章:
0
94 [报告]
发表于 2005-06-16 14:49 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

[quote]原帖由 "BigMonkey"]By the way, is yf your real name's acronym?[/quote 发表:


Hehe, the secrect answer is 'yes'. And yftty is the short of 'yf before a tty'

By the way, the FS need also supply a PHP interface, so please share some SWIG experiences if you have or want  

论坛徽章:
0
95 [报告]
发表于 2005-06-16 16:49 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "yftty" 发表:
按照甘特图的画法, 应该是这个样子的吧

  |
  |  研发 ->;
  |    ^ 开发 ->;
  |       ^ QA ->;
  |          ^ 运营(维护) ->;
----------------------------

那你对 FC  SCSI  iSCSI NFS ..........


吃饭的家伙,不熟不行啊.楼主涉猎众多,让人羡慕啊.

论坛徽章:
0
96 [报告]
发表于 2005-06-17 10:58 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://lwn.net/Articles/136579/

The second version of Oracle's cluster filesystem has been in the works for some time. There has been a recent increase in cluster-related code proposed for inclusion into the mainline, so it was not entirely surprising to see an OCFS2 patch set join the crowd. These patches have found their way directly into the -mm tree for those wishing to try them out.

As a cluster filesystem, OCFS2 carries rather more baggage than a single-node filesystem like ext3. It does have, at its core, an on-disk filesystem implementation which is heavily inspired by ext3. There are some differences, though: it is an extent-based filesystem, meaning that files are represented on-disk in large, contiguous chunks. Inode numbers are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however, so it does not need to bring along much of its own journaling code.

To actually function in a clustered mode, OCFS2 must have information about the cluster in which it is operating. To that end, it includes a simple node information layer which holds a description of the systems which make up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take the relevant information from a single configuration file (/etc/ocfs2/cluster.conf). It is not enough to know which nodes should be part of a cluster, however: these nodes can come and go, and the filesystem must be able to respond to these events. So OCFS2 also includes a simple heartbeat implementation for monitoring which nodes are actually alive. This code works by setting aside a special file; each node must write a block to that file (with an updated time stamp) every so often. If a particular block stops changing, its associated node is deemed to have left the cluster.

Another important component is the distributed lock manager. OCFS2 includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a VMS-like interface. Oracle's implementation is simpler, however (its core locking function only has eight parameters...), and it lacks many of the fancier lock types and functions of Red Hat's implementation. There is also a virtual filesystem interface ("dlmfs" making locking functionality available to user space.

There is a simple, TCP-based messaging system which is used by OCFS2 to talk between nodes in a cluster.

The remaining code is the filesystem implementation itself. It has all of the complications that one would expect of a high-performance filesystem implementation. OCFS2, however, is meant to operate with a disk which is, itself, shared across the cluster (perhaps via some sort of storage-area network or multipath scheme). So each node on the cluster manipulates the filesystem directly, but they must do so in a way which avoids creating chaos. The lock manager code handles much of this - nodes must take out locks on on-disk data structures before working with them.

There is more to it than that, however. There is, for example, a separate "allocation area" set aside for each node in the cluster; when a node needs to add an extent to a file, it can take it directly from its own allocation area and avoid contending with the other nodes for a global lock. There are also certain operations (deleting and renaming files, for example) which cannot be done by a node in isolation. It would not do for one node to delete a file and recycle its blocks if that file remains open on another node. So there is a voting mechanism for operations of this type; a node wanting to delete a file first requests a vote. If another node vetoes the operation, the file will remain for the time being. Either way, all nodes in the cluster can note that the file is being deleted and adjust their local data structures accordingly.

The code base as a whole was clearly written with an eye toward easing the path into the mainline kernel. It adheres to the kernel's coding standards and avoids the use of glue layers between the core filesystem code and the kernel. There are no changes to the kernel's VFS layer. Oracle's developers also appear to understand the current level of sensitivity about the merging of cluster support code (node and lock managers, heartbeat code) into the kernel. So they have kept their implementation of these functionalities small and separate from the filesystem itself. OCFS2 needs a lock manager now, for example, so it provides one. But, should a different implementation be chosen for merging at some future point, making the switch should not be too hard.

One assumes that OCFS2 will be merged at some point; adding a filesystem is not usually controversial if it is implemented properly and does not drag along intrusive VFS-layer changes. It is only one of many cluster filesystems, however, so it is unlikely to be alone. The competition in the cluster area, it seems, is just beginning.

论坛徽章:
0
97 [报告]
发表于 2005-06-17 11:05 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://lwn.net/Articles/136579/

Plan 9 started as Ken Thompson and Rob Pike's attempt to address a number of perceived shortcomings in the Unix model. Among other things, Plan 9 takes the "everything is a file" approach rather further than Unix does, and tries to do so in a distributed manner. Plan 9 never took off the way Unix did, but it remains an interesting project; it has been free software since 2003.

One of the core components of Plan 9 is the 9P filesystem. 9P is a networked filesystem, somewhat equivalent to NFS or CIFS, but with its own particular approach. 9P is not as much a way of sharing files as a protocol definition aimed at the sharing of resources in a networked environment. There is a draft RFC available which describes this protocol in detail.

The protocol is intentionally simple. It works in a connection-oriented, single-user mode, much like CIFS; each user on a Plan 9 system is expected to make one or more connections to the server(s) of interest. Plan 9 operates with per-user namespaces by design, so each user ends up with a unique view of the network. There is a small set of operations supported by 9P servers; a client can create file descriptors, use them to navigate around the filesystem, read and write files, create, rename and delete files, and close things down; that's about it.

The protocol is intentionally independent of the underlying transport mechanism. Typically, a TCP connection is used, but that is not required. A 9P client can, with a proper implementation, communicate with a server over named pipes, zero-copy memory transports, RDMA, RFC1149 avian links, etc. The protocol also puts most of the intelligence on the server side; clients, for example, perform no caching of data. An implication of all these choices is that there is no real reason why 9P servers have to be exporting filesystems at all. A server can just as easily offer a virtual filesystem (along the lines of /proc or sysfs), transparent remote access to devices, connections to remote processes, or just about anything else. The 9P protocol is the implementation of the "everything really is a file" concept. It could thus be used in a similar way as the filesystems in user space (FUSE) mechanism currently being considered for merging. 9P also holds potential as a way of sharing resources between virtualized systems running on the same host.

There is a 9P implementation for Linux, called "v9fs"; Eric Van Hensbergen has recently posted a v9fs patch set for review with an eye toward eventual inclusion. v9fs is a full 9P client implementation; there is also a user-space server available via the v9fs web site.

Linux and Plan 9 have different ideas of how a filesystem should work, so a fair amount of impedance matching is required. Unix-like systems prefer filesystems to be mounted in a global namespace for all users, while Plan 9 filesystems are a per-user resource. A v9fs filesystem can be used in either mode, though the most natural way is to use Linux namespaces to allow each user to set up independently authenticated connections. The lack of client-side caching does not mix well with the Linux VFS, which wants to cache heavily. The current v9fs implementation disables all of this caching. In some areas, especially write performance, this lack of caching makes itself felt. In others, however, v9fs claims better performance than NFS as a result of its simpler protocol. Plan 9 also lacks certain Unix concepts - such as symbolic links. To ease interoperability with Unix systems, a set of protocol extensions has been provided; v9fs uses those extensions where indicated.

The current release is described as "reasonably stable." The basic set of file operations has been implemented, with the exception of mmap(), which is hard to do in a way which does not pose the risk of system deadlocks. Future plans include "a more complete security model" and some thought toward implementing limited client-side caching, perhaps by using the CacheFS layer. See the patch introduction for pointers to more information, mailing lists, etc.



(Posted Jun 6, 2005 16:53 UTC (Mon) by guest stfn) (Post reply)

The design philosophy shares something with the recently popular "REpresentational State Transfer" style of web services. They each chose one unifying metaphor and a minimal interface: either everything is a file and accessed through file system calls, or everything is a resource and accessed through HTTP methods on a URL.

That might be a naive simplification* but other have observed the same:

http://www.xent.com/pipermail/fork/2001-August/002801.html
http://rest.blueoxen.net/cgi-bin/wiki.pl?RestArchitectura...

* It's only one aspect of the design and, on the other hand, there's all kinds of caching in the web and URIs if not URLs are meant to form a global namespace that all users share.

论坛徽章:
0
98 [报告]
发表于 2005-06-17 11:15 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://lwn.net/Articles/100321/

Many filesystems operate with a relatively slow backing store. Network filesystems are dependent on a network link and a remote server; obtaining a file from such a filesystem can be significantly slower than getting the file locally. Filesystems using slow local media (such as CDROMs) also tend to be slower than those using fast disks. For this reason, it can be desirable to cache data from these filesystems on a local disk.

Linux, however, has no mechanism which allows filesystems to perform local disk caching. Or, at least, it didn't have such a mechanism; David Howells's CacheFS patch changes that.

With CacheFS, the system administrator can set aside a partition on a block device for file caching. CacheFS will then present an interface which may be used by other filesystems. There is a basic registration interface, and a fairly elaborate mechanism for assigning an index to each file. Different filesystems will have different ways of creating identifiers for files, so CacheFS tries to impose as little policy as possible and let the filesystem code do what it wants. Finally, of course, there is an interface for caching a page from a file, noting changes, removing pages from the cache, etc.

CacheFS does not attempt to cache entire files; it must be able to deal with the possibility that somebody will try to work with a file which is bigger than the entire cache. It also does not actually guarantee to cache anything; it must be able to perform its own space management, and things must still function even in the absence of an actual cache device. This should not be an obstacle for most filesystems which, by their nature, must be prepared to deal with the real source for their files in the first place.

CacheFS is meant to work with other filesystems, rather than being used as a standalone filesystem in its own right. Its partitions must be mounted before use, however, and CacheFS uses the mount point to provide a view into the cached filesystem(s). The administrator can even manually force files out of the cache by simply deleting them from the mounted filesystem.

Interposing a cache between the user and the real filesystem clearly adds another failure point which could result in lost data. CacheFS addresses this issue by performing journaling on the cache contents. If things come to an abrupt halt, CacheFS will be able to replay any lost operations once everything is up and functioning again.

The current CacheFS patch is used only by the AFS filesystem, but work is in progress to adapt others as well. NFS, in particular, should benefit greatly from CacheFS, especially when NFSv4 (which is designed to allow local caching) is used. Expect this patch to have a relatively easy journey into the mainstream kernel. For those wanting more information, see the documentation file included with the patch.
(Log in to post comments)

  CacheFS & Security
(Posted Sep 2, 2004 16:41 UTC (Thu) by subscriber scripter) (Post reply)

I wonder what the security implications of CacheFS are. Does each file inherit the permissions of the original? Is confidentiality a problem? What if you want to securely erase a file?

  CacheFS & Security
(Posted Sep 3, 2004 19:49 UTC (Fri) by subscriber hppnq) (Post reply)

Not knowing anything about CacheFS internals, I would say these are cases of "don't do it, then".

  CacheFS & Security
(Posted Sep 13, 2004 18:49 UTC (Mon) by guest AnswerGuy) (Post reply)

The only difference between accessing a filesystem directly and through CacheFS should be that the CacheFS can store copies of the accessed data on a local block device. In other words that there's a (potentially persistent) footprint of all accesses.

Other than that CacheFS should preserve the same permissions semantics as if a given user/host were accessing the backend filesystem/service directly.

  A general caching filesystem
(Posted Sep 14, 2004 2:13 UTC (Tue) by subscriber xoddam) (Post reply)

This seems to me like a really complicated reimplementation of virtual
memory.

All filesystems already use VM pages for caching, don't they?
I'd have thought that attaching backing store to those pages would have
been a much simpler task than writing a whole new cache interface.

But then I'm not really a filesystem hacker.

  A general caching filesystem
(Posted Oct 25, 2004 0:55 UTC (Mon) by subscriber jcm) (Post reply)

xoddam writes:

>; This seems to me like a really complicated reimplementation of
>; virtual memory.

No it's really not. By virtual memory your are referring to an aspect of VM implementations known as paging, and that in itself only really impacts upon so called ``anonymous memory''. There is a page cache for certain regular filesystems but it's not possible for all filesystems to exploit the page cache to full effect and in any case this patch adds the ability to use a local disk as an additional cache storage for even slower stuff like network mounted filesystems - so the page cache can always sit between this disk and user processes which use it.

Jon.

  Improve "Laptop mode"
(Posted Oct 7, 2004 18:57 UTC (Thu) by subscriber BrucePerens) (Post reply)

I haven't looked at the CacheFS code yet, but this is what I would like to do with it, or something like it.

Put a cache filesystem on a FLASH disk plugged into my laptop. My laptop has a 512M MagicGate card, which looks like a USB disk. Use it to cache all recently read and written blocks from the hard disk, and allow the hard disk to remain spun down most of the time. Anytime the disk has to be spun up, flush any pending write blocks to it.

This would be an improvement over "laptop mode" in that it would not require system RAM and could thus be larger, and would not be as volatile as a RAM write cache.

Bruce

论坛徽章:
0
99 [报告]
发表于 2005-06-20 14:07 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

1. Introduction to the BeOS and BFS

1.1 History leading up to BFS

The Solution

Starting in September 1996, Cyril Meurillon and I set about to define a new I/O architecture and file system for BeOS. We knew that the existing split of file system and database would no longer work. We wanted a new, high-performance file system that supported the database functionality the BeOS was known for as well as a mechanism to support multiple file systems. We also took the opportunity to clean out some of the accumulated cruft that had worked its way into the system over the course of the previous five years of development.

The task we had to solve had two very clear components. First there was the higher-level file system and device interface. This half of the project involved defining an API for file system and device drivers, managing the name space, connecting program requests for files into file descriptors, and managing all the associated state. The second half of the project involved writing a file system that would provide the functionality required by the rest of the BeOS. Cyril, being the primary kernel architect at Be, took on the first portioin of the task. The most difficult portion of Cyril's project involved defining the file system API in such a way that it was as multithreaded as possible, correct, deadlock-free, and efficient. That task involved many major iterations as we battled over what a file system had to do and what the kernel layer would manage. There is some discussion of this level of the file system in Chapter 10, but it is not the primary focus of this book.

My half of the project involved defining the on-disk data structures, managing all the nity-gritty physical details of the raw disk blocks, and performing the I/O requests made by programs. Because the disk block cache is intimately intertwined with the file system (especially a journaled file system), I also took on the task of rewriting the block cache.

1.2 Design Goals

...

In addition to the above design goals, we had the long-standing goals of making the system as multithreaded and as efficient as possible, which meant fine-grained locking everywhere and paying close attention to the overhead introduced by the file system. Memory usage was also a big concern. ...

1.3 Design Constraints

There were also several design constraints that the project had to contend with. The first and foremost was the lack of engineering resources. The Be engineering staff is quite small, at the time only 13 engineers. Cyril and I had to wrok alone because everyone else was busy with other projects. We also did not have very much time to complete the project. Be, Inc., tries to have regular software releases, once every four to six months. The initial target was for the project to take six months. The short amount of time to complete the project and the lack of engineering resources meant that there was little time to explore different designs and to experiment with completely untested ideas. In the end it took nine months for the first beta release of BFS. The final version of BFS shipped the following month.

2. What is a File System ?

2.1 The Fundamentals

It is important to keep in mind the abstract goal of what a file system must achieve: to store, retrieve, locate, and manipulate information. Keeping the goal stated in general terms frees us to think of alternative implementations and possibilities that might not otherwise occur if we were to only think of a file system as a typical, strictly hierarchical, disk-based structure.

...

2.3 The Abstractions

...

Extents

Another technique to manage mapping from logical positions in a byte stream to data blocks on disk is to use extent lists. An extent list is similar to the simple block list described previously except that each block address is not just for a single block but rather for a range of blocks. That is, every block address is given as a starting block and a length (expressed as the number of successive blocks following the starting block). The size of an extent is usually larger than a simple block address but is potentially able to map a much larger region of disk space.

...

Although extent lists are a more compact way to refer to large amounts of data, they may still require use of indirect or double-indirect blocks. If a file system becomes highly fragmented and each extent can only map a few blocks of data, then the use of indirect and double-indirect blocks becomes a necessity. One disadvantage to using extent lists is that locating a specific file position may require scanning a large number of extents. Because the length of an extent is variable, when locating a specific position the file system must start at the first extent and scan through all of them until it finds the extent that covers the position of interest. ...

Storing Directory Entries

...

Another method of organizing directory entries is to use a sorted data structure suitable for on-disk storage. One such data structure is a B- tree (or its variants, B+ tree and B* tree). A B- tree keeps the keys sorted by their name and is efficient at looking up whether a key exists in the directory. B- trees also scale well and are able to deal efficiently with directories that contain many tens of thousands of files.

2.5 Extended file system Operations

...

Indexing

File attributes allow users to associate additional information with files, but there is even more that a file system can do with extended file attributes to aid users in managing and locating their information. If the file system also indexes the attributes. For example, if we added a *keywork* attribute to a set of files and the *keyword* attribute was indexed, the user could then issue queries asking which files contained various keywords regardless of their location in the hierarchy.

When coupled with a good query language, indexing offers a powerful alternative interface to the file system. With queries, users are not restricted to navigating a fixed hierarchy of files; instead they can issue queries to find the working set of files they would like to see, regardless of the location of the files.

Journaling/Logging

Avoiding corruption in a file system is a difficult task. Some file systems go to great lengths to avoid corruptioin problems. They may attempt to oder disk writes in such a way that corruption is recoverable, or they may force operations that can cause corruption to be synchronous so that the file system is always in a known state. Still other systems simply avoid the issue and depend on a very sophisticated file system check program to recover in the event of failures. All of these approaches must check the disk at boot time, a potentially lengthy operation (especially as disk size increase). Further, should a crash happen at an inopportune time, the file system may still be corrupt.

A more modern approach to avoiding corruption is *journaling*. Journaling, a technique borrowed from the database world, avoids corruption by batching groups of changes and committing them all at once to a transaction log. The batched changes guarantee the atomicity of multiple changes. That atomicity guarantee allows the file system to guarantee that operations either happen completely or not at all. Further, if a crash does happen, the system need only replay the transaction log to recover the system to a known state. Replaying the log is an operation that takes at most a few seconds, which is considerably faster than the file system check that nonjournaled file systems must make.

Guaranteed bandwidth/Bandwidth Reservationo

The desire to guarantee high-bandwidth I/O for multimedia applications drives some file system designers to provide special hooks that allow applications to guarantee that they will receive a certain amount of I/O bandwidth (within the limits of the hardware). To accomplish this the file system needs a great deal of knowledge about the capabilities of the underlying hardware it uses and must schedule I/O requests. This problem is nontrivial and still an area of research.

Access Control Lists

Access Control Lists (ACLs) provide an extended mechanism for specifying who may access a file and how they may access it. The traditional POSIX approach of three sets of permissions - for the owner of a file, the group that the owner is in, and everyone else - is not sufficient in some settings. An access control list specifies the exact level of access that any person may have to a file. This allows for fine-grained control over the access to a file in comparison to the braod divisions defined in the POSIX security model.
uplooking 该用户已被删除
100 [报告]
发表于 2005-06-21 02:38 |只看该作者
提示: 作者被禁止或删除 内容自动屏蔽
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP