1 ... 6 7 8 91011 12 13 14 ... 25 / 25 页下一页

浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [复制链接]

BigMonkey

稍有积蓄

论坛徽章:: 0

91楼 [报告]

发表于 2005-06-16 11:36 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

"帖子总数发表于: 2005-06-16 00:13 发表主题: To: raidcracker"

楼主老大这么晚还在

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

BigMonkey

稍有积蓄

论坛徽章:: 0

92楼 [报告]

发表于 2005-06-16 11:41 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

By the way, is yf your real name's acronym?

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

93楼 [报告]

发表于 2005-06-16 14:49 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

[quote]原帖由 "BigMonkey"]By the way, is yf your real name's acronym?[/quote 发表：

Hehe, the secrect answer is 'yes'. And yftty is the short of 'yf before a tty'

By the way, the FS need also supply a PHP interface, so please share some SWIG experiences if you have or want

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

raidcracker

白手起家

论坛徽章:: 0

94楼 [报告]

发表于 2005-06-16 16:49 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "yftty" 发表：
按照甘特图的画法, 应该是这个样子的吧

  |
  |  研发 ->;
  | ^ 开发 ->;
  |    ^ QA ->;
  |       ^ 运营(维护) ->;
----------------------------

那你对 FC  SCSI  iSCSI NFS ..........

吃饭的家伙,不熟不行啊.楼主涉猎众多,让人羡慕啊.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

95楼 [报告]

发表于 2005-06-17 10:58 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://lwn.net/Articles/136579/

The second version of Oracle's cluster filesystem has been in the works for some time. There has been a recent increase in cluster-related code proposed for inclusion into the mainline, so it was not entirely surprising to see an OCFS2 patch set join the crowd. These patches have found their way directly into the -mm tree for those wishing to try them out.

As a cluster filesystem, OCFS2 carries rather more baggage than a single-node filesystem like ext3. It does have, at its core, an on-disk filesystem implementation which is heavily inspired by ext3. There are some differences, though: it is an extent-based filesystem, meaning that files are represented on-disk in large, contiguous chunks. Inode numbers are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however, so it does not need to bring along much of its own journaling code.

To actually function in a clustered mode, OCFS2 must have information about the cluster in which it is operating. To that end, it includes a simple node information layer which holds a description of the systems which make up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take the relevant information from a single configuration file (/etc/ocfs2/cluster.conf). It is not enough to know which nodes should be part of a cluster, however: these nodes can come and go, and the filesystem must be able to respond to these events. So OCFS2 also includes a simple heartbeat implementation for monitoring which nodes are actually alive. This code works by setting aside a special file; each node must write a block to that file (with an updated time stamp) every so often. If a particular block stops changing, its associated node is deemed to have left the cluster.

Another important component is the distributed lock manager. OCFS2 includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a VMS-like interface. Oracle's implementation is simpler, however (its core locking function only has eight parameters...), and it lacks many of the fancier lock types and functions of Red Hat's implementation. There is also a virtual filesystem interface ("dlmfs"

making locking functionality available to user space.

There is a simple, TCP-based messaging system which is used by OCFS2 to talk between nodes in a cluster.

The remaining code is the filesystem implementation itself. It has all of the complications that one would expect of a high-performance filesystem implementation. OCFS2, however, is meant to operate with a disk which is, itself, shared across the cluster (perhaps via some sort of storage-area network or multipath scheme). So each node on the cluster manipulates the filesystem directly, but they must do so in a way which avoids creating chaos. The lock manager code handles much of this - nodes must take out locks on on-disk data structures before working with them.

There is more to it than that, however. There is, for example, a separate "allocation area" set aside for each node in the cluster; when a node needs to add an extent to a file, it can take it directly from its own allocation area and avoid contending with the other nodes for a global lock. There are also certain operations (deleting and renaming files, for example) which cannot be done by a node in isolation. It would not do for one node to delete a file and recycle its blocks if that file remains open on another node. So there is a voting mechanism for operations of this type; a node wanting to delete a file first requests a vote. If another node vetoes the operation, the file will remain for the time being. Either way, all nodes in the cluster can note that the file is being deleted and adjust their local data structures accordingly.

The code base as a whole was clearly written with an eye toward easing the path into the mainline kernel. It adheres to the kernel's coding standards and avoids the use of glue layers between the core filesystem code and the kernel. There are no changes to the kernel's VFS layer. Oracle's developers also appear to understand the current level of sensitivity about the merging of cluster support code (node and lock managers, heartbeat code) into the kernel. So they have kept their implementation of these functionalities small and separate from the filesystem itself. OCFS2 needs a lock manager now, for example, so it provides one. But, should a different implementation be chosen for merging at some future point, making the switch should not be too hard.

One assumes that OCFS2 will be merged at some point; adding a filesystem is not usually controversial if it is implemented properly and does not drag along intrusive VFS-layer changes. It is only one of many cluster filesystems, however, so it is unlikely to be alone. The competition in the cluster area, it seems, is just beginning.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

96楼 [报告]

发表于 2005-06-17 11:05 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://lwn.net/Articles/136579/

Plan 9 started as Ken Thompson and Rob Pike's attempt to address a number of perceived shortcomings in the Unix model. Among other things, Plan 9 takes the "everything is a file" approach rather further than Unix does, and tries to do so in a distributed manner. Plan 9 never took off the way Unix did, but it remains an interesting project; it has been free software since 2003.

One of the core components of Plan 9 is the 9P filesystem. 9P is a networked filesystem, somewhat equivalent to NFS or CIFS, but with its own particular approach. 9P is not as much a way of sharing files as a protocol definition aimed at the sharing of resources in a networked environment. There is a draft RFC available which describes this protocol in detail.

The protocol is intentionally simple. It works in a connection-oriented, single-user mode, much like CIFS; each user on a Plan 9 system is expected to make one or more connections to the server(s) of interest. Plan 9 operates with per-user namespaces by design, so each user ends up with a unique view of the network. There is a small set of operations supported by 9P servers; a client can create file descriptors, use them to navigate around the filesystem, read and write files, create, rename and delete files, and close things down; that's about it.

The protocol is intentionally independent of the underlying transport mechanism. Typically, a TCP connection is used, but that is not required. A 9P client can, with a proper implementation, communicate with a server over named pipes, zero-copy memory transports, RDMA, RFC1149 avian links, etc. The protocol also puts most of the intelligence on the server side; clients, for example, perform no caching of data. An implication of all these choices is that there is no real reason why 9P servers have to be exporting filesystems at all. A server can just as easily offer a virtual filesystem (along the lines of /proc or sysfs), transparent remote access to devices, connections to remote processes, or just about anything else. The 9P protocol is the implementation of the "everything really is a file" concept. It could thus be used in a similar way as the filesystems in user space (FUSE) mechanism currently being considered for merging. 9P also holds potential as a way of sharing resources between virtualized systems running on the same host.

There is a 9P implementation for Linux, called "v9fs"; Eric Van Hensbergen has recently posted a v9fs patch set for review with an eye toward eventual inclusion. v9fs is a full 9P client implementation; there is also a user-space server available via the v9fs web site.

Linux and Plan 9 have different ideas of how a filesystem should work, so a fair amount of impedance matching is required. Unix-like systems prefer filesystems to be mounted in a global namespace for all users, while Plan 9 filesystems are a per-user resource. A v9fs filesystem can be used in either mode, though the most natural way is to use Linux namespaces to allow each user to set up independently authenticated connections. The lack of client-side caching does not mix well with the Linux VFS, which wants to cache heavily. The current v9fs implementation disables all of this caching. In some areas, especially write performance, this lack of caching makes itself felt. In others, however, v9fs claims better performance than NFS as a result of its simpler protocol. Plan 9 also lacks certain Unix concepts - such as symbolic links. To ease interoperability with Unix systems, a set of protocol extensions has been provided; v9fs uses those extensions where indicated.

The current release is described as "reasonably stable." The basic set of file operations has been implemented, with the exception of mmap(), which is hard to do in a way which does not pose the risk of system deadlocks. Future plans include "a more complete security model" and some thought toward implementing limited client-side caching, perhaps by using the CacheFS layer. See the patch introduction for pointers to more information, mailing lists, etc.

(Posted Jun 6, 2005 16:53 UTC (Mon) by guest stfn) (Post reply)

The design philosophy shares something with the recently popular "REpresentational State Transfer" style of web services. They each chose one unifying metaphor and a minimal interface: either everything is a file and accessed through file system calls, or everything is a resource and accessed through HTTP methods on a URL.

That might be a naive simplification* but other have observed the same:

http://www.xent.com/pipermail/fork/2001-August/002801.html
http://rest.blueoxen.net/cgi-bin/wiki.pl?RestArchitectura...

* It's only one aspect of the design and, on the other hand, there's all kinds of caching in the web and URIs if not URLs are meant to form a global namespace that all users share.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

97楼 [报告]

发表于 2005-06-17 11:15 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://lwn.net/Articles/100321/

Many filesystems operate with a relatively slow backing store. Network filesystems are dependent on a network link and a remote server; obtaining a file from such a filesystem can be significantly slower than getting the file locally. Filesystems using slow local media (such as CDROMs) also tend to be slower than those using fast disks. For this reason, it can be desirable to cache data from these filesystems on a local disk.

Linux, however, has no mechanism which allows filesystems to perform local disk caching. Or, at least, it didn't have such a mechanism; David Howells's CacheFS patch changes that.

With CacheFS, the system administrator can set aside a partition on a block device for file caching. CacheFS will then present an interface which may be used by other filesystems. There is a basic registration interface, and a fairly elaborate mechanism for assigning an index to each file. Different filesystems will have different ways of creating identifiers for files, so CacheFS tries to impose as little policy as possible and let the filesystem code do what it wants. Finally, of course, there is an interface for caching a page from a file, noting changes, removing pages from the cache, etc.

CacheFS does not attempt to cache entire files; it must be able to deal with the possibility that somebody will try to work with a file which is bigger than the entire cache. It also does not actually guarantee to cache anything; it must be able to perform its own space management, and things must still function even in the absence of an actual cache device. This should not be an obstacle for most filesystems which, by their nature, must be prepared to deal with the real source for their files in the first place.

CacheFS is meant to work with other filesystems, rather than being used as a standalone filesystem in its own right. Its partitions must be mounted before use, however, and CacheFS uses the mount point to provide a view into the cached filesystem(s). The administrator can even manually force files out of the cache by simply deleting them from the mounted filesystem.

Interposing a cache between the user and the real filesystem clearly adds another failure point which could result in lost data. CacheFS addresses this issue by performing journaling on the cache contents. If things come to an abrupt halt, CacheFS will be able to replay any lost operations once everything is up and functioning again.

The current CacheFS patch is used only by the AFS filesystem, but work is in progress to adapt others as well. NFS, in particular, should benefit greatly from CacheFS, especially when NFSv4 (which is designed to allow local caching) is used. Expect this patch to have a relatively easy journey into the mainstream kernel. For those wanting more information, see the documentation file included with the patch.
(Log in to post comments)

CacheFS & Security
(Posted Sep 2, 2004 16:41 UTC (Thu) by subscriber scripter) (Post reply)

I wonder what the security implications of CacheFS are. Does each file inherit the permissions of the original? Is confidentiality a problem? What if you want to securely erase a file?

CacheFS & Security
(Posted Sep 3, 2004 19:49 UTC (Fri) by subscriber hppnq) (Post reply)

Not knowing anything about CacheFS internals, I would say these are cases of "don't do it, then".

  CacheFS & Security
(Posted Sep 13, 2004 18:49 UTC (Mon) by guest AnswerGuy) (Post reply)

The only difference between accessing a filesystem directly and through CacheFS should be that the CacheFS can store copies of the accessed data on a local block device. In other words that there's a (potentially persistent) footprint of all accesses.

Other than that CacheFS should preserve the same permissions semantics as if a given user/host were accessing the backend filesystem/service directly.

  A general caching filesystem
(Posted Sep 14, 2004 2:13 UTC (Tue) by subscriber xoddam) (Post reply)

This seems to me like a really complicated reimplementation of virtual
memory.

All filesystems already use VM pages for caching, don't they?
I'd have thought that attaching backing store to those pages would have
been a much simpler task than writing a whole new cache interface.

But then I'm not really a filesystem hacker.

  A general caching filesystem
(Posted Oct 25, 2004 0:55 UTC (Mon) by subscriber jcm) (Post reply)

xoddam writes:

>; This seems to me like a really complicated reimplementation of
>; virtual memory.

No it's really not. By virtual memory your are referring to an aspect of VM implementations known as paging, and that in itself only really impacts upon so called ``anonymous memory''. There is a page cache for certain regular filesystems but it's not possible for all filesystems to exploit the page cache to full effect and in any case this patch adds the ability to use a local disk as an additional cache storage for even slower stuff like network mounted filesystems - so the page cache can always sit between this disk and user processes which use it.

Jon.

  Improve "Laptop mode"
(Posted Oct 7, 2004 18:57 UTC (Thu) by subscriber BrucePerens) (Post reply)

I haven't looked at the CacheFS code yet, but this is what I would like to do with it, or something like it.

Put a cache filesystem on a FLASH disk plugged into my laptop. My laptop has a 512M MagicGate card, which looks like a USB disk. Use it to cache all recently read and written blocks from the hard disk, and allow the hard disk to remain spun down most of the time. Anytime the disk has to be spun up, flush any pending write blocks to it.

This would be an improvement over "laptop mode" in that it would not require system RAM and could thus be larger, and would not be as volatile as a RAM write cache.

Bruce

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

98楼 [报告]

发表于 2005-06-20 14:07 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

1. Introduction to the BeOS and BFS

1.1 History leading up to BFS

The Solution

Starting in September 1996, Cyril Meurillon and I set about to define a new I/O architecture and file system for BeOS. We knew that the existing split of file system and database would no longer work. We wanted a new, high-performance file system that supported the database functionality the BeOS was known for as well as a mechanism to support multiple file systems. We also took the opportunity to clean out some of the accumulated cruft that had worked its way into the system over the course of the previous five years of development.

The task we had to solve had two very clear components. First there was the higher-level file system and device interface. This half of the project involved defining an API for file system and device drivers, managing the name space, connecting program requests for files into file descriptors, and managing all the associated state. The second half of the project involved writing a file system that would provide the functionality required by the rest of the BeOS. Cyril, being the primary kernel architect at Be, took on the first portioin of the task. The most difficult portion of Cyril's project involved defining the file system API in such a way that it was as multithreaded as possible, correct, deadlock-free, and efficient. That task involved many major iterations as we battled over what a file system had to do and what the kernel layer would manage. There is some discussion of this level of the file system in Chapter 10, but it is not the primary focus of this book.

My half of the project involved defining the on-disk data structures, managing all the nity-gritty physical details of the raw disk blocks, and performing the I/O requests made by programs. Because the disk block cache is intimately intertwined with the file system (especially a journaled file system), I also took on the task of rewriting the block cache.

1.2 Design Goals

...

In addition to the above design goals, we had the long-standing goals of making the system as multithreaded and as efficient as possible, which meant fine-grained locking everywhere and paying close attention to the overhead introduced by the file system. Memory usage was also a big concern. ...

1.3 Design Constraints

There were also several design constraints that the project had to contend with. The first and foremost was the lack of engineering resources. The Be engineering staff is quite small, at the time only 13 engineers. Cyril and I had to wrok alone because everyone else was busy with other projects. We also did not have very much time to complete the project. Be, Inc., tries to have regular software releases, once every four to six months. The initial target was for the project to take six months. The short amount of time to complete the project and the lack of engineering resources meant that there was little time to explore different designs and to experiment with completely untested ideas. In the end it took nine months for the first beta release of BFS. The final version of BFS shipped the following month.

2. What is a File System ?

2.1 The Fundamentals

It is important to keep in mind the abstract goal of what a file system must achieve: to store, retrieve, locate, and manipulate information. Keeping the goal stated in general terms frees us to think of alternative implementations and possibilities that might not otherwise occur if we were to only think of a file system as a typical, strictly hierarchical, disk-based structure.

...

2.3 The Abstractions

...

Extents

Another technique to manage mapping from logical positions in a byte stream to data blocks on disk is to use extent lists. An extent list is similar to the simple block list described previously except that each block address is not just for a single block but rather for a range of blocks. That is, every block address is given as a starting block and a length (expressed as the number of successive blocks following the starting block). The size of an extent is usually larger than a simple block address but is potentially able to map a much larger region of disk space.

...

Although extent lists are a more compact way to refer to large amounts of data, they may still require use of indirect or double-indirect blocks. If a file system becomes highly fragmented and each extent can only map a few blocks of data, then the use of indirect and double-indirect blocks becomes a necessity. One disadvantage to using extent lists is that locating a specific file position may require scanning a large number of extents. Because the length of an extent is variable, when locating a specific position the file system must start at the first extent and scan through all of them until it finds the extent that covers the position of interest. ...

Storing Directory Entries

...

Another method of organizing directory entries is to use a sorted data structure suitable for on-disk storage. One such data structure is a B- tree (or its variants, B+ tree and B* tree). A B- tree keeps the keys sorted by their name and is efficient at looking up whether a key exists in the directory. B- trees also scale well and are able to deal efficiently with directories that contain many tens of thousands of files.

2.5 Extended file system Operations

...

Indexing

File attributes allow users to associate additional information with files, but there is even more that a file system can do with extended file attributes to aid users in managing and locating their information. If the file system also indexes the attributes. For example, if we added a *keywork* attribute to a set of files and the *keyword* attribute was indexed, the user could then issue queries asking which files contained various keywords regardless of their location in the hierarchy.

When coupled with a good query language, indexing offers a powerful alternative interface to the file system. With queries, users are not restricted to navigating a fixed hierarchy of files; instead they can issue queries to find the working set of files they would like to see, regardless of the location of the files.

Journaling/Logging

Avoiding corruption in a file system is a difficult task. Some file systems go to great lengths to avoid corruptioin problems. They may attempt to oder disk writes in such a way that corruption is recoverable, or they may force operations that can cause corruption to be synchronous so that the file system is always in a known state. Still other systems simply avoid the issue and depend on a very sophisticated file system check program to recover in the event of failures. All of these approaches must check the disk at boot time, a potentially lengthy operation (especially as disk size increase). Further, should a crash happen at an inopportune time, the file system may still be corrupt.

A more modern approach to avoiding corruption is *journaling*. Journaling, a technique borrowed from the database world, avoids corruption by batching groups of changes and committing them all at once to a transaction log. The batched changes guarantee the atomicity of multiple changes. That atomicity guarantee allows the file system to guarantee that operations either happen completely or not at all. Further, if a crash does happen, the system need only replay the transaction log to recover the system to a known state. Replaying the log is an operation that takes at most a few seconds, which is considerably faster than the file system check that nonjournaled file systems must make.

Guaranteed bandwidth/Bandwidth Reservationo

The desire to guarantee high-bandwidth I/O for multimedia applications drives some file system designers to provide special hooks that allow applications to guarantee that they will receive a certain amount of I/O bandwidth (within the limits of the hardware). To accomplish this the file system needs a great deal of knowledge about the capabilities of the underlying hardware it uses and must schedule I/O requests. This problem is nontrivial and still an area of research.

Access Control Lists

Access Control Lists (ACLs) provide an extended mechanism for specifying who may access a file and how they may access it. The traditional POSIX approach of three sets of permissions - for the owner of a file, the group that the owner is in, and everyone else - is not sufficient in some settings. An access control list specifies the exact level of access that any person may have to a file. This allows for fine-grained control over the access to a file in comparison to the braod divisions defined in the POSIX security model.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

uplooking 该用户已被删除	99楼 [报告] 发表于 2005-06-21 02:38 \|只看该作者提示: 作者被禁止或删除内容自动屏蔽
uplooking 该用户已被删除	实战分享：从技术角度谈机器学习入门\| 【大话IT】RadonDB低门槛向MySQL集群下战书 \| ChinaUnix打赏功能已上线！ \| 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

100楼 [报告]

发表于 2005-06-22 10:50 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

1) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html

2) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html
3. Elastic Quota File System (EQFS) Proposal
23 Jun 2004 - 30 Jun 2004 (46 posts) Archive Link: "Elastic Quota File System
(EQFS)"
People: Amit Gud, Olaf Dabrunz, Mark Cooke

Amit Gud said:

Recently I'm into developing an Elastic Quota File System (EQFS). This file
system works on a simple concept ... give it to others if you're not using
it, let others use it, but on the guarantee that you get it back when you
need it!!

Here I'm talking about disk quotas. In any typical network, e.g.
sourceforge, each user is given a fixed amount of quota. 100 Mb in case of
sourceforge. 100 Mb is way over some project requirements and too small for
some projects. EQFS tries to solve this problem by exploiting the users'
usage behavior at runtime. That is the user's quota which he doesn't need
is given to the users who need it, but on 100% assurance that the originl
user can any time reclaim his/her quota.

Before getting into implementation details I want to have public opinion
about this system. All EQFS tries to do is it maximizes the disk space
usage, which otherwise is wasted if the user doesn't really need the
allocated user..on the other hand it helps avoid the starvation of the user
who needs more space. It also helps administrator to get away with the
problem of variable quota needs..as EQFS itself adjusts according to the
user needs.

Mark Watts asked how it would be possible to "guarantee" that the user would
get the space back when they wanted it. Amit expanded:

Ok, this is what I propose:

Lets say there are just 2 users with 100 megs of individual quota, user A
is using 20 megs and user B is running out of his quota. Now what B could
do is delete some files himself and make some free space for storing other
files. Now what I say is instead of deleting the files, he declares those
files as elastic.

Now, moment he makes that files elastic, that much amount of space is added
to his quota. Here Mark Cooke's equation applies with some modifications: N
no. of users, Qi allocated quota of ith user Ui individual disk usage of
ith user ( should be <= allocated quota of ith user ), D disk threshold;
thats the amount of disk space admin wants to allow the users to use
(should be >;= sum of all users' allocated quota, i.e. summation Qi ; for i
= 0 to N - 1).

Total usage of all the users (here A & B) should be at _anytime_ less than
D. i.e. summation Ui <= D; for i = 0 to N - 1.

The point to note here is that we are not bothering how much quota has been
allocated to an individual user by the admin, but we are more interested in
the usage pattern followed by the users. E.g. if user B wants additional
space of say 25 megs, he picks up 25 megs of his files and 'marks' them
elastic. Now his quota is increased to 125 megs and he can now add more 25
megs of files; at the same time allocated quota for user A is left
unaffected. Applying the above equation total usage now is A: 20 megs, B:
125 megs, now total 145 <= D, say 200 megs. Thus this should be ok for the
system, since the usage is within bounds.

Now what happens if Ui >; D? This can happen when user A tries to recliam
his space. i.e. if user A adds say more 70 megs of files, so the total
usage is now - A: 90 megs, B: 125 megs; 215 ! <= D. The moment the total
usage crosses the value, 'action' will be taken on the elastic files. Here
elastic files are of user B so only those will be affected and users A's
data will be untouched, so in a way this will be completely transparent to
user A. What action should be taken can be specified by the user while
making the files elastic. He can either opt to delete the file, compress it
or move it to some place (backup) where he know he has write access. The
corresponding action will be taken until the threshold is met.

Will this work?? We are relying on the 'free' space ( i.e. D - Ui ) for the
users to benefit. The chances of having a greater value for D - Ui
increases with the increase in the number of users, i.e. N. Here we are
talking about 2 users but think of 10000+ users where all the users will
probably never use up _all_ the allocated disk space. This user behavior
can be well exploited.

EQFS can be best fitted in the mail servers. Here e.g. I make whole
linux-kernel mailing list elastic. As long as Ui <= D I get to keep all the
messages, whenever Ui >; D, messages with latest dates will be 'acted' upon.

For variable quota needs, admin can allocate different quotas for different
users, but this can get tiresome when N is large. With EQFS, he can
allocate fixed quota for each user ( old and new ) , set up a value for D
and relax. The users will automatically get the quota they need. One may
ask that this can be done by just setting up value of D, checking it
against summation Ui and not allocating individual quotas at all. But when
summation Ui crosses D value, whose file to act on? Moreover with both
individual quotas and D, we give users 'controlled' flexibility just like
elastic - it can be stretched but not beyond a certain range.

What happens when an user tries to eat up all the free ( D - Ui ) space?
This answer is implementation dependent because you need to make a
decision: should an user be allowed to make a file elastic when Ui == D . I
think by saying 'yes' we eliminate some users' mischief of eating up all
free space.

Olaf Dabrunz replied:

   + having files disappear at the discretion of the filesystem seems to be
      bad behaviour: either I need this file, then I do not want it to just
      disappear, or I do not need it, and then I can delete it myself.

      Since my idea of which files I need and which I do not need changes
      over time, I believe it is far better that I can control which files I
      need and which I do not need whenever other constraints (e.g. quota
      filled up) make this decision necessary. Also, then I can opt to try to
      convince someone to increase my quota.

   + moving the file to some other place (backup) does not seem to be a
      viable option:

      o If the backup media is always accessible, then why can't the user
         store the "elastic" files there immediately?

         ->; advantages:

            # the user knows where his file is
            # applications that remember the path to a file will be able to
            access it

      o If the backup media will only be accessible after manually
         inserting it into some drive, this amounts to sending an E-Mail to
         the backup admin and then pass a list of backup files to the backup
         software.

         But now getting the file back involves a considerable amount of
         manual and administrative work. And it involves bugging the backup
         admin, who now becomes the bottleneck of your EQFS.

So this narrows down to the effective handling of backup procedures and the
effective administration of fixed quotas and centralization of data.

If you have many users it is also likely that there are more people
interested in big data-files. So you need to help these people organize
themselves e.g. by helping them to create mailing-list, web-pages or
letting them install servers that makes the data centrally available with
some interface that they can use to select parts of the data.

I would rather suggest that if the file does not fit within a given quota,
the user should apply for more quota and give reasons for that.

I believe that flexible or "elastic" allocation of ressources is a good
idea in general, but it only works if you have cheap and easy ways to
control both allocation and deallocation. So in the case of CBQ in networks
this works, since bandwidth can easily and quickly be allocated and
deallocated.

But for filesystem space this requires something like a "slower (= less
expensive), bigger, always accessible" third level of storage in the "RAM,
disk, ..." hierarchy. And then you would need an easy or even transparent
way to access files on this third level storage. And you need to make sure
that, although you obviously *need* the data for something, you still can
afford to increase retrieval times by several orders of magnitude at the
discretion of the filesystem.

But usually all this can be done by scripts as well.

Still, there is a scenario and a combination of features for such a
filesystem that IMHO would make it useful:

   + Provide allocation of overquota as you described it.
   + Let the filesystem move (parts of) the "elastic" files to some
      third-level backing-store on an as-needed basis. This provides you with
      a not-so-cheap (but cheaper than manual handling) resource management
      facility.

Now you can use the third-level storage as a backing store for hard-drive
space, analoguous to what swap-space provides for RAM. And you can "swap
in" parts of files from there and cache them on the hard drive. So
"elastic" files are actually files that are "swappable" to backing store.

This assumes that the "elastic" files meet the requirements for a "working
set" in a similar fashion as for RAM-based data. I.e. the swap operations
need only be invoked relatively seldom.

If this is not the case, your site/customer needs to consider buying more
hard drive space (and maybe also RAM).

The tradeoff for the user now is:

   + do not have the big file(s) OR
   + have them and be able to use them in a random-access fashion from any
      application, but maybe only with a (quite) slow access time, but
      without additional administrative/manual hassle

Maybe this is a good tradeoff for a significant amount of users. Maybe
there are sites/customers that have the required backing store (or would
consider buying into this). I do not know. Find a sponsor, do some field
research and give it a try.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

1 ... 6 7 8 91011 12 13 14 ... 25 / 25 页下一页

返回列表

Chinaunix › 论坛 › IT运维 › 虚拟化与云服务 › 浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSC ...

浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [复制链接]

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

浏览过的版块