1 ... 10 11 12 131415 16 17 18 ... 25 / 25 页下一页

浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [复制链接]

论坛徽章:: 0

131楼 [报告]

发表于 2005-07-18 19:22 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

[quote]原帖由 "soway" 发表：
我最近也开始关注GFS的东西（RH的GFS），因为在我的环境中用到了集群计算。
......。大量的小文件读取，导致速度非常慢，因为任何一次计算，都需要到nfs server上面去读取文件，这些文件需要通过网络，通过nfs服务，所以速度变得只有本地读取的十分之一。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

liuzhentaosoft

禁止访问

论坛徽章:: 0

132楼 [报告]

发表于 2005-07-18 19:56 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

PVFS中文介绍

PVFS描述
http:/parlweb.parl.clemson.edu/pvfs/desc.html

PC集群作为一个并行平台在逐步的普及，此平台上的软件需求也正在增长。在当今的集群中。
并行计算环境下，我们找到了许多有效的软件模块。比如可靠的操作系统，本地存储系统和基于
消息传递的系统。然而，并行I/O限制了集群的软件产品的生产。
并行虚拟文件系统(PVFS)工程为Linux集群提供了高性能和可扩展行的并行文件系统。PVFS是
开放原代码的，并且在GNU公共版权许可证下发布。它无需特殊的硬件设备和内核的改动。PVFS提供
重要的4个功能：
×一致性的访问名字空间。
×支持现存的系统访问方式。
×数据分布在集群节点不同机器不同的硬盘上。
×为应用程序提供高性能的数据访问方式。
为了PVFS易于安装和使用。它必须提供与集群访问相一致的名字空间，而且它必须达到我们易
用的习惯方式。PVFS文件必须同时安装到所有节点的相同目录里。使得所有节点能看到和访问PVFS
文件上的所有文件通过相同的配置。在已安装PVFS文件和目录能够运用类似的工具，比如ls,cp和rm
。
为了给访问很多客户端上文件系统的数据提供高性能，PVFS将数据散布于许多集群的节点上，
应用程序能够通过网络从不同的路径获得数据。这个消除了I/O路径的瓶颈，且增加了众多客户端
潜在的带宽，或者是总和的带宽。
当传统的系统调用机制提供了方便的数据访问给应用程序不同的文件系统的数据文件，是使用
在内核之上的方式。对PVFS来说应用程序可以以连接本地PVFS，API的方式访问文件系统。这类库
直接使用Unix操作与PVFS服务器门连接，而不是传递消息给内核。这个类库能被应用程序于与其他
类库使用。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

133楼 [报告]

发表于 2005-07-20 10:39 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

在 2005-07-19二的 21:16 -0500，Eric Anderson写道：
>; Bakul Shah wrote:
>; [..snip..]
>; >;>;

I understand.  Any nudging in the right direction here would be
>; >;>;appreciated.
>; >;
>; >;
>; >; I'd probably start with modelling a single filesystem and how
>; >; it maps to a sequence of disk blocks (*without* using any
>; >; code or worrying about details of formats but capturing the
>; >; essential elements).  I'd describe various operations in
>; >; terms of preconditions and postconditions.  Then, I'd extend
>; >; the model to deal with redundancy and so on.  Then I'd model
>; >; various failure modes. etc.  If you are interested _enough_
>; >; we can take this offline and try to work something out.  You
>; >; may even be able to use perl to create an `executable'
>; >; specification

>;
>; I've done some research, and read some books/articles/white papers since
>; I started this thread.
>;
>; First, porting GFS might be a more universal effort, and might be
>; 'easier'.  However, that doesn't get us a clustered filesystem with BSD
>; license (something that sounds good to me).

It has been said it would be a seven man-month efforts for a FS expert.

>;
>; Clustering UFS2 would be cool.  Here's what I'm looking for:

It is exactly how "Lustre" doing its work, though it build itself on
Ext3, and Lustre targets at  http://www.lustre.org/docs/SGSRFP.pdf .

>;
>; A clustered filesystem (or layer?) that allows all machines in the
>; cluster to see the same filesystem as if it were local, with read/write
>; access.  The cluster will need cache coherency across all nodes, and
>; there will need to be some sort of lock manager on each node to
>; communicate with all the other nodes to coordinate file locking.  The
>; filesystem will have to support journaling.
>;
>; I'm wondering if one could make a pseudo filesystem something like
>; nullfs that sits on top of a UFS2 partition, and essentially monitors
>; all VFS operations to the filesystem, and communicates them over TCP/IP
>; to the other nodes in the cluster.  That way, each node would know which
>; inodes and blocks are changing, so they can flush those buffers, and
>; they would know which blocks (or partial blocks) to view as locked as
>; another node locks it. This could be done via multicast, so all nodes in
>; the cluster would have to be running a distributed lock manager daemon
>; (dlmd) that would coordinate this.  I think also that the UFS2
>; filesystem would have to have a bit set upon mount that tracked it's
>; mount as a 'clustered' filesystem mount.  The reason for that is so that
>; we could modify mount to only mount 'clustered' filesystems (mount -o
>; clustered) if the dlmd was running, since that would be a dependency for
>; stable coherent file control on a mount point.
>;
>; Does anyone have any insight as to whether a layer would work?  Or maybe
>; I'm way off here and I need to do more reading

>;
>; Eric
>;
>;
--
yf-263
Unix-driver.org

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

beyondsky

家境小康

论坛徽章:: 0

134楼 [报告]

发表于 2005-07-20 11:17 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

GFS、OGFS我有自己写的安装文档
PVFS2我也有自己内部的详细测试文档
但我说了，没具体的交流平台
这里很少有人每天来打开这个web页面来看下，而且就算每天关注也不能做到实时的讨论与交流
我觉得邮件列表和讨论组的建立是有必要的

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

135楼 [报告]

发表于 2005-07-23 19:52 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://www.lustre.org/docs/dfsprotocols.pdf

Peter J. Braam

School of Computer Science Carnegie Mellon University

Abstract:
The protocols used by distributed file systems vary widely. The aim of this talk is to give an overview of these protocols and discuss their applicability for a cluster environment. File systems like NFS have weak semantics, making tight sharing difficult. AFS, Coda and InterMezzo give a great deal of autonomy to cluster members, and involve a persistent file cache for each system. True cluster file systems such as found in VMS VAXClusters, XFS, GFS introduce a shared single image, but introduce complex dependencies on cluster membership.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

soway

禁止发言

论坛徽章:: 0

136楼 [报告]

发表于 2005-07-28 13:17 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "beyondsky" 发表：
GFS、OGFS我有自己写的安装文档
PVFS2我也有自己内部的详细测试文档
但我说了，没具体的交流平台
这里很少有人每天来打开这个web页面来看下，而且就算每天关注也不能做到实时的讨论与交流
我觉得邮件列表和讨论?.........

同意建立一个邮件列表或者QQ群组（可能中国这个用的更多）。
我也是前段时间来看了一下，后来就没来过了

最近几天也一直很忙，所以更加没关注。
不过目前的情况，为了我本身系统的稳定，我可能还是只用nfs实现。
nfs的本地写“cache”大家可以在nfs服务器上面开启async做到性能改善。

不过我想以后，在商业计算过程中，存储必须集群化，因为目前它的瓶颈或者弱点已经明显的显示出来了。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

137楼 [报告]

发表于 2005-08-03 09:55 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "soway" 发表：

不过目前的情况，为了我本身系统的稳定，我可能还是只用nfs实现。
nfs的本地写“cache”大家可以在nfs服务器上面开启async做到性能改善。

不过我想以后，在商业计算过程中，存储必须集群化，因为目前它的瓶颈或者弱点已经明显的显示出来了。..........

NFS的问题:

发件人:  Eric Anderson
收件人:  freebsd-fs@freebsd.org
主题: Re: Cluster Filesystem for FreeBSD - any interest?

...
Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
dreaming of.  I currently have about 1000 clients needing access to the
same pools of data (read/write) all the time.  The data changes
constantly.  There is a lot of this data.  We use NFS currently.
FreeBSD is *very* fast and stable at serving NFS data.  The problem is,
that even though it is very fast and stable, I still cannot pump out
enough bits fast enough with one machine, and if that one machine fails
(hardware problems, etc), then all my machines are hung waiting for me
to bring it back online.
...

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

final fantasy

白手起家

论坛徽章:: 0

138楼 [报告]

发表于 2005-08-03 10:40 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

能力有限
友情UP

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

yftty

家境小康

论坛徽章:: 0

139楼 [报告]

发表于 2005-08-03 14:25 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/journaling.txt?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=cluster

o  Journaling & Replay

The fundamental problem with a journaled cluster filesystem is
handling journal replay with multiple journals.  A single block of
metadata can be modified sequentially by many different nodes in the
cluster.  As the block is modified by each node, it gets logged in the
journal for each node.  If care is not taken, it's possible to get
into a situation where a journal replay can actually corrupt a
filesystem.  The error scenario is:

1) Node A modifies a metadata block by putting a updated copy into its
incore log.
2) Node B wants to read and modify the block so it requests the lock
and a blocking callback is sent to Node A.
3) Node A flushes its incore log to disk, and then syncs out the
metadata block to its inplace location.
4) Node A then releases the lock.
5) Node B reads in the block and puts a modified copy into its ondisk
log and then the inplace block location.
6) Node A crashes.

At this point, Node A's journal needs to be replayed.  Since there is
a newer version of block inplace, if that block is replayed, the
filesystem will be corrupted.  There are a few different ways of
avoiding this problem.

1) Generation Numbers (GFS1)

Each metadata block has header in it that contains a 64-bit
generation number.  As each block is logged into a journal, the
generation number is incremented.  This provides a strict ordering
of the different versions of the block a they are logged in the FS'
different journals.  When journal replay happens, each block in the
journal is not replayed if generation number in the journal is less
than the generation number in place.  This ensures that a newer
version of a block is never replaced with an older version.  So,
this solution basically allows multiple copies of the same block in
different journals, but it allows you to always know which is the
correct one.

Pros:

A) This method allows the fastest callbacks.  To release a lock,
   the incore log for the lock must be flushed and then the inplace
   data and metadata must be synced.  That's it.  The sync
   operations involved are: start the log body and wait for it to
   become stable on the disk, synchronously write the commit block,
   start the inplace metadata and wait for it to become stable on
   the disk.

Cons:

A) Maintaining the generation numbers is expensive.  All newly
   allocated metadata block must be read off the disk in order to
   figure out what the previous value of the generation number was.
   When deallocating metadata, extra work and care must be taken to
   make sure dirty data isn't thrown away in such a way that the
   generation numbers stop doing their thing.
B) You can't continue to modify the filesystem during journal
   replay.  Basically, replay of a block is a read-modify-write
   operation: the block is read from disk, the generation number is
   compared, and (maybe) the new version is written out.  Replay
   requires that the R-M-W operation is atomic with respect to
   other R-M-W operations that might be happening (say by a normal
   I/O process).  Since journal replay doesn't (and can't) play by
   the normal metadata locking rules, you can't count on them to
   protect replay.  Hence GFS1, quieces all writes on a filesystem
   before starting replay.  This provides the mutual exclusion
   required, but it's slow and unnecessarily interrupts service on
   the whole cluster.

2) Total Metadata Sync (OCFS2)

This method is really simple in that it uses exactly the same
infrastructure that a local journaled filesystem uses.  Every time
a node receives a callback, it stops all metadata modification,
syncs out the whole incore journal, syncs out any dirty data, marks
the journal as being clean (unmounted), and then releases the lock.
Because journal is marked as clean and recovery won't look at any
of the journaled blocks in it, a valid copy of any particular block
only exists in one journal at a time and that journal always the
journal who modified it last.

Pros:

A) Very simple to implement.
B) You can reuse journaling code from other places (such as JBD).
C) No quiece necessary for replay.
D) No need for generation numbers sprinkled throughout the metadata.

Cons:

A) This method has the slowest possible callbacks.  The sync
   operations are: stop all metadata operations, start and wait for
   the log body, write the log commit block, start and wait for all
   the FS' dirty metadata, write an unmount block.  Writing the
   metadata for the whole filesystem can be particularly expensive
   because it can be scattered all over the disk and there can be a
   whole journal's worth of it.

3) Revocation of a lock's buffers (GFS2)

This method prevents a block from appearing in more than one
journal by canceling out the metadata blocks in the journal that
belong to the lock being released.  Journaling works very similarly
to a local filesystem or to #2 above.

The biggest difference is you have to keep track of buffers in the
active region of the ondisk journal, even after the inplace blocks
have been written back.  This is done in GFS2 by adding a second
part to the Active Items List.  The first part (in GFS2 called
AIL1) contains a list of all the blocks which have been logged to
the journal, but not written back to their inplace location.  Once
an item in AIL1 has been written back to its inplace location, it
is moved to AIL2.  Once the tail of the log moves past the block's
transaction in the log, it can be removed from AIL2.

When a callback occurs, the log is flushed to the disk and the
metadata for the lock is synced to disk.  At this point, any
metadata blocks for the lock that are in the current active region
of the log will be in the AIL2 list.  We then build a transaction
that contains revoke tags for each buffer in the AIL2 list that
belongs to that lock.

Pros:

A) No quiece necessary for Replay
B) No need for generation numbers sprinkled throughout the
   metadata.
C) The sync operations are: stop all metadata operations, start and
   wait for the log body, write the log commit block, start and
   wait for all the FS' dirty metadata, start and wait for the log
   body of a transaction that revokes any of the lock's metadata
   buffers in the journal's active region, and write the commit
   block for that transaction.

Cons:

A) Recovery takes two passes, one to find all the revoke tags in
   the log and one to replay the metadata blocks using the revoke
   tags as a filter.  This is necessary for a local filesystem and
   the total sync method, too.  It's just that there will probably
   be more tags.

Comparing #2 and #3, both do extra I/O during a lock callback to make
sure that any metadata blocks in the log for that lock will be
removed.  I believe #2 will be slower because syncing out all the
dirty metadata for entire filesystem requires lots of little,
scattered I/O across the whole disk.  The extra I/O done by #3 is a
log write to the disk.  So, not only should it be less I/O, but it
should also be better suited to get good performance out of the disk
subsystem.

KWP 07/06/05