免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
楼主: yftty
打印 上一主题 下一主题

浩存 - 面向数据库,虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [复制链接]

论坛徽章:
0
131 [报告]
发表于 2005-07-18 19:22 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

[quote]原帖由 "soway" 发表:
我最近也开始关注GFS的东西(RH的GFS),因为在我的环境中用到了集群计算。
......。大量的小文件读取,导致速度非常慢,因为任何一次计算,都需要到nfs server上面去读取文件,这些文件需要通过网络,通过nfs服务,所以速度变得只有本地读取的十分之一。

论坛徽章:
0
132 [报告]
发表于 2005-07-18 19:56 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

PVFS中文介绍





PVFS描述
http:/parlweb.parl.clemson.edu/pvfs/desc.html

PC集群作为一个并行平台在逐步的普及,此平台上的软件需求也正在增长。在当今的集群中。
并行计算环境下,我们找到了许多有效的软件模块。比如可靠的操作系统,本地存储系统和基于
消息传递的系统。然而,并行I/O限制了集群的软件产品的生产。
并行虚拟文件系统(PVFS)工程为Linux集群提供了高性能和可扩展行的并行文件系统。PVFS是
开放原代码的,并且在GNU公共版权许可证下发布。它无需特殊的硬件设备和内核的改动。PVFS提供
重要的4个功能:
×一致性的访问名字空间。
×支持现存的系统访问方式。
×数据分布在集群节点不同机器不同的硬盘上。
×为应用程序提供高性能的数据访问方式。
为了PVFS易于安装和使用。它必须提供与集群访问相一致的名字空间,而且它必须达到我们易
用的习惯方式。PVFS文件必须同时安装到所有节点的相同目录里。使得所有节点能看到和访问PVFS
文件上的所有文件通过相同的配置。在已安装PVFS文件和目录能够运用类似的工具,比如ls,cp和rm

为了给访问很多客户端上文件系统的数据提供高性能,PVFS将数据散布于许多集群的节点上,
应用程序能够通过网络从不同的路径获得数据。这个消除了I/O路径的瓶颈,且增加了众多客户端
潜在的带宽,或者是总和的带宽。
当传统的系统调用机制提供了方便的数据访问给应用程序不同的文件系统的数据文件,是使用
在内核之上的方式。对PVFS来说应用程序可以以连接本地PVFS,API的方式访问文件系统。这类库
直接使用Unix操作与PVFS服务器门连接,而不是传递消息给内核。这个类库能被应用程序于与其他
类库使用。

论坛徽章:
0
133 [报告]
发表于 2005-07-20 10:39 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

在 2005-07-19二的 21:16 -0500,Eric Anderson写道:
>; Bakul Shah wrote:
>; [..snip..]
>; >;>; I understand.  Any nudging in the right direction here would be
>; >;>;appreciated.
>; >;
>; >;
>; >; I'd probably start with modelling a single filesystem and how
>; >; it maps to a sequence of disk blocks (*without* using any
>; >; code or worrying about details of formats but capturing the
>; >; essential elements).  I'd describe various operations in
>; >; terms of preconditions and postconditions.  Then, I'd extend
>; >; the model to deal with redundancy and so on.  Then I'd model
>; >; various failure modes. etc.  If you are interested _enough_
>; >; we can take this offline and try to work something out.  You
>; >; may even be able to use perl to create an `executable'
>; >; specification
>;
>; I've done some research, and read some books/articles/white papers since
>; I started this thread.
>;
>; First, porting GFS might be a more universal effort, and might be
>; 'easier'.  However, that doesn't get us a clustered filesystem with BSD
>; license (something that sounds good to me).

It has been said it would be a seven man-month efforts for a FS expert.

>;
>; Clustering UFS2 would be cool.  Here's what I'm looking for:

It is exactly how "Lustre" doing its work, though it build itself on
Ext3, and Lustre targets at  http://www.lustre.org/docs/SGSRFP.pdf .

>;
>; A clustered filesystem (or layer?) that allows all machines in the
>; cluster to see the same filesystem as if it were local, with read/write
>; access.  The cluster will need cache coherency across all nodes, and
>; there will need to be some sort of lock manager on each node to
>; communicate with all the other nodes to coordinate file locking.  The
>; filesystem will have to support journaling.
>;
>; I'm wondering if one could make a pseudo filesystem something like
>; nullfs that sits on top of a UFS2 partition, and essentially monitors
>; all VFS operations to the filesystem, and communicates them over TCP/IP
>; to the other nodes in the cluster.  That way, each node would know which
>; inodes and blocks are changing, so they can flush those buffers, and
>; they would know which blocks (or partial blocks) to view as locked as
>; another node locks it. This could be done via multicast, so all nodes in
>; the cluster would have to be running a distributed lock manager daemon
>; (dlmd) that would coordinate this.  I think also that the UFS2
>; filesystem would have to have a bit set upon mount that tracked it's
>; mount as a 'clustered' filesystem mount.  The reason for that is so that
>; we could modify mount to only mount 'clustered' filesystems (mount -o
>; clustered) if the dlmd was running, since that would be a dependency for
>; stable coherent file control on a mount point.
>;
>; Does anyone have any insight as to whether a layer would work?  Or maybe
>; I'm way off here and I need to do more reading
>;
>; Eric
>;
>;
--
yf-263
Unix-driver.org

论坛徽章:
0
134 [报告]
发表于 2005-07-20 11:17 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

GFS、OGFS我有自己写的安装文档
PVFS2我也有自己内部的详细测试文档
但我说了,没具体的交流平台
这里很少有人每天来打开这个web页面来看下,而且就算每天关注也不能做到实时的讨论与交流
我觉得邮件列表和讨论组的建立是有必要的

论坛徽章:
0
135 [报告]
发表于 2005-07-23 19:52 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://www.lustre.org/docs/dfsprotocols.pdf

Peter J. Braam

School of Computer Science Carnegie Mellon University

Abstract:
      The protocols used by distributed file systems vary widely. The aim of this talk is to give an overview of these protocols and discuss their applicability for a cluster environment. File systems like NFS have weak semantics, making tight sharing difficult. AFS, Coda and InterMezzo give a great deal of autonomy to cluster members, and involve a persistent file cache for each system. True cluster file systems such as found in VMS VAXClusters, XFS, GFS introduce a shared single image, but introduce complex dependencies on cluster membership.

论坛徽章:
0
136 [报告]
发表于 2005-07-28 13:17 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "beyondsky" 发表:
GFS、OGFS我有自己写的安装文档
PVFS2我也有自己内部的详细测试文档
但我说了,没具体的交流平台
这里很少有人每天来打开这个web页面来看下,而且就算每天关注也不能做到实时的讨论与交流
我觉得邮件列表和讨论?.........


同意建立一个邮件列表或者QQ群组(可能中国这个用的更多)。
我也是前段时间来看了一下,后来就没来过了

最近几天也一直很忙,所以更加没关注。
不过目前的情况,为了我本身系统的稳定,我可能还是只用nfs实现。
nfs的本地写“cache”大家可以在nfs服务器上面开启async做到性能改善。

不过我想以后,在商业计算过程中,存储必须集群化,因为目前它的瓶颈或者弱点已经明显的显示出来了。

论坛徽章:
0
137 [报告]
发表于 2005-08-03 09:55 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "soway" 发表:


不过目前的情况,为了我本身系统的稳定,我可能还是只用nfs实现。
nfs的本地写“cache”大家可以在nfs服务器上面开启async做到性能改善。

不过我想以后,在商业计算过程中,存储必须集群化,因为目前它的瓶颈或者弱点已经明显的显示出来了。..........


NFS的问题:

发件人:  Eric Anderson
收件人:  freebsd-fs@freebsd.org
主题:    Re: Cluster Filesystem for FreeBSD - any interest?

...
Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
dreaming of.  I currently have about 1000 clients needing access to the
same pools of data (read/write) all the time.  The data changes
constantly.  There is a lot of this data.  We use NFS currently.
FreeBSD is *very* fast and stable at serving NFS data.  The problem is,
that even though it is very fast and stable, I still cannot pump out
enough bits fast enough with one machine, and if that one machine fails
(hardware problems, etc), then all my machines are hung waiting for me
to bring it back online.
...

论坛徽章:
0
138 [报告]
发表于 2005-08-03 10:40 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

能力有限
友情UP

论坛徽章:
0
139 [报告]
发表于 2005-08-03 14:25 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/doc/journaling.txt?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=cluster

o  Journaling & Replay

The fundamental problem with a journaled cluster filesystem is
handling journal replay with multiple journals.  A single block of
metadata can be modified sequentially by many different nodes in the
cluster.  As the block is modified by each node, it gets logged in the
journal for each node.  If care is not taken, it's possible to get
into a situation where a journal replay can actually corrupt a
filesystem.  The error scenario is:

1) Node A modifies a metadata block by putting a updated copy into its
   incore log.
2) Node B wants to read and modify the block so it requests the lock
   and a blocking callback is sent to Node A.
3) Node A flushes its incore log to disk, and then syncs out the
   metadata block to its inplace location.
4) Node A then releases the lock.
5) Node B reads in the block and puts a modified copy into its ondisk
   log and then the inplace block location.
6) Node A crashes.

At this point, Node A's journal needs to be replayed.  Since there is
a newer version of block inplace, if that block is replayed, the
filesystem will be corrupted.  There are a few different ways of
avoiding this problem.

1) Generation Numbers (GFS1)

   Each metadata block has header in it that contains a 64-bit
   generation number.  As each block is logged into a journal, the
   generation number is incremented.  This provides a strict ordering
   of the different versions of the block a they are logged in the FS'
   different journals.  When journal replay happens, each block in the
   journal is not replayed if generation number in the journal is less
   than the generation number in place.  This ensures that a newer
   version of a block is never replaced with an older version.  So,
   this solution basically allows multiple copies of the same block in
   different journals, but it allows you to always know which is the
   correct one.

   Pros:

   A) This method allows the fastest callbacks.  To release a lock,
      the incore log for the lock must be flushed and then the inplace
      data and metadata must be synced.  That's it.  The sync
      operations involved are: start the log body and wait for it to
      become stable on the disk, synchronously write the commit block,
      start the inplace metadata and wait for it to become stable on
      the disk.

   Cons:

   A) Maintaining the generation numbers is expensive.  All newly
      allocated metadata block must be read off the disk in order to
      figure out what the previous value of the generation number was.
      When deallocating metadata, extra work and care must be taken to
      make sure dirty data isn't thrown away in such a way that the
      generation numbers stop doing their thing.
   B) You can't continue to modify the filesystem during journal
      replay.  Basically, replay of a block is a read-modify-write
      operation: the block is read from disk, the generation number is
      compared, and (maybe) the new version is written out.  Replay
      requires that the R-M-W operation is atomic with respect to
      other R-M-W operations that might be happening (say by a normal
      I/O process).  Since journal replay doesn't (and can't) play by
      the normal metadata locking rules, you can't count on them to
      protect replay.  Hence GFS1, quieces all writes on a filesystem
      before starting replay.  This provides the mutual exclusion
      required, but it's slow and unnecessarily interrupts service on
      the whole cluster.

2) Total Metadata Sync (OCFS2)

   This method is really simple in that it uses exactly the same
   infrastructure that a local journaled filesystem uses.  Every time
   a node receives a callback, it stops all metadata modification,
   syncs out the whole incore journal, syncs out any dirty data, marks
   the journal as being clean (unmounted), and then releases the lock.
   Because journal is marked as clean and recovery won't look at any
   of the journaled blocks in it, a valid copy of any particular block
   only exists in one journal at a time and that journal always the
   journal who modified it last.

   Pros:

   A) Very simple to implement.
   B) You can reuse journaling code from other places (such as JBD).
   C) No quiece necessary for replay.
   D) No need for generation numbers sprinkled throughout the metadata.

   Cons:

   A) This method has the slowest possible callbacks.  The sync
      operations are: stop all metadata operations, start and wait for
      the log body, write the log commit block, start and wait for all
      the FS' dirty metadata, write an unmount block.  Writing the
      metadata for the whole filesystem can be particularly expensive
      because it can be scattered all over the disk and there can be a
      whole journal's worth of it.

3) Revocation of a lock's buffers (GFS2)

   This method prevents a block from appearing in more than one
   journal by canceling out the metadata blocks in the journal that
   belong to the lock being released.  Journaling works very similarly
   to a local filesystem or to #2 above.

   The biggest difference is you have to keep track of buffers in the
   active region of the ondisk journal, even after the inplace blocks
   have been written back.  This is done in GFS2 by adding a second
   part to the Active Items List.  The first part (in GFS2 called
   AIL1) contains a list of all the blocks which have been logged to
   the journal, but not written back to their inplace location.  Once
   an item in AIL1 has been written back to its inplace location, it
   is moved to AIL2.  Once the tail of the log moves past the block's
   transaction in the log, it can be removed from AIL2.

   When a callback occurs, the log is flushed to the disk and the
   metadata for the lock is synced to disk.  At this point, any
   metadata blocks for the lock that are in the current active region
   of the log will be in the AIL2 list.  We then build a transaction
   that contains revoke tags for each buffer in the AIL2 list that
   belongs to that lock.

   Pros:

   A) No quiece necessary for Replay
   B) No need for generation numbers sprinkled throughout the
      metadata.
   C) The sync operations are: stop all metadata operations, start and
      wait for the log body, write the log commit block, start and
      wait for all the FS' dirty metadata, start and wait for the log
      body of a transaction that revokes any of the lock's metadata
      buffers in the journal's active region, and write the commit
      block for that transaction.

   Cons:

   A) Recovery takes two passes, one to find all the revoke tags in
      the log and one to replay the metadata blocks using the revoke
      tags as a filter.  This is necessary for a local filesystem and
      the total sync method, too.  It's just that there will probably
      be more tags.

Comparing #2 and #3, both do extra I/O during a lock callback to make
sure that any metadata blocks in the log for that lock will be
removed.  I believe #2 will be slower because syncing out all the
dirty metadata for entire filesystem requires lots of little,
scattered I/O across the whole disk.  The extra I/O done by #3 is a
log write to the disk.  So, not only should it be less I/O, but it
should also be better suited to get good performance out of the disk
subsystem.

KWP 07/06/05

论坛徽章:
0
140 [报告]
发表于 2005-08-03 17:52 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

晕了,本来视力就不佳,刷过屏来老眼昏花。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP