Chinaunix

标题: 浩存 - 面向数据库,虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [打印本页]

作者: yftty    时间: 2005-05-13 14:41
标题: 浩存 - 面向数据库,虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统
本帖最后由 yftty 于 2012-02-13 18:17 编辑

:Google是当前最有影响的Web搜索引擎,它利用一万多台廉价PC机构造了一个高性能、超大存储容量、稳定、实用的巨型Linux集群。
http://bbs.chinaunix.net/forum/v ... 9&show_type=old

其分布式分布式文件系统的实现方法,用低成本实现了高可用、高性能集群的方法是并行机设计、开发的一个成功典范,这种严格追求性价比的设计方法值得借鉴。

请大家参与到这一工作中来

发件人:         Eric Anderson
收件人:         FreeBSD Clustering List
主题:         FreeBSD Clustering wishlist - Was: Introduction & RE: Clustering with Freebsd
日期:         Wed, 11 May 2005 22:45:55 -0500  (星期四,11:45 CST)
邮件程序:         Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.7) Gecko/20050504


Ok - I'm changing the subject here in an attempt to gather information.

Here's my wishlist:

FreeBSD have a 'native' clustered filesystem.  This is different than
shared media (we already can do that over fiber channel, ggated, soon
iscsi and AOE).  This would allow multiple servers to access the same
data read/write - highly important for load balancing applications like
web servers, mail servers, and NFS servers.

Online growable filesystem.  I know I can growfs a filesystem now, but
doing online while data is being used is *insanely* useful.  Reiserfs
and Polyserve's FS (a clustered filesystem, not open-source) do this well.

FreeBSD's UFS2 made to do journaling.  There's already someone working
on this.

I believe the above mean that we need a distributed lock manager too, so
might as well add that to my wishlist.

Single filesystem limits set very high - 16TB would be a good minimum.

Vinum/geom (?) made to allow added a couple more 'disks' - be it a real
scsi device, or another vinum device - to existing vinum's, so I can
extend my vinum stripe, raid, concat, etc to a larger volume size,
without worrying about which disk is where.  I want to stripe mirrors of
raids, and raid striped mirrors of stripes.  I know it sounds crazy, but
I really *do* have uses for all this.

We currently pay lots of money every year (enough to pay an engineers
salary) for support and maintenance with Polyserve.  They make a good
product (we need it for the clustered filesystem and NFS distributed
lock manager stuff) - I'd much rather see that go to FreeBSD.

Eric         

[ 本帖最后由 yftty 于 2008-4-12 13:03 编辑 ]
作者: yftty    时间: 2005-05-13 14:46
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
在 2005-05-11三的 22:45 -0500,Eric Anderson写道:
>; Ok - I'm changing the subject here in an attempt to gather information.
>;
>; Here's my wishlist:
>;
>; FreeBSD have a 'native' clustered filesystem.  This is different than
>; shared media (we already can do that over fiber channel, ggated, soon

Yes, the clustered filesystem will not run on SAN, since that will give
a high cost.

>; iscsi and AOE).  This would allow multiple servers to access the same
>; data read/write - highly important for load balancing applications like
>; web servers, mail servers, and NFS servers.

http://www.netapp.com/tech_library/3022.html <-- this article give some
info about the small file operations among the web, mail, IM, netdisk,
blog, etc. service. and that's our DFS targets at

>;
>; Online growable filesystem.  I know I can growfs a filesystem now, but
>; doing online while data is being used is *insanely* useful.  Reiserfs
>; and Polyserve's FS (a clustered filesystem, not open-source) do this well.

Yes, we also support that with our insanely mechanism.

And you know in the current clustered fs, as GoogleFS, Lustre, etc.
which can be built on online growfs. That's also our way to do it.

>;
>; FreeBSD's UFS2 made to do journaling.  There's already someone working
>; on this.

Good news.

>;
>; I believe the above mean that we need a distributed lock manager too, so
>; might as well add that to my wishlist.

By the specific application & services, we can easily remove the
distributed lock manager easily with upper layer way. You can read the
GoogleFS paper to get some further info.

>;
>; Single filesystem limits set very high - 16TB would be a good minimum.

The limits can be removed.

>;
>; Vinum/geom (?) made to allow added a couple more 'disks' - be it a real
>; scsi device, or another vinum device - to existing vinum's, so I can
>; extend my vinum stripe, raid, concat, etc to a larger volume size,
>; without worrying about which disk is where.  I want to stripe mirrors of
>; raids, and raid striped mirrors of stripes.  I know it sounds crazy, but
>; I really *do* have uses for all this.

Yes, that's Lustre's way, and we also add a logical disk layer to
support it.

>;
>; We currently pay lots of money every year (enough to pay an engineers
>; salary) for support and maintenance with Polyserve.  They make a good

Would you like to persuade your company to sponse the developing

>;  
>; product (we need it for the clustered filesystem and NFS distributed
>; lock manager stuff) - I'd much rather see that go to FreeBSD.

At last, any help & donate & contribute among the requirements & tech.
domains are great appreciated !

>;
>; Eric
>;
>;
>;
--
yf-263
Unix-driver.org
作者: chifeng    时间: 2005-05-13 15:00
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
yftty愿意为BSD做贡献啦。。。哈哈。
而且还有钱拿。。。。
作者: riverfor    时间: 2005-05-13 15:05
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
我也想写fs!
作者: thzjy    时间: 2005-05-13 15:19
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
largeness   project
作者: dtest    时间: 2005-05-13 16:33
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
though i can not understand it completely, i think it's a good idea.
作者: kofwang    时间: 2005-05-14 11:22
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
need to learn more advanced tech for understanding this artical
作者: yftty    时间: 2005-05-15 21:54
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
在 2005-05-11三的 22:45 -0500,Eric Anderson写道:
>; Ok - I'm changing the subject here in an attempt to gather information.
>;
>; Here's my wishlist:

As for your wishlists, how about the MogileFS of
http://www.danga.com/mogilefs/

And what do you think about our GoogleFS like && MogileFS features
Clustre FS ?

Any comments are quite welcomed
作者: yftty    时间: 2005-05-16 13:11
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
看样子大家对英文不是很感冒,给点我们头人的中文的乐呵乐呵

我一直在构想一个基于类SMPP协议的公开协议的分布式网络存储系统。大家可以看到google发布过一个google
fs的白皮书。实质上就是一个将fs的操作变为网络协议的操作的做法。最近手头上在帮助一个朋友完成了一个相关系统的设计的考虑。不知各位是否有兴趣一起来完成这样的一个项目,并且将它一直维护下去,也许将来它不止是一个python的实现,还会有c、java的实现。但是我相信python的实现会是最好的,就像现在的bt一样。
这样的分布式网络存储的用处会非常的多,如现在大家常在使用的大容量网络硬盘、gmail这样的大容量邮件系统、NNTP这样的大容量信息交互系统、Blog这样的大容量信息存储系统。
它的特点在于存储的内容多样化、存储的数据不能集中化、存储的数据会以用户/组/系统等为中心进行存储。
相关内容大家可以看看google fs。如果找不到我可以提供相关的pdf白皮书。
另:项目会开源(GPL或BSD)、项目会有实质的用所来证明我们的想法的正确性(我来解决测试环境的问题)。
----HD
作者: wheel    时间: 2005-05-16 13:47
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
为何要基于类SMPP协议,不要基于bt
作者: yftty    时间: 2005-05-16 13:53
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "wheel"]为何要基于类SMPP协议,不要基于bt[/quote 发表:


具体的网络抽象层(NAL)正在选型,我上个季度用过CURL作了个DEMO.

后面可能会用类似 PVFS2 的网络层架构

文件访问支持 TFTP, FTP, HTTP, NFS, etc.

另: 现在看来还是用类似Lustre的Portals那样的东西吧
作者: dtest    时间: 2005-05-16 13:53
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
ok, i can take part in this project, how to start it? If python be used to develop, i think most of us must learn it at first.
作者: yftty    时间: 2005-05-16 23:10
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
some good talk on Spotlight on Tiger (Mac OS X)

http://www.kernelthread.com/software/fslogger/

这也是我们的设计所追求的目标:

表现层 (基于搜索的目录, 用户文件)

检索/搜索层 (搜索引擎)

存储层 (分布式文件系统)
作者: sttty    时间: 2005-05-17 00:15
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
好想法。支持。可惜我能力不够。不然我一定报名。  

狂顶
作者: ly_1979425    时间: 2005-05-17 09:18
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
如果使用光盘作为近线存储介质,会更有效的发挥成本优势。
如果把现在的光盘库文件系统,如果ISO9660,UDF,JOLIET等光盘文件系统格式,以一种统一的网络文件系统的格式显现给用户,会极大的提高光盘在网络中的使用。如果光盘库这种设备。
这种大家存储很大的数据,但成本很便宜。光盘的成本远低于硬盘的成本。
我可以在这个方面与yftty合作。
作者: xuediao    时间: 2005-05-17 09:28
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
看了一下,基本了解了大概的事情。不过楼主能不能描述一下DFS将来的应用场景,和基于SMPP协议的考虑,这点我不是怎么明白。

However, my pleasure to join in this!
作者: yftty    时间: 2005-05-17 10:45
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "xuediao" 发表:
看了一下,基本了解了大概的事情。不过楼主能不能描述一下DFS将来的应用场景,和基于SMPP协议的考虑,这点我不是怎么明白。

However, my pleasure to join in this!


不好意思, 请看英文部分 现有的集群文件系统就我所了解到的好像没有基于SMPP的.

应用场景就是那种海量存储. 如WEB, MAIL, VOD/IPTV, 广电, 图书馆等. 比较熟悉的系统应用如:Google的LINUX机群系统,Yahoo的BSD Server机群系统。

[ 本帖最后由 yftty 于 2006-3-8 12:02 编辑 ]
作者: yftty    时间: 2005-05-17 10:49
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "ly_1979425" 发表:
如果使用光盘作为近线存储介质,会更有效的发挥成本优势。
如果把现在的光盘库文件系统,如果ISO9660,UDF,JOLIET等光盘文件系统格式,以一种统一的网络文件系统的格式显现给用户,会极大的提高光盘在网络中的使用..........


是的,本设计有这方面的考虑,如你前面所言. 将每个光盘文件系统的MetaData信息统一存储在MDS部分,完成Namespace解析功能, 使得到达光盘的指令仅为Seek和Read/Write Stripe操作, 会大大提高它的易用性.

同时光盘会大大降底使用成本如场地费用, 电费.
作者: zhuwas    时间: 2005-05-17 13:10
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
i can do it in my spare time , support , support !!!
作者: yftty    时间: 2005-05-17 13:23
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
或者你可以通过这种流程分析

Ext3/UFS/ReiserFS ;

NFS ;

GlobalFS ;

OpenAFS (Arla), Coda, Inter-mezzo, Lustre, PVFS2, GoogleFS.

因为我们的组内成员在扩大, 我一直在考虑如何使它像路边的大白菜一样普通; 而不是令人觉得突然在面前立起一个望不到头的高楼.
作者: javawinter    时间: 2005-05-17 16:20
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
友情支持
作者: zl_vim    时间: 2005-05-17 17:02
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
是个什么dd?
怎么参与啊?
作者: 潇湘夜雨    时间: 2005-05-17 18:17
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
支持一把。。。在IT职业生涯里也发一个吧
作者: nemoliu    时间: 2005-05-17 23:00
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
hehe,伴随着google的成功fs显得更加诱人了,如果有实力也很像参与
作者: javawinter    时间: 2005-05-18 02:55
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
有实力的都来加入吧
作者: citybugzzzz    时间: 2005-05-18 08:45
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
UpUp!
继续关注中。。。虽然项目很忙,但很乐于参与!
作者: hdcola    时间: 2005-05-18 09:04
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
很久没回来看了。我来告诉大家为什么当初会考虑smpp类的协议来做消息存储的分布式文件系统的一种协议。
1.smpp是全异步的协议,理论上可以非常多,但通常的应用中它通过十六到三十二个窗口来并发处理,从而达到在服务器端没有及时处理完工作的情况下在一个连接中处理下一个指令。这可以大量的减少服务器端的并发连接数量。
2.消息类存储写后都不会有大量的改。这样在save时可以考虑使用存储转发机制,在服务器端难以响应或出问题时解决消息的问题。
这只是一个建议。多一个想法而已。
^_^
作者: yftty    时间: 2005-05-18 10:11
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "hdcola" 发表:
很久没回来看了。我来告诉大家为什么当初会考虑smpp类的协议来做消息存储的分布式文件系统的一种协议。
1.smpp是全异步的协议,理论上可以非常多,但通常的应用中它通过十六到三十二个窗口来并发处理,从而达到在服..........


欢迎大家多提意见和建议 >;_>; 我们都会在选型中作对应的评估和测试

具体的工作会分为

client, data server, metadata server,  namespace, datapath, log, recovery, networking (or on wire protocol), migration/replication, utilities, etc. 几部分. 欢迎大家就感兴趣的部分参与到工作中来.

或者可以分几个主题分别讨论相关的技术领域. 算是我们作分布式协作的尝试

欢迎大家也就开源协作模式作讨论
作者: mozilla121    时间: 2005-05-18 15:15
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
頂一下
作者: nizvoo    时间: 2005-05-18 15:58
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
i wanna do some part!
作者: yftty    时间: 2005-05-18 16:13
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "nizvoo"]i wanna do some part![/quote 发表:


If you said the great golden saying "I wanna do some part!", please recite your tech. background or interests domain so as I can give more info to let you get into the work smoothly.

Speak another way, do you consider as I say : Just do it ! make sense
作者: uplooking    时间: 2005-05-18 16:47
提示: 作者被禁止或删除 内容自动屏蔽
作者: yftty    时间: 2005-05-18 16:55
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
http://tech.sina.com.cn/it/2005-05-08/0920600573.shtml

新华网北京5月7日电 (记者 李斌) 中国青年软件振兴计划工作委员会等单位日前进行的一项4400多人的“中国软件人才生存状况”调查表明,中国软件人才不仅“后继乏人”,而且由于培训缺乏、教育模式等原因“后继乏力”。

软件业知识更新速度快,然而调查发现,60%的国内软件企业没有对员工提供必要的职业规划,表明国内软件企业在员工培训方面不够重视。       

调查表明,虽然大部分软件从业人员都希望自己可以通过培训提高自身能力,可是社会环境却很难提供这样的机会:一方面是供职的企业不支持,另一方面是社会上能够及时提供新技术培训的机构少之又少。

77%的软件从业人员的工作时间在8个小时以上,处于中间层次的程序员们没有时间去接受新的技术、新的理念,没有时间去提高自身能力。大多数软件专业本科毕业生月工资水平在2000元左右,年薪能够达到10万元的软件人才估计不足全部软件从业人员的5%。调查发现,教育体制的落后导致了软件专业毕业生缺乏实际编程能力,无法适应企业的实际需要。而软件企业自身又不愿提供相应的培训,这样一来编程人员的数量几乎是处在一种“净减”状态。

同时,中国缺少专门的软件开发管理人才培训机构,只有自身具备良好管理天赋的软件工程师或者程序员幸运地成为软件开发管理人员,出现了“软件人才就业难”和“软件企业招不到合适员工”的怪现象。


-------------

希望Uplooking.com能为这个行业培养出更多的系统级开发人才
作者: nizvoo    时间: 2005-05-18 17:24
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
3 years c++/windows/opengl/dx
作者: yftty    时间: 2005-05-18 17:38
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "nizvoo"]3 years c++/windows/opengl/dx[/quote 发表:


本季度属于孕酿阶段,这个季度末我会向公司汇报或探讨可能的运作形式;请大家也就这方面提供意见和建议.关于像一个这样的项目的生存和发展.

使这个成为一个成功的行业级软件,并取得强大的生命力.

同时从这个贴子开始作起,去探索一个东东如何去保持其持续的生命力

年青,美丽, 永远!
作者: yftty    时间: 2005-05-18 17:39
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
http://lists.danga.com/pipermail/mogilefs/2004-December/000018.html

On Dec 20, 2004, at 11:50, Brad Fitzpatrick wrote:

Excellent! I did a project implementing exactly
same idea two years ago for a project related
to storage of mail messages for GSM carrier and
can appreciate the beauty of the solution! It is
great to have such product in open source.
作者: uplooking    时间: 2005-05-18 17:47
提示: 作者被禁止或删除 内容自动屏蔽
作者: yftty    时间: 2005-05-18 18:29
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
不多,但想想刚开始或现在华为作电信设备的时候也没多少人,所以他每年需要培养那么多

人们总喜欢称商业规则为 Game Rule, Game 也可以说是个赌博, 所以对公司在说,在一定程度上他是在赌大众心理. 赌对了的就活的舒服一点, 你觉得行业的趋势和大众的心理在哪里呢?

这样说对你有吸引力么

http://www.blogchina.com/new/display/72595.html

遗憾人物”的最大缺陷就是资源利用和行业整合能力的欠缺,以及企业管理能力的平庸。
作者: sttty    时间: 2005-05-18 23:55
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
将此项目支持到底。有机会,好好学学。

说到uplooking 课程。前几天去听公开课。感觉不错,课程很实用。我发现听课的人水平都不低。
当时感觉很惭愧。
作者: yftty    时间: 2005-05-19 09:36
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "sttty" 发表:
将此项目支持到底。有机会,好好学学。

说到uplooking 课程。前几天去听公开课。感觉不错,课程很实用。我发现听课的人水平都不低。
当时感觉很惭愧。


对于一个社团来说, 它存在的价值在于:
首先它能帮助大家成长,
其次它能大家带来更多的机会.

请发布宣传性的东东如上以此为出发点 呵呵
作者: nizvoo    时间: 2005-05-19 09:46
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
ok, i know it. I need learn more FS knowledge. keep touch.my mail : nizvoo"AT"gmail.com.
作者: deltali    时间: 2005-05-19 10:11
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
what's the role of locks in a distributed filesystem?

thanks!
作者: yftty    时间: 2005-05-19 11:03
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "deltali" 发表:
what's the role of locks in a distributed filesystem?

thanks!


The locks in a distributed filesystem is managed by Distributed Lock Manager (DLM),

A distributed filesystem need to addressing the problem of delivering aggregate performance to a large number of clients.

DLM is the basis of scalable clusters. In a DLM based cluster all nodes can write to all shared resources and co-ordinate their action using the DLM.

This sort of technology is mainly intended for CPU and/or ram intensive processing, not for disc intensive operations nor for reliblity.

Digital >; Compaq >; HP...  HP own the Digital DLM technology, available in Tru64 Unix (was Digital Unix and OpenVMS 8.)

Compaq/HP licensed the DLM technology to Oracle who have base their cluster/grid software on the DLM

Sun Solaris also has a DLM based cluster technology.

Now Sun and HP are fighting blog wars...
http://blogs.zdnet.com/index.php?p=661&tag=nl.e539
http://www.chillingeffects.org/responses/notice.cgi?NoticeID=1460

Where I see DLM being good is for rendering and scientific calculation. These processes could really benifit from having a central data store but will not put a huge load on the DLM hardware..

Some more deeply knowledge:

http://kerneltrap.org/mailarchive/1/message/56956/thread

http://kerneltrap.org/mailarchive/1/message/66678/thread








http://lwn.net/Articles/135686/

Clusters and distributed lock management
The creation of tightly-connected clusters requires a great deal of supporting infrastructure. One of the necessary pieces is a lock manager - a system which can arbitrate access to resources which are shared across the cluster. The lock manager provides functions similar to those found in the locking calls on a single-user system - it can give a process read-only or write access to parts of files. The lock management task is complicated by the cluster environment, though; a lock manager must operate correctly regardless of network latencies, cope with the addition and removal of nodes, recover from the failure of nodes which hold locks, etc. It is a non-trivial problem, and Linux does not currently have a working, distributed lock manager in the mainline kernel.

David Teigland (of Red Hat) recently posted a set of distributed lock manager patches (called "dlm", with a request for inclusion into the mainline. This code, which was originally developed at Sistina, is said to be influenced primarily by the venerable VMS lock manager. An initial look at the code confirms this statement: callbacks are called "ASTs" (asynchronous system traps, in VMS-speak), and the core locking call is an eleven-parameter monster:

    int dlm_lock(dlm_lockspace_t *lockspace,
        int mode,
        struct dlm_lksb *lksb,
        uint32_t flags,
        void *name,
        unsigned int namelen,
        uint32_t parent_lkid,
        void (*lockast) (void *astarg),
        void *astarg,
        void (*bast) (void *astarg, int mode),
        struct dlm_range *range);

Most of the discussion has not been concerned with the technical issues, however. There are some disagreements over issues like how nodes should be identified, but most of the developers who are interested in this area seem to think that this implementation is at least a reasonable starting point. The harder issue is figuring out just how a general infrastructure for cluster support can be created for the Linux kernel. At least two other projects have their own distributed lock managers and are likely to want to be a part of this discussion; an Oracle developer recently described the posting of dlm as "a preemptive strike." Lock management is a function needed by most tightly-coupled clustering and clustered filesystem projects; wouldn't it be nice if they could all use the same implementation?

The fact is that the clustering community still needs to work these issues out; Andrew Morton doesn't want to have to make these decisions for them:

Not only do I not know whether this stuff should be merged: I don't even know how to find that out. Unless I'm prepared to become a full-on cluster/dlm person, which isn't looking likely.

The usual fallback is to identify all the stakeholders and get them to say "yes Andrew, this code is cool and we can use it", but I don't think the clustering teams have sufficent act-togetherness to be able to do that.

Clustering will be discussed at the kernel summit in July. A month prior to that, there will also be a clustering workshop held in Germany. In the hopes that these two events will help bring some clarity to this issue, Andrew has said that he will hold off on any decisions for now.
作者: wolfg    时间: 2005-05-19 14:36
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
关注
作者: ufoor    时间: 2005-05-19 23:38
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
看的有些晕了,还得多学
相关的东西还是先看中文的比较好些,效率高些.如果中文的没有再看英文的
作者: Zer4tul    时间: 2005-05-20 03:08
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
好像是HD想出的主意吧?不错啊……可惜我水平不够……就在一边加油好了……过两天仔细看看Google FS的文档。
作者: yftty    时间: 2005-05-20 08:15
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "ufoor" 发表:
看的有些晕了,还得多学
相关的东西还是先看中文的比较好些,效率高些.如果中文的没有再看英文的


看中文的有利于迅速建立相关的概念, 但几个概念建立起来之后, 就不要看中文的了, 否则会越看越糊涂.
作者: yftty    时间: 2005-05-20 08:19
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "Zer4tul"]好像是HD想出的主意吧?不错啊……可惜我水平不够……就在一边加油好了……过两天仔细看看Google FS的文档。[/quote 发表:


hehe, HD can be considerred the Godfather of the Project !

Also great project need great man. Do you want to let me know and merge your brilliant ideas as what to do or how to do.

Let's inspiring each to other !
作者: akadoc    时间: 2005-05-20 13:17
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
up,up,up。关注中。。。
作者: yftty    时间: 2005-05-20 17:03
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "akadoc"]up,up,up。关注中。。。[/quote 发表:


您想关注那一点或哪一部分呢,是组织还是技术呢,还是技术的哪一部分呢

请看与我们类似的MogileFS提供的Features.

http://www.danga.com/mogilefs/

MogileFS is our open source distributed filesystem. Its properties and features include:

    * Application level -- no special kernel modules required.
    * No single point of failure -- all three components of a MogileFS setup (storage nodes, trackers, and the tracker's database(s)) can be run on multiple machines, so there's no single point of failure. (you can run trackers on the same machines as storage nodes, too, so you don't need 4 machines...) A minimum of 2 machines is recommended.
    * Automatic file replication -- files, based on their "class", are automatically replicated between enough different storage nodes as to satisfy the minimum replica count as requested by their class. For instance, for a photo hosting site you can make original JPEGs have a minimum replica count of 3, but thumbnails and scaled versions only have a replica count of 1 or 2. If you lose the only copy of a thumbnail, the application can just rebuild it. In this way, MogileFS (without RAID) can save money on disks that would otherwise be storing multiple copies of data unnecessarily.
    * "Better than RAID" -- in a non-SAN RAID setup, the disks are redundant, but the host isn't. If you lose the entire machine, the files are inaccessible. MogileFS replicates the files between devices which are on different hosts, so files are always available.
    * Transport Neutral -- MogileFS clients can communicate with MogileFS storage nodes (after talking to a tracker) via either NFS or HTTP, but we strongly recommend HTTP.
    * Flat Namespace -- Files are identified by named keys in a flat, global namespace. You can create as many namespaces as you'd like, so multiple applications with potentially conflicting keys can run on the same MogileFS installation.
    * Shared-Nothing -- MogileFS doesn't depend on a pricey SAN with shared disks. Every machine maintains its own local disks.
    * No RAID required -- Local disks on MogileFS storage nodes can be in a RAID, or not. It's cheaper not to, as RAID doesn't buy you any safety that MogileFS doesn't already provide.
    * Local filesystem agnostic -- Local disks on MogileFS storage nodes can be formatted with your filesystem of choice (ext3, ReiserFS, etc..). MogileFS does its own internal directory hashing so it doesn't hit filesystem limits such as "max files per directory" or "max directories per directory". Use what you're comfortable with.

MogileFS is not:

    * POSIX Compliant -- you don't run regular Unix applications or databases against MogileFS. It's meant for archiving write-once files and doing only sequential reads. (though you can modify a file by way of overwriting it with a new version) Notes:
          o Yes, this means your application has to specifically use a MogileFS client library to store and retrieve files. The steps in general are 1) talk to a tracker about what you want to put or get, 2) read/write to the NFS path for that storage node (the tracker will tell you where) or do an HTTP GET/PUT to the storage node, if you're running with an HTTP transport instead of NFS (which is highly recommended)
          o We've briefly tinkered with using FUSE, which lets Linux filesystems be implemented in userspace, to provide a Linux filesystem interface to MogileFS, but we haven't worked on it much.
    * Completely portable ... yet -- we have some Linux-isms in our code, at least in the HTTP transport code. Our plan is to scrap that and make it portable, though.
作者: scrazy77    时间: 2005-05-20 20:50
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "yftty" 发表:


您想关注那一点或哪一部分呢,是组织还是技术呢,还是技术的哪一部分呢

请看与我们类似的MogileFS提供的Features.

http://www.danga.com/mogilefs/

MogileFS is our open source distributed filesystem..........

MogileFS 可視為簡單版的google gfs 實作,
概念上是很接近的,
只是其最小單位是以 file為主,而google gfs最小單位是一個Chunk (64MB)
但目前使用MogileFS 要用application client來access,
使用上的方便性還是不如像RedHat GFS這類的 Distribute share storage,
或Netapp Filer...
當然MogileFS可能是最便宜的solution
目前在我內部的cluster已經在進行測試,
使用php的client,應用於多server access的blog & album system,
如要實作為POSIX filesystem,使用FUSE應該是可以很快作出來,
danga他們好像也有此計劃

Eric Chang
作者: yftty    时间: 2005-05-21 00:30
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
>; MogileFS 可視為簡單版的google gfs 實作,
>; 概念上是很接近的,

是的,都属于非对称式集群文件系统的用户空间实现的一个子集

同时它们可以被看作是文件管理库函数,而不是个文件系统.

>; 只是其最小單位是以 file為主,而google gfs最小單位是一個Chunk (64MB)

MogileFS 以 File 为最小管理单位, 所以只需要处理文件名字空间,无需处理磁盘块空间.

GoogleFS 将原来的磁盘块操作提升为基于文件的 Chunk (64MB) 操作,以使存储管理有个合适的管理最小细度,降底用于管理方面的开销.

>; 但目前使用MogileFS 要用application client來access,
>; 使用上的方便性還是不如像RedHat GFS這類的 Distribute share storage,

GFS 属于基于SAN的对称式的分布式文件系统

>; 或Netapp Filer...

Netapp Filer 属于优化的NFS Server

>; 當然MogileFS可能是最便宜的solution
>; 目前在我內部的cluster已經在進行測試,

Good job !

>; 使用php的client,應用於多server access的blog & album system,
>; 如要實作為POSIX filesystem,使用FUSE應該是可以很快作出來,

这个应该是说的开发流程 我们刚开始也是这个思路,但由此带来的工作量大大增加,所以就不在FUSE里面作试验了.

>; danga他們好像也有此計劃

>; Eric Chang
作者: 我菜我怕谁    时间: 2005-05-21 09:09
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
嗨,unix本身偶还没有搞懂,还是潜水吧!!
作者: yftty    时间: 2005-05-21 10:36
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "我菜我怕谁"]嗨,unix本身偶还没有搞懂,还是潜水吧!![/quote 发表:


HOHO,这个在一定程度上和Unix没关系 我也很很不是很明白Unix,呵呵;

IT业作为由美国主导,硅谷精英发起的"消费型经济",向以眼花缭乱的概念为噱头赢利,从而令大众的购买力大大超支. 同时,他们不但构造了技术壁垒,市场壁垒,还有这种心理上的壁垒.    莫要被它吓倒喔.

大项目都是纸老虎,要从战略上鄙视它,这样才能从战术上操纵它  

再大的项目每个人所参与的都是一小部分,但我是否因为这一小小的一部分,可以说"我"在参与了这个领域,或这个社会的进步了呢 ;漫长的历程仅仅是因为目标的不明确

附: 王国维所言的作事情的三种境界 --

1. 昨夜西风调碧树,独上高楼,望尽天涯路!

2. 衣带渐宽终不悔,为伊消的人憔悴.

3. 众里寻它千百度,蓦然回首,那人却在,灯火阑珊处. (是你么    )
作者: kofwang    时间: 2005-05-21 10:45
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
有道理,不过你算是找对了方向。对于一般人来说:
1、昨夜烧酒空寒心,欲上高楼,无觅天涯路
2、体力透支终不支,钱包依旧若空池
3、杀场拼争三百年,卸甲归田,却发现,无家可归
作者: sttty    时间: 2005-05-21 10:47
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
好一个
1. 昨夜西风调碧树,独上高楼,望尽天涯路!

2. 衣带渐宽终不悔,为伊消的人憔悴.

3. 众里寻它千百度,蓦然回首,那人却在,灯火阑珊处

一句话惊醒梦中人呀
作者: kofwang    时间: 2005-05-21 10:53
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
“以眼花缭乱的概念为噱头赢利”
如今正是概念经济大行其道的时候。对于中国人来说,“家庭影院”,“自驾游”,“三个代表”,吸引了多少眼球阿
作者: yftty    时间: 2005-05-21 10:58
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "kofwang" 发表:
有道理,不过你算是找对了方向。对于一般人来说:
1、昨夜烧酒空寒心,欲上高楼,无觅天涯路
2、体力透支终不支,钱包依旧若空池
3、杀场拼争三百年,卸甲归田,却发现,无家可归


在牢房里望出去,一人看到了泥土,一人看到了星星  :wink:

人更多的是在看曲折后的坦途;所以这也是悲剧如<梁祝>;更容易流传于世一样

在病态的执着后面你是否有这样的感受,早上总是被惊醒,但又不知道在担心或该担心什么?
作者: akadoc    时间: 2005-05-21 14:23
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "yftty" 发表:


对于一个社团来说, 它存在的价值在于:
首先它能帮助大家成长,
其次它能大家带来更多的机会.

请发布宣传性的东东如上以此为出发点 呵呵


Hoping to see a team as U say,in this project!
作者: chifeng    时间: 2005-05-21 22:37
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
不知道像我这样的菜鸟能否帮上忙?
做点具体的事情.....
作者: tclwp    时间: 2005-05-22 17:25
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
如果整和进新的开拓性技术,前途光明
作者: yftty    时间: 2005-05-22 20:56
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "akadoc" 发表:


Hoping to see a team as U say,in this project!


团队已经建立起来了。目前有两位成员,第三位会在七月份到位;) 都有分布式文件系统的成功产品经验   

当然希望有更多的人参与到我们的工作中来    和我们一起探索这方面的技术和相关的管理※工程经验。  
作者: yftty    时间: 2005-05-22 20:59
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "chifeng" 发表:
不知道像我这样的菜鸟能否帮上忙?
做点具体的事情.....


呵呵,人因为工作而有相应的水平,而不是有了那个水平才去做那个事情。成长应该是一个人毕生的追求,所以我们总是在用已知的去探索未知的;)

我们一直在努力 !  
作者: sttty    时间: 2005-05-22 22:45
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
成功的人都是这样一步步走出来的。希望我在几年后,也延续这条路走下去。
作者: yftty    时间: 2005-05-22 23:46
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "tclwp"]如果整和进新的开拓性技术,前途光明[/quote 发表:


像一个这样的或类似的项目,研发(新技术)的风险是相对来说比较小的,更大的是在工程方面.呵呵,通过作这件事情,我也渐渐明白了Google.com的两个创始人为什么一个负责技术,一个负责工程(当然我的理解可能有偏差).

在这样一个系统里,任何一个单独部分拿出来,都是比较简单的东西,并且从其它许多地方都能看到它的影子。但所有的东西整合到一起的时候,或我们通常说的形成一个系统的时候,技术的复杂性就上来了。特别是商业关键业务系统,其复杂性就更加明显。比如:一个大型的并发系统存在着非常多的Corner Cases, 优化的部分非常多从而难于把握具体的原因。而性能往往就是这个工程追求的唯一目标    大家多支持多探讨
作者: whoto    时间: 2005-05-23 10:29
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
我不懂Google fs,我对yfttyFS(姑且这么叫)理解是:
在一个虚拟的yfttyFS根文件系统下,提供提供对多种存储设备、多种文件系统、多种操作系统提供的存储空间、多种协议、包括yfttyFS本身的连接(挂接)能力,形成一个统一的存储系统,提供存储服务。
望高手多指教。

yfttyFS/--yfttyFS/X1
        |
        +--yfttyFS/X2
        |
        +--yfttyFS/X...
        |
        +-/Xdev/--HD
        |       |--SCSI
        |       |--CD
        |       |--DVD
        |       |--etc.
        |
        +-/Xfs/--UFS
        |      |--UFS2
        |      |--Ext2
        |      |--NTFS
        |      |--ISO9660
        |      |--etc.
        |
        +-/Xsys/--BSD(s)
        |       |--Linux(s)
        |       |--Windows(s)
        |       |--UNIX(s)
        |       |--etc.
        |
        +-/Xprotocol/--TFTP
        |            |--FTP
        |            |--HTTP
        |            |--NFS
        |            |--etc.
        |
        +--/etc.
        |


WEB       --|
MAIL      --|
VOD/IPTV  --|---base on--yfttyFS
Library   --|
etc.      --|
作者: yftty    时间: 2005-05-23 11:28
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
hehe, I never think about and dare not to name it xxxFS as you said. As most ideas are stole from various resources, and there are members in our team much more intelligent than I. Here I disclose it just to want more insight into our project, as to benifit to the project and guys who contribute.

Yes, seems you really know what we want to do Yes, the storage is a pool, and is always on-demand ! As the air around you.

And the tricky for my nickname:
Here I can see your masterpiece saying is cause now 'yf' is before a 'tty'
作者: Solaris12    时间: 2005-05-25 18:43
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "yftty" 发表:


团队已经建立起来了。目前有两位成员,第三位会在七月份到位;) 都有分布式文件系统的成功产品经验   

当然希望有更多的人参与到我们的工作中来    和我们一起探索这方面的技术和相关的管理※工..........


怎么和你联系,对这个项目很感兴趣,
可以在技术和工程管理方面多多交流。
作者: yftty    时间: 2005-05-27 00:24
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "Solaris12" 发表:


怎么和你联系,对这个项目很感兴趣,
可以在技术和工程管理方面多多交流。


工程管理方面我们准备使用 PSP/TSPi and XP , 欢迎大家就这方面探讨  

另: 书都买了,还没来得及看  
作者: javawinter    时间: 2005-05-27 01:15
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
UP
作者: Solaris12    时间: 2005-05-27 13:03
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "yftty" 发表:


工程管理方面我们准备使用 PSP/TSPi and XP , 欢迎大家就这方面探讨  


恕本人无知,PSP/TSPi是什么?

XP是指极限编程吗?
根据我的理解,XP比较适合开发人员少,以客户需求为导向的项目。FS的产品不必要套用XP。

当然,在软件开发中确实有很多best practice,我们可以根据自己的实际情况作出相应的调整,找到效率和流程的平衡点:

1. 关于SCM:

要做好一个产品,必须制定关于SCM的一系列政策和标准,主要在一下几方面:

版本控制管理
变化跟踪管理
   

2.关于process

需要制定代码集成的一些标准。

开发:概念性文档-->;开发-->;code review->;代码集成
测试:测试计划-->;测试开发-->;测试->;测试报告


对于比较小和资源有限的开发团队,SCM和process不宜搞得复杂,尽量减少开发文档,强化配置管理和code review
测试方面,最好能找到开源的测试工具,但这就要求,FS的编程接口不能是专有的,应尽量符合某种标准
作者: yftty    时间: 2005-05-27 13:48
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
(13:43:29) j-fox: 不管用什么管理模式,作好计划(各种计划,特别是风险应对计划)和状态监控是最主要的,先先开始拿一个小任务去尝试找到适用的方法

(13:45:45) j-fox: 先准备好开发文档
(13:46:04) yftty -- A dream makes a team, and the team builds the dream !: 好,我先把你的贴上
作者: xuediao    时间: 2005-05-27 14:10
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "Solaris12" 发表:

XP比较适合开发人员少,以客户需求为导向的项目。

如同Solaris12所说,XP是一个强调快速灵活,而PSP和TSPi是CMMi的一个延伸,强调计划和过程控制。

虽然说这是一个大型的工程项目,又以分布式开发为主,但同时实施这两个方法难度很大啊。

在这两个方法中取得平衡点,说不定即将开创一个新式的软件工程学,呵呵  
作者: xuediao    时间: 2005-05-27 14:16
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "yftty" 发表:
(13:43:29) j-fox: 不管用什么管理模式,作好计划(各种计划,特别是风险应对计划)和状态监控是最主要的,先先开始拿一个小任务去尝试找到适用的方法

(13:45:45) j-fox: 先准备好开发文档
(13:46:04) yftty -- ..........

我比较赞同j-fox的观点,开发状态监控和风险应对是最重要的,如果单纯公司内部开发可能实施TSP要容易得多,对于国内的分布式开发,这算是一个尝试和学习的过程吧。
作者: mozilla121    时间: 2005-05-27 14:29
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
嚴格使用這套流程在執行上會比較難. 只有一個非常認同這種流程的團對才有可能執行下去.
作者: yftty    时间: 2005-05-27 14:51
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "xuediao" 发表:

如同Solaris12所说,XP是一个强调快速灵活,而PSP和TSPi是CMMi的一个延伸,强调计划和过程控制。

虽然说这是一个大型的工程项目,又以分布式开发为主,但同时实施这两个方法难度很大啊。

在这两个方法中取得
作者: yftty    时间: 2005-05-27 14:54
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "mozilla121"]嚴格使用這套流程在執行上會比較難. 只有一個非常認同這種流程的團對才有可能執行下去.[/quote 发表:


"自知","自胜";"知足","强行". -- <<道德经>;>;
作者: xuediao    时间: 2005-05-27 14:54
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
呵呵,这也是中庸之道,抑或是新式的洋务运动吧

小平哥说得好,管他黑猫白猫,能逮老鼠就是好猫!
作者: Solaris12    时间: 2005-05-28 21:03
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "xuediao" 发表:

如同Solaris12所说,XP是一个强调快速灵活,而PSP和TSPi是CMMi的一个延伸,强调计划和过程控制。

虽然说这是一个大型的工程项目,又以分布式开发为主,但同时实施这两个方法难度很大啊。

在这两个方法中取得?.........


其实CMM这类东西非常适合外包公司做的。
我所在的开发团队,即不是XP,也不是CMM,但是却非常有效。
而且,你会在里面找到其他软件工程方法的影子,
所以,任何流程部重要,最重要的是和你拥有的资源匹配,
在我看来,很多国内软件公司最大的问题主要是以下几点:

1. SCM(软件配置管理)方面

没有称职的release engineer.
无法做到真正的版本管理
没有变化跟踪管理系统,无法捕捉系统的每一个变化
没有daily build,没有automatic 的 sanity test
和system test.

更重要的是,很多公司建立项目之初,就没有统一的
SCM的政策,比如code integreate criteria

2. 开发流程方面

没有民主权威机构来控制市场和软件体系结构的需求及功能改变
没有code review
没有automatic的regression test对应每一个daily build

不过任何软件工程和方法都是要占用额外资源的,
关键是每一个软件公司都能认识并且投入。

其实仔细看很多知名的开源项目的开发模式,
以上这些东西都能很好的满足,比如说:
你可以随时拿到它的daily build或者snapshot,
看到该build是否通过测试。还有bugtraq系统,
记录到了每一次的改动,包括bugfix,和新功能
作者: yftty    时间: 2005-06-01 12:37
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
To Solaris12,

现在也是按照你所说的思路去一步步实施的,但还没有建立起来.

1.SCM, 现在仅仅是简单的Commit Rules (参照的是Lustre的流程).也是为了和现有的资源相匹配.

2. 开发流程, 现在仅有设计Rivew.其它的需要人员去建立.

另: 现在突然觉得有点丢掉了那曾经熟悉的东西.
作者: james.liu    时间: 2005-06-01 13:42
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
看完这个帖子,第一印象不是这个项目或者牵涉的技术,而是yftty这个家伙
很能侃。

我不懂,但是我想看看,,,我该通过何种方式来旁观这个项目呢?
作者: 风暴一族    时间: 2005-06-03 09:26
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
不错的说~
作者: yftty    时间: 2005-06-07 09:41
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
here is the current sanity testing & results

[yf@yftty xxxfs]$ tests/xxxfs_sanity -v
000010:000001:1118108292.377965:4560socket.c:63xxfs_net_connect()) Process entered
config finished, ready to do the sanity testing !
xxxFS file creation testing succeeded !
xxxFS file read testing succeeded !
xxxFS file deletion testing succeeded !
xxxFS Sanity testing pid (4560) succeeded 1 !
[yf@yftty xxxfs]$
作者: yftty    时间: 2005-06-07 11:38
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
项目到现在已快要过两个季度,经过这些时间的实践和思考

我的浅见是这个项目从流程来说上面的发贴所谈的已经比较完善了
从分工和组织来说,大家看下面的是否合适?

        __________           __________
       | 理论指导 |   <->;   | 开发指导 |
        ----------           ----------
         |        \         /       |
         |         \       /        |
       ------       ------       --------
      | 研发 | <->; | 开发 | <->; |  测试  |
       ------       ------       --------

另:

这样还是有问题, 晕.
作者: yftty    时间: 2005-06-08 09:30
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
Dan Stromberg wrote:
>; The lecturer at the recent NG storage talk at Usenix in Anaheim,
>; indicated that it was best to avoid "active/active" and get
>; "active/passive" instead.
>;
>; Does anyone:
>;
>; 1) Know what these things mean?

In the clustering world, active/active means 2 or more servers are
active at a time, either operating on separate data (and thus acting as
passive failover partners to each other), or operating on the same data
(which requires the use of a cluster filesystem or other similar
mechanism to allow coherent simultaneous access to the data).

>; 2) Know why active/passive might be preferred over active/active?

Well, if you're talking about active/passive vs. active/active with a
cluster filesystem or such, the active/passive is tons easier to
implement and get right. Plus, depending on your application, the added
complexity of a cluster filesystem might not actually buy you much more
than you could get with, say, NFS or Samba (CIFS).

--
Paul
作者: yftty    时间: 2005-06-08 11:00
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
http://tech.blogchina.com/53/2005-06-07/372338.html

想了解Google的企业文化,需要从Google创立时的一个插曲开始:当谢尔盖·布林(Sergey Brin)和拉里·佩奇(Larry Page)想将自己的网络梦想付诸实际,最大的障碍是,他们并没有足够的资金来购买价格昂贵的设备。于是两人花费数百美元购买了一些个人电脑来代替那些数百万美元的服务器。

在实际应用中,这些普通电脑的故障率自然要高于专业服务器。他们需要确保任何一台普通电脑发生故障时都不会影响到用户正常得出搜索结果,于是Google 决定自己开发软件工具来解决这些问题。比如Google文件系统。这种文件系统不仅能够高效处理大型数据,还能够随时应付突然发生的存储故障。配合 Google的三重备份体制,这些个人电脑组成的系统就可以完成那些服务器的工作。

而这种遇到任何问题都全力解决之的理念,极大的影响了后来Google的文化。至今,Google依旧保持着网络公司的风貌。拥有2700名员工的公司总部里有900人是技术人员,而且在这里没有几间办公室。在施密特衣柜般的小办公室楼下,布林和佩奇共用一间办公室。而那里就像一间大学宿舍,里面摆着冰球装备、滑板和遥控飞机模型、懒人椅等等。

...

没有人质疑Google拥有魔幻般的技术和创新,但没有一家伟大的公司仅仅依靠出色的技术而成为世界级的公司。伟大的公司需要伟大的管理来帮助公司更上层楼。谁是Google的灵魂?当然是布林、佩奇再加上施密特组成的三人组。但谈到管理层面,49岁的施密特的确起到了至关重要的作用。

49岁的施密特曾经是Sun公司的CTO以及Novell公司的CEO,他至今仍清晰记得刚到这家公司时董事会对他的交待:“别把公司弄糟了,艾利克。公司的起点非常非常好,可别进行太大的改革。”他完全理解投资者的担心,他们不想这家创造力十足的公司变得僵化死板。

1999年施密特刚到这家公司的时候这里根本谈不上有什么管理,但他也不想照搬传统大公司那一套管理方法,他希望根据实际情况形成Google自己的管理模式。大多数情况下施密特和2位创始人一起行动,作出决策。通常情况下是施密特主持管理层会议,而2位创始人主持员工会议。当遇到重大问题需要解决的时候,Google3人组就会根据少数服从多数的基本规则作出决定。并且许多决定他们是当着员工的面得出结果的。公司管理层刻意保持企业文化中率直、自由的工程师文化,他们认为这是他们抗衡Yahoo和微软这样大规模公司的有力武器。

哈佛商学院教授大卫·友菲(David Yoffie)却并不看好这种管理模式:“如果很多人同时作决定,那等于没有决定任何事情。在Google每天会同时作出成千上万的计划,需要有一个人作出最终决断。”

施密特表示实际上他所扮演的角色更倾向于COO。他以雅虎和eBay举例来说,在这些公司里都是创始人来制定远景战略,尽管他们并不拥有首席执行官的头衔。但施密特的支持者认为,这名CEO的个人风格掩盖了他在公司中的实际地位。而曾经担任CEO的佩奇如今担任产品总裁。前董事长布林则担任技术总裁。而施密特则在过去的4年中为Google搭建了完善的架构。

布林和佩奇的管理哲学完全源于他们当初所在的斯坦福大学计算机科学实验室。Google的经理很少要求那些工程师去完成什么项目,取而代之的则是公司会宣布一个100项优先完成项目列表,工程师们根据自己的喜好参加不同的流动工作组,以周或者月为时间单位完成工作。
作者: liuzhentaosoft    时间: 2005-06-10 23:49
提示: 作者被禁止或删除 内容自动屏蔽
作者: yftty    时间: 2005-06-14 15:49
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
This is a short taxonomy of the kinds of distributed filesystems you can find today (Febrary 2004). This was assembled with some help from Garth Gibson and Larry Jones.

Distributed filesystem - the generic term for a client/server or "network" filesystem where the data isn't locally attached to a host. There are lots of different kinds of distributed filesystems, the first ones coming out of research in the 1980s. NFS and CIFS are the most common distributed filesystems today

Global filesystem - this refers to the namespace, so that all files have the same name and path name when viewed from all hosts. This obviously makes it easy to share data across machines and users in different parts of the organization. For example, the WWW is a global namespace because a URL works everywhere. But, filesystems don't always have that property because your share definitions may not match mine, we may not see the same file servers or the same portions of those file servers.

AFS was an early provider of a global namespace - all files were organized under /afs/cellname/... and you could assemble AFS cells even from different organizations (e.g., different universities) into one shared filesystem. The Panasas filesystem (PanFS) supports a similar structure, if desired.

SAN filesystem - these provide a way for hosts to share Fibre Channel storage, which is traditionally carved into private chunks bound to different hosts. To provide sharing, a block-level metadata manager controls access to different SAN devices. A SAN Filesystem mounts storage natively in only one node, but connects all nodes to that storage and distributes block addresses to other nodes. Scalability is often an issue because blocks are a low-level way to share data placing a big burden on the metadata managers and requiring large network transactions in order to access data.

Examples include SGI cXFS, IBM GPFS, Red Hat Sistina, IBM SanFS, EMC Highroad and others.

Symmetric filesystems - A symmetric filesystem is one in which the clients also run the metadata manager code; that is, all nodes understand the disk structures. A concern with these systems is the burden that metadata management places on the client node, serving both itself and other nodes, which may impact the ability of the client to perform its intended compute jobs. Examples include Sistina GFS, GPFS, Compaq CFS, Veritas CFS, Polyserve Matrix

Asymmetric filesystems - An asymmetric filesystem is one in which there are one or more dedicated metadata managers that maintain the filesystem and its associated disk structures. Examples include Panasas ActiveScale, IBM SanFS, and Lustre. Traditional client/server filesystems like NFS and CIFS are also asymmetric.

Cluster filesystem - a distributed filesystem that is not a single server with a set of clients, but instead a cluster of servers that all work together to provide high performance service to their clients. To the clients the cluster is transparent - it is just "the filesystem", but the filesystem software deals with distributing requests to elements of the storage cluster.

Examples include: HP (DEC) Tru64 cluster and Spinnaker is a clustered NAS (NFS) service. Panasas ActiveScale is a cluster filesystem

Parallel filesystem - file systems with support for parallel applications, all nodes may be accessing the same files at the same time, concurrent read and write. Examples of this include: Panasas ActiveScale, Lustre, GPFS and Sistina.

Finally, these definitions overlap. A SAN filesystem can be symmetric or asymmetric. Its servers can be clustered or single. And it can support parallel apps or not.
作者: raidcracker    时间: 2005-06-14 18:49
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
raidcracker 写到:
开发测试一个人做,要出乱子的.:>;


http://bbs.chinaunix.net/forum/viewtopic.php?t=544517&show_type=&postdays=0&postorder=asc&start=80

给我们的开发分工和流程提点建议如何?

------------------------------------------------------------

对项目管理我并没有很多的经验,只是深恶痛绝开发凌驾于测试之上.
我感觉你的分工就有这个倾向.测试应该是独立的并行于开发指导,而且测试是测试工程师的职责,调试是开发工程师的职责.不能和二为一.

最后不要误会,我不是搞测试的来为测试说话的,而是搞Raid和SAN的研发的.
作者: yftty    时间: 2005-06-16 00:13
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
按照甘特图的画法, 应该是这个样子的吧

  |
  |  研发 ->;
  |    ^ 开发 ->;
  |       ^ QA ->;
  |          ^ 运营(维护) ->;
----------------------------

那你对 FC  SCSI  iSCSI NFS SAMBA CACHE RAID 等很熟悉喽  
作者: BigMonkey    时间: 2005-06-16 11:36
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
"帖子总数发表于: 2005-06-16 00:13    发表主题: To: raidcracker"

楼主老大这么晚还在
作者: BigMonkey    时间: 2005-06-16 11:41
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
By the way, is yf your real name's acronym?
作者: yftty    时间: 2005-06-16 14:49
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
[quote]原帖由 "BigMonkey"]By the way, is yf your real name's acronym?[/quote 发表:


Hehe, the secrect answer is 'yes'. And yftty is the short of 'yf before a tty'

By the way, the FS need also supply a PHP interface, so please share some SWIG experiences if you have or want  
作者: raidcracker    时间: 2005-06-16 16:49
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
原帖由 "yftty" 发表:
按照甘特图的画法, 应该是这个样子的吧

  |
  |  研发 ->;
  |    ^ 开发 ->;
  |       ^ QA ->;
  |          ^ 运营(维护) ->;
----------------------------

那你对 FC  SCSI  iSCSI NFS ..........


吃饭的家伙,不熟不行啊.楼主涉猎众多,让人羡慕啊.
作者: yftty    时间: 2005-06-17 10:58
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
http://lwn.net/Articles/136579/

The second version of Oracle's cluster filesystem has been in the works for some time. There has been a recent increase in cluster-related code proposed for inclusion into the mainline, so it was not entirely surprising to see an OCFS2 patch set join the crowd. These patches have found their way directly into the -mm tree for those wishing to try them out.

As a cluster filesystem, OCFS2 carries rather more baggage than a single-node filesystem like ext3. It does have, at its core, an on-disk filesystem implementation which is heavily inspired by ext3. There are some differences, though: it is an extent-based filesystem, meaning that files are represented on-disk in large, contiguous chunks. Inode numbers are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however, so it does not need to bring along much of its own journaling code.

To actually function in a clustered mode, OCFS2 must have information about the cluster in which it is operating. To that end, it includes a simple node information layer which holds a description of the systems which make up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take the relevant information from a single configuration file (/etc/ocfs2/cluster.conf). It is not enough to know which nodes should be part of a cluster, however: these nodes can come and go, and the filesystem must be able to respond to these events. So OCFS2 also includes a simple heartbeat implementation for monitoring which nodes are actually alive. This code works by setting aside a special file; each node must write a block to that file (with an updated time stamp) every so often. If a particular block stops changing, its associated node is deemed to have left the cluster.

Another important component is the distributed lock manager. OCFS2 includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a VMS-like interface. Oracle's implementation is simpler, however (its core locking function only has eight parameters...), and it lacks many of the fancier lock types and functions of Red Hat's implementation. There is also a virtual filesystem interface ("dlmfs" making locking functionality available to user space.

There is a simple, TCP-based messaging system which is used by OCFS2 to talk between nodes in a cluster.

The remaining code is the filesystem implementation itself. It has all of the complications that one would expect of a high-performance filesystem implementation. OCFS2, however, is meant to operate with a disk which is, itself, shared across the cluster (perhaps via some sort of storage-area network or multipath scheme). So each node on the cluster manipulates the filesystem directly, but they must do so in a way which avoids creating chaos. The lock manager code handles much of this - nodes must take out locks on on-disk data structures before working with them.

There is more to it than that, however. There is, for example, a separate "allocation area" set aside for each node in the cluster; when a node needs to add an extent to a file, it can take it directly from its own allocation area and avoid contending with the other nodes for a global lock. There are also certain operations (deleting and renaming files, for example) which cannot be done by a node in isolation. It would not do for one node to delete a file and recycle its blocks if that file remains open on another node. So there is a voting mechanism for operations of this type; a node wanting to delete a file first requests a vote. If another node vetoes the operation, the file will remain for the time being. Either way, all nodes in the cluster can note that the file is being deleted and adjust their local data structures accordingly.

The code base as a whole was clearly written with an eye toward easing the path into the mainline kernel. It adheres to the kernel's coding standards and avoids the use of glue layers between the core filesystem code and the kernel. There are no changes to the kernel's VFS layer. Oracle's developers also appear to understand the current level of sensitivity about the merging of cluster support code (node and lock managers, heartbeat code) into the kernel. So they have kept their implementation of these functionalities small and separate from the filesystem itself. OCFS2 needs a lock manager now, for example, so it provides one. But, should a different implementation be chosen for merging at some future point, making the switch should not be too hard.

One assumes that OCFS2 will be merged at some point; adding a filesystem is not usually controversial if it is implemented properly and does not drag along intrusive VFS-layer changes. It is only one of many cluster filesystems, however, so it is unlikely to be alone. The competition in the cluster area, it seems, is just beginning.
作者: yftty    时间: 2005-06-17 11:05
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
http://lwn.net/Articles/136579/

Plan 9 started as Ken Thompson and Rob Pike's attempt to address a number of perceived shortcomings in the Unix model. Among other things, Plan 9 takes the "everything is a file" approach rather further than Unix does, and tries to do so in a distributed manner. Plan 9 never took off the way Unix did, but it remains an interesting project; it has been free software since 2003.

One of the core components of Plan 9 is the 9P filesystem. 9P is a networked filesystem, somewhat equivalent to NFS or CIFS, but with its own particular approach. 9P is not as much a way of sharing files as a protocol definition aimed at the sharing of resources in a networked environment. There is a draft RFC available which describes this protocol in detail.

The protocol is intentionally simple. It works in a connection-oriented, single-user mode, much like CIFS; each user on a Plan 9 system is expected to make one or more connections to the server(s) of interest. Plan 9 operates with per-user namespaces by design, so each user ends up with a unique view of the network. There is a small set of operations supported by 9P servers; a client can create file descriptors, use them to navigate around the filesystem, read and write files, create, rename and delete files, and close things down; that's about it.

The protocol is intentionally independent of the underlying transport mechanism. Typically, a TCP connection is used, but that is not required. A 9P client can, with a proper implementation, communicate with a server over named pipes, zero-copy memory transports, RDMA, RFC1149 avian links, etc. The protocol also puts most of the intelligence on the server side; clients, for example, perform no caching of data. An implication of all these choices is that there is no real reason why 9P servers have to be exporting filesystems at all. A server can just as easily offer a virtual filesystem (along the lines of /proc or sysfs), transparent remote access to devices, connections to remote processes, or just about anything else. The 9P protocol is the implementation of the "everything really is a file" concept. It could thus be used in a similar way as the filesystems in user space (FUSE) mechanism currently being considered for merging. 9P also holds potential as a way of sharing resources between virtualized systems running on the same host.

There is a 9P implementation for Linux, called "v9fs"; Eric Van Hensbergen has recently posted a v9fs patch set for review with an eye toward eventual inclusion. v9fs is a full 9P client implementation; there is also a user-space server available via the v9fs web site.

Linux and Plan 9 have different ideas of how a filesystem should work, so a fair amount of impedance matching is required. Unix-like systems prefer filesystems to be mounted in a global namespace for all users, while Plan 9 filesystems are a per-user resource. A v9fs filesystem can be used in either mode, though the most natural way is to use Linux namespaces to allow each user to set up independently authenticated connections. The lack of client-side caching does not mix well with the Linux VFS, which wants to cache heavily. The current v9fs implementation disables all of this caching. In some areas, especially write performance, this lack of caching makes itself felt. In others, however, v9fs claims better performance than NFS as a result of its simpler protocol. Plan 9 also lacks certain Unix concepts - such as symbolic links. To ease interoperability with Unix systems, a set of protocol extensions has been provided; v9fs uses those extensions where indicated.

The current release is described as "reasonably stable." The basic set of file operations has been implemented, with the exception of mmap(), which is hard to do in a way which does not pose the risk of system deadlocks. Future plans include "a more complete security model" and some thought toward implementing limited client-side caching, perhaps by using the CacheFS layer. See the patch introduction for pointers to more information, mailing lists, etc.



(Posted Jun 6, 2005 16:53 UTC (Mon) by guest stfn) (Post reply)

The design philosophy shares something with the recently popular "REpresentational State Transfer" style of web services. They each chose one unifying metaphor and a minimal interface: either everything is a file and accessed through file system calls, or everything is a resource and accessed through HTTP methods on a URL.

That might be a naive simplification* but other have observed the same:

http://www.xent.com/pipermail/fork/2001-August/002801.html
http://rest.blueoxen.net/cgi-bin/wiki.pl?RestArchitectura...

* It's only one aspect of the design and, on the other hand, there's all kinds of caching in the web and URIs if not URLs are meant to form a global namespace that all users share.
作者: yftty    时间: 2005-06-17 11:15
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
http://lwn.net/Articles/100321/

Many filesystems operate with a relatively slow backing store. Network filesystems are dependent on a network link and a remote server; obtaining a file from such a filesystem can be significantly slower than getting the file locally. Filesystems using slow local media (such as CDROMs) also tend to be slower than those using fast disks. For this reason, it can be desirable to cache data from these filesystems on a local disk.

Linux, however, has no mechanism which allows filesystems to perform local disk caching. Or, at least, it didn't have such a mechanism; David Howells's CacheFS patch changes that.

With CacheFS, the system administrator can set aside a partition on a block device for file caching. CacheFS will then present an interface which may be used by other filesystems. There is a basic registration interface, and a fairly elaborate mechanism for assigning an index to each file. Different filesystems will have different ways of creating identifiers for files, so CacheFS tries to impose as little policy as possible and let the filesystem code do what it wants. Finally, of course, there is an interface for caching a page from a file, noting changes, removing pages from the cache, etc.

CacheFS does not attempt to cache entire files; it must be able to deal with the possibility that somebody will try to work with a file which is bigger than the entire cache. It also does not actually guarantee to cache anything; it must be able to perform its own space management, and things must still function even in the absence of an actual cache device. This should not be an obstacle for most filesystems which, by their nature, must be prepared to deal with the real source for their files in the first place.

CacheFS is meant to work with other filesystems, rather than being used as a standalone filesystem in its own right. Its partitions must be mounted before use, however, and CacheFS uses the mount point to provide a view into the cached filesystem(s). The administrator can even manually force files out of the cache by simply deleting them from the mounted filesystem.

Interposing a cache between the user and the real filesystem clearly adds another failure point which could result in lost data. CacheFS addresses this issue by performing journaling on the cache contents. If things come to an abrupt halt, CacheFS will be able to replay any lost operations once everything is up and functioning again.

The current CacheFS patch is used only by the AFS filesystem, but work is in progress to adapt others as well. NFS, in particular, should benefit greatly from CacheFS, especially when NFSv4 (which is designed to allow local caching) is used. Expect this patch to have a relatively easy journey into the mainstream kernel. For those wanting more information, see the documentation file included with the patch.
(Log in to post comments)

  CacheFS & Security
(Posted Sep 2, 2004 16:41 UTC (Thu) by subscriber scripter) (Post reply)

I wonder what the security implications of CacheFS are. Does each file inherit the permissions of the original? Is confidentiality a problem? What if you want to securely erase a file?

  CacheFS & Security
(Posted Sep 3, 2004 19:49 UTC (Fri) by subscriber hppnq) (Post reply)

Not knowing anything about CacheFS internals, I would say these are cases of "don't do it, then".

  CacheFS & Security
(Posted Sep 13, 2004 18:49 UTC (Mon) by guest AnswerGuy) (Post reply)

The only difference between accessing a filesystem directly and through CacheFS should be that the CacheFS can store copies of the accessed data on a local block device. In other words that there's a (potentially persistent) footprint of all accesses.

Other than that CacheFS should preserve the same permissions semantics as if a given user/host were accessing the backend filesystem/service directly.

  A general caching filesystem
(Posted Sep 14, 2004 2:13 UTC (Tue) by subscriber xoddam) (Post reply)

This seems to me like a really complicated reimplementation of virtual
memory.

All filesystems already use VM pages for caching, don't they?
I'd have thought that attaching backing store to those pages would have
been a much simpler task than writing a whole new cache interface.

But then I'm not really a filesystem hacker.

  A general caching filesystem
(Posted Oct 25, 2004 0:55 UTC (Mon) by subscriber jcm) (Post reply)

xoddam writes:

>; This seems to me like a really complicated reimplementation of
>; virtual memory.

No it's really not. By virtual memory your are referring to an aspect of VM implementations known as paging, and that in itself only really impacts upon so called ``anonymous memory''. There is a page cache for certain regular filesystems but it's not possible for all filesystems to exploit the page cache to full effect and in any case this patch adds the ability to use a local disk as an additional cache storage for even slower stuff like network mounted filesystems - so the page cache can always sit between this disk and user processes which use it.

Jon.

  Improve "Laptop mode"
(Posted Oct 7, 2004 18:57 UTC (Thu) by subscriber BrucePerens) (Post reply)

I haven't looked at the CacheFS code yet, but this is what I would like to do with it, or something like it.

Put a cache filesystem on a FLASH disk plugged into my laptop. My laptop has a 512M MagicGate card, which looks like a USB disk. Use it to cache all recently read and written blocks from the hard disk, and allow the hard disk to remain spun down most of the time. Anytime the disk has to be spun up, flush any pending write blocks to it.

This would be an improvement over "laptop mode" in that it would not require system RAM and could thus be larger, and would not be as volatile as a RAM write cache.

Bruce
作者: yftty    时间: 2005-06-20 14:07
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
1. Introduction to the BeOS and BFS

1.1 History leading up to BFS

The Solution

Starting in September 1996, Cyril Meurillon and I set about to define a new I/O architecture and file system for BeOS. We knew that the existing split of file system and database would no longer work. We wanted a new, high-performance file system that supported the database functionality the BeOS was known for as well as a mechanism to support multiple file systems. We also took the opportunity to clean out some of the accumulated cruft that had worked its way into the system over the course of the previous five years of development.

The task we had to solve had two very clear components. First there was the higher-level file system and device interface. This half of the project involved defining an API for file system and device drivers, managing the name space, connecting program requests for files into file descriptors, and managing all the associated state. The second half of the project involved writing a file system that would provide the functionality required by the rest of the BeOS. Cyril, being the primary kernel architect at Be, took on the first portioin of the task. The most difficult portion of Cyril's project involved defining the file system API in such a way that it was as multithreaded as possible, correct, deadlock-free, and efficient. That task involved many major iterations as we battled over what a file system had to do and what the kernel layer would manage. There is some discussion of this level of the file system in Chapter 10, but it is not the primary focus of this book.

My half of the project involved defining the on-disk data structures, managing all the nity-gritty physical details of the raw disk blocks, and performing the I/O requests made by programs. Because the disk block cache is intimately intertwined with the file system (especially a journaled file system), I also took on the task of rewriting the block cache.

1.2 Design Goals

...

In addition to the above design goals, we had the long-standing goals of making the system as multithreaded and as efficient as possible, which meant fine-grained locking everywhere and paying close attention to the overhead introduced by the file system. Memory usage was also a big concern. ...

1.3 Design Constraints

There were also several design constraints that the project had to contend with. The first and foremost was the lack of engineering resources. The Be engineering staff is quite small, at the time only 13 engineers. Cyril and I had to wrok alone because everyone else was busy with other projects. We also did not have very much time to complete the project. Be, Inc., tries to have regular software releases, once every four to six months. The initial target was for the project to take six months. The short amount of time to complete the project and the lack of engineering resources meant that there was little time to explore different designs and to experiment with completely untested ideas. In the end it took nine months for the first beta release of BFS. The final version of BFS shipped the following month.

2. What is a File System ?

2.1 The Fundamentals

It is important to keep in mind the abstract goal of what a file system must achieve: to store, retrieve, locate, and manipulate information. Keeping the goal stated in general terms frees us to think of alternative implementations and possibilities that might not otherwise occur if we were to only think of a file system as a typical, strictly hierarchical, disk-based structure.

...

2.3 The Abstractions

...

Extents

Another technique to manage mapping from logical positions in a byte stream to data blocks on disk is to use extent lists. An extent list is similar to the simple block list described previously except that each block address is not just for a single block but rather for a range of blocks. That is, every block address is given as a starting block and a length (expressed as the number of successive blocks following the starting block). The size of an extent is usually larger than a simple block address but is potentially able to map a much larger region of disk space.

...

Although extent lists are a more compact way to refer to large amounts of data, they may still require use of indirect or double-indirect blocks. If a file system becomes highly fragmented and each extent can only map a few blocks of data, then the use of indirect and double-indirect blocks becomes a necessity. One disadvantage to using extent lists is that locating a specific file position may require scanning a large number of extents. Because the length of an extent is variable, when locating a specific position the file system must start at the first extent and scan through all of them until it finds the extent that covers the position of interest. ...

Storing Directory Entries

...

Another method of organizing directory entries is to use a sorted data structure suitable for on-disk storage. One such data structure is a B- tree (or its variants, B+ tree and B* tree). A B- tree keeps the keys sorted by their name and is efficient at looking up whether a key exists in the directory. B- trees also scale well and are able to deal efficiently with directories that contain many tens of thousands of files.

2.5 Extended file system Operations

...

Indexing

File attributes allow users to associate additional information with files, but there is even more that a file system can do with extended file attributes to aid users in managing and locating their information. If the file system also indexes the attributes. For example, if we added a *keywork* attribute to a set of files and the *keyword* attribute was indexed, the user could then issue queries asking which files contained various keywords regardless of their location in the hierarchy.

When coupled with a good query language, indexing offers a powerful alternative interface to the file system. With queries, users are not restricted to navigating a fixed hierarchy of files; instead they can issue queries to find the working set of files they would like to see, regardless of the location of the files.

Journaling/Logging

Avoiding corruption in a file system is a difficult task. Some file systems go to great lengths to avoid corruptioin problems. They may attempt to oder disk writes in such a way that corruption is recoverable, or they may force operations that can cause corruption to be synchronous so that the file system is always in a known state. Still other systems simply avoid the issue and depend on a very sophisticated file system check program to recover in the event of failures. All of these approaches must check the disk at boot time, a potentially lengthy operation (especially as disk size increase). Further, should a crash happen at an inopportune time, the file system may still be corrupt.

A more modern approach to avoiding corruption is *journaling*. Journaling, a technique borrowed from the database world, avoids corruption by batching groups of changes and committing them all at once to a transaction log. The batched changes guarantee the atomicity of multiple changes. That atomicity guarantee allows the file system to guarantee that operations either happen completely or not at all. Further, if a crash does happen, the system need only replay the transaction log to recover the system to a known state. Replaying the log is an operation that takes at most a few seconds, which is considerably faster than the file system check that nonjournaled file systems must make.

Guaranteed bandwidth/Bandwidth Reservationo

The desire to guarantee high-bandwidth I/O for multimedia applications drives some file system designers to provide special hooks that allow applications to guarantee that they will receive a certain amount of I/O bandwidth (within the limits of the hardware). To accomplish this the file system needs a great deal of knowledge about the capabilities of the underlying hardware it uses and must schedule I/O requests. This problem is nontrivial and still an area of research.

Access Control Lists

Access Control Lists (ACLs) provide an extended mechanism for specifying who may access a file and how they may access it. The traditional POSIX approach of three sets of permissions - for the owner of a file, the group that the owner is in, and everyone else - is not sufficient in some settings. An access control list specifies the exact level of access that any person may have to a file. This allows for fine-grained control over the access to a file in comparison to the braod divisions defined in the POSIX security model.
作者: uplooking    时间: 2005-06-21 02:38
提示: 作者被禁止或删除 内容自动屏蔽
作者: yftty    时间: 2005-06-22 10:50
标题: Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目
1) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html

2) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html
3. Elastic Quota File System (EQFS) Proposal
23 Jun 2004 - 30 Jun 2004 (46 posts) Archive Link: "Elastic Quota File System
(EQFS)"
People: Amit Gud, Olaf Dabrunz, Mark Cooke

Amit Gud said:

    Recently I'm into developing an Elastic Quota File System (EQFS). This file
    system works on a simple concept ... give it to others if you're not using
    it, let others use it, but on the guarantee that you get it back when you
    need it!!

    Here I'm talking about disk quotas. In any typical network, e.g.
    sourceforge, each user is given a fixed amount of quota. 100 Mb in case of
    sourceforge. 100 Mb is way over some project requirements and too small for
    some projects. EQFS tries to solve this problem by exploiting the users'
    usage behavior at runtime. That is the user's quota which he doesn't need
    is given to the users who need it, but on 100% assurance that the originl
    user can any time reclaim his/her quota.

    Before getting into implementation details I want to have public opinion
    about this system. All EQFS tries to do is it maximizes the disk space
    usage, which otherwise is wasted if the user doesn't really need the
    allocated user..on the other hand it helps avoid the starvation of the user
    who needs more space. It also helps administrator to get away with the
    problem of variable quota needs..as EQFS itself adjusts according to the
    user needs.

Mark Watts asked how it would be possible to "guarantee" that the user would
get the space back when they wanted it. Amit expanded:

    Ok, this is what I propose:

    Lets say there are just 2 users with 100 megs of individual quota, user A
    is using 20 megs and user B is running out of his quota. Now what B could
    do is delete some files himself and make some free space for storing other
    files. Now what I say is instead of deleting the files, he declares those
    files as elastic.

    Now, moment he makes that files elastic, that much amount of space is added
    to his quota. Here Mark Cooke's equation applies with some modifications: N
    no. of users, Qi allocated quota of ith user Ui individual disk usage of
    ith user ( should be <= allocated quota of ith user ), D disk threshold;
    thats the amount of disk space admin wants to allow the users to use
    (should be >;= sum of all users' allocated quota, i.e. summation Qi ; for i
    = 0 to N - 1).

    Total usage of all the users (here A & B) should be at _anytime_ less than
    D. i.e. summation Ui <= D; for i = 0 to N - 1.

    The point to note here is that we are not bothering how much quota has been
    allocated to an individual user by the admin, but we are more interested in
    the usage pattern followed by the users. E.g. if user B wants additional
    space of say 25 megs, he picks up 25 megs of his files and 'marks' them
    elastic. Now his quota is increased to 125 megs and he can now add more 25
    megs of files; at the same time allocated quota for user A is left
    unaffected. Applying the above equation total usage now is A: 20 megs, B:
    125 megs, now total 145 <= D, say 200 megs. Thus this should be ok for the
    system, since the usage is within bounds.

    Now what happens if Ui >; D? This can happen when user A tries to recliam
    his space. i.e. if user A adds say more 70 megs of files, so the total
    usage is now - A: 90 megs, B: 125 megs; 215 ! <= D. The moment the total
    usage crosses the value, 'action' will be taken on the elastic files. Here
    elastic files are of user B so only those will be affected and users A's
    data will be untouched, so in a way this will be completely transparent to
    user A. What action should be taken can be specified by the user while
    making the files elastic. He can either opt to delete the file, compress it
    or move it to some place (backup) where he know he has write access. The
    corresponding action will be taken until the threshold is met.

    Will this work?? We are relying on the 'free' space ( i.e. D - Ui ) for the
    users to benefit. The chances of having a greater value for D - Ui
    increases with the increase in the number of users, i.e. N. Here we are
    talking about 2 users but think of 10000+ users where all the users will
    probably never use up _all_ the allocated disk space. This user behavior
    can be well exploited.

    EQFS can be best fitted in the mail servers. Here e.g. I make whole
    linux-kernel mailing list elastic. As long as Ui <= D I get to keep all the
    messages, whenever Ui >; D, messages with latest dates will be 'acted' upon.

    For variable quota needs, admin can allocate different quotas for different
    users, but this can get tiresome when N is large. With EQFS, he can
    allocate fixed quota for each user ( old and new ) , set up a value for D
    and relax. The users will automatically get the quota they need. One may
    ask that this can be done by just setting up value of D, checking it
    against summation Ui and not allocating individual quotas at all. But when
    summation Ui crosses D value, whose file to act on? Moreover with both
    individual quotas and D, we give users 'controlled' flexibility just like
    elastic - it can be stretched but not beyond a certain range.

    What happens when an user tries to eat up all the free ( D - Ui ) space?
    This answer is implementation dependent because you need to make a
    decision: should an user be allowed to make a file elastic when Ui == D . I
    think by saying 'yes' we eliminate some users' mischief of eating up all
    free space.

Olaf Dabrunz replied:

      + having files disappear at the discretion of the filesystem seems to be
        bad behaviour: either I need this file, then I do not want it to just
        disappear, or I do not need it, and then I can delete it myself.

        Since my idea of which files I need and which I do not need changes
        over time, I believe it is far better that I can control which files I
        need and which I do not need whenever other constraints (e.g. quota
        filled up) make this decision necessary. Also, then I can opt to try to
        convince someone to increase my quota.

      + moving the file to some other place (backup) does not seem to be a
        viable option:

          o If the backup media is always accessible, then why can't the user
            store the "elastic" files there immediately?

            ->; advantages:

              # the user knows where his file is
              # applications that remember the path to a file will be able to
                access it

          o If the backup media will only be accessible after manually
            inserting it into some drive, this amounts to sending an E-Mail to
            the backup admin and then pass a list of backup files to the backup
            software.

            But now getting the file back involves a considerable amount of
            manual and administrative work. And it involves bugging the backup
            admin, who now becomes the bottleneck of your EQFS.

    So this narrows down to the effective handling of backup procedures and the
    effective administration of fixed quotas and centralization of data.

    If you have many users it is also likely that there are more people
    interested in big data-files. So you need to help these people organize
    themselves e.g. by helping them to create mailing-list, web-pages or
    letting them install servers that makes the data centrally available with
    some interface that they can use to select parts of the data.

    I would rather suggest that if the file does not fit within a given quota,
    the user should apply for more quota and give reasons for that.

    I believe that flexible or "elastic" allocation of ressources is a good
    idea in general, but it only works if you have cheap and easy ways to
    control both allocation and deallocation. So in the case of CBQ in networks
    this works, since bandwidth can easily and quickly be allocated and
    deallocated.

    But for filesystem space this requires something like a "slower (= less
    expensive), bigger, always accessible" third level of storage in the "RAM,
    disk, ..." hierarchy. And then you would need an easy or even transparent
    way to access files on this third level storage. And you need to make sure
    that, although you obviously *need* the data for something, you still can
    afford to increase retrieval times by several orders of magnitude at the
    discretion of the filesystem.

    But usually all this can be done by scripts as well.

    Still, there is a scenario and a combination of features for such a
    filesystem that IMHO would make it useful:

      + Provide allocation of overquota as you described it.
      + Let the filesystem move (parts of) the "elastic" files to some
        third-level backing-store on an as-needed basis. This provides you with
        a not-so-cheap (but cheaper than manual handling) resource management
        facility.

    Now you can use the third-level storage as a backing store for hard-drive
    space, analoguous to what swap-space provides for RAM. And you can "swap
    in" parts of files from there and cache them on the hard drive. So
    "elastic" files are actually files that are "swappable" to backing store.

    This assumes that the "elastic" files meet the requirements for a "working
    set" in a similar fashion as for RAM-based data. I.e. the swap operations
    need only be invoked relatively seldom.

    If this is not the case, your site/customer needs to consider buying more
    hard drive space (and maybe also RAM).

    The tradeoff for the user now is:

      + do not have the big file(s) OR
      + have them and be able to use them in a random-access fashion from any
        application, but maybe only with a (quite) slow access time, but
        without additional administrative/manual hassle

    Maybe this is a good tradeoff for a significant amount of users. Maybe
    there are sites/customers that have the required backing store (or would
    consider buying into this). I do not know. Find a sponsor, do some field
    research and give it a try.




欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2