Chinaunix

标题: OCFS,OCFS2,ASM,RAW 讨论主题合并帖 [打印本页]

作者: cwinxp    时间: 2006-03-08 09:27
标题: OCFS,OCFS2,ASM,RAW 讨论主题合并帖
200亿条数据,就是20T吧,用ASM好?还是用OCFS好?还是用RAW好?  谢谢

[ 本帖最后由 nntp 于 2006-9-1 00:42 编辑 ]
作者: nntp    时间: 2006-03-08 20:17
几个问题要先搞清楚:

1. 20T 是否是历史数据和归档数据?还是天天都要修改,更新和变化?
2. 20T数据中有多少比例的数据是需要经常修改/查询/更新的?
3. 20T数据是纯data数据还是有媒体数据?
4.  这套业务系统以后的数据增长是怎样的?  每天/每个月,每年新增长多少数据,增长速度如何?

这些问题都搞清楚了,你的数据库存储的规划就基本清晰了, 否则操作起来盲人摸象。


按照我的经验,恐怕20T不是全部都需要always online, 所以在数据库逻辑设计上需要把数据分层对待,即便是Oracle ,你要是只有一个层的20T数据,恐怕性能也会糟糕的一塌糊涂.

还有,ASM, OCFS, RAW并不是等价可比的。他们的特性和设计差别很大.

ASM的性能基本上和RAW差不多. 但是管理性上好很多很多。但是牺牲的代价就是引入了系统的复杂性,多了一层东西,问题出现的几率也大很多.

不过有一点我可以肯定就是如果你要放你这20T的数据,OCFS2不应该考虑, 别问我为什么,因为又要解释很多很多东西.
作者: cwinxp    时间: 2006-03-09 09:36
标题: 谢谢了~~~~~~
谢谢
要同时检索10T到20T数据,怎么弄好?
作者: nntp    时间: 2006-03-09 16:50
原帖由 cwinxp 于 2006-3-9 09:36 发表
谢谢
要同时检索10T到20T数据,怎么弄好?



找专业的公司作咨询把,10T-20T的数据同时作检索,已经不能算是常规应用了。

正常情况下,他们会对这样的应用做这些工作:

1. 分析你的数据使用习性,调整数据库结构(包括为针对查询操作做的优化)
2. 建立一个常规的HA集群方案,同时会根据发向这套系统的查询请求的情况,加入负载均衡的考虑
3. 做一个小样测试之后,会根据性能测试的采样结果,调整你的OS和文件系统.(如果你们这里有对Linux比较熟悉的工作人员,这个工作也可以自己做,OTN上有很多性能调整的资料)
4. 还有可能在分析你的待查数据之后,把数据在物理上分开布局
5. 1-4的工作的前提是你的硬件投资是有限的不多的。做了1-4 的研究工作之后,还有一种糟糕的可能就是发现瓶颈还是在硬件上,你们对性能/可用性上的要求和真实你的物理基础能够提供的并不符合。所以需要对硬件作改造.

如果要做好10T-20T的应用,这些工作都比较复杂,需要接触实际的系统和深入的了解应用.

good luck
作者: cwinxp    时间: 2006-03-10 10:19
标题: thank you , 这么大的数据,硬件不是问题,搞5个CX700级联的,够用吧
thank you , 这么大的数据,硬件不是问题,搞5个CX700级联的,够用吧

我把5个CX700 按照您说的分出几个RAW等等,然后把10T甚至更大分成一个区给ASM吗?
作者: nntp    时间: 2006-03-10 17:31
原帖由 cwinxp 于 2006-3-10 10:19 发表
thank you , 这么大的数据,硬件不是问题,搞5个CX700级联的,够用吧

我把5个CX700 按照您说的分出几个RAW等等,然后把10T甚至更大分成一个区给ASM吗?



硬件好当然好,但是硬件好完全保证不了这套东西能够按照期望的性能和可用性工作。关于怎么样规划,我真地说不来什么,这些工作得认真分析你的应用之后才能得出结论,这么大的数据,搞错了就错了.
作者: shimu    时间: 2006-03-10 23:49
我个人认为,这么大的数据来说安全最重要的,当然选择RAW,ocfs,asm相对是新东西,成熟性和稳定性不能比。。
作者: cwinxp    时间: 2006-03-13 09:29
标题: 用RAW ? 谢谢
用RAW ?  谢谢
作者: brave_script    时间: 2006-05-09 16:01
标题: LVM上如何实现ocfs2文件系统的在线扩展
我现在做oracle的应用集群,在VG上建立的不同LV来实现ocfs2文件系统的的存储,现在希望能够在一个LV满时实现在线的扩充,在ext3的文件系统有相应的方法实现,不知道ocfs2文件系统如何实现
求教各位大虾
作者: nntp    时间: 2006-05-09 18:12
生产系统么? 不要用ocfs2.

raw+ASM就可以了.

目前的RAC环境,看不出有任何理由在生产环境用ocfs2的必要.

RAC涉及到存储的就是2个个地方,一个是OCR和voting(以及他们的redundant config),另外一块就是Oralce Data和Flashback recovery area.

现在Oracle的RAC配置一般是两种  raw(ocr+voting)+ASM(data+flashback recovery area),另外一种是 ocfs2+ASM

OCR和 voting 占用的空间很小,根本没有必要在用了ocfs2的下面用一个OS的LVM来支持,就算你那样做了,也是错误的,因为目前OCR和voting 都需要存储是clusterware的,这也是用raw或ocfs2的原因,你用lvm+ocfs2的话,底下的OS LVM不是clusterware的,所以就会把你的数据破坏掉,这个话题是一个很老的话题了,你到oracle forum去搜,或者有metalink账号的话你看看就知道了,没有意义多讨论.

如果你用 OS LVM+ocfs2 用来放 Data+Flashback Recovery Area,我建议你还是不要这么干,不是说不可以,只不过ocfs2实在是很脆弱,你有订阅 ocfs2的maillist 么? 去看看吧.
Data+FRA用ASM 或RAW都很好,无论是性能上还是管理上,还是可靠性尚.

建议你仔细学习RAC安装的相关资料,把基础知识了解清楚。 
作者: shahand    时间: 2006-05-09 19:15
nntp回答耐心,诲人不倦啊
作者: brave_script    时间: 2006-05-09 23:21
谢谢版主,asm在oracle的官方网站一般都采用是oracle10g,由于特殊原因我们采用的是9204的oracle,如果采用raw那么分区是有限制的最多255个所以采用ocfs2文件系统,这也是oracle官方网站建议的。现在我已经做好了rac只是在扩充上有些不是很理想。ocr和voting是在单一的raw上的现在主要是data文件和flashback recovery area 文件的扩充问题如何解决,的确ocfs2文件系统有时不是很稳定但相对扩充要好多了
作者: nntp    时间: 2006-05-10 04:00
oracle 没有说best practise 建议你用ocfs2, 实际上在社区没有一个oracle得人敢出来说ocfs2 你们放心用在生产环境把.

既然是RAC这个前提,我的建议就偏安全考虑.

既然解决的是Data部分的问题,而且又不用ASM,就没有选择了,只能用  LVM+OCFS了.

不过ocfs R1很麻烦的,不但和R2 一样不支持online resizing, 而且如果要resize ,操作起来需要一定的步骤的.

现在的麻烦就是 array 可以online resize, lun可以online hot add,  pv可以online add, vg 可以online extend, lv 可以online extend,唯独你 resize ocfs on  lv 的时候,不能online做. 必须要把ocfs 从所有node上卸下来.
作者: nntp    时间: 2006-05-10 14:39
ocfs1不能直接升级到ocfs2, 如果以后要升级,需要做DB的导入导出操作.

昨天为了确认我给你的回复,顺便又搜了一下,ocfs1的 bug在网上比比皆是,触目惊心.
说白了,你们这样的架构的选择,最后就是给施工单位/人员和客户自找麻烦,痛苦的还在后面呢.

[ 本帖最后由 nntp 于 2006-5-10 14:40 编辑 ]
作者: brave_script    时间: 2006-05-11 10:33
谢谢斑竹。其实我现在做的就是你所的方式,在所有节点将要扩充的ocfs盘umount之后在格式化,其实在oracle中也不需要这么做,毕竟oracle都是文件存放,只是想明白可以不可以动态扩展ocfs2文件系统在LVM上
作者: brave_script    时间: 2006-05-11 10:35
顺便说一下我们使用的是ocfs2
作者: nntp    时间: 2006-05-11 12:47
我昨天看到ocfs2的maillist有ocfs2 的 developer回答了类似问题:
他们的答复和我在二楼写的基本相同.

我在重复一下:  ocfs2是一个clusteraware 的文件系统,在每个RAC node上都有instance运行,并通过网络通信+lock的机制,确保不同的node对同一个存储区域的读写是在控制下进行并且所有的node通过ocfs2 instance知道谁写了/谁读了. 所以ocfs2 filesystem的完整性是有保障底线的.

当你把ocfs2创建在LVM上的时候,LVM的 control在不同的node上是各管各的,由每个node的OS和LVM module自己来控制,node之间的LVM 并不通信,他们都是独立的,不排斥不加锁得去访问/操作共享存储上的区域,虽然你可以从每个node上用lvm工具scan到共享盘阵上的pv/vg/lv,但是一旦涉及到读写操作,所有的node便完全孤立来做了.所以LVM metadata 的读写就变成一个严重的问题.
所以 ocfs2+LVM 用在RAC的数据共享上是不可取的.

________________________________
maillist 的答复如下:

That's why ocfs2 is not certified with lvm2.

Going forward, we will be looking into this issue. But currently
there is no certified solution.

If you are running Oracle db and need volume mgmt, you should look into ASM.
-----------------------------------------------------------------------------------------------
作者: joyhappy    时间: 2006-05-12 08:52
标题: 回复 9楼 nntp 的帖子
"所以 ocfs2+LVM 用在RAC的数据共享上是不可取的"

我同意这种说法。
但从原理上讲,如果确实需要用LVM, 可以用LVM2,也就是ocfs2 + CLVM,不过我没有试过;应该可以。
作者: nntp    时间: 2006-05-12 15:17
原帖由 joyhappy 于 2006-5-12 08:52 发表
"所以 ocfs2+LVM 用在RAC的数据共享上是不可取的"

我同意这种说法。
但从原理上讲,如果确实需要用LVM, 可以用LVM2,也就是ocfs2 + CLVM,不过我没有试过;应该可以。


HA里面用LVM 很常见,但是都是一头用一头锁的,RAC那种需要同时访问操作的,我恐怕就不是这样简单了.
作者: blue_stone    时间: 2006-05-12 16:54
linux下的lvm不是clusterware aware的,所以不能够用在cluster环境下,cluster环境下应该使用clvm.

不明白为什么ocfs2不能使用在生产环境中,毕竟ocfs2已经整合到了linux kernel中。
还请nntp解释一下
作者: nntp    时间: 2006-05-12 17:38
原帖由 blue_stone 于 2006-5-12 16:54 发表
linux下的lvm不是clusterware aware的,所以不能够用在cluster环境下,cluster环境下应该使用clvm.

不明白为什么ocfs2不能使用在生产环境中,毕竟ocfs2已经整合到了linux kernel中。
还请nntp解释一下


我在谈生产环境的时候,说话的依据就不是什么整合不整合kernel了,而是很现实的稳定不稳定,如果我是负责一个企业系统架构的主管,我不会管那个东西吹得有多好,来头有多大,如果我看到好多好多人在汇报故障,并且故障源头在code level,大家在讨论的一些故障最后导致的问题不但会影响数据安全(可靠性),还影响到了服务连续性(可用性). 我就不会去考虑它. 即便是厂商可以提供技术支持和服务.  当然更加不要说没有任何支持服务的技术了.

因为有一件事情很清楚,用了新技术,为了还没有看到享受到的新特性和性能,我今天冒了这个风险,我必须要客观的评估风险,比如风险春在于什么层面?风险可能影响的范围有多大,一旦风险发生造成的损失具体有多少?为了这些风险我需要投入多少资源,资金和额外人工来做预防工作?风险发生后的的恢复工作复杂程度多少? 系统重新上线的时间间隔是多少? 停机对企业和个人的影响如何?风险发生后是否会影响下一期的企业IT建设计划和资金投入? 是否会影响到我作为IT架构的主管在决策层的信用和话语权?

谈到生产环境,我们考虑的前提就是一种最糟糕,最难堪,最受伤害的可能.所以新特性到底有多大的价值被采用,就是一个系统考虑的问题,而不是就事论事了.

这个话题有点岔开了,我猜这里大多数朋友都是engineer,所以很少涉及到项目管理,风险评估和控制方面的东西,不过项目组的每个人都有一些了解的话会对整个项目有莫大的好处.

所以回到话题上来,如果站在一个linux 粉丝的角度,看到ocfs2集成到linux kernel我觉得是一个超棒的事情,如果站在项目的高度来考虑,目前不建议.

建议订阅ocfs2的 mailing list . 可以获取第一手的信息.
作者: blue_stone    时间: 2006-05-12 18:01
对nntp的话深表认同
感觉自己太浅薄了
作者: rambus    时间: 2006-06-23 15:33
标题: ocfs到底能否真正在生产环境应用?
现在正在培训中,ORACLE方面把OCFS,OCFS2性能吹得天花乱坠,但是问了各个地方的同行,好像还没有那个生产环境是采用OCFS的。
不知道实际怎么样呢?
作者: pawnjazz    时间: 2006-06-23 16:19
oracle 的強項在database , 做cluster 應該找OS平台廠商,就我知道的Redhat GFS 就是用在cluster
作者: soway    时间: 2006-06-23 16:35
提示: 作者被禁止或删除 内容自动屏蔽
作者: 我爱钓鱼    时间: 2006-06-23 17:35
ORACLE自己的资深工程师说:现在用的多的还是RAC,暂时不要用...
作者: nntp    时间: 2006-06-23 17:37
ocfs 只能在RAC当中用
ocfs2的开发方向有了重大调整,目的是成为通用的 cluster filesystem.

我相信oracle和ocfs2 开发团队的实力和未来的发展,软件发展都有从幼稚到成熟,混乱到清晰,脆弱到稳定的过程,如果目前你有生产系统要考虑集群文件系统, ocfs2就不要考虑了.
作者: fengwy    时间: 2006-06-26 10:48
原帖由 我爱钓鱼 于 2006-6-23 17:35 发表
ORACLE自己的资深工程师说:现在用的多的还是RAC,暂时不要用...

ocfs不就是用在rac中吗
作者: nntp    时间: 2006-06-26 11:03
原帖由 fengwy 于 2006-6-26 10:48 发表

ocfs不就是用在rac中吗


yep, ocfs can only be used in RAC environment,  ocfs2 is different, Oracle make it to be a general cluster file system for normal application.
作者: cs119    时间: 2006-06-26 17:02
谁用谁知道呀!想要性能好最好用裸设。
作者: rambus    时间: 2006-06-27 14:47
很奇怪的是OCFS为什么只能在AS3中使用,在AS4却不支持了呢?
作者: nntp    时间: 2006-06-27 15:10
原帖由 rambus 于 2006-6-27 14:47 发表
很奇怪的是OCFS为什么只能在AS3中使用,在AS4却不支持了呢?


奇怪你为什么这样想.  ocfs2的站点看过么? 都写在那儿了. ocfs(ocfs1)现在被ocfs2 upgrade了.
作者: archangle    时间: 2006-06-28 08:27
ocfs2去年测试过,性能很差,不知道现在怎么样了,但是感觉当这个产品成熟了之后会不不错的产品。
作者: nntp    时间: 2006-06-28 12:25
ocfs2目前还是性能不好,再等等吧.
作者: nimysun    时间: 2006-06-29 10:02
新的产品推出之后,怎么地也得等几年成熟之后才能考虑投入生产系统吧
作者: fengwy    时间: 2006-06-29 10:31
原帖由 nntp 于 2006-6-26 11:03 发表


yep, ocfs can only be used in RAC environment,  ocfs2 is different, Oracle make it to be a general cluster file system for normal application.

没看到oracle关于有这方面的资料呀
作者: nntp    时间: 2006-06-29 11:23
汗..... ocfs 项目的站点的第一页的第一行就写着呢http://oss.oracle.com/projects/ocfs2/)

摘录给你看看:

WHAT IS OCFS2?

OCFS2 is the next generation of the Oracle Cluster File System for Linux. It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system that can be used for shared Oracle home installations making management of Oracle Real Application Cluster (RAC) installations even easier. Among the new features and benefits are:

    * Node and architecture local files using Context Dependent Symbolic Links (CDSL)
    * Network based pluggable DLM
    * Improved journaling / node recovery using the Linux Kernel "JBD" subsystem
    * Improved performance of meta-data operations (space allocation, locking, etc).
    * Improved data caching / locking (for files such as oracle binaries, libraries, etc)
作者: fengwy    时间: 2006-06-30 11:21
汗,没想到是在open source中的资料。
作者: fengwy    时间: 2006-06-30 11:23
原帖由 nntp 于 2006-6-28 12:25 发表
ocfs2目前还是性能不好,再等等吧.

这个性能不好是指在rac环境下对数据库的使用,还是在作为通用filesystem cluster中的使用呢
作者: nntp    时间: 2006-06-30 13:45
原帖由 fengwy 于 2006-6-30 11:23 发表

这个性能不好是指在rac环境下对数据库的使用,还是在作为通用filesystem cluster中的使用呢



both.
作者: youngcow    时间: 2006-07-19 16:21
提示: 作者被禁止或删除 内容自动屏蔽
作者: blue_stone    时间: 2006-07-19 22:03
gfs可否用在生产环境中呢?
作者: nntp    时间: 2006-07-20 04:06
置顶贴
作者: vecentli    时间: 2006-07-27 11:05
OCFS2, developed by Oracle Corporation, is a Cluster File System which allows all nodes in a cluster to concurrently access a device via the standard file system interface. This allows for easy management of applications that need to run across a cluster.

OCFS (Release 1) was released in December 2002 to enable Oracle Real Application Cluster (RAC) users to run the clustered database without having to deal with RAW devices. The file system was designed to store database related files, such as data files, control files, redo logs, archive logs, etc. OCFS2 is the next generation of the Oracle Cluster File System. It has been designed to be a general purpose cluster file system. With it, one can store not only database related files on a shared disk, but also store Oracle binaries and configuration files (shared Oracle Home) making management of RAC even easier.

In this article, I will be using OCFS2 to store the two files that are required to be shared by the Oracle Clusterware software. (Along with these two files, I will also be using this space to store the shared SPFILE for all Oracle RAC instances.)
作者: 秋风No.1    时间: 2006-08-04 16:49
原帖由 nntp 于 2006-6-30 13:45 发表



both.

不太明白了

既然都不好,为何oracle RAC的安装还推荐使用ocfs2?
作者: nntp    时间: 2006-08-04 21:02
原帖由 秋风No.1 于 2006-8-4 16:49 发表

不太明白了

既然都不好,为何oracle RAC的安装还推荐使用ocfs2?


你有看到哪个软件从开发之初就是好的? 一个开发中的软件有各种现阶段的问题,难道就停止开发而放弃?

RAC安装从来没有推荐过用ocfs2, 你看Oracle RAC的产品经理在oracleworld上的发言了么?说得很清楚.

RAC系统离开clusterwide filesystem,节点failed之后,存储部分的切换延迟就会很大,这个道理和RHCS Vs RHCS+GFS一样的.

会安装RAC不难,难的是知道什么时候应该部署RAC,怎么部署,部署什么部分,那些现在可以放心用,那些不能,用了会有什么可能的风险,怎么防止和解决?
作者: 秋风No.1    时间: 2006-08-04 22:07
原帖由 nntp 于 2006-8-4 21:02 发表


你有看到哪个软件从开发之初就是好的? 一个开发中的软件有各种现阶段的问题,难道就停止开发而放弃?

RAC安装从来没有推荐过用ocfs2, 你看Oracle RAC的产品经理在oracleworld上的发言了么?说得很清楚.

...

谢谢指教
作者: nntp    时间: 2006-08-05 10:30
一点补充,RAC环境,如果不是用raw, 在生产环境还是应该选择ASM.
作者: fengwy    时间: 2006-08-07 00:55
会安装RAC不难,难的是知道什么时候应该部署RAC,怎么部署,部署什么部分,那些现在可以放心用,那些不能,用了会有什么可能的风险,怎么防止和解决?
------------------------------------------------------------------------------
作者: oncity    时间: 2006-08-18 07:44
标题: 给 ocfs2 折腾了好几天,绝望中
安装 ocfs2 并不困难. (用最新的 suse server 10,什么都自带)

但使用起来,怪问题,特别多.

1) 死机,特别在复制大目录的时候.

2) 死机,意外关掉其中一个节点的时候 (拔网线).

3) 死机.......就是莫明其妙的.....

死机前后,没有任何 log 说明 !

看来我的 nfs 升级 iscsi + 群集系统 的计划都会是失败告终....
作者: 好好先生    时间: 2006-08-18 09:22
原帖由 oncity 于 2006-8-18 07:44 发表
安装 ocfs2 并不困难. (用最新的 suse server 10,什么都自带)

但使用起来,怪问题,特别多.

1) 死机,特别在复制大目录的时候.

2) 死机,意外关掉其中一个节点的时候 (拔网线).

3) 死机.......就是莫明 ...


怎么死机?屏幕上有信息吗?是没有任何反应还是kernel崩溃?请把你的情况说清楚...谢谢!
作者: oncity    时间: 2006-08-18 10:07
原帖由 好好先生 于 2006-8-18 09:22 发表


怎么死机?屏幕上有信息吗?是没有任何反应还是kernel崩溃?请把你的情况说清楚...谢谢!


就是完全没有出错信息,包括屏显和 syslog

用 linux 那么久,还是第一次看到这种彻底瞬间崩溃的现象.  

估计是内核的问题.
作者: 我爱钓鱼    时间: 2006-08-18 10:12
原帖由 oncity 于 2006-8-18 10:07 发表


就是完全没有出错信息,包括屏显和 syslog

用 linux 那么久,还是第一次看到这种彻底瞬间崩溃的现象.  

估计是内核的问题.


不可能吧....内核崩溃的话,会有日志的,默认是显示在控制台上..
作者: nntp    时间: 2006-08-18 10:14
看过我之前对ocfs的评论么?  

你因该首先排除掉环境问题和版本依赖性问题,因为ocfs2还是一个处于开发初始阶段的系统,虽然名字有一个2,但实际上是第一版支持general purpose的集群文件系统。ocfs2用来做生产系统是不明智的(见我的帖子)和不正确的。ocfs2现在用的话,你根本无法lock down一个stable set.

为什么不用GFS呢?
作者: oncity    时间: 2006-08-18 10:22
原帖由 nntp 于 2006-8-18 10:14 发表
看过我之前对ocfs的评论么?  

你因该首先排除掉环境问题和版本依赖性问题,因为ocfs2还是一个处于开发初始阶段的系统,虽然名字有一个2,但实际上是第一版支持general purpose的集群文件系统。ocfs2用来做生产 ...


因为平台用了 SUSE Linux Enterprise Server 10 , 自带了 ocfs2 ,当然要先试试.  

架起 ocfs2 很容易,简单测试也没有问题,但真正复制大量数据时就出问题.

如果用 gfs ,我想要换成 redhat 才行吧,最稳定是用那个版本来安装? as 4 u2 吗?
作者: nntp    时间: 2006-08-18 12:45
越高越好.
作者: nntp    时间: 2006-08-18 16:50
LZ建议你订阅 ocfs2的邮件列表, 动手之前看看别人吃的亏,然后好判断到底用不用.

SuSE SLES版本一般在第一个SP出来之前,还是不要上生产环境.
作者: pxwyd    时间: 2006-08-29 19:05
标题: 我也有这样的问题RHEL4 update4 ocfs2+10g2
我在RHEL4 update4上装的ocfs2;
node01 ,node02当把node02的网线或者node01的网线拔了之后,node02就会死机;而node1没有问题
/var/log/messages中有如下日志后死机
Aug 28 18:23:14 node02 kernel: o2net: connection to node node01 (num 0) at 192.168.210.201:7777 has been idle for 10 seconds, shutting it down.
Aug 28 18:23:14 node02 kernel: (0,0): o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1156760584.614463 now 1156760594.612669 dr 1156760584.614448 adv 1156760584.614468:1156760584.614471 func (8911b11d:505) 1156760549.622451:1156760549.622455)
Aug 28 18:23:14 node02 kernel: o2net: no longer connected to node node01 (num 0) at 192.168.210.201:7777
Aug 28 18:25:01 node02 crond(pam_unix)[4833]: session opened for user root by (uid=0)
Aug 28 18:25:01 node02 crond(pam_unix)[4833]: session closed for user root
Aug 28 18:30:01 node02 crond(pam_unix)[6257]: session opened for user root by (uid=0)
Aug 28 18:30:01 node02 crond(pam_unix)[6259]: session opened for user root by (uid=0)
Aug 28 18:30:01 node02 crond(pam_unix)[6259]: session closed for user root
Aug 28 18:30:02 node02 crond(pam_unix)[6257]: session closed for user root
作者: oncity    时间: 2006-08-29 21:25
标题: 回复 9楼 pxwyd 的帖子
ocfs2 的问题太复杂.

普通群集WEB网站 ,还是 nfs 适用.
作者: nonameboy    时间: 2006-08-29 23:00
首先强烈推荐使用RAW设备.
拔掉网线死机的话,第二个节点正常是会重启,而不应该死机.
你试一下拔CRS 的几个服务停掉再拔网线看看.
估计这样就不会死机了.
为什么会死一个节点?
根据我的理解是:因为RAC是同时使用两个节点都使用,再用两个Virtual IP 设两个主机上面,
而ORACLE client同时连接到你的两个VIP.
正常情况下,一个节点出现问题的时候,会把他的VIP设到活的节点的机器上.这样才能保证客户端可以访问两个VIP.
而你他们两台主机之间的通讯靠的是Private的网卡,RAC靠两个网卡来共享内存池,同样他们的流量是相当大的.----这个概念跟我们以前在MSCS上做OFS是不一样的!!!!
而你拔掉网线的话,他们就没有办法做到共享内存池,而客户端如果还是同时使用两台主机的话,ORACLE就会出问题.所以,网线拔掉的话,必然要有一台要接管全部的VIP,而另一个一直处于无修止的重启,直到网线拔好.

而你的问题就是为什么不是重启而是死机.
估计你查一下,你的model里面的设置是不是按方档的来做,还有就是系统本身的设置. 
估计是CRS进程在重启机器的时候没有导致SYSTEM hung.
作者: nonameboy    时间: 2006-08-29 23:03
另外,上生产系统一定要上RAW设备.OCFS太变态了,只能这么说,
如果你用OCFS的话,你以后升级KERNEL会有麻烦.


另外对于RAC上到SUSE的情况,我保留怀疑的态度.
因为我们公司几个非常非常资深的LINUX/ORACLE的工程师在做这个测试,
测了半年一直没有通过.

所以我们的生产系统的RAC一直上在RHEL 3.0 上.
要知道RAC不是说装完了就完事的.

[ 本帖最后由 nonameboy 于 2006-8-29 23:06 编辑 ]
作者: nntp    时间: 2006-08-30 01:48
to 12楼,

你说RAC上到SuSE的问题(非ocfs/ocfs2)的观点,我完全不同意.
我想你们公司那些非常资深的linux/oracle工程师,他们一定知道Oracle Consulting部门负责IDC业务的团队推荐在关键业务系统的Oracle 是运行在SLES9 上面的.

如果有机会,倒是想要和贵公司的资深工程师切磋一下关于SuSE和 RAC的技术问题,不知道他们在RHEL+RAC的环境是怎么通过stability testing的.
作者: pxwyd    时间: 2006-08-30 09:04
我是做oracle的rac,所以才用ocfs2
作者: pxwyd    时间: 2006-08-30 09:13
会安装RAC不难,难的是知道什么时候应该部署RAC,怎么部署,部署什么部分,那些现在可以放心用,那些不能,用了会有什么可能的风险,怎么防止和解决?



有什么高招能解决这些问题吗?我装的ocfs2+rac,RHEL4 ,两个节点,当node1的网线断了之后,node2就会死机;其他方面的性能感觉还可以;听说ocfs2是2004年就发布了,我以为已经能商用了,看了大家的讨论才知道还没有正式用到生产系统呢。

加入我要是需要oracle的rac环境,请大家给点建议,用哪个文件系统较好呢?
作者: ljhb    时间: 2006-08-30 10:14
标题: 回复 20楼 blue_stone 的帖子
ut在一些项目里用的就是gfs/hitache的存储,不知道效果怎么样
作者: pxwyd    时间: 2006-08-30 12:04
谢谢指点:
我测试的环境是sun v65x两台;一个scsi磁盘阵列;os是RHEL4 update4;oracle10g2;ocfs2;
只运行ocfs2时拔网线出现的问题;只有一台服务器死机;另外一台正常;把ocfs2和o2cb停掉之后就没有问题;
我们已经放弃使用ocfs2了;准备用raw和asm;不知道你们的公司做oracle rac所用的文件系统,就是raw吗?

性能如何?
作者: nonameboy    时间: 2006-08-30 16:52
原帖由 pxwyd 于 2006-8-30 12:04 发表
谢谢指点:
我测试的环境是sun v65x两台;一个scsi磁盘阵列;os是RHEL4 update4;oracle10g2;ocfs2;
只运行ocfs2时拔网线出现的问题;只有一台服务器死机;另外一台正常;把ocfs2和o2cb停掉之后就没有问题;
...

我们以前有用过OCFS,但只是做测试,因为实验中有发现一些问题.
所以才转到RAW设备.
现在我们RAC的生产环境都是跑在RAW设备上面.
作者: nonameboy    时间: 2006-08-30 16:53
原帖由 nntp 于 2006-8-30 01:48 发表
to 12楼,

你说RAC上到SuSE的问题(非ocfs/ocfs2)的观点,我完全不同意.
我想你们公司那些非常资深的linux/oracle工程师,他们一定知道Oracle Consulting部门负责IDC业务的团队推荐在关键业务系统的Oracle 是运 ...

其中有一个家伙有论坛,你有空可以上去聊聊.
http://www.puschitz.com/
作者: nntp    时间: 2006-08-30 17:36
原帖由 pxwyd 于 2006-8-30 09:04 发表
我是做oracle的rac,所以才用ocfs2


请用ASM.

无论从性能,目前各自版本的成熟度,厂商研发投入和支持力度,最佳实践以及成功案例,现在这个阶段都不应该在生产系统中使用ocfs2.
作者: nntp    时间: 2006-08-30 17:38
to pxwyd.

用ASM.
作者: pxwyd    时间: 2006-08-31 08:22
谢谢;我现在就是在用asm+raw测试的;看看效果如何
作者: nntp    时间: 2006-08-31 15:23
原帖由 nonameboy 于 2006-8-30 16:53 发表

其中有一个家伙有论坛,你有空可以上去聊聊.
http://www.puschitz.com/



他就是在OTN上面登了几个文章么?

没有看到此君对SLES和RHEL在RAC环境有任何的企业级测试.
作者: nntp    时间: 2006-08-31 15:24
原帖由 pxwyd 于 2006-8-31 08:22 发表
谢谢;我现在就是在用asm+raw测试的;看看效果如何



ASM的性能和RAW差别不大,在有些指标的测试中,超过了RAW的性能.

要知道在linux环境中的raw,和商用unix的raw环境是不一样的.
作者: nntp    时间: 2006-08-31 15:27
原帖由 nonameboy 于 2006-8-30 16:53 发表

其中有一个家伙有论坛,你有空可以上去聊聊.
http://www.puschitz.com/



还有,你说"一个家伙" 是指你们公司的同事这个Werner Puschitz 么?我只知道他是一个独立顾问,自己给自己打工,什么时候成为你们公司的员工了?
我们去年有一个RAC项目的时候,曾经联系过他,问了他远程咨询服务的价格,结果价格谈不拢就黄掉了。没有想到加入你们公司了?

:em11:

[ 本帖最后由 nntp 于 2006-8-31 15:29 编辑 ]
作者: vecentli    时间: 2006-08-31 15:53
对voting disk和OCR,我还是用ocfs2放的。

datafile之类的,lvm + raw也可以。
作者: nntp    时间: 2006-08-31 16:07
原帖由 vecentli 于 2006-8-31 15:53 发表
对voting disk和OCR,我还是用ocfs2放的。

datafile之类的,lvm + raw也可以。



我的voting /orc 用 raw , 10gr2有 redundant 的配置,所以raw比较方便,而且因为尺寸都很小,所以需要额外backup的时候也很方便.

voting/orc 用raw与否,对于运行性能没有太多影响,但是当集群因为不稳定的时候,系统开始做node membership的变动的时候,性能上还是有区别的.

datafile 你在RAC中用 lvm+raw?  还是 clvm+raw?  lvm+raw怎么可能?  lvm本身不是cluserware的,你的raw貌似创建在lvm上,但是node的lvm不能够把变动传送到其他node上的。datafile 在linux平台用raw存放显示不出性能优势,用ASM性能上面有保证.
作者: vecentli    时间: 2006-08-31 16:21
ocr和voting disk用什么放无所谓,偶觉得个人习惯起决定性因素,用ocfs2更符合大部分人的使用习惯罢了。
至于lvm+raw,lvm只是用做datafile的管理方式罢了,存数据的是raw,当然,raw是建立在lv上的。

偶没有测试过,难道lvm不能管理rac下的raw?或者建立在lv上的raw,rac不能识别?

[ 本帖最后由 vecentli 于 2006-8-31 16:24 编辑 ]
作者: vecentli    时间: 2006-08-31 16:29
木有ups,无故调电后asm cache的数据丢了,db说不定就起不来了。
木有足够的技术储备,无法用rman备份数据库。

所以,用什么,还要看自身条件的啦。。
作者: vecentli    时间: 2006-08-31 16:33
原帖由 nntp 于 2006-8-31 16:07 发表



我的voting /orc 用 raw , 10gr2有 redundant 的配置,所以raw比较方便,而且因为尺寸都很小,所以需要额外backup的时候也很方便.

voting/orc 用raw与否,对于运行性能没有太多影响,但是当集群因为不稳 ...


有道理。。

lvm的配置不能传到其他机器,在node1上用lv,在node2上的instance无法识别。

作者: nntp    时间: 2006-08-31 18:01
原帖由 vecentli 于 2006-8-31 16:29 发表
木有ups,无故调电后asm cache的数据丢了,db说不定就起不来了。
木有足够的技术储备,无法用rman备份数据库。

所以,用什么,还要看自身条件的啦。。



单机还是RAC? 如果是RAC的话, 就算掉电, asm 可以处理这种情况的,你订了oracle mag么?去年年底有一期介绍类似情况的.
作者: vecentli    时间: 2006-08-31 21:59
标题: OCFS2 FAQ
OCFS2 - FREQUENTLY ASKED QUESTIONS

      CONTENTS
    * General
    * Download and Install
    * Configure
    * O2CB Cluster Service
    * Format
    * Mount
    * Oracle RAC
    * Migrate Data from OCFS (Release 1) to OCFS2
    * Coreutils
    * Troubleshooting
    * Limits
    * System Files
    * Heartbeat
    * Quorum and Fencing
    * Novell SLES9
    * Release 1.2
    * Upgrade to the Latest Release
    * Processes

      GENERAL
   1. How do I get started?
          * Download and install the module and tools rpms.
          * Create cluster.conf and propagate to all nodes.
          * Configure and start the O2CB cluster service.
          * Format the volume.
          * Mount the volume.
   2. How do I know the version number running?

              # cat /proc/fs/ocfs2/version
              OCFS2 1.2.1 Fri Apr 21 13:51:24 PDT 2006 (build bd2f25ba0af9677db3572e3ccd92f739)

   3. How do I configure my system to auto-reboot after a panic?
      To auto-reboot system 60 secs after a panic, do:

              # echo 60 > /proc/sys/kernel/panic

      To enable the above on every reboot, add the following to /etc/sysctl.conf:

              kernel.panic = 60

      DOWNLOAD AND INSTALL
   4. Where do I get the packages from?
      For Novell's SLES9, upgrade to the latest SP3 kernel to get the required modules installed. Also, install ocfs2-tools and ocfs2console packages. For Red Hat's RHEL4, download and install the appropriate module package and the two tools packages, ocfs2-tools and ocfs2console. Appropriate module refers to one matching the kernel version, flavor and architecture. Flavor refers to smp, hugemem, etc.
   5. What are the latest versions of the OCFS2 packages?
      The latest module package version is 1.2.2. The latest tools/console packages versions are 1.2.1.
   6. How do I interpret the package name ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm?
      The package name is comprised of multiple parts separated by '-'.
          * ocfs2 - Package name
          * 2.6.9-22.0.1.ELsmp - Kernel version and flavor
          * 1.2.1 - Package version
          * 1 - Package subversion
          * i686 - Architecture
   7. How do I know which package to install on my box?
      After one identifies the package name and version to install, one still needs to determine the kernel version, flavor and architecture.
      To know the kernel version and flavor, do:

              # uname -r
              2.6.9-22.0.1.ELsmp

      To know the architecture, do:

              # rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n"
              i686

   8. Why can't I use uname -p to determine the kernel architecture?
      uname -p does not always provide the exact kernel architecture. Case in point the RHEL3 kernels on x86_64. Even though Red Hat has two different kernel architectures available for this port, ia32e and x86_64, uname -p identifies both as the generic x86_64.
   9. How do I install the rpms?
      First install the tools and console packages:

              # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

      Then install the appropriate kernel module package:

              # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm

  10. Do I need to install the console?
      No, the console is not required but recommended for ease-of-use.
  11. What are the dependencies for installing ocfs2console?
      ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or later, pygtk2 (EL4) or python-gtk (SLES9) 1.99.16 or later, python 2.3 or later and ocfs2-tools.
  12. What modules are installed with the OCFS2 1.2 package?
          * configfs.ko
          * ocfs2.ko
          * ocfs2_dlm.ko
          * ocfs2_dlmfs.ko
          * ocfs2_nodemanager.ko
          * debugfs
  13. What tools are installed with the ocfs2-tools 1.2 package?
          * mkfs.ocfs2
          * fsck.ocfs2
          * tunefs.ocfs2
          * debugfs.ocfs2
          * mount.ocfs2
          * mounted.ocfs2
          * ocfs2cdsl
          * ocfs2_hb_ctl
          * o2cb_ctl
          * o2cb - init service to start/stop the cluster
          * ocfs2 - init service to mount/umount ocfs2 volumes
          * ocfs2console - installed with the console package
  14. What is debugfs and is it related to debugfs.ocfs2?
      debugfs is an in-memory filesystem developed by Greg Kroah-Hartman. It is useful for debugging as it allows kernel space to easily export data to userspace. It is currently being used by OCFS2 to dump the list of filesystem locks and could be used for more in the future. It is bundled with OCFS2 as the various distributions are currently not bundling it. While debugfs and debugfs.ocfs2 are unrelated in general, the latter is used as the front-end for the debugging info provided by the former. For example, refer to the troubleshooting section.

      CONFIGURE
  15. How do I populate /etc/ocfs2/cluster.conf?
      If you have installed the console, use it to create this configuration file. For details, refer to the user's guide. If you do not have the console installed, check the Appendix in the User's guide for a sample cluster.conf and the details of all the components. Do not forget to copy this file to all the nodes in the cluster. If you ever edit this file on any node, ensure the other nodes are updated as well.
  16. Should the IP interconnect be public or private?
      Using a private interconnect is recommended. While OCFS2 does not take much bandwidth, it does require the nodes to be alive on the network and sends regular keepalive packets to ensure that they are. To avoid a network delay being interpreted as a node disappearing on the net which could lead to a node-self-fencing, a private interconnect is recommended. One could use the same interconnect for Oracle RAC and OCFS2.
  17. What should the node name be and should it be related to the IP address?
      The node name needs to match the hostname. The IP address need not be the one associated with that hostname. As in, any valid IP address on that node can be used. OCFS2 will not attempt to match the node name (hostname) with the specified IP address.
  18. How do I modify the IP address, port or any other information specified in cluster.conf?
      While one can use ocfs2console to add nodes dynamically to a running cluster, any other modifications require the cluster to be offlined. Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one and copy to the rest, and restart the cluster on all nodes. Always ensure that cluster.conf is the same on all the nodes in the cluster.
  19. How do I add a new node to an online cluster?
      You can use the console to add a new node. However, you will need to explicitly add the new node on all the online nodes. That is, adding on one node and propagating to the other nodes is not sufficient. If the operation fails, it will most likely be due to bug#741. In that case, you can use the o2cb_ctl utility on all online nodes as follows:

              # o2cb_ctl -C -i -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

  20. Ensure the node is added both in /etc/ocfs2/cluster.conf and in /config/cluster/CLUSTERNAME/node on all online nodes. You can then simply copy the cluster.conf to the new (still offline) node as well as other offline nodes. At the end, ensure that cluster.conf is consistent on all the nodes. How do I add a new node to an offline cluster?
      You can either use the console or use o2cb_ctl or simply hand edit cluster.conf. Then either use the console to propagate it to all nodes or hand copy using scp or any other tool. The o2cb_ctl command to do the same is:

              # o2cb_ctl -C -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

      Notice the "-i" argument is not required as the cluster is not online.

      O2CB CLUSTER SERVICE
  21. How do I configure the cluster service?

              # /etc/init.d/o2cb configure

      Enter 'y' if you want the service to load on boot and the name of the cluster (as listed in /etc/ocfs2/cluster.conf).
  22. How do I start the cluster service?
          * To load the modules, do:

                    # /etc/init.d/o2cb load

          * To Online it, do:

                    # /etc/init.d/o2cb online [cluster_name]

      If you have configured the cluster to load on boot, you could combine the two as follows:

              # /etc/init.d/o2cb start [cluster_name]

      The cluster name is not required if you have specified the name during configuration.
  23. How do I stop the cluster service?
          * To offline it, do:

                    # /etc/init.d/o2cb offline [cluster_name]

          * To unload the modules, do:

                    # /etc/init.d/o2cb unload

      If you have configured the cluster to load on boot, you could combine the two as follows:

              # /etc/init.d/o2cb stop [cluster_name]

      The cluster name is not required if you have specified the name during configuration.
  24. How can I learn the status of the cluster?
      To learn the status of the cluster, do:

              # /etc/init.d/o2cb status

  25. I am unable to get the cluster online. What could be wrong?
      Check whether the node name in the cluster.conf exactly matches the hostname. One of the nodes in the cluster.conf need to be in the cluster for the cluster to be online.

      FORMAT
  26. How do I format a volume?
      You could either use the console or use mkfs.ocfs2 directly to format the volume. For console, refer to the user's guide.

              # mkfs.ocfs2 -L "oracle_home" /dev/sdX

      The above formats the volume with default block and cluster sizes, which are computed based upon the size of the volume.

              # mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX

      The above formats the volume for 4 nodes with a 4K block size and a 32K cluster size.
  27. What does the number of node slots during format refer to?
      The number of node slots specifies the number of nodes that can concurrently mount the volume. This number is specified during format and can be increased using tunefs.ocfs2. This number cannot be decreased.
  28. What should I consider when determining the number of node slots?
      OCFS2 allocates system files, like Journal, for each node slot. So as to not to waste space, one should specify a number within the ballpark of the actual number of nodes. Also, as this number can be increased, there is no need to specify a much larger number than one plans for mounting the volume.
  29. Does the number of node slots have to be the same for all volumes?
      No. This number can be different for each volume.
  30. What block size should I use?
      A block size is the smallest unit of space addressable by the file system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K. The block size cannot be changed after the format. For most volume sizes, a 4K size is recommended. On the other hand, the 512 bytes block is never recommended.
  31. What cluster size should I use?
      A cluster size is the smallest unit of space allocated to a file to hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size of 128K or larger is recommended. For Oracle home, 32K to 64K.
  32. Any advantage of labelling the volumes?
      As in a shared disk environment, the disk name (/dev/sdX) for a particular device be different on different nodes, labelling becomes a must for easy identification. You could also use labels to identify volumes during mount.

              # mount -L "label" /dir

      The volume label is changeable using the tunefs.ocfs2 utility.

      MOUNT
  33. How do I mount the volume?
      You could either use the console or use mount directly. For console, refer to the user's guide.

              # mount -t ocfs2 /dev/sdX /dir

      The above command will mount device /dev/sdX on directory /dir.
  34. How do I mount by label?
      To mount by label do:

              # mount -L "label" /dir

  35. What entry to I add to /etc/fstab to mount an ocfs2 volume?
      Add the following:

              /dev/sdX        /dir        ocfs2        noauto,_netdev        0        0

      The _netdev option indicates that the devices needs to be mounted after the network is up.
  36. What do I need to do to mount OCFS2 volumes on boot?
          * Enable o2cb service using:

                    # chkconfig --add o2cb

          * Enable ocfs2 service using:

                    # chkconfig --add ocfs2

          * Configure o2cb to load on boot using:

                    # /etc/init.d/o2cb configure

          * Add entries into /etc/fstab as follows:

                    /dev/sdX        /dir        ocfs2        _netdev        0        0

  37. How do I know my volume is mounted?
          * Enter mount without arguments, or,

                    # mount

          * List /etc/mtab, or,

                    # cat /etc/mtab

          * List /proc/mounts, or,

                    # cat /proc/mounts

          * Run ocfs2 service.

                    # /etc/init.d/ocfs2 status

            mount command reads the /etc/mtab to show the information.
  38. What are the /config and /dlm mountpoints for?
      OCFS2 comes bundled with two in-memory filesystems configfs and ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the in-kernel node manager the list of nodes in the cluster and to the in-kernel heartbeat thread the resource to heartbeat on. ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel dlm to take and release clusterwide locks on resources.
  39. Why does it take so much time to mount the volume?
      It takes around 5 secs for a volume to mount. It does so so as to let the heartbeat thread stabilize. In a later release, we plan to add support for a global heartbeat, which will make most mounts instant.

      ORACLE RAC
  40. Any special flags to run Oracle RAC?
      OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry (OCR), Data files, Redo logs, Archive logs and Control files must be mounted with the datavolume and nointr mount options. The datavolume option ensures that the Oracle processes opens these files with the o_direct flag. The nointr option ensures that the ios are not interrupted by signals.

              # mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db

  41. What about the volume containing Oracle home?
      Oracle home volume should be mounted normally, that is, without the datavolume and nointr mount options. These mount options are only relevant for Oracle files listed above.

              # mount -t ocfs2 /dev/sdb1 /software/orahome

  42. Also as OCFS2 does not currently support shared writeable mmap, the health check (GIMH) file $ORACLE_HOME/dbs/hc_ORACLESID.dat and the ASM file $ASM_HOME/dbs/ab_ORACLESID.dat should be symlinked to local filesystem. We expect to support shared writeable mmap in the RHEL5 timeframe. Does that mean I cannot have my data file and Oracle home on the same volume?
      Yes. The volume containing the Oracle data files, redo-logs, etc. should never be on the same volume as the distribution (including the trace logs like, alert.log).
  43. Any other information I should be aware off?
      The 1.2.3 release of OCFS2 does not update the modification time on the inode across the cluster for non-extending writes. However, the time will be locally updated in the cached inodes. This leads to one observing different times (ls -l) for the same file on different nodes on the cluster.
      While this does not affect most uses of the filesystem, as one variably changes the file size during write, the one usage where this is most commonly experienced is with Oracle datafiles and redologs. This is because Oracle rarely resizes these files and thus almost all writes are non-extending.
      In the short term (1.2.x), we intend to provide a mount option (nocmtime) to allow users to explicitly ask the filesystem to not change the modification time during non-extending writes. While this is not the complete solution, this will ensure that the times are consistent across the cluster.
      In the long term (1.4.x), we intend to fix this by updating modification times for all writes while providing an opt-out option (nocmtime) for users who would prefer to avoid the performance overhead associated with this feature.

      MIGRATE DATA FROM OCFS (RELEASE 1) TO OCFS2
  44. Can I mount OCFS volumes as OCFS2?
      No. OCFS and OCFS2 are not on-disk compatible. We had to break the compatibility in order to add many of the new features. At the same time, we have added enough flexibility in the new disk layout so as to maintain backward compatibility in the future.
  45. Can OCFS volumes and OCFS2 volumes be mounted on the same machine simultaneously?
      No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's SLES. OCFS2, on the other hand, only works on the 2.6 kernels (Red Hat's EL4 and SuSE's SLES9).
  46. Can I access my OCFS volume on 2.6 kernels (SLES9/RHEL4)?
      Yes, you can access the OCFS volume on 2.6 kernels using FSCat tools, fsls and fscp. These tools can access the OCFS volumes at the device layer, to list and copy the files to another filesystem. FSCat tools are available on oss.oracle.com.
  47. Can I in-place convert my OCFS volume to OCFS2?
      No. The on-disk layout of OCFS and OCFS2 are sufficiently different that it would require a third disk (as a temporary buffer) inorder to in-place upgrade the volume. With that in mind, it was decided not to develop such a tool but instead provide tools to copy data from OCFS without one having to mount it.
  48. What is the quickest way to move data from OCFS to OCFS2?
      Quickest would mean having to perform the minimal number of copies. If you have the current backup on a non-OCFS volume accessible from the 2.6 kernel install, then all you would need to do is to retore the backup on the OCFS2 volume(s). If you do not have a backup but have a setup in which the system containing the OCFS2 volumes can access the disks containing the OCFS volume, you can use the FSCat tools to extract data from the OCFS volume and copy onto OCFS2.

      COREUTILS
  49. Like with OCFS (Release 1), do I need to use o_direct enabled tools to perform cp, mv, tar, etc.?
      No. OCFS2 does not need the o_direct enabled tools. The file system allows processes to open files in both o_direct and bufferred mode concurrently.

      TROUBLESHOOTING
作者: vecentli    时间: 2006-08-31 22:01
# How do I enable and disable filesystem tracing?
To list all the debug bits along with their statuses, do:

        # debugfs.ocfs2 -l

To enable tracing the bit SUPER, do:

        # debugfs.ocfs2 -l SUPER allow

To disable tracing the bit SUPER, do:

        # debugfs.ocfs2 -l SUPER off

To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do:

        # debugfs.ocfs2 -l SUPER deny

To enable heartbeat tracing, do:

        # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow

To disable heartbeat tracing, do:

        # debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

# How do I get a list of filesystem locks and their statuses?
OCFS2 1.0.9+ has this feature. To get this list, do:

    * Mount debugfs is mounted at /debug.

              # mount -t debugfs debugfs /debug

    * Dump the locks.

              # echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks

# How do I read the fs_locks output?
Let's look at a sample output:

        Lockres: M000000000000000006672078b84822  Mode: Protected Read
        Flags: Initialized Attached
        RO Holders: 0  EX Holders: 0
        Pending Action: None  Pending Unlock Action: None
        Requested Mode: Protected Read  Blocking Mode: Invalid

First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation.
To get the inode number and generation from lockname, do:

        #echo "stat " | debugfs.ocfs2 -n /dev/sdX
        Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
        ....

To map the lockname to a directory entry, do:

        # echo "locate " | debugfs.ocfs2 -n /dev/sdX
        419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

One could also provide the inode number instead of the lockname.

        # echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdX
        419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

To get a lockname from a directory entry, do:

        # echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 -n /dev/sdX
        M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822

The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource.

The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive.

If you have a dlm hang, the resource to look for would be one with the "Busy" flag set.

The next step would be to query the dlm for the lock resource.

Note: The dlm debugging is still a work in progress.

To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

        # echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done
        82DA8137A49A47E4B187F74E09FBBB4B

Then do:

        # echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

For example:

        # echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug
        # dmesg | tail
        struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985
        lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no
          granted queue:
            type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
          converting queue:
          blocked queue:

It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource.

This is just to give a flavor of dlm debugging.

LIMITS
# Is there a limit to the number of subdirectories in a directory?
Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.).
# Is there a limit to the size of an ocfs2 file system?
Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system.

SYSTEM FILES
# What are system files?
System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do:

        # echo "ls -l //" | debugfs.ocfs2 -n /dev/sdX
                18        16       1      2  .
                18        16       2      2  ..
                19        24       10     1  bad_blocks
                20        32       18     1  global_inode_alloc
                21        20       8      1  slot_map
                22        24       9      1  heartbeat
                23        28       13     1  global_bitmap
                24        28       15     2  orphan_dir:0000
                25        32       17     1  extent_alloc:0000
                26        28       16     1  inode_alloc:0000
                27        24       12     1  journal:0000
                28        28       16     1  local_alloc:0000
                29        3796     17     1  truncate_log:0000

The first column lists the block number.
# Why do some files have numbers at the end?
There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do:

        # echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
               Slot#   Node#
            0      39
                   1      40
            2      41
                   3      42

HEARTBEAT
# How does the disk heartbeat work?
Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.
# When is a node deemed dead?
An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.
# What about self fencing?
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.
# How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.
# What should one set O2CB_HEARTBEAT_THRESHOLD to?
It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.

        O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)

# How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?

        # cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
        7

# What if a node umounts a volume?
During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.
# I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?
We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat. Red Hat is expected to have this fixed in RHEL4 U4 release. SLES9 SP3 2.5.6-7.257 includes this fix. For the latest, refer to the tracker bug filed on bugzilla. Till this issue is resolved, one is advised to use the DEADLINE io scheduler. To use it, add "elevator=deadline" to the kernel command line as follows:

    * For SLES9, edit the command line in /boot/grub/menu.lst.

      title Linux 2.6.5-7.244-bigsmp (with deadline)
              kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
                      vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
              initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

    * For RHEL4, edit the command line in /boot/grub/grub.conf:

      title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
              root (hd0,0)
              kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
              initrd /initrd-2.6.9-22.EL.img

To see the current kernel command line, do:

        # cat /proc/cmdline

QUORUM AND FENCING
# What is a quorum?
A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.
# How does OCFS2's cluster services define a quorum?
The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.
A node has quorum when:

    * it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.
      OR,
    * it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.

# What is fencing?
Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.
# How does a node decide that it has connectivity with another?
When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.
# How long does the quorum process take?
First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself.
# How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?
This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

               # chkconfig --list ocfs2
               ocfs2     0:off   1:off   2:on    3:on    4:on    5:on    6:off

# How does one list out the startup and shutdown ordering of the OCFS2 related services?

    * To list the startup order for runlevel 3 on RHEL4, do:

              # cd /etc/rc3.d
              # ls S*ocfs2* S*o2cb* S*network*
              S10network  S24o2cb  S25ocfs2

    * To list the shutdown order on RHEL4, do:

              # cd /etc/rc6.d
              # ls K*ocfs2* K*o2cb* K*network*
              K19ocfs2  K20o2cb  K90network

    * To list the startup order for runlevel 3 on SLES9, do:

              # cd /etc/init.d/rc3.d
              # ls S*ocfs2* S*o2cb* S*network*
              S05network  S07o2cb  S08ocfs2

    * To list the shutdown order on SLES9, do:

              # cd /etc/init.d/rc3.d
              # ls K*ocfs2* K*o2cb* K*network*
              K14ocfs2  K15o2cb  K17network

Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.

NOVELL SLES9
# Why are OCFS2 packages for SLES9 not made available on oss.oracle.com?
OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.
# What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com?
As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.

    * The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.
    * The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1.

RELEASE 1.2
# What is new in OCFS2 1.2?
OCFS2 1.2 has two new features:

    * It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x.
    * Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).

# Do I need to re-make the volume when upgrading?
No. OCFS2 1.2 is fully on-disk compatible with 1.0.
# Do I need to upgrade anything else?
Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.

UPGRADE TO THE LATEST RELEASE
# How do I upgrade to the latest release?

    * Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)

    * Umount all OCFS2 volumes.

              # umount -at ocfs2

    * Shutdown the cluster and unload the modules.

              # /etc/init.d/o2cb offline
              # /etc/init.d/o2cb unload

    * If required, upgrade the tools and console.

              # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

    * Upgrade the module.

              # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm

    * Ensure init services ocfs2 and o2cb are enabled.

              # chkconfig --add o2cb
              # chkconfig --add ocfs2

    * To check whether the services are enabled, do:

              # chkconfig --list o2cb
              o2cb      0:off   1:off   2:on    3:on    4:on    5:on    6:off
              # chkconfig --list ocfs2
              ocfs2     0:off   1:off   2:on    3:on    4:on    5:on    6:off

    * At this stage one could either reboot the node or simply, restart the cluster and mount the volume.

# Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes.
# After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
Do "dmesg | tail". If you see the error:

ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value

it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.
# The cluster fails to load. What do I do?
Check "demsg | tail" for any relevant errors. One common error is as follows:

SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc:  denied  { mount } for  ...

The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.

[ 本帖最后由 nntp 于 2006-9-1 00:00 编辑 ]
作者: vecentli    时间: 2006-08-31 22:02
PROCESSES
# List and describe all OCFS2 threads?

[o2net]
    One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.
[user_dlm]
    One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.
[ocfs2_wq]
    One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.
[o2hb-14C29A7392]
    One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.
[ocfs2vote-0]
    One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.
[dlm_thread]
    One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.
[dlm_reco_thread]
    One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.
[dlm_wq]
    One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.
[kjournald]
    One per mount. Is used as OCFS2 uses JDB for journalling.
[ocfs2cmt-0]
    One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.
[ocfs2rec-0]
    Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.
作者: vecentli    时间: 2006-08-31 22:02
url:

http://oss.oracle.com/projects/o ... ocfs2_faq.html#O2CB
作者: nntp    时间: 2006-09-01 00:44
各位,我把本版几个主要讨论ocfs,ocfs2,ASM,raw 的讨论主题合并在一起了,大家可以在这里继续讨论
作者: nntp    时间: 2006-09-01 03:05
如果要部署RAC, 如果需要快速完工并且在这方面经验欠缺的话,Oracle  提供的 "Oracle Validated Configurations" 是一个最好的帮手。
Oracle刚开始推出 OVC的时候,我觉得特别特别好,即便是对于非常熟悉linux/oracle/RAC得人来说,也是一个大大减轻工作量的好工具.

搞不清楚状况,被工作任务紧逼的朋友,可以完全按照 OVC来完成任务,已经做好RAC并且碰到故障问题的时候,也可以按照 OVC来做排查参考.

Oracle Validated Configurations
http://www.oracle.com/technology ... urations/index.html
作者: nntp    时间: 2006-09-01 03:46
http://forums.oracle.com/forums/ ... 337838&#1337838
Oracle Forum 一个非常有意义的问答讨论, 我的看法和他们后面几位基本一致. 特别是有位仁兄提到的ASM<->RAW之间的便捷转换.
还有关于之前我回答本线索某位朋友关于 voting 和OCR的位置问题,我当时没有说太多原因,在这个讨论中也由简单的提及.
作者: vecentli    时间: 2006-09-01 10:07
原帖由 nntp 于 2006-8-31 18:01 发表



单机还是RAC? 如果是RAC的话, 就算掉电, asm 可以处理这种情况的,你订了oracle mag么?去年年底有一期介绍类似情况的.



对这个介绍比较感兴趣。能否提供一个url?
                 
如果要对这个进行恢复,我觉得是比较有难度的。。毕竟关于asm内部i/o机制的资料不多。

[ 本帖最后由 vecentli 于 2006-9-1 10:10 编辑 ]
作者: blue_stone    时间: 2006-09-01 12:01
redhat的gfs和ibm的gpfs能不能也放一起讨论?
能不能把gfs, gpfs, ocfs, ocfs2比较一下?
用途, 可靠性, 可用性, 性能, 稳定性等
作者: nntp    时间: 2006-09-01 16:13
gfs 和ocfs2是一种东西,  和ocfs, gpfs不是一种东西. ocfs 和当中的任何一种都不一样.

gfs/ocfs2 使得多个节点访问共享存储的同一个位置成为可能,他们通过普通网络建立不同节点上文件系统缓存的同步机制,通过集群锁,杜绝多个节点的不同应用操作同一个文件产生的竞争关系从而破坏文件的可能性,通过普通网络交换节点之间的心跳状态. 这是功能上的类似。从成熟度,性能来考虑,目前ocfs2还远不能和gfs相提并论, 能够用ocfs2的地方都可以用gfs来替代,但是反之就不行.  gfs在 HA集群环境,担当了一个"廉价缩水版"的polyserv.   至少目前来看,我个人的观点是gfs在技术,成熟度,开发力量投入,性能上都要领先ocfs2 差不多3年左右的时间.而且这种差距可能进一步拉大.

ocfs是只能for oracle的,也是oracle把集群文件系统纳入发展视线的第一个版本,之前我也说过,这个版本当时并没有定位在通用集群文件系统上,无论是质量,性能,稳定性等等在oracle用户圈子,反面的意见占大多数.

即便是在今天ocfs2的阶段,oracle mailing list, forum上大量充斥对于ocfs2质量,性能和可靠性的投诉.

ASM 是Oracle 在 linux, HP-UX, Solaris 等多个商用高端Unix平台采用的新一代存储管理系统,在Oracle公司的产品地位,开发的投入,用户范围,适用的层次和领域都是ocfs2项目无法比的.
ASM在功能上,相当于 RAW+LVM. 在数据量和访问量的线性增长关系上,表现也很出色,在实际的真实测试环境中,ASM的性能基本接近RAW, 因为还有Volume 开销,所以性能上有一点点地开销,也是很容易理解的. CLVM+OCFS2的性能在线性增长的测试中,明显低于ASM和RAW. 前天我一个朋友给我发来了他在欧洲高能实验室一个年会上作的slide,他们实验室的IT部门统计了一下,整个实验室各种单数据库和集群加起来,现在有540多个TB的数据跑在ASM上面,经过重负荷的使用和测试,他们对于ASM是表现是相当满意的. 他们大部分的系统是IA64+linux和AMD Opteron+Linux. 我看有时间的话,会把他们的测试和结论贴一些上来.

[ 本帖最后由 nntp 于 2006-9-1 16:30 编辑 ]
作者: myprotein    时间: 2006-09-15 09:14
nntp老大太强了!
小弟一事不明:lvm+ocfs2,您说lvm不是cluster aware的,但是以我的浅薄知识,好像aix中可以创建并发vg的吧?这个并发vg,是不是cluster aware的呢?
作者: blue_stone    时间: 2006-09-15 10:18
原帖由 myprotein 于 2006-9-15 09:14 发表
nntp老大太强了!
小弟一事不明:lvm+ocfs2,您说lvm不是cluster aware的,但是以我的浅薄知识,好像aix中可以创建并发vg的吧?这个并发vg,是不是cluster aware的呢?


lvm和lvm2都不时cluster aware的, linux下cluster aware的卷管理软件是clvm.
aix中的concurrent vg是cluster aware的
作者: myprotein    时间: 2006-09-15 10:47
多谢老大
作者: king3171    时间: 2006-09-19 17:14
原帖由 nntp 于 2006-9-1 16:13 发表
gfs 和ocfs2是一种东西,  和ocfs, gpfs不是一种东西. ocfs 和当中的任何一种都不一样.

gfs/ocfs2 使得多个节点访问共享存储的同一个位置成为可能,他们通过普通网络建立不同节点上文件系统缓存的同步机制,通 ...


这个帖子的每一个回复我都看了,受益非浅,这几种文件系统的比较,我很感兴趣,但还是有疑惑,我对SUNSOLARIS的文件系统比较熟悉,其他的HPUX、AIX有一些了解,但对他们的文件系统不很清楚。SOLARIS中有一种文件系统叫Global File Systems,也被称为Cluster file system或Proxy file system,我想应该就是老兄所说的GFS,在SOLARIS中,这个Global File Systems可以被集群中的多个节点同时访问,但只有一个节点在实际控制操作读写,其他节点都是通过这个主控节点来操作,主控节点DOWN掉后,主控权会转移到其他节点。但SOLARIS的这个Global File Systems其实和普通的UFS文件系统是没有本质区别的,只是在MOUNT这个要作为Global File Systems的分区的的时候加了global这个选项而已。如下:
mount -o global,logging /dev/vx/dsk/nfs-dg/vol-01  /global/nfs
去年在做SUN的CLUSTER,跑IBM的DB2 用到这个Global File Systems时出现一些问题,后来厂家的工程师说不推荐用Global File Systems,说容易出现问题,后来把这个Global File Systems取消了,虽然后来证实出现问题并不是Global File Systems造成的。
    我想知道,GFS是一个第三方的标准的技术,各个厂家使用的都一样呢,还是各个厂家各自为政,虽然叫法类似但实际原理各不相同的一个技术,恳请指点迷津!!!

[ 本帖最后由 king3171 于 2006-9-19 17:18 编辑 ]
作者: nntp    时间: 2006-09-19 21:58
sorry, 恕我直言,你对Solaris 集群文件系统的了解是不正确的.

Solaris 上面可以跑一个独立的集群文件系统产品,叫做 SUN CFS - Cluster File System. 这个东西就是从Veritas CFS买过来 O*成自己的产品. 实际上HPUX上面也有CFS, 也是从Veritas CFS O*过来的. 这个CFS当时推出来的时候,实际上Sistina公司的GFS还处于初始萌芽状态,所以在行业内,Veritas就号称这个CFS可以实现 Global File Service.
这是你了解到的信息中不正确的地方之一.  所以 Sun/HP的CFS 号称实现Global File Service, 但是这个 GFS 可不是 Sistina 的"GFS"(Global File System). 也就是一字之差,说明了两者之间的相似和区别.

至于Sun的CFS到底是什么原理和内部细节,你可以从sun站点查一个白皮书,我记的名字就叫做 Sun Cluster Software Cluster File System xxxxx 的pdf文件, google一下,里面有详细的介绍. Sun CFS的组成部分,特点,原理和基本特性等. 算是写得相当清楚地.

本版置顶的帖子有关于RedHat 收购的Sistina 公司的GFS的详细联接和文档,因为你在帖子中表明需要搞清楚两者的区别,所以我也觉得如果三两句说不清楚,还是建议你将两者的白皮书和规格都详细阅读后,自然会有一个比较清楚的比较.

因为都是不同的产品,目的,设计特点,用途都不太一样,所以也不存在什么共同的功能上的标准. 底层编码设计上的标准肯定是有的,还是按照Unix世界通用的几大标准来设计的.
作者: king3171    时间: 2006-09-20 13:33
谢谢,我再查一下吧,你说的Solaris 上面可以跑一个独立的集群文件系统产品,叫做 SUN CFS - Cluster File System,我不知道你说的产品是不是SUN的CLUSTER 3.1 产品,我想应该不是,因为 SUN CLUSTER 3.1 中并没有提到你说的那个东西,前面我说的那个GLOBAL file system 就是CLUSTER 3.1 产品中的概念。至于单独的集群文件系统产品我和SUN的工程师交流中没有听他们提过,我再查一下吧,有新的发现和心得再上来和您交流。

[ 本帖最后由 king3171 于 2006-9-20 13:46 编辑 ]
作者: nntp    时间: 2006-09-21 18:08
sorry,你还是看看吧. 嘿嘿.
作者: justenn    时间: 2006-12-28 10:07
看完了所有97贴,收益非浅!




欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2