OCFS,OCFS2,ASM,RAW 讨论主题合并帖 [复制链接]

论坛徽章:: 0

82楼 [报告]

发表于 2006-08-31 22:01 |只看该作者

# How do I enable and disable filesystem tracing?
To list all the debug bits along with their statuses, do:

# debugfs.ocfs2 -l

To enable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER allow

To disable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER off

To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do:

# debugfs.ocfs2 -l SUPER deny

To enable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow

To disable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

# How do I get a list of filesystem locks and their statuses?
OCFS2 1.0.9+ has this feature. To get this list, do:

* Mount debugfs is mounted at /debug.

      # mount -t debugfs debugfs /debug

* Dump the locks.

      # echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks

# How do I read the fs_locks output?
Let's look at a sample output:

Lockres: M000000000000000006672078b84822  Mode: Protected Read
Flags: Initialized Attached
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Protected Read  Blocking Mode: Invalid

First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation.
To get the inode number and generation from lockname, do:

#echo "stat " | debugfs.ocfs2 -n /dev/sdX
Inode: 419616 Mode: 0666 Generation: 2025343010 (0x78b84822)
....

To map the lockname to a directory entry, do:

# echo "locate " | debugfs.ocfs2 -n /dev/sdX
419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

One could also provide the inode number instead of the lockname.

# echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdX
419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

To get a lockname from a directory entry, do:

# echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 -n /dev/sdX
M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822

The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource.

The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive.

If you have a dlm hang, the resource to look for would be one with the "Busy" flag set.

The next step would be to query the dlm for the lock resource.

Note: The dlm debugging is still a work in progress.

To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

# echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done
82DA8137A49A47E4B187F74E09FBBB4B

Then do:

# echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

For example:

# echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug
# dmesg | tail
struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985
lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no
   granted queue:
      type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
   converting queue:
   blocked queue:

It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource.

This is just to give a flavor of dlm debugging.

LIMITS
# Is there a limit to the number of subdirectories in a directory?
Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.).
# Is there a limit to the size of an ocfs2 file system?
Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system.

SYSTEM FILES
# What are system files?
System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do:

# echo "ls -l //" | debugfs.ocfs2 -n /dev/sdX
        18       16    1    2  .
        18       16    2    2  ..
        19       24    10    1  bad_blocks
        20       32    18    1  global_inode_alloc
        21       20    8    1  slot_map
        22       24    9    1  heartbeat
        23       28    13    1  global_bitmap
        24       28    15    2  orphan_dir:0000
        25       32    17    1  extent_alloc:0000
        26       28    16    1  inode_alloc:0000
        27       24    12    1  journal:0000
        28       28    16    1  local_alloc:0000
        29       3796    17    1  truncate_log:0000

The first column lists the block number.
# Why do some files have numbers at the end?
There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do:

# echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
   Slot# Node#
         0    39
         1    40
         2    41
         3    42

HEARTBEAT
# How does the disk heartbeat work?
Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.
# When is a node deemed dead?
An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.
# What about self fencing?
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.
# How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.
# What should one set O2CB_HEARTBEAT_THRESHOLD to?
It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.

O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)

# How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
7

# What if a node umounts a volume?
During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.
# I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?
We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat. Red Hat is expected to have this fixed in RHEL4 U4 release. SLES9 SP3 2.5.6-7.257 includes this fix. For the latest, refer to the tracker bug filed on bugzilla. Till this issue is resolved, one is advised to use the DEADLINE io scheduler. To use it, add "elevator=deadline" to the kernel command line as follows:

* For SLES9, edit the command line in /boot/grub/menu.lst.

   title Linux 2.6.5-7.244-bigsmp (with deadline)
      kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
      vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
      initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

* For RHEL4, edit the command line in /boot/grub/grub.conf:

   title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
      root (hd0,0)
      kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
      initrd /initrd-2.6.9-22.EL.img

To see the current kernel command line, do:

# cat /proc/cmdline

QUORUM AND FENCING
# What is a quorum?
A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.
# How does OCFS2's cluster services define a quorum?
The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.
A node has quorum when:

* it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.
   OR,
* it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.

# What is fencing?
Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.
# How does a node decide that it has connectivity with another?
When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.
# How long does the quorum process take?
First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself.
# How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?
This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

   # chkconfig --list ocfs2
   ocfs2    0:off 1:off 2:on 3:on 4:on 5:on 6:off

# How does one list out the startup and shutdown ordering of the OCFS2 related services?

* To list the startup order for runlevel 3 on RHEL4, do:

      # cd /etc/rc3.d
      # ls S*ocfs2* S*o2cb* S*network*
      S10network  S24o2cb  S25ocfs2

* To list the shutdown order on RHEL4, do:

      # cd /etc/rc6.d
      # ls K*ocfs2* K*o2cb* K*network*
      K19ocfs2  K20o2cb  K90network

* To list the startup order for runlevel 3 on SLES9, do:

      # cd /etc/init.d/rc3.d
      # ls S*ocfs2* S*o2cb* S*network*
      S05network  S07o2cb  S08ocfs2

* To list the shutdown order on SLES9, do:

      # cd /etc/init.d/rc3.d
      # ls K*ocfs2* K*o2cb* K*network*
      K14ocfs2  K15o2cb  K17network

Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.

NOVELL SLES9
# Why are OCFS2 packages for SLES9 not made available on oss.oracle.com?
OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.
# What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com?
As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.

* The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.
* The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1.

RELEASE 1.2
# What is new in OCFS2 1.2?
OCFS2 1.2 has two new features:

* It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x.
* Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).

# Do I need to re-make the volume when upgrading?
No. OCFS2 1.2 is fully on-disk compatible with 1.0.
# Do I need to upgrade anything else?
Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.

UPGRADE TO THE LATEST RELEASE
# How do I upgrade to the latest release?

* Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)

* Umount all OCFS2 volumes.

      # umount -at ocfs2

* Shutdown the cluster and unload the modules.

      # /etc/init.d/o2cb offline
      # /etc/init.d/o2cb unload

* If required, upgrade the tools and console.

      # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

* Upgrade the module.

      # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm

* Ensure init services ocfs2 and o2cb are enabled.

      # chkconfig --add o2cb
      # chkconfig --add ocfs2

* To check whether the services are enabled, do:

      # chkconfig --list o2cb
      o2cb    0:off 1:off 2:on 3:on 4:on 5:on 6:off
      # chkconfig --list ocfs2
      ocfs2    0:off 1:off 2:on 3:on 4:on 5:on 6:off

* At this stage one could either reboot the node or simply, restart the cluster and mount the volume.

# Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes.
# After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
Do "dmesg | tail". If you see the error:

ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value

it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.
# The cluster fails to load. What do I do?
Check "demsg | tail" for any relevant errors. One common error is as follows:

SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc:  denied  { mount } for  ...

The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.

[ 本帖最后由 nntp 于 2006-9-1 00:00 编辑 ]

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

83楼 [报告]

发表于 2006-08-31 22:02 |只看该作者

PROCESSES
# List and describe all OCFS2 threads?

[o2net]
One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.
[user_dlm]
One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.
[ocfs2_wq]
One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.
[o2hb-14C29A7392]
One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.
[ocfs2vote-0]
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.
[dlm_thread]
One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.
[dlm_reco_thread]
One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.
[dlm_wq]
One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.
[kjournald]
One per mount. Is used as OCFS2 uses JDB for journalling.
[ocfs2cmt-0]
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.
[ocfs2rec-0]
Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

84楼 [报告]

发表于 2006-08-31 22:02 |只看该作者

url:

http://oss.oracle.com/projects/o ... ocfs2_faq.html#O2CB

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

nntp

论坛徽章:: 0

85楼 [报告]

发表于 2006-09-01 00:44 |只看该作者

各位，我把本版几个主要讨论ocfs,ocfs2,ASM,raw 的讨论主题合并在一起了，大家可以在这里继续讨论

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

nntp

论坛徽章:: 0

86楼 [报告]

发表于 2006-09-01 03:05 |只看该作者

如果要部署RAC, 如果需要快速完工并且在这方面经验欠缺的话，Oracle 提供的 "Oracle Validated Configurations" 是一个最好的帮手。
Oracle刚开始推出 OVC的时候，我觉得特别特别好，即便是对于非常熟悉linux/oracle/RAC得人来说，也是一个大大减轻工作量的好工具.

搞不清楚状况，被工作任务紧逼的朋友，可以完全按照 OVC来完成任务，已经做好RAC并且碰到故障问题的时候，也可以按照 OVC来做排查参考.

Oracle Validated Configurations
http://www.oracle.com/technology ... urations/index.html

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

nntp

论坛徽章:: 0

87楼 [报告]

发表于 2006-09-01 03:46 |只看该作者

http://forums.oracle.com/forums/ ... 337838&#1337838
Oracle Forum 一个非常有意义的问答讨论, 我的看法和他们后面几位基本一致. 特别是有位仁兄提到的ASM<->RAW之间的便捷转换.
还有关于之前我回答本线索某位朋友关于 voting 和OCR的位置问题，我当时没有说太多原因，在这个讨论中也由简单的提及.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

88楼 [报告]

发表于 2006-09-01 10:07 |只看该作者

原帖由 nntp 于 2006-8-31 18:01 发表

单机还是RAC? 如果是RAC的话, 就算掉电, asm 可以处理这种情况的，你订了oracle mag么？去年年底有一期介绍类似情况的.

对这个介绍比较感兴趣。能否提供一个url？

如果要对这个进行恢复，我觉得是比较有难度的。。毕竟关于asm内部i/o机制的资料不多。

[ 本帖最后由 vecentli 于 2006-9-1 10:10 编辑 ]

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

blue_stone

荣誉版主

论坛徽章:: 0

89楼 [报告]

发表于 2006-09-01 12:01 |只看该作者

redhat的gfs和ibm的gpfs能不能也放一起讨论?
能不能把gfs, gpfs, ocfs, ocfs2比较一下?
用途, 可靠性, 可用性, 性能, 稳定性等

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

nntp