免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
楼主: cwinxp

OCFS,OCFS2,ASM,RAW 讨论主题合并帖 [复制链接]

论坛徽章:
0
发表于 2006-08-31 21:59 |显示全部楼层

OCFS2 FAQ

OCFS2 - FREQUENTLY ASKED QUESTIONS

      CONTENTS
    * General
    * Download and Install
    * Configure
    * O2CB Cluster Service
    * Format
    * Mount
    * Oracle RAC
    * Migrate Data from OCFS (Release 1) to OCFS2
    * Coreutils
    * Troubleshooting
    * Limits
    * System Files
    * Heartbeat
    * Quorum and Fencing
    * Novell SLES9
    * Release 1.2
    * Upgrade to the Latest Release
    * Processes

      GENERAL
   1. How do I get started?
          * Download and install the module and tools rpms.
          * Create cluster.conf and propagate to all nodes.
          * Configure and start the O2CB cluster service.
          * Format the volume.
          * Mount the volume.
   2. How do I know the version number running?

              # cat /proc/fs/ocfs2/version
              OCFS2 1.2.1 Fri Apr 21 13:51:24 PDT 2006 (build bd2f25ba0af9677db3572e3ccd92f739)

   3. How do I configure my system to auto-reboot after a panic?
      To auto-reboot system 60 secs after a panic, do:

              # echo 60 > /proc/sys/kernel/panic

      To enable the above on every reboot, add the following to /etc/sysctl.conf:

              kernel.panic = 60

      DOWNLOAD AND INSTALL
   4. Where do I get the packages from?
      For Novell's SLES9, upgrade to the latest SP3 kernel to get the required modules installed. Also, install ocfs2-tools and ocfs2console packages. For Red Hat's RHEL4, download and install the appropriate module package and the two tools packages, ocfs2-tools and ocfs2console. Appropriate module refers to one matching the kernel version, flavor and architecture. Flavor refers to smp, hugemem, etc.
   5. What are the latest versions of the OCFS2 packages?
      The latest module package version is 1.2.2. The latest tools/console packages versions are 1.2.1.
   6. How do I interpret the package name ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm?
      The package name is comprised of multiple parts separated by '-'.
          * ocfs2 - Package name
          * 2.6.9-22.0.1.ELsmp - Kernel version and flavor
          * 1.2.1 - Package version
          * 1 - Package subversion
          * i686 - Architecture
   7. How do I know which package to install on my box?
      After one identifies the package name and version to install, one still needs to determine the kernel version, flavor and architecture.
      To know the kernel version and flavor, do:

              # uname -r
              2.6.9-22.0.1.ELsmp

      To know the architecture, do:

              # rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n"
              i686

   8. Why can't I use uname -p to determine the kernel architecture?
      uname -p does not always provide the exact kernel architecture. Case in point the RHEL3 kernels on x86_64. Even though Red Hat has two different kernel architectures available for this port, ia32e and x86_64, uname -p identifies both as the generic x86_64.
   9. How do I install the rpms?
      First install the tools and console packages:

              # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

      Then install the appropriate kernel module package:

              # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm

  10. Do I need to install the console?
      No, the console is not required but recommended for ease-of-use.
  11. What are the dependencies for installing ocfs2console?
      ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or later, pygtk2 (EL4) or python-gtk (SLES9) 1.99.16 or later, python 2.3 or later and ocfs2-tools.
  12. What modules are installed with the OCFS2 1.2 package?
          * configfs.ko
          * ocfs2.ko
          * ocfs2_dlm.ko
          * ocfs2_dlmfs.ko
          * ocfs2_nodemanager.ko
          * debugfs
  13. What tools are installed with the ocfs2-tools 1.2 package?
          * mkfs.ocfs2
          * fsck.ocfs2
          * tunefs.ocfs2
          * debugfs.ocfs2
          * mount.ocfs2
          * mounted.ocfs2
          * ocfs2cdsl
          * ocfs2_hb_ctl
          * o2cb_ctl
          * o2cb - init service to start/stop the cluster
          * ocfs2 - init service to mount/umount ocfs2 volumes
          * ocfs2console - installed with the console package
  14. What is debugfs and is it related to debugfs.ocfs2?
      debugfs is an in-memory filesystem developed by Greg Kroah-Hartman. It is useful for debugging as it allows kernel space to easily export data to userspace. It is currently being used by OCFS2 to dump the list of filesystem locks and could be used for more in the future. It is bundled with OCFS2 as the various distributions are currently not bundling it. While debugfs and debugfs.ocfs2 are unrelated in general, the latter is used as the front-end for the debugging info provided by the former. For example, refer to the troubleshooting section.

      CONFIGURE
  15. How do I populate /etc/ocfs2/cluster.conf?
      If you have installed the console, use it to create this configuration file. For details, refer to the user's guide. If you do not have the console installed, check the Appendix in the User's guide for a sample cluster.conf and the details of all the components. Do not forget to copy this file to all the nodes in the cluster. If you ever edit this file on any node, ensure the other nodes are updated as well.
  16. Should the IP interconnect be public or private?
      Using a private interconnect is recommended. While OCFS2 does not take much bandwidth, it does require the nodes to be alive on the network and sends regular keepalive packets to ensure that they are. To avoid a network delay being interpreted as a node disappearing on the net which could lead to a node-self-fencing, a private interconnect is recommended. One could use the same interconnect for Oracle RAC and OCFS2.
  17. What should the node name be and should it be related to the IP address?
      The node name needs to match the hostname. The IP address need not be the one associated with that hostname. As in, any valid IP address on that node can be used. OCFS2 will not attempt to match the node name (hostname) with the specified IP address.
  18. How do I modify the IP address, port or any other information specified in cluster.conf?
      While one can use ocfs2console to add nodes dynamically to a running cluster, any other modifications require the cluster to be offlined. Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one and copy to the rest, and restart the cluster on all nodes. Always ensure that cluster.conf is the same on all the nodes in the cluster.
  19. How do I add a new node to an online cluster?
      You can use the console to add a new node. However, you will need to explicitly add the new node on all the online nodes. That is, adding on one node and propagating to the other nodes is not sufficient. If the operation fails, it will most likely be due to bug#741. In that case, you can use the o2cb_ctl utility on all online nodes as follows:

              # o2cb_ctl -C -i -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

  20. Ensure the node is added both in /etc/ocfs2/cluster.conf and in /config/cluster/CLUSTERNAME/node on all online nodes. You can then simply copy the cluster.conf to the new (still offline) node as well as other offline nodes. At the end, ensure that cluster.conf is consistent on all the nodes. How do I add a new node to an offline cluster?
      You can either use the console or use o2cb_ctl or simply hand edit cluster.conf. Then either use the console to propagate it to all nodes or hand copy using scp or any other tool. The o2cb_ctl command to do the same is:

              # o2cb_ctl -C -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

      Notice the "-i" argument is not required as the cluster is not online.

      O2CB CLUSTER SERVICE
  21. How do I configure the cluster service?

              # /etc/init.d/o2cb configure

      Enter 'y' if you want the service to load on boot and the name of the cluster (as listed in /etc/ocfs2/cluster.conf).
  22. How do I start the cluster service?
          * To load the modules, do:

                    # /etc/init.d/o2cb load

          * To Online it, do:

                    # /etc/init.d/o2cb online [cluster_name]

      If you have configured the cluster to load on boot, you could combine the two as follows:

              # /etc/init.d/o2cb start [cluster_name]

      The cluster name is not required if you have specified the name during configuration.
  23. How do I stop the cluster service?
          * To offline it, do:

                    # /etc/init.d/o2cb offline [cluster_name]

          * To unload the modules, do:

                    # /etc/init.d/o2cb unload

      If you have configured the cluster to load on boot, you could combine the two as follows:

              # /etc/init.d/o2cb stop [cluster_name]

      The cluster name is not required if you have specified the name during configuration.
  24. How can I learn the status of the cluster?
      To learn the status of the cluster, do:

              # /etc/init.d/o2cb status

  25. I am unable to get the cluster online. What could be wrong?
      Check whether the node name in the cluster.conf exactly matches the hostname. One of the nodes in the cluster.conf need to be in the cluster for the cluster to be online.

      FORMAT
  26. How do I format a volume?
      You could either use the console or use mkfs.ocfs2 directly to format the volume. For console, refer to the user's guide.

              # mkfs.ocfs2 -L "oracle_home" /dev/sdX

      The above formats the volume with default block and cluster sizes, which are computed based upon the size of the volume.

              # mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX

      The above formats the volume for 4 nodes with a 4K block size and a 32K cluster size.
  27. What does the number of node slots during format refer to?
      The number of node slots specifies the number of nodes that can concurrently mount the volume. This number is specified during format and can be increased using tunefs.ocfs2. This number cannot be decreased.
  28. What should I consider when determining the number of node slots?
      OCFS2 allocates system files, like Journal, for each node slot. So as to not to waste space, one should specify a number within the ballpark of the actual number of nodes. Also, as this number can be increased, there is no need to specify a much larger number than one plans for mounting the volume.
  29. Does the number of node slots have to be the same for all volumes?
      No. This number can be different for each volume.
  30. What block size should I use?
      A block size is the smallest unit of space addressable by the file system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K. The block size cannot be changed after the format. For most volume sizes, a 4K size is recommended. On the other hand, the 512 bytes block is never recommended.
  31. What cluster size should I use?
      A cluster size is the smallest unit of space allocated to a file to hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size of 128K or larger is recommended. For Oracle home, 32K to 64K.
  32. Any advantage of labelling the volumes?
      As in a shared disk environment, the disk name (/dev/sdX) for a particular device be different on different nodes, labelling becomes a must for easy identification. You could also use labels to identify volumes during mount.

              # mount -L "label" /dir

      The volume label is changeable using the tunefs.ocfs2 utility.

      MOUNT
  33. How do I mount the volume?
      You could either use the console or use mount directly. For console, refer to the user's guide.

              # mount -t ocfs2 /dev/sdX /dir

      The above command will mount device /dev/sdX on directory /dir.
  34. How do I mount by label?
      To mount by label do:

              # mount -L "label" /dir

  35. What entry to I add to /etc/fstab to mount an ocfs2 volume?
      Add the following:

              /dev/sdX        /dir        ocfs2        noauto,_netdev        0        0

      The _netdev option indicates that the devices needs to be mounted after the network is up.
  36. What do I need to do to mount OCFS2 volumes on boot?
          * Enable o2cb service using:

                    # chkconfig --add o2cb

          * Enable ocfs2 service using:

                    # chkconfig --add ocfs2

          * Configure o2cb to load on boot using:

                    # /etc/init.d/o2cb configure

          * Add entries into /etc/fstab as follows:

                    /dev/sdX        /dir        ocfs2        _netdev        0        0

  37. How do I know my volume is mounted?
          * Enter mount without arguments, or,

                    # mount

          * List /etc/mtab, or,

                    # cat /etc/mtab

          * List /proc/mounts, or,

                    # cat /proc/mounts

          * Run ocfs2 service.

                    # /etc/init.d/ocfs2 status

            mount command reads the /etc/mtab to show the information.
  38. What are the /config and /dlm mountpoints for?
      OCFS2 comes bundled with two in-memory filesystems configfs and ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the in-kernel node manager the list of nodes in the cluster and to the in-kernel heartbeat thread the resource to heartbeat on. ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel dlm to take and release clusterwide locks on resources.
  39. Why does it take so much time to mount the volume?
      It takes around 5 secs for a volume to mount. It does so so as to let the heartbeat thread stabilize. In a later release, we plan to add support for a global heartbeat, which will make most mounts instant.

      ORACLE RAC
  40. Any special flags to run Oracle RAC?
      OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry (OCR), Data files, Redo logs, Archive logs and Control files must be mounted with the datavolume and nointr mount options. The datavolume option ensures that the Oracle processes opens these files with the o_direct flag. The nointr option ensures that the ios are not interrupted by signals.

              # mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db

  41. What about the volume containing Oracle home?
      Oracle home volume should be mounted normally, that is, without the datavolume and nointr mount options. These mount options are only relevant for Oracle files listed above.

              # mount -t ocfs2 /dev/sdb1 /software/orahome

  42. Also as OCFS2 does not currently support shared writeable mmap, the health check (GIMH) file $ORACLE_HOME/dbs/hc_ORACLESID.dat and the ASM file $ASM_HOME/dbs/ab_ORACLESID.dat should be symlinked to local filesystem. We expect to support shared writeable mmap in the RHEL5 timeframe. Does that mean I cannot have my data file and Oracle home on the same volume?
      Yes. The volume containing the Oracle data files, redo-logs, etc. should never be on the same volume as the distribution (including the trace logs like, alert.log).
  43. Any other information I should be aware off?
      The 1.2.3 release of OCFS2 does not update the modification time on the inode across the cluster for non-extending writes. However, the time will be locally updated in the cached inodes. This leads to one observing different times (ls -l) for the same file on different nodes on the cluster.
      While this does not affect most uses of the filesystem, as one variably changes the file size during write, the one usage where this is most commonly experienced is with Oracle datafiles and redologs. This is because Oracle rarely resizes these files and thus almost all writes are non-extending.
      In the short term (1.2.x), we intend to provide a mount option (nocmtime) to allow users to explicitly ask the filesystem to not change the modification time during non-extending writes. While this is not the complete solution, this will ensure that the times are consistent across the cluster.
      In the long term (1.4.x), we intend to fix this by updating modification times for all writes while providing an opt-out option (nocmtime) for users who would prefer to avoid the performance overhead associated with this feature.

      MIGRATE DATA FROM OCFS (RELEASE 1) TO OCFS2
  44. Can I mount OCFS volumes as OCFS2?
      No. OCFS and OCFS2 are not on-disk compatible. We had to break the compatibility in order to add many of the new features. At the same time, we have added enough flexibility in the new disk layout so as to maintain backward compatibility in the future.
  45. Can OCFS volumes and OCFS2 volumes be mounted on the same machine simultaneously?
      No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's SLES. OCFS2, on the other hand, only works on the 2.6 kernels (Red Hat's EL4 and SuSE's SLES9).
  46. Can I access my OCFS volume on 2.6 kernels (SLES9/RHEL4)?
      Yes, you can access the OCFS volume on 2.6 kernels using FSCat tools, fsls and fscp. These tools can access the OCFS volumes at the device layer, to list and copy the files to another filesystem. FSCat tools are available on oss.oracle.com.
  47. Can I in-place convert my OCFS volume to OCFS2?
      No. The on-disk layout of OCFS and OCFS2 are sufficiently different that it would require a third disk (as a temporary buffer) inorder to in-place upgrade the volume. With that in mind, it was decided not to develop such a tool but instead provide tools to copy data from OCFS without one having to mount it.
  48. What is the quickest way to move data from OCFS to OCFS2?
      Quickest would mean having to perform the minimal number of copies. If you have the current backup on a non-OCFS volume accessible from the 2.6 kernel install, then all you would need to do is to retore the backup on the OCFS2 volume(s). If you do not have a backup but have a setup in which the system containing the OCFS2 volumes can access the disks containing the OCFS volume, you can use the FSCat tools to extract data from the OCFS volume and copy onto OCFS2.

      COREUTILS
  49. Like with OCFS (Release 1), do I need to use o_direct enabled tools to perform cp, mv, tar, etc.?
      No. OCFS2 does not need the o_direct enabled tools. The file system allows processes to open files in both o_direct and bufferred mode concurrently.

      TROUBLESHOOTING

论坛徽章:
0
发表于 2006-08-31 22:01 |显示全部楼层
# How do I enable and disable filesystem tracing?
To list all the debug bits along with their statuses, do:

        # debugfs.ocfs2 -l

To enable tracing the bit SUPER, do:

        # debugfs.ocfs2 -l SUPER allow

To disable tracing the bit SUPER, do:

        # debugfs.ocfs2 -l SUPER off

To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do:

        # debugfs.ocfs2 -l SUPER deny

To enable heartbeat tracing, do:

        # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow

To disable heartbeat tracing, do:

        # debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

# How do I get a list of filesystem locks and their statuses?
OCFS2 1.0.9+ has this feature. To get this list, do:

    * Mount debugfs is mounted at /debug.

              # mount -t debugfs debugfs /debug

    * Dump the locks.

              # echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks

# How do I read the fs_locks output?
Let's look at a sample output:

        Lockres: M000000000000000006672078b84822  Mode: Protected Read
        Flags: Initialized Attached
        RO Holders: 0  EX Holders: 0
        Pending Action: None  Pending Unlock Action: None
        Requested Mode: Protected Read  Blocking Mode: Invalid

First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation.
To get the inode number and generation from lockname, do:

        #echo "stat " | debugfs.ocfs2 -n /dev/sdX
        Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
        ....

To map the lockname to a directory entry, do:

        # echo "locate " | debugfs.ocfs2 -n /dev/sdX
        419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

One could also provide the inode number instead of the lockname.

        # echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdX
        419616  /linux-2.6.15/arch/i386/kernel/semaphore.c

To get a lockname from a directory entry, do:

        # echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 -n /dev/sdX
        M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822

The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource.

The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive.

If you have a dlm hang, the resource to look for would be one with the "Busy" flag set.

The next step would be to query the dlm for the lock resource.

Note: The dlm debugging is still a work in progress.

To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

        # echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done
        82DA8137A49A47E4B187F74E09FBBB4B

Then do:

        # echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

For example:

        # echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug
        # dmesg | tail
        struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985
        lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no
          granted queue:
            type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
          converting queue:
          blocked queue:

It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource.

This is just to give a flavor of dlm debugging.

LIMITS
# Is there a limit to the number of subdirectories in a directory?
Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.).
# Is there a limit to the size of an ocfs2 file system?
Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system.

SYSTEM FILES
# What are system files?
System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do:

        # echo "ls -l //" | debugfs.ocfs2 -n /dev/sdX
                18        16       1      2  .
                18        16       2      2  ..
                19        24       10     1  bad_blocks
                20        32       18     1  global_inode_alloc
                21        20       8      1  slot_map
                22        24       9      1  heartbeat
                23        28       13     1  global_bitmap
                24        28       15     2  orphan_dir:0000
                25        32       17     1  extent_alloc:0000
                26        28       16     1  inode_alloc:0000
                27        24       12     1  journal:0000
                28        28       16     1  local_alloc:0000
                29        3796     17     1  truncate_log:0000

The first column lists the block number.
# Why do some files have numbers at the end?
There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do:

        # echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
               Slot#   Node#
            0      39
                   1      40
            2      41
                   3      42

HEARTBEAT
# How does the disk heartbeat work?
Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.
# When is a node deemed dead?
An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.
# What about self fencing?
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.
# How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.
# What should one set O2CB_HEARTBEAT_THRESHOLD to?
It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.

        O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)

# How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?

        # cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
        7

# What if a node umounts a volume?
During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.
# I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?
We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat. Red Hat is expected to have this fixed in RHEL4 U4 release. SLES9 SP3 2.5.6-7.257 includes this fix. For the latest, refer to the tracker bug filed on bugzilla. Till this issue is resolved, one is advised to use the DEADLINE io scheduler. To use it, add "elevator=deadline" to the kernel command line as follows:

    * For SLES9, edit the command line in /boot/grub/menu.lst.

      title Linux 2.6.5-7.244-bigsmp (with deadline)
              kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
                      vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
              initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

    * For RHEL4, edit the command line in /boot/grub/grub.conf:

      title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
              root (hd0,0)
              kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
              initrd /initrd-2.6.9-22.EL.img

To see the current kernel command line, do:

        # cat /proc/cmdline

QUORUM AND FENCING
# What is a quorum?
A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.
# How does OCFS2's cluster services define a quorum?
The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.
A node has quorum when:

    * it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.
      OR,
    * it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.

# What is fencing?
Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.
# How does a node decide that it has connectivity with another?
When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.
# How long does the quorum process take?
First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself.
# How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?
This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

               # chkconfig --list ocfs2
               ocfs2     0:off   1:off   2:on    3:on    4:on    5:on    6:off

# How does one list out the startup and shutdown ordering of the OCFS2 related services?

    * To list the startup order for runlevel 3 on RHEL4, do:

              # cd /etc/rc3.d
              # ls S*ocfs2* S*o2cb* S*network*
              S10network  S24o2cb  S25ocfs2

    * To list the shutdown order on RHEL4, do:

              # cd /etc/rc6.d
              # ls K*ocfs2* K*o2cb* K*network*
              K19ocfs2  K20o2cb  K90network

    * To list the startup order for runlevel 3 on SLES9, do:

              # cd /etc/init.d/rc3.d
              # ls S*ocfs2* S*o2cb* S*network*
              S05network  S07o2cb  S08ocfs2

    * To list the shutdown order on SLES9, do:

              # cd /etc/init.d/rc3.d
              # ls K*ocfs2* K*o2cb* K*network*
              K14ocfs2  K15o2cb  K17network

Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.

NOVELL SLES9
# Why are OCFS2 packages for SLES9 not made available on oss.oracle.com?
OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.
# What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com?
As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.

    * The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.
    * The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1.

RELEASE 1.2
# What is new in OCFS2 1.2?
OCFS2 1.2 has two new features:

    * It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x.
    * Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).

# Do I need to re-make the volume when upgrading?
No. OCFS2 1.2 is fully on-disk compatible with 1.0.
# Do I need to upgrade anything else?
Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.

UPGRADE TO THE LATEST RELEASE
# How do I upgrade to the latest release?

    * Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)

    * Umount all OCFS2 volumes.

              # umount -at ocfs2

    * Shutdown the cluster and unload the modules.

              # /etc/init.d/o2cb offline
              # /etc/init.d/o2cb unload

    * If required, upgrade the tools and console.

              # rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

    * Upgrade the module.

              # rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm

    * Ensure init services ocfs2 and o2cb are enabled.

              # chkconfig --add o2cb
              # chkconfig --add ocfs2

    * To check whether the services are enabled, do:

              # chkconfig --list o2cb
              o2cb      0:off   1:off   2:on    3:on    4:on    5:on    6:off
              # chkconfig --list ocfs2
              ocfs2     0:off   1:off   2:on    3:on    4:on    5:on    6:off

    * At this stage one could either reboot the node or simply, restart the cluster and mount the volume.

# Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes.
# After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
Do "dmesg | tail". If you see the error:

ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value

it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.
# The cluster fails to load. What do I do?
Check "demsg | tail" for any relevant errors. One common error is as follows:

SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc:  denied  { mount } for  ...

The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.

[ 本帖最后由 nntp 于 2006-9-1 00:00 编辑 ]

论坛徽章:
0
发表于 2006-08-31 22:02 |显示全部楼层
PROCESSES
# List and describe all OCFS2 threads?

[o2net]
    One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.
[user_dlm]
    One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.
[ocfs2_wq]
    One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.
[o2hb-14C29A7392]
    One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.
[ocfs2vote-0]
    One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.
[dlm_thread]
    One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.
[dlm_reco_thread]
    One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.
[dlm_wq]
    One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.
[kjournald]
    One per mount. Is used as OCFS2 uses JDB for journalling.
[ocfs2cmt-0]
    One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.
[ocfs2rec-0]
    Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.

论坛徽章:
0
发表于 2006-08-31 22:02 |显示全部楼层

论坛徽章:
0
发表于 2006-09-01 00:44 |显示全部楼层
各位,我把本版几个主要讨论ocfs,ocfs2,ASM,raw 的讨论主题合并在一起了,大家可以在这里继续讨论

论坛徽章:
0
发表于 2006-09-01 03:05 |显示全部楼层
如果要部署RAC, 如果需要快速完工并且在这方面经验欠缺的话,Oracle  提供的 "Oracle Validated Configurations" 是一个最好的帮手。
Oracle刚开始推出 OVC的时候,我觉得特别特别好,即便是对于非常熟悉linux/oracle/RAC得人来说,也是一个大大减轻工作量的好工具.

搞不清楚状况,被工作任务紧逼的朋友,可以完全按照 OVC来完成任务,已经做好RAC并且碰到故障问题的时候,也可以按照 OVC来做排查参考.

Oracle Validated Configurations
http://www.oracle.com/technology ... urations/index.html

论坛徽章:
0
发表于 2006-09-01 03:46 |显示全部楼层
http://forums.oracle.com/forums/ ... 337838&#1337838
Oracle Forum 一个非常有意义的问答讨论, 我的看法和他们后面几位基本一致. 特别是有位仁兄提到的ASM<->RAW之间的便捷转换.
还有关于之前我回答本线索某位朋友关于 voting 和OCR的位置问题,我当时没有说太多原因,在这个讨论中也由简单的提及.

论坛徽章:
0
发表于 2006-09-01 10:07 |显示全部楼层
原帖由 nntp 于 2006-8-31 18:01 发表



单机还是RAC? 如果是RAC的话, 就算掉电, asm 可以处理这种情况的,你订了oracle mag么?去年年底有一期介绍类似情况的.



对这个介绍比较感兴趣。能否提供一个url?
                 
如果要对这个进行恢复,我觉得是比较有难度的。。毕竟关于asm内部i/o机制的资料不多。

[ 本帖最后由 vecentli 于 2006-9-1 10:10 编辑 ]

论坛徽章:
0
发表于 2006-09-01 12:01 |显示全部楼层
redhat的gfs和ibm的gpfs能不能也放一起讨论?
能不能把gfs, gpfs, ocfs, ocfs2比较一下?
用途, 可靠性, 可用性, 性能, 稳定性等

论坛徽章:
0
发表于 2006-09-01 16:13 |显示全部楼层
gfs 和ocfs2是一种东西,  和ocfs, gpfs不是一种东西. ocfs 和当中的任何一种都不一样.

gfs/ocfs2 使得多个节点访问共享存储的同一个位置成为可能,他们通过普通网络建立不同节点上文件系统缓存的同步机制,通过集群锁,杜绝多个节点的不同应用操作同一个文件产生的竞争关系从而破坏文件的可能性,通过普通网络交换节点之间的心跳状态. 这是功能上的类似。从成熟度,性能来考虑,目前ocfs2还远不能和gfs相提并论, 能够用ocfs2的地方都可以用gfs来替代,但是反之就不行.  gfs在 HA集群环境,担当了一个"廉价缩水版"的polyserv.   至少目前来看,我个人的观点是gfs在技术,成熟度,开发力量投入,性能上都要领先ocfs2 差不多3年左右的时间.而且这种差距可能进一步拉大.

ocfs是只能for oracle的,也是oracle把集群文件系统纳入发展视线的第一个版本,之前我也说过,这个版本当时并没有定位在通用集群文件系统上,无论是质量,性能,稳定性等等在oracle用户圈子,反面的意见占大多数.

即便是在今天ocfs2的阶段,oracle mailing list, forum上大量充斥对于ocfs2质量,性能和可靠性的投诉.

ASM 是Oracle 在 linux, HP-UX, Solaris 等多个商用高端Unix平台采用的新一代存储管理系统,在Oracle公司的产品地位,开发的投入,用户范围,适用的层次和领域都是ocfs2项目无法比的.
ASM在功能上,相当于 RAW+LVM. 在数据量和访问量的线性增长关系上,表现也很出色,在实际的真实测试环境中,ASM的性能基本接近RAW, 因为还有Volume 开销,所以性能上有一点点地开销,也是很容易理解的. CLVM+OCFS2的性能在线性增长的测试中,明显低于ASM和RAW. 前天我一个朋友给我发来了他在欧洲高能实验室一个年会上作的slide,他们实验室的IT部门统计了一下,整个实验室各种单数据库和集群加起来,现在有540多个TB的数据跑在ASM上面,经过重负荷的使用和测试,他们对于ASM是表现是相当满意的. 他们大部分的系统是IA64+linux和AMD Opteron+Linux. 我看有时间的话,会把他们的测试和结论贴一些上来.

[ 本帖最后由 nntp 于 2006-9-1 16:30 编辑 ]
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP