wjemail 发表于 2010-03-30 15:22

redhat NFS集群问题

有一个集群的问题,搞了几天不得要领,望高人指点。

概况:
2个节点共享一个SAN存储,系统是RHEL5.3,用redhat自带的集群软件配置NFS服务,没有配置GFS.

cluster.conf 如下:
<?xml version="1.0"?>
<cluster alias="NFSCluster" config_version="123" name="NFSCluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="elgar" nodeid="1" votes="1">
                </clusternode>
                <clusternode name="chopin" nodeid="2" votes="1">
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
        </fencedevices>
        <rm log_level="7">
                <failoverdomains>
                        <failoverdomain name="NFSCDomain" ordered="0" restricted="0">
                                <failoverdomainnode name="elgar" priority="1"/>
                                <failoverdomainnode name="chopin" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.217.212.238" monitor_link="1"/>
                        <fs device="/dev/mapper/nfscserver-proj" force_fsck="0" force_unmount="1" fsid="53751" fstype="ext3" mountpoint="/proj" name="fs-proj" options="usrquota,grpquota" self_fence="1"/>
                        <fs device="/dev/mapper/nfscserver-export" force_fsck="0" force_unmount="1" fsid="38296" fstype="ext3" mountpoint="/export" name="fs-export" options="usrquota,grpquota" self_fence="1"/>
                        <fs device="/dev/mapper/nfscserver-alpha" force_fsck="0" force_unmount="1" fsid="25724" fstype="ext3" mountpoint="/alpha" name="fs-alpha" options="usrquota,grpquota" self_fence="1"/>
                        <fs device="/dev/mapper/nfscserver-sim" force_fsck="0" force_unmount="1" fsid="47898" fstype="ext3" mountpoint="/sim" name="fs-sim" options="usrquota,grpquota" self_fence="1"/>
                        <fs device="/dev/mapper/nfscserver-cad" force_fsck="0" force_unmount="1" fsid="10950" fstype="ext3" mountpoint="/cad" name="fs-cad" options="usrquota,grpquota" self_fence="1"/>
                        <nfsexport name="NFS-E2K"/>
                        <nfsclient name="nfsclt-proj" options="rw" path="/proj" target="*"/>
                        <nfsclient name="nfsclt-sim" options="rw" path="/sim" target="*"/>
                        <nfsclient name="nfsclt-cad" options="rw" path="/cad" target="*"/>
                        <nfsclient name="nfsclt-home1" options="rw" path="/export/home1" target="*"/>
                        <nfsclient name="nfsclt-alpha" options="rw" path="/alpha" target="*"/>
                </resources>
                <service autostart="1" domain="NFSCDomain" name="NFCServices" recovery="relocate" nfslock="1">
                        <ip ref="10.217.212.238">
                                <fs ref="fs-export">
                                        <nfsexport ref="NFS-E2K">
                                                <nfsclient ref="nfsclt-home1"/>
                                        </nfsexport>
                                </fs>
                                <fs ref="fs-alpha">
                                        <nfsexport ref="NFS-E2K">
                                                <nfsclient ref="nfsclt-alpha"/>
                                        </nfsexport>
                                </fs>
                                <fs ref="fs-sim">
                                        <nfsexport ref="NFS-E2K">
                                                <nfsclient ref="nfsclt-sim"/>
                                        </nfsexport>
                                </fs>
                                <fs ref="fs-cad">
                                        <nfsexport ref="NFS-E2K">
                                                <nfsclient ref="nfsclt-cad"/>
                                        </nfsexport>
                                </fs>
                                <fs ref="fs-proj">
                                        <nfsexport ref="NFS-E2K">
                                                <nfsclient ref="nfsclt-proj"/>
                                        </nfsexport>
                                </fs>
                        </ip>
                </service>
        </rm>
</cluster>

当self_fence设为0时,不能正常切换。相应的log如下 (正在重启的active节点):
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.222): Permission denied
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.254): Permission denied
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.207): Permission denied
Mar 26 20:25:43 chopin rpc.statd: Caught signal 15, un-registering and exiting.
Mar 26 20:25:43 chopin clurgmgrd: : <err> 'umount /export' failed, error=0

当self_fence设为1时,也不能正常切换。相应的log如下 (正在重启的active节点):
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.254): Permission denied
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.207): Permission denied
Mar 26 20:25:43 chopin rpc.statd: Caught signal 15, un-registering and exiting.
Mar 26 20:25:43 chopin clurgmgrd: : <err> 'umount /export' failed, error=0
Mar 26 20:25:43 chopin clurgmgrd: : <alert> umount failed - REBOOTING
此时,备份节点的log如下:
Mar 29 12:26:47 elgar kernel: dlm: closing connection to node 2
Mar 29 12:26:47 elgar fenced: chopin not a cluster member after 0 sec post_fail_delay
Mar 29 12:26:47 elgar openais:         r(0) ip(10.217.212.236)
Mar 29 12:26:47 elgar clurgmgrd: <info> State change: chopin DOWN
Mar 29 12:26:47 elgar fenced: fencing node "chopin"
Mar 29 12:26:47 elgar openais: Members Left:
Mar 29 12:26:47 elgar fenced: fence "chopin" failed
Mar 29 12:26:47 elgar openais:         r(0) ip(10.217.212.237)
Mar 29 12:26:47 elgar openais: Members Joined:
Mar 29 12:26:47 elgar openais: CLM CONFIGURATION CHANGE
Mar 29 12:26:47 elgar openais: New Configuration:
Mar 29 12:26:47 elgar openais:         r(0) ip(10.217.212.236)
Mar 29 12:26:47 elgar openais: Members Left:
Mar 29 12:26:47 elgar openais: Members Joined:
Mar 29 12:26:47 elgar openais: This node is within the primary component and will provide service.
Mar 29 12:26:47 elgar openais: entering OPERATIONAL state.
Mar 29 12:26:47 elgar openais: got nodejoin message 10.217.212.236
Mar 29 12:26:47 elgar openais: got joinlist message from node 1
Mar 29 12:26:48 elgar kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
Mar 29 12:26:52 elgar fenced: fencing node "chopin"
Mar 29 12:26:52 elgar fenced: fence "chopin" failed
Mar 29 12:26:53 elgar mountd: export request from 10.217.212.230 failed.
Mar 29 12:26:53 elgar last message repeated 3 times
Mar 29 12:26:57 elgar fenced: fencing node "chopin"
Mar 29 12:26:57 elgar fenced: fence "chopin" failed
Mar 29 12:27:02 elgar fenced: fencing node "chopin"
Mar 29 12:27:02 elgar fenced: fence "chopin" failed

不知道能否告知:
1. 为什么有些filesystem不能卸载?有什么办法去卸载吗?
2. 为什么备份节点尝试着去fencing?这个情况不论我们有没有配置fencing device都会出现。

谢谢

michael1983 发表于 2010-03-30 16:43

nfs做集群?

swjtuzwf 发表于 2010-07-09 09:42

heartbeat做双机?曾经看到别人做过,但是没成功,建议找点资料对照着试试,或者使用第三方的HA软件。
页: [1]
查看完整版本: redhat NFS集群问题