redhat NFS集群问题
有一个集群的问题,搞了几天不得要领,望高人指点。概况:
2个节点共享一个SAN存储,系统是RHEL5.3,用redhat自带的集群软件配置NFS服务,没有配置GFS.
cluster.conf 如下:
<?xml version="1.0"?>
<cluster alias="NFSCluster" config_version="123" name="NFSCluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="elgar" nodeid="1" votes="1">
</clusternode>
<clusternode name="chopin" nodeid="2" votes="1">
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
</fencedevices>
<rm log_level="7">
<failoverdomains>
<failoverdomain name="NFSCDomain" ordered="0" restricted="0">
<failoverdomainnode name="elgar" priority="1"/>
<failoverdomainnode name="chopin" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.217.212.238" monitor_link="1"/>
<fs device="/dev/mapper/nfscserver-proj" force_fsck="0" force_unmount="1" fsid="53751" fstype="ext3" mountpoint="/proj" name="fs-proj" options="usrquota,grpquota" self_fence="1"/>
<fs device="/dev/mapper/nfscserver-export" force_fsck="0" force_unmount="1" fsid="38296" fstype="ext3" mountpoint="/export" name="fs-export" options="usrquota,grpquota" self_fence="1"/>
<fs device="/dev/mapper/nfscserver-alpha" force_fsck="0" force_unmount="1" fsid="25724" fstype="ext3" mountpoint="/alpha" name="fs-alpha" options="usrquota,grpquota" self_fence="1"/>
<fs device="/dev/mapper/nfscserver-sim" force_fsck="0" force_unmount="1" fsid="47898" fstype="ext3" mountpoint="/sim" name="fs-sim" options="usrquota,grpquota" self_fence="1"/>
<fs device="/dev/mapper/nfscserver-cad" force_fsck="0" force_unmount="1" fsid="10950" fstype="ext3" mountpoint="/cad" name="fs-cad" options="usrquota,grpquota" self_fence="1"/>
<nfsexport name="NFS-E2K"/>
<nfsclient name="nfsclt-proj" options="rw" path="/proj" target="*"/>
<nfsclient name="nfsclt-sim" options="rw" path="/sim" target="*"/>
<nfsclient name="nfsclt-cad" options="rw" path="/cad" target="*"/>
<nfsclient name="nfsclt-home1" options="rw" path="/export/home1" target="*"/>
<nfsclient name="nfsclt-alpha" options="rw" path="/alpha" target="*"/>
</resources>
<service autostart="1" domain="NFSCDomain" name="NFCServices" recovery="relocate" nfslock="1">
<ip ref="10.217.212.238">
<fs ref="fs-export">
<nfsexport ref="NFS-E2K">
<nfsclient ref="nfsclt-home1"/>
</nfsexport>
</fs>
<fs ref="fs-alpha">
<nfsexport ref="NFS-E2K">
<nfsclient ref="nfsclt-alpha"/>
</nfsexport>
</fs>
<fs ref="fs-sim">
<nfsexport ref="NFS-E2K">
<nfsclient ref="nfsclt-sim"/>
</nfsexport>
</fs>
<fs ref="fs-cad">
<nfsexport ref="NFS-E2K">
<nfsclient ref="nfsclt-cad"/>
</nfsexport>
</fs>
<fs ref="fs-proj">
<nfsexport ref="NFS-E2K">
<nfsclient ref="nfsclt-proj"/>
</nfsexport>
</fs>
</ip>
</service>
</rm>
</cluster>
当self_fence设为0时,不能正常切换。相应的log如下 (正在重启的active节点):
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.222): Permission denied
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.254): Permission denied
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.207): Permission denied
Mar 26 20:25:43 chopin rpc.statd: Caught signal 15, un-registering and exiting.
Mar 26 20:25:43 chopin clurgmgrd: : <err> 'umount /export' failed, error=0
当self_fence设为1时,也不能正常切换。相应的log如下 (正在重启的active节点):
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.254): Permission denied
Mar 26 20:25:41 chopin rpc.statd: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.207): Permission denied
Mar 26 20:25:43 chopin rpc.statd: Caught signal 15, un-registering and exiting.
Mar 26 20:25:43 chopin clurgmgrd: : <err> 'umount /export' failed, error=0
Mar 26 20:25:43 chopin clurgmgrd: : <alert> umount failed - REBOOTING
此时,备份节点的log如下:
Mar 29 12:26:47 elgar kernel: dlm: closing connection to node 2
Mar 29 12:26:47 elgar fenced: chopin not a cluster member after 0 sec post_fail_delay
Mar 29 12:26:47 elgar openais: r(0) ip(10.217.212.236)
Mar 29 12:26:47 elgar clurgmgrd: <info> State change: chopin DOWN
Mar 29 12:26:47 elgar fenced: fencing node "chopin"
Mar 29 12:26:47 elgar openais: Members Left:
Mar 29 12:26:47 elgar fenced: fence "chopin" failed
Mar 29 12:26:47 elgar openais: r(0) ip(10.217.212.237)
Mar 29 12:26:47 elgar openais: Members Joined:
Mar 29 12:26:47 elgar openais: CLM CONFIGURATION CHANGE
Mar 29 12:26:47 elgar openais: New Configuration:
Mar 29 12:26:47 elgar openais: r(0) ip(10.217.212.236)
Mar 29 12:26:47 elgar openais: Members Left:
Mar 29 12:26:47 elgar openais: Members Joined:
Mar 29 12:26:47 elgar openais: This node is within the primary component and will provide service.
Mar 29 12:26:47 elgar openais: entering OPERATIONAL state.
Mar 29 12:26:47 elgar openais: got nodejoin message 10.217.212.236
Mar 29 12:26:47 elgar openais: got joinlist message from node 1
Mar 29 12:26:48 elgar kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
Mar 29 12:26:52 elgar fenced: fencing node "chopin"
Mar 29 12:26:52 elgar fenced: fence "chopin" failed
Mar 29 12:26:53 elgar mountd: export request from 10.217.212.230 failed.
Mar 29 12:26:53 elgar last message repeated 3 times
Mar 29 12:26:57 elgar fenced: fencing node "chopin"
Mar 29 12:26:57 elgar fenced: fence "chopin" failed
Mar 29 12:27:02 elgar fenced: fencing node "chopin"
Mar 29 12:27:02 elgar fenced: fence "chopin" failed
不知道能否告知:
1. 为什么有些filesystem不能卸载?有什么办法去卸载吗?
2. 为什么备份节点尝试着去fencing?这个情况不论我们有没有配置fencing device都会出现。
谢谢 nfs做集群? heartbeat做双机?曾经看到别人做过,但是没成功,建议找点资料对照着试试,或者使用第三方的HA软件。
页:
[1]