- 论坛徽章:
- 0
|
有一个集群的问题,搞了几天不得要领,望高人指点。\r\n\r\n概况:\r\n2个节点共享一个SAN存储,系统是RHEL5.3,用redhat自带的集群软件配置NFS服务,没有配置GFS.\r\n\r\ncluster.conf 如下:\r\n<?xml version=\"1.0\"?>\r\n<cluster alias=\"NFSCluster\" config_version=\"123\" name=\"NFSCluster\">\r\n <fence_daemon post_fail_delay=\"0\" post_join_delay=\"3\"/>\r\n <clusternodes>\r\n <clusternode name=\"elgar\" nodeid=\"1\" votes=\"1\">\r\n </clusternode>\r\n <clusternode name=\"chopin\" nodeid=\"2\" votes=\"1\">\r\n </clusternode>\r\n </clusternodes>\r\n <cman expected_votes=\"1\" two_node=\"1\"/>\r\n <fencedevices>\r\n </fencedevices>\r\n <rm log_level=\"7\">\r\n <failoverdomains>\r\n <failoverdomain name=\"NFSCDomain\" ordered=\"0\" restricted=\"0\">\r\n <failoverdomainnode name=\"elgar\" priority=\"1\"/>\r\n <failoverdomainnode name=\"chopin\" priority=\"1\"/>\r\n </failoverdomain>\r\n </failoverdomains>\r\n <resources>\r\n <ip address=\"10.217.212.238\" monitor_link=\"1\"/>\r\n <fs device=\"/dev/mapper/nfscserver-proj\" force_fsck=\"0\" force_unmount=\"1\" fsid=\"53751\" fstype=\"ext3\" mountpoint=\"/proj\" name=\"fs-proj\" options=\"usrquota,grpquota\" self_fence=\"1\"/>\r\n <fs device=\"/dev/mapper/nfscserver-export\" force_fsck=\"0\" force_unmount=\"1\" fsid=\"38296\" fstype=\"ext3\" mountpoint=\"/export\" name=\"fs-export\" options=\"usrquota,grpquota\" self_fence=\"1\"/>\r\n <fs device=\"/dev/mapper/nfscserver-alpha\" force_fsck=\"0\" force_unmount=\"1\" fsid=\"25724\" fstype=\"ext3\" mountpoint=\"/alpha\" name=\"fs-alpha\" options=\"usrquota,grpquota\" self_fence=\"1\"/>\r\n <fs device=\"/dev/mapper/nfscserver-sim\" force_fsck=\"0\" force_unmount=\"1\" fsid=\"47898\" fstype=\"ext3\" mountpoint=\"/sim\" name=\"fs-sim\" options=\"usrquota,grpquota\" self_fence=\"1\"/>\r\n <fs device=\"/dev/mapper/nfscserver-cad\" force_fsck=\"0\" force_unmount=\"1\" fsid=\"10950\" fstype=\"ext3\" mountpoint=\"/cad\" name=\"fs-cad\" options=\"usrquota,grpquota\" self_fence=\"1\"/>\r\n <nfsexport name=\"NFS-E2K\"/>\r\n <nfsclient name=\"nfsclt-proj\" options=\"rw\" path=\"/proj\" target=\"*\"/>\r\n <nfsclient name=\"nfsclt-sim\" options=\"rw\" path=\"/sim\" target=\"*\"/>\r\n <nfsclient name=\"nfsclt-cad\" options=\"rw\" path=\"/cad\" target=\"*\"/>\r\n <nfsclient name=\"nfsclt-home1\" options=\"rw\" path=\"/export/home1\" target=\"*\"/>\r\n <nfsclient name=\"nfsclt-alpha\" options=\"rw\" path=\"/alpha\" target=\"*\"/>\r\n </resources>\r\n <service autostart=\"1\" domain=\"NFSCDomain\" name=\"NFCServices\" recovery=\"relocate\" nfslock=\"1\">\r\n <ip ref=\"10.217.212.238\">\r\n <fs ref=\"fs-export\">\r\n <nfsexport ref=\"NFS-E2K\">\r\n <nfsclient ref=\"nfsclt-home1\"/>\r\n </nfsexport>\r\n </fs>\r\n <fs ref=\"fs-alpha\">\r\n <nfsexport ref=\"NFS-E2K\">\r\n <nfsclient ref=\"nfsclt-alpha\"/>\r\n </nfsexport>\r\n </fs>\r\n <fs ref=\"fs-sim\">\r\n <nfsexport ref=\"NFS-E2K\">\r\n <nfsclient ref=\"nfsclt-sim\"/>\r\n </nfsexport>\r\n </fs>\r\n <fs ref=\"fs-cad\">\r\n <nfsexport ref=\"NFS-E2K\">\r\n <nfsclient ref=\"nfsclt-cad\"/>\r\n </nfsexport>\r\n </fs>\r\n <fs ref=\"fs-proj\">\r\n <nfsexport ref=\"NFS-E2K\">\r\n <nfsclient ref=\"nfsclt-proj\"/>\r\n </nfsexport>\r\n </fs>\r\n </ip>\r\n </service>\r\n </rm>\r\n</cluster>\r\n\r\n当self_fence设为0时,不能正常切换。相应的log如下 (正在重启的active节点):\r\nMar 26 20:25:41 chopin rpc.statd[11846]: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.222): Permission denied\r\nMar 26 20:25:41 chopin rpc.statd[11846]: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.254): Permission denied\r\nMar 26 20:25:41 chopin rpc.statd[11846]: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.207): Permission denied\r\nMar 26 20:25:43 chopin rpc.statd[11846]: Caught signal 15, un-registering and exiting.\r\nMar 26 20:25:43 chopin clurgmgrd: [5845]: <err> \'umount /export\' failed, error=0 \r\n\r\n当self_fence设为1时,也不能正常切换。相应的log如下 (正在重启的active节点):\r\nMar 26 20:25:41 chopin rpc.statd[11846]: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.254): Permission denied\r\nMar 26 20:25:41 chopin rpc.statd[11846]: unlink (/tmp/statd-apollo.z11831/sm.bak/10.217.212.207): Permission denied\r\nMar 26 20:25:43 chopin rpc.statd[11846]: Caught signal 15, un-registering and exiting.\r\nMar 26 20:25:43 chopin clurgmgrd: [5845]: <err> \'umount /export\' failed, error=0 \r\nMar 26 20:25:43 chopin clurgmgrd: [5845]: <alert> umount failed - REBOOTING\r\n此时,备份节点的log如下:\r\nMar 29 12:26:47 elgar kernel: dlm: closing connection to node 2\r\nMar 29 12:26:47 elgar fenced[5086]: chopin not a cluster member after 0 sec post_fail_delay\r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] r(0) ip(10.217.212.236) \r\nMar 29 12:26:47 elgar clurgmgrd[6647]: <info> State change: chopin DOWN \r\nMar 29 12:26:47 elgar fenced[5086]: fencing node \"chopin\"\r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] Members Left: \r\nMar 29 12:26:47 elgar fenced[5086]: fence \"chopin\" failed\r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] r(0) ip(10.217.212.237) \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] Members Joined: \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] CLM CONFIGURATION CHANGE \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] New Configuration: \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] r(0) ip(10.217.212.236) \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] Members Left: \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] Members Joined: \r\nMar 29 12:26:47 elgar openais[5066]: [SYNC ] This node is within the primary component and will provide service. \r\nMar 29 12:26:47 elgar openais[5066]: [TOTEM] entering OPERATIONAL state. \r\nMar 29 12:26:47 elgar openais[5066]: [CLM ] got nodejoin message 10.217.212.236 \r\nMar 29 12:26:47 elgar openais[5066]: [CPG ] got joinlist message from node 1 \r\nMar 29 12:26:48 elgar kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON\r\nMar 29 12:26:52 elgar fenced[5086]: fencing node \"chopin\"\r\nMar 29 12:26:52 elgar fenced[5086]: fence \"chopin\" failed\r\nMar 29 12:26:53 elgar mountd[5610]: export request from 10.217.212.230 failed.\r\nMar 29 12:26:53 elgar last message repeated 3 times\r\nMar 29 12:26:57 elgar fenced[5086]: fencing node \"chopin\"\r\nMar 29 12:26:57 elgar fenced[5086]: fence \"chopin\" failed\r\nMar 29 12:27:02 elgar fenced[5086]: fencing node \"chopin\"\r\nMar 29 12:27:02 elgar fenced[5086]: fence \"chopin\" failed\r\n\r\n不知道能否告知:\r\n1. 为什么有些filesystem不能卸载?有什么办法去卸载吗?\r\n2. 为什么备份节点尝试着去fencing?这个情况不论我们有没有配置fencing device都会出现。\r\n\r\n谢谢 |
|