免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 5398 | 回复: 4
打印 上一主题 下一主题

【紧急】RHCS CMAN “groupd 已死,但 pid 文件仍存”。但服务仍可用 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-03-24 13:20 |只看该作者 |倒序浏览
紧急求助。
2 nodes的RHCS部署好了,没有配置qdiskd。只跑了一个NFS服务。跑了2个多星期平安无事。
早上过来做例行检查时,发现 node1和node2的/var/log/messages报错,但nfs服务仍然可以正常访问,
打开system-config-cluster,提示node1和node2都不再是cluster的成员了。messages里面的错误如下:

Mar 24 12:33:01 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610650 seconds.
Mar 24 12:33:32 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610680 seconds.
Mar 24 12:34:02 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610710 seconds.
Mar 24 12:34:32 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610740 seconds.


从17号就开始出现了。
出现上述日志前的一段日志:
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] The token was lost in the OPERATIONAL state.
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] entering GATHER state from 2.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] entering GATHER state from 0.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Creating commit token because I am the rep.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Saving state aru ae high seq received ae
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Storing new sequence id for ring 2b0
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] entering COMMIT state.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] entering RECOVERY state.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] position [0] member 192.168.101.205:
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] previous ring seq 684 rep 192.168.101.205
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] aru ae high delivered ae received flag 1
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Did not need to originate any messages in recovery.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Sending initial ORF token
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ] CLM CONFIGURATION CHANGE
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ] New Configuration:
Mar 17 10:39:53 nas-node1 kernel: dlm: closing connection to node 2
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.205)
Mar 17 10:39:53 nas-node1 fenced[2933]: nas-node2.sdomain.root.corp not a cluster member after 0 sec post_fail_delay
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ] Members Left:
Mar 17 10:39:53 nas-node1 fenced[2933]: fencing node "nas-node2.sdomain.root.corp"
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.206)
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ] Members Joined:
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ] CLM CONFIGURATION CHANGE
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ] New Configuration:
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.205)
Mar 17 10:39:54 nas-node1 openais[2899]: [CLM  ] Members Left:
Mar 17 10:39:54 nas-node1 openais[2899]: [CLM  ] Members Joined:
Mar 17 10:39:54 nas-node1 openais[2899]: [SYNC ] This node is within the primary component and will provide service.
Mar 17 10:39:54 nas-node1 openais[2899]: [TOTEM] entering OPERATIONAL state.
Mar 17 10:39:54 nas-node1 openais[2899]: [CLM  ] got nodejoin message 192.168.101.205
Mar 17 10:39:54 nas-node1 openais[2899]: [CPG  ] got joinlist message from node 1
Mar 17 10:40:05 nas-node1 fenced[2933]: agent "fence_ilo" reports: Unable to connect/login to fencing device
Mar 17 10:40:05 nas-node1 fenced[2933]: fence "nas-node2.sdomain.root.corp" failed
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering GATHER state from 9.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Creating commit token because I am the rep.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Saving state aru e high seq received e
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Storing new sequence id for ring 2b4
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering COMMIT state.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering RECOVERY state.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] position [0] member 192.168.101.205:
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] previous ring seq 688 rep 192.168.101.205
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] aru e high delivered e received flag 1
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] position [1] member 192.168.101.206:
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] previous ring seq 688 rep 192.168.101.206
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] aru f high delivered f received flag 1
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Did not need to originate any messages in recovery.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Sending initial ORF token
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] CLM CONFIGURATION CHANGE
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] New Configuration:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.205)
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] Members Left:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] Members Joined:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] CLM CONFIGURATION CHANGE
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] New Configuration:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.205)
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.206)
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] Members Left:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] Members Joined:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ]        r(0) ip(192.168.101.206)
Mar 17 10:40:09 nas-node1 openais[2899]: [SYNC ] This node is within the primary component and will provide service.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering OPERATIONAL state.
Mar 17 10:40:09 nas-node1 openais[2899]: [MAIN ] Killing node nas-node2.sdomain.root.corp because it has rejoined the cluster with existing state
Mar 17 10:40:09 nas-node1 openais[2899]: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] got nodejoin message 192.168.101.206
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM  ] got nodejoin message 192.168.101.205
Mar 17 10:40:09 nas-node1 dlm_controld[2939]: cluster is down, exiting
Mar 17 10:40:09 nas-node1 gfs_controld[2945]: cluster is down, exiting
Mar 17 10:40:09 nas-node1 kernel: dlm: closing connection to node 1
Mar 17 10:40:34 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 30 seconds.
Mar 17 10:41:04 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 60 seconds.
Mar 17 10:41:35 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 90 seconds.

看到里面有fence "nas-node2“ failed。仔细检查了cluster的配置后发现HP ILO卡的IP写错了,于是做了纠正。并将cluster.conf 复制到了node2.
重启cman服务失败
# service cman restart
Stopping cluster:
   Stopping fencing... done
   Stopping cman... done
   Stopping ccsd... done
   Unmounting configfs... done
                                                           [确定]
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting qdiskd... done
   Starting daemons... done
   Starting fencing... failed

                                                           [失败]

重启node2,启动过程中停止在 ”Starting fencing...“ 步骤,系统无法正常启动。

现在的情况就是:
1、cluster的功能肯定挂了。
2、ricci、rgmanager两个服务正常运行
3、service cman status会提示 groupd已死。

生产环境,不太敢乱来,紧急求助各位大侠!!

论坛徽章:
0
2 [报告]
发表于 2010-03-24 14:58 |只看该作者
看来这个问题有点复杂。
我修改了 cluster.conf 后,因为目前node不属于cluster的一个member,可以使用

service rgmanager reload

来重新载入配置么?
生产环境,很害怕

论坛徽章:
0
3 [报告]
发表于 2010-03-24 17:50 |只看该作者
晚上回家后重启一下服务器。这个问题得解决掉,否则会睡不安稳!

论坛徽章:
0
4 [报告]
发表于 2010-03-25 16:01 |只看该作者
先关了openais再启动cman试试

论坛徽章:
0
5 [报告]
发表于 2010-03-26 10:05 |只看该作者
早上7点,趁大家还没上班的时候,在两个node上chkconfig off 了cman和rgmanager服务,避免node被fence后无法启动。然后在两个node上重启cman和rgmanager服务:

1、先在node 1 重启,fenced 启动仍然失败
2、然后在node 2 启动cman和rgmanager,成功启动,并且成功fence掉node 1,顺利接管NFS服务。
3、等待node 1启动后,手工启动 cman和rgmanager服务,全部成功。

整个cluster终于正常了。可是原因呢?
为什么莫名其妙会报:
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] The token was lost in the OPERATIONAL state.

我17号那天啥都没做啊!
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP