- 论坛徽章:
- 0
|
紧急求助。
2 nodes的RHCS部署好了,没有配置qdiskd。只跑了一个NFS服务。跑了2个多星期平安无事。
早上过来做例行检查时,发现 node1和node2的/var/log/messages报错,但nfs服务仍然可以正常访问,
打开system-config-cluster,提示node1和node2都不再是cluster的成员了。messages里面的错误如下:
Mar 24 12:33:01 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610650 seconds.
Mar 24 12:33:32 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610680 seconds.
Mar 24 12:34:02 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610710 seconds.
Mar 24 12:34:32 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 610740 seconds.
从17号就开始出现了。
出现上述日志前的一段日志:
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] The token was lost in the OPERATIONAL state.
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Mar 17 10:39:48 nas-node1 openais[2899]: [TOTEM] entering GATHER state from 2.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] entering GATHER state from 0.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Creating commit token because I am the rep.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Saving state aru ae high seq received ae
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Storing new sequence id for ring 2b0
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] entering COMMIT state.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] entering RECOVERY state.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] position [0] member 192.168.101.205:
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] previous ring seq 684 rep 192.168.101.205
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] aru ae high delivered ae received flag 1
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Did not need to originate any messages in recovery.
Mar 17 10:39:53 nas-node1 openais[2899]: [TOTEM] Sending initial ORF token
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] CLM CONFIGURATION CHANGE
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] New Configuration:
Mar 17 10:39:53 nas-node1 kernel: dlm: closing connection to node 2
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.205)
Mar 17 10:39:53 nas-node1 fenced[2933]: nas-node2.sdomain.root.corp not a cluster member after 0 sec post_fail_delay
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] Members Left:
Mar 17 10:39:53 nas-node1 fenced[2933]: fencing node "nas-node2.sdomain.root.corp"
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.206)
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] Members Joined:
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] CLM CONFIGURATION CHANGE
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] New Configuration:
Mar 17 10:39:53 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.205)
Mar 17 10:39:54 nas-node1 openais[2899]: [CLM ] Members Left:
Mar 17 10:39:54 nas-node1 openais[2899]: [CLM ] Members Joined:
Mar 17 10:39:54 nas-node1 openais[2899]: [SYNC ] This node is within the primary component and will provide service.
Mar 17 10:39:54 nas-node1 openais[2899]: [TOTEM] entering OPERATIONAL state.
Mar 17 10:39:54 nas-node1 openais[2899]: [CLM ] got nodejoin message 192.168.101.205
Mar 17 10:39:54 nas-node1 openais[2899]: [CPG ] got joinlist message from node 1
Mar 17 10:40:05 nas-node1 fenced[2933]: agent "fence_ilo" reports: Unable to connect/login to fencing device
Mar 17 10:40:05 nas-node1 fenced[2933]: fence "nas-node2.sdomain.root.corp" failed
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering GATHER state from 9.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Creating commit token because I am the rep.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Saving state aru e high seq received e
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Storing new sequence id for ring 2b4
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering COMMIT state.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering RECOVERY state.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] position [0] member 192.168.101.205:
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] previous ring seq 688 rep 192.168.101.205
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] aru e high delivered e received flag 1
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] position [1] member 192.168.101.206:
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] previous ring seq 688 rep 192.168.101.206
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] aru f high delivered f received flag 1
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Did not need to originate any messages in recovery.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] Sending initial ORF token
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] CLM CONFIGURATION CHANGE
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] New Configuration:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.205)
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] Members Left:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] Members Joined:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] CLM CONFIGURATION CHANGE
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] New Configuration:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.205)
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.206)
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] Members Left:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] Members Joined:
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] r(0) ip(192.168.101.206)
Mar 17 10:40:09 nas-node1 openais[2899]: [SYNC ] This node is within the primary component and will provide service.
Mar 17 10:40:09 nas-node1 openais[2899]: [TOTEM] entering OPERATIONAL state.
Mar 17 10:40:09 nas-node1 openais[2899]: [MAIN ] Killing node nas-node2.sdomain.root.corp because it has rejoined the cluster with existing state
Mar 17 10:40:09 nas-node1 openais[2899]: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] got nodejoin message 192.168.101.206
Mar 17 10:40:09 nas-node1 openais[2899]: [CLM ] got nodejoin message 192.168.101.205
Mar 17 10:40:09 nas-node1 dlm_controld[2939]: cluster is down, exiting
Mar 17 10:40:09 nas-node1 gfs_controld[2945]: cluster is down, exiting
Mar 17 10:40:09 nas-node1 kernel: dlm: closing connection to node 1
Mar 17 10:40:34 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 30 seconds.
Mar 17 10:41:04 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 60 seconds.
Mar 17 10:41:35 nas-node1 ccsd[2893]: Unable to connect to cluster infrastructure after 90 seconds.
看到里面有fence "nas-node2“ failed。仔细检查了cluster的配置后发现HP ILO卡的IP写错了,于是做了纠正。并将cluster.conf 复制到了node2.
重启cman服务失败
# service cman restart
Stopping cluster:
Stopping fencing... done
Stopping cman... done
Stopping ccsd... done
Unmounting configfs... done
[确定]
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting qdiskd... done
Starting daemons... done
Starting fencing... failed
[失败]
重启node2,启动过程中停止在 ”Starting fencing...“ 步骤,系统无法正常启动。
现在的情况就是:
1、cluster的功能肯定挂了。
2、ricci、rgmanager两个服务正常运行
3、service cman status会提示 groupd已死。
生产环境,不太敢乱来,紧急求助各位大侠!! |
|