免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2819 | 回复: 2

RHEL HA的主节点机器断电后服务能够切换到备机上吗? [复制链接]

论坛徽章:
0
发表于 2015-08-10 16:24 |显示全部楼层
主备机只做了ipmilan的fence,我的理解是心跳断但备机无法fence主机,所以服务不会启动,不知道我的理解对不对?

论坛徽章:
0
发表于 2015-08-13 21:39 |显示全部楼层
网上找到一个资料,算是回答这个问题了:

infrastructureadventures.com/2011/07/15/adventures-with-two-node-rhel-ha-clusters-%E2%80%93-behavior/

论坛徽章:
0
发表于 2015-08-13 21:42 |显示全部楼层
Several failure scenarios were tested to determine how the cluster would respond in those conditions. The behavior of the cluster during those testes produced the following results.

Failure of Network Connection
In the event of a failure of network connection (i.e. switch loss, cable disconnected, etc) that is used for an IP Resource (VIP) the cluster will detect that loss and failover the cluster services to the standby node. The detection and failover process takes less than 30 seconds to complete.
Failure of the network connection on the passive node has no impact on the cluster, other than any failover attempt to that node will fail.
Restoration of the network connections has no impact on the cluster and the cluster services will stay on the current active node (i.e. there is no “Fail-back”).

Failure of iSCSI Connection
In the event of a failure of the network connections used for iSCSI (i.e. switch loss, cable disconnected, etc) the cluster will detect that loss and failover the cluster services to the standby node. This requires the multipathing daemon to fail all valid iSCSI paths and for the SCSI stack to fail the disk. This failure detection and failover process can take between two to three minutes.
While reducing this time is technically possibly it was advised by Redhat support to utilize the defaults unless a shorter failover time was required. This issue with changing the timers is the possibility that a brief network interruption or loss of one or more (but not all) iSCSI paths could cause the premature failover of the cluster.
Failure of the iSCSI network connections on the passive node has no impact on the cluster, other than any failover attempt to that node will fail
Restoration of the iSCSI networks connections has no impact on the cluster and the cluster services will stay on the current active node (i.e. there is no “Fail-back”).

Failure of Heartbeat Connection
In the event that a node has a failure of its connection to the heartbeat network then its partner node will fence the failed node. So for example if the active node loses its heartbeat connection then the passive node will fence the currently active node, likewise if the passive node loses its heartbeat connection then the active node will fence the passive node. In the case the active node is fenced the passive node will then take over the cluster services. The detection, fencing and failover process takes less than 30 seconds.
The node that was fenced will reboot. If the node’s connection to the heartbeat network is restored by the reboot (or by the time the cluster services are started on the node) then the node will rejoin the cluster as the passive member. If the heartbeat connection is not restored then the node will not rejoin the cluster.
In the case that the node with the failed heartbeat connection is up and the cluster service (cman) is running when the heartbeat connection is restored then both nodes try and rebuild the cluster. This triggers a race condition where both nodes try and fence the other. This could cause a brief, less than 30 second, outage if the passive node manages to fence the active node and starts the cluster services.

Complete Failure of a Node
A complete failure of the active node (i.e. the node loses power) will require manual intervention to perform failover of the cluster services. This is because the fact that the passive node cannot successfully fence the failed node (at least by iLO).
In this case the remaining passive node will detect that its partner has failed, but will not start the cluster services. This is done since the cluster does not want to risk any data corruption that could occur in a split-brain scenario.
To failover cluster services an administrator will need to connect to the passive node and run the fence_ack_manual command. This command tells the passive node that you are manually acknowledging the fence. I.e. The system should consider that the fence has successfully completed. The system will then start the cluster services.
The condition could also be encountered in the case a node loses both its iLO and heartbeat connections. The remaining node would not be able to fence the failed node since it is unable to reach the iLO of the failed node. In this case the failed node could still have a working iSCSI connection and have the storage mounted. Because of this possible condition it is imperative that the administrator validate that the failed node does not have the storage mounted before issuing the fence_ack_manual.
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP