- 论坛徽章:
- 0
|
虽然自己考过237了,可是这两个东西的记忆太模糊了,最近看到别人有这方面的疑问,我也就重新学习一下
hacmp planing guide 中的原文有这两段
To ensure a clean takeover, HACMP provides a Deadman Switch, which is configured to halt the unresponsive node one second before the other nodes begin processing a node failure event. The Deadman Switch uses the Failure Detection Parameters of the slowest network to determine at what point to halt the node. Thus, by increasing the amount of time before a failure is detected, you give a node more time in which to give HACMP CPU cycles. This can be critical if the node experiences saturation at times.
To help eliminate node saturation, modify AIX 5L tuning parameters. For information about these tuning parameters, see the following sections in the Administration Guide:
•Configuring Cluster Performance Tuning in Chapter 18: Troubleshooting HACMP Clusters
•Changing the Failure Detection Rate of a Network Module in Chapter 12: Managing the Cluster Topology.
Change Failure Detection Parameters only after these other measures have been implemented.
Syncd Frequency
The syncd setting determines the frequency with which the I/O disk-write buffers are flushed. Frequent flushing of these buffers reduces the chance of deadman switch time-outs.
The AIX 5L default value for syncd as set in /sbin/rc.boot is 60. Change this value to 10. Note that the I/O pacing parameter setting should be changed first. You do not need to adjust this parameter again unless time-outs frequently occur.
简单解释如下:
集群中为了正确处理节点失败,需要判断节点是否死掉。这期间deadman switch使用失败探测参数设置的相关参数进行判断
如果i/o memory等有问题都可能使集群管理器不能正常处理节点通讯,而错误地使集群节点死掉
所以要调整些参数
1.i/o pacing
2.syncd
3.增加通信子系统使用内存量
4更改错误探测速率
split brain这个没太完全清楚,大概就是为了让hacmp知道系统故障时资源不能让多个节点同时
访问数据造成数据的破坏。这一点容易在tcpip网络发生故障时,而非tcpip网络不存在或者故障
2个节点都认为对vg等可以合法访问。于是如果出现这种情况(tcp损坏,非tcp不通),系统就让
后来想加入集群的节点down
应该没有什么大问题吧
本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/92/showart_56120.html |
|