ReliantHA经常无故重新启动的问题! [复制链接]

日期:2011-11-23 16:44:17
发表于 2007-10-09 15:32 |显示全部楼层


I get an error, "GAB: Port h halting system" when using UnixWare 7 ReliantHa.  
I have installed ReliantHA and when I run "hvstart" after a few seconds one or more servers shutdown displaying the message:
"GAB: Port h halting system".
"System has halted and may be powered off (Press any key to reboot)."
This is a generic ReliantHA error message indicating that a ReliantHA node has been shutdown for some reason, often due to a communications failure of some kind.
Use the following tools to help diagnose the problem after first re-booting the servers in the cluster.
1. Disconnect the public network and ping SYSA and ping SYSB. NOTE: These are the private network names that ReliantHA uses and are case sensitive.
1.断开公网,ping SYSA 和ping SYSB. 注意:这些是ReliantHA使用的内网名,大小写敏感。
2. Make sure when ReliantHA was configured with "mkcluster" that the external uname (or public name) was used for the name of the nodes and NOT SYSA or SYSB.
3. Check the Release Notes of ReliantHA to look at the S99gab script's timeout values.
              These release notes are located at:
4. Check the output from /usr/opt/reliant/log for any errors.
4.在/usr/opt/reliant/log 中差错
              This is a directory, most useful is the switchlog file.
              NOTE: It is normal to see errors such as:
              dynamic linker: commds: warning: copy relocation size mismatch
              for symbol svc_fdset
                动态链结: 命令: 警告:svc_fdset符号 拷贝位置大小不匹配
5. If using Compaq Network Interface Cards (NIC) Netflex3 series, consider using the OU8 eeE8 (DDI

driver rather than Compaq's own "n100c" driver. This is because these cards are rebadged Intel Pro100B cards.
5.如果使用Compaq Network Interface Cards (NIC) Netflex3系列,用OU8 eeE8 (DDI

驱动而非康柏自己的N100C驱动。因为这些卡是Intel Pro100B型的卡。
              The latest "nd" package is available from:
              If the Compaq Insight Manager agents are installed for NIC monitoring then this would need to be removed.
                如果NIC已安装康柏识别管理器(Compaq Insight Manager agents)其“管理”应该被禁。
              Basically, ensure that the NIC can support a programmable MAC address and that cross-over cables are used to directly connect                              the nodes on the Private LAN.
6. Check the latest patches are installed for the operating system available from:
7. Check the output of "mswconfig -l", "llstat -a" and "/etc/mswtab" for any errors.
7.有差错否:mswconfig –l
          llstat –a
8. If no specific config files are defined then hvstart will use a simple default set of scripts for basic testing between the nodes.
9. Running "ipcs -a" should allocate a message queue once "hvstart" has run. You can also see the status of ReliantHA with "hvdisp -a".
9.运行ipcs –a将在hvstart运行时分配一个信息队列。你也可以通过hvdisp –a查看reliantHA的状态。
10. Use the "truss" command to examine the output of the "hvstart" command to get an indication of when the failure occurs:
              truss -f -o /hvstart.truss hvstart
11. If the system is swapping excessively then this could cause enough latency at the heartbeat communication layer for a heartbeat to be missed and so a node be killed with a gab halt. Use the standard system tools "sar" and "rtpm" to monitor for swapping behaviour.
11.如果系统过度交换,将造成心跳(heartbeat)流通层的延迟,引起一个心跳被错过,一个结点被误“杀”。请使用标准系统工具"sar" and "rtpm"管理交换行为。
              In addition:另外:
    Check /etc/conf/cf.d/stune for tuning that may conflict with the
              shared message queues that ReliantHA needs to operate such as:
              Both of these values should be set to the default operating
              system values even if database vendors such as Oracle say that
              these values need to be set.
NOTE: MSGSSZ, MSGMNB and MSGTQL should be tuned from their default values to at least 524288, 65536 and 1000 respectively (add any further application related tuning to these values).
NOTE: The minimum requirement for ReliantHA is 2 private LAN connections.
注意:MSGSSZ, MSGMNB, MSGTQL应该分别被设为其默认值,即至少524288, 65536,1000 (还可对这些值进行应用程序相关的调整――如加一些值)
NOTE: Instead of a "real" NIC you could also use a (null modem) serial cable as the second interface.
                 For Unisys: CBL6099-10M Null Modem Cable
                 对UNISYS:CBL6099-10M Null空MODEM线
                 For Compaq/HP: BC29Q-02M Null Modem Cable
                 对COMPAQ/HP: BC29Q-02M Null Modem Cable
NOTE: In general, note that should a node fail if shared memory or disk buffering is used then this data will be lost when the second node takes over. This is important for databases that use this technology. Ensure that RAID controllers are configured to WRITE-THRU and not cached.
NOTE: When you run "hvstart" manually, you will need to hit RETURN to return to the prompt.
NOTE: With ReliantHA 1.1.3a a new option "gabconfig" option was added called -P.
注:对ReliantHA 1.1.3a,添加了新的gabconfig选项:-p。
                 The -P option was added as a standalone "debug" option for use
                 after the gab driver is already configured which will generate
                 a PANIC should "gab" halt.  By default it is turned off.  To
                 turn it on set the value to -P 1.
                 It is not recommended to use this feature within /etc/rc2.d.         不推荐在/etc/rc2.d中使用此功能
                 Create an S92gab file in /etc/init.d to execute this
                 command at the end of the reboot, after entering multiuser
                 mode in the following format:
                 /sbin/gabconfig -S 4000 -c
                 /sbin/gabconfig -P 1
                 Also add -D 63 to the previous line for more debug as:
                 也可在前一行加-D 63获得更多调试功能:
                 /sbin/gabconfig -S 4000 -c -D 63
                 /sbin/gabconfig -P 1
NOTE: When replacing a private NIC, first remove the mswtab and clustertab, then recreate them again after the new card is installed.
NOTE: For RHA 1.1.4, please also run "rdu" for the Reliant Diags Utility.
注意:在替换一个私有NIC时,先删除mswtab和clustertab(群标签),在新卡安装后在重建他们。对于RHA1.1.4,还请运行“rdu”以获得Reliant Diags Utility。

如果是主机挂,通常是因为CPU负载太大,导致系统响应时间太慢。ReliantHA是老外设计出来的,比较教条+理想化,他们认为如果CPU IDEL时间在10%以下,那一定是系统出问题了,所以强制切换,呵呵

