免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2411 | 回复: 0
打印 上一主题 下一主题

ReliantHA经常无故重新启动的问题! [复制链接]

论坛徽章:
1
荣誉版主
日期:2011-11-23 16:44:17
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2007-10-09 15:32 |只看该作者 |倒序浏览

原贴地址:
http://bbs.chinaunix.net/viewthread.php?tid=546470&highlight=ReliantHA%BE%AD%B3%A3%CE%DE%B9%CA%D6%D8%D0%C2%C6%F4%B6%AF%B5%C4%CE%CA%CC%E2

ReliantHA经常无故重新启动的问题!
I get an error, "GAB: Port h halting system" when using UnixWare 7 ReliantHa.  
  
Problem
I have installed ReliantHA and when I run "hvstart" after a few seconds one or more servers shutdown displaying the message:
"GAB: Port h halting system".
and/or:
"System has halted and may be powered off (Press any key to reboot)."
This is a generic ReliantHA error message indicating that a ReliantHA node has been shutdown for some reason, often due to a communications failure of some kind.
Solution
解决方案:
Use the following tools to help diagnose the problem after first re-booting the servers in the cluster.
先将集群内服务器重启,再使用下列工具诊断问题,
1. Disconnect the public network and ping SYSA and ping SYSB. NOTE: These are the private network names that ReliantHA uses and are case sensitive.
1.断开公网,ping SYSA 和ping SYSB. 注意:这些是ReliantHA使用的内网名,大小写敏感。
2. Make sure when ReliantHA was configured with "mkcluster" that the external uname (or public name) was used for the name of the nodes and NOT SYSA or SYSB.
2.请确认当ReliantHA配置为"mkcluster",使用结点的外部名(或公共名)而非SYSA或SYSB.
2.请确认在配置ReliantHA时"mkcluster"命令中使用的是结点外部名(或公共名)而非SYSA或SYSB.
3. Check the Release Notes of ReliantHA to look at the S99gab script's timeout values.
3.检查ReliantHA的版本说明查看S99gab脚本的超时值。
              These release notes are located at:
                这些版本说明在:
              http://www.sco.com/products/clustering/notes/harelnot.html
4. Check the output from /usr/opt/reliant/log for any errors.
4.在/usr/opt/reliant/log 中差错
              This is a directory, most useful is the switchlog file.
                这是一个目录,最有用的是switchlog文件。
              NOTE: It is normal to see errors such as:
                看到下列错误是正常的:
              dynamic linker: commds: warning: copy relocation size mismatch
              for symbol svc_fdset
                动态链结: 命令: 警告:svc_fdset符号 拷贝位置大小不匹配
5. If using Compaq Network Interface Cards (NIC) Netflex3 series, consider using the OU8 eeE8 (DDI

driver rather than Compaq's own "n100c" driver. This is because these cards are rebadged Intel Pro100B cards.
5.如果使用Compaq Network Interface Cards (NIC) Netflex3系列,用OU8 eeE8 (DDI

驱动而非康柏自己的N100C驱动。因为这些卡是Intel Pro100B型的卡。
              The latest "nd" package is available from:
                最新的”nd”包在:
              ftp://ftp.sco.com/pub/unixware7/drivers/storage
              ftp://ftp.sco.com/pub/openunix8/drivers/storage
              ftp://ftp.sco.com/pub/unixware7/713/
              If the Compaq Insight Manager agents are installed for NIC monitoring then this would need to be removed.
                如果NIC已安装康柏识别管理器(Compaq Insight Manager agents)其“管理”应该被禁。
              Basically, ensure that the NIC can support a programmable MAC address and that cross-over cables are used to directly connect                              the nodes on the Private LAN.
        一般地,保证NIC支持可编程MAC地址并且使用交叉线直接连接局域网的结点。
6. Check the latest patches are installed for the operating system available from:
6.检查操作系统最新版本:
              ftp://ftp.sco.com/pub/;
7. Check the output of "mswconfig -l", "llstat -a" and "/etc/mswtab" for any errors.
7.有差错否:mswconfig –l
          llstat –a
          /etc/mswtab
8. If no specific config files are defined then hvstart will use a simple default set of scripts for basic testing between the nodes.
8.如果未制定配置文件,hvstart将使用简单默认脚本集来进行结点间测试
9. Running "ipcs -a" should allocate a message queue once "hvstart" has run. You can also see the status of ReliantHA with "hvdisp -a".
9.运行ipcs –a将在hvstart运行时分配一个信息队列。你也可以通过hvdisp –a查看reliantHA的状态。
10. Use the "truss" command to examine the output of the "hvstart" command to get an indication of when the failure occurs:
10.使用truss命令检查hvstart命令的输出,获悉故障何时发生的:
              truss -f -o /hvstart.truss hvstart
11. If the system is swapping excessively then this could cause enough latency at the heartbeat communication layer for a heartbeat to be missed and so a node be killed with a gab halt. Use the standard system tools "sar" and "rtpm" to monitor for swapping behaviour.
11.如果系统过度交换,将造成心跳(heartbeat)流通层的延迟,引起一个心跳被错过,一个结点被误“杀”。请使用标准系统工具"sar" and "rtpm"管理交换行为。
              In addition:另外:
    Check /etc/conf/cf.d/stune for tuning that may conflict with the
              shared message queues that ReliantHA needs to operate such as:
检查/etc/conf/cf.d/stune以调整与(reliantHA要对其操作的)共享信息队列的冲突,例如:
                 MSGSEG
                 STRTHRESH
              Both of these values should be set to the default operating
              system values even if database vendors such as Oracle say that
              these values need to be set.
                上两个值应该被设为默认操作系统值,即使数据库发行商如ORACLE说这些值该被设定
NOTE: MSGSSZ, MSGMNB and MSGTQL should be tuned from their default values to at least 524288, 65536 and 1000 respectively (add any further application related tuning to these values).
NOTE: The minimum requirement for ReliantHA is 2 private LAN connections.
注意:MSGSSZ, MSGMNB, MSGTQL应该分别被设为其默认值,即至少524288, 65536,1000 (还可对这些值进行应用程序相关的调整――如加一些值)
NOTE: Instead of a "real" NIC you could also use a (null modem) serial cable as the second interface.
注意:除了用“真实”NIC,你也可用(空MODEM)串行线作为第二接口。
                 For Unisys: CBL6099-10M Null Modem Cable
                 对UNISYS:CBL6099-10M Null空MODEM线
                 For Compaq/HP: BC29Q-02M Null Modem Cable
                 对COMPAQ/HP: BC29Q-02M Null Modem Cable
NOTE: In general, note that should a node fail if shared memory or disk buffering is used then this data will be lost when the second node takes over. This is important for databases that use this technology. Ensure that RAID controllers are configured to WRITE-THRU and not cached.
注意:通常,当一个结点有故障,如果使用共享内存或磁盘缓冲区,第二个结点接管时数据都被丢弃。此技术对数据库很重要,保证RAID控制器被配置为WRITE-THRU而非缓存。
NOTE: When you run "hvstart" manually, you will need to hit RETURN to return to the prompt.
注意:当手动运行“hvstart“,你要单击回车键回到命令行界面。
NOTE: With ReliantHA 1.1.3a a new option "gabconfig" option was added called -P.
注:对ReliantHA 1.1.3a,添加了新的gabconfig选项:-p。
                 The -P option was added as a standalone "debug" option for use
                 after the gab driver is already configured which will generate
                 a PANIC should "gab" halt.  By default it is turned off.  To
                 turn it on set the value to -P 1.
-p选项作为一个独立的调试选项,在gab驱动被配置为若产生PANIC就gab停。默认值是关,若要开,设置为-p1.
                 It is not recommended to use this feature within /etc/rc2.d.         不推荐在/etc/rc2.d中使用此功能
                 Create an S92gab file in /etc/init.d to execute this
                 command at the end of the reboot, after entering multiuser
                 mode in the following format:
                 可在/etc/init.d新建一个S92gab文件执行此命令,这些应在重启,并进入多用户模式后,如下:
                 /sbin/gabconfig -S 4000 -c
                 /sbin/gabconfig -P 1
                 Also add -D 63 to the previous line for more debug as:
                 也可在前一行加-D 63获得更多调试功能:
                 /sbin/gabconfig -S 4000 -c -D 63
                 /sbin/gabconfig -P 1
NOTE: When replacing a private NIC, first remove the mswtab and clustertab, then recreate them again after the new card is installed.
NOTE: For RHA 1.1.4, please also run "rdu" for the Reliant Diags Utility.
注意:在替换一个私有NIC时,先删除mswtab和clustertab(群标签),在新卡安装后在重建他们。对于RHA1.1.4,还请运行“rdu”以获得Reliant Diags Utility。



这种问题很简单的,只有两个可能性
如果是备机挂,就是心跳线问题。你可能用了不稳定的网线连接,或者其中一条心跳线为串口线。当发生串口阻塞的时候,系统就挂了。可以把串口换成网卡,这样一般都能解决。
如果是主机挂,通常是因为CPU负载太大,导致系统响应时间太慢。ReliantHA是老外设计出来的,比较教条+理想化,他们认为如果CPU IDEL时间在10%以下,那一定是系统出问题了,所以强制切换,呵呵
要解决的话,加CPU,或者减少一个数据库引擎,就可以搞定


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/22/showart_397281.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP