免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2931 | 回复: 4
打印 上一主题 下一主题

[故障求助] Hacmp的错误导致业务中断,求助! [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2011-12-28 15:55 |只看该作者 |倒序浏览
我的环境,5300-06,Hacmp是5.4.0.0

最近一段时间在凌晨时刻,时而出现故障,oasrv2被hacmp关机了,
oasrv1的Service IP有效,但上边跑的RG已经失效,只能重启这个RG才行。

下面是cluster.log的日志的相关时间段截取
Dec 27 02:54:51 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.v9AyC/dUQ./UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked receive thread Interval in seconds during which process was blocked 120 Interface name en3
Dec 27 02:54:52 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.w9AyC/rZs//UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked receive thread Interval in seconds during which process was blocked 120 Interface name en0
Dec 27 02:54:52 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.w9AyC/sls//UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked receive thread Interval in seconds during which process was blocked 113 Interface name rhdisk2
Dec 27 02:54:52 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.w9AyC/Het//UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked netmon thread Interval in seconds during which process was blocked 58 Interface name en3
Dec 27 02:54:52 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.w9AyC/Fxt//UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked command receive thread Interval in seconds during which process was blocked 128 Interface name en0
Dec 27 02:55:06 OAsrv1 daemon:err|error haemd[1347816]: LPP=PSSP,Fn=emd_gsi.c,SID=1.4.1.36,L#=1836,                                     haemd: 2521-034 Not responding to Group Services - terminating.
Dec 27 02:55:08 OAsrv1 local0:crit clstrmgrES[1310746]: Tue Dec 27 02:55:08 Removing 2 from ml_idx
Dec 27 02:55:18 OAsrv1 user:notice HACMP for AIX: EVENT START: node_down oasrv2
Dec 27 02:55:21 OAsrv1 user:notice HACMP for AIX: EVENT COMPLETED: node_down oasrv2 0
Dec 27 02:55:21 OAsrv1 user:notice HACMP for AIX: EVENT START: node_down_complete oasrv2
Dec 27 02:55:22 OAsrv1 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete oasrv2 0
Dec 27 02:55:22 OAsrv1 local0:crit clstrmgrES[1310746]: Tue Dec 27 02:55:22 createAndConnectClientSocket: Setting up commpath for connection /usr/es/sbin/cluster/HacmpRgRmWakeup
Dec 27 02:55:22 OAsrv1 local0:crit clstrmgrES[1310746]: Tue Dec 27 02:55:22 createAndConnectClientSocket : connect(/usr/es/sbin/cluster/HacmpRgRmWakeup) failed,  errno=2
Dec 27 02:55:47 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.nAAyC/Ezt0/UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked send thread Interval in seconds during which process was blocked 35 Interface name rhdisk2
Dec 27 02:56:41 OAsrv1 daemon:err|error topsvcs[991248]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6BUfAx.dBAyC/ehS0/UAe.1...................:::Reference ID: :::Template ID: 3d32b80d:::Details File:  :::Location: rsct,nim_control.C,1.39.1.18,5919             :::TS_NIM_ERROR_STUCK_ER NIM thread blocked Thread which was blocked send thread Interval in seconds during which process was blocked 36 Interface name rhdisk2

下边是nim.topsvcs.rhdisk2.oa_cluster日志的相关时间段截取
12/26 19:28:22.884: Heartbeat was NOT received. Missed HBs: 1. Limit: 8
12/27 02:53:05.845: Heartbeat was NOT received. Missed HBs: 1. Limit: 8
12/27 02:53:11.851: Heartbeat was NOT received. Missed HBs: 2. Limit: 8
12/27 02:53:17.851: Heartbeat was NOT received. Missed HBs: 3. Limit: 8
12/27 02:53:23.854: Heartbeat was NOT received. Missed HBs: 4. Limit: 8
12/27 02:53:29.854: Heartbeat was NOT received. Missed HBs: 5. Limit: 8
12/27 02:53:35.861: Heartbeat was NOT received. Missed HBs: 6. Limit: 8
12/27 02:53:41.866: Heartbeat was NOT received. Missed HBs: 7. Limit: 8
12/27 02:53:47.871: Heartbeat was NOT received. Missed HBs: 8. Limit: 8
12/27 02:53:47.871: Local adapter is up: issuing notification for remote adapter
12/27 02:53:47.871: Adapter status successfully sent.
12/27 02:54:51.116: Received a SEND MSG command. Dst: .
12/27 02:54:51.117: Received a SEND MSG command. Dst: .
12/27 02:54:51.117: Received a STOP HB command.
12/27 02:54:51.117: Received a STOP MONITOR command.
12/27 02:54:51.412: Receive thread blocked for 113 seconds.
12/27 02:54:51.412: nim error successfully sent.
12/27 02:54:51.677: Received a SEND MSG command. Dst: .
12/27 02:54:53.769: Received a SEND MSG command. Dst: .
12/27 02:54:55.820: Received a SEND MSG command. Dst: .
12/27 02:54:55.871: Received a SEND MSG command. Dst: .
12/27 02:54:55.922: Received a SEND MSG command. Dst: .
12/27 02:54:58.474: Received a SEND MSG command. Dst: .
12/27 02:55:01.824: Received a SEND MSG command. Dst: .
12/27 02:55:03.077: Received a SEND MSG command. Dst: .
12/27 02:55:11.844: Received a SEND MSG command. Dst: .
12/27 02:55:11.870: Received a START HB command. Destination: .
12/27 02:55:11.870: set_dhb_polling_rate(): Default poll speed 40
12/27 02:55:11.870: Received a SEND MSG command. Dst: .
12/27 02:55:11.870: Received a START MONITOR command.
12/27 02:55:11.870: Address:  How often: 6000 msec Sensitivity: 8 Configuration Instance: 43
12/27 02:55:11.870: Received a SEND MSG command. Dst: .
12/27 02:55:11.870: Received a SEND MSG command. Dst: .
12/27 02:55:12.970: Received a SEND MSG command. Dst: .
12/27 02:55:17.751: Received a SEND MSG command. Dst: .
12/27 02:55:17.751: Received a SEND MSG command. Dst: .
12/27 02:55:17.875: Received a SEND MSG command. Dst: .
12/27 02:55:22.877: Received a SEND MSG command. Dst: .
12/27 02:55:23.871: Heartbeat was NOT received. Missed HBs: 1. Limit: 8
12/27 02:55:25.881: Received a SEND MSG command. Dst: .
12/27 02:55:25.881: Received a SEND MSG command. Dst: .
12/27 02:55:27.877: Received a SEND MSG command. Dst: .
12/27 02:55:29.881: Heartbeat was NOT received. Missed HBs: 2. Limit: 8
12/27 02:55:32.877: Received a SEND MSG command. Dst: .
12/27 02:55:35.881: Heartbeat was NOT received. Missed HBs: 3. Limit: 8
12/27 02:55:36.857: Received a SEND MSG command. Dst: .
12/27 02:55:36.857: Received a SEND MSG command. Dst: .
12/27 02:55:37.877: Received a SEND MSG command. Dst: .
12/27 02:55:41.891: Heartbeat was NOT received. Missed HBs: 4. Limit: 8
12/27 02:55:41.961: Received a SEND MSG command. Dst: .
12/27 02:55:42.877: Received a SEND MSG command. Dst: .
12/27 02:55:47.711: writePacket(): Unable to write for too long
12/27 02:55:47.761: Send thread blocked for 35 seconds.
12/27 02:55:47.761: nim error successfully sent.
12/27 02:55:47.877: Received a SEND MSG command. Dst: .
12/27 02:55:47.901: Heartbeat was NOT received. Missed HBs: 5. Limit: 8
12/27 02:55:49.131: Received a SEND MSG command. Dst: .
12/27 02:55:49.131: Received a SEND MSG command. Dst: .
12/27 02:55:49.881: writePacket(): Unable to write for too long
12/27 02:55:52.049: writePacket(): Unable to write for too long
12/27 02:55:52.877: Received a SEND MSG command. Dst: .
12/27 02:55:53.911: Heartbeat was NOT received. Missed HBs: 6. Limit: 8
12/27 02:55:54.223: writePacket(): Unable to write for too long
12/27 02:55:56.401: writePacket(): Unable to write for too long
12/27 02:55:57.877: Received a SEND MSG command. Dst: .
12/27 02:55:58.561: writePacket(): Unable to write for too long
12/27 02:55:59.911: Heartbeat was NOT received. Missed HBs: 7. Limit: 8
12/27 02:56:00.725: writePacket(): Unable to write for too long
12/27 02:56:02.878: Received a SEND MSG command. Dst: .
12/27 02:56:02.887: writePacket(): Unable to write for too long
12/27 02:56:05.056: writePacket(): Unable to write for too long
12/27 02:56:05.106: 8 failed writes in a row - clearing send queue.
12/27 02:56:05.881: Received a SEND MSG command. Dst: .
12/27 02:56:05.881: Received a SEND MSG command. Dst: .
12/27 02:56:05.921: Heartbeat was NOT received. Missed HBs: 8. Limit: 8
12/27 02:56:05.921: Local adapter is up: issuing notification for remote adapter
12/27 02:56:05.921: Adapter status successfully sent.
12/27 02:56:05.921: Received a STOP HB command.
12/27 02:56:05.921: Received a STOP MONITOR command.
12/27 02:56:06.871: Received a SEND MSG command. Dst: .
12/27 02:56:16.878: Received a SEND MSG command. Dst: .
12/27 02:56:20.930: Received a SEND MSG command. Dst: .
12/27 02:56:26.881: Received a SEND MSG command. Dst: .
12/27 02:56:35.996: Received a SEND MSG command. Dst: .
12/27 02:56:36.891: Received a SEND MSG command. Dst: .
12/27 02:56:41.599: writePacket(): Unable to write for too long
12/27 02:56:41.649: Send thread blocked for 36 seconds.
12/27 02:56:41.649: nim error successfully sent.
12/27 02:56:43.760: writePacket(): Unable to write for too long
12/27 02:56:45.923: writePacket(): Unable to write for too long
12/27 02:56:46.895: Received a SEND MSG command. Dst: .
12/27 02:56:50.997: Received a SEND MSG command. Dst: .
12/27 02:56:52.287: writePacket(): Unable to write for too long
12/27 02:56:56.560: writePacket(): Unable to write for too long
12/27 02:56:56.901: Received a SEND MSG command. Dst: .
12/27 02:57:02.928: writePacket(): Unable to write for too long
12/27 02:57:06.006: Received a SEND MSG command. Dst: .
12/27 02:57:06.907: Received a SEND MSG command. Dst: .
12/27 02:57:09.294: writePacket(): Unable to write for too long
12/27 02:57:13.561: writePacket(): Unable to write for too long
12/27 02:57:16.911: Received a SEND MSG command. Dst: .
12/27 02:57:19.928: writePacket(): Unable to write for too long
12/27 02:57:19.978: 8 failed writes in a row - clearing send queue.
12/27 02:57:21.011: Received a SEND MSG command. Dst: .
12/27 02:57:26.919: Received a SEND MSG command. Dst: .
12/27 02:57:36.027: Received a SEND MSG command. Dst: .
12/27 02:57:36.921: Received a SEND MSG command. Dst: .
12/27 02:57:46.928: Received a SEND MSG command. Dst: .
12/27 02:57:51.028: Received a SEND MSG command. Dst: .
12/27 02:57:56.731: writePacket(): Unable to write for too long
12/27 02:57:56.932: Received a SEND MSG command. Dst: .
12/27 02:58:03.097: writePacket(): Unable to write for too long
12/27 02:58:06.042: Received a SEND MSG command. Dst: .
12/27 02:58:06.941: Received a SEND MSG command. Dst: .
12/27 02:58:09.468: writePacket(): Unable to write for too long
12/27 02:58:15.834: writePacket(): Unable to write for too long
12/27 02:58:16.961: Received a SEND MSG command. Dst: .
12/27 02:58:17.301: dhb_lost_handshake_fct(): Restarting handshaking
12/27 02:58:17.303: initHS(): Wrote initial handshake
12/27 02:58:21.051: Received a SEND MSG command. Dst: .
12/27 02:58:26.964: Received a SEND MSG command. Dst: .
12/27 02:58:36.071: Received a SEND MSG command. Dst: .
12/27 02:58:36.971: Received a SEND MSG command. Dst: .
12/27 02:58:46.991: Received a SEND MSG command. Dst: .
12/27 02:58:51.071: Received a SEND MSG command. Dst: .

论坛徽章:
0
2 [报告]
发表于 2011-12-29 15:38 |只看该作者
主机太忙了吧。两边ha不同步,shutdown了其中的一个实例。
1、检查系统资源,oracle资源在出问题前是否有优化可能
2、查oracle bug。

论坛徽章:
0
3 [报告]
发表于 2011-12-29 17:00 |只看该作者
回复 2# 李仁杰


    不好意思,少说了一点,跑的业务是domino。

论坛徽章:
0
4 [报告]
发表于 2011-12-29 21:12 |只看该作者
可能是磁盘心跳,硬盘IO过大,导致磁盘心跳收不到消息

论坛徽章:
0
5 [报告]
发表于 2012-01-04 15:25 |只看该作者
今晚升级了Hacmp到5.4.1.11,再调整了跟DMS相关的几个参数,
包括:I/O Pacing修改HIGH water =33 LOW watermark = 24
      syncd频率改为10
      错误探测速率,ether和diskhb的值都该为 Slow
修改后都重启了一遍,再运行一次全备份,因为这个操作压力比较大,
之前的错误出现都在备份时。这次备份diskhb还是有1个Missed HBs。
再检查一次所有相关的日志,暂时没有报错了。
只能继续观察了,哪个大神能再给整改建议,谢谢!
---------------------------------------------------------------------------------
自从上个月31号晚做了以上所说的操作后,到目前还是有2个心跳异常。
lssrc -ls topsvcs时显示diskhb有Missed HBs: Total: 2 Current group: 2 的提示。
但在其它相关的日志中,都没有错误的记录。
貌似情况有所改善,但还有遗留的问题,奇怪。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP