- 论坛徽章:
- 0
|
suncluster中网口物理正常,IPMP状态异常的处理
一、现象:
1. suncluster中的ipmp状态异常
root@ABCSERVER1 # scstat -i
-- IPMP Groups --
Node Name Group Status Adapter Status
--------- ----- ------ ------- ------
IPMP Group: ABCSERVER1 ipmp1 Online qfe1 Offline
IPMP Group: ABCSERVER1 ipmp1 Online ce1 Online
IPMP Group: ABCSERVER2 ipmp1 Online qfe1 Offline
IPMP Group: ABCSERVER2 ipmp1 Online ce1 Online
2. ping群组中Offline的网口,显示alive:
/userhome/abcapp$ ping 10.xx.x.233
10.xx.x.233 is alive
二、诊断:
查看message信息:
root@ABCSERVER1 # dmesg
Thu Nov 29 09:59:57 CST 2007
…
Nov 29 01:34:43 ABCSERVER1 in.mpathd[436]: [ID 168056 daemon.error] All Interfaces in group ipmp1 have failed
Nov 29 01:34:43 ABCSERVER1 Cluster.PNM: [ID 890413 daemon.notice] ipmp1: state transition from OK to DOWN.
…
Nov 29 01:46:58 ABCSERVER1 genunix: [ID 408789 kern.notice] NOTICE: ce1: fault cleared external to device; service available
Nov 29 01:46:58 ABCSERVER1 genunix: [ID 451854 kern.notice] NOTICE: ce1: xcvr addr:0x01 - link up 100 Mbps full duplex
Nov 29 01:46:58 ABCSERVER1 in.mpathd[436]: [ID 820239 daemon.error] The link has come up on ce1
Nov 29 01:46:59 ABCSERVER1 qfe: [ID 517869 kern.info] SUNW,qfe1: 100 Mbps full duplex link up - internal transceiver
Nov 29 01:47:25 ABCSERVER1 in.mpathd[436]: [ID 620804 daemon.error] Successfully failed back to NIC ce1
Nov 29 01:47:25 ABCSERVER1 in.mpathd[436]: [ID 299542 daemon.error] NIC repair detected on ce1 of group ipmp1
Nov 29 01:47:25 ABCSERVER1 in.mpathd[436]: [ID 237757 daemon.error] At least 1 interface (ce1) of group ipmp1 has repaired
Nov 29 01:47:25 ABCSERVER1 Cluster.PNM: [ID 890413 daemon.notice] ipmp1: state transition from DOWN to OK.
Nov 29 01:47:25 ABCSERVER1 in.mpathd[436]: [ID 832587 daemon.error] Successfully failed over from NIC qfe1 to NIC ce1
…
Nov 29 01:47:25 ABCSERVER1 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource ora-service status msg on node ABCSERVER1 change to <LogicalHostname online.>
从message信息上来看,在断开网络后,suncluster确认两个网口失效,认为IPMP的state transition为down,同时标明资源组中的server-rs为DEGRADED。在健全检查(sanity check)失败后,suncluster决定不做切换。网络再次连通时,两个网口全部link上,IPMP的state transition变回up,同时服务从qfe1切回ce1。改变server-rs状态变为online,node状态变为online,suncluster处理过程完毕。
看来suncluster少做了一步,似乎只要将网络服务从ce1切回qfe1,IPMP状态就正常了。
三、处理:
1. 试图强制将qfe1拉起来:
root@ABCSERVER1 # ifconfig qfe1:1 up
未果,依旧显示IPMP offline
2. 查询SA299第一章关于in.mpathd进程的说明,执行命令重新读取mpathd配置:
root@ABCSERVER1 # pkill -HUP /sbin/in.mpathd
3. 等待片刻后查看dmesg中出现以下信息:
Nov 29 09:20:30 ABCSERVER1 last message repeated 1 time
Nov 29 09:36:16 ABCSERVER1 in.mpathd[436]: [ID 111610 daemon.error] SIGHUP: restart and reread config file
Nov 29 09:37:10 ABCSERVER1 in.mpathd[4902]: [ID 620804 daemon.error] Successfully failed back to NIC qfe1
Nov 29 09:37:10 ABCSERVER1 in.mpathd[4902]: [ID 299542 daemon.error] NIC repair detected on qfe1 of group ipmp1
查看IPMP状态:
root@ABCSERVER1 # scstat -i
-- IPMP Groups --
Node Name Group Status Adapter Status
--------- ----- ------ ------- ------
IPMP Group: ABCSERVER1 ipmp1 Online qfe1 Online
IPMP Group: ABCSERVER1 ipmp1 Online ce1 Online
IPMP Group: ABCSERVER2 ipmp1 Online qfe1 Offline
IPMP Group: ABCSERVER2 ipmp1 Online ce1 Online
在备机重复操作,状态全部正常
欢迎大家一起讨论 |
|