七杀书生 发表于 2013-06-12 12:45

非常郁闷的端午节,凌晨3点两台SUN T5440宕机了,请大家帮忙分析一下原因

本帖最后由 七杀书生 于 2013-06-12 12:58 编辑

郁闷到吐血的端午节,今天凌晨3点左右接到用户电话说是两台SUN T5440宕机了,安装sun cluster 3.4u3数据库不可用,业务中断了,我迅速起床,赶到现场。
1、到了现场检查发现两台主机指示灯正常,但是后面的网卡和光纤卡都亮橘黄色的等。这时我用串口连db1,发现无admin的密码,试了几次也不对,可以使用root用户进入到->模式然后新建用户,但我为了赶时间迅速又用串口连到db2上,顺利进入sc模式,然后console到操作系统,发现系统运行正常,但是网络不通。使用ifconfig -a查看发现没有IP地址信息,查看/etc/目录下的hosts文件,发现无法打开,ls -l查看hosts属性,发现hosts文件是一个软连接,在./inet目录下也没有这个文件。hosts文件丢失?不知道原因。使用uptime查看机器运行时间,才2:55,说明3小时前系统宕机重启过。这时重启db2,依然无IP配置信息。这时就想办法恢复hosts配置,但没有原来Hosts配置备份,又是双机配置,不知道db1上的hosts怎么配置的,所以只有想办法从db1上找到hosts配置。
2、这时串口连接到db1,以root进入到->模式,并新建一个admin1账号,然后以这个账号进入到sc模式,想再使用console -f进入到操作系统,但是无反应,应该是操作系统没有正常启动,这时退回到sc模式,使用poweroff命令关掉db1,在poweron db1,操作系统正常引导。但是引导到两个warning处停止了,两个warning说的都是集群中另外一个节点down或者不可达,就一直停在那里。看来不能以集群模式启动。
3、这时再次退回到sc模式,修改auto-boot为false,然后进入到ok模式,执行boot -x以非集群模式引导,顺利进入db1的操作系统后,根据它的hosts文件,修改配置db2的hosts文件,并重启db2。
4、db2操作系统正常引导,检查集群状态,处于online状态,但是db1处于offline状态,检查oracle进程,oracle进程正常。
5、串口再次连接db1,执行init 0关掉系统并进入ok模式,执行boot命令以集群模式引导,db1成功引导操作系统后,检查集群状态,两个主机都是online状态,这是将db2上的磁盘组和资源组手动切换回db1。
6、至此故障全部恢复了,但是现在想分析一下宕机的原因,小弟才能有限,想求助于各位高手,希望大家帮忙分析一下Messages文件,非常非常感谢!


db2的messages:
Jun 12 02:27:31 bimsnewdb2 ip: TCP_IOC_ABORT_CONN: local = 000.000.000.000:0, remote = 172.016.004.001:0, start = -2, end = 6
Jun 12 02:27:31 bimsnewdb2 cl_runtime: NOTICE: clcomm: Path bimsnewdb2:nxge0 - bimsnewdb1:nxge0 being drained
Jun 12 02:27:31 bimsnewdb2 cl_runtime: NOTICE: clcomm: Path bimsnewdb2:nxge1 - bimsnewdb1:nxge1 being drained
Jun 12 02:27:31 bimsnewdb2 ip: TCP_IOC_ABORT_CONN: aborted 0 connection
Jun 12 02:27:37 bimsnewdb2 cl_runtime: NOTICE: CMM: Node bimsnewdb1 (nodeid = 1) is down.
Jun 12 02:27:37 bimsnewdb2 cl_runtime: NOTICE: CMM: Cluster members: bimsnewdb2.
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: remote node bimsnewdb1 died
Jun 12 02:27:37 bimsnewdb2 cl_runtime: NOTICE: CMM: node reconfiguration #6 completed.
Jun 12 02:27:37 bimsnewdb2 cl_runtime: NOTICE: CMM: Quorum device /dev/did/rdsk/d7s2: owner set to node 2.
Jun 12 02:27:37 bimsnewdb2 Cluster.Framework: stdout: fencing node bimsnewdb1 from shared devices
Jun 12 02:27:37 bimsnewdb2 Cluster.Framework: stdout: becoming primary for oracle_dg
Jun 12 02:27:37 bimsnewdb2 Cluster.Framework: stdout: becoming primary for temp_dg
Jun 12 02:27:37 bimsnewdb2 Cluster.Framework: stdout: becoming primary for dbbak_dg
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource ghcatemp-hastp-rs state on node bimsnewdb1 change to R_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource dbbak-hastp-rs state on node bimsnewdb1 change to R_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle-server-rs state on node bimsnewdb1 change to R_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle-lsnr-rs state on node bimsnewdb1 change to R_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle-hastp-rs state on node bimsnewdb1 change to R_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs state on node bimsnewdb1 change to R_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource group oracle-rg state on node bimsnewdb1 change to RG_OFFLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.RGM.global.rgmd: resource group oracle-rg state on node bimsnewdb2 change to RG_PENDING_ONLINE
Jun 12 02:27:37 bimsnewdb2 Cluster.CCR: reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d6s2
Jun 12 02:27:37 bimsnewdb2 Cluster.CCR: reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d8s2
Jun 12 02:27:37 bimsnewdb2 Cluster.CCR: reservation warning(fence_node) - Unable to open device /dev/did/rdsk/d8s2, will retry in 2 seconds
Jun 12 02:27:37 bimsnewdb2 Cluster.CCR: reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d14s2
Jun 12 02:27:37 bimsnewdb2 Cluster.CCR: reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d15s2
Jun 12 02:27:39 bimsnewdb2 Cluster.CCR: reservation warning(fence_node) - Unable to open device /dev/did/rdsk/d8s2, will retry in 2 seconds
Jun 12 02:27:41 bimsnewdb2 last message repeated 1 time
Jun 12 02:27:43 bimsnewdb2 Cluster.CCR: reservation error(fence_node) - Unable to open device /dev/did/rdsk/d8s2
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status on node bimsnewdb2 change to R_FM_UNKNOWN
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status msg on node bimsnewdb2 change to <Starting>
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: launching method <hafoip_prenet_start> for resource <oracle_ip-rs>, resource group <oracle-rg>, node <bimsnewdb2>, timeout <300> seconds
Jun 12 02:27:44 bimsnewdb2 SC[,SUNW.LogicalHostname:3,oracle-rg,oracle_ip-rs,hafoip_prenet_start]: Hostname lookup failed for oracle_ip: Unknown host
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status on node bimsnewdb2 change to R_FM_FAULTED
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status msg on node bimsnewdb2 change to <Failed to obtain list of IP addresses for this resource.>
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: Method <hafoip_prenet_start> failed on resource <oracle_ip-rs> in resource group <oracle-rg>
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs state on node bimsnewdb2 change to R_START_FAILED
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource group oracle-rg state on node bimsnewdb2 change to RG_PENDING_OFF_START_FAILED
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs state on node bimsnewdb2 change to R_STOPPING
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: launching method <hafoip_stop> for resource <oracle_ip-rs>, resource group <oracle-rg>, node <bimsnewdb2>, timeout <300> seconds
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status on node bimsnewdb2 change to R_FM_UNKNOWN
Jun 12 02:27:44 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status msg on node bimsnewdb2 change to <Stopping>
Jun 12 02:27:45 bimsnewdb2 SC[,SUNW.LogicalHostname:3,oracle-rg,oracle_ip-rs,hafoip_stop]: Hostname lookup failed for oracle_ip: Unknown host
Jun 12 02:27:45 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status on node bimsnewdb2 change to R_FM_FAULTED
Jun 12 02:27:45 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs status msg on node bimsnewdb2 change to <Failed to obtain list of IP addresses for this resource.>
Jun 12 02:27:45 bimsnewdb2 Cluster.RGM.global.rgmd: Method <hafoip_stop> failed on resource <oracle_ip-rs> in resource group <oracle-rg>
Jun 12 02:27:45 bimsnewdb2 Cluster.RGM.global.rgmd: resource oracle_ip-rs state on node bimsnewdb2 change to R_STOP_FAILED
Jun 12 02:27:45 bimsnewdb2 Cluster.RGM.global.rgmd: fatal: Aborting node bimsnewdb2 because method <hafoip_stop> failed on resource <oracle_ip-rs> and Failover_mode is set to HARD
Jun 12 02:29:29 bimsnewdb2 genunix: ^MSunOS Release 5.10 Version Generic_142900-03 64-bit
Jun 12 02:29:29 bimsnewdb2 genunix: Copyright 1983-2009 Sun Microsystems, Inc.All rights reserved.
Jun 12 02:29:29 bimsnewdb2 Use is subject to license terms.
Jun 12 02:29:29 bimsnewdb2 genunix: Ethernet address = 0:21:28:6c:95:20
Jun 12 02:29:29 bimsnewdb2 unix: NOTICE: Kernel Cage is ENABLED
Jun 12 02:29:29 bimsnewdb2 unix: mem = 33259520K (0x7ee000000)
Jun 12 02:29:29 bimsnewdb2 unix: avail mem = 33068572672
Jun 12 02:29:29 bimsnewdb2 rootnex: root nexus = T5440
Jun 12 02:29:29 bimsnewdb2 rootnex: pseudo0 at root
Jun 12 02:29:29 bimsnewdb2 genunix: pseudo0 is /pseudo
Jun 12 02:29:29 bimsnewdb2 scsi: /scsi_vhci (scsi_vhci0):

七杀书生 发表于 2013-06-12 12:46

db1的messages:
Jun 12 02:27:30 bimsnewdb1 cl_dlpitrans: Notifying cluster that this node is panicking
Jun 12 02:27:30 bimsnewdb1 unix:
Jun 12 02:27:30 bimsnewdb1 ^Mpanic/thread=3008b8a41e0:
Jun 12 02:27:30 bimsnewdb1 unix: mod_rele_dev_by_major: Unheld driver: major number <85>
Jun 12 02:27:30 bimsnewdb1 unix:
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a744e20 genunix:mod_rele_dev_by_major+88 (55, 127cc00, 253, 0, 195a858, 300015c42a0)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000000000000055 0000000001901800 00000300015c4288 00000300015c2000
Jun 12 02:27:30 bimsnewdb1   %l4-7: 0000000000002288 0000000001903c00 0000000000000451 00000000000004a6
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a744ed0 genunix:dev_to_instance+7c (18f8c00, ffffffff, 55, ffffffffffffffff, 55, 0)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000000070768e08 000000000000002f 000000000000000e 0000000001fc6c8a
Jun 12 02:27:30 bimsnewdb1   %l4-7: 00000300000c8280 ffffffffffffffff 0000000001387638 0000000000000159
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a744f90 genunix:e_ddi_hold_devi_by_dev+14 (550000212d, 0, 0, 55, 9b509682, 9b509e3b)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 000000011b50a624 000000000000008f 0000000000000035 0000000000000000
Jun 12 02:27:30 bimsnewdb1   %l4-7: 0000000000000001 fffffffffffff747 00000000000008b9 0000000000070400
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745050 specfs:spec_open+84 (2a10a745278, 2001, 30060406e80, 0, 3006b492de8, 6)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000000000000000 00000300353abb00 00000600409bf010 0000030038353e40
Jun 12 02:27:30 bimsnewdb1   %l4-7: 000000009b509e52 00000000000007d0 000003003f256b58 000000550000212d
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745110 genunix:fop_open+78 (2a10a745278, 3, 30060406e80, 2001, 300353abb00, 300353abb00)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000060031cd59c0 000000000000008f 0000000000000020 00000000013b570c
Jun 12 02:27:30 bimsnewdb1   %l4-7: 00000300000f8280 00000000018b19b8 00000000013b570b 0000000000000004
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a7451c0 pxfs:__1cMio_repl_implEopen6MinH_A_out_4nHpxfs_v1Efobj_Cpn0C___rn0BJfobj_info_pnGsolobjEcred_rnFCORBALEnvironment__v_+d0 (2a10a745610, 2001, 2a10a745340, 7a720478, 3002ed71688, 2a10a745680)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 000000007092ccd8 000003003781c5d8 000003003781c540 000003003781c550
Jun 12 02:27:30 bimsnewdb1   %l4-7: 0000000000000228 0000030060406e80 000000007a623e50 00000300820eeab8
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745280 pxfs:__1cNdevice_serverXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_nH_A_out_4n0BEfobj_Cpn0F___rn0BJfobj_info_rnFCORBALEnvironment__v_+1fc (600380b28f0, 2a10a7456e8, 2001, 3002ed71688, 2a10a745410, 2a10a745610)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 00000300820eeab8 000002a10a745680 0000000000000001 000000007a7223f4
Jun 12 02:27:30 bimsnewdb1   %l4-7: 000000007092cb98 000002a10a7456e0 0000000000000004 0000000000000000
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745350 pxfs:__1cXdevice_server_repl_implXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_nH_A_out_4n0BEfobj_Cpn0F___rn0BJfobj_info_rnFCORBALEnvironment__v_+28 (600380b28e8, 2a10a7456e8, 2001, 2a10a745680, 2a10a7454e0, 2a10a7456e0)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000000000000001 0000000000070400 0000000000000000 00000000707cb000
Jun 12 02:27:30 bimsnewdb1   %l4-7: 0000000070742000 0000000000000002 00000000708cae90 000000007a6208c0
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745420 cl_dcs:__1cDmdcSdevice_server_stubXget_open_device_fobj_v26MrknHpxfs_v1Gpvnode_ipnGsolobjEcred_nH_A_out_4n0CEfobj_Cpn0G___rn0CJfobj_info_rnFCORBALEnvironment__v_+e0 (6004716a808, 2a10a7456e8, 2001, 3002ed71688, 2a10a745670, 2a10a745610)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 00000000000708ca 000000007a68ae88 000000007ae2e148 00000000707a1518
Jun 12 02:27:30 bimsnewdb1   %l4-7: 000002a10a745680 00000600471d2500 00000000708cadec 0000000000070800
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745520 pxfs:__1cJpxspecialEopen6MppnFvnode_ipnEcred__i_+368 (3, 3002ed71688, 30060406e80, 2a10a745680, 2a10a7456e8, 708ceff0)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 00000000708ef8f0 0000000000000001 000006004716a808 000003002eb9a400
Jun 12 02:27:30 bimsnewdb1   %l4-7: 00000000708ca8d0 00000000708ca8f8 00000000708ca6c0 00000000708ca780
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745760 genunix:fop_open+78 (2a10a745930, 3, 30060406e80, 2001, 30098f6ab40, 30098f6ab40)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 000003001d2d6940 0000000000002000 0000000000000000 00000300060c81b0
Jun 12 02:27:30 bimsnewdb1   %l4-7: 00000008039830f0 000002a10a7457f0 0000000000000000 0000000000000004
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a745810 genunix:vn_openat+500 (0, 0, 1, 0, 2001, 7fffffff)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jun 12 02:27:30 bimsnewdb1   %l4-7: 0000000000000000 0000000000002000 0000000000000000 0000000000000000
Jun 12 02:27:30 bimsnewdb1 genunix: 000002a10a7459d0 genunix:copen+260 (ffffffffffd19553, 8039830f0, 0, 7ffffc00, 0, 2001)
Jun 12 02:27:30 bimsnewdb1 genunix:    %l0-3: 0000000000000000 0000000000000000 0000000063270000 0000000000000000
Jun 12 02:27:30 bimsnewdb1   %l4-7: 00000000018f5000 0000000000000c00 0000000000000012 000003005f5784c0
Jun 12 02:27:30 bimsnewdb1 unix:
Jun 12 02:27:30 bimsnewdb1 genunix: syncing file systems...
Jun 12 02:27:33 bimsnewdb1 genunix: 15
Jun 12 02:27:36 bimsnewdb1 genunix: 13
Jun 12 02:28:45 bimsnewdb1 last message repeated 20 times
Jun 12 02:28:46 bimsnewdb1 genunix: done (not all i/o completed)
Jun 12 02:28:47 bimsnewdb1 genunix: dumping to /dev/md/dsk/d110, offset 6873219072, content: kernel
Jun 12 02:34:15 bimsnewdb1 genunix: ^M100% done: 460397 pages dumped, compression ratio 2.78,
Jun 12 02:34:15 bimsnewdb1 genunix: dump succeeded

znnnz 发表于 2013-06-12 23:31

fmadmfault


showfaults -v

wait空白 发表于 2013-06-13 08:50

T系列做的双机老这样,我这都发生过好几次了。

346279055 发表于 2013-06-14 11:00

熟悉了IBM的HA后,才发现原先接触的sun cluster真的很恐怖

Aaron.Lau 发表于 2013-06-14 11:36

没法比,做sun太不好混了,又没市场。:emn13:

zhaopingzi 发表于 2013-06-17 09:06

本帖最后由 zhaopingzi 于 2013-06-17 09:06 编辑

sun cluster。比较复杂,连他自己都很少用;难用。

ac220v 发表于 2013-07-12 14:51

remote node bimsnewdb1 died
然后产生了fencing操作

liusu_520 发表于 2013-07-15 16:14

cluster有3.4的版本吗?

skyiys 发表于 2013-07-18 10:02

ac220v 发表于 2013-07-12 14:51 static/image/common/back.gif
remote node bimsnewdb1 died
然后产生了fencing操作

好厉害
页: [1] 2
查看完整版本: 非常郁闷的端午节,凌晨3点两台SUN T5440宕机了,请大家帮忙分析一下原因