nerrazurri 发表于 2012-10-13 06:01

Sun Cluster 无法切换,有初步分析,求助众位大侠

公司两台服务器配置了Sun Cluster,两节点   hrsms01、hrsms02 ,数据库初始运行在 hrsms01上

前几天hrsms01因为内存卡问题,突然宕机,数据库自动切换到hrsms02上,但当更换内存条hrsms01重新加入Cluster后,执行

root@hrsms02 # scswitch -z -g oracle-rg -h hrsms01
scswitch: Resource group oracle-rg failed to start on chosen node and may fail over to other node(s)


日志如下:

Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: CMM: Cluster has reached quorum.
Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: CMM: Node hrsms01 (nodeid = 1) is up; new incarnation number = 1
348036672.
Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: CMM: Node hrsms02 (nodeid = 2) is up; new incarnation number = 1
348030597.
Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: launching method <bin/oracle_server_boot> for resource <oracle-r
es>, resource group <oracle-rg>, timeout <30> seconds
Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: launching method <bin/oracle_listener_boot> for resource <oracle
-lsn>, resource group <oracle-rg>, timeout <30> seconds
Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: method <bin/oracle_listener_boot> completed successfully for res
ource <oracle-lsn>, resource group <oracle-rg>, time used: 0% of timeout <30 seconds>
Sep 19 14:37:52 hrsms01 Cluster.RGM.rgmd: method <bin/oracle_server_boot> completed successfully for resou
rce <oracle-res>, resource group <oracle-rg>, time used: 0% of timeout <30 seconds>
Sep 19 14:37:54 hrsms01 snmpXdmid: Error in Adding Row for Subscription Table Entry
Sep 19 14:37:54 hrsms01 snmpXdmid: Failed to add filter to SP for Event delivery
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The status of device: /dev/did/rdsk/d1s0 is set to MONITORED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The status of device: /dev/did/rdsk/d2s0 is set to MONITORED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The status of device: /dev/did/rdsk/d4s0 is set to MONITORED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The status of device: /dev/did/rdsk/d5s0 is set to MONITORED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d4s0 has changed to
FAILED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d5s0 has changed to
FAILEDSep 19 14:37:54 hrsms01 Cluster.scdpmd: The status of device: /dev/did/rdsk/d6s0 is set to MONITORED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The status of device: /dev/did/rdsk/d7s0 is set to MONITORED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d6s0 has changed to
FAILED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d7s0 has changed to
FAILEDSep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d1s0 has changed to
OK
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d2s0 has changed to
OK
Sep 19 14:38:00 hrsms01 pseudo: pseudo-device: vol0
Sep 19 14:38:00 hrsms01 genunix: vol0 is /pseudo/vol@0
Sep 19 14:38:29 hrsms01 genunix: NOTICE: ce1: no fault external to device; service available
Sep 19 14:38:29 hrsms01 genunix: NOTICE: ce1: xcvr addr:0x01 - link up 100 Mbps full duplex
ep 19 14:40:00 hrsms01 pseudo: pseudo-device: devinfo0
Sep 19 14:40:00 hrsms01 genunix: devinfo0 is /pseudo/devinfo@0
Sep 19 14:50:25 hrsms01 Cluster.RGM.rgmd: launching method <hafoip_prenet_start> for resource <plmmcsg>, r
esource group <oracle-rg>, timeout <300> seconds
Sep 19 14:50:26 hrsms01 Cluster.RGM.rgmd: method <hafoip_prenet_start> completed successfully for resource
<plmmcsg>, resource group <oracle-rg>, time used: 0% of timeout <300 seconds>
Sep 19 14:50:26 hrsms01 Cluster.RGM.rgmd: launching method <hastorageplus_prenet_start> for resource <orac
le-ha>, resource group <oracle-rg>, timeout <1800> seconds
Sep 19 14:50:28 hrsms01 Cluster.Framework: stdout: becoming primary for plmds
Sep 19 14:50:29 hrsms01 Cluster.Framework: stderr: metaset: hrsms01: plmds: there are no existing databases
Sep 19 14:50:29 hrsms01 Cluster.Framework: stderr: metaset: hrsms01: plmds: must be owner of the set for th
is commandSep 19 14:51:04 hrsms01 Cluster.Framework: stdout: becoming primary for plmds
Sep 19 14:51:05 hrsms01 Cluster.Framework: stderr: metaset: hrsms01: plmds: there are no existing databases
Sep 19 14:51:05 hrsms01 Cluster.Framework: stderr: metaset: hrsms01: plmds: must be owner of the set for th
is command
Sep 19 14:51:08 hrsms01 SC:
Device switchover of global service plmds associated with path /u02 to this node failed: Node failed to become the primary.
Sep 19 14:51:08 hrsms01 SC:
Device switchover of global service plmds associated with path /u03 to this node failed: Node failed to become the primary.
Sep 19 14:51:08 hrsms01 SC: Gl
obal service plmds associated with path /u02 is unable to become a primary on node 1.Sep 19 14:51:08 hrsms01 Cluster.RGM.rgmd: Method <hastorageplus_prenet_start> failed on resource <oracle-ha
> in resource group <oracle-rg>
Sep 19 14:51:08 hrsms01 Cluster.RGM.rgmd: launching method <hastorageplus_stop> for resource <oracle-ha>,
resource group <oracle-rg>, timeout <1800> seconds
Sep 19 14:51:08 hrsms01 Cluster.RGM.rgmd: method <hastorageplus_stop> completed successfully for resource
<oracle-ha>, resource group <oracle-rg>, time used: 0% of timeout <1800 seconds>
Sep 19 14:51:08 hrsms01 Cluster.RGM.rgmd: launching method <hafoip_stop> for resource <plmmcsg>, resource
group <oracle-rg>, timeout <300> seconds
Sep 19 14:51:08 hrsms01 ip: TCP_IOC_ABORT_CONN: local = 192.168.099.070:0, remote = 000.000.000.000:0, start
= -2, end = 6
Sep 19 14:51:08 hrsms01 ip: TCP_IOC_ABORT_CONN: aborted 0 connection
Sep 19 14:51:08 hrsms01 Cluster.RGM.rgmd: method <hafoip_stop> completed successfully for resource <plmmcs
g>, resource group <oracle-rg>, time used: 0% of timeout <300 seconds>
Sep 19 14:51:08 hrsms01 Cluster.RGM.rgmd: launching method <hastorageplus_postnet_stop> for resource <orac
le-ha>, resource group <oracle-rg>, timeout <1800> seconds
@




我首先关注了下


Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d4s0 has changed to
FAILED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d5s0 has changed to
FAILED

Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d6s0 has changed to
FAILED
Sep 19 14:37:54 hrsms01 Cluster.scdpmd: The state of the path to device: /dev/did/rdsk/d7s0 has changed to
FAILED


hrsms01>#scdidadm -L
1      hrsms01:/dev/rdsk/c0t0d0       /dev/did/rdsk/d1   
2      hrsms01:/dev/rdsk/c0t1d0       /dev/did/rdsk/d2   
3      hrsms01:/dev/rdsk/c0t6d0       /dev/did/rdsk/d3   
4      hrsms02:/dev/rdsk/c6t600A0B80001F71B200000AB44509B55Ed0 /dev/did/rdsk/d4   
4      hrsms01:/dev/rdsk/c2t600A0B80001F71B200000AB44509B55Ed0 /dev/did/rdsk/d4   
5      hrsms02:/dev/rdsk/c6t600A0B800018EF9D000000154509B527d0 /dev/did/rdsk/d5   
5      hrsms01:/dev/rdsk/c2t600A0B800018EF9D000000154509B527d0 /dev/did/rdsk/d5   
6      hrsms02:/dev/rdsk/c6t600A0B80001F71B200000AB34509B4DEd0 /dev/did/rdsk/d6   
6      hrsms01:/dev/rdsk/c2t600A0B80001F71B200000AB34509B4DEd0 /dev/did/rdsk/d6   
7      hrsms02:/dev/rdsk/c6t600A0B800018EF9D000000134509B4BDd0 /dev/did/rdsk/d7   
7      hrsms01:/dev/rdsk/c2t600A0B800018EF9D000000134509B4BDd0 /dev/did/rdsk/d7   
8      hrsms02:/dev/rdsk/c1t0d0       /dev/did/rdsk/d8   
11       hrsms02:/dev/rdsk/c0t0d0       /dev/did/rdsk/d11   
12       hrsms02:/dev/rdsk/c1t5d0       /dev/did/rdsk/d12   
13       hrsms02:/dev/rdsk/c1t1d0       /dev/did/rdsk/d13   
14       hrsms02:/dev/rdsk/c1t2d0       /dev/did/rdsk/d14   
16       hrsms02:/dev/rdsk/c1t4d0       /dev/did/rdsk/d16   
17       hrsms02:/dev/rdsk/c1t3d0       /dev/did/rdsk/d17   
8187   hrsms02:/dev/rmt/1             /dev/did/rmt/5      
8188   hrsms01:/dev/rmt/2             /dev/did/rmt/4      
8189   hrsms01:/dev/rmt/1             /dev/did/rmt/3      
8190   hrsms02:/dev/rmt/0             /dev/did/rmt/2      
8191   hrsms01:/dev/rmt/0             /dev/did/rmt/1


hrsms01>#scdpm -p all
hrsms01:/dev/did/rdsk/d1                                     Ok
hrsms01:/dev/did/rdsk/d2                                     Ok
hrsms01:/dev/did/rdsk/d4                                     Fail
hrsms01:/dev/did/rdsk/d5                                     Fail
hrsms01:/dev/did/rdsk/d6                                     Fail
hrsms01:/dev/did/rdsk/d7                                     Fail
hrsms02:/dev/did/rdsk/d12                                    Ok
hrsms02:/dev/did/rdsk/d13                                    Ok
hrsms02:/dev/did/rdsk/d14                                    Ok
hrsms02:/dev/did/rdsk/d16                                    Ok
hrsms02:/dev/did/rdsk/d17                                    Ok
hrsms02:/dev/did/rdsk/d4                                     Ok
hrsms02:/dev/did/rdsk/d5                                     Ok
hrsms02:/dev/did/rdsk/d6                                     Ok
hrsms02:/dev/did/rdsk/d7                                     Ok
hrsms02:/dev/did/rdsk/d8                                     Ok


结果显示 在hrsms01上确实无法访问

hrsms01:/dev/did/rdsk/d4                                     Fail
hrsms01:/dev/did/rdsk/d5                                     Fail
hrsms01:/dev/did/rdsk/d6                                     Fail
hrsms01:/dev/did/rdsk/d7                                     Fail

我现在hrsms01上尝试访问下/dev/did/rdsk/d2


hrsms01>#prtvtoc /dev/did/rdsk/d4s2
prtvtoc: /dev/did/rdsk/d4s2: No such device or address


而在hrsms02上则可以


root@hrsms02 # prtvtoc /dev/did/rdsk/d4s2
* /dev/did/rdsk/d4s2 partition map
*
* Dimensions:
*   512 bytes/sector
*      64 sectors/track
*      64 tracks/cylinder
*    4096 sectors/cylinder
*   25600 cylinders
*   25598 accessible cylinders
*
* Flags:
*   1: unmountable
*10: read-only


使用Format查看磁盘信息

hrsms01>#format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /ssm@0,0/pci@18,600000/pci@2/scsi@2/sd@0,0
       1. c0t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /ssm@0,0/pci@18,600000/pci@2/scsi@2/sd@1,0

Sector    Last
* PartitionTagFlags    Sector   Count    SectorMount Directory
       0      4    00      12288 104837120 104849407
       7      4    01          0   12288   12287



root@hrsms02 # format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000014c3e031d7,0
       1. c1t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e011e31a01,0
       2. c1t2d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e011e7be61,0
       3. c1t3d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e011e80151,0
       4. c1t4d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e011e7fad1,0
       5. c1t5d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e011e3c451,0
       6. c6t600A0B80001F71B200000AB34509B4DEd0 <IBM-1722-600-0520 cyl 51198 alt 2 hd 256 sec 64>
          /scsi_vhci/ssd@g600a0b80001f71b200000ab34509b4de
       7. c6t600A0B80001F71B200000AB44509B55Ed0 <IBM-1722-600-0520 cyl 25598 alt 2 hd 64 sec 64>
          /scsi_vhci/ssd@g600a0b80001f71b200000ab44509b55e
       8. c6t600A0B800018EF9D000000134509B4BDd0 <IBM-1722-600-0520 cyl 51198 alt 2 hd 256 sec 64>
          /scsi_vhci/ssd@g600a0b800018ef9d000000134509b4bd
       9. c6t600A0B800018EF9D000000154509B527d0 <IBM-1722-600-0520 cyl 25996 alt 2 hd 64 sec 64>
          /scsi_vhci/ssd@g600a0b800018ef9d000000154509b527



再关注一下硬件配置信息:


hrsms01>#cfgadm -al
Ap_Id                        Type         Receptacle   Occupant   Condition
N0.IB6                         PCI+_I/O_Boconnected    configured   ok
N0.IB6::pci0                   io         connected    configured   ok
N0.IB6::pci1                   io         connected    configured   ok
N0.IB6::pci2                   io         connected    configured   ok
N0.IB6::pci3                   io         connected    configured   ok
N0.IB8                         PCI+_I/O_Boconnected    configured   ok
N0.IB8::pci0                   io         connected    configured   ok
N0.IB8::pci1                   io         connected    configured   ok
N0.IB8::pci2                   io         connected    configured   ok
N0.IB8::pci3                   io         connected    configured   ok
N0.SB0                         unknown      empty      unconfigured unknown
N0.SB2                         unknown      empty      unconfigured unknown
N0.SB4                         CPU_V3       connected    configured   ok
N0.SB4::cpu0                   cpu          connected    configured   ok
N0.SB4::cpu1                   cpu          connected    configured   ok
N0.SB4::cpu2                   cpu          connected    configured   ok
N0.SB4::cpu3                   cpu          connected    configured   ok
N0.SB4::memory               memory       connected    configured   ok
c0                           scsi-bus   connected    configured   unknown
c0::dsk/c0t0d0               disk         connected    configured   unknown
c0::dsk/c0t1d0               disk         connected    configured   unknown
c0::dsk/c0t6d0               CD-ROM       connected    configured   unknown
c0::es/ses0                  processor    connected    configured   unknown
c0::es/ses1                  processor    connected    configured   unknown
c0::rmt/0                      tape         connected    configured   unknown
c1                           scsi-bus   connected    unconfigured unknown
c6                           fc         connected    unconfigured unknown
c7                           fc         connected    unconfigured unknown
c8                           fc-fabric    connected    unconfigured unknownc
8::200600a0b81f71b4         disk         connected    unconfigured unknownc8::210100e08ba7607a         unknown      connected    unconfigured unknown
c9                           fc-fabric    connected    configured   unknown
c9::200700a0b81f71b4         disk         connected    unconfigured unknownc9::210000e08b87607a         unknown      connected    unconfigured unknown
c9::500308c146699004         tape         connected    configured   unknown
c9::500308c146699007         tape         connected    configured   unknown


root@hrsms02 # cfgadm -al
Ap_Id                        Type         Receptacle   Occupant   Condition
c0                           scsi-bus   connected    configured   unknown
c0::dsk/c0t0d0               CD-ROM       connected    configured   unknown
c1                           fc-private   connected    configured   unknown
c1::21000014c3e031d7         disk         connected    configured   unknown
c1::500000e011e31a01         disk         connected    configured   unknown
c1::500000e011e3c451         disk         connected    configured   unknown
c1::500000e011e7be61         disk         connected    configured   unknown
c1::500000e011e7fad1         disk         connected    configured   unknown
c1::500000e011e80151         disk         connected    configured   unknown
c1::5080020000251231         ESI          connected    configured   unknown
c4                           fc-fabric    connected    configured   unknown
c4::200700a0b81f71b4         disk         connected    configured   unknownc4::210000e08b11eb72         unknown      connected    unconfigured unknown
c4::500308c146699004         tape         connected    unconfigured unknown
c4::500308c146699007         tape         connected    unconfigured unknown
c5                           fc-fabric    connected    configured   unknown
c5::200600a0b81f71b4         disk         connected    configured   unknownc5::210000e08b11ea72         unknown      connected    unconfigured unknown
pcisch0:hpc1_slot0             ethernet/hpconnected    configured   ok
pcisch0:hpc1_slot1             ethernet/hpconnected    configured   ok
pcisch0:hpc1_slot2             mult/hp      connected    configured   ok
pcisch0:hpc1_slot3             ethernet/hpconnected    configured   ok
pcisch2:hpc2_slot4             unknown      empty      unconfigured unknown
pcisch2:hpc2_slot5             unknown      empty      unconfigured unknown
pcisch2:hpc2_slot6             unknown      empty      unconfigured unknown
pcisch3:hpc0_slot7             unknown      empty      unconfigured unknown
pcisch3:hpc0_slot8             vgs8514/hp   connected    configured   ok
usb0/1                         unknown      empty      unconfigured ok
usb0/2                         unknown      empty      unconfigured ok
usb0/3                         unknown      empty      unconfigured ok
usb0/4                         unknown      empty      unconfigured ok



请问众位,上述状态正常吗?

在 hrsms01 上 两个HBA卡一个是Configured ,一个是 unconfigued,感觉好像有点问题
而在hrsms02上,两个HBA卡都是configured

在正常的Sun Cluster环境里,一旦资源切换以后,使用Format在此节点上就无法看到共享磁盘了吗?





byuq 发表于 2012-10-13 20:30

上述状态不正常。
节点hrsms01没有正确识别共享存储,肯定不能切换的。
使用
cfgadm -c configure c8
cfgadm -c configure c9
配置一下光纤通道就可以识别共享存储了。

nerrazurri 发表于 2012-10-14 11:46

丰衣足食朋友,谢谢你的回复!
我会重新试下的!

非常感谢!

nerrazurri 发表于 2012-10-14 11:51

:dizzy:

byuq 朋友,谢谢你的回复!
我会重新试下的!

非常感谢!

东方蜘蛛 发表于 2012-10-15 13:55

啥结果?:time:

nerrazurri 发表于 2012-10-15 14:56

正在内部沟通,确定是不是这个问题导致的,帖子会一直更新!

znnnz 发表于 2012-10-16 11:22

老机器的话,不正当拔线拆机,会导致路径识别发生变化。

wait空白 发表于 2012-10-16 13:31

更新帖子。

hanlei19 发表于 2012-10-16 13:32

关注之,望楼主尽快上处理结果。

nerrazurri 发表于 2012-10-16 13:46

众位,现已基本确定就是这个问题导致的:hrsms01上无法访问共享磁盘,需要configure一下。


另跟大家分享下:关于心跳网络的事情。

在这个案例中我们本来还有一个感到纳闷的事情:

其实问题的最开始,我们就已经初步判定是磁盘配置有问题,但是当时因为共享磁盘上当时有两个集群文件系统,/u02、/u03,在双机上都可以挂载成功,手工也能够将数据库起来,当时觉得如果磁盘有问题的话,不应该能访问共享存储啊,就把这个点给放过去了,去找别的原因,白费了一番功夫。

今天内部沟通的时候,才真正理解了Sun Cluster Global Device的真正含义:

之所以hrsms01无法访问共享磁盘,但仍然可以挂在/u02、/u03集群文件系统,其实跟搭建双机时我们一开始总要在双机本地硬盘上都划分/global分区一样,虽然是本地盘,但是双方都可以挂载访问,这个访问通道是通过心跳网络来进行的。

现在正跟客户确定时间,帖子会后续更新。

谢谢好心的朋友们!



页: [1] 2
查看完整版本: Sun Cluster 无法切换,有初步分析,求助众位大侠