- 论坛徽章:
- 0
|
最近生产线上的 RAC 遇到了问题,自己看 log 很久都没能解决。希望各路高手都来帮忙看看。
硬件环境描述
两台 Linux CentOS 5.3 x86_64
一台 Dell MD3000 存储设备。
每台主机到 MD3000 有两根 SAS 线。一根接 Control 0,一根接 Control 1。已经装了 DELL 光盘中 SAS 卡驱动,multipath 驱动。
RAC 环境描述
一共两台主机,一共两个 db。这里暂且称机器为 dbA dbB 吧,两个数据库叫 db1 db2。
配置了 Service,sr1 sr2,sr1 是以dbA 为主,dbB 仅在 dbA 不可用时启用。sr2 是以 dbB 为主,dbA仅在 dbB 不可用时启用。当然了,service 是在 instance 之间切来切去的。
问题描述
最近 sr2 经常会自己切到 dbA,在切换的时候,通过 log 感觉到 db22 那时处于不可访问的状态,以下是log:
2009-06-21 00:27:04.992: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: CLSR-0002: Oracle error encountered while executing clsrcdbciopn: Error selecting open stat.
2009-06-21 00:27:04.992: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: ORA-12152: TNS:unable to send break message
2009-06-21 00:27:04.994: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: CLSR-0002: Oracle error encountered while executing ROLLBACK
2009-06-21 00:27:04.994: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: ORA-03114: not connected to ORACLE
2009-06-21 00:27:04.994: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: CLSR-0002: Oracle error encountered while executing DISCONNECT
2009-06-21 00:27:04.994: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: ORA-03114: not connected to ORACLE
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: CLSR-0002: Oracle error encountered while executing CONNECT
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: ORA-01092: ORACLE instance terminated. Disconnection forced
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: Could not connect to 'chdb2'
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrcdbcconnect: connect failed
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrimon_look_master: Unable to get db connection.
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrcdbcconnect: database is not mounted nor opened
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrcdbcconnect: connect failed
2009-06-21 00:27:05.022: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrrlbgthr: Failed to get db connection.
2009-06-21 00:27:05.293: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrrlbgthr: exiting/shutting down flag is set.
2009-06-21 00:27:05.293: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrrlbgthr: RLB gateway thread returning
2009-06-21 00:27:05.293: [ RACG][1216280896] [9700][1216280896][ora.chdb2.chdb22.inst]: CLSR-0002: Oracle error encountered while executing DISCONNECT
2009-06-21 00:27:05.293: [ RACG][1216280896] [9700][1216280896][ora.chdb2.chdb22.inst]: ORA-03135: connection lost contact
2009-06-21 00:27:05.612: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: CLSR-0002: Oracle error encountered while executing ROLLBACK
2009-06-21 00:27:05.612: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: ORA-03135: connection lost contact
Process ID: 10070
Session ID: 5456 Serial number: 7
2009-06-21 00:27:05.612: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: CLSR-0002: Oracle error encountered while executing DISCONNECT
2009-06-21 00:27:05.612: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: ORA-03114: not connected to ORACLE
2009-06-21 00:27:05.614: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: clsrcdbcconnect: database is not mounted nor opened
2009-06-21 00:27:05.614: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: clsrcdbcconnect: connect failed
2009-06-21 00:27:05.614: [ RACG][1247750464] [9700][1247750464][ora.chdb2.tvudb2.chdb22.srv]: clsrcsnchk: clsrcsnconnect failed sid=chdb22
2009-06-21 00:27:11.751: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]:
SQL*Plus: Release 11.1.0.6.0 - Production on Sun Jun 21 00:27:05 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
Enter user-name: Connected.
SQL> ORACLE instance shut down.
SQL> Disconnected
2009-06-21 00:27:39.086: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]:
SQL*Plus: Release 11.1.0.6.0 - Production on Sun Jun 21 00:27:12 2009
Copyright (c) 1982, 2007, Oracle. All rights reserved.
Enter user-name: Connected to an idle instance.
SQL> ORACLE instance started.
Total System Global Area 1.1090E+10 bytes
2009-06-21 00:27:39.086: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: Fixed Size 2147632 bytes
Variable Size 4697623248 bytes
Database Buffers 6375342080 bytes
Redo Buffers 15216640 bytes
Database mounted.
Database opened.
2009-06-21 00:27:39.086: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
2009-06-21 00:27:39.290: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrcexecut: env _USR_ORA_PFILE=/opt/oracle/crs/racg/tmp/ora.chdb2.chdb22.inst.ora
2009-06-21 00:27:39.290: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrcexecut: cmd = /opt/oracle/database/bin/racgeut -e _USR_ORA_DEBUG=0 -e ORACLE_SID=chdb22 810 /opt/oracle/database/bin/racgmdb -q
2009-06-21 00:27:39.290: [ RACG][1205791040] [9700][1205791040][ora.chdb2.chdb22.inst]: clsrcexecut: rc = 3, time = 0.200s
log 里面可以看到数据库发现了自己的 instance 不可用,随后又将 instance 启动了起来。
基本上等到我人上去操作的时候 db22 都已经处于可用的状态了。
----------------------------------------
一直怀疑是 存储掉了,但是都没找到很直接的 log。
同时 不知道如何在 Linux 上指定 mutlipath 的优先路径,MD3000 设备上我是有设置每个磁盘的首选路径的。但 dell 装的驱动配置文件里面好像没有类似的选项:
[root@ldb2 log]# date
Mon Jun 22 08:32:40 GMT 2009
[root@ldb2 log]# cat /etc/mpp.conf
VirtualDiskProductId=MD Virtual Disk
DebugLevel=0x0
NotReadyWaitTime=270
BusyWaitTime=270
QuiescenceWaitTime=270
InquiryWaitTime=60
MaxLunsPerArray=256
MaxPathsPerController=4
ScanInterval=60
InquiryInterval=1
MaxArrayModules=30
ErrorLevel=3
SelectionTimeoutRetryCount=0
UaRetryCount=10
RetryCount=10
SynchTimeout=170
FailOverQuiescenceTime=20
FailoverTimeout=120
FailBackToCurrentAllowed=1
ControllerIoWaitTime=300
ArrayIoWaitTime=600
DisableLUNRebalance=3
IdlePathCheckingInterval=60
RecheckFailedPathWaitTime=30
FailedPathCheckingInterval=60
ArrayFailoverWaitTime=300
PrintSenseBuffer=0
ClassicModeFailover=0
AVTModeFailover=0
LunFailoverDelay=3
LoadBalancePolicy=1
S2ToS3Key=7c618e562c4329d0
已经出了好几次问题了,但都无从下手。
谢谢大家帮忙 |
|