- 论坛徽章:
- 0
|
本帖最后由 willhuntingshan 于 2010-12-16 13:33 编辑
1.主机为两HP PC SERVER 16CPU, 4G内存;
2.每主机有一对千兆网卡,分别用于CRS和OCFS2心跳,其中OCFS2心跳走千兆局域网络,CRS心跳走百兆私有网络;
2.共享HP MAS存储 1.2T,SCSI接口;
3.操作系统为RS5.2,内核信息
[root@server1 log]# uname -a
Linux server1 2.6.18-92.el5xen #1 SMP Tue Apr 29 13:45:57 EDT 2008 i686 athlon i386 GNU/Linux
4.双机使用的是ocfs2-2.6.18-92.el5-1.4.8,数据库为oracle10.2.0.4 RAC;
5.ocfs2相关参数:
[root@server1 ~]# /etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 301
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Active
[root@server1 ~]#
目前两台主机频繁随机重启,以6时左右发生重启为例,摘录日志如下:
server2
messages:
Dec 16 06:07:17 server2 kernel: o2net: connection to node server1 (num 1) at 192.168.192.46:7777 has been idle for 30.0 seconds, shutting it down.
Dec 16 06:07:17 server2 kernel: (oracle,14318,0) 2net_idle_timer:1506 here are some times that might help debug the situation: (tmr 1292450807.249784 now 1292450837.236984 dr 1292450807.249784 adv 1292450807.249784:1292450807.249784 func (4ddce11c:504) 1292450807.249784:1292450807.249784)
Dec 16 06:07:17 server2 kernel: o2net: no longer connected to node server1 (num 1) at 192.168.192.46:7777
Dec 16 06:07:34 server2 kernel: (dlm_thread,6213,5):dlm_send_proxy_ast_msg:458 ERROR: Error -107 when sending message 505 (key 0x99a98efb) to node 1
Dec 16 06:07:34 server2 kernel: (dlm_thread,6213,5):dlm_flush_asts:601 ERROR: status = -107
Dec 16 06:07:46 server2 kernel: (dlm_thread,6213,5):dlm_send_proxy_ast_msg:458 ERROR: Error -107 when sending message 505 (key 0x99a98efb) to node 1
Dec 16 06:07:46 server2 kernel: (dlm_thread,6213,5):dlm_flush_asts:601 ERROR: status = -107
Dec 16 06:07:46 server2 kernel: (dlm_thread,6187,11):dlm_send_proxy_ast_msg:458 ERROR: Error -107 when sending message 505 (key 0x194f03) to node 1
Dec 16 06:07:46 server2 kernel: (dlm_thread,6187,11):dlm_flush_asts:601 ERROR: status = -107
Dec 16 06:07:47 server2 kernel: (o2net,6049,0) 2net_connect_expired:1667 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
Dec 16 06:07:48 server2 kernel: (dlm_thread,6200,3):dlm_send_proxy_ast_msg:458 ERROR: Error -107 when sending message 505 (key 0x4ddce11c) to node 1
Dec 16 06:07:48 server2 kernel: (dlm_thread,6200,3):dlm_flush_asts:601 ERROR: status = -107
Dec 16 06:08:17 server2 kernel: (o2net,6049,0) 2net_connect_expired:1667 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
Dec 16 06:08:47 server2 kernel: (o2net,6049,0) 2net_connect_expired:1667 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors.
Dec 16 06:08:52 server2 avahi-daemon[6735]: Registering new address record for 10.10.12.146 on eth0.
Dec 16 06:08:52 server2 avahi-daemon[6735]: Withdrawing address record for 10.10.12.146 on eth0.
Dec 16 06:08:52 server2 avahi-daemon[6735]: Registering new address record for 10.10.12.146 on eth0.
ocssd.log
[ CSSD]2010-12-16 06:07:47.224 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 50 5.663192e-261artbeat fatal, eviction in 59.770 seconds
[ CSSD]2010-12-16 06:07:47.224 [3021970320] >TRACE: clssnmPollingThread: node server1 (2) is impending reconfig, flag 1037, misstime 60230
[ CSSD]2010-12-16 06:07:47.224 [3021970320] >TRACE: clssnmPollingThread: diskTimeout set to (117000)ms impending reconfig status(1)
[ CSSD]2010-12-16 06:07:48.223 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 50 5.663192e-261artbeat fatal, eviction in 58.760 seconds
[ CSSD]2010-12-16 06:08:17.211 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 75 5.663192e-261artbeat fatal, eviction in 29.770 seconds
[ CSSD]2010-12-16 06:08:18.210 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 75 5.663192e-261artbeat fatal, eviction in 28.760 seconds
[ CSSD]2010-12-16 06:08:35.203 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 11.770 seconds
[ CSSD]2010-12-16 06:08:36.203 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 10.760 seconds
[ CSSD]2010-12-16 06:08:37.202 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 9.770 seconds
[ CSSD]2010-12-16 06:08:38.202 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 8.760 seconds
[ CSSD]2010-12-16 06:08:39.202 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 7.770 seconds
[ CSSD]2010-12-16 06:08:40.201 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 6.760 seconds
[ CSSD]2010-12-16 06:08:41.201 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 5.760 seconds
[ CSSD]2010-12-16 06:08:42.200 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 4.770 seconds
[ CSSD]2010-12-16 06:08:43.200 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 3.760 seconds
[ CSSD]2010-12-16 06:08:44.705 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 2.770 seconds
[ CSSD]2010-12-16 06:08:45.241 [3084909456] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 57960 ms, disk (1//oracledata/data2/votdata02.dbf)
[ CSSD]2010-12-16 06:08:45.709 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 1.760 seconds
[ CSSD]2010-12-16 06:08:46.245 [3084909456] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 57650 ms, disk (0//oracledata/data1/votdata01.dbf)
[ CSSD]2010-12-16 06:08:46.245 [3084909456] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 57970 ms, disk (2//oracledata/data3/votdata03.dbf)
[ CSSD]2010-12-16 06:08:46.706 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 0.770 seconds
[ CSSD]2010-12-16 06:08:47.482 [3021970320] >TRACE: clssnmPollingThread: Eviction started for node server1 (2), flags 0x040d, state 3, wt4c 0
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmDoSyncUpdate: Initiating sync 3
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (117000)ms
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: node(1) is ALIVE
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSendSync: syncSeqNo(3)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2010-12-16 06:08:47.482 [3074419600] >TRACE: clssnmHandleSync: diskTimeout set to (117000)ms
[ CSSD]2010-12-16 06:08:47.482 [3074419600] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[server2] seq[9] sync[3]
[ CSSD]2010-12-16 06:08:47.482 [1709760] >USER: NMEVENT_SUSPEND [00][00][00][06]
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmDoSyncUpdate: Terminating node 2, server1, misstime(120010) state(5)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2010-12-16 06:08:47.483 [3074419600] >TRACE: clssnmSendVoteInfo: node(1) syncSeqNo(3)
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmEvict: Start
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmEvict: Evicting node 2, server1, birth 2, death 3, impendingrcfg 1, stateflags 0x40d
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmSendShutdown: req to node 2, kill time 11420846
@
[ CSSD]2010-12-16 06:08:44.705 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 2.770 seconds
[ CSSD]2010-12-16 06:08:45.241 [3084909456] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 57960 ms, disk (1//oracledata/data2/votdata02.dbf)
[ CSSD]2010-12-16 06:08:45.709 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 1.760 seconds
[ CSSD]2010-12-16 06:08:46.245 [3084909456] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 57650 ms, disk (0//oracledata/data1/votdata01.dbf)
[ CSSD]2010-12-16 06:08:46.245 [3084909456] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 57970 ms, disk (2//oracledata/data3/votdata03.dbf)
[ CSSD]2010-12-16 06:08:46.706 [3021970320] >WARNING: clssnmPollingThread: node server1 (2) at 90 5.663192e-261artbeat fatal, eviction in 0.770 seconds
[ CSSD]2010-12-16 06:08:47.482 [3021970320] >TRACE: clssnmPollingThread: Eviction started for node server1 (2), flags 0x040d, state 3, wt4c 0
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmDoSyncUpdate: Initiating sync 3
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (117000)ms
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: node(1) is ALIVE
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSendSync: syncSeqNo(3)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2010-12-16 06:08:47.482 [3074419600] >TRACE: clssnmHandleSync: diskTimeout set to (117000)ms
[ CSSD]2010-12-16 06:08:47.482 [3074419600] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[server2] seq[9] sync[3]
[ CSSD]2010-12-16 06:08:47.482 [1709760] >USER: NMEVENT_SUSPEND [00][00][00][06]
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmWaitForAcks: done, msg type(11)
ᅤ챼서ᄈᆲᅧ며ᄆᄐ¦ᅭᄉᄑ 06:08:47.482 [3000990608] >TRACE: clssnmDoSyncUpdate: Terminating node 2, server1, misstime(120010) state(5)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2010-12-16 06:08:47.482 [3000990608] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2010-12-16 06:08:47.483 [3074419600] >TRACE: clssnmSendVoteInfo: node(1) syncSeqNo(3)
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmEvict: Start
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmEvict: Evicting node 2, server1, birth 2, death 3, impendingrcfg 1, stateflags 0x40d
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmSendShutdown: req to node 2, kill time 11420846
[ CSSD]2010-12-16 06:08:47.483 [3000990608] >TRACE: clssnmDiscHelper: server1, node(2) connection failed, con (0xa09d9c , probe((nil))
crsd.log
2010-12-16 06:08:47.517: [ CRSCOMM][2665450384]0CLEANUP: Searching for connections to failed node server1
2010-12-16 06:08:47.517: [ CRSEVT][2665450384]0Processing member leave for server1, incarnation: 3
2010-12-16 06:08:47.519: [ CRSD][2665450384]0SM: recovery in process: 8
2010-12-16 06:08:47.519: [ CRSEVT][2665450384]0Do failover for: server1
2010-12-16 06:08:47.570: [ OCRSRV][2728389520]s_update_remote_cache_int: FAILED TO RCV ACK FROM node 2 retcode 7
2010-12-16 06:08:48.324: [ CRSRES][2665450384]0 startup = 0
2010-12-16 06:08:48.391: [ CRSRES][2665450384]0 startup = 0
2010-12-16 06:08:48.421: [ CRSRES][2665450384]0 startup = 0
2010-12-16 06:08:48.444: [ CRSRES][2665450384]0 startup = 0
2010-12-16 06:08:48.472: [ CRSRES][2665450384]0 startup = 0
2010-12-16 06:08:48.632: [ CRSRES][2728389520]0startRunnable: setting CLI values
2010-12-16 06:08:48.650: [ CRSRES][2728389520]0Attempting to start `ora.server1.vip` on member `server2`
2010-12-16 06:08:52.432: [ CRSRES][2728389520]0Start of `ora.server1.vip` on member `server2` succeeded.
2010-12-16 06:08:52.463: [ CRSEVT][2665450384]0Post recovery done evmd event for: server1
2010-12-16 06:08:52.464: [ CRSD][2665450384]0SM: recoveryDone: 0
2010-12-16 06:08:52.475: [ CRSEVT][2665450384]0Processing RecoveryDone
server1
messages
被重启前没有任何日志,只有机器重启时的日志:
Dec 16 06:09:52 server1 syslogd 1.4.1: restart.
Dec 16 06:09:52 server1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Dec 16 06:09:52 server1 kernel: Linux version 2.6.18-92.el5xen (brewbuilder@hs20-bc2-3.build.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:45:57 EDT 2008
ocssd.log
没有6时的日志
[ CSSD]2010-12-16 03:30:35.938 [3032460176] >TRACE: clssgmCommonAddMember: clsomon joined (2/0x1000000/#CSS_CLSSOMON)
[ CSSD]2010-12-16 09:21:36.423 >USER: Copyright 2010, Oracle version 10.2.0.4.0
crsd.log
没有6时的日志
2010-12-16 03:31:15.526: [ OCRUTL][2969549712]u_freem: mem passed is null
2010-12-16 09:21:30.928: [ default][2845232][ENTER]0 |
|