- 论坛徽章:
- 1
|
事件描述:
21号12点17分,DB01机/tmp目录满告警,同时DB01上HA自动关闭并切换到DB02,
DB02在12点03分也有/tmp 目录满报警,但很快自动释放到80%,两机/tmp空间大小都为2G,12点18分DB02正常接管HA,
但在13点05分DB02也突然HA自动关闭,因为DB02在13点05分时tmp目录为80%,并没有满,想不明白为什么也会突然关闭HA?
日志如下:
DB01上:
#errpt -a
---
LABEL: SRC_SVKO
IDENTIFIER: BC3BE5A3
Date/Time: Wed Nov 21 12:18:49 BEIST 2012
Sequence Number: 1544
Machine Id: 00C3EEA44C00
Node Id: DB01
Class: S
Type: PERM
WPAR: Global
Resource Name: SRC
Description
SOFTWARE PROGRAM ERROR
Probable Causes
APPLICATION PROGRAM
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
MANUALLY RESTART SUBSYSTEM IF NEEDED
Detail Data
SYMPTOM CODE
256
SOFTWARE ERROR CODE
-9017
ERROR CODE
0
DETECTING MODULE
'srchevn.c'@line:'376'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
LABEL: J2_FS_FULL
IDENTIFIER: F7FA22C9
Date/Time: Wed Nov 21 12:17:45 BEIST 2012
Sequence Number: 1543
Machine Id: 00C3EEA44C00
Node Id: DB01
Class: O
Type: INFO
WPAR: Global
Resource Name: SYSJ2
Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
Probable Causes
FILE SYSTEM FULL
Recommended Actions
INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
REMOVE UNNECESSARY DATA FROM FILE SYSTEM
USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED
Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp
#more cluster.log
Nov 21 12:17:59 DB01 local0:crit clstrmgrES[78500]: Wed Nov 21 12:17:59 HACMP: clstrmgrES: SrcStopForce: Called
Nov 21 12:18:02 DB01 user:notice HACMP for AIX: EVENT START: node_down DB01
Nov 21 12:18:02 DB01 user:notice HACMP for AIX: EVENT START: stop_server ora_monitor
Nov 21 12:18:36 DB01 user:notice HACMP for AIX: EVENT COMPLETED: stop_server ora_monitor 0
Nov 21 12:18:38 DB01 user:notice HACMP for AIX: EVENT START: release_service_addr
Nov 21 12:18:39 DB01 user:notice HACMP for AIX: EVENT COMPLETED: release_service_addr 0
Nov 21 12:18:39 DB01 user:notice HACMP for AIX: EVENT COMPLETED: node_down DB01 0
Nov 21 12:18:46 DB01 user:notice HACMP for AIX: EVENT START: node_down_complete DB01
Nov 21 12:18:46 DB01 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete DB01 0
Nov 21 12:18:49 DB01 local0:crit clstrmgrES[78500]: Wed Nov 21 12:18:49 HACMP: clstrmgrES: approvalCb: Quit flag was set, exiting
Nov 21 12:18:56 DB01 daemon:notice topsvcs[127554]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6SQG4h/kM3fE/7bW186pl8....................:::Reference ID: 6UpNEL0AXDlD/WzJ.86pl8....................:::Template ID: 6d19271e::etails File: :cation: rsct,comm.C,1.148,634 :::TS_STOP_ST Topology Services daemon stopped Topology Services daemon stopped by: Signal SIGTERM
DB02 上:
#errpt -a
LABEL: J2_FS_FULL
IDENTIFIER: F7FA22C9
Date/Time: Wed Nov 21 12:03:52 BEIST 2012
Sequence Number: 1687
Machine Id: 00C3EED44C00
Node Id: DB02
Class: O
Type: INFO
WPAR: Global
Resource Name: SYSJ2
Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
Probable Causes
FILE SYSTEM FULL
Recommended Actions
INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
REMOVE UNNECESSARY DATA FROM FILE SYSTEM
USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED
Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F
Date/Time: Wed Nov 21 12:19:23 BEIST 2012
Sequence Number: 1690
Machine Id: 00C3EED44C00
Node Id: DB02
Class: S
Type: INFO
WPAR: Global
Resource Name: topsvcs
Description
Possible malfunction on local adapter
Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured
Recommended Actions
Verify adapter configuration
Verify network connectivity
Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.22,4329
ERROR ID
6zV5DL.9N3fE/5NN096pl8....................
REFERENCE CODE
Adapter interface name
tty0
Adapter offset
2
Adapter IP address
255.255.0.1
---------------------------------------------------------------------------
LABEL: SRC_SVKO
IDENTIFIER: BC3BE5A3
Date/Time: Wed Nov 21 13:05:21 BEIST 2012
Sequence Number: 1691
Machine Id: 00C3EED44C00
Node Id: DB02
Class: S
Type: PERM
WPAR: Global
Resource Name: SRC
Description
SOFTWARE PROGRAM ERROR
Probable Causes
APPLICATION PROGRAM
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
MANUALLY RESTART SUBSYSTEM IF NEEDED
Detail Data
SYMPTOM CODE
256
SOFTWARE ERROR CODE
-9017
ERROR CODE
0
DETECTING MODULE
'srchevn.c'@line:'376'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
#more cluster.log
Nov 21 12:18:42 DB02 user:notice HACMP for AIX: EVENT START: node_down DB01
Nov 21 12:18:42 DB02 user:notice HACMP for AIX: EVENT START: acquire_takeover_addr
Nov 21 12:18:43 DB02 user:notice HACMP for AIX: EVENT COMPLETED: acquire_takeover_addr 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: node_down DB01 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT START: node_down_complete DB01
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT START: start_server ora_monitor
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: start_server ora_monitor 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete DB01 0
Nov 21 12:18:49 DB02 local0:crit clstrmgrES[74200]: Wed Nov 21 12:18:49 Removing 3 from ml_idx
Nov 21 12:19:23 DB02 daemon:notice topsvcs[164044]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6zV5DL.9N3fE/5NN096pl8....................:::Reference ID: :::Template ID: 173c787f::etails File: :cation: rsct,nim_control.C,1.39.1.22,4329 :::TS_LOC_DOWN_ST Possible malfunction on local adapter Adapter interface name tty0 Adapter offset 2 Adapter IP address 255.255.0.1
Nov 21 12:19:25 DB02 user:notice HACMP for AIX: EVENT START: network_down minus 1 net_rs232_01
Nov 21 12:19:25 DB02 user:notice HACMP for AIX: EVENT COMPLETED: network_down minus 1 net_rs232_01 0
Nov 21 12:19:26 DB02 user:notice HACMP for AIX: EVENT START: network_down_complete minus 1 net_rs232_01
Nov 21 12:19:26 DB02 user:notice HACMP for AIX: EVENT COMPLETED: network_down_complete minus 1 net_rs232_01 0
Nov 21 13:04:59 DB02 local0:crit clstrmgrES[74200]: Wed Nov 21 13:04:59 HACMP: clstrmgrES: SrcStopForce: Called
Nov 21 13:04:59 DB02 user:notice HACMP for AIX: EVENT START: node_down DB02
Nov 21 13:04:59 DB02 user:notice HACMP for AIX: EVENT START: stop_server ora_monitor
另外怀疑过是不是clstrmgr.debug生成时因为/tmp空间不够大导致的DB02的HA也自动关闭,但我的clstrmgr.debug并没在/tmp下,是在/var/hacmp下的, IZ05428补丁也是打过的,以下网址是找到的一个案例,但和我的不太符合,
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ05428 |
|