免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 4591 | 回复: 7
打印 上一主题 下一主题

[HACMP集群] /tmp目录满导致两机HA自动关闭求解!!!! [复制链接]

论坛徽章:
1
2015年亚洲杯之科威特
日期:2015-03-25 15:56:45
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2012-11-23 11:20 |只看该作者 |倒序浏览
事件描述:
      21号12点17分,DB01机/tmp目录满告警,同时DB01上HA自动关闭并切换到DB02,
     DB02在12点03分也有/tmp 目录满报警,但很快自动释放到80%,两机/tmp空间大小都为2G,12点18分DB02正常接管HA,
但在13点05分DB02也突然HA自动关闭,因为DB02在13点05分时tmp目录为80%,并没有满,想不明白为什么也会突然关闭HA?
日志如下:

DB01上:

#errpt -a

---
LABEL:          SRC_SVKO
IDENTIFIER:     BC3BE5A3

Date/Time:       Wed Nov 21 12:18:49 BEIST 2012
Sequence Number: 1544
Machine Id:      00C3EEA44C00
Node Id:         DB01
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
         MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
          256
SOFTWARE ERROR CODE
        -9017
ERROR CODE
            0
DETECTING MODULE
'srchevn.c'@line:'376'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
LABEL:          J2_FS_FULL
IDENTIFIER:     F7FA22C9

Date/Time:       Wed Nov 21 12:17:45 BEIST 2012
Sequence Number: 1543
Machine Id:      00C3EEA44C00
Node Id:         DB01
Class:           O
Type:            INFO
WPAR:            Global
Resource Name:   SYSJ2           

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

        Recommended Actions
         INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
         REMOVE UNNECESSARY DATA FROM FILE SYSTEM
         USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp

#more cluster.log

Nov 21 12:17:59 DB01 local0:crit clstrmgrES[78500]: Wed Nov 21 12:17:59 HACMP: clstrmgrES: SrcStopForce: Called
Nov 21 12:18:02 DB01 user:notice HACMP for AIX: EVENT START: node_down DB01
Nov 21 12:18:02 DB01 user:notice HACMP for AIX: EVENT START: stop_server ora_monitor
Nov 21 12:18:36 DB01 user:notice HACMP for AIX: EVENT COMPLETED: stop_server ora_monitor 0
Nov 21 12:18:38 DB01 user:notice HACMP for AIX: EVENT START: release_service_addr
Nov 21 12:18:39 DB01 user:notice HACMP for AIX: EVENT COMPLETED: release_service_addr 0
Nov 21 12:18:39 DB01 user:notice HACMP for AIX: EVENT COMPLETED: node_down DB01 0
Nov 21 12:18:46 DB01 user:notice HACMP for AIX: EVENT START: node_down_complete DB01
Nov 21 12:18:46 DB01 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete DB01 0
Nov 21 12:18:49 DB01 local0:crit clstrmgrES[78500]: Wed Nov 21 12:18:49 HACMP: clstrmgrES: approvalCb: Quit flag was set, exiting
Nov 21 12:18:56 DB01 daemon:notice topsvcs[127554]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6SQG4h/kM3fE/7bW186pl8....................:::Reference ID: 6UpNEL0AXDlD/WzJ.86pl8....................:::Template ID: 6d19271e::etails File:  :cation: rsct,comm.C,1.148,634                         :::TS_STOP_ST Topology Services daemon stopped Topology Services daemon stopped by: Signal SIGTERM


DB02 上:


#errpt -a

LABEL:          J2_FS_FULL
IDENTIFIER:     F7FA22C9

Date/Time:       Wed Nov 21 12:03:52 BEIST 2012
Sequence Number: 1687
Machine Id:      00C3EED44C00
Node Id:         DB02
Class:           O
Type:            INFO
WPAR:            Global
Resource Name:   SYSJ2           

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

        Recommended Actions
         INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
         REMOVE UNNECESSARY DATA FROM FILE SYSTEM
         USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp

---------------------------------------------------------------------------
LABEL:          TS_LOC_DOWN_ST
IDENTIFIER:     173C787F

Date/Time:       Wed Nov 21 12:19:23 BEIST 2012
Sequence Number: 1690
Machine Id:      00C3EED44C00
Node Id:         DB02
Class:           S
Type:            INFO
WPAR:            Global
Resource Name:   topsvcs         

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

        Recommended Actions
         Verify adapter configuration
         Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.22,4329            
ERROR ID
6zV5DL.9N3fE/5NN096pl8....................
REFERENCE CODE
                                          
Adapter interface name
tty0
Adapter offset
            2
Adapter IP address
255.255.0.1

---------------------------------------------------------------------------
LABEL:          SRC_SVKO
IDENTIFIER:     BC3BE5A3

Date/Time:       Wed Nov 21 13:05:21 BEIST 2012
Sequence Number: 1691
Machine Id:      00C3EED44C00
Node Id:         DB02
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
         MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
          256
SOFTWARE ERROR CODE
        -9017
ERROR CODE
            0
DETECTING MODULE
'srchevn.c'@line:'376'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------

#more cluster.log

Nov 21 12:18:42 DB02 user:notice HACMP for AIX: EVENT START: node_down DB01
Nov 21 12:18:42 DB02 user:notice HACMP for AIX: EVENT START: acquire_takeover_addr
Nov 21 12:18:43 DB02 user:notice HACMP for AIX: EVENT COMPLETED: acquire_takeover_addr 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: node_down DB01 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT START: node_down_complete DB01
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT START: start_server ora_monitor
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: start_server ora_monitor 0
Nov 21 12:18:46 DB02 user:notice HACMP for AIX: EVENT COMPLETED: node_down_complete DB01 0
Nov 21 12:18:49 DB02 local0:crit clstrmgrES[74200]: Wed Nov 21 12:18:49 Removing 3 from ml_idx
Nov 21 12:19:23 DB02 daemon:notice topsvcs[164044]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6zV5DL.9N3fE/5NN096pl8....................:::Reference ID: :::Template ID: 173c787f::etails File:  :cation: rsct,nim_control.C,1.39.1.22,4329             :::TS_LOC_DOWN_ST Possible malfunction on local adapter Adapter interface name tty0 Adapter offset 2 Adapter IP address 255.255.0.1
Nov 21 12:19:25 DB02 user:notice HACMP for AIX: EVENT START: network_down minus 1 net_rs232_01
Nov 21 12:19:25 DB02 user:notice HACMP for AIX: EVENT COMPLETED: network_down minus 1 net_rs232_01 0
Nov 21 12:19:26 DB02 user:notice HACMP for AIX: EVENT START: network_down_complete minus 1 net_rs232_01
Nov 21 12:19:26 DB02 user:notice HACMP for AIX: EVENT COMPLETED: network_down_complete minus 1 net_rs232_01 0
Nov 21 13:04:59 DB02 local0:crit clstrmgrES[74200]: Wed Nov 21 13:04:59 HACMP: clstrmgrES: SrcStopForce: Called
Nov 21 13:04:59 DB02 user:notice HACMP for AIX: EVENT START: node_down DB02
Nov 21 13:04:59 DB02 user:notice HACMP for AIX: EVENT START: stop_server ora_monitor



另外怀疑过是不是clstrmgr.debug生成时因为/tmp空间不够大导致的DB02的HA也自动关闭,但我的clstrmgr.debug并没在/tmp下,是在/var/hacmp下的, IZ05428补丁也是打过的,以下网址是找到的一个案例,但和我的不太符合,
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ05428

论坛徽章:
1
2015年亚洲杯之科威特
日期:2015-03-25 15:56:45
2 [报告]
发表于 2012-11-23 15:07 |只看该作者
终于在日志中找到原因了
/usr/sbin/cluster/utilities/clstop: No space left on device
HACMP/ES for AIX on DB02 shutting down.
Please exit any cluster applications...
HACMP/ES for AIX on ${HOSTNAME} shutting down.
Please exit any cluster applications...
: No space left on device
这个space应该是指tmp吧?
想搞清楚HA都用了/tmp中哪些文件?

论坛徽章:
0
3 [报告]
发表于 2012-11-23 15:10 |只看该作者
HA版本多少?

论坛徽章:
1
2015年亚洲杯之科威特
日期:2015-03-25 15:56:45
4 [报告]
发表于 2012-11-23 15:30 |只看该作者
AIX6.1,HA是5.4.1的回复 3# wushanyink


   

论坛徽章:
0
5 [报告]
发表于 2012-11-23 15:40 |只看该作者
没记错的话,5.4.1的HA的hacmp.out应该是在/tmp目录下的。。

你看看这个文件的大小吧。。。

/tmp/hacmp.out

如果太大,执行cat /dev/null>hacmp.out

记得备份这个旧文件到其他目录。。

论坛徽章:
0
6 [报告]
发表于 2012-11-23 15:43 |只看该作者
Probable Causes
FILE SYSTEM FULL

        Recommended Actions
         INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
         REMOVE UNNECESSARY DATA FROM FILE SYSTEM
         USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED

Detail Data
JFS2 MAJOR/MINOR DEVICE NUMBER
000A 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd3, /tmp


其实这个报错已经很明显了。。。

1  检查tmp目录下的大文件,如果可能,清空或者删除。
2  将/tmp增大,如果rootvg有剩余空间的话。。可以考虑扩到4G。。

论坛徽章:
1
2015年亚洲杯之科威特
日期:2015-03-25 15:56:45
7 [报告]
发表于 2012-11-23 16:07 |只看该作者
谢谢兄弟,这在故障发生后就已经做过了,现在是想知道/tmp下的什么文件和HA有关,都是做什么用的,加深对HA的学习,谢谢了回复 6# wushanyink


   

论坛徽章:
0
8 [报告]
发表于 2012-11-23 20:55 |只看该作者
5.4的hacmp,/var/hacmp/log/hacmp.out
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP