- 论坛徽章:
- 0
|
环境:H85*2、oracle、AIX、HACMP、双机互备
问题描述:
1、错误操作:备份文件insur_arch(15G)到/dev目录下,而/dev目录下有名为insur_arch的块设备文件,导致该insur_arch文件被破坏,此块设备文件对应文件系统/insur_arch,结果是文件系统/insur_arch被破坏,文件系统/insur_arch为共享文件系统,属于共享vgdb1
2、共享vgdb1被varyoff、service ip无法ping通、A机跑的应用异常停止、B机跑的应用正常
处理过程:
1、到现场后,应用已无法启动,现场情况上面提到的问题描述,于是重新启动HA:
#smitty clstart
#lsvg -o
rootvg
vgdb1
发现共享vgdb1被varyon
#netstat -i
sevice ip
standby ip
发现此时sevice ip成功漂移到boot网卡上
但是大约过了两分钟
#lsvg -o
rootvg
vgdb1被异常varyoff
#netstat -i
boot ip
standby ip
此时service ip没有漂移在boot网卡上,已经失效
#lssrc -g cluster
检查ha的进程正常
查看hacmp.out文件:#tail -f /tmp/hacmp.out
res1:node_down_local[22] [ 0 -ne 0 ]
res1:node_down_local[459] [ 1 -ne 0 ]
res1:node_down_local[461] set_resource_status ERROR
res1:node_down_local[3] set +u
res1:node_down_local[4] NOT_DOIT=
res1:node_down_local[5] set -u
res1:node_down_local[6] [ != TRUE ]
res1:node_down_local[8] [ REAL = EMUL ]
res1:node_down_local[13] clchdaemons -d clstrmgr_scripts -t resource_locator -n yhdsb01
-o res1 -v ERROR
res1:node_down_local[14] [ 0 -ne 0 ]
res1:node_down_local[23] [ ERROR = RELEASING ]
res1:node_down_local[38] [ NONE = RELEASE_SECONDARY ]
res1:node_down_local[42] [ NONE = SECONDARY_BECOMES_PRIMARY ]
res1:node_down_local[47] cl_RMupdate rg_error res1 node_down_local
Reference string: Thu.Feb.16.19:19:38.BEIST.2006.node_down_local.res1.ref
res1:node_down_local[48] [ 0 -ne 0 ]
res1:node_down_local[462] exit 1
Feb 16 19:19:39 EVENT FAILED:1: node_down_local
res1:rg_move[285] [ 1 -ne 0 ]
res1:rg_move[287] cl_log 650 rg_move: Failure occurred while processing Resource Group
res1. Manual intervention required. rg_move res1
res1:cl_log[50] version=1.9
res1:cl_log[92] SYSLOG_FILE=/usr/es/adm/cluster.log
***************************
Feb 16 2006 19:19:39 !!!!!!!!!! ERROR !!!!!!!!!!
***************************
Feb 16 2006 19:19:39 rg_move: Failure occurred while processing Resource Group res1.
Manual intervention required.
res1:rg_move[288] STATUS=1
res1:rg_move[291] UPDATESTATD=1
res1:rg_move[298] [ -f /tmp/.NFSSTOPPED ]
res1:rg_move[304] [ -f /tmp/.RPCLOCKDSTOPPED ]
res1:rg_move[322] exit 1
Feb 16 19:19:39 EVENT FAILED:1: rg_move yhdsb01 1 RELEASE
+ exit 1
Feb 16 19:19:39 EVENT FAILED:1: rg_move_release yhdsb01 1
HACMP Event Summary
Event: rg_move_release yhdsb01 1
Start time: Thu Feb 16 19:19:13 2006
End time: Thu Feb 16 19:19:39 2006
Action: Resource: Script Name:
----------------------------------------------------------------------------
Releasing resource group: res1 node_down_local
Search on: Thu.Feb.16.19:19:14.BEIST.2006.node_down_local.res1.ref
Releasing resource: All_servers stop_server
Search on: Thu.Feb.16.19:19:14.BEIST.2006.stop_server.All_servers.res1.ref
Error encountered with resource: db1server stop_server
Search on: Thu.Feb.16.19:19:15.BEIST.2006.stop_server.db1server.res1.ref
Resource offline: All_nonerror_servers stop_server
Search on: Thu.Feb.16.19:19:15.BEIST.2006.stop_server.All_nonerror_servers.res1.ref
Releasing resource: All_filesystems cl_deactivate_fs
Search on: Thu.Feb.16.19:19:17.BEIST.2006.cl_deactivate_fs.All_filesystems.res1.ref
Resource offline: All_non_error_filesystems cl_deactivate_fs
Search on:
Thu.Feb.16.19:19:18.BEIST.2006.cl_deactivate_fs.All_non_error_filesystems.res1.ref
Releasing resource: All_volume_groups cl_deactivate_vgs
Search on: Thu.Feb.16.19:19:18.BEIST.2006.cl_deactivate_vgs.All_volume_groups.res1.ref
Resource offline: All_volume_groups cl_deactivate_vgs
Search on: Thu.Feb.16.19:19:25.BEIST.2006.cl_deactivate_vgs.All_volume_groups.res1.ref
Releasing resource: All_service_addrs release_service_addr
Search on: Thu.Feb.16.19:19:26.BEIST.2006.release_service_addr.All_service_addrs.res1.ref
Resource offline: All_nonerror_service_addrs release_service_addr
Search on:
Thu.Feb.16.19:19:37.BEIST.2006.release_service_addr.All_nonerror_service_addrs.res1.ref
Error encountered with group: res1 node_down_local
Search on: Thu.Feb.16.19:19:38.BEIST.2006.node_down_local.res1.ref
----------------------------------------------------------------------------
Feb 16 19:19:39 EVENT START: event_error 1 1_rg_move_release yhdsb01 1 _1
:event_error[52] [[ high = high ]]
:event_error[52] version=1.10
:event_error[53] :event_error[53] cl_get_path
HA_DIR=es
:event_error[55] EXIT_STATUS=1
:event_error[56] RP_NAME=1 1_rg_move_release yhdsb01 1 _1
:event_error[59] [ 2 -ne 2 ]
:event_error[65] set -u
:event_error[67] RP_NAME=rg_move_release yhdsb01 1 _1
:event_error[68] RP_NAME=rg_move_release yhdsb01 1
:event_error[70] :event_error[70] cllsclstr -c
:event_error[70] cut -d : -f2
:event_error[70] grep -v cname
CLUSTER=yhdsbclu
:event_error[74] [ -x /usr/lpp/ssp/bin/spget_syspar ]
:event_error[81] echo WARNING: Cluster yhdsbclu Failed while running rg_move_release
yhdsb01 1 , exit status was 1
:event_error[81] 1> /dev/console
:event_error[82] echo WARNING: Cluster yhdsbclu Failed while running rg_move_release
yhdsb01 1 , exit status was 1
WARNING: Cluster yhdsbclu Failed while running rg_move_release yhdsb01 1 , exit status
was 1
:event_error[88] [[ rg_move_release yhdsb01 1 = reconfig_resource* ]]
Feb 16 19:19:39 EVENT FAILED:-1: event_error 1 1_rg_move_release yhdsb01 1 _1
WARNING: Cluster yhdsbclu has been running recovery program
'/usr/es/sbin/cluster/events/rg_move.rp' for 13620 seconds. Please check cluster status.
Feb 16 19:25:13 EVENT START: config_too_long 360 /usr/es/sbin/cluster/events/rg_move.rp
:config_too_long[64] [[ high = high ]]
:config_too_long[64] version=1.11
:config_too_long[65] :config_too_long[65] cl_get_path
HA_DIR=es
:config_too_long[67] NUM_SECS=360
:config_too_long[68] EVENT=/usr/es/sbin/cluster/events/rg_move.rp
:config_too_long[70] HOUR=3600
:config_too_long[71] THRESHOLD=5
:config_too_long[72] SLEEP_INTERVAL=1
:config_too_long[78] PERIOD=30
:config_too_long[81] set -u
:config_too_long[86] LOOPCNT=0
:config_too_long[87] MESSAGECNT=0
:config_too_long[88] :config_too_long[88] cllsclstr -c
:config_too_long[88] cut -d : -f2
:config_too_long[88] grep -v cname
CLUSTER=yhdsbclu
:config_too_long[89] TIME=360
:config_too_long[90] sleep_cntr=0
:config_too_long[95] [ -x /usr/lpp/ssp/bin/spget_syspar ]
WARNING: Cluster yhdsbclu has been running recovery program
'/usr/es/sbin/cluster/events/rg_move.rp' for 360 seconds. Please check cluster status.
2、根据报错信息,有两方面的报错:
--1、有关于/insur_arch的报错:
res1:cl_activate_fs[240] /usr/sbin/fsck -f -p -o nologredo /dev/insur_arch
/dev/rinsur_arch:
Not a recognized filesystem type. (TERMINATED)
res1:cl_activate_fs[85] mount /insur_arch
mount: /dev/insur_arch on /insur_arch: Invalid argument
res1:cl_activate_fs[87] [[ fsck == logredo ]]
res1:cl_activate_fs[107] cl_RMupdate resource_error /insur_arch cl_activate_fs
Reference string: Thu.Feb.16.14:14:16.BEIST.2006.cl_activate_fs..insur_arch.res1.ref
res1:cl_activate_fs[108] cl_echo 10 'cl_activate_fs: Failed mount of /insur_arch.' cl_activate_fs /insur_arch
res1:cl_echo[49] version=1.13
res1:cl_echo[98] HACMP_OUT_FILE=/tmp/hacmp.out
Feb 16 2006 14:14:17 cl_activate_fs: Failed mount of /insur_arch.res1:cl_activate_fs[109] STATUS=1
---根据/usr/es/sbin/cluster/events/utils/cl_activate_fs的描述,STATUS=1意味着:one or more filesystems failed to fsck or mount(/insur_arch被破坏导致无法被mount)
--2、有关cluster状态的报错:
'/usr/es/sbin/cluster/events/rg_move.rp' for 360 seconds. Please check cluster status.
检查cluster的状态:
#/usr/sbin/cluster/clinfo
#/usr/sbin/cluster/clstat
发现A机:boot ip、standby ip、service ip、tty都是down状态
发现B机:standbyip、serviceip是up状态,bootip、tty是down状态
重新启动A机,启HA问题依旧。
3、由于是生产系统,需要尽快恢复系统运行,于是打算手工varyonvg vgdb1,启动数据库服务,绑定service ip,启动应用。
varyonvg vgdb1:
#varyonvg vgdb1
#lsvg -l vgdb1
insur_log1 /insur_log1
insur_log2 /insur_log2
insur_log3 /insur_log3
insur_arch /insur_arch--此文件被破坏
insur_data /insur_data
分别mount上面的文件系统
#mount /insur_log1
ok
#mount /insur_log2
ok
#mount /insur_log3
ok
#mount /insur_data
ok
#mount /insur_arch
mount:0506-324 cannot mount /dev/insur_arch on /insur_arch:
A system call received a parameter that is not valid.
/insur_arch无法被mount,这是因为误操作导致/insur_arch被破坏,这是导致故障的主要原因,暂时不管。
启动oracle服务:
#su - oracle
$sqlplus "/as sysdba"
sql>startup mount insur
sql>exit
$lsnrctl status
$sqlplus "/as sysdba"
> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /insur_arch/archive--被破坏的文件系统是存放oracle的归档日志
Oldest online log sequence 565
Next log sequence to archive 567
Current log sequence 567
> archive log stop;
绑定service ip:
#ifconfig en1 10.81.193.8 255.255.255.0 alias
启动应用:失败
看来需要修改oracle的配置文件、关闭oracle的自动归档、、、再启动oracle服务,才能使应用正常运行,很遗憾=)自己对oracle不熟悉。但是,仔细一想:这样即使应用正常运行了,既不能完全解决问题,而且不能满足客户自动归档日志的要求。
4、故障是被破坏的/insur_arch文件系统不能被mount引起,因为/insur_arch文件系统mount失败,ha调用rg_move、rg_move_release 、node_down_local、stop_server 等事件,导致应用停止。由于/insur_arch文件系统只是用来存放oracle的归档日志,于是打算:客户的同意下删除/insur_arch文件系统、重建/insur_arch文件系统、同步ha、恢复应用
-1、记录/insur_arch文件系统的有关属性
-2、删除/insur_arch文件系统:
#smitty fs
-3、建/insur_arch文件系统,并修改相关属性:
#smitty mklv
#mkdir /insur_arch
#smitty fs
#mkdir /inusr_arch/archive
#chown -R oracle:dba /insur_arch
5、同步ha、启ha、接管测试
-1、A、B机关闭ha
-2、同步ha、启ha
-3、应用恢复正常、接管测试正常
6、故障恢复。
注:/insur_arch文件系统不能被mount,可能是超块被破坏,下次有机会的话模拟这种环境,试着恢复超块的方式恢复/insur_arch文件系统,而不是删除重新建,这样效果更好。
本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/5038/showart_75911.html |
|