免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2553 | 回复: 0
打印 上一主题 下一主题

[高级应用] HA双机共享文件系统被破坏的处理过程 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2006-02-17 18:55 |只看该作者 |倒序浏览
环境:H85*2、oracle、AIX、HACMP、双机互备

问题描述:
1、错误操作:备份文件insur_arch(15G)到/dev目录下,而/dev目录下有名为insur_arch的块设备文件,导致该insur_arch文件被破坏,此块设备文件对应文件系统/insur_arch,结果是文件系统/insur_arch被破坏,文件系统/insur_arch为共享文件系统,属于共享vgdb1
2、共享vgdb1被varyoff、service ip无法ping通、A机跑的应用异常停止、B机跑的应用正常
处理过程:
1、到现场后,应用已无法启动,现场情况上面提到的问题描述,于是重新启动HA:
#smitty clstart
#lsvg -o
rootvg
vgdb1
发现共享vgdb1被varyon
#netstat -i
sevice ip
standby ip
发现此时sevice ip成功漂移到boot网卡上
但是大约过了两分钟
#lsvg -o
rootvg
vgdb1被异常varyoff
#netstat -i
boot ip
standby ip
此时service ip没有漂移在boot网卡上,已经失效
#lssrc -g cluster
检查ha的进程正常

查看hacmp.out文件:#tail -f /tmp/hacmp.out
res1:node_down_local[22] [ 0 -ne 0 ]
res1:node_down_local[459] [ 1 -ne 0 ]
res1:node_down_local[461] set_resource_status ERROR
res1:node_down_local[3] set +u
res1:node_down_local[4] NOT_DOIT=
res1:node_down_local[5] set -u
res1:node_down_local[6] [  != TRUE ]
res1:node_down_local[8] [ REAL = EMUL ]
res1:node_down_local[13] clchdaemons -d clstrmgr_scripts -t resource_locator -n yhdsb01
-o res1 -v ERROR
res1:node_down_local[14] [ 0 -ne 0 ]
res1:node_down_local[23] [ ERROR = RELEASING ]
res1:node_down_local[38] [ NONE = RELEASE_SECONDARY ]
res1:node_down_local[42] [ NONE = SECONDARY_BECOMES_PRIMARY ]
res1:node_down_local[47] cl_RMupdate rg_error res1 node_down_local
Reference string: Thu.Feb.16.19:19:38.BEIST.2006.node_down_local.res1.ref
res1:node_down_local[48] [ 0 -ne 0 ]
res1:node_down_local[462] exit 1
Feb 16 19:19:39 EVENT FAILED:1: node_down_local

res1:rg_move[285] [ 1 -ne 0 ]
res1:rg_move[287] cl_log 650 rg_move: Failure occurred while processing Resource Group
res1. Manual intervention required. rg_move res1
res1:cl_log[50] version=1.9
res1:cl_log[92] SYSLOG_FILE=/usr/es/adm/cluster.log
***************************
Feb 16 2006 19:19:39 !!!!!!!!!! ERROR !!!!!!!!!!
***************************
Feb 16 2006 19:19:39 rg_move: Failure occurred while processing Resource Group res1.
Manual intervention required.
res1:rg_move[288] STATUS=1
res1:rg_move[291] UPDATESTATD=1
res1:rg_move[298] [ -f /tmp/.NFSSTOPPED ]
res1:rg_move[304] [ -f /tmp/.RPCLOCKDSTOPPED ]
res1:rg_move[322] exit 1
Feb 16 19:19:39 EVENT FAILED:1: rg_move yhdsb01 1 RELEASE

+ exit 1
Feb 16 19:19:39 EVENT FAILED:1: rg_move_release yhdsb01 1
   HACMP Event Summary
Event: rg_move_release yhdsb01 1
Start time: Thu Feb 16 19:19:13 2006
End time: Thu Feb 16 19:19:39 2006
Action:  Resource:   Script Name:
----------------------------------------------------------------------------
Releasing resource group: res1 node_down_local
Search on: Thu.Feb.16.19:19:14.BEIST.2006.node_down_local.res1.ref
Releasing resource: All_servers stop_server
Search on: Thu.Feb.16.19:19:14.BEIST.2006.stop_server.All_servers.res1.ref
Error encountered with resource: db1server stop_server
Search on: Thu.Feb.16.19:19:15.BEIST.2006.stop_server.db1server.res1.ref
Resource offline: All_nonerror_servers stop_server
Search on: Thu.Feb.16.19:19:15.BEIST.2006.stop_server.All_nonerror_servers.res1.ref
Releasing resource: All_filesystems cl_deactivate_fs
Search on: Thu.Feb.16.19:19:17.BEIST.2006.cl_deactivate_fs.All_filesystems.res1.ref
Resource offline: All_non_error_filesystems cl_deactivate_fs
Search on:
Thu.Feb.16.19:19:18.BEIST.2006.cl_deactivate_fs.All_non_error_filesystems.res1.ref
Releasing resource: All_volume_groups cl_deactivate_vgs
Search on: Thu.Feb.16.19:19:18.BEIST.2006.cl_deactivate_vgs.All_volume_groups.res1.ref
Resource offline: All_volume_groups cl_deactivate_vgs
Search on: Thu.Feb.16.19:19:25.BEIST.2006.cl_deactivate_vgs.All_volume_groups.res1.ref
Releasing resource: All_service_addrs release_service_addr
Search on: Thu.Feb.16.19:19:26.BEIST.2006.release_service_addr.All_service_addrs.res1.ref
Resource offline: All_nonerror_service_addrs release_service_addr
Search on:
Thu.Feb.16.19:19:37.BEIST.2006.release_service_addr.All_nonerror_service_addrs.res1.ref
Error encountered with group: res1 node_down_local
Search on: Thu.Feb.16.19:19:38.BEIST.2006.node_down_local.res1.ref
----------------------------------------------------------------------------
Feb 16 19:19:39 EVENT START: event_error 1 1_rg_move_release yhdsb01 1 _1
:event_error[52] [[ high = high ]]
:event_error[52] version=1.10
:event_error[53] :event_error[53] cl_get_path
HA_DIR=es
:event_error[55] EXIT_STATUS=1
:event_error[56] RP_NAME=1 1_rg_move_release yhdsb01 1 _1
:event_error[59] [ 2 -ne 2 ]
:event_error[65] set -u
:event_error[67] RP_NAME=rg_move_release yhdsb01 1 _1
:event_error[68] RP_NAME=rg_move_release yhdsb01 1
:event_error[70] :event_error[70] cllsclstr -c
:event_error[70] cut -d : -f2
:event_error[70] grep -v cname
CLUSTER=yhdsbclu
:event_error[74] [ -x /usr/lpp/ssp/bin/spget_syspar ]
:event_error[81] echo WARNING: Cluster yhdsbclu Failed while running rg_move_release
yhdsb01 1 , exit status was 1
:event_error[81] 1> /dev/console
:event_error[82] echo WARNING: Cluster yhdsbclu Failed while running rg_move_release
yhdsb01 1 , exit status was 1
WARNING: Cluster yhdsbclu Failed while running rg_move_release yhdsb01 1 , exit status
was 1
:event_error[88] [[ rg_move_release yhdsb01 1  = reconfig_resource* ]]
Feb 16 19:19:39 EVENT FAILED:-1: event_error 1 1_rg_move_release yhdsb01 1 _1
WARNING: Cluster yhdsbclu has been running recovery program
'/usr/es/sbin/cluster/events/rg_move.rp' for 13620 seconds. Please check cluster status.
Feb 16 19:25:13 EVENT START: config_too_long 360 /usr/es/sbin/cluster/events/rg_move.rp
:config_too_long[64] [[ high = high ]]
:config_too_long[64] version=1.11
:config_too_long[65] :config_too_long[65] cl_get_path
HA_DIR=es
:config_too_long[67] NUM_SECS=360
:config_too_long[68] EVENT=/usr/es/sbin/cluster/events/rg_move.rp
:config_too_long[70] HOUR=3600
:config_too_long[71] THRESHOLD=5
:config_too_long[72] SLEEP_INTERVAL=1
:config_too_long[78] PERIOD=30
:config_too_long[81] set -u
:config_too_long[86] LOOPCNT=0
:config_too_long[87] MESSAGECNT=0
:config_too_long[88] :config_too_long[88] cllsclstr -c
:config_too_long[88] cut -d : -f2
:config_too_long[88] grep -v cname
CLUSTER=yhdsbclu
:config_too_long[89] TIME=360
:config_too_long[90] sleep_cntr=0
:config_too_long[95] [ -x /usr/lpp/ssp/bin/spget_syspar ]
WARNING: Cluster yhdsbclu has been running recovery program
'/usr/es/sbin/cluster/events/rg_move.rp' for 360 seconds. Please check cluster status.
2、根据报错信息,有两方面的报错:
--1、有关于/insur_arch的报错:
res1:cl_activate_fs[240] /usr/sbin/fsck -f -p -o nologredo /dev/insur_arch
/dev/rinsur_arch:
Not a recognized filesystem type. (TERMINATED
)
res1:cl_activate_fs[85] mount /insur_arch
mount: /dev/insur_arch on /insur_arch: Invalid argument

res1:cl_activate_fs[87] [[ fsck == logredo ]]
res1:cl_activate_fs[107] cl_RMupdate resource_error /insur_arch cl_activate_fs
Reference string: Thu.Feb.16.14:14:16.BEIST.2006.cl_activate_fs..insur_arch.res1.ref
res1:cl_activate_fs[108] cl_echo 10 'cl_activate_fs: Failed mount of /insur_arch.' cl_activate_fs /insur_arch

res1:cl_echo[49] version=1.13
res1:cl_echo[98] HACMP_OUT_FILE=/tmp/hacmp.out
Feb 16 2006 14:14:17 cl_activate_fs: Failed mount of /insur_arch.res1:cl_activate_fs[109] STATUS=1

---根据/usr/es/sbin/cluster/events/utils/cl_activate_fs的描述,STATUS=1意味着:one or more filesystems failed to fsck or mount(/insur_arch被破坏导致无法被mount)

--2、有关cluster状态的报错:
'/usr/es/sbin/cluster/events/rg_move.rp' for 360 seconds. Please check cluster status.

检查cluster的状态:
#/usr/sbin/cluster/clinfo
#/usr/sbin/cluster/clstat
发现A机:boot ip、standby ip、service ip、tty都是down状态
发现B机:standbyip、serviceip是up状态,bootip、tty是down状态
重新启动A机,启HA问题依旧。

3、由于是生产系统,需要尽快恢复系统运行,于是打算手工varyonvg vgdb1,启动数据库服务,绑定service ip,启动应用。
varyonvg vgdb1:
#varyonvg vgdb1
#lsvg -l vgdb1
insur_log1   /insur_log1
insur_log2   /insur_log2
insur_log3   /insur_log3
insur_arch   /insur_arch--此文件被破坏
insur_data   /insur_data
分别mount上面的文件系统
#mount /insur_log1
ok
#mount /insur_log2
ok
#mount /insur_log3
ok
#mount /insur_data
ok
#mount /insur_arch
mount:0506-324 cannot mount /dev/insur_arch on /insur_arch:
A system call received a parameter that is not valid.
/insur_arch无法被mount,这是因为误操作导致/insur_arch被破坏,这是导致故障的主要原因,暂时不管。
启动oracle服务:
#su - oracle
$sqlplus  "/as sysdba"
sql>startup mount insur
sql>exit
$lsnrctl status
$sqlplus "/as sysdba"
> archive log list;    
Database log mode    Archive Mode    
Automatic archival    Enabled    
Archive destination   /insur_arch/archive--被破坏的文件系统是存放oracle的归档日志
Oldest online log sequence   565    
Next log sequence to archive  567    
Current log sequence       567
> archive log stop;
绑定service ip:
#ifconfig en1 10.81.193.8 255.255.255.0 alias
启动应用:失败
看来需要修改oracle的配置文件、关闭oracle的自动归档、、、再启动oracle服务,才能使应用正常运行,很遗憾=)自己对oracle不熟悉。但是,仔细一想:这样即使应用正常运行了,既不能完全解决问题,而且不能满足客户自动归档日志的要求。
4、故障是被破坏的/insur_arch文件系统不能被mount引起,因为/insur_arch文件系统mount失败,ha调用rg_move、rg_move_release 、node_down_local、stop_server 等事件,导致应用停止。由于/insur_arch文件系统只是用来存放oracle的归档日志,于是打算:客户的同意下删除/insur_arch文件系统、重建/insur_arch文件系统、同步ha、恢复应用
-1、记录/insur_arch文件系统的有关属性
-2、删除/insur_arch文件系统:
#smitty fs
-3、建/insur_arch文件系统,并修改相关属性:
#smitty mklv
#mkdir /insur_arch
#smitty fs
#mkdir /inusr_arch/archive
#chown -R oracle:dba /insur_arch
5、同步ha、启ha、接管测试
-1、A、B机关闭ha
-2、同步ha、启ha
-3、应用恢复正常、接管测试正常
6、故障恢复。
注:/insur_arch文件系统不能被mount,可能是超块被破坏,下次有机会的话模拟这种环境,试着恢复超块的方式恢复/insur_arch文件系统,而不是删除重新建,这样效果更好。


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/5038/showart_75911.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP