jshywh 发表于 2013-12-27 12:02

hp-ux orcle10G CRS,单节点VIP与lsnr中断

先贴日志,再说明:


出故障的二号机crsd日志
2013-12-24 03:06:19.349: CheckResource error for ora.hzdb2.vip error code = 1
2013-12-24 03:06:19.374: In stateChanged, ora.hzdb2.vip target is ONLINE
2013-12-24 03:06:19.375: ora.hzdb2.vip on hzdb2 went OFFLINE unexpectedly
2013-12-24 03:06:19.375: StopResource: setting CLI values
2013-12-24 03:06:19.378: Attempting to stop `ora.hzdb2.vip` on member `hzdb2`
2013-12-24 03:06:20.127: Stop of `ora.hzdb2.vip` on member `hzdb2` succeeded.
2013-12-24 03:06:20.128: ora.hzdb2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2013-12-24 03:06:20.138: ora.hzdb2.vip failed on hzdb2 relocating.
2013-12-24 03:06:20.514: StopResource: setting CLI values
2013-12-24 03:06:20.519: Attempting to stop `ora.hzdb2.LISTENER_HZDB2.lsnr` on member `hzdb2`
2013-12-24 03:07:38.398: Stop of `ora.hzdb2.LISTENER_HZDB2.lsnr` on member `hzdb2` succeeded.
2013-12-24 03:07:38.419: Attempting to start `ora.hzdb2.vip` on member `hzdb1`
2013-12-24 03:07:45.304: Start of `ora.hzdb2.vip` on member `hzdb1` succeeded.
2013-12-24 10:13:28.136: StopResource: setting CLI values
2013-12-24 10:13:28.170: Attempting to stop `ora.hzdb2.vip` on member `hzdb1`
2013-12-24 10:13:28.630: Stop of `ora.hzdb2.vip` on member `hzdb1` succeeded.
2013-12-24 10:13:28.634: startRunnable: setting CLI values
2013-12-24 10:13:28.635: Attempting to start `ora.hzdb2.vip` on member `hzdb2`
2013-12-24 10:13:35.756: Start of `ora.hzdb2.vip` on member `hzdb2` succeeded.
2013-12-24 10:13:35.767: startRunnable: setting CLI values
2013-12-24 10:13:35.777: Attempting to start `ora.hzdb2.LISTENER_HZDB2.lsnr` on member `hzdb2`
2013-12-24 10:13:36.844: Start of `ora.hzdb2.LISTENER_HZDB2.lsnr` on member `hzdb2` succeeded.
2013-12-24 10:13:36.979: CRS-1002: Resource 'ora.hzdb2.LISTENER_HZDB2.lsnr' is already running on member 'hzdb2'

主机报警日志
Tue Dec 24 03:07:30 2013
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='hzidcdb2';
Tue Dec 24 03:07:31 2013
Immediate Kill Session#: 761, Serial#: 7085
Immediate Kill Session: sess: c0000000f6e949d0OS pid: 19319
Immediate Kill Session#: 774, Serial#: 16698
Immediate Kill Session: sess: c0000000f6ea6018OS pid: 18857
Immediate Kill Session#: 783, Serial#: 10189
Immediate Kill Session: sess: c0000000f6eb20c0OS pid: 23985
Immediate Kill Session#: 784, Serial#: 7689
Immediate Kill Session: sess: c0000000f6eb3628OS pid: 18784
Immediate Kill Session#: 785, Serial#: 18763
Immediate Kill Session: sess: c0000000f6eb4b90OS pid: 18642
Immediate Kill Session#: 789, Serial#: 20
Immediate Kill Session: sess: c0000000f6eba130OS pid: 4578
Immediate Kill Session#: 796, Serial#: 1797
Immediate Kill Session: sess: c0000000f6ec3708OS pid: 20036
Immediate Kill Session#: 810, Serial#: 10142
Immediate Kill Session: sess: c0000000f6ed62b8OS pid: 18941


Ora.hzdb2.vip.log文件日志
2013-12-24 03:06:19.324: [    RACG] : clsrcexecut: env ORACLE_CONFIG_HOME=/oracle/product/10.2/crs

2013-12-24 03:06:19.343: [    RACG] : clsrcexecut: cmd = /oracle/product/10.2/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /oracle/product/10.2/crs/bin/racgvip check hzdb2

2013-12-24 03:06:19.343: [    RACG] : clsrcexecut: rc = 1, time = 0.554s

2013-12-24 03:06:19.343: [    RACG] : end for resource = ora.hzdb2.vip, action = check, status = 1, time = 0.638s

2013-12-24 10:13:37.040: [    RACG] : clsrcstartorp: Error with malloc



全部于12月24日志凌晨出现故障,表现方式为2号小机的ora.hzdb2.LISTENER_HZDB2.lsnr offline,应用方面会报错,需重启应用,或者手动启动ora.hzdb2.LISTENER_HZDB2.lsnr资源方能解决。此故障近期内出现了2次。求告知可能是什么原因
机器为HP-ux 11i

www_xylove 发表于 2013-12-27 23:49

数据库版本是什么?

故障诊断:
2号节点的本地监听出现问题,2号节点监听断了之后,2号节点的vip地址就飘到1号节点上去了。问题是,2号节点监听为什么会断掉,从告警日志:

Tue Dec 24 03:07:30 2013
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='hzidcdb2';
Tue Dec 24 03:07:31 2013
Immediate Kill Session#: 761, Serial#: 7085
Immediate Kill Session: sess: c0000000f6e949d0OS pid: 19319
Immediate Kill Session#: 774, Serial#: 16698
....
.....
这一段非常重要。
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='hzidcdb2';
看到了没有,service_names被置空了!这里是导致监听断掉的根本原因,因为,监听器进程无法注册到数据库服务名。
从监听日志看,应该是监听dead了;从告警日志看,是大量的连接到2号节点的session被kill掉了,应用无法连接,最终影响到业务。

从metalink查看:service_names被置空有2个原因,一个是bug,另外一个是service_names与db_name名称相同。

解决方法:
1.如果数据库版本是10.2.0.4以下,则升级到10.2.0.5
2.将service_names与db_name名称设置不相同


再看看
2号节点的监听日志
2号节点的系统日志(/var/adm/syslog/syslog.log)




jshywh 发表于 2013-12-30 10:24

回复 2# www_xylove


    OK,看到了,版本是10.2的,小版本回头看一下,上个周六2号机的监听又停掉了。现在就让它停着,不准备启来。解决问题后再处理,因为应用是weblogic挂的php,不知道是配置问题还是什么原因。出现这种问题的时候,就会报连接错误,我估计是不是轮询负载,然后还依然往db2上的监听去连接。连不上就报错。其它日志过会去看看

www_xylove 发表于 2014-01-06 22:17

楼主,问题解决了吗

jshywh 发表于 2014-01-07 09:52

还没有呢。前段时间比较慢,2号机的监听一直停着。单机暂地在跑。今天早上过来看看测试一下!

jshywh 发表于 2014-01-08 15:00

回复 2# www_xylove


版主在么?重新抓了下监听日志与系统日志,发现系统日志中一个问题。感觉可能跟这次故障有关系,如下:

   


    2号机:listener



24-DEC-2013 03:07:37 * node_down * hzdb2 * 0
24-DEC-2013 03:07:37 * service_down * hzidcdb2 * 0
24-DEC-2013 03:07:37 * service_down * hzidcdb2 * 0
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.75.0.24)(P
ORT=1521)))
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.75.0.22)(P
ORT=1521)))
Listener completed notification to CRS on stop
24-DEC-2013 03:07:37 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=HZDB2)(USER=oracle))(C
OMMAND=stop)(ARGUMENTS=64)(SERVICE=LISTENER_HZDB2)(VERSION=169870336)) * stop *
0

TNSLSNR for HPUX: Version 10.2.0.4.0 - Production on 24-DEC-2013 10:13:36

Copyright (c) 1991, 2007, Oracle.All rights reserved.

System parameter file is /oracle/product/10.2/db/network/admin/listener.ora
Log messages written to /oracle/product/10.2/db/network/log/listener_hzdb2.log
Trace information written to /oracle/product/10.2/db/network/trace/listener_hzdb
2.trc
Trace level is currently 0
Started with pid=15337
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.75.0.24)(PORT=1521))
)
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.75.0.22)(PORT=1521))
)
Listener completed notification to CRS on start

TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
WARNING: Subscription for node down event still pending
24-DEC-2013 10:13:36 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=HZDB2)(USER=oracle))(C
OMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER_HZDB2)(VERSION=169870336)) * statu

s * 0



2号机主机日志:
Dec 24 02:54:58 HZDB2 cmnetd: lan1 is down at the data link layer.
Dec 24 02:54:58 HZDB2 cmnetd: lan1 failed.
Dec 24 02:54:58 HZDB2 cmnetd: Subnet 172.75.0.0 down
Dec 24 02:55:10 HZDB2 cmnetd: lan1 is up at the data link layer.
Dec 24 02:55:10 HZDB2 cmnetd: lan1 recovered.
Dec 24 02:55:10 HZDB2 cmnetd: Subnet 172.75.0.0 up
Dec 24 03:04:34 HZDB2 cmnetd: lan1 is down at the data link layer.
Dec 24 03:04:34 HZDB2 cmnetd: lan1 failed.
Dec 24 03:04:34 HZDB2 cmnetd: Subnet 172.75.0.0 down
Dec 24 03:04:56 HZDB2 cmnetd: lan1 is up at the data link layer.
Dec 24 03:04:56 HZDB2 cmnetd: lan1 recovered.
Dec 24 03:04:56 HZDB2 cmnetd: Subnet 172.75.0.0 up
Dec 24 03:05:04 HZDB2 cmnetd: lan1 is down at the data link layer.
Dec 24 03:05:04 HZDB2 cmnetd: lan1 failed.
Dec 24 03:05:04 HZDB2 cmnetd: Subnet 172.75.0.0 down
Dec 24 03:05:18 HZDB2 cmnetd: lan1 is up at the data link layer.
Dec 24 03:05:18 HZDB2 cmnetd: lan1 recovered.
Dec 24 03:05:18 HZDB2 cmnetd: Subnet 172.75.0.0 up
Dec 24 03:06:18 HZDB2 cmnetd: lan1 is down at the data link layer.
Dec 24 03:06:18 HZDB2 cmnetd: lan1 failed.
Dec 24 03:06:18 HZDB2 cmnetd: Subnet 172.75.0.0 down
Dec 24 03:06:34 HZDB2 cmnetd: lan1 is up at the data link layer.
Dec 24 03:06:34 HZDB2 cmnetd: lan1 recovered.
Dec 24 03:06:34 HZDB2 cmnetd: Subnet 172.75.0.0 up
Dec 24 03:53:16 HZDB2 cmnetd: lan1 is down at the data link layer.
Dec 24 03:53:16 HZDB2 cmnetd: lan1 failed.
Dec 24 03:53:16 HZDB2 cmnetd: Subnet 172.75.0.0 down
Dec 24 03:54:10 HZDB2 cmnetd: lan1 is up at the data link layer.
Dec 24 03:54:10 HZDB2 cmnetd: lan1 recovered.
Dec 24 03:54:10 HZDB2 cmnetd: Subnet 172.75.0.0 up
Dec 24 03:54:12 HZDB2 cmnetd: lan1 is down at the data link layer.
Dec 24 03:54:12 HZDB2 cmnetd: lan1 failed.
Dec 24 03:54:12 HZDB2 cmnetd: Subnet 172.75.0.0 down
Dec 24 03:54:56 HZDB2 cmnetd: lan1 is up at the data link layer.
Dec 24 03:54:56 HZDB2 cmnetd: lan1 recovered.
Dec 24 03:54:56 HZDB2 cmnetd: Subnet 172.75.0.0 up


另,oracle的servcename为空,没设置
10.2.0.4.0
SQL> show parameter instance_name;

NAME                                 TYPE      VALUE
------------------------------------ ----------- ------------------------------
instance_name                        string      hzidcdb2

SQL> show parameter service_names;

NAME                                 TYPE      VALUE
------------------------------------ ----------- ------------------------------
service_names                        string
SQL> show parameter db_name;

版本确实为10.2.0.4.0,

www_xylove 发表于 2014-01-08 16:21

其实这个问题,我第一次回答你的时候,已经有了答案了。

jshywh 发表于 2014-01-10 11:10

回复 7# www_xylove
是10.2.0.4 的BUG么?还是第二个原因?


   

www_xylove 发表于 2014-01-10 14:07

回复 8# jshywh

是bug
临时解决方法是将service_names与db_name名称设置不相同
永久解决方法是升级到10.2.0.5

   

jshywh 发表于 2014-01-14 11:22

回复 9# www_xylove

OK,明白了,


   
页: [1]
查看完整版本: hp-ux orcle10G CRS,单节点VIP与lsnr中断