Chinaunix

标题: POWER HA5.5中，资源回切出现问题，求指点（已解决） [打印本页]

作者: lanfeng356 时间: 2011-06-15 15:39
标题: POWER HA5.5中，资源回切出现问题，求指点（已解决）
本帖最后由 lanfeng356 于 2011-06-16 11:35 编辑

1.平台：
主机：IBM P6 550
操作系统：AIX 6100-06
cluster：POWER HA5.5

2.问题现象：
A主机上的资源组1（包含单实例数据库）可以切换到B主机上
B主机将资源组1（包含单实例数据库）无法回切到A主机上
（双机配置已经同步，两边的启停脚本一模一样，执行权限也一模一样）
此时无法停止B主机双机
B主机:root:/hacmp>lssrc -ls clstrmgrES
Current state: ST_RP_FAILED
sccsid = "@(#)36 1.135.5.2 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r550, 0934B_hacmp550 8/8/09 14:48:23"
i_local_nodeid 1, i_local_siteid -1, my_handle 2
ml_idx[1]=0    ml_idx[2]=1
tp is 20714628
Events on event queue:
te_type 36, te_nodeid 2, te_network 1
There are 0 events on the Ibcast queue
There are 0 events on the RM Ibcast queue
CLversion: 10
local node vrmf is 5506
cluster fix level is "6"
The following timer(s) are currently active:
Event error node list: node_B
Current DNP values
DNP Values for NodeId - 1  NodeName - node_A
PgSpFree = 2222330  PvPctBusy = 0  PctTotalTimeIdle = 369.378540
DNP Values for NodeId - 2  NodeName - node_B
PgSpFree = 2224876  PvPctBusy = 0  PctTotalTimeIdle = 365.773210

切换资源组的时候报错：
Command: failed       stdout: yes          stderr: no

Before command completion, additional instructions may appear below.

Attempting to move resource group RG1 to node A.

Waiting for the cluster to process the resource group movement request....

Waiting for the cluster to stabilize...........

ERROR: Event processing has failed for the requested resource
group movement.  The cluster is unstable and requires manual intervention
to continue processing.

查看双机状态：
Resource Group Name: RG1
Startup Policy: Online On Home Node Only
Fallover Policy: Fallover To Next Priority Node In The List
Fallback Policy: Fallback To Higher Priority Node In The List
Site Policy: ignore
Primary instance(s):
The following node temporarily has the highest priority for this instance:
A, user-requested rg_move performed on Mon Jun 13 18:03:22 2011

Node                      Group State
---------------------------- ---------------
A                               OFFLINE
B                               ERROR

只有将主机B shutdown -Fr 以后，主机A自动重新接管资源组RG1

B主机上的资源组2（只有一个浮动IP）可以切换到主机A上
A主机可以将资源组2（只有一个浮动IP）回切到主机B上

3.报错日志

hacmp.rar (39.63 KB, 下载次数: 74)

4.另外一个无法使用clstat的问题
使用clstat报错：
emacdb2:root:/usr/es/sbin/cluster>./clstat
Failed retrieving cluster information.

There are a number of possible causes:
clinfoES or snmpd subsystems are not active.
snmp is unresponsive.
snmp is not configured correctly.
Cluster services are not active on any nodes.

Refer to the HACMP Administration Guide for more information.
Additional information for verifying the SNMP configuration on AIX 6
can be found in /usr/es/sbin/cluster/README5.5.0.UPDATE

按照 /usr/es/sbin/cluster/README5.5.0.UPDATE文档中的提示：

在文件/etc/snmpdv3.conf中添加下面行
VACM_VIEW defaultView 1.3.6.1.4.1.2.3.1.2.1.5 - included -

然后重启snmp服务
1) stopsrc -s snmpd
2) startsrc -s snmpd

依然报上面的错误

求达人指点，谢谢！

作者: yclhyhy 时间: 2011-06-15 16:20
资源组的策略贴出来看看

作者: lanfeng356 时间: 2011-06-15 16:26
回复 2# yclhyhy

[TOP]                                                 [Entry Fields]
  Resource Group Name                               RG1
  Participating Nodes (Default Node Priority)       node_A node_B

  Startup Policy                                     Online On Home Node Only
  Fallover Policy                                  Fallover To Next Priority Node In The List
  Fallback Policy                                  Fallback To Higher Priority Node In The List
  Fallback Timer Policy (empty is immediate)       []                                                                      +

  Service IP Labels/Addresses                      [node_A_svc]                                                          +
  Application Servers                               [app]                                                                   +

  Volume Groups                                     [datavg ]                                                             +
  Use forced varyon of volume groups, if necessary false                                                                +
  Automatically Import Volume Groups                false                                                                +
  Filesystems (empty is ALL for VGs specified)    [ ]                                                                   +
  Filesystems Consistency Check                      fsck                                                                   +
  Filesystems Recovery Method                      sequential                                                             +
  Filesystems mounted before IP configured          false                                                                +

作者: tianyue01 时间: 2011-06-15 16:38
Online On Home Node Only
所以不TKO，是这样的吗?

作者: lanfeng356 时间: 2011-06-15 16:49
回复 4# tianyue01

什么是TKO？

作者: yclhyhy 时间: 2011-06-15 16:55
好像从2切回1的时候 app1没完全成功down掉，datavg也没成功varyoffvg掉，在2上掉死了。

作者: lanfeng356 时间: 2011-06-15 17:07
回复 6# yclhyhy

似乎是这样的，但是我从node_A往node_B上切换就没问题，我觉得也是vg导致的问题，我node_B上有个只有浮动IP的资源组，切换到node_A主机，就没问题。

那我应该如何做呢？是脚本导致的问题吗？还是snmp导致的问题呢？

作者: yclhyhy 时间: 2011-06-15 17:12
你在datavg上还建了/arch1 文件系统？？

测试是否脚本导致的很简单，把脚本先从hacmp配置里面去掉，测试ha切换，看地址、资源能不能在A、B间正常切换，成功地话再在脚本里找原因，看是不是down应用前后顺序什么的。

作者: lanfeng356 时间: 2011-06-15 22:30
本帖最后由 lanfeng356 于 2011-06-16 11:34 编辑

回复 1# lanfeng356

又做了一些测试，把双机启停脚本注释掉，在node_A做资源offline和online没有问题

不带脚本，将资源切回也没问题。

判断是脚本的问题。

后来问题找到原因：文件权限问题

在node_B节点中，start.sh和stop.sh中，将日志输出到cluster.log，这个文件的权限是755，属主是root，停止数据库的时候，su 到oracle用户，没有权限往里面输入日志，双机切换执行脚本时报错，导致双机切换不过去。
在node_A节点中，cluster.log的权限是777，所以切换过去没有问题。

作者: diyxyj 时间: 2011-06-16 20:24
哈哈文件属主、文件属性太重要了....

作者: 午夜幽魂 时间: 2011-06-18 10:54
都是权限若的祸呀，
不过给我们大家提了个醒呀，哈哈

作者: aiwsuoai 时间: 2011-06-20 14:49
学习了。

作者: bj319 时间: 2012-06-25 14:32

学习了

欢迎光临 Chinaunix (http://bbs.chinaunix.net/)