免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
12下一页
最近访问板块 发新帖
查看: 3435 | 回复: 12
打印 上一主题 下一主题

关于双机HA网络问题,求救!请大侠出手,万分感谢! [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-08-12 11:27 |只看该作者 |正序浏览
两台linux机器做HA,用的是redhat 自带的HA软件。双机接磁盘阵列。
以前一直正常。突然今天,两台机器只有一台能通,而且是交替的通,非常奇怪。

serverA                eth0        10.8.28.201
                        eth1        192.168.0.201

serverB                eth0        10.8.28.202
                eth1        192.168.0.202

serverA和serverB的eth0都插在一台交换机上
serverA和serverB的eth1是心跳线,通过直连线互通

通过另外一台服务器ping这两个ip,出现的情况是

把两台机器网线插到交换机上
首先
ping 10.8.28.201 能通
ping 10.8.28.202 不通

过一段时间,
ping 10.8.28.202 通了
ping 10.8.28.201 也是通的

再过一小会
ping 10.8.28.202 一直通
ping 10.8.28.201 不通了

一直这样循环,就是基本上只能一台机器的网络通,另外一台不通。

操作系统:Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
uname -a 信息:Linux manager 2.4.21-32.ELsmp #1 SMP Fri Apr 15 21:17:59 EDT 2005 i686 i686 i386 GNU/Linux

HA软件版本:
redhat-config-cluster-1.0.8-1
clumanager-1.2.31-1

请教下,哪位大侠指点下!

谢谢!!

论坛徽章:
0
13 [报告]
发表于 2008-08-15 10:51 |只看该作者
看日志有两个错误啊
Aug 12 00:10:42 jifei1 clusvcmgrd: [1533]: <err> service error: User script '/home/tjjifei/cluster/jifei.sh stop' returned error 1
Aug 12 00:10:42 jifei1 clusvcmgrd: [1533]: <err> service error: -bash: line 1: cd: /home/tjjifei/jifeiapp/jifee: No such file or directory
--------------
1、你手工运行脚本可以吗?
2、查看/home/tjjifei/jifeiapp/jifee目录的权限,是不是没有呢。

论坛徽章:
0
12 [报告]
发表于 2008-08-15 10:30 |只看该作者
原帖由 jerrywjl 于 2008-8-14 14:48 发表


是啊,你之前不断心跳当然正常了,关键是断了心跳正常情况下会怎样?!


jerrywjl老大,我直接用心跳线的地址去ssh和ftp都是通的,应该不是心跳线出问题了吧?心跳线出问题的话应该是不通的呀。

论坛徽章:
0
11 [报告]
发表于 2008-08-14 15:02 |只看该作者
原帖由 jerrywjl 于 2008-8-14 14:48 发表


是啊,你之前不断心跳当然正常了,关键是断了心跳正常情况下会怎样?!



不太明白你的意思啊,jerrywjl大侠,你是从日志看到的心跳断了?

请指点,呵呵

[ 本帖最后由 zhangsuhua 于 2008-8-14 15:06 编辑 ]

论坛徽章:
0
10 [报告]
发表于 2008-08-14 14:48 |只看该作者
突然出现这个问题,期间一直正常,应该和直连线没有关系的。


是啊,你之前不断心跳当然正常了,关键是断了心跳正常情况下会怎样?!

论坛徽章:
0
9 [报告]
发表于 2008-08-14 14:23 |只看该作者
谢谢各位!

目前是停掉了一台机器,服务跑在单机,手工启动是没有问题。

现在还不知道是什么原因,请大家帮忙看下,谢谢!

[ 本帖最后由 zhangsuhua 于 2008-8-14 14:37 编辑 ]

论坛徽章:
0
8 [报告]
发表于 2008-08-14 14:22 |只看该作者
日志
--------------
Aug 12 00:10:28 jifei1 clumanager: [1248]: <notice> Starting Red Hat Cluster Manager...
Aug 12 00:10:28 jifei1 kernel: Software Watchdog Timer: 0.05, timer margin: 60 sec
Aug 12 00:10:28 jifei1 clumanager: Loading Watchdog Timer (softdog):  succeeded
Aug 12 00:10:28 jifei1 cludb[1265]: <crit> _clu_lock_init: unable to get local member ID
Aug 12 00:10:28 jifei1 cludb[1266]: <crit> _clu_lock_init: unable to get local member ID
Aug 12 00:10:28 jifei1 cludb[1267]: <crit> _clu_lock_init: unable to get local member ID
Aug 12 00:10:28 jifei1 cludb[1268]: <crit> _clu_lock_init: unable to get local member ID
Aug 12 00:10:28 jifei1 cludb[1269]: <crit> _clu_lock_init: unable to get local member ID
Aug 12 00:10:28 jifei1 cluquorumd[1277]: <warning> STONITH: No drivers configured for host '10.1.3.234'!
Aug 12 00:10:28 jifei1 cluquorumd[1277]: <warning> STONITH: Data integrity may be compromised!
Aug 12 00:10:28 jifei1 cluquorumd[1277]: <warning> STONITH: No drivers configured for host '10.1.3.235'!
Aug 12 00:10:28 jifei1 cluquorumd[1277]: <warning> STONITH: Data integrity may be compromised!
Aug 12 00:10:28 jifei1 clumanager: cluquorumd startup succeeded
Aug 12 00:10:30 jifei1 modprobe: modprobe: Can't locate module char-major-10-134
Aug 12 00:10:32 jifei1 kernel: mtrr: no more MTRRs available
Aug 12 00:10:33 jifei1 kernel: mtrr: no more MTRRs available
Aug 12 00:10:39 jifei1 clumembd[1306]: <notice> Member 10.1.3.234 UP
Aug 12 00:10:41 jifei1 cluquorumd[1278]: <notice> Quorum Formed; Starting Service Manager
Aug 12 00:10:41 jifei1 clusvcmgrd: [1373]: <notice> service notice: Stopping service wapftp ...
Aug 12 00:10:41 jifei1 clusvcmgrd: [1373]: <notice> service notice: Running user script '/home/tjjifei/cluster/wapftp.sh stop'
Aug 12 00:10:41 jifei1 su(pam_unix)[1401]: session opened for user tjjifei by (uid=0)
Aug 12 00:10:41 jifei1 su(pam_unix)[1401]: session closed for user tjjifei
Aug 12 00:10:41 jifei1 clusvcmgrd: [1373]: <err> service error: User script '/home/tjjifei/cluster/wapftp.sh stop' returned error 1
Aug 12 00:10:41 jifei1 clusvcmgrd: [1373]: <err> service error: -bash: line 1: cd: /home/tjjifei/wapftpapp/wapftp: No such file or directory
Aug 12 00:10:41 jifei1 clusvcmgrd: [1373]: <err> service error: Cannot stop user script for wapftp
Aug 12 00:10:42 jifei1 clusvcmgrd: [1453]: <notice> service notice: Stopping service jiesuan ...
Aug 12 00:10:42 jifei1 clusvcmgrd: [1453]: <notice> service notice: Running user script '/home/tjjifei/cluster/jiesuan.sh stop'
Aug 12 00:10:42 jifei1 su(pam_unix)[1481]: session opened for user tjjifei by (uid=0)
Aug 12 00:10:42 jifei1 su(pam_unix)[1481]: session closed for user tjjifei
Aug 12 00:10:42 jifei1 clusvcmgrd: [1453]: <err> service error: User script '/home/tjjifei/cluster/jiesuan.sh stop' returned error 1
Aug 12 00:10:42 jifei1 clusvcmgrd: [1453]: <err> service error: -bash: line 1: cd: /home/tjjifei/jiesuanapp/jiesuan/: No such file or directory
Aug 12 00:10:42 jifei1 clusvcmgrd: [1453]: <err> service error: Cannot stop user script for jiesuan
Aug 12 00:10:42 jifei1 clusvcmgrd: [1533]: <notice> service notice: Stopping service jifee ...
Aug 12 00:10:42 jifei1 clusvcmgrd: [1533]: <notice> service notice: Running user script '/home/tjjifei/cluster/jifei.sh stop'
Aug 12 00:10:42 jifei1 su(pam_unix)[1567]: session opened for user tjjifei by (uid=0)
Aug 12 00:10:42 jifei1 su(pam_unix)[1567]: session closed for user tjjifei
Aug 12 00:10:42 jifei1 clusvcmgrd: [1533]: <err> service error: User script '/home/tjjifei/cluster/jifei.sh stop' returned error 1
Aug 12 00:10:42 jifei1 clusvcmgrd: [1533]: <err> service error: -bash: line 1: cd: /home/tjjifei/jifeiapp/jifee: No such file or directory
--------------

论坛徽章:
0
7 [报告]
发表于 2008-08-14 14:20 |只看该作者
原帖由 jerrywjl 于 2008-8-13 23:14 发表
这是一个RHEL3上的集群。如果两台机器交替重启,则证明有可能是两个节点在相互fence对方。
RHEL3集群我没有做过,所以不敢断言,不过我认为用直连线做心跳的方法肯定不行。因为一旦心跳断了没有接上就会相互fe ...



这个HA系统2006年用到现在,所有硬件设备没有动过。

突然出现这个问题,期间一直正常,应该和直连线没有关系的。

现在怀疑是HA软件的原因。手动一步一步执行启动程序没有问题,看日志,好像找不到程序启动脚本,不太明白,手工都没有问题的
----------------------------------
Aug 11 23:56:03 jifei1 syslogd 1.4.1: restart.
Aug 11 23:56:03 jifei1 syslog: syslogd startup succeeded
Aug 11 23:56:03 jifei1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Aug 11 23:56:03 jifei1 kernel: Linux version 2.4.21-32.ELsmp (bhcompile@tweety.build.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #1 SMP Fri Apr 15 21:17:59 EDT 2005
Aug 11 23:56:03 jifei1 kernel: BIOS-provided physical RAM map:
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 0000000000100000 - 000000007fffa000 (usable)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 000000007fffa000 - 0000000080000000 (ACPI data)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
Aug 11 23:56:03 jifei1 kernel:  BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
Aug 11 23:56:03 jifei1 kernel: 1151MB HIGHMEM available.
Aug 11 23:56:03 jifei1 syslog: klogd startup succeeded
Aug 11 23:56:03 jifei1 kernel: 896MB LOWMEM available.
Aug 11 23:56:03 jifei1 kernel: found SMP MP-table at 000f4fd0
Aug 11 23:56:03 jifei1 kernel: hm, page 000f4000 reserved twice.
Aug 11 23:56:03 jifei1 kernel: hm, page 000f5000 reserved twice.
Aug 11 23:56:03 jifei1 kernel: hm, page 000f2000 reserved twice.
/restart                                                                                                         
Aug 11 23:59:38 jifei1 clusvcmgrd: [1378]: <err> service error: Cannot stop user script for wapftp
Aug 11 23:59:38 jifei1 clusvcmgrd: [1462]: <notice> service notice: Stopping service jiesuan ...
Aug 11 23:59:38 jifei1 clusvcmgrd: [1462]: <notice> service notice: Running user script '/home/tjjifei/cluster/jiesuan.sh stop'
Aug 11 23:59:38 jifei1 su(pam_unix)[1491]: session opened for user tjjifei by (uid=0)
Aug 11 23:59:38 jifei1 su(pam_unix)[1491]: session closed for user tjjifei
Aug 11 23:59:38 jifei1 clusvcmgrd: [1462]: <err> service error: User script '/home/tjjifei/cluster/jiesuan.sh stop' returned error 1
Aug 11 23:59:38 jifei1 clusvcmgrd: [1462]: <err> service error: -bash: line 1: cd: /home/tjjifei/jiesuanapp/jiesuan/: No such file or directory
Aug 11 23:59:38 jifei1 clusvcmgrd: [1462]: <err> service error: Cannot stop user script for jiesuan
Aug 11 23:59:38 jifei1 clusvcmgrd: [1544]: <notice> service notice: Stopping service jifee ...
Aug 11 23:59:38 jifei1 clusvcmgrd: [1544]: <notice> service notice: Running user script '/home/tjjifei/cluster/jifei.sh stop'
Aug 11 23:59:38 jifei1 su(pam_unix)[1572]: session opened for user tjjifei by (uid=0)
Aug 11 23:59:39 jifei1 su(pam_unix)[1572]: session closed for user tjjifei
Aug 11 23:59:39 jifei1 clusvcmgrd: [1544]: <err> service error: User script '/home/tjjifei/cluster/jifei.sh stop' returned error 1
Aug 11 23:59:39 jifei1 clusvcmgrd: [1544]: <err> service error: -bash: line 1: cd: /home/tjjifei/jifeiapp/jifee: No such file or directory
Aug 11 23:59:39 jifei1 clusvcmgrd: [1544]: <err> service error: Cannot stop user script for jifee
----------------------------------

论坛徽章:
0
6 [报告]
发表于 2008-08-14 12:11 |只看该作者
是不是开启了watchdog导致的

论坛徽章:
0
5 [报告]
发表于 2008-08-13 23:14 |只看该作者
这是一个RHEL3上的集群。如果两台机器交替重启,则证明有可能是两个节点在相互fence对方。
RHEL3集群我没有做过,所以不敢断言,不过我认为用直连线做心跳的方法肯定不行。因为一旦心跳断了没有接上就会相互fence。所以你还是赶紧去检查一下这个心跳链路吧。
  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP