免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 7210 | 回复: 4

redhat5.5双机断网关机不能实现重启 [复制链接]

论坛徽章:
0
发表于 2011-03-13 15:22 |显示全部楼层
问题:两台IBM服务器做双机,断网测试的时候,直接关机,不能实现重启。查看日志也没有互相fence failed现象。
硬件IBM X3850 X5两台,操作系统redhat5.5 两个cisco3560swich,每个服务器4块网卡和2两块光纤卡,fence设备是IBM IMM。
eth0/eth1各接一个交换机,两块做bond0。eht4/eth5两根心跳相连,做bond1。2块光纤卡eth2/eth3。
之前出现过重启服务器后6块网卡之间MAC地址漂移现象,最终通过加MAC到每个网卡配置文件把问题解决了。不知道双机和这有没有关系?
另有8台x3650跟3850同样的配置,双机测试已经没有问题。
我的配置信息
主机名:
root@ynrhzf-db1  bond0:192.168.141.11  bond0:192.168.142.11  
root@ynrhzf-db2  bond0:192.168.141.12  bond0:192.168.142.12



[root@ynrhzf-db1 ~]# cat /etc/hosts

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
192.168.141.11  db1.anypay.yn   ynrhzf-db1
192.168.141.12  db2.anypay.yn   ynrhzf-db2
192.168.141.10  ynrhzf-db                  //浮动IP
192.168.142.11  pri-db1
192.168.142.12  pri-db2
192.168.141.103 imm-db1
192.168.141.104 imm-db2


[root@ynrhzf-db1 network-scripts]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]
[root@ynrhzf-db2 ~]# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]

cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="db1.anypay.yn" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="db2.anypay.yn" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr="227.0.0.10"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="db-failover" ordered="1" restricted="1">
                                <failoverdomainnode name="db1.anypay.yn" priority="1"/>
                                <failoverdomainnode name="db2.anypay.yn" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.141.10/24" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="db-failover" name="db-services">
                        <ip ref="192.168.141.10/24"/>
                </service>
        </rm>
</cluster>

测试结果:
root@ynrhzf-db1 network-scripts]# fence_rsa -a 192.168.141.103 -l USERID -p PASSW0RD -o status
Status: ON
[root@ynrhzf-db1 network-scripts]# fence_rsa -a 192.168.141.104 -l USERID -p PASSW0RD -o status
Status: ON

telnet远程管理口也没问题,进去后两台都执行reset命令,都能实现重启。





我的cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="db1.anypay.yn" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="db2.anypay.yn" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr="227.0.0.10"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="db-failover" ordered="1" restricted="1">
                                <failoverdomainnode name="db1.anypay.yn" priority="1"/>
                                <failoverdomainnode name="db2.anypay.yn" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.141.10/24" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="db-failover" name="db-services">
                        <ip ref="192.168.141.10/24"/>
                </service>
        </rm>
</cluster>

这是在db1断网测试的日志:

Mar 12 05:32:04 db2 openais[14653]: [CLM  ] Members Joined:
Mar 12 05:32:04 db2 openais[14653]: [CLM  ]     r(0) ip(192.168.141.11)  
Mar 12 05:32:04 db2 openais[14653]: [SYNC ] This node is within the primary component and will provide service.
Mar 12 05:32:04 db2 openais[14653]: [TOTEM] entering OPERATIONAL state.
Mar 12 05:32:04 db2 openais[14653]: [CLM  ] got nodejoin message 192.168.141.11
Mar 12 05:32:04 db2 openais[14653]: [CLM  ] got nodejoin message 192.168.141.12
Mar 12 05:32:04 db2 openais[14653]: [CPG  ] got joinlist message from node 1
Mar 12 05:32:21 db2 kernel: dlm: Using TCP for communications
Mar 12 05:32:21 db2 kernel: dlm: got connection from 1
Mar 12 05:32:22 db2 clurgmgrd[14710]: <notice> Resource Group Manager Starting
Mar 12 05:36:50 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:36:51 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:36:51 db2 dhclient: bound to 169.254.95.120 -- renewal in 294 seconds.
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] The token was lost in the OPERATIONAL state.
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] Transmit multicast socket send buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] entering GATHER state from 2.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering GATHER state from 0.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Creating commit token because I am the rep.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Saving state aru 3b high seq received 3b
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Storing new sequence id for ring 18
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering COMMIT state.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering RECOVERY state.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] position [0] member 192.168.141.12:
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] previous ring seq 20 rep 192.168.141.11
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] aru 3b high delivered 3b received flag 1
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Did not need to originate any messages in recovery.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Sending initial ORF token
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] New Configuration:
Mar 12 05:37:17 db2 kernel: dlm: closing connection to node 1
Mar 12 05:37:17 db2 fenced[14673]: db1.anypay.yn not a cluster member after 0 sec post_fail_delay
Mar 12 05:37:17 db2 openais[14653]: [CLM  ]     r(0) ip(192.168.141.12)  
Mar 12 05:37:17 db2 fenced[14673]: fencing node "db1.anypay.yn"
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] Members Left:
Mar 12 05:37:17 db2 openais[14653]: [CLM  ]     r(0) ip(192.168.141.11)  
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] Members Joined:
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] New Configuration:
Mar 12 05:37:17 db2 openais[14653]: [CLM  ]     r(0) ip(192.168.141.12)  
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] Members Left:
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] Members Joined:
Mar 12 05:37:17 db2 openais[14653]: [SYNC ] This node is within the primary component and will provide service.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering OPERATIONAL state.
Mar 12 05:37:17 db2 openais[14653]: [CLM  ] got nodejoin message 192.168.141.12
Mar 12 05:37:17 db2 openais[14653]: [CPG  ] got joinlist message from node 2
Mar 12 05:37:28 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:37:28 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:37:28 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:37:29 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:37:29 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:37:29 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:38:03 db2 ccsd[14647]: Attempt to close an unopened CCS descriptor (3180).
Mar 12 05:38:03 db2 ccsd[14647]: Error while processing disconnect: Invalid request descriptor
Mar 12 05:38:03 db2 fenced[14673]: fence "db1.anypay.yn" success
Mar 12 05:38:04 db2 clurgmgrd[14710]: <notice> Taking over service service:db-services from down member db1.anypay.yn
Mar 12 05:38:06 db2 avahi-daemon[7286]: Registering new address record for 192.168.141.10 on bond0.
Mar 12 05:38:07 db2 clurgmgrd[14710]: <notice> Service service:db-services started
Mar 12 05:41:11 db2 kernel: usb 8-1: new low speed USB device using uhci_hcd and address 2
Mar 12 05:41:12 db2 kernel: usb 8-1: configuration #1 chosen from 1 choice
Mar 12 05:41:12 db2 kernel: input:   USB Keyboard as /class/input/input1
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Keyboard [  USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:12 db2 kernel: input:   USB Keyboard as /class/input/input2
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Device [  USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:32 db2 kernel: usb 8-1: USB disconnect, address 2
Mar 12 05:41:44 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:41:45 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:41:45 db2 dhclient: bound to 169.254.95.120 -- renewal in 252 seconds.
Mar 12 05:42:04 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:04 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:04 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:04 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:04 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:05 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:06 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:06 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:06 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:08 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:08 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:08 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:10 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:10 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:10 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:11 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:11 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:11 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:11 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:48 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:48 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:49 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:49 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:49 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:49 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:27 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:43:27 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:43:27 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:43:29 db2 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:29 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:29 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:43:29 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:43:30 db2 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:30 db2 kernel: bonding: bond1: link status definitely up for interface eth5.



这个配置文件不知道影响大不大
[root@ynrhzf-db1 ~]# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 lpfc
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding
options bond0 miimon=100 mode=1
alias bond1 bonding
options bond1 miimon=100 mode=0

### BEGIN UltraPath Driver Comments ###
remove upUpper if [ -d /proc/mpp ] && [ `ls -a /proc/mpp | wc -l` -gt 2 ]; then echo -e "Please Unload Physical HBA Driver prior to unloading upUpper."; else /sbin/modprobe -r --ignore-remove upUpper; fi
# Additional config info can be found in /opt/mpp/modprobe.conf.mppappend.
# The Above config info is needed if you want to make mkinitrd manually.
# Edit the '/etc/modprobe.conf' file and run 'upUpdate' to create Ramdisk dynamically.
### END UltraPath Driver Comments ###
options qla2xxx qlport_down_retry=5
options lpfc lpfc_nodev_tmo=30
alias eth2 e1000e
alias eth3 e1000e
alias eth4 igb
alias eth5 igb

这问题已经困扰我N多天了,大侠们帮我分析分析吧!!!!

论坛徽章:
0
发表于 2011-03-13 19:57 |显示全部楼层
配置和日志均显示没有什么大问题。

但/etc/hosts中,将这个地方:
192.168.141.11  db1.anypay.yn   ynrhzf-db1
192.168.141.12  db2.anypay.yn   ynrhzf-db2
改成:
192.168.141.11  db1.anypay.yn  
192.168.141.12  db2.anypay.yn  

以避免断网之后的fence错误。心跳线不能直连,必须通过交换机走。fence的时候不重启而是关机的情况,估计得从服务器的BIOS设置中去查。

论坛徽章:
0
发表于 2011-03-13 22:05 |显示全部楼层
谢谢大侠,明天试试去

论坛徽章:
0
发表于 2011-03-14 15:57 |显示全部楼层
bios里查看了一下,没找到需要修改的东西哦 郁闷呢

论坛徽章:
0
发表于 2011-03-14 17:02 |显示全部楼层
集群调用fence_rsa的默认行为就是reboot,所以这个地方应该不是在操作系统上能配置的。再仔细查查。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP