- 论坛徽章:
- 0
|
问题:两台IBM服务器做双机,断网测试的时候,直接关机,不能实现重启。查看日志也没有互相fence failed现象。
硬件IBM X3850 X5两台,操作系统redhat5.5 两个cisco3560swich,每个服务器4块网卡和2两块光纤卡,fence设备是IBM IMM。
eth0/eth1各接一个交换机,两块做bond0。eht4/eth5两根心跳相连,做bond1。2块光纤卡eth2/eth3。
之前出现过重启服务器后6块网卡之间MAC地址漂移现象,最终通过加MAC到每个网卡配置文件把问题解决了。不知道双机和这有没有关系?
另有8台x3650跟3850同样的配置,双机测试已经没有问题。
我的配置信息
主机名:
root@ynrhzf-db1 bond0:192.168.141.11 bond0:192.168.142.11
root@ynrhzf-db2 bond0:192.168.141.12 bond0:192.168.142.12
[root@ynrhzf-db1 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
192.168.141.11 db1.anypay.yn ynrhzf-db1
192.168.141.12 db2.anypay.yn ynrhzf-db2
192.168.141.10 ynrhzf-db //浮动IP
192.168.142.11 pri-db1
192.168.142.12 pri-db2
192.168.141.103 imm-db1
192.168.141.104 imm-db2
[root@ynrhzf-db1 network-scripts]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
[root@ynrhzf-db2 ~]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="db1.anypay.yn" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="imm-db1"/>
</method>
</fence>
</clusternode>
<clusternode name="db2.anypay.yn" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="imm-db2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="227.0.0.10"/>
</cman>
<fencedevices>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="db-failover" ordered="1" restricted="1">
<failoverdomainnode name="db1.anypay.yn" priority="1"/>
<failoverdomainnode name="db2.anypay.yn" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.141.10/24" monitor_link="1"/>
</resources>
<service autostart="1" domain="db-failover" name="db-services">
<ip ref="192.168.141.10/24"/>
</service>
</rm>
</cluster>
测试结果:
root@ynrhzf-db1 network-scripts]# fence_rsa -a 192.168.141.103 -l USERID -p PASSW0RD -o status
Status: ON
[root@ynrhzf-db1 network-scripts]# fence_rsa -a 192.168.141.104 -l USERID -p PASSW0RD -o status
Status: ON
telnet远程管理口也没问题,进去后两台都执行reset命令,都能实现重启。
我的cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="db1.anypay.yn" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="imm-db1"/>
</method>
</fence>
</clusternode>
<clusternode name="db2.anypay.yn" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="imm-db2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1">
<multicast addr="227.0.0.10"/>
</cman>
<fencedevices>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
<fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="db-failover" ordered="1" restricted="1">
<failoverdomainnode name="db1.anypay.yn" priority="1"/>
<failoverdomainnode name="db2.anypay.yn" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="192.168.141.10/24" monitor_link="1"/>
</resources>
<service autostart="1" domain="db-failover" name="db-services">
<ip ref="192.168.141.10/24"/>
</service>
</rm>
</cluster>
这是在db1断网测试的日志:
Mar 12 05:32:04 db2 openais[14653]: [CLM ] Members Joined:
Mar 12 05:32:04 db2 openais[14653]: [CLM ] r(0) ip(192.168.141.11)
Mar 12 05:32:04 db2 openais[14653]: [SYNC ] This node is within the primary component and will provide service.
Mar 12 05:32:04 db2 openais[14653]: [TOTEM] entering OPERATIONAL state.
Mar 12 05:32:04 db2 openais[14653]: [CLM ] got nodejoin message 192.168.141.11
Mar 12 05:32:04 db2 openais[14653]: [CLM ] got nodejoin message 192.168.141.12
Mar 12 05:32:04 db2 openais[14653]: [CPG ] got joinlist message from node 1
Mar 12 05:32:21 db2 kernel: dlm: Using TCP for communications
Mar 12 05:32:21 db2 kernel: dlm: got connection from 1
Mar 12 05:32:22 db2 clurgmgrd[14710]: <notice> Resource Group Manager Starting
Mar 12 05:36:50 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:36:51 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:36:51 db2 dhclient: bound to 169.254.95.120 -- renewal in 294 seconds.
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] The token was lost in the OPERATIONAL state.
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] Transmit multicast socket send buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais[14653]: [TOTEM] entering GATHER state from 2.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering GATHER state from 0.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Creating commit token because I am the rep.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Saving state aru 3b high seq received 3b
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Storing new sequence id for ring 18
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering COMMIT state.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering RECOVERY state.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] position [0] member 192.168.141.12:
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] previous ring seq 20 rep 192.168.141.11
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] aru 3b high delivered 3b received flag 1
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Did not need to originate any messages in recovery.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] Sending initial ORF token
Mar 12 05:37:17 db2 openais[14653]: [CLM ] CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais[14653]: [CLM ] New Configuration:
Mar 12 05:37:17 db2 kernel: dlm: closing connection to node 1
Mar 12 05:37:17 db2 fenced[14673]: db1.anypay.yn not a cluster member after 0 sec post_fail_delay
Mar 12 05:37:17 db2 openais[14653]: [CLM ] r(0) ip(192.168.141.12)
Mar 12 05:37:17 db2 fenced[14673]: fencing node "db1.anypay.yn"
Mar 12 05:37:17 db2 openais[14653]: [CLM ] Members Left:
Mar 12 05:37:17 db2 openais[14653]: [CLM ] r(0) ip(192.168.141.11)
Mar 12 05:37:17 db2 openais[14653]: [CLM ] Members Joined:
Mar 12 05:37:17 db2 openais[14653]: [CLM ] CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais[14653]: [CLM ] New Configuration:
Mar 12 05:37:17 db2 openais[14653]: [CLM ] r(0) ip(192.168.141.12)
Mar 12 05:37:17 db2 openais[14653]: [CLM ] Members Left:
Mar 12 05:37:17 db2 openais[14653]: [CLM ] Members Joined:
Mar 12 05:37:17 db2 openais[14653]: [SYNC ] This node is within the primary component and will provide service.
Mar 12 05:37:17 db2 openais[14653]: [TOTEM] entering OPERATIONAL state.
Mar 12 05:37:17 db2 openais[14653]: [CLM ] got nodejoin message 192.168.141.12
Mar 12 05:37:17 db2 openais[14653]: [CPG ] got joinlist message from node 2
Mar 12 05:37:28 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:37:28 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:37:28 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:37:29 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:37:29 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:37:29 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:38:03 db2 ccsd[14647]: Attempt to close an unopened CCS descriptor (3180).
Mar 12 05:38:03 db2 ccsd[14647]: Error while processing disconnect: Invalid request descriptor
Mar 12 05:38:03 db2 fenced[14673]: fence "db1.anypay.yn" success
Mar 12 05:38:04 db2 clurgmgrd[14710]: <notice> Taking over service service:db-services from down member db1.anypay.yn
Mar 12 05:38:06 db2 avahi-daemon[7286]: Registering new address record for 192.168.141.10 on bond0.
Mar 12 05:38:07 db2 clurgmgrd[14710]: <notice> Service service:db-services started
Mar 12 05:41:11 db2 kernel: usb 8-1: new low speed USB device using uhci_hcd and address 2
Mar 12 05:41:12 db2 kernel: usb 8-1: configuration #1 chosen from 1 choice
Mar 12 05:41:12 db2 kernel: input: USB Keyboard as /class/input/input1
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Keyboard [ USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:12 db2 kernel: input: USB Keyboard as /class/input/input2
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Device [ USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:32 db2 kernel: usb 8-1: USB disconnect, address 2
Mar 12 05:41:44 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:41:45 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:41:45 db2 dhclient: bound to 169.254.95.120 -- renewal in 252 seconds.
Mar 12 05:42:04 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:04 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:04 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:04 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:04 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:05 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:06 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:06 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:06 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:08 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:08 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:08 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:10 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:10 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:10 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:11 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:11 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:11 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:11 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:48 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:48 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:49 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:49 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:49 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:49 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:27 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:43:27 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:43:27 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:43:29 db2 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:29 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:29 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:43:29 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:43:30 db2 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:30 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
这个配置文件不知道影响大不大
[root@ynrhzf-db1 ~]# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 lpfc
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding
options bond0 miimon=100 mode=1
alias bond1 bonding
options bond1 miimon=100 mode=0
### BEGIN UltraPath Driver Comments ###
remove upUpper if [ -d /proc/mpp ] && [ `ls -a /proc/mpp | wc -l` -gt 2 ]; then echo -e "Please Unload Physical HBA Driver prior to unloading upUpper."; else /sbin/modprobe -r --ignore-remove upUpper; fi
# Additional config info can be found in /opt/mpp/modprobe.conf.mppappend.
# The Above config info is needed if you want to make mkinitrd manually.
# Edit the '/etc/modprobe.conf' file and run 'upUpdate' to create Ramdisk dynamically.
### END UltraPath Driver Comments ###
options qla2xxx qlport_down_retry=5
options lpfc lpfc_nodev_tmo=30
alias eth2 e1000e
alias eth3 e1000e
alias eth4 igb
alias eth5 igb
这问题已经困扰我N多天了,大侠们帮我分析分析吧!!!! |
|