免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 5436 | 回复: 4
打印 上一主题 下一主题

求救:heartbeat集群节点重启问题 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2011-11-29 15:56 |只看该作者 |倒序浏览
本帖最后由 ty123555 于 2011-11-29 21:04 编辑

heartbeat集群有两个节点组成,采用的是heartbeat v2 cib.xml配置的形式,配置文件中并没有关于fencing和watchdog等会导致主机重启的配置,但不知道为什么有时候节点会自动重启:
重启时heartbeat产生日志如下:
heartbeat[4113]: 2011/11/27_10:09:31 CRIT: Cluster node suse1 returning after partition.
heartbeat[4113]: 2011/11/27_10:09:31 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[4113]: 2011/11/27_10:09:31 WARN: Deadtime value may be too small.
heartbeat[4113]: 2011/11/27_10:09:31 info: See FAQ for information on tuning deadtime.
heartbeat[4113]: 2011/11/27_10:09:31 info: URL: http://linux-ha.org/FAQ#heavy_load
heartbeat[4113]: 2011/11/27_10:09:31 WARN: Late heartbeat: Node suse1: interval 3500 ms
heartbeat[4113]: 2011/11/27_10:09:31 info: Status update for node suse1: status active
crmd[4412]: 2011/11/27_10:09:31 notice: crmd_ha_status_callback: Status update: Node suse1 now has status [active]
crmd[4412]: 2011/11/27_10:09:31 info: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_JOIN_REQUEST cause=C_HA_MESSAGE origin=route_message ]
crmd[4412]: 2011/11/27_10:09:31 info: update_dc: Unset DC suse2
crmd[4412]: 2011/11/27_10:09:31 info: erase_node_from_join: Removed dead node suse2 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=0
crmd[4412]: 2011/11/27_10:09:31 info: do_dc_join_offer_all: join-8: Waiting on 1 outstanding join acks
crmd[4412]: 2011/11/27_10:09:31 info: update_dc: Set DC to suse2 (2.0)
crmd[4412]: 2011/11/27_10:09:31 info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
crmd[4412]: 2011/11/27_10:09:32 info: do_state_transition: All 1 cluster nodes responded to the join offer.
attrd[4411]: 2011/11/27_10:09:33 info: attrd_local_callback: Sending full refresh
cib[4408]: 2011/11/27_10:09:32 info: sync_our_cib: Syncing CIB to all peers
cib[4408]: 2011/11/27_10:09:32 WARN: cib_peer_callback: Discarding cib_replace message (2752) from suse1: not in our membership
cib[4408]: 2011/11/27_10:09:32 WARN: cib_peer_callback: Discarding cib_apply_diff message (2754) from suse1: not in our membership
cib[4408]: 2011/11/27_10:09:32 WARN: cib_peer_callback: Discarding cib_apply_diff message (2756) from suse1: not in our membership
cib[4408]: 2011/11/27_10:09:32 WARN: cib_peer_callback: Discarding cib_apply_diff message (2757) from suse1: not in our membership
crmd[4412]: 2011/11/27_10:09:32 WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_ack_nack) from suse1: not in our membership list (size=1)
crmd[4412]: 2011/11/27_10:09:32 ERROR: do_cl_join_finalize_respond: Join join-6 with suse1 failed.  NACK'd
crmd[4412]: 2011/11/27_10:09:32 ERROR: do_log: [[FSA]] Input I_ERROR from do_cl_join_finalize_respond() received in state (S_FINALIZE_JOIN)
crmd[4412]: 2011/11/27_10:09:32 info: do_state_transition: State transition S_FINALIZE_JOIN -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_cl_join_finalize_respond ]
crmd[4412]: 2011/11/27_10:09:32 ERROR: do_recover: Action A_RECOVER (0000000001000000) not supported
crmd[4412]: 2011/11/27_10:09:32 WARN: do_election_vote: Not voting in election, we're in state S_RECOVERY
crmd[4412]: 2011/11/27_10:09:32 info: do_dc_release: DC role released
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to pengine: [4861]
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to tengine: [4860]
tengine[4860]: 2011/11/27_10:09:32 info: update_abort_priority: Abort priority upgraded to 1000000
pengine[4861]: 2011/11/27_10:09:32 info: pengine_shutdown: Exiting PEngine (SIGTERM)
crmd[4412]: 2011/11/27_10:09:32 ERROR: do_log: [[FSA]] Input I_TERMINATE from do_recover() received in state (S_RECOVERY)
crmd[4412]: 2011/11/27_10:09:32 info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Terminating the pengine
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to pengine: [4861]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Terminating the tengine
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to tengine: [4860]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Waiting for subsystems to exit
crmd[4412]: 2011/11/27_10:09:32 WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: All subsystems stopped, continuing
crmd[4412]: 2011/11/27_10:09:32 WARN: do_log: [[FSA]] Input I_PENDING from do_election_vote() received in state (S_TERMINATE)
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Terminating the pengine
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to pengine: [4861]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Terminating the tengine
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to tengine: [4860]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Waiting for subsystems to exit
crmd[4412]: 2011/11/27_10:09:32 WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: All subsystems stopped, continuing
crmd[4412]: 2011/11/27_10:09:32 info: crmdManagedChildDied: Process pengine:[4861] exited (signal=0, exitcode=0)
crmd[4412]: 2011/11/27_10:09:32 WARN: do_log: [[FSA]] Input I_RELEASE_SUCCESS from do_dc_release() received in state (S_TERMINATE)
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Terminating the tengine
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to tengine: [4860]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Waiting for subsystems to exit
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: All subsystems stopped, continuing
crmd[4412]: 2011/11/27_10:09:32 info: process_client_disconnect: Received HUP from pengine:[-1]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Terminating the tengine
crmd[4412]: 2011/11/27_10:09:32 info: stop_subsystem: Sent -TERM to tengine: [4860]
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: Waiting for subsystems to exit
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: All subsystems stopped, continuing
tengine[4860]: 2011/11/27_10:09:32 info: update_abort_priority: Abort action 2 superceeded by 3
tengine[4860]: 2011/11/27_10:09:32 info: notify_crmd: Exiting after transition
tengine[4860]: 2011/11/27_10:09:32 info: te_init: Exiting tengine
crmd[4412]: 2011/11/27_10:09:32 info: crmdManagedChildDied: Process tengine:[4860] exited (signal=0, exitcode=0)
crmd[4412]: 2011/11/27_10:09:32 info: do_shutdown: All subsystems stopped, continuing
crmd[4412]: 2011/11/27_10:09:32 ERROR: verify_stopped: Resource ipservice was active at shutdown.  You may ignore this error if it is unmanaged.
crmd[4412]: 2011/11/27_10:09:32 notice: ghash_print_pending_for_rsc: Recurring action ipservice:4 (ipservice_monitor_5000) incomplete at shutdown
crmd[4412]: 2011/11/27_10:09:32 info: do_lrm_control: Disconnected from the LRM
ccm[4407]: 2011/11/27_10:09:32 info: client (pid=4412) removed from ccm
crmd[4412]: 2011/11/27_10:09:32 info: do_ha_control: Disconnected from Heartbeat
crmd[4412]: 2011/11/27_10:09:32 info: do_cib_control: Disconnecting CIB
cib[4408]: 2011/11/27_10:09:32 info: cib_process_readwrite: We are now in R/O mode
crmd[4412]: 2011/11/27_10:09:32 info: crmd_cib_connection_destroy: Connection to the CIB terminated...
crmd[4412]: 2011/11/27_10:09:32 info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
crmd[4412]: 2011/11/27_10:09:32 ERROR: do_exit: Could not recover from internal error
crmd[4412]: 2011/11/27_10:09:32 info: free_mem: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
crmd[4412]: 2011/11/27_10:09:32 info: do_exit: [crmd] stopped (2)
heartbeat[4113]: 2011/11/27_10:09:32 WARN: Managed /usr/lib/heartbeat/crmd process 4412 exited with return code 2.
heartbeat[4113]: 2011/11/27_10:09:32 EMERG: Rebooting system.  Reason: /usr/lib/heartbeat/crmd
根据日志提示是crmd进程异常退出导致系统重启的,但不知道什么原因这使crmd进程异常退出,应该如何修改配置,以免再次出现这样的问题呢?

论坛徽章:
0
2 [报告]
发表于 2011-11-29 19:00 |只看该作者
heartbeat2.x的测试终结篇,针对使用heartbeat的新手
http://bbs.chinaunix.net/viewthread.php?tid=2011785

看看这个

论坛徽章:
0
3 [报告]
发表于 2011-11-29 20:37 |只看该作者
回复 2# kns1024wh


   那文章我看过了,基本上都是介绍基础的东西,这边的集群切换和各种功能测试已经没有什么问题,就是有时候会莫名其妙的重启.而且是由集群的crmd进程触发的,这个使我百思不得其解.

论坛徽章:
0
4 [报告]
发表于 2011-12-02 11:54 |只看该作者
回复 3# ty123555


    heartbeat[4113]: 2011/11/27_10:09:31 CRIT: Cluster node suse1 returning after partition.
说明你的心跳超时时间配置的过短了,把心跳超时时间加长一点试试。
服务退出是因为crmd进入到一种异常状态,heartbeat的关键进程无法启动都会导致该问题

论坛徽章:
0
5 [报告]
发表于 2012-08-14 13:46 |只看该作者
我也遇到过同样的问题,我设置的心跳时间是3秒,是否我把心跳时间设置得长一些就可以解决这个问题呢?
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP