论坛徽章: 0
10 可用积分
各位同学,
现在我使用了heartbeat 2.1.3 版本于生产系统。
发现了两次IPaddr2 monitor timeout, 不知是linux系统的原因, 还是heartbeat的bug,
希望见过类似问题的同学, 能否分享一下。
heartbeat 利用 IPaddr2 这个resource agent 监控ip, 我用heartbeat有一年半了,
用于生产系统也有一年左右的时间。 所以相信在配置上应该没有问题。
现在偶发了两次很奇怪的问题。IPaddr2 monitor 竟然 timeout,
heartbeat发现这种情况后, 会去restart一下这个resource group,
如何restart呢, 先stop service resrouce agent
(什么叫resource agent呢, 例如我们写一个支持ocf规范的shell脚本, 称为resource agent, 来管理我们的应用进程。),
再 stop IPaddr2 这个resource agent.
结果呢, stop service resrouce agent 时又超时(20秒的timer啊), 而我那个service resrouce agent 的stop操作非常简单,
没有block的地方, 它竟然超时20秒。
这会引起heartbeat对service resrouce agent脱管, 变成所谓 un-managed.
我开始以为是时间漂移, 后来发现不是的。
但是在出问题的这段时间内(38秒), 的确我们有个其他进程在做备份操作, 会占用大量的I/O,
但是我想也不至于让操作系统瘫痪的地步。 毕竟这个操作每天这个点都做, 为什么单单就这一天会影响到heartbeat呢 ?
如果这是heartbeat的bug, 我想网上早就有这样的案例, 但是没有搜索到类似的问题, 所以我想应该不是heartbeat的问题。
不知道各位同学在使用时有没有碰到类似问题。企盼分享一下。
+++++++++++++++++++++++++++++++++++++++++ /var/log/ha-debug +++++++++++++
iptv_user_agent[1443][1454]: 2009/12/19_19:02:08 INFO: user_agent running
iptv_user_agent[1483][1494]: 2009/12/19_19:02:10 INFO: user_agent running
iptv_user_agent[1495][1534]: 2009/12/19_19:02:16 INFO: user_agent running
lrmd[4456]: 2009/12/19_19:02:18 WARN: IPaddr2_1:monitor process (PID 1497) timed out (try 1). Killing with signal SIGTERM (15).
lrmd[4456]: 2009/12/19_19:02:18 WARN: operation monitor[24] on ocf::IPaddr2::IPaddr2_1 for client 4459, its parameters: CRM_meta_interval=[5000] ip=[172.17.5.15] cidr_netmask=[24] CRM_meta_id=[IPaddr2_1_mon] CRM_meta_timeout=[5000] crm_feature_set=[2.0] iflabel=[0] CRM_meta_name=[monitor] nic=[eth1] : pid [1497] timed out
crmd[4459]: 2009/12/19_19:02:18 ERROR: process_lrm_event: LRM operation IPaddr2_1_monitor_5000 (24) Timed Out (timeout=5000ms)
tengine[11725]: 2009/12/19_19:02:19 info: process_graph_event: Detected action IPaddr2_1_monitor_5000 from a different transition: 8 vs. 18
tengine[11725]: 2009/12/19_19:02:19 info: update_abort_priority: Abort priority upgraded to 1000000
tengine[11725]: 2009/12/19_19:02:19 WARN: update_failcount: Updating failcount for IPaddr2_1 on f7a40064-99bc-4647-aea2-a064cbc762be after failed monitor: rc=-2
crmd[4459]: 2009/12/19_19:02:19 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]
crmd[4459]: 2009/12/19_19:02:19 info: do_state_transition: All 2 cluster nodes are eligible to run resources.
pengine[11726]: 2009/12/19_19:02:20 info: determine_online_status: Node aaa-a is online
pengine[11726]: 2009/12/19_19:02:20 info: common_apply_stickiness: Setting failure stickiness for IPaddr2_1 on aaa-a: -100
pengine[11726]: 2009/12/19_19:02:20 WARN: unpack_rsc_op: Processing failed op IPaddr2_1_monitor_5000 on aaa-a: Timed Out
pengine[11726]: 2009/12/19_19:02:20 info: determine_online_status: Node aaa-b is online
pengine[11726]: 2009/12/19_19:02:20 notice: group_print: Resource Group: group_1
pengine[11726]: 2009/12/19_19:02:20 notice: native_print: IPaddr2_1 (heartbeat: cf:IPaddr2): Started aaa-a FAILED
pengine[11726]: 2009/12/19_19:02:20 notice: native_print: iptv_user_agent (heartbeat: cf:iptv_user_agent): Started aaa-a
pengine[11726]: 2009/12/19_19:02:20 notice: NoRoleChange: Recover resource IPaddr2_1 (aaa-a)
pengine[11726]: 2009/12/19_19:02:20 notice: StopRsc: aaa-a Stop IPaddr2_1
pengine[11726]: 2009/12/19_19:02:20 notice: StartRsc: aaa-a Start IPaddr2_1
pengine[11726]: 2009/12/19_19:02:20 notice: RecurringOp: aaa-a IPaddr2_1_monitor_5000
pengine[11726]: 2009/12/19_19:02:20 notice: NoRoleChange: Leave resource iptv_user_agent (aaa-a)
crmd[4459]: 2009/12/19_19:02:21 info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
tengine[11725]: 2009/12/19_19:02:21 info: unpack_graph: Unpacked transition 19: 11 actions in 11 synapses
tengine[11725]: 2009/12/19_19:02:21 info: te_pseudo_action: Pseudo action 13 fired and confirmed
tengine[11725]: 2009/12/19_19:02:21 info: send_rsc_command: Initiating action 9: iptv_user_agent_stop_0 on aaa-a
crmd[4459]: 2009/12/19_19:02:21 info: do_lrm_rsc_op: Performing op=iptv_user_agent_stop_0 key=9:19:cf4b6798-33c3-4fcb-880c-0b87b803ba39)
iptv_user_agent[1535][1547]: 2009/12/19_19:02:23 INFO: user_agent running
crmd[4459]: 2009/12/19_19:02:23 info: process_lrm_event: LRM operation iptv_user_agent_monitor_2000 (call=26, rc=-2) Cancelled
lrmd[4456]: 2009/12/19_19:02:23 info: rsc:iptv_user_agent: stop
pengine[11726]: 2009/12/19_19:02:25 info: process_pe_message: Transition 19: PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-79.bz2
pengine[11726]: 2009/12/19_19:02:25 ERROR: subsystem_msg_dispatch: pengine took 5570ms to complete
iptv_user_agent[1576][1586]: 2009/12/19_19:02:25 INFO: user_agent: enter iptv_stop
iptv_user_agent[1576][1587]: 2009/12/19_19:02:25 INFO: generate user_agent alarm
tengine[11725]: 2009/12/19_19:02:41 WARN: action_timer_callback: Timer popped (abort_level=0, complete=false)
tengine[11725]: 2009/12/19_19:02:41 WARN: print_elem: Action missed its timeout[Action 9]: In-flight (id: iptv_user_agent_stop_0, loc: aaa-a, priority: 0)
lrmd[4456]: 2009/12/19_19:02:43 WARN: iptv_user_agent:stop process (PID 1576) timed out (try 1). Killing with signal SIGTERM (15).
lrmd[4456]: 2009/12/19_19:02:43 info: RA output: (iptv_user_agent:stop:stderr)
Session terminated, killing shell...
crmd[4459]: 2009/12/19_19:02:43 ERROR: process_lrm_event: LRM operation iptv_user_agent_stop_0 (27) Timed Out (timeout=20000ms)
lrmd[4456]: 2009/12/19_19:02:43 WARN: operation stop[27] on ocf::iptv_user_agent::iptv_user_agent for client 4459, its parameters: CRM_meta_timeout=[20000] crm_feature_set=[2.0] : pid [1576] timed out
tengine[11725]: 2009/12/19_19:02:43 info: status_from_rc: Re-mapping op status to LRM_OP_ERROR for rc=-2
我来回答