论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2012-11-05 09:37 |只看该作者 |倒序浏览

本帖最后由 PinkOrient 于 2012-11-05 09:52 编辑

发现Rgmanager做restart的时候实际上是先stop再start脚本，跟预期的有点差异，为什么不直接调用脚本的restart参数呢？

设置如下

<service autostart="1" domain="xxx_dm" name="xxx_server" recovery="restart" max_restarts="3" restart_expire_time="60">
<ip address="139.122.10.187" monitor_link="1">
<script ref="xxx_server"/>
</ip>
</service>

复制代码

其中脚本xxx_server会监控n个xxx进程，如果任何一个xxx进程不存在了，则脚本status返回1，此时如果调用脚本的restart/start函数的话，其他n-1个正常的xxx进程不受影响，只是把停掉的拉起来。
尝试kill掉一个其中一个xxx_server进程，期望的是rgmanager会在本地主机调用一次service xxx_serverd restart, 直接把死掉的尝试拉起来，其他在跑的不影响，
但是实际情况如下，cluster发现status不为0后，重新把服务停掉并把资源withdraw，然后再重新register资源和拉起服务，把好的xxx进程也干掉了，并且整个过程的周期是18s左右。

Nov 2 17:03:52 ServerNode01 xxx_serverd[29499]: status ... [OK]
Nov 2 17:04:25 ServerNode01 xxx_serverd[30222]: status ... [OK]
Nov 2 17:04:58 ServerNode01 xxx_serverd[30842]: status ... [Failed] #发现死了一个，status不正常
Nov 2 17:04:58 ServerNode01 clurgmgrd: [23683]: <err> script:xxx_server: status of /etc/init.d/xxx_serverd failed (returned 1)
Nov 2 17:04:58 ServerNode01 clurgmgrd[23683]: <notice> status on script "xxx_server" returned 1 (generic error)
Nov 2 17:04:58 ServerNode01 clurgmgrd[23683]: <notice> Stopping service service:xxx_server #停掉service，导致其他的几个也退出了
Nov 2 17:04:58 ServerNode01 xxx_serverd[30985]: stop ... [OK]
Nov 2 17:04:58 ServerNode01 avahi-daemon[6987]: Withdrawing address record for 139.122.10.187 on bond0. #VIP也withdraw掉了
Nov 2 17:05:09 ServerNode01 clurgmgrd[23683]: <notice> Service service:xxx_server is recovering
Nov 2 17:05:09 ServerNode01 clurgmgrd[23683]: <notice> Recovering failed service service:xxx_server
Nov 2 17:05:11 ServerNode01 avahi-daemon[6987]: Registering new address record for 139.122.10.187 on bond0.
Nov 2 17:05:16 ServerNode01 xxx_serverd[31550]: start ... [OK]
Nov 2 17:05:16 ServerNode01 clurgmgrd[23683]: <notice> Service service:xxx_server started #重新分配资源和启动完成
Nov 2 17:05:49 ServerNode01 xxx_serverd[32390]: status ... [OK]