免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
1234下一页
最近访问板块 发新帖
查看: 12186 | 回复: 34
打印 上一主题 下一主题

[小机硬件] P630莫名死机 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-01-28 15:59 |只看该作者 |倒序浏览
P630 在1月24日莫名死机,对任何操作没有响应,直接关电重启。现在发现系统有sysplanar0 UNDETERMINED ERROR错误,不知有哪些原因会引起这个错误。请高手们赐教。
另外该小型机的文本内存居高不下,是否是导致小型机死机的原因呢。从哪里着手处理这个问题呢。
由于本人对IBM AIX几乎是一窍不通,所以有劳大家写得尽可能详细一些。
小女子这厢有礼了!

论坛徽章:
0
2 [报告]
发表于 2008-01-28 16:29 |只看该作者
先把sysdumpdev -L的输出贴出来看看,这是第一步!

论坛徽章:
0
3 [报告]
发表于 2008-01-28 16:39 |只看该作者

回复 #1 shadowyu_cz 的帖子

怎么输出啊,或者到哪里找DUMP文件呢,完全不懂怎么操作。
我的AIX版本是5.1的。

论坛徽章:
0
4 [报告]
发表于 2008-01-28 17:46 |只看该作者
命令行输入sysdumpdev -L,然后把该命令的输出贴到论坛。

论坛徽章:
0
5 [报告]
发表于 2008-01-29 09:16 |只看该作者
# sysdumpdev -L
0453-039

Device name:         /dev/hd6
Major device number: 10
Minor device number: 2
Size:                309236224 bytes
Date/Time:           Tue Jul 19 14:44:13 BEIST 2005
Dump status:         0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0
#


以上是输出内容,请指教,谢谢。

论坛徽章:
0
6 [报告]
发表于 2008-01-29 10:56 |只看该作者
Date/Time:           Tue Jul 19 14:44:13 BEIST 2005

这是你这台机器上次dump产生的时间,看起来和你down机的时间不一致啊!

你检查一下你机器的日期设定,如果是准确的话,则说明这dump不是这次down机产生的。

如果日期设定有误,产生日期与实际日期接近,那么则做snap -ac并ftp到ibm的testcase站点。

如果日期设定无误,则说明这dump不是此次down机生成的,那么则做snap -gc发送到我信箱,我帮你看看。

ps:snap的做法:

1。snap -r 删除掉以前的snap文件;

2。根据上面的描述确定做snap -ac还是snap -gc;

3。在/tmp/ibmsupt目录下,将snap.pax.Z以bin方式ftp到你的pc并根据前面的选择要么ftp到testcase,要么发我信箱。

testcase.boulder.ibm.com或testcase.software.ibm.com以上两个地址都可以通过internet访问,在上传数据的时候要使用ftp并把数据放在/toibm/aix目录下,

或者yan_bing@hotmail.com

论坛徽章:
0
7 [报告]
发表于 2008-01-29 11:04 |只看该作者
已经发到你的邮箱了,谢谢你哦!

论坛徽章:
0
8 [报告]
发表于 2008-01-29 11:48 |只看该作者
收到了,我在看!请稍等!

论坛徽章:
0
9 [报告]
发表于 2008-01-29 11:59 |只看该作者
A PROBLEM WAS DETECTED ON Tue Jan 29 04:05:56 BEIST 2008                  801014
                       
The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):

  651-880: The CEC or SPCN reported an error. Report the SRN and the
           following reference and physical location codes to your service
           provider.
           Error log information:
                 Date: Fri Jan 25 09:55:42 BEIST 2008
                 Sequence number: 3612
                 Label: SCAN_ERROR_CHRP
    Ref. Code: B1004699 FRU: n/a              n/a   



————————————————————————————————————————————————————
7028-6C4,6E4: B1xx 4699 Service processor firmware:
  This is usually an indication of a problem in the communication path between the HMC and the service processor.
  It may only be an informational message.
  If the managed system is down, go to the service processor error log and find the error log entry
  containing B1xx 4699. Look at the first two bytes of word 13 of the detailed entry information.
  If the managed system is running, look at the AIX error log entry containing B1xx 4699. This is a
  SCAN_ERROR_CHRP error with an identifier of BFE4C025. In the detail data, find the string B1xx 4699.
  (If present, it will be at byte 60 of the detail data.) Go forward 8 bytes after the B1 to byte 68
  and look at bytes 68 and 69.
  If the system is running Linux, examine the Linux system log. The line(s) in the extended data that
  begin with <4>RTAS: Log Debug: 04 contain the error code in the next 8 hex characters. (This error
  code is also known as word 11).
  Each 4 bytes after the error code in the Linux extended data is another word. The 4 bytes after
  the error code are word 12 and the next four bytes are word 13. An example of the Linux extended
  data, and finding words 11, 12, and 13, is shown in MAP 1321, step 1321-28, and step 1321-29,
  in this service guide.
  Perform the following actions based on the following values of bytes 68 and 69 from the AIX error
  log entry, or on the first two bytes of word 13 from the service processor error log entry:
  2306: No processor card is detected in slot one (U0.1-P1-C1); a processor card is required in the first slot for the system to boot.
  Actions:
  1. If a processor card is not plugged into slot one (U0.1-P1-C1), plug one in.
  2. If a processor card is plugged into slot one (U0.1-P1-C1), reseat it. If reseating the processor card does not fix the problem, replace it.
  9906: Software problem during firmware update from the operating system.
  Actions:
  1. Check for a system firmware update that is later than the one that just caused this
   error. Apply the update if available.
  2. Call service support.
  A205: Machine type and model fields are not valid in the VPD module. Obtain an operator panel.
  Do not swap the old VPD module onto the new operator panel. Call service support for
  instructions on how to write the machine type and model into the new VPD module.
  A20B: Error requesting trace buffer for service processor.
  Actions:
  1. Reset the service processor, if possible.
  2. Check for system firmware updates. Apply the updates if they are available.
  A218: Unknown return code detected.
  Actions:
  Check for system firmware updates. Apply the updates if they are available.
  A21A: Error allocating an internal service processor memory space.
  Actions:
  1. Reset the service processor, if possible.
  2. Check for system firmware updates. Apply the updates if they are available.
  A800: HMC/service processor initialization failure.
  Actions:
  1. Check for system firmware updates.
  2. Replace the service processor, location: U0.1-P1.
  A801: HMC wrap failure.
  Actions:
  Replace the service processor, location: U0.1-P1.
  A806: Loss of the surveillance heartbeat between the HMC and the service processor.
  Actions:
  1. Make sure that the HMC is booted and operational.
  2. Check the serial cables that go from the HMC to the service processor, location: U0.1-P1.
  If there are no other error codes or indications of a problem, the A806 (loss of surveillance
  heartbeat) was a temporary condition and has been resolved; the B1xx 4699 code is then
  an informational message only.
  If the problem persists:
  1. Check the serial cables connecting the HMC to the CEC backplane, location: U0.1-P1.
  2. Run diagnostics on the serial port on the HMC.
  3. Run diagnostics on the serial ports on the service processor.
  Values of A009 and A719 of bytes 68 and 69 in the AIX error log entry or the first two bytes of
  word 13 in the service processor error log entry are also informational entries:
  A009: The system received a power-off request at run time from the HMC.
  A719: Primary power failed; the system switched to battery backup power.
  For all other values of bytes 68 and 69, or the first two bytes of word 13, do the following:
  1. Check for system firmware updates.
  2. Reset the service processor by activating the pinhole reset switch on the operator panel.
  3. Call service support.

————————————————————————————————————————————————————————————

0444 0003 0000 0084 C600 0008 0144 5600 2008 0125 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 4942 4D00 0000 0000 0050 3034 B100 4699
04A0 005D A009 C0F5 0000 0000 0000 7701 0000 0000 0000 0000 0000 0000 0000 0000
0100 0000 0000 0000 4231 3030 3436 3939 2020 2020 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 5247 3034 3033 3232 5F64 3735 6530 335F 7366 7731 3336 0000

__________________________________________________________________________________________

A009: The system received a power-off request at run time from the HMC.



由上述过程可以看出,你这台630是在1月25日9点多手工通过HMC 重启的。

稍后继续。。。。。。。。。。。。

论坛徽章:
0
10 [报告]
发表于 2008-01-29 12:23 |只看该作者
Errorlog Entry Detail

LABEL:                PGSP_KILL
IDENTIFIER:        C5C09FFA

Date/Time:       Fri Jan 11 09:50:43 2008
Sequence Number: 3598
Machine Id:      000BA77F4C00
Node Id:         czscp1
Class:           S
Type:            PERM
Resource Name:   SYSVMM

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SYSTEM RUNNING OUT OF PAGING SPACE

Failure Causes
INSUFFICIENT PAGING SPACE DEFINED FOR THE SYSTEM
PROGRAM USING EXCESSIVE AMOUNT OF PAGING SPACE

        Recommended Actions
        DEFINE ADDITIONAL PAGING SPACE
        REDUCE PAGING SPACE REQUIREMENTS OF PROGRAM(S)

Detail Data
PROGRAM
clinfo
USER'S PROCESS ID:
           0
PROGRAM'S PAGING SPACE USE IN 1KB BLOCKS
           0


从11号开始就有paging space消耗高的现象!

——————————————————————————————————————————
但是你本来就是4G RAM,开了4G的paging space

——————————————————————————————————————————

而且在25号重启之后到29号,paging space的使用率只是1%,所以和你的应用应该没有关系!

——————————————————————————————————————————
Errorlog Entry Detail

LABEL:                SRC_RSTRT
IDENTIFIER:        BA431EB7

Date/Time:       Fri Jan 11 09:50:43 2008
Sequence Number: 3599
Machine Id:      000BA77F4C00
Node Id:         czscp1
Class:           S
Type:            PERM
Resource Name:   SRC

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        VERIFY SUBSYSTEM RESTARTED AUTOMATICALLY

Detail Data
SYMPTOM CODE
      589833
SOFTWARE ERROR CODE
       -9035
ERROR CODE
           0
DETECTING MODULE
'srchevn.c'@line:'201'
FAILING MODULE
clinfoES

——————————————————————————————————————

在第一次paging sapce消耗过高之后出现了上面SRC restart的事件

——————————————————————————————————————
所以初步判定和你的HA环境有关系。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP