Chinaunix

标题: P630宕机了,大家帮忙找一下原因。 [打印本页]

作者: tony8201    时间: 2005-07-13 17:41
标题: P630宕机了,大家帮忙找一下原因。
11号晚上11点a机宕机,应用自动切换到b机上,管理员不知道,没有马上启动a机。12号晚上7点b机宕机,应用停止,管理员发现后把a机和b机启动,到现在一切正常。不明白为什么会宕机???

a机错误信息:
#errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
2F3E09A4   0713144505 I H ent2           REPAIR ACTION
B6048838   0713112205 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0713111905 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
BA431EB7   0712201705 P S SRC            SOFTWARE PROGRAM ERROR
B6048838   0712201705 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
BA431EB7   0712201505 P S SRC            SOFTWARE PROGRAM ERROR
B6048838   0712201505 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
AFA89905   0712201205 I O grpsvcs        Group Services daemon started
97419D60   0712201205 I O topsvcs        Topology Services daemon started
A6DF45AA   0712200705 I O RMCdaemon      The daemon is started.
BFE4C025   0712200505 P H sysplanar0     UNDETERMINED ERROR
2BFA76F6   0711234905 T S SYSPROC        SYSTEM SHUTDOWN BY USER
9DBCFDEE   0712200705 T O errdemon       ERROR LOGGING TURNED ON
FE2DEE00   0711234905 P S SYSXAIXIF      DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00   0711234905 P S SYSXAIXIF      DUPLICATE IP ADDRESS DETECTED IN THE NET
AA8AB241   0711234905 T O OPERATOR       OPERATOR NOTIFICATION
BC3BE5A3   0711234905 P S SRC            SOFTWARE PROGRAM ERROR
B6048838   0624162405 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0624161705 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
B6048838   0624160905 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
D5385D18   0505224805 T H hdisk2         ARRAY OPERATION ERROR
3C81E43F   0427114005 P U topsvcs        Late in sending heartbeat
作者: tony8201    时间: 2005-07-13 17:42
标题: P630宕机了,大家帮忙找一下原因。
#errpt -a
LABEL:                REBOOT_ID
IDENTIFIER:        2BFA76F6

Date/Time:       Mon Jul 11 23:49:50 BEIS
Sequence Number: 180
Machine Id:      00570B1E4C00
Node Id:         localhost
Class:           S
Type:            TEMP
Resource Name:   SYSPROC         

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
           0
0=SOFT IPL 1=HALT 2=TIME REBOOT
           1
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
           0
---------------------------------------------------------------------------
LABEL:                ERRLOG_ON
IDENTIFIER:        9DBCFDEE

Date/Time:       Tue Jul 12 20:07:17 BEIS
Sequence Number: 179
Machine Id:      00570B1E4C00
Node Id:         localhost
Class:           O
Type:            TEMP
Resource Name:   errdemon        

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

        Recommended Actions
        NONE

---------------------------------------------------------------------------
LABEL:                AIXIF_ARP_DUP_ADDR
IDENTIFIER:        FE2DEE00

Date/Time:       Mon Jul 11 23:49:45 BEIS
Sequence Number: 178
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           S
Type:            PERM
Resource Name:   SYSXAIXIF      

Description
DUPLICATE IP ADDRESS DETECTED IN THE NET

Failure Causes
ARP RESPONSE RECEIVED FOR MY IP ADDRESS

        Recommended Actions
        CONTACT NETWORK ADMINISTRATOR

Detail Data
DUPLICATE IP ADDRESS
0A67 0103
MAC ADDRESS
000D 600B 8DE2
---------------------------------------------------------------------------
LABEL:                AIXIF_ARP_DUP_ADDR
IDENTIFIER:        FE2DEE00

Date/Time:       Mon Jul 11 23:49:44 BEIS
Sequence Number: 177
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           S
Type:            PERM
Resource Name:   SYSXAIXIF      

Description
DUPLICATE IP ADDRESS DETECTED IN THE NET

Failure Causes
ARP RESPONSE RECEIVED FOR MY IP ADDRESS

        Recommended Actions
        CONTACT NETWORK ADMINISTRATOR

Detail Data
DUPLICATE IP ADDRESS
0A67 0103
MAC ADDRESS
000D 600B 8DE2
---------------------------------------------------------------------------
LABEL:                OPMSG
IDENTIFIER:        AA8AB241

Date/Time:       Mon Jul 11 23:49:43 BEIS
Sequence Number: 176
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           O
Type:            TEMP
Resource Name:   OPERATOR        

Description
OPERATOR NOTIFICATION

User Causes
ERRLOGGER COMMAND

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
MESSAGE FROM ERRLOGGER COMMAND
clexit.rc : Unexpected termination of clstrmgrES
---------------------------------------------------------------------------
LABEL:                SRC_SVKO
IDENTIFIER:        BC3BE5A3

Date/Time:       Mon Jul 11 23:49:42 BEIS
Sequence Number: 175
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           S
Type:            PERM
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
         512
SOFTWARE ERROR CODE
       -9017
ERROR CODE
           0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
LABEL:                CORE_DUMP
IDENTIFIER:        B6048838

Date/Time:       Fri Jun 24 16:24:34 BEIS
Sequence Number: 174
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           S
Type:            PERM
Resource Name:   SYSPROC         

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
          11
USER'S PROCESS ID:
       16234
FILE SYSTEM SERIAL NUMBER
          11
INODE NUMBER
        4098
PROCESSOR ID
           0
CORE FILE NAME
/xbstation/xb/server/bin/core
PROGRAM NAME
Dept_Dispose
ADDITIONAL INFORMATION
??
??
Unable to generate symptom string.
---------------------------------------------------------------------------
LABEL:                CORE_DUMP
IDENTIFIER:        B6048838

Date/Time:       Fri Jun 24 16:17:42 BEIS
Sequence Number: 173
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           S
Type:            PERM
Resource Name:   SYSPROC         

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
          11
USER'S PROCESS ID:
       11916
FILE SYSTEM SERIAL NUMBER
          11
INODE NUMBER
        4098
PROCESSOR ID
           1
CORE FILE NAME
/xbstation/xb/server/bin/core
PROGRAM NAME
Dept_Dispose
ADDITIONAL INFORMATION
_ptrgl 0
??
Unable to generate symptom string.
---------------------------------------------------------------------------
LABEL:                CORE_DUMP
IDENTIFIER:        B6048838

Date/Time:       Fri Jun 24 16:09:35 BEIS
Sequence Number: 172
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           S
Type:            PERM
Resource Name:   SYSPROC         

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
          11
USER'S PROCESS ID:
       58854
FILE SYSTEM SERIAL NUMBER
          11
INODE NUMBER
        4098
PROCESSOR ID
           1
CORE FILE NAME
/xbstation/xb/server/bin/core
PROGRAM NAME
Dept_Dispose
ADDITIONAL INFORMATION
??
??
Unable to generate symptom string.
---------------------------------------------------------------------------
LABEL:                FCP_ARRAY_ERR4
IDENTIFIER:        D5385D18

Date/Time:       Thu May  5 22:48:15 BEIS
Sequence Number: 163
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           H
Type:            TEMP
Resource Name:   hdisk2         
Resource Class:  disk
Resource Type:   array
Location:        U0.1-P2-I3/Q1-W200400A0B818048A-L0

Description
ARRAY OPERATION ERROR

Probable Causes
ARRAY DASD MEDIA
ARRAY DASD DEVICE

Failure Causes
DASD MEDIA
DISK DRIVE

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0A00 2E08 010E F888 0000 0804 0000 0000 0000 0000 0000 02E3 0200 0300 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 3354 8000 F205 2101 0000 0000 0000 0000 0000 0000 EF00 64B0 0000 0000
0000 0000
---------------------------------------------------------------------------
LABEL:                TS_LATEHB_PE
IDENTIFIER:        3C81E43F

Date/Time:       Wed Apr 27 11:40:58 BEIS
Sequence Number: 122
Machine Id:      00570B1E4C00
Node Id:         xbserver1
Class:           U
Type:            PERF
Resource Name:   topsvcs         
Resource Class:  NONE
Resource Type:   NONE
Location:        
VPD:            

Description
Late in sending heartbeat

Probable Causes
Heavy CPU load
Severe physical memory shortage
Heavy I/O activities

Failure Causes
Daemon can not get required system resource

        Recommended Actions
        Reduce the system load

Detail Data
DETECTING MODULE
rsct,bootstrp.C,1.184,4520                    
ERROR ID
6zESUw.8bkP0/CRG.22kN8....................
REFERENCE CODE
                                          
A heartbeat is late by the following number of seconds
        3590
作者: tony8201    时间: 2005-07-13 17:45
标题: P630宕机了,大家帮忙找一下原因。
hacmp.out在此之前一直报这个错误:

00:00000:00029:2005/07/11 11:45:58.25 kernel  Cannot read, host process disconnected:  1116 spid: 29
00:00000:00046:2005/07/11 11:47:15.19 kernel  Cannot read, host process disconnected:  1672 spid: 46
00:00000:00117:2005/07/11 13:11:36.77 kernel  Cannot read, host process disconnected:  1496 spid: 117
00:00000:00161:2005/07/11 13:11:52.10 kernel  Cannot read, host process disconnected:  1120 spid: 161
00:00000:00095:2005/07/11 13:12:32.84 kernel  Cannot read, host process disconnected:  984 spid: 95
00:00000:00063:2005/07/11 13:12:33.85 kernel  Cannot read, host process disconnected:  192 spid: 63
00:00000:00101:2005/07/11 13:13:09.28 kernel  Cannot read, host process disconnected:  1552 spid: 101
00:00000:00138:2005/07/11 13:13:10.47 kernel  Cannot read, host process disconnected:  2012 spid: 138
00:00000:00078:2005/07/11 13:35:21.32 kernel  Cannot read, host process disconnected:  3712 spid: 78
00:00000:00157:2005/07/11 13:41:52.56 kernel  Cannot read, host process disconnected:  932 spid: 157
00:00000:00100:2005/07/11 13:44:39.44 kernel  Cannot read, host process disconnected:  1064 spid: 100
00:00000:00048:2005/07/11 13:45:02.54 kernel  Cannot read, host process disconnected:  636 spid: 48
00:00000:00067:2005/07/11 13:48:46.76 kernel  Cannot read, host process disconnected:  1980 spid: 67
00:00000:00014:2005/07/11 13:50:45.35 kernel  Cannot read, host process disconnected:  1056 spid: 14
00:00000:00025:2005/07/11 13:52:13.68 kernel  Cannot read, host process disconnected:  244 spid: 25
00:00000:00050:2005/07/11 13:52:15.68 kernel  Cannot read, host process disconnected:  480 spid: 50
00:00000:00168:2005/07/11 14:27:16.55 kernel  Cannot read, host process disconnected: 506 1240 spid: 168
00:00000:00160:2005/07/11 14:27:45.20 kernel  Cannot read, host process disconnected: 506 1916 spid: 160
00:00000:00000:2005/07/11 14:38:04.42 kernel  nrpacket: recv, Connection timed out
00:00000:00000:2005/07/11 14:58:59.43 kernel  nrpacket: recv, Connection timed out
00:00000:00157:2005/07/11 15:35:29.66 kernel  Cannot read, host process disconnected: 107 1720 spid: 157
00:00000:00019:2005/07/11 15:35:36.89 kernel  Cannot read, host process disconnected:  640 spid: 19
00:00000:00057:2005/07/11 16:16:42.44 kernel  Cannot read, host process disconnected: 304 668 spid: 57
00:00000:00078:2005/07/11 16:25:01.20 kernel  Cannot read, host process disconnected: 302 344 spid: 78
00:00000:00104:2005/07/11 16:35:53.40 kernel  Cannot read, host process disconnected: 103 624 spid: 104
00:00000:00168:2005/07/11 17:00:36.45 kernel  Cannot read, host process disconnected:  1104 spid: 168
00:00000:00026:2005/07/11 17:00:57.26 kernel  Cannot read, host process disconnected:  1140 spid: 26
00:00000:00066:2005/07/11 17:01:25.53 kernel  Cannot read, host process disconnected:  480 spid: 66
00:00000:00132:2005/07/11 17:03:37.63 kernel  Cannot read, host process disconnected: 109 1764 spid: 132
00:00000:00057:2005/07/11 17:19:44.78 kernel  Cannot read, host process disconnected:  400 spid: 57
00:00000:00026:2005/07/11 17:20:20.33 kernel  Cannot read, host process disconnected:  1564 spid: 26
00:00000:00073:2005/07/11 17:21:15.06 kernel  Cannot read, host process disconnected:  1428 spid: 73
00:00000:00160:2005/07/11 17:21:30.87 kernel  Cannot read, host process disconnected:  1324 spid: 160
00:00000:00069:2005/07/11 17:27:23.17 kernel  Cannot read, host process disconnected:  552 spid: 69
00:00000:00095:2005/07/11 18:02:47.27 kernel  Cannot read, host process disconnected: 509 1144 spid: 95
00:00000:00090:2005/07/11 18:32:32.59 kernel  Cannot read, host process disconnected: 309 548 spid: 90
00:00000:00131:2005/07/11 18:51:44.69 kernel  Cannot read, host process disconnected: 302 664 spid: 131
00:00000:00063:2005/07/11 18:52:40.59 kernel  Cannot read, host process disconnected:  1912 spid: 63
00:00000:00097:2005/07/11 19:55:59.92 kernel  Cannot read, host process disconnected:  1104 spid: 97
00:00000:00011:2005/07/11 21:35:05.08 kernel  Cannot read, host process disconnected:  1496 spid: 11
作者: tony8201    时间: 2005-07-13 17:54
标题: P630宕机了,大家帮忙找一下原因。
hacmp.out 2005/07/11 21:35:05.08之后没有任何记录直到12号晚上管理员发现。

硬件检测没有问题,据管理员讲电源和网络没有问题。

FE2DEE00   0711234905 P S SYSXAIXIF      DUPLICATE IP ADDRESS DETECTED IN THE NET

这个错误应该是IP地址重了,不知道是不是这个原因?

2BFA76F6   0711234905 T S SYSPROC        SYSTEM SHUTDOWN BY USER

但是为什么会出现这个报告呢?这个期间谁也没有动过主机。

以上是a 机出现的问题。
作者: tony8201    时间: 2005-07-13 17:57
标题: P630宕机了,大家帮忙找一下原因。
b机出现错误:

#errpt
2BFA76F6   0712190805 T S SYSPROC        SYSTEM SHUTDOWN BY USER
9DBCFDEE   0712200405 T O errdemon       ERROR LOGGING TURNED ON
AA8AB241   0712190805 T O OPERATOR       OPERATOR NOTIFICATION
BC3BE5A3   0712190805 P S SRC            SOFTWARE PROGRAM ERROR
B6048838   0712190805 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED
64368504   0711235205 P O grpsvcs        Connection failure between Group Service
173C787F   0711235105 I S topsvcs        Possible malfunction on local adapter
FE2DEE00   0711235105 P S SYSXAIXIF      DUPLICATE IP ADDRESS DETECTED IN THE NET
FE2DEE00   0711235105 P S SYSXAIXIF      DUPLICATE IP ADDRESS DETECTED IN THE NET
作者: tony8201    时间: 2005-07-13 17:59
标题: P630宕机了,大家帮忙找一下原因。
#errpt -a

---------------------------------------------------------------------------
LABEL:                REBOOT_ID
IDENTIFIER:        2BFA76F6

Date/Time:       Tue Jul 12 19:08:51 BEIS
Sequence Number: 178
Machine Id:      00570A9E4C00
Node Id:         localhost
Class:           S
Type:            TEMP
Resource Name:   SYSPROC         

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
           0
0=SOFT IPL 1=HALT 2=TIME REBOOT
           1
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
           0
---------------------------------------------------------------------------
LABEL:                ERRLOG_ON
IDENTIFIER:        9DBCFDEE

Date/Time:       Tue Jul 12 20:04:11 BEIS
Sequence Number: 177
Machine Id:      00570A9E4C00
Node Id:         localhost
Class:           O
Type:            TEMP
Resource Name:   errdemon        

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

        Recommended Actions
        NONE

---------------------------------------------------------------------------
LABEL:                OPMSG
IDENTIFIER:        AA8AB241

Date/Time:       Tue Jul 12 19:08:43 BEIS
Sequence Number: 176
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           O
Type:            TEMP
Resource Name:   OPERATOR        

Description
OPERATOR NOTIFICATION

User Causes
ERRLOGGER COMMAND

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
MESSAGE FROM ERRLOGGER COMMAND
clexit.rc : Unexpected termination of clstrmgrES
---------------------------------------------------------------------------
LABEL:                SRC_SVKO
IDENTIFIER:        BC3BE5A3

Date/Time:       Tue Jul 12 19:08:41 BEIS
Sequence Number: 175
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            PERM
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
      721035
SOFTWARE ERROR CODE
       -9017
ERROR CODE
           0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
LABEL:                CORE_DUMP
IDENTIFIER:        B6048838

Date/Time:       Tue Jul 12 19:08:41 BEIS
Sequence Number: 174
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            PERM
Resource Name:   SYSPROC         

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
          11
USER'S PROCESS ID:
       37274
FILE SYSTEM SERIAL NUMBER
           1
INODE NUMBER
           2
PROCESSOR ID
           0
CORE FILE NAME
/core
PROGRAM NAME
clstrmgr
ADDITIONAL INFORMATION
ha_gs_sen 1C0
ha_gs_sen 1B0
Unable to generate symptom string.
---------------------------------------------------------------------------
LABEL:                GS_TS_RETCODE_ER
IDENTIFIER:        64368504

Date/Time:       Mon Jul 11 23:52:07 BEIS
Sequence Number: 173
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           O
Type:            PERM
Resource Name:   grpsvcs         

Description
Connection failure between Group Services and Topology Services

Probable Causes
Topology Services daemon is not running
Topology Services daemon has died
Topology Services library has detected an error

Failure Causes
Group Services detects an error condition of Topology Services

        Recommended Actions
        Check the Topology Services daemon
Verify that Group Services daemon has been restarted
Call IBM Service if problem persists

Detail Data
DETECTING MODULE
RSCT,PMAdaptMbr.C,1.44,716                    
ERROR ID
62IcBY/bKdo0/cac132kN8....................
REFERENCE CODE
                                          
DIAGNOSTIC EXPLANATION
Received unknown adapter [97] for PMAdaptMbr name [allAdapter_13_0_ether] from hats.
---------------------------------------------------------------------------
LABEL:                TS_LOC_DOWN_ST
IDENTIFIER:        173C787F

Date/Time:       Mon Jul 11 23:51:49 BEIS
Sequence Number: 172
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            INFO
Resource Name:   topsvcs         

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

        Recommended Actions
        Verify adapter configuration
        Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.38,4143                  
ERROR ID
6zV5DL.JKdo0/zzx/32kN8....................
REFERENCE CODE
                                          
Adapter interface name
tty0
Adapter offset
           2
Adapter IP address
255.255.0.1
---------------------------------------------------------------------------
LABEL:                AIXIF_ARP_DUP_ADDR
IDENTIFIER:        FE2DEE00

Date/Time:       Mon Jul 11 23:51:02 BEIS
Sequence Number: 171
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            PERM
Resource Name:   SYSXAIXIF      

Description
DUPLICATE IP ADDRESS DETECTED IN THE NET

Failure Causes
ARP RESPONSE RECEIVED FOR MY IP ADDRESS

        Recommended Actions
        CONTACT NETWORK ADMINISTRATOR

Detail Data
DUPLICATE IP ADDRESS
0A67 0103
MAC ADDRESS
000D 600B 866E
---------------------------------------------------------------------------
LABEL:                AIXIF_ARP_DUP_ADDR
IDENTIFIER:        FE2DEE00

Date/Time:       Mon Jul 11 23:51:01 BEIS
Sequence Number: 170
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            PERM
Resource Name:   SYSXAIXIF      

Description
DUPLICATE IP ADDRESS DETECTED IN THE NET

Failure Causes
ARP RESPONSE RECEIVED FOR MY IP ADDRESS

        Recommended Actions
        CONTACT NETWORK ADMINISTRATOR

Detail Data
DUPLICATE IP ADDRESS
0A67 0103
MAC ADDRESS
000D 600B 866E
作者: yanbing    时间: 2005-07-13 21:52
标题: P630宕机了,大家帮忙找一下原因。
就是ip地址重复的问题。。。

shutdown是hacmp控制的强制关机。
作者: tony8201    时间: 2005-07-14 09:17
标题: P630宕机了,大家帮忙找一下原因。
如果a机由于IP地址重复在11号晚上宕机。
那么b机为什么会在12号晚上宕机呢?只有这几个报告:

2BFA76F6   0712190805 T S SYSPROC        SYSTEM SHUTDOWN BY USER
9DBCFDEE   0712200405 T O errdemon       ERROR LOGGING TURNED ON
AA8AB241   0712190805 T O OPERATOR       OPERATOR NOTIFICATION
BC3BE5A3   0712190805 P S SRC            SOFTWARE PROGRAM ERROR
B6048838   0712190805 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATED

---------------------------------------------------------------------------
LABEL:                REBOOT_ID
IDENTIFIER:        2BFA76F6

Date/Time:       Tue Jul 12 19:08:51 BEIS
Sequence Number: 178
Machine Id:      00570A9E4C00
Node Id:         localhost
Class:           S
Type:            TEMP
Resource Name:   SYSPROC         

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
           0
0=SOFT IPL 1=HALT 2=TIME REBOOT
           1
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
           0
---------------------------------------------------------------------------
LABEL:                ERRLOG_ON
IDENTIFIER:        9DBCFDEE

Date/Time:       Tue Jul 12 20:04:11 BEIS
Sequence Number: 177
Machine Id:      00570A9E4C00
Node Id:         localhost
Class:           O
Type:            TEMP
Resource Name:   errdemon        

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

        Recommended Actions
        NONE

---------------------------------------------------------------------------
LABEL:                OPMSG
IDENTIFIER:        AA8AB241

Date/Time:       Tue Jul 12 19:08:43 BEIS
Sequence Number: 176
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           O
Type:            TEMP
Resource Name:   OPERATOR        

Description
OPERATOR NOTIFICATION

User Causes
ERRLOGGER COMMAND

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
MESSAGE FROM ERRLOGGER COMMAND
clexit.rc : Unexpected termination of clstrmgrES
---------------------------------------------------------------------------
LABEL:                SRC_SVKO
IDENTIFIER:        BC3BE5A3

Date/Time:       Tue Jul 12 19:08:41 BEIS
Sequence Number: 175
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            PERM
Resource Name:   SRC            

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        MANUALLY RESTART SUBSYSTEM IF NEEDED

Detail Data
SYMPTOM CODE
      721035
SOFTWARE ERROR CODE
       -9017
ERROR CODE
           0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
clstrmgrES
---------------------------------------------------------------------------
LABEL:                CORE_DUMP
IDENTIFIER:        B6048838

Date/Time:       Tue Jul 12 19:08:41 BEIS
Sequence Number: 174
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            PERM
Resource Name:   SYSPROC         

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
          11
USER'S PROCESS ID:
       37274
FILE SYSTEM SERIAL NUMBER
           1
INODE NUMBER
           2
PROCESSOR ID
           0
CORE FILE NAME
/core
PROGRAM NAME
clstrmgr
ADDITIONAL INFORMATION
ha_gs_sen 1C0
ha_gs_sen 1B0
Unable to generate symptom string.
---------------------------------------------------------------------------
作者: tony8201    时间: 2005-07-14 09:25
标题: P630宕机了,大家帮忙找一下原因。
主机配置:2 CPU,2 G内存,FastT600盘柜。
系统:AIX5.2 + ML04 + HACMP5.2 + SYBASE12.5 + C6.0
作者: xuwelcome    时间: 2005-07-14 15:14
标题: P630宕机了,大家帮忙找一下原因。
有个疑问
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F

Date/Time:       Mon Jul 11 23:51:49 BEIS
Sequence Number: 172
Machine Id:      00570A9E4C00
Node Id:         xbserver2
Class:           S
Type:            INFO
Resource Name:   topsvcs         

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Recommended Actions
Verify adapter configuration
Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.38,4143                  
ERROR ID
6zV5DL.JKdo0/zzx/32kN8....................
REFERENCE CODE
                                          
Adapter interface name
tty0
Adapter offset
          2
Adapter IP address
255.255.0.1

tty怎么会有ip address 255.255.0.1,而且255.255.0.1不论是作为地址还是掩码都是不符合规范的。我是想hacmp的配置是不是有问题,可以查查。
作者: tony8201    时间: 2005-07-14 15:16
标题: P630宕机了,大家帮忙找一下原因。
a机mail报错:

From root Wed Jul 13 04:06:58 2005
Received: (from root@localhost) by xbserver1_stdby (AIX5.2/8.11.6p2/8.11.0) id j6CK6vV37530 for root; Wed, 13 Jul 2005 04:06:57 +0800
Date: Wed, 13 Jul 2005 04:06:57 +0800
From: root
Message-Id: <200507122006.j6CK6vV37530@xbserver1_stdby>;
To: root
Subject: diagela message from xbserver1
Status: RO

A PROBLEM WAS DETECTED ON Wed Jul 13 04:05:57 BEIST 2005                  801014
                       
The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):

  652-880: The CEC or SPCN reported a non-critical error. Report the SRN and
           the following reference and physical location codes to your service
           provider.
           Error log information:
                 Date: Tue Jul 12 20:05:38 BEIST 2005
                 Sequence number: 181
                 Label: SCAN_ERROR_CHRP
    Ref. Code: B0061406 FRU: n/a              n/a            

From root Tue Jul 12 20:07:24 2005
Received: (from root@localhost) by localhost (AIX5.2/8.11.6p2/8.11.0) id j6CC7Nm05542 for root; Tue, 12 Jul 2005 20:07:23 +0800
Date: Tue, 12 Jul 2005 20:07:23 +0800
From: root
Message-Id: <200507121207.j6CC7Nm05542@localhost>;
To: root
Subject: diagela message from localhost
Status: RO

A PROBLEM WAS DETECTED ON Tue Jul 12 20:07:21 BEIST 2005                  801014
                       
The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):

  652-880: The CEC or SPCN reported a non-critical error. Report the SRN and
           the following reference and physical location codes to your service
           provider.
           Error log information:
                 Date: Tue Jul 12 20:05:38 BEIST 2005
                 Sequence number: 181
                 Label: SCAN_ERROR_CHRP
    Ref. Code: B0061406 FRU: n/a              n/a            

From root Tue Apr 12 14:01:00 2005
Received: (from root@localhost) by loopback (AIX5.2/8.11.6p2/8.11.0) id j3C510v42772 for root; Tue, 12 Apr 2005 14:01:00 +0900
Date: Tue, 12 Apr 2005 14:01:00 +0900
From: root
Message-Id: <200504120501.j3C510v42772@loopback>;
To: root
Subject: diagela message from xbserver1
Status: RO

TESTING COMPLETE on Tue Apr 12 13:24:38 BEIDT 2005                        801010

No trouble was found.

The resources tested were:
                       
- proc1            U0.1-P1-C1           Processor
作者: tony8201    时间: 2005-07-14 15:18
标题: P630宕机了,大家帮忙找一下原因。
B0061406
应该是系统微码的错误,需要升级微码吧?!
作者: firefoxli    时间: 2005-07-15 16:17
标题: P630宕机了,大家帮忙找一下原因。
硬件检测没有问题,据管理员讲电源和网络没有问题。 不能全信,网络为什么报IP冲突?有没有其他人动过???

关注。我见过有篇文章说是
hacmp环境?

1.升级 bos.rte.libpthreads 的包到最新的级别。
2.降低NIM failure detact rate.
smitty hacmp
cluster config
cluster topology
configure Network Modules
Change a Network Module using Predefined Values
把rs232 和 Ethernet 的值都调慢。
作者: dong_jh    时间: 2005-07-15 23:54
标题: P630宕机了,大家帮忙找一下原因。
打HACMP的补丁试试。
作者: liudw    时间: 2005-07-16 00:08
标题: P630宕机了,大家帮忙找一下原因。
在双机倒换的时候会提示这种tty故障,这个不是什么特殊的
作者: 强人    时间: 2005-09-26 17:09
标题: P630宕机了,大家帮忙找一下原因。
楼主的问题,最后解决了吗?
我感觉都没谈到点子上啊?
为什么都不管  错误类型是B6048838的呢?
感觉是不是软件原因引起的?
IP 重复不会引起core dump把?
作者: novmcgrady    时间: 2006-04-08 10:23
最后解决了没有啊?

我也遇到了同样的问题
作者: herowangzj    时间: 2006-04-08 12:05
关注中,觉得不是IP冲突引起的,我做HA时遇到过IP冲突,就报一个IP冲突错误就完了,当然是服务IP冲突的,当应该关系不大啊,感觉是软件引起的,希望高手指点!
作者: Joker2004    时间: 2006-04-10 17:44
应该就是IP的问题,IP冲突,可能双机的boot或者stb网卡被冲掉了,导致双机信息不能同步引起宕机了。
作者: herowangzj    时间: 2006-04-10 18:56
有机会我模拟下试试
作者: feiaix    时间: 2006-12-19 16:51
楼主的问题怎么解决的呀 ,我也 遇到了同样的问题了。:(
BAD LUCK
作者: Jens    时间: 2006-12-21 17:07
我以前也碰到过同样的问题,但也维护公司的人来了没能解决,两台机重启后就好了。。。
作者: feiaix    时间: 2006-12-22 09:17
是不是心跳没配置好啊。然后I/O太大机器宕掉了。
作者: lj_cd    时间: 2006-12-22 10:25
有问题就来问,解决了就不说,唉。。。。。。。。。。




欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2