- 论坛徽章:
- 0
|
上面一个不清楚重发一个
上周单位一台数据库服务器发生莫明其妙的重启事件。
事后我用snap收集了dump信息。
用kdb分析dump结果如下,请各位大侠看看能不能看出点有用的东西:
IBM p595 , 8CPU, 24G memory ,oslevel 5300-04
$kdb dump unix
The specified kernel file is a 64-bit kernel
dump mapped from @ 700000000000000 to @ 7000000d1386f84
Preserving 1317350 bytes of symbol table
First symbol __mulh
Component Names:
1) minidump [2 entries]
2) dmp_minimal [9 entries]
3) proc [2155 entries]
4) thrd [9557 entries]
5) rasct [1 entries]
6) ldr [2 entries]
7) errlg [3 entries]
mtrc [50 entries]
9) lfs [2 entries]
10) bos [2 entries]
11) ipc [7 entries]
12) vmm [13 entries]
13) alloc_kheap [512 entries]
14) alloc_other [228 entries]
15) rtastrc [8 entries]
16) sscsidd [2 entries]
17) aixpcm [5 entries]
1 efcdd [38 entries]
19) scdisk [11 entries]
20) lvm [2 entries]
21) jfs2 [1 entries]
22) tty [4 entries]
23) netstat [10 entries]
24) goent_dd [7 entries]
25) scsidisk [123 entries]
26) efscsi [9 entries]
27) dump_statistics [1 entries]
Component Dump Table has 12764 entries
START END <name>
0000000000001000 0000000003BBA050 start+000FD8
F00000002FF47600 F00000002FFDC920 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F100070F00000000 F100070F10000000 pvproc+000000
F100070F10000000 F100070F18000000 pvthread+000000
PFT:
PVT:
id....................0002
raddr.....0000000002000000 eaddr.....F200800080000000
size..............00080000 align.............00001000
valid..1 ros....0 fixlmb.1 seg....0 wimg...2
[kdb_read_mem] no real storage @ F100000010789F98
[kdb_read_mem] no real storage @ F1000000107765D8
Dump analysis on CHRP_SMP_PCI POWER_PC POWER_5 machine with 16 available CPU(s)
(64-bit registers)
Processing symbol table...
.......................done
(6)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_5 machine with 16 available CPU(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. DBAML
release... 3
version... 5
build date Jan 10 2006
build time 10:56:32
label..... 0602A_53E
machine... 00C1397E4C00
nid....... C1397E4C
time of crash: Thu Feb 21 10:15:36 2008
age of system: 261 day, 18 hr., 6 min., 8 sec.
xmalloc debug: disabled
CRASH INFORMATION:
CPU 6 CSA 01941E00 at time of crash, error code for LEDs: 30000000
pvthread+000D00 STACK:
[00075FEC]v_delpft+000108 (F200800030000008 [??])
[0010AA88]v_relframe+000464 (??, ??, ??)
[001027E4]v_pageout+0006D0 (??, ??, ??)
[00141A20]v_steal+00043C (??, ??, ??, ??)
[00144EF4]v_fblru_scan+0003B8 (??)
[001403D4]v_lru+00035C (??)
[001414D0]v_memp_lru+00023C (??)
[00207FEC]v_prememp_lru+000020 (??)
[002A2474].backt+000080 ()
____ Exception (F00000003002F780) ____
iar : 00000000002A23F4 msr : 80000000000010B2 cr : 42000024
lr : 00000000001408D4 ctr : 0000000000140880 xer : 00000000
mq : 00000000 asr : 000000003AB4A001
r0 : 0000000000207FCC r1 : 0FFFFFFFF402FE90 r2 : 0000000001491C28
r3 : 0000000000000001 r4 : F100010049CA8180 r5 : 0000000003B90280
r6 : 0000000000000000 r7 : 0000000000000000 r8 : 0000000000000106
r9 : 0000000000000000 r10 : 00000000001408D4 r11 : F00000003002F780
r12 : 80000000000010B2 r13 : F100010049C89800 r14 : 00000000DEADBEEF
r15 : 000000000101A9C0 r16 : 00000000DEADBEEF r17 : 00000000DEADBEEF
r18 : 00000000DEADBEEF r19 : 00000000DEADBEEF r20 : 00000000DEADBEEF
r21 : 00000000DEADBEEF r22 : 00000000DEADBEEF r23 : 00000000DEADBEEF
r24 : 00000000DEADBEEF r25 : 00000000DEADBEEF r26 : 00000000DEADBEEF
r27 : 00000000DEADBEEF r28 : 00000000DEADBEEF r29 : 00000000DEADBEEF
r30 : 0000000003B90280 r31 : 0000000000000001
prev 0000000000000000 stackfix 0000000000000000 int_ticks 00
kjmpbuf 0000000000000000 excbranch 0000000000000000 no_pfault 00
intpri 0B backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu 00 fpinfo 00 alloc F000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
krlockp 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000040010000 bit set: DSISR_PFT
esid 000000003C00BD10 dar 0FFFFFFFF4030580 dsirr 0000000000000106
[002A23F4].backt+000000 ()
[kdb_get_memory] no real storage @ FFFFFFFF402FEA0
从CPU 6 CSA 01941E00 at time of crash, error code for LEDs: 30000000
这句话看应该是 cpu 6 出了问题导致的crash。
(6)>status
CPU TID TSLOT PID PSLOT PROC_NAME
0 2005 2 2004 2 wait
1 12025 18 D01A 13 wait
2 13027 19 E01C 14 wait
3 1502B 21 F01E 15 wait
4 1602D 22 10020 16 wait
5 1702F 23 11022 17 wait
6 D01B 13 4008 4 lrud
7 19033 25 13026 19 wait
8 135 32768 128 16384 wait
9 4A9163 33961 D8118 16600 db2sysc
10 413D 32772 4130 16388 wait
11 513F 32773 5132 16389 wait
12 6141 32774 6134 16390 wait
13 7143 32775 7136 16391 wait
14 368145 33640 B8038 184 asiqsrv12
15 9147 32777 913A 16393 wait
16-63 Disabled
(因为是双核的cpu所以显示有16个cpu)
我们可以看到cpu 6正在执行lrud(换页)进程。
我现在想知道CPU 6 CSA 01941E00 at time of crash中
01941E00 是什么意思,
或者有没有哪位高人知道怎么能进一步分析错误原因。
望各位能不吝赐教,谢谢 |
|