- 论坛徽章:
- 0
|
上面一个不清楚重发一个
上周单位一台数据库服务器发生莫明其妙的重启事件。\r\n事后我用snap收集了dump信息。\r\n用kdb分析dump结果如下,请各位大侠看看能不能看出点有用的东西:\r\nIBM p595 , 8CPU, 24G memory ,oslevel 5300-04\r\n\r\n$kdb dump unix\r\nThe specified kernel file is a 64-bit kernel\r\ndump mapped from @ 700000000000000 to @ 7000000d1386f84\r\nPreserving 1317350 bytes of symbol table\r\nFirst symbol __mulh\r\nComponent Names:\r\n 1) minidump [2 entries]\r\n 2) dmp_minimal [9 entries]\r\n 3) proc [2155 entries]\r\n 4) thrd [9557 entries]\r\n 5) rasct [1 entries]\r\n 6) ldr [2 entries]\r\n 7) errlg [3 entries]\r\n mtrc [50 entries]\r\n 9) lfs [2 entries]\r\n10) bos [2 entries]\r\n11) ipc [7 entries]\r\n12) vmm [13 entries]\r\n13) alloc_kheap [512 entries]\r\n14) alloc_other [228 entries]\r\n15) rtastrc [8 entries]\r\n16) sscsidd [2 entries]\r\n17) aixpcm [5 entries]\r\n1 efcdd [38 entries]\r\n19) scdisk [11 entries]\r\n20) lvm [2 entries]\r\n21) jfs2 [1 entries]\r\n22) tty [4 entries]\r\n23) netstat [10 entries]\r\n24) goent_dd [7 entries]\r\n25) scsidisk [123 entries]\r\n26) efscsi [9 entries]\r\n27) dump_statistics [1 entries]\r\nComponent Dump Table has 12764 entries\r\n START END <name>\r\n0000000000001000 0000000003BBA050 start+000FD8\r\nF00000002FF47600 F00000002FFDC920 __ublock+000000\r\n000000002FF22FF4 000000002FF22FF8 environ+000000\r\n000000002FF22FF8 000000002FF22FFC errno+000000\r\nF100070F00000000 F100070F10000000 pvproc+000000\r\nF100070F10000000 F100070F18000000 pvthread+000000\r\nPFT:\r\nPVT:\r\nid....................0002\r\nraddr.....0000000002000000 eaddr.....F200800080000000\r\nsize..............00080000 align.............00001000\r\nvalid..1 ros....0 fixlmb.1 seg....0 wimg...2\r\n[kdb_read_mem] no real storage @ F100000010789F98\r\n[kdb_read_mem] no real storage @ F1000000107765D8\r\nDump analysis on CHRP_SMP_PCI POWER_PC POWER_5 machine with 16 available CPU(s)\r\n (64-bit registers)\r\nProcessing symbol table...\r\n.......................done\r\n(6)> stat\r\nSYSTEM_CONFIGURATION:\r\nCHRP_SMP_PCI POWER_PC POWER_5 machine with 16 available CPU(s) (64-bit registers)\r\n\r\nSYSTEM STATUS:\r\nsysname... AIX\r\nnodename.. DBAML\r\nrelease... 3\r\nversion... 5\r\nbuild date Jan 10 2006\r\nbuild time 10:56:32\r\nlabel..... 0602A_53E\r\nmachine... 00C1397E4C00\r\nnid....... C1397E4C\r\ntime of crash: Thu Feb 21 10:15:36 2008\r\nage of system: 261 day, 18 hr., 6 min., 8 sec.\r\nxmalloc debug: disabled\r\n\r\nCRASH INFORMATION:\r\nCPU 6 CSA 01941E00 at time of crash, error code for LEDs: 30000000\r\npvthread+000D00 STACK:\r\n[00075FEC]v_delpft+000108 (F200800030000008 [??])\r\n[0010AA88]v_relframe+000464 (??, ??, ??)\r\n[001027E4]v_pageout+0006D0 (??, ??, ??)\r\n[00141A20]v_steal+00043C (??, ??, ??, ??)\r\n[00144EF4]v_fblru_scan+0003B8 (??)\r\n[001403D4]v_lru+00035C (??)\r\n[001414D0]v_memp_lru+00023C (??)\r\n[00207FEC]v_prememp_lru+000020 (??)\r\n[002A2474].backt+000080 ()\r\n____ Exception (F00000003002F780) ____\r\niar : 00000000002A23F4 msr : 80000000000010B2 cr : 42000024\r\nlr : 00000000001408D4 ctr : 0000000000140880 xer : 00000000\r\nmq : 00000000 asr : 000000003AB4A001\r\nr0 : 0000000000207FCC r1 : 0FFFFFFFF402FE90 r2 : 0000000001491C28\r\nr3 : 0000000000000001 r4 : F100010049CA8180 r5 : 0000000003B90280\r\nr6 : 0000000000000000 r7 : 0000000000000000 r8 : 0000000000000106\r\nr9 : 0000000000000000 r10 : 00000000001408D4 r11 : F00000003002F780\r\nr12 : 80000000000010B2 r13 : F100010049C89800 r14 : 00000000DEADBEEF\r\nr15 : 000000000101A9C0 r16 : 00000000DEADBEEF r17 : 00000000DEADBEEF\r\nr18 : 00000000DEADBEEF r19 : 00000000DEADBEEF r20 : 00000000DEADBEEF\r\nr21 : 00000000DEADBEEF r22 : 00000000DEADBEEF r23 : 00000000DEADBEEF\r\nr24 : 00000000DEADBEEF r25 : 00000000DEADBEEF r26 : 00000000DEADBEEF\r\nr27 : 00000000DEADBEEF r28 : 00000000DEADBEEF r29 : 00000000DEADBEEF\r\nr30 : 0000000003B90280 r31 : 0000000000000001\r\nprev 0000000000000000 stackfix 0000000000000000 int_ticks 00\r\nkjmpbuf 0000000000000000 excbranch 0000000000000000 no_pfault 00\r\nintpri 0B backt 00 flags 00\r\nfpscr 0000000000000000 fpscrx 00000000 fpowner 00\r\nfpeu 00 fpinfo 00 alloc F000\r\no_iar 0000000000000000 o_toc 0000000000000000\r\no_arg1 0000000000000000 o_vaddr 0000000000000000\r\nkrlockp 0000000000000000\r\nExcept :\r\n csr 0000000000000000 dsisr 0000000040010000 bit set: DSISR_PFT\r\n esid 000000003C00BD10 dar 0FFFFFFFF4030580 dsirr 0000000000000106\r\n[002A23F4].backt+000000 ()\r\n[kdb_get_memory] no real storage @ FFFFFFFF402FEA0\r\n\r\n\r\n从CPU 6 CSA 01941E00 at time of crash, error code for LEDs: 30000000\r\n这句话看应该是 cpu 6 出了问题导致的crash。\r\n(6)>status\r\nCPU TID TSLOT PID PSLOT PROC_NAME\r\n 0 2005 2 2004 2 wait\r\n 1 12025 18 D01A 13 wait\r\n 2 13027 19 E01C 14 wait\r\n 3 1502B 21 F01E 15 wait\r\n 4 1602D 22 10020 16 wait\r\n 5 1702F 23 11022 17 wait\r\n 6 D01B 13 4008 4 lrud\r\n 7 19033 25 13026 19 wait\r\n 8 135 32768 128 16384 wait\r\n 9 4A9163 33961 D8118 16600 db2sysc\r\n 10 413D 32772 4130 16388 wait\r\n 11 513F 32773 5132 16389 wait\r\n 12 6141 32774 6134 16390 wait\r\n 13 7143 32775 7136 16391 wait\r\n 14 368145 33640 B8038 184 asiqsrv12\r\n 15 9147 32777 913A 16393 wait\r\n 16-63 Disabled\r\n(因为是双核的cpu所以显示有16个cpu)\r\n我们可以看到cpu 6正在执行lrud(换页)进程。\r\n\r\n我现在想知道CPU 6 CSA 01941E00 at time of crash中 \r\n01941E00 是什么意思,\r\n或者有没有哪位高人知道怎么能进一步分析错误原因。\r\n望各位能不吝赐教,谢谢 |
|