论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2002-08-23 15:13 |只看该作者 |倒序浏览

前些日子，备份的代理服务器连续宕机，在/var/crash/ns/生成了四组之多的dump文件，本想找一些文章看看，为什么系统频繁crash。找到这篇文章，还没来得及看，又完了，彻底完了--硬盘坏了，：（。
================================================================================
Crash Dump Analysis on Solaris

1、Introduction
This document attempts to provide a high-level introduction to handling basic crash dump analysis on Sun servers. A sample procedure is included which can be adopted to any organization for uniform handling of Sun server crashes. The term 'Crash Dump Analysis' may be a bit misleading in the context of this document. Coverage of actual analysis of the system crash dump using a debugger is not covered--Sun has an excellent instructor-led training class on this topic. Most System Administrators at most organizations will never have to use a debugger on a crash dump--this is typically a service provided by Sun with a service contract. In light of this, this document covers introductory materials regarding server crashes and preparing the necessary information to present to Sun when a service call is opened.

2、What Happens After a Crash?
When a panic occurs on a Solaris system, a message describing the error is usually echoed to the system console. The system will then attempt to write out the contents of the physical memory to a predetermined dump device, which is usually a dedicated disk partition, or the system swap partition. Once this is completed, the system is then rebooted.

Once the system begins rebooting, a startup script will call the savecore utility, if enabled. This command will perform a few tasks on the memory dump. First it will check to make sure that the crash dump corresponds to the running operating system. If the dump passes this test, savecore will then begin to copy the crash dump from the dedicated dump device to the directory /var/crash/`uname -n', or some other predetermined device. The dump is written out to two files, unix.n and vmcore.n, where n is an sequential integer identifying this particular crash. Finally, savecore logs a reboot using the LOG_AUTH syslog facility.

A sample memory dump of a system named testbox appears as follows:

# ls -l /var/crash/testbox
total 1544786
-rw-r--r-- 1 root    root          2 Jun 15 16:02 bounds
-rw-r--r-- 1 root    root    670367 Jun 15 16:00 unix.0
-rw-r--r-- 1 root    root    790110208 Jun 15 16:02 vmcore.0

Various options related to performing the actual crash dump and the savecore functions can be set using the dumpadm command. This utility allows the administrator to determine the dedicated dump device, the directory savecore will write to, and whether or not savecore runs at all. In addition, the /etc/init.d/savecore initilization script is the actual script run at bootup which executes savecore.

Typical output from dumpadm for the system testbox appears as follows:

# dumpadm
      Dump content: kernel pages
      Dump device: /dev/dsk/c0t0d0s3 (swap)
Savecore directory: /var/crash/testbox
   Savecore enabled: yes

3、What Causes a Crash?
Fatal operating system errors can be caused by bugs in the operating system, its associated device drivers and loadable modules, or by faulty hardware. Whatever the cause, the crash dump itself provides invaluable information to a Sun Support Engineer (if you are lucky enough to have a support contract) to aid in diagnosing the problem.

4、What To Do In Case of a Crash?
Any action taken when a Sun server crashes is obviously going to depend on the local policies and procedures in place at your organization. The presence of a Sun Service Agreement and its level will also affect your response to a crash.

What follows is an example of a typical procedure for dealing with a crash. This procedure was created based on real world experiences but does not reflect any particular real-world organization. For the purposes of illustration, assume that the organization in this example has a Platinum level contract with Sun.

The first step in analysing a crash is to determine if the necessary evidence is present in order to find a root cause. To begin, scan /var/adm/messages for any warnings or errors. Many crashes will leave evidence in the logs, such as which CPU caught the panic or which memory DIMM had errors. Often Sun engineers can diagnose the cause of a crash based on this information alone.

Next, check /var/crash/`uname -n` for a crash dump. If one is not present, confirm that savecore is enabled. Try running savecore -v if it was not previously enabled. It would also be a good idea to run prtdiag at this time to determine if there are any egregious hardware faults.

Armed with this information, open a call with Sun. Take note of the case ID number. For purposes of this example the case ID will be 123456. The Sun engineer may be able to diagnose the fault based on the panic strings or error messages from /var/adm/messages, or they may require the actual crash dump for analysis. Luckily there are two tools, CTEact (ACT), and explorer, which cull useful information from the crash dump and the system making it unecessary to upload the actual crash dump (which could be gigabytes in size).

Use the following steps to generate the ACT analysis of that core file to send to Sun:
Create a temporary upload directory. This directory will hold the output of these programs and will eventually be uploaded to Sun.

# mkdir /tmp/upload
# cd /var/crash/`uname -n`
# /opt/CTEact/bin/act -n unix.0 -d vmcore.0 >; /tmp/upload/act_out

Install (if necessary) and run the explorer script as follows:
# ./explorer

The explorer script will prompt you for some information. Do not select email output. The script will create both a subdirectory and a uuencoded file containing the system audit. Copy the uuencoded system audit output to the /tmp/upload directory. For example:
# cp explorer.80b0c1cc.uu /tmp/upload

Tar and compress the output for upload to Sun:
# cd /tmp
# mv upload 123456
# tar -cvf 123456.tar 123456
# gzip 123456.tar

Finally, FTP the output to Sun:
# ftp sunsolve.sun.com
ftp>; username: ftp
ftp>; password:
ftp>; bin
ftp>; put 123456.tar.gz
ftp>; quit

At this point you can remove the temporary upload directory:
# /bin/rm -rf /tmp/123456

Retain the original core files in /var/crash/`uname -n` until the case is closed. Once the case is closed by Sun, remove these file to free up disk space.

5、Conclusion
Those who wish to do more than simply upload information to Sun and let them analyse the crash dump should strongly consider taking Sun's "Core Dump Analysis" course.

For more information, particularly on self-analysis of crash dumps, see Printceton University Solaris 2.x Core Dump Analysis.

文库|博客

hanmin

白手起家

论坛徽章:: 0

2楼 [报告]

发表于 2002-08-23 15:24 |只看该作者

[转帖]Crash Dump Analysis on Solaris

在没有C Workshop的情况下，有个操作系统自带的工具adb很不错，我就用过这个工具跟踪到CPU的一个寄存器损坏，导致crash的原因。
===============================================================================
adb Core Analysis

adb can be used to analyze a core file to determine the cause of a panic. Note that Solaris versions to 7 require that adb be run on a system with the same architecture and OS as the machine that produced the core dump. Solaris 8 allows adb to run on a different architecture, but time will tell how stable this facility is.

The following procedure is useful for diagnosing system crashes due to traps:

To invoke adb on a core file, type: adb -k unix.n vmcore.n.

$< msgbuf prints out the message buffer. Of particular interest are the error messages, g7 (the current thread address), rp (the register pointer), pc (the program counter address), and sp (the stack pointer).

(Note: We can also find this information by using strings vmcore.n | more.)

Alternatively we can find rp by using $c to display the stack and picking up the second argument to trap or die. We can then find pc by executing rp_address$< regs.

The instructions in the program counter can be displayed with pc_address/40ai. (A truncated version can be displayed with pc_address/ai.)

g7 can also be obtained by looking at the panic_thread variable with panic_thread/X for 32-bit or panic_thread/K for 64-bit systems.

In order to look at the command that caused the panic, we need to find procp by running g7_address$< thread. The command will be located in the psargs field of the output from procp_address$< proc2u. The remainder of the output from this command represents the user structure of the process.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

race

广告杀手

论坛徽章:: 0

3楼 [报告]

发表于 2002-08-23 15:28 |只看该作者

[转帖]Crash Dump Analysis on Solaris

硬盘坏和这个关系那么大？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

jackieleon

白手起家

论坛徽章:: 0

4楼 [报告]

发表于 2002-08-23 15:29 |只看该作者

[转帖]Crash Dump Analysis on Solaris

CTEact分析出来的结果可以好好看看，如果是由程序引起的，这个结果还是很有帮助的，但如果是由硬件引起的，就不是很好判断，还需要SUN的进一步分析；
explorer的结果更是需要交给SUN才能分析出来（通常是传到sunsolve的FTP上去）

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

段誉

版主

论坛徽章:: 0

5楼 [报告]

发表于 2002-08-23 15:35 |只看该作者

[转帖]Crash Dump Analysis on Solaris

下面引用由race在 2002/08/23 03:28pm 发表的内容：
硬盘坏和这个关系那么大？

不是，当时只是不知道什么原因去找的文章，而且，硬盘坏的原因和这个没什么关系，硬盘坏是频繁掉电造成的，只一上午的时间，掉了四回，不坏也算怪事儿了！

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

cinc

家境小康

论坛徽章:: 0

6楼 [报告]

发表于 2002-08-23 15:37 |只看该作者

[转帖]Crash Dump Analysis on Solaris

可以用 iscda 脚本处理 core dump 文件:
iscda unix.0 vmcore.0 >; /tmp/iscda.output
iscda 就是调用 adb 和 crash 得出的输出.
然后把这个结果发给 sun 公司的工程师,让他们处理.

iscda 在 Solaris 的光盘里有,在 sun 网站也有下载

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

段誉

版主

论坛徽章:: 0

7楼 [报告]

发表于 2002-08-23 15:40 |只看该作者

[转帖]Crash Dump Analysis on Solaris

哪张光盘？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

cinc

家境小康

论坛徽章:: 0

8楼 [报告]

发表于 2002-08-23 15:42 |只看该作者

[转帖]Crash Dump Analysis on Solaris

crash 也是一个分析 core dump 的好工具,可以看出那个程序导致了 crash
#crash vmcore.0 unix.0
dumpfile ......
>;u
PER PROCESS USER AREA FOR PROCESS 这里是进程号
command : ... psargs: ..... 这里显示导致 core dump 的命令和参数
...
>;defproc
这里显示导致 core dump 的进程号

其他几个有用的命令
>;p
>;defthread
>;stat