论坛徽章:: 3

电梯直达

1楼 [收藏(0)] [报告]

发表于 2006-04-14 09:25 |只看该作者 |倒序浏览

故障诊断 Lotus Domino 的挂起和崩溃

级别: 中级
Kiran Bellari, 软件工程师, IBM
2006 年 3 月 27 日

快，服务器挂起与崩溃之间究竟有什么区别？更重要的是，如何修复它们？在本文中，我们将解释如何识别 Lotus Domino 服务器挂起和崩溃，以及如何分析和纠正它们。
Lotus Domino 构建得非常可靠。但是即使构建得再好的产品，也会遇到导致其挂起或崩溃的问题。当出现这样的情况时，您隔离、分析和修复问题的速度越快，您的用户社团就会越快高兴起来并正常运行，您也因而能够更快地返回去考虑别的事情。

本文提供了一些可用于修复 Notes/Domino 问题的思路。我们首先来定义服务器挂起和服务器崩溃之间的区别，以及如何解决每种问题的例子。我们最后将概述该产品的最新版本 —— Notes/Domino 7 —— 中包含的新的故障诊断特性。我们假设您是一名有经验的 Domino 管理员，并且熟悉基本的 Notes/Domino 概念和术语。

何为服务器挂起和崩溃？

在进入技术细节之前，我们首先定义两个常用的术语，即崩溃（crash）和挂起（hang），以确保我们的理解是一致的。

服务器崩溃
Domino 服务器崩溃是这样一种情景，即服务器程序已经终止并且不再运行。您通常可以通过查看崩溃屏幕或者 NSD/RIP 日志文件（取决于您运行的是什么版本的 Domino），来确定服务器终止时所执行的任务。

Domino 服务器崩溃的常见故障现象包括：

Domino 服务器不再运行，但是系统上的其他程序还在运行。
Domino 服务器控制台不出现，即使当任务似乎已加载时。
Domino 服务器已加载，并且没做任何事情就突然死机。
一个 panic 错误出现在控制台上或 Log.nsf 中，并且系统死机。
NSD/RIP 自动运行并生成一个文件，服务器自己死机和/或重新启动。

存在几种不同类型的服务器崩溃。例如一次性崩溃（one-time crash），顾名思义，可能只出现一次，并且不会再次出现。一个导致 Domino 崩溃的进程访问坏内存或已破坏的文档时会出现一次性崩溃。例如，假设位于 Mail.box 中的一个文档已经破坏。当 Domino 路由器访问 Mail.box 想将该文档路由到其目的地时，将产生一个 Domino 服务器崩溃。类似的场景以后可能会出现，也可能不会出现。一般来说，一次性崩溃是最难分析的。

可重复的崩溃（reproducible crash）是一种可通过一系列步骤重复的崩溃。例如一个这样的表单，其中包含一个编码错误的按钮，每当按这个按钮时，都会导致一个可重复的崩溃。

重复的崩溃（Repetitive crashes）按一定的规律发生。它们似乎不与任何特定动作相关，而是发生在每天的相同时间。在这样的场景中，您需要确切地知道，在导致问题的时间段，服务器上在运行什么。例如，假设 Domino 服务器上启用了一个预定的代理，每个月运行一次。该代理可能会导致服务器崩溃。在这样的场景中，首先需要禁用导致问题的代理，然后再检查该代理为什么会导致问题（并修复问题）。

ABEND 是服务器崩溃的一种特殊形态。术语 ABEND 是 “abnormal end” 这两个单词的组合。ABEND 崩溃不产生 RIP 或 NSD 文件。

崩溃的原因如下：

代码中的软件问题（无论是在服务器上还是客户机上）。
数据库中的破坏。
访问 Domino 的第三方应用程序中的软件问题。
内存不足。
定制代码导致的限制操作。
内存泄漏。
未完成的请求。

服务器挂起

Domino 服务器挂起是这样一种场景，即 Domino 服务器仍在运行，但是服务器上的一个或多个任务不响应请求。这些任务可能还是活跃的，但是不在做它们应该做的事情。术语 “挂起” 也定义了一种状态，即当计算机程序不按设计运行时可能会出现的状态。大部分时候，出现挂起是因为，低级循环或资源的持久不可用导致严重的性能问题。（服务器挂起通常归因于资源问题，所以有时可把它们看成性能问题。）

在挂起期间，程序看起来像已瘫痪，也不显示错误消息，并且屏幕冻结或者应用程序不响应用户的动作。键盘输入或鼠标点击没有反应，不管光标置于何处都一样，但是程序仍在运行。与 ABEND 或崩溃不一样，挂起有时会自己解决问题，应用程序继续其正常的执行过程，无需您的干预。这样的情况更应该看成是性能问题，而不是挂起。

Domino 服务器挂起的故障现象包括：

Domino 仍在运行，但是不响应客户机。在这种情况下，用户通常报告说他们收到 “Server not responding” 消息。
控制台的行为就像是断开连接的，不接受任何命令，甚至像 quit 这样简单的命令也不接受。
客户机对服务器的访问（例如，打开数据库）感觉到响应时间慢。
出现信号量超时。“show stat” 命令将报告信号量超时信息。下面是 Statrep.nsf 中报告的一个信号量超时的例子：Sem.Timeouts = 430D: 58 0A13:42 030B:28 0116:26 0A12:21。在这个例子中，430D 是信号量名称，58 是超时的数量。注意，信号量超时并不一定表示性能问题。在忙碌的服务器上出现信号量超时是很常见的。如果服务器上没有出现任何信号量超时，统计数据 Sem.timeouts 就不会出现在 Statrep.nsf 中。
会报告与性能相关的错误消息，比如：
Insufficient memory.
Insufficient memory. NSF Folder Pool is full.
Maximum number of memory segments that Notes can support has been exceeded.
Network operation did not complete in a reasonable amount of time.
Server not responding.

注意，在服务器挂起场景中，NSD/RIP 是不会自动生成的。

导致服务器挂起的原因包括，资源问题（资源不足）、第三方应用程序冲突和硬件问题。一般来说，服务器挂起比服务器崩溃更难分析。最后指出一点：崩溃和挂起不只出现在 Domino 服务器上，也可以出现在 Notes 客户机上。

故障诊断

在本节中，我们来看一些用于故障诊断服务器崩溃和服务器挂起的一般方法。

故障诊断 Domino 服务器崩溃

如果 Domino 已经崩溃，并且不能重启，那么从 Notes.ini 变量 Servertask 删除任务，并试图缩小范围和识别导致崩溃的任务。当您怀疑是某个特定的任务导致问题时，就打开服务器控制台，并缩小该任务产生的可能的错误消息的范围。例如，如果在访问 Mail.box 中的邮件时路由器崩溃了，那么重新命名 Mail.box 并允许服务器重新创建 Mail.box。

如果您怀疑问题是已破坏的数据库导致的，那么在该数据库上运行离线维护任务。如果崩溃是按规律发生的，那么检查崩溃发生时服务器上执行的动作。

考虑下列问题：

Domino 服务器向控制台或日志文件报告错误消息吗？
错误消息的确切语法是什么样的？
错误消息是哪里产生的？是 Domino 服务器上，还是 Notes 客户机上？
该问题第一次出现是什么时候？
在问题开始出现之前，最近做了更改吗？

故障诊断 Notes 客户机崩溃

首先，找出问题是否特定于某个用户。如果是的，就检查该用户的配置，并将之与其他用户的配置进行比较。此外，还要确定问题发生是否归结于访问某个特定的应用程序。如果是的，就请一个开发人员来检查应用程序。

如果您怀疑问题是由已破坏的数据库或文档导致的，就运行维护任务 Updall、Fixup 和 Compact（用适当的开关）。此外，如果您认为问题是由于坏的索引，那么试图重新创建数据库的全文本索引（如果可能的话）。

故障诊断 Domino 服务器挂起

如果常量信号量问题出现在服务器控制台上，那么检查任务的安排是否冲突。如果系统响应缓慢，那么检查您的非-Domino 应用程序，看它们是否也运行缓慢。另外，一般来说，应该确保用所有最新的补丁更新了操作系统。

NSD 分析
确定让服务器崩溃的进程通常是解决服务器崩溃的第一步。在 Domino 6 和更高版本中，NSD 文件是一个很好的起点。NSD 给出服务器状态的所有当前信息（所有线程的调用堆栈、内存信息，等等）。在发生崩溃时，Domino 服务器将自动生成一个 NSD 日志文件，并存储在 data\IBM_TECHNICAL_SUPPORT 目录中。NSD 日志文件的文件名中带有一个时间戳，展示了 NSD 是何时生成的。例如，Nsd_W32I_KIRANTP_2006_01_17@17_17_18.log 表示这个 NSD 是 2006 年 1 月 17 日生成的。NSD 在运行时，会附加到每个进程和线程，以转储调用堆栈。这有助于您确定服务器或工作站崩溃的原因。

NSD 文件的核心是堆栈跟踪部分。这一部分提供代码路径的一个 breakdown，当前存在的进程中的每个线程要遍历该路径，以进入其当前状态。这对于考察服务器上的挂起或崩溃场景非常有帮助。此外，通过检查 NSD 文件，可以找到 Domino data 目录中生成的任何核心文件，并进行基本的分析，以跟踪死去并遗留下核心文件的进程所做调用的最终堆栈。在诸如 Domino 这样的复杂产品中，两台不同服务器上相同类型的动作的堆栈跟踪可以产生不同的结果。

在 NSD 文件中，通过执行对单词 “fatal”、“panic” 或 “segmentation” 的搜索，可以识别失败进程中的可执行部分。找到进程后，我们可以看出谁在它之前，并有望确定崩溃是如何发生的。有时，当 “panic”、“fatal” 都没有找到时，核心转储将包含对函数中 “segmentation fault” 的引用。这表明，进程试图访问因某种原因已破坏的共享内存段，并将不调用 “fatal_error” 或 “panic” 而崩溃。

下面是 NSD 文件的示例摘录，其中的一个服务器进程涉及到崩溃：

### FATAL THREAD 39/83 [ nSERVER:07c0: 2764]
### FP=0743f548, PC=60197cf3, SP=0743ebd0, stksize=2424
Exception code: c0000005 (ACCESS_VIOLATION)
############################################################
@[ 1] 0x60197cf3 nnotes._Panic@4+483 (7430016,496dae76,0,496dace

@[ 2] 0x600018a4 nnotes._OSBBlockAddr@8+148 (1153f38,2000000,743f608,1)
@[ 3] 0x6000bd92 nnotes._CollectionNavigate@24+610 (0,743fc74,f,0)
@[ 4] 0x600626cc nnotes._ReadEntries@68+2860 (4c5440e8,4cfb8dba,800f,1)
@[ 5] 0x600b9f6f nnotes._NIFReadEntriesExt@72+351 (0,4cfb8dba,800f,1)
@[ 6] 0x10032d40 nserverl._ServerReadEntries@8+1424 (0,8d0c0035,4b64b5bc,4ae46dd6)
@[ 7] 0x100191fc nserverl._DbServer@8+2284 (41b0383,cb740064,0,23696f

@[ 8] 0x1002b8c8 nserverl._WorkThreadTask@8+1576 (4711d68,0,3,563fb10)
@[ 9] 0x100016cb nserverl._Scheduler@4+763 (0,563fb10,0,10ec334)
@[10] 0x6011e5e4 nnotes._ThreadWrapper@4+212 (0,10ec334,563fb10,0)
[11] 0x77e887dd KERNEL32.GetModuleFileNameA+465

当确定了失败进程后，您就可以着重故障诊断这个特定的进程了。

ServerTasks

如果一台服务器不断地崩溃（例如，每五分钟一次），一个有用的故障诊断步骤是，从服务器的 Notes.ini 文件临时删除 ServerTasks= 行。然后，服务器可以重新启动，任务可以单独地加载，以确定是哪个进程导致崩溃。

Panic 消息
当 Domino 检测到一个内部一致性错误，或者一个可能导致数据破坏或其他问题的条件时，它会立即调用一个名为 Panic 的子例程。这是在代码操作时，用于不断监控代码的关键部分的一种特殊构造。这有助于在问题升级并可能破坏数据之前，尽可能早地捕捉问题。当发生 panic 时，它将导致系统停止（因此可看成是一种可控制的崩溃）。Panics 产生的消息，有时是英语形式的，有时是代码形式的（例如，PANIC: 04:3C）。您可以将该代码提交给 Lotus Software Technical Support，以便进一步故障诊断。

故障诊断工具

本节介绍您在遇到 Domino 服务器崩溃或挂起时可用的一些故障诊断工具。在使用任何这些工具之前，请确保参考 Domino 管理文档。此外，Domino 自助支持页面对于故障诊断信息也是一个好的资源。

RIP（Domino R5）

RIP 文件是在服务器崩溃时产生的。该文件包含关于服务器崩溃时在做什么的信息。它报告系统上的任何崩溃，而不只是与 Domino 有关的崩溃。RIP 文件只在 Domino 5.x 中才产生。在 Domino 6 和更高版本中，NSD 取代了 RIP，并且还包括 RIP 中没有的附加功能。

要产生 RIP 文件，需要将 QNC.EXE 加载到 Domino 服务器上。QNC.EXE 程序（通常叫做 “quincy”）是与 Domino 一起发布的默认调试程序。QNC.EXE 程序通常位于 \Domino 目录中。要启用 QNC.EXE，请在操作系统的命令提示符下输入 “qnc –I”。也可以通过在服务器启动时输入 “qnc nserver” 启动 QNC.EXE。如果在服务器崩溃时不生成 RIP 文件，那么请检查 QNC.EXE 是否已启用。通常，RIP 文件创建在 data 目录中。

NSD（Domino 6 和更高版本）

如前所述，Domino 6 和更高版本提供 NSD 特性。这个文件包含关于服务器崩溃时的状态信息。有关更多信息，请参阅本文前面的 “NSD 分析” 一节。

内存转储（Domino 6 和更高版本）

在 Domino 6 和更高版本中，可以在服务器控制台上使用命令 “sh memory dump” 来创建内存转储文件。内存转储文件包含关于 Domino 当前使用的内存的信息。这在故障诊断性能问题和内存泄漏时非常有用。通常，内存转储文件位于 data\IBM_TECHNICAL_SUPPORT 目录中。内存转储文件名包含一个时间戳，表示生成 NSD 时的时间。例如：

memory_ KIRANTP_2005_09_14@17_50_08.dmp

注意：要将可用内存记录到文件，而不是在服务器控制台上查看它，请输入下面的服务器控制台命令： sh memory dump >memory.txt

HTTP 请求日志

为了故障诊断与 Domino Web 服务器崩溃和挂起有关的问题，Lotus Software Technical Support 通常会要求您创建 HTTP 请求日志。要为请求日志启用默认设置，请编辑服务器的 Notes.ini 文件，并添加 HTTPEnableThreadDebug=1 这一行。这将 HTTP 请求日志记录设置为默认级别。（要将日志记录级别设置为记录更详细的信息，请参阅 Domino 管理文档。）也可以通过在 Domino 服务器控制台输入 “tell http debug thread on | off” 动态地启用 HTTP 请求日志记录。启用了 HTTP 请求日志记录之后，Domino 就会创建一系列名为 htthr*.log 的文件，例如 htthr_a40_10_20050914@171556.log。

HTTP 请求日志记录应该只用于故障诊断特定的问题，并且通常是在 Lotus Software Technical Support 的指导和帮助下完成的。不要将请求日志记录用于其他目的，比如一般管理。这些日志文件随着时间会不断增大，所以不应该长时间启用该设置，否则会消耗掉所有可用的设备空间。

Automatic Data Collection

Notes/Domino 6.0.1 引入了自动诊断数据收集工具，也叫做 Automatic Data Collection，或者简称为 ADC。Automatic Data Collection 只意味着，当 Notes 客户机或 Domino 服务器崩溃时，该程序将收集调试崩溃时必需的所有数据，并在客户机或服务器重启时发送到一个 mail-in 数据库。然后，管理员就每个域具有一个位置，在这里，他们可以看到所有客户机和服务器已经发生的所有崩溃。这将有助于消除这样的情况，即管理员或用户在客户机或服务器崩溃时不能捕获适当的数据。

Notes.ini 设置

为了故障诊断性能和崩溃问题，您可以启用下列 Notes.ini 调试参数：

Debug_threadid=1 记录每个服务器操作的每个进程和线程 ID。
Debug_show_timeout=1 打开到控制台的信号量超时消息，并创建一个名为的 semdebug.txt 信号量文本文件。
Debug_capture_timeout=10 给每个信号量超时消息加时间戳。
CONSOLE_LOG_ENABLED=1 （Domino 6 和更高版本）启用 Domino 控制台日志记录。

服务器崩溃的故障恢复

您可以将故障恢复设置为自动处理 Domino 服务器崩溃。当服务器崩溃时，它就自动关闭并重启，无需任何管理员干预。Domino 将崩溃信息记录在 data 目录中。当服务器重启时，Domino 检查它是否是崩溃后重启。如果是的，就会自动给 “Mail Fault Notification to” 域中的人员或组发送一封电子邮件。

重大的错误（比如操作系统异常或内部 panic）终止每个 Domino 进程，并释放所有相关的资源。启动脚本检测该场景，并重启服务器。如果您使用的是多服务器分区，并且故障发生在单个分区中，那么只有该分区终止并重启。

Domino 7 中的新故障诊断特性

本节简要介绍一些有助于您分析和纠正服务器挂起和崩溃的 Domino 7 新特性。

Domino Domain Monitoring

Domino 7 中的一个最重要且有用的服务器维护和故障诊断特性是 Domino Domain Monitoring (DDM)。这为监控一个域（或多个域）中的所有服务器提供了一个中央位置。DDM 使用名为 probes 的程序来收集来自单个服务器的服务器信息，然后报告回一个特殊的数据库（DDM.nsf），您可以在该数据库中查看所收集的数据。这允许您从单个 Domino Administrator 控制台监控、分析和故障诊断大量的服务器。

Activity Trends

Activity Trends 特性用于分析 “历史” 服务器数据，以助于发现只有通过很长时间才能发现的趋势。您可以查看该数据，来帮助预计和避免未来的问题。该数据从日志文件（Log.nsf）和 Catalog 任务收集而来，并存储在 Activity Trends 数据库（Activity.nsf）中。Activity Trends Collector 任务处理该数据，并产生 “趋势化” 数据，用于绘制图表和平衡资源。

将状态条历史写到日志文件

您可以将 Notes 客户机状态条消息设置为记录到本地日志文件（Log.nsf）或者您指定的外部文件。这有助于您故障诊断 Notes 客户机崩溃。使用 Notes.ini 的设置 logstatusbar=1 将状态条消息记录到 Log.nsf。要查看已记录的消息，请打开 Log.nsf 并点击 Miscellaneous Events 视图。状态条消息后跟有 Status Msg。要将状态条消息写到外部文件，请使用 Notes.ini 的设置 Debug_Outfile=<path to file> 和 Notes.ini 的设置 logstatusbar=1。例如：
logstatusbar=1
Debug_Outfile=c:\temp\StatusBarLogging.txt

这将状态条消息记录到文件 StatusBarLogging.txt。

Log.nsf 文件也提供 Notes 客户机崩溃之前记录到状态条中的动作的一个快照。

Fault Analyzer

Fault Analyzer 是一个新的服务器特性，用于在所有新的崩溃被提交到 Automatic Data Collection mail-in 数据库时对它们进行处理。Fault Analyzer 任务搜索为 Fault Report 文档配置的数据库，并确定堆栈是否与用户或服务器已经看到过的崩溃相匹配。它通过分析 Fault Report mail-in 数据库中的调用堆栈，并分析它们以确定其中是否有相同问题的其他情况，从而在 Automatic Data Collection 特性的基础上新增了功能。

Fault Analyzer 是在设置 Automatic Data Collection 的同时配置的（参见图 1）。使用 Server Configuration 文档在服务器上设置 Automatic Data Collection 和启用或禁用 Fault Analyzer。

图 1. 配置 Fault Analyzer

如果 Fault Analyzer 找到重复的故障报告，那么新的崩溃就被报告为初始崩溃的一个响应，并且附件要么被从响应文档删除以节省数据库空间，要么用响应文档进行保存。

Automatic Data Collection 增强

当您使用 Automatic Data Collection 工具来收集有关服务器崩溃的信息时，现在服务器被首先检查，看它是否运行在 Domino 之下，如果是的，就使用 Controller 日志。如果不是的，就检查服务器是否启用了控制台记录，如果是的，就使用控制台输出。最后，如果既没设置 Domino Controller，也没设置控制台记录，则会从 Log.nsf 中提取数据。

现在您可以选择，Automatic Data Collection 工具在客户机或服务器上运行时，将会收集哪些文件（使用通配符）。在 Notes 客户机上，它是使用 Desktop Policy Settings 文档配置的（参见图 2）。

图 2. 在 Notes 客户机上配置 Automatic Data Collection

在 Domino 服务器上，它是使用 Server Configuration 文档配置的（参见图 3）。

图 3. 在 Domino 服务器上配置 Automatic Data Collection

这允许您从其他 IBM 产品以及第三方插件收集诊断文件。

可能会出现这样的情况，即 Automatic Data Collection 发送的输出非常大。如果这成为了问题，那么您可以配置 Automatic Data Collection，限制 NSD 发送的附件和记录到 Fault Reports 数据库的控制台日志的大小（参见图 3）。

Shutdown Monitor

在您发出退出或重启服务器的命令之后，Domino 服务器通常要花很长时间才能实际关闭。为了避免这一延迟，Shutdown Monitor 任务确保 Domino 在请求一终止时就立即终止。如果服务器不在指定的时间内终止，那么服务器将被迫终止，并在终止之前生成一个 NSD 日志。这个时间限制是在 Server 文档的 Automatic Server Restart 部分的 Server Shutdown Timeout 域中指定的，在 Basics 附签上（参见图 4）。

图 4. 设置 Server Shutdown Timeout

默认的 Server Shutdown Timeout 设置是 5 分钟。可以使用 Notes.ini 的设置 shutdown_monitor_disabled=1 禁用该特性。

Process Monitor（仅针对 Windows 平台）

Process Monitor 任务监控应该作为 Domino 服务器环境一部分运行的进程。（该任务只运行在 Microsoft Windows 平台上；该功能在 Domino for Unix 平台上已实现，无需使用单独的服务器任务。）如果任何这些进程缺失，或者一个进程在没有完成通常的 Domino 终止例程时就意外终止了，那么该任务将导致服务器 panic 并确定哪个进程过早终止了。Process Monitor 任务与 Nprocmon.exe 一起工作，后者监控 Nserver.exe 进程的异常终止。

该特性可以大大减少异常终止问题出现的次数，而这样的问题很难分析（因为通常难以确定哪个进程终止了并导致了服务器问题）。要禁用 Process Monitor 任务，请在服务器的 Notes.ini 文件中设置变量 process_monitor_disabled=1。

结束语

在本文中，我们定义了 Domino 服务器挂起与崩溃之间的区别。讨论了在分析和修复 Notes/Domino 问题时可以使用的一些故障诊断过程和工具。还了解了 Notes/Domino 7 中引入的一些新的故障诊断特性。在 Notes 客户机或 Domino 服务器遇到挂起或崩溃时，您可以来参考这篇文章，当然，希望您不要经常碰到这种情况。

参考资料

学习

您可以参阅本文在 developerWorks 全球站点上的英文原文。

developerWorks Lotus 文章 “New features in Lotus Domino 7.0” 概述了 Domino 7 中引入的所有新的服务器特性。

在开始使用本文中提到的任何故障诊断工具之前，请参考 Domino 管理文档。
此外，Domino 自助支持页面也是故障诊断信息的一个好资源。

讨论

参与 developerWorks blogs 并加入 developerWorks 社区。
或到www.chinaunix.net社区的Lotus版块

关于作者

Kiran Bellari 于 2004 年加入 IBM，在加入 Domino 服务器崩溃性能技术支持之前，最初是在 Domino Document Manager 团队工作。Kiran 是双料 CLP，从 National Insitute of Technology 获得学士学位。

[ 本帖最后由 plumlee 于 2006-4-14 09:44 编辑 ]

文库|博客

plumlee

荣誉版主

论坛徽章:: 3

2楼 [报告]

发表于 2006-04-14 09:27 |只看该作者

附上英文对照版

Troubleshooting Lotus Domino hangs and crashes
developerWorks

Document options
Set printer orientation to landscape mode

Print this page
Email this page

E-mail this page

Free software for rapid results

Kick-start your Java apps

Rate this page

Help us improve this content

Level: Intermediate

Kiran Bellari, Software Engineer, IBM

17 Jan 2006

Quick -- what's the difference between a server hang and a crash? More important, how do you go about fixing them? In this article, we explain how you can identify Lotus Domino server hangs and crashes, and what you can do to analyze and correct them.

More dW content related to: Insufficient memory - NSF monitor pool is full

Lotus Domino is built to be very reliable. But even the best-built products may encounter problems that cause them to hang or crash. When this happens, the quicker you can isolate, analyze, and fix the problem, the quicker your user community will be happily up and running -- and the quicker you can go back to worrying about other things.

This article offers some ideas you can use to fix Notes/Domino problems. We start by defining the differences between a server hang and a server crash, and how you can go about solving examples of each. We conclude with an overview of new troubleshooting features included in Notes/Domino 7, the latest release of the product. We assume you're an experienced Domino administrator, and are familiar with basic Notes/Domino concepts and terminology.

What are server hangs and crashes?

Before we get into the technical details, let's define two commonly used terms, crash and hang, to ensure we're all on the same page.

Server crash

A Domino server crash is a situation where the server program has terminated and it is no longer running. You can often determine the task that the server was performing when it terminated by looking at the crash screen, or from the NSD/RIP log file (depending on which release of Domino you are running).

Common symptoms of a Domino server crash include:

* The Domino server is no longer running, but other programs on the system are still running.
* The Domino server console does not appear, even when tasks appear to be loaded.
* The Domino server loaded and abruptly came down without doing anything.
* A panic error appears on the console or in Log.nsf, and the system comes down.
* NSD/RIP automatically runs and generates a file, and the server comes down and/or restarts by itself.

There are several different types of server crashes. For example, a one-time crash, as the name implies, may occur once and never appear again. A one-time crash may be caused by bad memory or a corrupted document accessed by a process that resulted in Domino crashing. For example, suppose a document deposited in Mail.box is corrupted. When the Domino router accesses Mail.box to route the document to its destination, this produces a Domino server crash. A similar situation may or may not occur in the future. In general, one-time crashes are the most difficult to analyze.

A reproducible crash is one that can be repeated by following a sequence of steps. One example is a form that includes a badly coded button that always results in a crash when pressed.

Repetitive crashes occur on a particular schedule. They don't seem to be associated with any specific actions; instead, they may happen at the same time every day. In such situations, you need to identify exactly what is getting executed on the server at that time that may be causing the problem. For instance, imagine that a Domino server has a scheduled agent enabled that runs every month. This agent may be producing the server crash. In such scenarios, you need to first disable the agent creating the problem and then review why the agent is causing the problem (and fix it).

An ABEND is a special form of server crash. The term ABEND is a combination of the words "abnormal end." ABEND crashes do not produce RIP or NSD files.

Causes of crashes include:

* A software problem in the code (either on the server or on the client).
* Corruption in a database.
* A software problem in a third-party application accessing Domino.
* Insufficient memory.
* Restricted operations caused by customized code.
* A memory leak.
* An incomplete request.

Server hang

A Domino server hang is a situation where the Domino server is still running, but one or more tasks on the server are not responding to requests. These tasks may still be active, but they are not doing what they are supposed to do. The term "hang" also defines a state that sometimes occurs when computer programs do not run as designed. Most of the time, a hang occurs due to a low-level loop or a permanent unavailability of a resource, causing serious performance issues. (Server hangs are most commonly attributed to resource issues, so they are sometimes considered performance problems.)

During a hang, the program seems to be paralyzed, no error messages are displayed, and the screen freezes or the application does not respond to users' actions. Keyboard input or mouse clicking has no effect, regardless of where the cursor is placed, but the program is still running. Unlike an ABEND or crash, sometimes a hang will resolve itself, and the application resumes its normal execution without your involvement. Such a case might be considered more of a performance issue than a hang.

Symptoms of a Domino server hang include:

* Domino is still running, but is not responsive to clients. In this case, users often report that they are receiving “Server not responding” messages.
* The console behaves as if it is disconnected and won’t accept any commands, not even a simple command such as quit.
* Clients accessing the server (for example, opening databases) are experiencing slow response times.
* Semaphore timeouts are occurring. The 'show stat' command will record semaphore timeout information. The following is an example of semaphore timeouts recorded in Statrep.nsf: Sem.Timeouts = 430D: 58 0A13:42 030B:28 0116:26 0A12:21. In this example, 430D is the semaphore name, and 58 is the number of timeouts. Note that semaphore timeouts do not always indicate a performance problem. It is common for semaphore timeouts to occur on a busy server. The statistic Sem.timeouts will not appear in Statrep.nsf if the server has not experienced any semaphore timeouts.
* Performance-related error messages are reported, such as:
   Insufficient memory.
   Insufficient memory. NSF Folder Pool is full.
   Maximum number of memory segments that Notes can support has been exceeded.
   Network operation did not complete in a reasonable amount of time.
   Server not responding.

Note that in a server hang situation, an NSD/RIP is never generated automatically.

Causes of server hangs include resource problems (insufficient resources), third-party application conflicts, and hardware problems. In general, server hangs are more difficult to analyze than server crashes. One final note: crashes and hangs not only occur on the Domino server, they can also happen on the Notes client.

Troubleshooting

In this section, we examine some general approaches to troubleshooting server crashes and server hangs.

Troubleshooting Domino server crashes

If Domino has crashed and is not able to restart, remove tasks from the Notes.ini variable Servertask and attempt to narrow down and identify the task causing the crash. When you suspect a particular task is causing the problem, open the server console and narrow down the possible error messages generated by task. For example, if the router crashed while accessing mail in Mail.box, rename Mail.box and allow the server to recreate Mail.box.

If you suspect the problem is caused by a corrupted database, run offline maintenance tasks on this database. If the crash is occurring on a scheduled basis, review the actions performed on the server at the time of the crash.

Consider the following questions:

* Is the Domino server reporting error messages to the console or the log file?
* What is the exact syntax of the error message?
* Where is the error message being generated, on the Domino server or on the Notes client?
* When did this problem first appear?
* Did you implement any recent changes before the problem started appearing?

Troubleshooting Notes client crashes

First, find out whether or not the problem is specific to a single user. If so, check the configuration of that user and compare it to configurations for other users. Also, determine whether or not the problem happens due to a specific application being accessed. If so, review the application with a developer.

If you suspect the problem is caused by a corrupted database or document, run the maintenance tasks Updall, Fixup, and Compact (with appropriate switches). Also, try to recreate the database's full-text index, if possible, if you think the problem is due to a bad index.

Troubleshooting Domino server hangs

If constant semaphore problems appear on the server console, check whether or not the tasks' schedule is conflicting. If the system is responding slowly, check your non-Domino applications to see whether or not they are also performing slowly. Additionally, as a general rule, make sure your operating system is updated with all the latest patches.

NSD analysis
Determining the process that crashed the server is often the first step in resolving a server crash. In Domino 6 and later, the NSD file can be a good place to start. NSD gives you all current information about the state of the server (call stacks for all threads, memory information, and so on). In the event of a crash, an NSD log file will automatically be generated by the Domino server and stored in the data\IBM_TECHNICAL_SUPPORT directory. An NSD log will have a file name with a time stamp showing the time when the NSD was generated. For example: Nsd_W32I_KIRANTP_2006_01_17@17_17_18.log indicates this NSD was created on January 17, 2006. When NSD runs, it attaches to each process and thread, to dump the calls stacks. This can help you determine the cause of a server or workstation crash.

The "heart" of an NSD file is the stack trace section. This section provides a breakdown of the code path each thread in a currently existing process traversed to put it in its current state. This is very helpful in examining hang or crash situations on a server. Also, by examining the NSD file, you can find any core files generated in a Domino data directory, and can do a base-level analysis to trace the final stack of calls that were made by the process that died and left behind the core. In a complex product such as Domino, a stack trace of the same type of action on two different servers can produce different results.

In the NSD file, you can identify the executable in the failing process by performing a word search for "fatal," "panic," or "segmentation." By finding the process, we can see what preceded it, and hopefully determine how the crash occurred. When neither "panic" or "fatal" are found, sometimes a core dump will contain a reference to a "segmentation fault" in a function. This indicates that the process tried to access a shared memory segment that was corrupted for some reason, and will crash without calling "fatal_error" or "panic."

The following is a sample excerpt from an NSD file where a server process is involved in a crash:

### FATAL THREAD 39/83 [ nSERVER:07c0: 2764]
### FP=0743f548, PC=60197cf3, SP=0743ebd0, stksize=2424
Exception code: c0000005 (ACCESS_VIOLATION)
############################################################
@[ 1] 0x60197cf3 nnotes._Panic@4+483 (7430016,496dae76,0,496dace

@[ 2] 0x600018a4 nnotes._OSBBlockAddr@8+148 (1153f38,2000000,743f608,1)
@[ 3] 0x6000bd92 nnotes._CollectionNavigate@24+610 (0,743fc74,f,0)
@[ 4] 0x600626cc nnotes._ReadEntries@68+2860 (4c5440e8,4cfb8dba,800f,1)
@[ 5] 0x600b9f6f nnotes._NIFReadEntriesExt@72+351 (0,4cfb8dba,800f,1)
@[ 6] 0x10032d40 nserverl._ServerReadEntries@8+1424 (0,8d0c0035,4b64b5bc,4ae46dd6)
@[ 7] 0x100191fc nserverl._DbServer@8+2284 (41b0383,cb740064,0,23696f

@[ 8] 0x1002b8c8 nserverl._WorkThreadTask@8+1576 (4711d68,0,3,563fb10)
@[ 9] 0x100016cb nserverl._Scheduler@4+763 (0,563fb10,0,10ec334)
@[10] 0x6011e5e4 nnotes._ThreadWrapper@4+212 (0,10ec334,563fb10,0)
[11] 0x77e887dd KERNEL32.GetModuleFileNameA+465

When the failing process has been determined, you can focus on troubleshooting that particular process.

ServerTasks
If a server is crashing continuously (for example, every five minutes), a useful troubleshooting step is to temporarily remove the ServerTasks= line from the server's Notes.ini file. The server can then be restarted and tasks can be loaded individually to determine which process is causing the crash.

Panic messages
When Domino detects an internal consistency error, or a condition that may lead to corruption of data or some other problem, it immediately calls a subroutine called Panic. This is a special construct used to continually monitor critical parts of the code as it operates. This helps catch problems as early as possible, before they escalate and possibly destroy data. When a panic takes place, it brings the system to a stop (and thus can be considered a controlled crash). Panics generate messages, sometimes in English and sometimes in code (for example: PANIC: 04:3C). You can give this code to Lotus Software Technical Support for further troubleshooting.

Troubleshooting tools

This section reviews some of the troubleshooting tools available to you when you encounter a Domino server crash or hang. Before using any of these tools, be sure to consult the Domino administration documentation. Also, the Domino self-help support page is a good resource for troubleshooting information.

RIP (Domino R5)

A RIP file is generated when a server crashes. This file contains information about what the server was doing when it crashed. It reports any crash on the system, not just ones related to Domino. RIP files are generated only in Domino 5.x. In Domino 6 and later, NSD serves the purpose formerly performed by RIP, and also includes additional capabilities not included in RIP.

For a RIP file to be generated, QNC.EXE needs to be loaded on the Domino server. The QNC.EXE program (often called "quincy"

is the default debugger program that ships with Domino. The QNC.EXE program is usually located in the \Domino directory. To enable QNC.EXE, type "qnc –I" at the operating system's command prompt. You can also enable QNC.EXE by typing "qnc nserver" at server launch. If RIP files are not generated when the server crashes, check whether QNC.EXE is enabled. Normally, RIP files get created in the data directory.

NSD (Domino 6 and later)

As mentioned previously, Domino 6 and later provides the NSD feature. This is a file that contains information about the state of the server at the time of a crash. For more information, see the section, "NSD analysis," earlier in this article.

Memory dump (Domino 6 and later)

In Domino 6 and later, you can use the command “sh memory dump” on the server console to create a memory dump file. A memory dump contains information on memory currently used by Domino. This is very useful when troubleshooting performance problems and memory leaks. Normally, memory dump files get collected in the data\IBM_TECHNICAL_SUPPORT directory. A memory dump file name includes a time stamp for the time when the NSD was generated. For example:

memory_ KIRANTP_2005_09_14@17_50_08.dmp

Note: To record the available memory to a file instead of viewing it on the server console, enter the following server console command: sh memory dump >memory.txt

HTTP request logs

To troubleshoot issues related to Domino Web server crashes and hangs, Lotus Software Technical Support will often ask you to create an HTTP request log. To enable the default settings for request logs, edit the server's Notes.ini file and add the line HTTPEnableThreadDebug=1. This sets HTTP request logging at the default level. (To set the logging level to record more details, see the Domino administration documentation.) You can also enable HTTP request logging dynamically by entering "tell http debug thread on | off" at the Domino server console. With HTTP request logging enabled, Domino creates a series of files with the name htthr*.log. For example: htthr_a40_10_20050914@171556.log.

HTTP request logging should be used only for troubleshooting specific issues, and usually at the direction of and with assistance from Lotus Software Technical Support. Do not use request logging for other purposes, such as general administration. These log files grow in size over time, so you should not leave this setting enabled for long periods or you could consume all available drive space.

Automatic Data Collection

Notes/Domino 6.0.1 introduced the automatic diagnostic data collection tool, also known as Automatic Data Collection, or ADC for short. Automatic Data Collection simply means that, when a Notes client or Domino server crashes, the program gathers all the necessary data to debug the crash and sends it to a mail-in database when the client or server restarts. Administrators then have one location per domain in which they can see all the crashes that have occurred for all clients and servers. This will help eliminate the instances where an administrator or user may not be able to capture the proper data on a client or server crash.

Notes.ini settings

To troubleshoot performance and crash issues, you can enable the following Notes.ini debugging parameters:

* Debug_threadid=1 logs each process and thread ID for each server operation.
* Debug_show_timeout=1 turns on semaphore timeout messages to the console, and creates a semaphore text file called semdebug.txt.
* Debug_capture_timeout=10 time stamps each semaphore timeout message.
* CONSOLE_LOG_ENABLED=1 (Domino 6 and later) enables Domino console logging.

Fault recovery for server crashes

You can set up fault recovery to automatically handle Domino server crashes. When the server crashes, it shuts itself down and then restarts automatically, without any administrator intervention. Domino records crash information in the data directory. When the server restarts, Domino checks to see if it is restarting after a crash. If it is, an email is automatically sent to the person or group in the "Mail Fault Notification to" field.

A fatal error (such as an operating system exception or an internal panic) terminates each Domino process and releases all associated resources. The startup script detects the situation and restarts the server. If you are using multiple server partitions and a failure occurs in a single partition, only that partition is terminated and restarted.

New troubleshooting features in Domino 7

This section briefly discusses some new Domino 7 features that can help you analyze and correct server hangs and crashes.

Domino Domain Monitoring

One of the most significant and useful server maintenance and troubleshooting features in Domino 7 is Domino Domain Monitoring (DDM). This provides one central location for monitoring all the servers in a domain (or multiple domains). DDM uses programs called probes to gather server information from the individual servers, and then report back to a special database (DDM.nsf) where you can view the collected data. This allows you to monitor, analyze, and troubleshoot a large number of servers from a single Domino Administrator console.

Activity Trends

The Activity Trends feature lets you analyze "historical" server data, to help spot trends that can only be identified over an extended period of time. You can review this data to help predict and avoid future issues. This data is collected from the log file (Log.nsf) and the Catalog task, and stored in the Activity Trends database (Activity.nsf). The Activity Trends Collector task processes this data, and produces "trended" data that you can use for charting and resource balancing.

Writing status bar history to a log file

You can now enable Notes client logging of status bar messages to the local log file (Log.nsf) or to an external file that you designate. This can help you troubleshoot Notes client crashes. Use the Notes.ini setting logstatusbar=1 to enable logging of status bar messages to Log.nsf. To view the logged messages, open Log.nsf and then click the Miscellaneous Events view. Status bar messages are appended with Status Msg. To write the status bar messages to an external file, use the Notes.ini setting Debug_Outfile=<path to file> with the Notes.ini setting logstatusbar=1. For example:
logstatusbar=1
Debug_Outfile=c:\temp\StatusBarLogging.txt

This logs status bar messages to the file StatusBarLogging.txt.

The Log.nsf file can also provide a snapshot of actions logged in the status bar before the Notes client crashed.

Fault Analyzer

Fault Analyzer is a new server feature that processes all new crashes as they are delivered to the Automatic Data Collection mail-in database. The Fault Analyzer task searches the database configured for Fault Report documents and determines whether or not the stack matches a crash that has already been seen by a user or server. It adds to the functionality of the Automatic Data Collection feature by analyzing the call stacks that are located in the Fault Report mail-in database, and evaluating them to determine whether or not there are other instances of the same problem.

Fault Analyzer is configured at the same time that you set up Automatic Data Collection (see figure 1). Use the Server Configuration document to set up Automatic Data Collection on the server and to enable or disable Fault Analyzer.

Figure 1. Configuring Fault Analyzer
Configuring Fault Analyzer

If Fault Analyzer locates duplicate fault reports, the new crash is reported as a response to the original crash, and attachments are either removed from the response document to save space in the database, or they are saved with the response document.

Automatic Data Collection enhancements

When you use the Automatic Data Collection tool to gather information about server crashes, the server is now first checked to see if it is being run under the Domino Controller and, if so, uses the Controller logs. If not, the server is checked to see if console logging is enabled and, if so, uses the console output. Finally, data is extracted from Log.nsf if neither the Domino Controller nor console logging has been set.

Now you can select which files (using wildcards) will be collected by the Automatic Data Collection tool when it runs on clients or servers. On Notes clients, it is configured using a Desktop Policy Settings document (see figure 2).

Figure 2. Configuring Automatic Data Collection on the Notes client
Configuring Automatic Data Collection on the Notes client

On Domino servers, it is configured using the Server Configuration document (see figure 3).

Figure 3. Configuring Automatic Data Collection on the Domino server
Configuring Automatic Data Collection on the Notes client

This allows you to collect diagnostic files from other IBM products, as well as third-party add-ins.

There is a possibility that the output sent by Automatic Data Collection could be very large. If this becomes a problem, you can configure Automatic Data Collection to restrict the size of attachments sent by NSD and the console log to the Fault Reports database (see figure 3).

Shutdown Monitor

It often takes a long time for the Domino server to actually shut down after you issue a quit or restart server command. To avoid this delay, the Shutdown Monitor task ensures that Domino terminates when requested to do so. If the server doesn't terminate in the allotted time, the server will forcefully terminate and an NSD log will be generated before termination. The time limit is specified in the Server Shutdown Timeout field of the Automatic Server Restart section of the Server document, on the Basics tab (see figure 4).

Figure 4. Setting the Server Shutdown Timeout
Setting the Server Shutdown Timeout

The default Server Shutdown Timeout setting is 5 minutes. This feature can be disabled using the Notes.ini setting shutdown_monitor_disabled=1.

Process Monitor (Windows platforms only)

The Process Monitor task monitors the processes that should be running as part of the Domino server environment. (This task runs on Microsoft Windows platforms only; this functionality is implemented in Domino for Unix platforms without using a separate server task.) If any of these processes is missing, or if one terminates unexpectedly without completing the usual Domino termination routines, this task causes the server to panic and identify which process has prematurely terminated. The Process Monitor task works with Nprocmon.exe, which monitors the Nserver.exe process for abnormal terminations.

This feature can significantly reduce the number of abnormal termination problems, which otherwise are difficult to analyze (because it's often difficult to determine which process has terminated and caused the server problem). To disable the Process Monitor task, set the variable process_monitor_disabled=1 in the server's Notes.ini file.

Conclusion

In this article, we have defined the differences between a Domino server hang and a crash. We have discussed some troubleshooting procedures and tools you can follow when analyzing and fixing Notes/Domino problems. We also looked at new troubleshooting features introduced in Notes/Domino 7. You can consult this article whenever you encounter a hang or crash with the Notes client or Domino server -- which hopefully won't be very often!

Back to top

Resources
Learn

* The developerWorks Lotus article, "New features in Lotus Domino 7.0," provides an overview of all the new server features introduced in Domino 7.

* Before using any of the troubleshooting tools mentioned in this article, consult the Domino administration documentation.

* Also, the Domino self-help support page is a good resource for troubleshooting information.

Discuss

* Participate in developerWorks blogs and get involved in the developerWorks community.

Back to top

About the author

Kiran Bellari joined IBM on 2004 and initially worked for the Domino Document Manager team before joining Domino server crash performance technical support. Kiran is a dual CLP, and earned his Bachelor's Degree from the National Insitute of Technology.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

xiao1124

白手起家

论坛徽章:: 0

3楼 [报告]

发表于 2007-03-08 18:06 |只看该作者

辛苦了，总体看了下感觉获益不少，我的domino服务器最近每个2小时左右就出问题，应该算挂起吧，现象就是先提示TCP/IP通信协议堆栈报告用完，要你增加内存或减少客户机连接，接着就提示系统资源不足，无法完成请求的服务。关键是机器上进行其它操作也是这个提示，只有重启
现在硬件方面除了拨号是否有问题不好排除外，其它应该没问题，问题还没解决，希望高手指点。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

sabrinachen

白手起家

论坛徽章:: 0

4楼 [报告]

发表于 2007-03-12 09:38 |只看该作者

是否装有第3方软件或前期做过配置修改

是否装有第3方软件或前期做过配置修改？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

返回列表

Chinaunix › 论坛 › IT运维 › 服务器应用 › 故障诊断 Lotus Domino 的挂起和崩溃

[Lotus] 故障诊断 Lotus Domino 的挂起和崩溃 [复制链接]

是否装有第3方软件或前期做过配置修改