论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2007-03-11 14:14 |只看该作者 |倒序浏览

关于callout队列
那么，系统callout机制是否工作正常呢？
首先看调用cv_timedwait不能超时返回的线程：

stack pointer for thread fffffe80044e5c80: fffffe80044e5880

[ fffffe80044e5880 _resume_from_idle+0xf8() ]

  fffffe80044e58c0 swtch+0x167()

  fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, 1036d)

  fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, ffffffff82f76388, 1036d)

  fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40)

  fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180)

  fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180)

  fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70)

  fffffe80044e5b50 runservice+0x42(ffffffff894acb70)

  fffffe80044e5b80 queue_service+0x42(ffffffff894acb70)

  fffffe80044e5bc0 stream_service+0x73(ffffffff83905740)

  fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820)

  fffffe80044e5c70 thread_start+8()
检查系统全局的callout队列，正如我们之前发现的，这个线程对应的callout表项已经严重过期了：

> ::callout ! grep fffffe80044e5c80

setrun fffffe80044e5c80 3ffffffffffe1a80 1036d (T-33730)
其中T-33730表示已经过期了33730个tick。
用mdb打印出所有callout表项，我们发现，系统中有过期的表项还有很多，用wc算一下，有2573个。

> ::callout

FUNCTION ARGUMENT ID TIME

sigalarm2proc ffffffff9569aae0 7fffffffffffc010

144a1 (T-17038)

sigalarm2proc ffffffff91bb7510 7fffffffffffe010

14484 (T-17067)

sigalarm2proc ffffffff9569c380 7fffffffffffc020

144a1 (T-17038)

sigalarm2proc ffffffff95428d48 7fffffffffffc030

144a1 (T-17038)

sigalarm2proc ffffffff91bb8db0 7fffffffffffe030

14483 (T-17068)

sigalarm2proc ffffffff9542b238 7fffffffffffc040

144a1 (T-17038)

.....................[snipped].................................... ................................

> ::callout ! grep "T-" | wc -l

2573
看来callout机制似乎失灵了，这也是导致cv_timedwait的不工作的直接原因。
下面我们就看一下callout到底出了什么问题。这还得从callout的代码开始看起。
首先，系统在每个tick执行时钟中断处理时进入clock例程，这个例程会调用callout_schedule来处理全局的callout队列：
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/clock.c#692
static void
clock(void)
{
      ..........................
      ..........................
      ..........................
      /*
      * Schedule timeout() requests if any are due at this time.
      */
      callout_schedule();
在callout_schedule里，会循环遍历所有调出表，把调出表的入口传递给callout_schedule_1：
/*
* Schedule callouts for all callout tables.  Called by clock() on each tick.
*/
void
callout_schedule(void)
{
int f, t;

      if (cpr_stop_callout)
         return;
      for (t = 0; t  CALLOUT_NTYPES; t++)
         for (f = 0; f  callout_fanout; f++)
               callout_schedule_1(callout_table[CALLOUT_TABLE(t, f)]);
}
而在callout_schedule_1会遍历给定调出表中的调出项，选择用两种不同的方式执行callout_execute。
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/callout.c#295
/*
* Schedule any callouts that are due on or before this tick.
*/
static void
callout_schedule_1(callout_table_t *ct)
{
   callout_t *cp;
   clock_t curtime, runtime;
   mutex_enter(&ct->ct_lock);
   ct->ct_curtime = curtime = lbolt;
   while (((runtime = ct->ct_runtime) - curtime)  0) {
      for (cp = ct->ct_lbhash[CALLOUT_LBHASH(runtime)];
         cp != NULL; cp = cp->c_lbnext) {
         if (cp->c_runtime != runtime ||
               (cp->c_xid & CALLOUT_EXECUTING))
               continue;
         mutex_exit(&ct->ct_lock);
         if (ct->ct_taskq == NULL)
               softcall((void (*)(void *))callout_execute, ct);
         else
               (void) taskq_dispatch(ct->ct_taskq,
                  (task_func_t *)callout_execute, ct,
                  KM_NOSLEEP);
         return;
      }
      ct->ct_runtime++;
   }
   mutex_exit(&ct->ct_lock);
}
总结下来callout队列的执行通常是经过如下code path:
1. softcall
clock -> callout_schedule -> callout_schedule_1 ->通过softcall产生一个PIL为1的软中断执行callout_execute
2. taskq
clock -> callout_schedule -> callout_schedule_1 ->通过dipatch一个独立的taskq线程来执行callout_execute
下面看看callout_execute的实现：
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/callout.c#241
/*
* Do the actual work of executing callouts.  This routine is called either
* by a taskq_thread (normal case), or by softcall (realtime case).
*/
static void
callout_execute(callout_table_t *ct)
{
      callout_t *cp;
      callout_id_t xid;
      clock_t       runtime;
      timestruc_t now;
      int64_t       hresms;

      mutex_enter(&ct->ct_lock);
      ....[snipped]..............
cp->c_executor = curthread;
cp->c_xid = xid |= CALLOUT_EXECUTING;
      mutex_exit(&ct->ct_lock);
      DTRACE_PROBE1(callout__start, callout_t *, cp);
      (*cp->c_func)(cp->c_arg);
      DTRACE_PROBE1(callout__end, callout_t *, cp);
      mutex_enter(&ct->ct_lock);
      ....[snipped]..............
      mutex_exit(&ct->ct_lock);
}
我们看到，在callout_execute执行中会用到mutex_enter(&ct->ct_lock)来保证互斥访问callout表项的内容，但在执行真正的定时器函数时，它会调用mutex_exit(&ct->ct_lock)释放掉锁。因此，可以有多个taskq线程或者软中断并发执行调出函数，而不会相互影响。
此前，我没仔细看callout_execute的代码，犯了一个严重错误，幸好有朋友指出，在这里再次表示谢意。希望之前的错误没有误导大家。
回过头来看线程fffffe80000ddc80的调用栈，显然它是由e1000g驱动调用timeout(9F)注册的e1000g_LocalTimer函数，而且callout_execute是由单独的taskq线程执行的：

stack pointer for thread fffffe80000ddc80: fffffe80000dd9d0

[ fffffe80000dd9d0 _resume_from_idle+0xf8() ]

  fffffe80000dda10 swtch+0x167()

  fffffe80000ddab0 turnstile_block+0x76b(ffffffff93338a00, 1, ffffffff82f76288, fffffffffbc05908, 0, 0)

  fffffe80000ddb10 rw_enter_sleep+0x1de(ffffffff82f76288, 1)

  fffffe80000ddb40 e1000g_tx_cleanup+0x56(ffffffff82f76000)

  fffffe80000ddb80 e1000g_LocalTimer+0x19(ffffffff82f76000)

  fffffe80000ddbd0 callout_execute+0xb1(ffffffff80a3e000)

  fffffe80000ddc60 taskq_thread+0x1a7(ffffffff826be7a0)

  fffffe80000ddc70 thread_start+8()
这个线程执行过程中，又因为等待一个读写锁而睡眠，我们前面分析出这个锁的所有者是fffffe80044e5c80，它因为调用cv_timedwait而睡眠等待它被callout机制来唤醒：

stack pointer for thread fffffe80044e5c80: fffffe80044e5880

[ fffffe80044e5880 _resume_from_idle+0xf8() ]

  fffffe80044e58c0 swtch+0x167()

  fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, 1036d)

  fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, ffffffff82f76388, 1036d)

  fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40)

  fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180)

  fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180)

  fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70)

  fffffe80044e5b50 runservice+0x42(ffffffff894acb70)

  fffffe80044e5b80 queue_service+0x42(ffffffff894acb70)

  fffffe80044e5bc0 stream_service+0x73(ffffffff83905740)

  fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820)

  fffffe80044e5c70 thread_start+8()
而cv_timedwait是通过realtime_timeout注册callout表项的，这意味着，callout_execute要通过softcall机制来执行，进而调用到setrun函数唤醒该线程。
而我们已经知道，fffffe80044e5c80调用cv_timedwait后从未返回，而且已经严重过期：

> ::callout ! grep fffffe80044e5c80

setrun fffffe80044e5c80 3ffffffffffe1a80 1036d (T-33730)
那么，为什么会这样呢？很自然，我们需要了解softcall是如何实现的。
关于softcall
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/softint.c#103
/*
* Call function func with argument arg
* at some later time at software interrupt priority
*/
void
softcall(void (*func)(void *), void *arg)
{
      softcall_t *sc;
      /*
      * protect against cross-calls
      */
      mutex_enter(&softcall_lock);
      /* coalesce identical softcalls */
      for (sc = softhead; sc != 0; sc = sc->sc_next) {
            if (sc->sc_func == func && sc->sc_arg == arg) {
                     mutex_exit(&softcall_lock);
                     return;
            }
      }
      if ((sc = softfree) == 0)
            panic("too many softcalls");
      softfree = sc->sc_next;
      sc->sc_func = func;
      sc->sc_arg = arg;
      sc->sc_next = 0;
      if (softhead) {
            softtail->sc_next = sc;
            softtail = sc;
            mutex_exit(&softcall_lock);
      } else {
            softhead = softtail = sc;
            if (softcall_state == SOFT_DRAIN)
                     /*
                     * softint is already running; no need to
                     * raise a siron. Due to lock protection of
                     * softhead / softcall state, we know
                     * that softint() will see the new addition to
                     * the softhead queue.
                     */
                     mutex_exit(&softcall_lock);
            else {
                     softcall_state = SOFT_PEND;
                     mutex_exit(&softcall_lock);
                     siron();
            }
      }
}
可以看到，softcall会把需要执行的函数放入一个内核全局的队列并交由系统处理，softhead指针可以访问这个队列：
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/softint.c#73

static softcall_t softcalls[NSOFTCALLS], *softhead, *softtail, *softfree;
从源代码里，我们看到，全局变量softcall_state用来标识当前softcall队列的状态，首次排队时，队列状态会被置为待处理态，SOFT_PEND。然后调用一个siron()来在CPU上产生一个软中断。
其中，softcall_state的状态定义如下：
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/softint.c#60
/*
* Defined states for softcall processing.
*/
#define SOFT_IDLE             0x01 /* no processing is needed */
#define SOFT_PEND             0x02 /* softcall list needs processing */
#define SOFT_DRAIN             0x04 /* the list is being processed */
下面我们就用mdb遍历这个队列，发现队列中有2个待执行的callout_execute调用：
>*softhead::list softcall_t sc_next|::print softcall_t
{
   sc_func = callout_execute
   sc_arg = 0xffffffff80219000
   sc_next = softcalls+0x1290
}
{
   sc_func = callout_execute
   sc_arg = 0xffffffff80216000
   sc_next = 0
}

>softcall_state/J
softcall_state:
softcall_state: 2


mdb读出当时队列状态是0x2，就是SOFT_PEND，待处理状态。
那么我们阻塞在cv_timedwait上的线程fffffe80044e5c80，是否属于这两个待处理的callout_execute之一呢？
sc_arg是callout_execute的参数，类型是callout_table_t，我们用mdb查看一下：
> 0xffffffff80219000::print callout_table_t
{
ct_lock = {
      _opaque = [ 0 ]
}
ct_freelist = 0xffffffffad201c38
ct_curtime = 0x1872f
ct_runtime = 0x1036d
ct_taskq = 0
ct_short_id = 0x3ffffffffffe1a80
ct_long_id = 0x7ffffffffffd91a0
ct_idhash = [ 0, 0xffffffffa40a9560, 0xffffffffa40a9600, 0xffffffffa40a9740, 0xffffffffa40a9880, 0xffffffffa40a9970,
0xffffffffa9383ce0, 0xffffffffa40a9b50, 0xffffffffa40a9c40, 0xffffffffa40a9d30, 0xffffffffa40a9e70, 0xffffffffa409f068,
0xffffffffa409f108, 0xffffffffa409f248, 0xffffffffa409f338, 0xffffffffa409f478, 0xffffffffa409f568, 0xffffffffa409f6a8,
0xffffffffa409f798, 0xffffffffa409f888, 0xffffffffa9383c90, 0xffffffffa409fa68, 0xffffffffa1a2e878, 0xffffffffa409fb08,
0xffffffffacfbb4c8, 0xffffffffa409fd38, 0xffffffffa1a2eeb8, 0xffffffffa40ba008, 0xffffffffa40ba198, 0xffffffffa40ba2d8,
0xffffffffa40ba3c8, 0xffffffffa40ba4b8, ... ]
ct_lbhash = [ 0, 0, 0, 0, 0xffffffffad188ca0, 0, 0, 0, 0, 0, 0, 0, 0, 0xffffffff8ff5a658, 0, 0, 0xffffffff8276e878, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xffffffffad1d0528, ... ]
}
非常的幸运，ct_short_id链表里的第一个项就是我们cv_timedwait注册的那个setrun的表项，我们不用再遍历整个链表了:)

> ::callout ! grep fffffe80044e5c80

setrun fffffe80044e5c80 3ffffffffffe1a80 1036d (T-33730)
看来，cv_timedwait没有按照预想原因返回的根源我们找到了，是softcall队列严重推迟引起的，而且，看起来softcall队列永远不可能继续被处理了。
那么为什么会这样呢？很自然我们想到要查看CPU的状态，因为softcall队列是软中断处理的，软中断线程的PIL一般是1，比一般线程的优先级都要高，除非CPU上有其它中断，否则应该会立即得到处理。
> ::cpuinfo -v
ID ADDR          FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD          PROC
  0 fffffffffbc27730  1f 1 6 169 no no t-3 fffffe80000bfc80 sched
                     | | |
         RUNNING    | +--> PIL THREAD
            READY       |       10 fffffe80000bfc80
         QUIESCED       |          6 fffffe80000b9c80
         EXISTS       |
         ENABLE       +-->  PRI THREAD          PROC
                                 99 fffffe80000d1c80 sched
ID ADDR          FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD          PROC
  1 fffffffffbc2f260  1b 1 0  -1 no no t-17 fffffe8000401c80 (idle)
                     | |
         RUNNING    +-->  PRI THREAD          PROC
            READY             60 fffffe80044d9c80 sched
         EXISTS
         ENABLE
其实，前面我们已经看到系统中有两个CPU， CPU1是空闲的，而CPU0的BSPL是6，意味着6以下的中断都不能打断当前执行的线程，而这个BSPL正是网卡中断线程设置的，用来屏蔽低优先级中断。
因为我们的softcall队列保持了相当长的PEND状态，那就意味着似乎这些待处理的softcall被分配到CPU0上来执行。
是不是这样呢？我们可以用mdb检查CPU结构cpu_t中成员machcpu的状态，即可知到softcall被分配到了哪一个CPU上:

CPU0:
> fffffffffbc27d10::print struct machcpu mcpu_softinfo
{
mcpu_softinfo.st_pending = 0x416
}
CPU 1:
> fffffffffbc2f840::print struct machcpu mcpu_softinfo
{
mcpu_softinfo.st_pending = 0x404
}
可以看到，CPU0上有PIL 1,2,4,10的中断待处理。CPU1有4和10待处理。其中PIL为1的中断就是处理softcall队列的软中断。
如果用mdb查看，该软中断处理函数是softlevel1，PIL是1：

> ::softint
ADDR          PEND PIL ARG1          ARG2          ISR(s)
ffffffff8277a5c0 0 1 ffffffff8048da80 0             errorq_intr
fffffffffbc05ae8 0 1 0             0             softlevel1
ffffffff8277a4c0 0 2 ffffffff8048dd00 0             errorq_intr
fffffffffbc00070 0 2 0             0             cbe_low_level
ffffffff83a706c0 0 4 ffffffff90004d18 0             ghd_doneq_process
ffffffff8ed31880 0 4 ffffffff90004d18 0             ghd_timeout_softintr
ffffffff83946e00 0 4 ffffffff8f046c40 0             power_soft_intr
ffffffff83a70c00 0 4 ffffffff83b1b000 0             bge_chip_factotum
ffffffff83a70cc0 0 4 ffffffff83b1b000 0             bge_reschedule
ffffffff8277a2c0 0 4 0             0             asysoftintr
ffffffff82f4d100 0 9 ffffffff82f76370 0             e1000g_tx_softint_worker
ffffffff82f4df00 0 9 ffffffff82f86370 0             e1000g_tx_softint_worker
ffffffff833b3e80 0 9 ffffffff801af7e8 0             hcdi_soft_intr
ffffffff8277a000 0 9 ffffffff801afb68 0             hcdi_soft_intr
fffffffffbc00030 0 10  0             0             cbe_softclock
但实际上我们前面也知道，CPU0上的网卡中断线程fffffe80000b9c80已经睡眠在了读写锁上，而读写锁的主人fffffe80044e5c80，此刻却在callout队列中等待被唤醒来继续执行。
而callout的执行又依赖于CPU0能够处理softcall队列中callout_execute，但CPU0上的软中断已经被网卡中断线程fffffe80000b9c80通过BSPL屏蔽掉了。
此时，死锁已经发生，线程fffffe80000b9c80和fffffe80044e5c80永远的互相等待下去了。
当然，上述死锁条件的成立还得基于以下假设：
softcall一旦被分派，就不能重新调度到其它CPU上。本例中，即使CPU1空闲，也不能通过cross-call，或者叫处理器间中断来重新分配softcall队列的处理。
显然，这个假设似乎不合道理，如果真是这样的话，那可是内核的bug，我们的分析再次陷入困境。
因此，只好发邮件给OpenSolaris.org社区的code邮件列表了，终于，这个假设得到了确认，因为内核中这两个bug，我们的假设是成立的：
http://www.opensolaris.org/jive/thread.jspa?threadID=38081&tstart=30

http://www.opensolaris.org/jive/thread.jspa?threadID=38118&tstart=30

[/url]
6292092 callout should not be blocked by interrupts from executing realtime timeouts
6540436 kpreempt() needs a more reliable way to generate level1 intr
如果上面两个内核的bug被修复，死锁还会发生吗？
答案不难得出：如果是SMP系统，大概不会发生了。但UP系统，单CPU，即便没有了上面两个bug，系统一样会死锁。
所以，这个死锁到底还是e1000g驱动的问题。好在这只是e1000g试验版本的一个错误，显然，我们在中断处理函数中引入读写锁是有问题的。
小结
1. 系统hang的root cause是e1000g的rwlock的不当使用，导致死锁的发生；
2. 关于callout
e1000g_LocalTimer显然是e1000g注册的定时器函数，但在这个函数中却试图使用一个rwlock，而导致了在callout_execute调用中的睡眠。而我们知道，callout_execute睡眠会引起系统全局的callout table被锁住，callout机制无法使用；因此要避免在e1000g_LocalTimer中调用引起阻塞的代码或者函数；
3. Solaris Internal第二版的bug：
解决这个问题的事后，偶然发现Solaris Internal第二版的英文原版836页第17章图17.6存在一个bug。
这个图是关于rwlock的，其中关于读锁的图中，关于读者记数器的起始位，应该是3-31.63，而书上说是4。
这个可以从源代码中得到证实：
[url=http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/rwlock.c#40]http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/rwlock.c#40
/*
* Big Theory Statement for readers/writer locking primitives.
*
* An rwlock provides exclusive access to a single thread ("writer") or
* concurrent access to multiple threads ("readers").  See rwlock(9F)
* for a full description of the interfaces and programming model.
* The rest of this comment describes the implementation.
*
* An rwlock is a single word with the following structure:
*
* ---------------------------------------------------------------------
* | OWNER (writer) or HOLD COUNT (readers) | WRLOCK | WRWANT | WAIT |
* ---------------------------------------------------------------------
*          63 / 31 .. 3          2 1 0
这我也从crashdump中得到了验证，检查读写锁实际的值，你会发现，是0xb, 其中第3位恰好是1，即只有1个reader。
> ffffffff82f76288::rwlock
         ADDR    OWNER/COUNT FLAGS       WAITERS
ffffffff82f76288       READERS=1  B011 ffffffff838470c0 (W)
                                 || fffffe800027bc80 (R)
               WRITE_WANTED -------+| fffffe80000ddc80 (R)
               HAS_WAITERS --------+ fffffe80000b9c80 (R)
>
> ffffffff82f76288/J
0xffffffff82f76288:          b
> ffffffff82f76288/R
0xffffffff82f76288:          1011
相关文档：
Solaris学习笔记(5)
Solaris学习笔记(4)

Solaris学习笔记(3)
Solaris学习笔记(2)
Solaris学习笔记(1)
X86汇编语言学习手记(3)
X86汇编语言学习手记(2)
X86汇编语言学习手记(1)

Technorati Tag:
OpenSolaris

Technorati Tag:
Solaris

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u/768/showart_256930.html

文库|博客

返回列表

Chinaunix › 论坛 › 操作系统 › Solaris › Solaris文档中心 › Solaris学习笔记(6)下 - 07年9月修订版

Solaris学习笔记(6)下 - 07年9月修订版 [复制链接]

浏览过的版块