- 论坛徽章:
- 0
|
关于callout队列
那么,系统callout机制是否工作正常呢?
首先看调用cv_timedwait不能超时返回的线程:
![]()
stack pointer for thread fffffe80044e5c80: fffffe80044e5880
![]()
[ fffffe80044e5880 _resume_from_idle+0xf8() ]
![]()
fffffe80044e58c0 swtch+0x167()
![]()
fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, 1036d)
![]()
fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, ffffffff82f76388, 1036d)
![]()
fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40)
![]()
fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180)
![]()
fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180)
![]()
fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70)
![]()
fffffe80044e5b50 runservice+0x42(ffffffff894acb70)
![]()
fffffe80044e5b80 queue_service+0x42(ffffffff894acb70)
![]()
fffffe80044e5bc0 stream_service+0x73(ffffffff83905740)
![]()
fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820)
![]()
fffffe80044e5c70 thread_start+8()
检查系统全局的callout队列,正如我们之前发现的,这个线程对应的callout表项已经严重过期了:
![]()
> ::callout ! grep fffffe80044e5c80
![]()
setrun fffffe80044e5c80 3ffffffffffe1a80 1036d (T-33730)
其中T-33730表示已经过期了33730个tick。
用mdb打印出所有callout表项,我们发现,系统中有过期的表项还有很多,用wc算一下,有2573个。
![]()
> ::callout
![]()
FUNCTION ARGUMENT ID TIME
![]()
sigalarm2proc ffffffff9569aae0 7fffffffffffc010
![]()
144a1 (T-17038)
![]()
sigalarm2proc ffffffff91bb7510 7fffffffffffe010
![]()
14484 (T-17067)
![]()
sigalarm2proc ffffffff9569c380 7fffffffffffc020
![]()
144a1 (T-17038)
![]()
sigalarm2proc ffffffff95428d48 7fffffffffffc030
![]()
144a1 (T-17038)
![]()
sigalarm2proc ffffffff91bb8db0 7fffffffffffe030
![]()
14483 (T-17068)
![]()
sigalarm2proc ffffffff9542b238 7fffffffffffc040
![]()
144a1 (T-17038)
![]()
.....................[snipped].................................... ................................
![]()
![]()
> ::callout ! grep "T-" | wc -l
![]()
2573
看来callout机制似乎失灵了,这也是导致cv_timedwait的不工作的直接原因。
下面我们就看一下callout到底出了什么问题。这还得从callout的代码开始看起。
首先,系统在每个tick执行时钟中断处理时进入clock例程,这个例程会调用callout_schedule来处理全局的callout队列:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/clock.c#692
static void
clock(void)
{
..........................
..........................
..........................
/*
* Schedule timeout() requests if any are due at this time.
*/
callout_schedule();
在callout_schedule里,会循环遍历所有调出表,把调出表的入口传递给callout_schedule_1:
/*
* Schedule callouts for all callout tables. Called by clock() on each tick.
*/
void
callout_schedule(void)
{
int f, t;
if (cpr_stop_callout)
return;
for (t = 0; t CALLOUT_NTYPES; t++)
for (f = 0; f callout_fanout; f++)
callout_schedule_1(callout_table[CALLOUT_TABLE(t, f)]);
}
而在callout_schedule_1会遍历给定调出表中的调出项,选择用两种不同的方式执行callout_execute。
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/callout.c#295
/*
* Schedule any callouts that are due on or before this tick.
*/
static void
callout_schedule_1(callout_table_t *ct)
{
callout_t *cp;
clock_t curtime, runtime;
mutex_enter(&ct->ct_lock);
ct->ct_curtime = curtime = lbolt;
while (((runtime = ct->ct_runtime) - curtime) 0) {
for (cp = ct->ct_lbhash[CALLOUT_LBHASH(runtime)];
cp != NULL; cp = cp->c_lbnext) {
if (cp->c_runtime != runtime ||
(cp->c_xid & CALLOUT_EXECUTING))
continue;
mutex_exit(&ct->ct_lock);
if (ct->ct_taskq == NULL)
softcall((void (*)(void *))callout_execute, ct);
else
(void) taskq_dispatch(ct->ct_taskq,
(task_func_t *)callout_execute, ct,
KM_NOSLEEP);
return;
}
ct->ct_runtime++;
}
mutex_exit(&ct->ct_lock);
}
总结下来callout队列的执行通常是经过如下code path:
1. softcall
clock -> callout_schedule -> callout_schedule_1 ->通过softcall产生一个PIL为1的软中断执行callout_execute
2. taskq
clock -> callout_schedule -> callout_schedule_1 ->通过dipatch一个独立的taskq线程来执行callout_execute
下面看看callout_execute的实现:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/callout.c#241
/*
* Do the actual work of executing callouts. This routine is called either
* by a taskq_thread (normal case), or by softcall (realtime case).
*/
static void
callout_execute(callout_table_t *ct)
{
callout_t *cp;
callout_id_t xid;
clock_t runtime;
timestruc_t now;
int64_t hresms;
mutex_enter(&ct->ct_lock);
....[snipped]..............
cp->c_executor = curthread;
cp->c_xid = xid |= CALLOUT_EXECUTING;
mutex_exit(&ct->ct_lock);
DTRACE_PROBE1(callout__start, callout_t *, cp);
(*cp->c_func)(cp->c_arg);
DTRACE_PROBE1(callout__end, callout_t *, cp);
mutex_enter(&ct->ct_lock);
....[snipped]..............
mutex_exit(&ct->ct_lock);
}
我们看到,在callout_execute执行中会用到mutex_enter(&ct->ct_lock)来保证互斥访问callout表项的内容,但在执行真正的定时器函数时,它会调用mutex_exit(&ct->ct_lock)释放掉锁。因此,可以有多个taskq线程或者软中断并发执行调出函数,而不会相互影响。
此前,我没仔细看callout_execute的代码,犯了一个严重错误,幸好有朋友指出,在这里再次表示谢意。希望之前的错误没有误导大家。
回过头来看线程fffffe80000ddc80的调用栈,显然它是由e1000g驱动调用timeout(9F)注册的e1000g_LocalTimer函数,而且callout_execute是由单独的taskq线程执行的:
![]()
stack pointer for thread fffffe80000ddc80: fffffe80000dd9d0
![]()
[ fffffe80000dd9d0 _resume_from_idle+0xf8() ]
![]()
fffffe80000dda10 swtch+0x167()
![]()
fffffe80000ddab0 turnstile_block+0x76b(ffffffff93338a00, 1, ffffffff82f76288, fffffffffbc05908, 0, 0)
![]()
fffffe80000ddb10 rw_enter_sleep+0x1de(ffffffff82f76288, 1)
![]()
fffffe80000ddb40 e1000g_tx_cleanup+0x56(ffffffff82f76000)
![]()
fffffe80000ddb80 e1000g_LocalTimer+0x19(ffffffff82f76000)
![]()
fffffe80000ddbd0 callout_execute+0xb1(ffffffff80a3e000)
![]()
fffffe80000ddc60 taskq_thread+0x1a7(ffffffff826be7a0)
![]()
fffffe80000ddc70 thread_start+8()
这个线程执行过程中,又因为等待一个读写锁而睡眠,我们前面分析出这个锁的所有者是fffffe80044e5c80,它因为调用cv_timedwait而睡眠等待它被callout机制来唤醒:
![]()
stack pointer for thread fffffe80044e5c80: fffffe80044e5880
![]()
[ fffffe80044e5880 _resume_from_idle+0xf8() ]
![]()
fffffe80044e58c0 swtch+0x167()
![]()
fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, 1036d)
![]()
fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, ffffffff82f76388, 1036d)
![]()
fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40)
![]()
fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180)
![]()
fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180)
![]()
fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70)
![]()
fffffe80044e5b50 runservice+0x42(ffffffff894acb70)
![]()
fffffe80044e5b80 queue_service+0x42(ffffffff894acb70)
![]()
fffffe80044e5bc0 stream_service+0x73(ffffffff83905740)
![]()
fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820)
![]()
fffffe80044e5c70 thread_start+8()
而cv_timedwait是通过realtime_timeout注册callout表项的,这意味着,callout_execute要通过softcall机制来执行,进而调用到setrun函数唤醒该线程。
而我们已经知道,fffffe80044e5c80调用cv_timedwait后从未返回,而且已经严重过期:
![]()
> ::callout ! grep fffffe80044e5c80
![]()
setrun fffffe80044e5c80 3ffffffffffe1a80 1036d (T-33730)
那么,为什么会这样呢?很自然,我们需要了解softcall是如何实现的。
关于softcall
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/softint.c#103
/*
* Call function func with argument arg
* at some later time at software interrupt priority
*/
void
softcall(void (*func)(void *), void *arg)
{
softcall_t *sc;
/*
* protect against cross-calls
*/
mutex_enter(&softcall_lock);
/* coalesce identical softcalls */
for (sc = softhead; sc != 0; sc = sc->sc_next) {
if (sc->sc_func == func && sc->sc_arg == arg) {
mutex_exit(&softcall_lock);
return;
}
}
if ((sc = softfree) == 0)
panic("too many softcalls");
softfree = sc->sc_next;
sc->sc_func = func;
sc->sc_arg = arg;
sc->sc_next = 0;
if (softhead) {
softtail->sc_next = sc;
softtail = sc;
mutex_exit(&softcall_lock);
} else {
softhead = softtail = sc;
if (softcall_state == SOFT_DRAIN)
/*
* softint is already running; no need to
* raise a siron. Due to lock protection of
* softhead / softcall state, we know
* that softint() will see the new addition to
* the softhead queue.
*/
mutex_exit(&softcall_lock);
else {
softcall_state = SOFT_PEND;
mutex_exit(&softcall_lock);
siron();
}
}
}
可以看到,softcall会把需要执行的函数放入一个内核全局的队列并交由系统处理,softhead指针可以访问这个队列:
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/softint.c#73
![]()
static softcall_t softcalls[NSOFTCALLS], *softhead, *softtail, *softfree;
从源代码里,我们看到,全局变量softcall_state用来标识当前softcall队列的状态,首次排队时,队列状态会被置为待处理态,SOFT_PEND。然后调用一个siron()来在CPU上产生一个软中断。
其中,softcall_state的状态定义如下:
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/softint.c#60
/*
* Defined states for softcall processing.
*/
#define SOFT_IDLE 0x01 /* no processing is needed */
#define SOFT_PEND 0x02 /* softcall list needs processing */
#define SOFT_DRAIN 0x04 /* the list is being processed */
下面我们就用mdb遍历这个队列,发现队列中有2个待执行的callout_execute调用:
>*softhead::list softcall_t sc_next|::print softcall_t
{
sc_func = callout_execute
sc_arg = 0xffffffff80219000
sc_next = softcalls+0x1290
}
{
sc_func = callout_execute
sc_arg = 0xffffffff80216000
sc_next = 0
}
>softcall_state/J
softcall_state:
softcall_state: 2
mdb读出当时队列状态是0x2,就是SOFT_PEND,待处理状态。
那么我们阻塞在cv_timedwait上的线程fffffe80044e5c80,是否属于这两个待处理的callout_execute之一呢?
sc_arg是callout_execute的参数,类型是callout_table_t,我们用mdb查看一下:
> 0xffffffff80219000::print callout_table_t
{
ct_lock = {
_opaque = [ 0 ]
}
ct_freelist = 0xffffffffad201c38
ct_curtime = 0x1872f
ct_runtime = 0x1036d
ct_taskq = 0
ct_short_id = 0x3ffffffffffe1a80
ct_long_id = 0x7ffffffffffd91a0
ct_idhash = [ 0, 0xffffffffa40a9560, 0xffffffffa40a9600, 0xffffffffa40a9740, 0xffffffffa40a9880, 0xffffffffa40a9970,
0xffffffffa9383ce0, 0xffffffffa40a9b50, 0xffffffffa40a9c40, 0xffffffffa40a9d30, 0xffffffffa40a9e70, 0xffffffffa409f068,
0xffffffffa409f108, 0xffffffffa409f248, 0xffffffffa409f338, 0xffffffffa409f478, 0xffffffffa409f568, 0xffffffffa409f6a8,
0xffffffffa409f798, 0xffffffffa409f888, 0xffffffffa9383c90, 0xffffffffa409fa68, 0xffffffffa1a2e878, 0xffffffffa409fb08,
0xffffffffacfbb4c8, 0xffffffffa409fd38, 0xffffffffa1a2eeb8, 0xffffffffa40ba008, 0xffffffffa40ba198, 0xffffffffa40ba2d8,
0xffffffffa40ba3c8, 0xffffffffa40ba4b8, ... ]
ct_lbhash = [ 0, 0, 0, 0, 0xffffffffad188ca0, 0, 0, 0, 0, 0, 0, 0, 0, 0xffffffff8ff5a658, 0, 0, 0xffffffff8276e878, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xffffffffad1d0528, ... ]
}
非常的幸运,ct_short_id链表里的第一个项就是我们cv_timedwait注册的那个setrun的表项,我们不用再遍历整个链表了:)
![]()
> ::callout ! grep fffffe80044e5c80
![]()
setrun fffffe80044e5c80 3ffffffffffe1a80 1036d (T-33730)
看来,cv_timedwait没有按照预想原因返回的根源我们找到了,是softcall队列严重推迟引起的,而且,看起来softcall队列永远不可能继续被处理了。
那么为什么会这样呢?很自然我们想到要查看CPU的状态,因为softcall队列是软中断处理的,软中断线程的PIL一般是1,比一般线程的优先级都要高,除非CPU上有其它中断,否则应该会立即得到处理。
> ::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc27730 1f 1 6 169 no no t-3 fffffe80000bfc80 sched
| | |
RUNNING | +--> PIL THREAD
READY | 10 fffffe80000bfc80
QUIESCED | 6 fffffe80000b9c80
EXISTS |
ENABLE +--> PRI THREAD PROC
99 fffffe80000d1c80 sched
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 fffffffffbc2f260 1b 1 0 -1 no no t-17 fffffe8000401c80 (idle)
| |
RUNNING +--> PRI THREAD PROC
READY 60 fffffe80044d9c80 sched
EXISTS
ENABLE
其实,前面我们已经看到系统中有两个CPU, CPU1是空闲的,而CPU0的BSPL是6,意味着6以下的中断都不能打断当前执行的线程,而这个BSPL正是网卡中断线程设置的,用来屏蔽低优先级中断。
因为我们的softcall队列保持了相当长的PEND状态,那就意味着似乎这些待处理的softcall被分配到CPU0上来执行。
是不是这样呢?我们可以用mdb检查CPU结构cpu_t中成员machcpu的状态,即可知到softcall被分配到了哪一个CPU上:
CPU0:
> fffffffffbc27d10::print struct machcpu mcpu_softinfo
{
mcpu_softinfo.st_pending = 0x416
}
CPU 1:
> fffffffffbc2f840::print struct machcpu mcpu_softinfo
{
mcpu_softinfo.st_pending = 0x404
}
可以看到,CPU0上有PIL 1,2,4,10的中断待处理。CPU1有4和10待处理。其中PIL为1的中断就是处理softcall队列的软中断。
如果用mdb查看,该软中断处理函数是softlevel1,PIL是1:
> ::softint
ADDR PEND PIL ARG1 ARG2 ISR(s)
ffffffff8277a5c0 0 1 ffffffff8048da80 0 errorq_intr
fffffffffbc05ae8 0 1 0 0 softlevel1
ffffffff8277a4c0 0 2 ffffffff8048dd00 0 errorq_intr
fffffffffbc00070 0 2 0 0 cbe_low_level
ffffffff83a706c0 0 4 ffffffff90004d18 0 ghd_doneq_process
ffffffff8ed31880 0 4 ffffffff90004d18 0 ghd_timeout_softintr
ffffffff83946e00 0 4 ffffffff8f046c40 0 power_soft_intr
ffffffff83a70c00 0 4 ffffffff83b1b000 0 bge_chip_factotum
ffffffff83a70cc0 0 4 ffffffff83b1b000 0 bge_reschedule
ffffffff8277a2c0 0 4 0 0 asysoftintr
ffffffff82f4d100 0 9 ffffffff82f76370 0 e1000g_tx_softint_worker
ffffffff82f4df00 0 9 ffffffff82f86370 0 e1000g_tx_softint_worker
ffffffff833b3e80 0 9 ffffffff801af7e8 0 hcdi_soft_intr
ffffffff8277a000 0 9 ffffffff801afb68 0 hcdi_soft_intr
fffffffffbc00030 0 10 0 0 cbe_softclock
但实际上我们前面也知道,CPU0上的网卡中断线程fffffe80000b9c80已经睡眠在了读写锁上,而读写锁的主人fffffe80044e5c80,此刻却在callout队列中等待被唤醒来继续执行。
而callout的执行又依赖于CPU0能够处理softcall队列中callout_execute,但CPU0上的软中断已经被网卡中断线程fffffe80000b9c80通过BSPL屏蔽掉了。
此时,死锁已经发生,线程fffffe80000b9c80和fffffe80044e5c80永远的互相等待下去了。
当然,上述死锁条件的成立还得基于以下假设:
softcall一旦被分派,就不能重新调度到其它CPU上。本例中,即使CPU1空闲,也不能通过cross-call,或者叫处理器间中断来重新分配softcall队列的处理。
显然,这个假设似乎不合道理,如果真是这样的话,那可是内核的bug,我们的分析再次陷入困境。
因此,只好发邮件给OpenSolaris.org社区的code邮件列表了,终于,这个假设得到了确认,因为内核中这两个bug,我们的假设是成立的:
http://www.opensolaris.org/jive/thread.jspa?threadID=38081&tstart=30
http://www.opensolaris.org/jive/thread.jspa?threadID=38118&tstart=30
[/url]
6292092 callout should not be blocked by interrupts from executing realtime timeouts
6540436 kpreempt() needs a more reliable way to generate level1 intr
如果上面两个内核的bug被修复,死锁还会发生吗?
答案不难得出:如果是SMP系统,大概不会发生了。但UP系统,单CPU,即便没有了上面两个bug,系统一样会死锁。
所以,这个死锁到底还是e1000g驱动的问题。好在这只是e1000g试验版本的一个错误,显然,我们在中断处理函数中引入读写锁是有问题的。
小结
1. 系统hang的root cause是e1000g的rwlock的不当使用,导致死锁的发生;
2. 关于callout
e1000g_LocalTimer显然是e1000g注册的定时器函数,但在这个函数中却试图使用一个rwlock,而导致了在callout_execute调用中的睡眠。而我们知道,callout_execute睡眠会引起系统全局的callout table被锁住,callout机制无法使用;因此要避免在e1000g_LocalTimer中调用引起阻塞的代码或者函数;
3. Solaris Internal第二版的bug:
解决这个问题的事后,偶然发现Solaris Internal第二版的英文原版836页第17章图17.6存在一个bug。
这个图是关于rwlock的,其中关于读锁的图中,关于读者记数器的起始位,应该是3-31.63,而书上说是4。
这个可以从源代码中得到证实:
[url=http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/rwlock.c#40]http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/rwlock.c#40
/*
* Big Theory Statement for readers/writer locking primitives.
*
* An rwlock provides exclusive access to a single thread ("writer") or
* concurrent access to multiple threads ("readers"). See rwlock(9F)
* for a full description of the interfaces and programming model.
* The rest of this comment describes the implementation.
*
* An rwlock is a single word with the following structure:
*
* ---------------------------------------------------------------------
* | OWNER (writer) or HOLD COUNT (readers) | WRLOCK | WRWANT | WAIT |
* ---------------------------------------------------------------------
* 63 / 31 .. 3 2 1 0
这我也从crashdump中得到了验证,检查读写锁实际的值,你会发现,是0xb, 其中第3位恰好是1,即只有1个reader。
> ffffffff82f76288::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffffff82f76288 READERS=1 B011 ffffffff838470c0 (W)
|| fffffe800027bc80 (R)
WRITE_WANTED -------+| fffffe80000ddc80 (R)
HAS_WAITERS --------+ fffffe80000b9c80 (R)
>
> ffffffff82f76288/J
0xffffffff82f76288: b
> ffffffff82f76288/R
0xffffffff82f76288: 1011
相关文档:
Solaris学习笔记(5)
Solaris学习笔记(4)
Solaris学习笔记(3)
Solaris学习笔记(2)
Solaris学习笔记(1)
X86汇编语言学习手记(3)
X86汇编语言学习手记(2)
X86汇编语言学习手记(1)
Technorati Tag:
OpenSolaris
Technorati Tag:
Solaris
本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/768/showart_256930.html |
|