Chinaunix

标题: 系统内核报警 系统hung住了。。。 [打印本页]

作者: LinuxCaiB    时间: 2013-11-06 11:13
标题: 系统内核报警 系统hung住了。。。
Oct 16 01:18:32 mailserver kernel:  =======================
Oct 18 17:37:27 mailserver kernel: mptscsih: ioc0: attempting task abort! (sc=f451ad40)
Oct 18 17:37:27 mailserver kernel: sd 0:0:0:0:
Oct 18 17:37:27 mailserver kernel:         command: Write(10): 2a 00 00 04 fc 7d 00 01 48 00
Oct 18 17:37:27 mailserver kernel: mptscsih: ioc0: task abort: SUCCESS (sc=f451ad40)
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: attempting task abort! (sc=ce96b940)
Oct 18 17:40:51 mailserver kernel: sd 0:0:1:0:
Oct 18 17:40:51 mailserver kernel:         command: Write(10): 2a 00 22 e7 f8 b7 00 04 00 00
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: task abort: FAILED (sc=ce96b940)
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: attempting task abort! (sc=ce96bd00)
Oct 18 17:40:51 mailserver kernel: sd 0:0:1:0:
Oct 18 17:40:51 mailserver kernel:         command: Write(10): 2a 00 22 e7 fc bf 00 04 00 00
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: task abort: FAILED (sc=ce96bd00)
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: attempting task abort! (sc=ecbdb1c0)
Oct 18 17:40:51 mailserver kernel: sd 0:0:1:0:
Oct 18 17:40:51 mailserver kernel:         command: Write(10): 2a 00 22 e8 14 0f 00 04 00 00
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: task abort: FAILED (sc=ecbdb1c0)
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: attempting target reset! (sc=ce96b940)
Oct 18 17:40:52 mailserver kernel: sd 0:0:1:0:
Oct 18 17:40:52 mailserver kernel:         command: Write(10): 2a 00 22 e7 f8 b7 00 04 00 00
Oct 18 17:40:52 mailserver kernel: mptscsih: ioc0: target reset: SUCCESS (sc=ce96b940)

OS redhat 5.5  32位。系统内核报了一堆警告,然后就hung住了。。。
大神能不能解答一下这是什么原因引起的。
作者: humjb_1983    时间: 2013-11-06 12:46
回复 1# LinuxCaiB

看打印应该是磁盘有问题了,可以将最初的报错帖出来。
也可以用smart工具或者badblocks工具检测一下磁盘。
   
作者: LinuxCaiB    时间: 2013-11-06 13:56
回复 2# humjb_1983

很奇怪过了一段时间就没有报错了。
系统之前的信息:
Sep 29 12:29:34 mailserver kernel: VMware memory control driver initialized
Sep 29 12:29:34 mailserver kernel: e1000: eth0: e1000_set_tso: TSO is Enabled
Sep 29 12:29:34 mailserver kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Sep 29 12:29:34 mailserver kernel: NET: Registered protocol family 10
Sep 29 12:29:34 mailserver kernel: lo: Disabled Privacy Extensions
Sep 29 12:29:34 mailserver kernel: IPv6 over IPv4 tunneling driver
Sep 29 12:29:34 mailserver xinetd[3432]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in.
Sep 29 12:29:34 mailserver xinetd[3432]: Started working: 0 available services
Sep 29 12:29:34 mailserver kernel: Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
Sep 29 12:29:35 mailserver kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Sep 29 12:29:35 mailserver kernel: NFSD: starting 90-second grace period
Sep 29 12:29:43 mailserver tpvmlpd[3783]: device type not supported
Sep 29 12:29:58 mailserver tpvmlpd[4410]: device type not supported
Sep 29 12:30:13 mailserver tpvmlpd[4526]: device type not supported
Sep 29 12:30:28 mailserver tpvmlpd[4543]: device type not supported
Sep 29 12:30:43 mailserver tpvmlpd[4580]: device type not supported
Sep 29 12:30:58 mailserver tpvmlpd[4588]: device type not supported
Sep 29 12:31:13 mailserver tpvmlpd[4604]: device type not supported
Sep 29 12:31:44 mailserver tpvmlpd[4627]: device type not supported
Sep 29 12:31:59 mailserver tpvmlpd[4703]: device type not supported
Sep 29 12:32:14 mailserver tpvmlpd[4730]: device type not supported
Sep 29 12:32:29 mailserver tpvmlpd[4770]: device type not supported
Sep 29 12:32:44 mailserver tpvmlpd[4780]: device type not supported
Sep 29 12:32:59 mailserver tpvmlpd[4788]: device type not supported
Sep 29 12:33:14 mailserver tpvmlpd[4824]: device type not supported
Sep 29 12:33:29 mailserver tpvmlpd[4843]: device type not supported
Sep 29 12:33:44 mailserver tpvmlpd[4850]: device type not supported
Sep 29 12:33:59 mailserver tpvmlpd[4863]: device type not supported
Sep 29 12:34:14 mailserver tpvmlpd[4870]: device type not supported
Sep 29 12:34:29 mailserver tpvmlpd[4882]: device type not supported
Sep 29 12:34:41 mailserver kernel: SCSI device sdb: 629145600 512-byte hdwr sectors (322123 MB)
Sep 29 12:34:41 mailserver kernel: sdb: Write Protect is off
Sep 29 12:34:41 mailserver kernel: sdb: cache data unavailable
Sep 29 12:34:41 mailserver kernel: sdb: assuming drive cache: write through
Sep 29 12:34:41 mailserver kernel:  sdb: sdb1
Sep 29 12:34:43 mailserver kernel: SCSI device sdb: 629145600 512-byte hdwr sectors (322123 MB)
Sep 29 12:34:43 mailserver kernel: sdb: Write Protect is off
Sep 29 12:34:43 mailserver kernel: sdb: cache data unavailable
Sep 29 12:34:43 mailserver kernel: sdb: assuming drive cache: write through
Sep 29 12:34:43 mailserver kernel:  sdb: sdb1
Sep 29 12:34:44 mailserver tpvmlpd[4902]: device type not supported
Sep 29 12:34:44 mailserver tpvmlpd[3053]: aborting
Sep 29 12:40:52 mailserver kernel: kjournald starting.  Commit interval 5 seconds
Sep 29 12:40:52 mailserver kernel: EXT3 FS on dm-2, internal journal
Sep 29 12:40:52 mailserver kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 16 01:17:54 mailserver kernel: INFO: task mysqld:1762 blocked for more than 120 seconds.
Oct 16 01:17:54 mailserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 16 01:17:54 mailserver kernel: mysqld        D 0005126C  2452  1762   3593          1952  1270 (NOTLB)
Oct 16 01:17:54 mailserver kernel:        f1a56ed0 00000082 f740862d 0005126c 00051269 0000000e 00000000 00000009
Oct 16 01:17:54 mailserver kernel:        f0ffeaa0 f7409972 0005126c 00001345 00000005 f0ffebac c4835054 f70c6e40
Oct 16 01:17:54 mailserver kernel:        f781c068 00000000 00000000 f1a56ecc c041eff8 00000000 f1a56ed8 f781c050
Oct 16 01:17:54 mailserver kernel: Call Trace:
Oct 16 01:17:54 mailserver kernel:  [<c041eff8>] __wake_up+0x2a/0x3d
Oct 16 01:17:54 mailserver kernel:  [<c043654b>] prepare_to_wait+0x24/0x46
Oct 16 01:17:54 mailserver kernel:  [<f88691da>] log_wait_commit+0x80/0xc7 [jbd]
Oct 16 01:17:54 mailserver kernel:  [<c04363ff>] autoremove_wake_function+0x0/0x2d
Oct 16 01:17:54 mailserver kernel:  [<f8864661>] journal_stop+0x195/0x1ba [jbd]
Oct 16 01:17:54 mailserver kernel:  [<c049325a>] __writeback_single_inode+0x1a3/0x2af
Oct 16 01:18:32 mailserver kernel:  [<c045b9c6>] do_writepages+0x2b/0x32
Oct 16 01:18:32 mailserver kernel:  [<c0457527>] __filemap_fdatawrite_range+0x66/0x72
Oct 16 01:18:32 mailserver kernel:  [<c04938f6>] sync_inode+0x19/0x24
Oct 16 01:18:32 mailserver kernel:  [<f88e8007>] ext3_sync_file+0xaf/0xc4 [ext3]
Oct 16 01:18:32 mailserver kernel:  [<c0476d63>] do_fsync+0x41/0x83
Oct 16 01:18:32 mailserver kernel:  [<c0476dc2>] __do_fsync+0x1d/0x2b
Oct 16 01:18:32 mailserver kernel:  [<c0404ead>] sysenter_past_esp+0x56/0x79
Oct 16 01:18:32 mailserver kernel:  =======================
Oct 18 17:37:27 mailserver kernel: mptscsih: ioc0: attempting task abort! (sc=f451ad40)
Oct 18 17:37:27 mailserver kernel: sd 0:0:0:0:
Oct 18 17:37:27 mailserver kernel:         command: Write(10): 2a 00 00 04 fc 7d 00 01 48 00
Oct 18 17:37:27 mailserver kernel: mptscsih: ioc0: task abort: SUCCESS (sc=f451ad40)
Oct 18 17:40:51 mailserver kernel: mptscsih: ioc0: attempting task abort! (sc=ce96b940)

   
作者: humjb_1983    时间: 2013-11-06 17:02
回复 3# LinuxCaiB
从这个信息看,应该还是磁盘问题,你的环境是虚拟机?磁盘是虚拟盘?那就是虚拟机的问题了。
如果是物理盘,可以用工具检测一下磁盘。

   
作者: LinuxCaiB    时间: 2015-01-05 16:09
本帖最后由 LinuxCaiB 于 2015-01-05 16:12 编辑

回复 4# humjb_1983
问题解决了,就是虚拟机的问题,光纤有问题。忘记回复你啦,哈哈,都跨2年了。感谢,感谢!


   
作者: gaojl0728    时间: 2015-01-05 16:43
linux 2.6.18 的ext3文件系统实现有bug, 在某些情况下会造成spinlock死锁, 去年定位过一个类似的问题。
作者: humjb_1983    时间: 2015-01-05 16:57
LinuxCaiB 发表于 2015-01-05 16:09
回复 4# humjb_1983
问题解决了,就是虚拟机的问题,光纤有问题。忘记回复你啦,哈哈,都跨2年了。感谢, ...

呵呵,解决就好了。。
作者: humjb_1983    时间: 2015-01-05 16:58
gaojl0728 发表于 2015-01-05 16:43
linux 2.6.18 的ext3文件系统实现有bug, 在某些情况下会造成spinlock死锁, 去年定位过一个类似的问题。

这样的话,堆栈看不来会不一样吧?呵呵
作者: gaojl0728    时间: 2015-01-05 17:07
回复 8# humjb_1983


    基本上不一样, 我那个问题是内核死锁后, 触发了soft lockup告警,8个CPU通过不同的内核路径获取ext3文件系统的两个锁,但是因为拿锁的顺序相反,导致8个CPU全部锁死了,后来CPU心跳中断检测到soft lockup进程长时间没有调度,给出了警告。

我看他有个hangtask告警,很可能也是类似的问题, 不过那个堆栈明显有点乱了。
作者: humjb_1983    时间: 2015-01-05 17:28
gaojl0728 发表于 2015-01-05 17:07
回复 8# humjb_1983


对的,呵呵,spin_lock死锁的话,softlockup或nmi_watchdog应该能检测到~~
作者: gaojl0728    时间: 2015-01-05 17:54
回复 10# humjb_1983


    spinlock锁死了CPU卡死了softlockup自然能够检测, 但如果mutex/semaphore死锁了softlockup就不行了还得靠hangtask,
linux 2.6.18 ext3文件系统实现用到了很多大锁,锁的太宽难保不会死锁。
后来的新版本内核把大粒度锁拆成了小锁明显好多了。。
作者: LinuxCaiB    时间: 2015-01-05 20:26
回复 11# gaojl0728

我们有些服务器都还都是2.16的内核。后来的都换3.16了.你们都是大神啊,会看内核堆栈。。。orz


   




欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2