zfs做的raidz2机器运行中坏一块硬盘竟然机器宕机了...
freebsd10.26块300G的sas做的raidz2
lsi的9211-8i卡
下午出现机器ssh上不去,ssh密码验证通过,但是光标闪,出不来ssh登录成功界面.
网站还能打开,sftp能上去,pureftp上不去
傍晚的时候网站也打不开了
在机房插显示器,显示如下错误:
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 04 de e8 c0 00 00 40 00
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): CAM status: SCSI Status Error
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI status: Check Condition
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI sense: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Info: 0x4dee8f7
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Field Replaceable Unit: 8
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Actual Retry Count: 24
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Retrying command (per sense data)
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 04 de e9 00 00 00 40 00
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): CAM status: SCSI Status Error
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI status: Check Condition
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI sense: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Info: 0x4dee900
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Field Replaceable Unit: 8
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Actual Retry Count: 24
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Retrying command (per sense data)
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 04 de e9 40 00 00 40 00
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): CAM status: SCSI Status Error
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI status: Check Condition
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI sense: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Info: 0x4dee940
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Field Replaceable Unit: 8
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Actual Retry Count: 24
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): Retrying command (per sense data)
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): WRITE(10). CDB: 2a 00 04 de e2 78 00 00 40 00
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): CAM status: SCSI Status Error
Nov 26 13:23:58 server kernel: (da3:mps0:0:8:0): SCSI status: Check Condition
控制台无法登录root,
强制重启,还是出现上面的错误一直刷,无法进入系统.
最后热拔掉da3,系统引导成功,进入系统.
root@server:/var/log # zpool status
pool: zroot
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub repaired 0 in 0h1m with 0 errors on Sat Oct 24 05:52:02 2015
config:
NAME STATE READ WRITE CKSUM
zroot DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
da0p3 ONLINE 0 0 0
da1p3 ONLINE 0 0 0
da2p3 ONLINE 0 0 0
6636317080889922059REMOVED 0 0 0was /dev/da3p3
da4p3 ONLINE 0 0 0
da5p3 ONLINE 0 0 0
errors: No known data errors
不知道是哪部分引起的这种死循环,导致系统宕掉... 难道是那块坏硬盘把整个raid卡拖死了? :lol有raid卡还做什么软件raid啊?闲的 lsstarboy 发表于 2015-11-27 07:57 static/image/common/back.gif
难道是那块坏硬盘把整个raid卡拖死了?
插上显示器的时候一直显示那个错误信息,刷新很快,怀疑是这个原因导致拖死系统的. 我在考虑是不是有一种可能性?
机器是老机器,硬盘口一共6个sas口.
当时我拔掉da3,位置在5口,也就是最后一个口,机器顺利启动.
后来我关机,把5口反复插拔几次,启动机器,机器竟然启动了,zpool status一切正常.scrub后同步了点数据,没有任何错误.机器运行到现在没有任何问题.
所以我在想一种可能:
本来45两个口没有插硬盘.接口有氧化之类的或其他问题导致接触不良.所以看错误提示热插拔设备错误,造成读不出来一直读,拖死机器导致宕机.
后来我反复插拔几次,接触好了就没问题了.
那么热插拔设备读取错误为何不降级运行?为何会一直读取?导致机器宕机???
rtm009 发表于 2015-11-27 09:12 static/image/common/back.gif
有raid卡还做什么软件raid啊?闲的
raid卡不光周围朋友遇到过好多发生错误找不回数据,我自己都遇到过出现问题的情况.
而且zfs的大部分特性,稳定性,弹性是raid卡永远无法提供的.
页:
[1]