论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2005-11-27 12:54 |只看该作者 |倒序浏览

机器上用的xp+RHAS4是双系统,RHAS4是昨天才装上的，但自昨天下午开始就收到系统发来的邮件提示我硬盘SMART error。我是用的希捷的硬盘，在xp上用希捷的硬盘检测工具没发现什么问题。在linux上也用smartctl -a /dev/hda 检测，在self test项也是PASSED，请高手帮忙看看。
系统邮件：
Message 12:
From root@localhost.localdomain  Sun Nov 27 11:14:48 2005
Date: Sun, 27 Nov 2005 11:14:44 +0800
From: root <root@localhost.localdomain>
To: root@localhost.localdomain
Subject: SMART error (CurrentPendingSector) detected on host: localhost.localdom
ain

This email was generated by the smartd daemon running on:

host name: localhost.localdomain
  DNS domain: localdomain
  NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/hda, 5 Currently unreadable (pending) sectors

For details see host's SYSLOG (default: /var/log/messages).

You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

双机, 双机

文库|博客

epingnet

白手起家

论坛徽章:: 0

2楼 [报告]

发表于 2005-11-27 12:58 |只看该作者

smartctl检测信息：
[root@localhost ~]# smartctl -a /dev/hda
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model:    ST380021A
Serial Number: 3HV1KTZR
Firmware Version: 3.19
User Capacity: 80,026,361,856 bytes
Device is:       In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is: Sun Nov 27 12:33:38 2005 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                    was completed without error.
                                    Auto Offline Data Collection: Enabled.
Self-test execution status:    ( 0) The previous self-test routine completed
                                    without error or no self-test has ever
                                    been run.
Total time to complete Offline
data collection:                ( 422) seconds.
Offline data collection
capabilities:                   (0x1b) SMART execute Offline immediate.
                                    Auto Offline data collection on/off support.
                                    Suspend Offline collection upon new
                                    command.
                                    Offline surface scan supported.
                                    Self-test supported.
                                    No Conveyance Self-test supported.
                                    No Selective Self-test supported.
SMART capabilities:          (0x0003) Saves SMART data before entering
                                    power-saving mode.
                                    Supports SMART auto save timer.
Error logging capability:       (0x01) Error logging supported.
                                    No General Purpose Logging support.
Short self-test routine
recommended polling time:       ( 1) minutes.
Extended self-test routine
recommended polling time:       (  57) minutes.
SMART Attributes Data Structure revision number: 10

SMART Error Log Version: 1
ATA Error Count: 679 (device log contains only the most recent five errors)
      CR = Command Register [HEX]
      FR = Features Register [HEX]
      SC = Sector Count Register [HEX]
      SN = Sector Number Register [HEX]
      CL = Cylinder Low Register [HEX]
      CH = Cylinder High Register [HEX]
      DH = Device/Head Register [HEX]
      DC = Device Command Register [HEX]
      ER = Error register [HEX]
      ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 679 occurred at disk power-on lifetime: 6961 hours (290 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 e1 61 d6 e0  Error: UNC at LBA = 0x00d661e1 = 14049761

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 40 c0 61 d6 e0 00    02:02:22.531  READ DMA
  c6 00 10 00 00 00 e8 00    02:02:22.531  SET MULTIPLE MODE
  91 00 3f 00 00 00 af 00    02:02:22.531  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 00 00 00 00 00 a0 00    02:02:22.531  RECALIBRATE [OBS-4]
  00 00 00 00 00 00 00 04    02:02:22.513  NOP [Abort queued commands]

Error 678 occurred at disk power-on lifetime: 6961 hours (290 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 e1 61 d6 e0  Error: UNC at LBA = 0x00d661e1 = 14049761
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 40 c0 61 d6 e0 00    02:02:18.950  READ DMA
  c6 00 10 00 00 00 e7 00    02:02:18.950  SET MULTIPLE MODE
  91 00 3f 00 00 00 af 00    02:02:18.950  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 00 00 00 00 00 a0 00    02:02:18.950  RECALIBRATE [OBS-4]
  00 00 00 00 00 00 00 04    02:02:18.936  NOP [Abort queued commands]
Error 677 occurred at disk power-on lifetime: 6961 hours (290 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 e1 61 d6 e0  Error: UNC at LBA = 0x00d661e1 = 14049761
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 40 c0 61 d6 e0 00    02:02:15.289  READ DMA
  c8 00 21 9f 61 d6 e0 00    02:02:15.288  READ DMA
  c8 00 40 5f 61 d6 e0 00    02:02:15.288  READ DMA
  c8 00 40 1f 61 d6 e0 00    02:02:15.287  READ DMA
  c8 00 40 df 60 d6 e0 00    02:02:15.285  READ DMA
Error 676 occurred at disk power-on lifetime: 6960 hours (290 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 63 bd 5e e1  Error: UNC at LBA = 0x015ebd63 = 22986083
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  40 00 01 63 bd 5e e1 00    01:12:37.218  READ VERIFY SECTOR(S)
  c8 00 01 a5 69 2e e2 00    01:12:37.208  READ DMA
  40 00 01 62 bd 5e e1 00    01:12:37.194  READ VERIFY SECTOR(S)
  c8 00 01 4d 69 43 e6 00    01:12:37.186  READ DMA
  40 00 02 62 bd 5e e1 00    01:12:34.009  READ VERIFY SECTOR(S)

Error 675 occurred at disk power-on lifetime: 6960 hours (290 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 63 bd 5e e1  Error: UNC at LBA = 0x015ebd63 = 22986083

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  40 00 02 62 bd 5e e1 00    01:12:34.009  READ VERIFY SECTOR(S)
  c8 00 01 9a 2c 36 e3 00    01:12:33.992  READ DMA
  40 00 02 60 bd 5e e1 00    01:12:33.968  READ VERIFY SECTOR(S)
  c8 00 01 66 69 2e e2 00    01:12:33.948  READ DMA
  ca 00 08 9f d2 ea e0 00    01:12:33.948  WRITE DMA

SMART Self-test log structure revision number 1
Num  Test_Description Status                Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline Completed without error    00%    8551       -
# 2  Short offline    Completed without error    00%    8131       -
# 3  Short offline    Completed without error    00%    7555       -
# 4  Short offline    Completed without error    00%    7401       -
# 5  Short offline    Completed without error    00%    7400       -
# 6  Short offline    Completed without error    00%    7221       -
# 7  Short offline    Completed without error    00%    7032       -
# 8  Short offline    Completed without error    00%    7019       -
# 9  Extended offline Completed without error    00%    7019       -
#10  Short offline    Completed without error    00%    7018       -
#11  Short offline    Completed: read failure    90%    6981       14049760
#12  Short offline    Completed: read failure    90%    6981       14049760
#13  Short offline    Completed: read failure    90%    6967       14049760
#14  Short offline    Completed: read failure    90%    6966       14049760
#15  Short offline    Completed: read failure    90%    6965       14049760
#16  Extended offline Completed: read failure    90%    6965       14049760
#17  Short offline    Completed: read failure    90%    6965       14049760
#18  Short offline    Completed: read failure    90%    6963       14049760
#19  Short offline    Completed: read failure    90%    6963       14049760
#20  Short offline    Completed: read failure    90%    6963       14049760
#21  Short offline    Completed: read failure    90%    6963       14049760
Device does not support Selective Self Tests/Logging

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

epingnet

白手起家

论坛徽章:: 0

3楼 [报告]

发表于 2005-12-01 10:42 |只看该作者

今天又收到这封邮件，而且在网上找到篇贴子，情况跟我的一样，刚好我也是用的RHAS4,硬盘真的要废了吗？
[CentOS] SMART error (OfflineUncorrectableSector) detected on host: SMART error (CurrentPendingSector) detected on host
Peter Arremann loony at loonybin.org
Thu Sep 15 21:26:09 UTC 2005

* Previous message: [CentOS] SMART error (OfflineUncorrectableSector) detected on host: SMART error (CurrentPendingSector) detected on host
* Next message: [CentOS] Disk problem has stopped 'df' from working
* Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thursday 15 September 2005 05:20, Abilash Praveen M wrote:
> I tried to smartcl.. but it won't let me in to the mounted partition.
> Please help! Is this a very serious issue? It was a very new HDD and I
> wonder how I got the uncorrectable sectors. Should I have to replace my
> HDD?
Yes, get your disk replaced. On a new disk you should not have issues like
this.
Especially if the grown defect list is huge your disk is gonna fail sometime
really soon.

You can try run the tool from the disk manufacturer (some of them require you
to do so before they will issue you an RMA number) - if you don't have that
floppy or don't have a foppy drive, download the ultimatebootcd.com cd image
and it should be on there. That tool might also (depending on your drive
manufacturer) have some mechanism for saving your data or remapping the
failed sectors so you can rescue your data from Linux.

Peter.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

epingnet

白手起家

论坛徽章:: 0

4楼 [报告]

发表于 2005-12-01 11:33 |只看该作者

[root@localhost ~]# smartctl -l selftest /dev/hda
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description Status                Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline Completed without error    00%    8551       -
# 2  Short offline    Completed without error    00%    8131       -
# 3  Short offline    Completed without error    00%    7555       -
# 4  Short offline    Completed without error    00%    7401       -
# 5  Short offline    Completed without error    00%    7400       -
# 6  Short offline    Completed without error    00%    7221       -
# 7  Short offline    Completed without error    00%    7032       -
# 8  Short offline    Completed without error    00%    7019       -
# 9  Extended offline Completed without error    00%    7019       -
#10  Short offline    Completed without error    00%    7018       -
#11  Short offline    Completed: read failure    90%    6981       14049760
#12  Short offline    Completed: read failure    90%    6981       14049760
#13  Short offline    Completed: read failure    90%    6967       14049760
#14  Short offline    Completed: read failure    90%    6966       14049760
#15  Short offline    Completed: read failure    90%    6965       14049760
#16  Extended offline Completed: read failure    90%    6965       14049760
#17  Short offline    Completed: read failure    90%    6965       14049760
#18  Short offline    Completed: read failure    90%    6963       14049760
#19  Short offline    Completed: read failure    90%    6963       14049760
#20  Short offline    Completed: read failure    90%    6963       14049760
#21  Short offline    Completed: read failure    90%    6963       14049760

[root@localhost ~]# fdisk -lu /dev/hda

Disk /dev/hda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot    Start       End    Blocks Id  System
/dev/hda1 *       63 15631244    7815591 7  HPFS/NTFS
/dev/hda2       15631245 114768359 49568557+ f  W95 Ext'd (LBA)
/dev/hda3    114768360 152103419 18667530 83  Linux
/dev/hda4    152103420 156296384    2096482+  82  Linux swap
/dev/hda5       15631308 40210694 12289693+ 7  HPFS/NTFS
/dev/hda6       40210758 64790144 12289693+ 7  HPFS/NTFS
/dev/hda7       64790208 114768359 24989076 7  HPFS/NTFS

以上两条命令的结果是不是指我的第一个分区/dev/hda1的14049760扇区在linux上读取错误？

[ 本帖最后由 epingnet 于 2005-12-1 11:36 编辑 ]

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

epingnet

白手起家

论坛徽章:: 0

5楼 [报告]

发表于 2005-12-01 11:55 |只看该作者

贴一个关于BadBlockHowTo的摘录，但本人功力不够，继续学习
THIS DOCUMENT SHOWS HOW TO IDENTIFY THE FILE ASSOCIATED WITH AN
UNREADABLE DISK SECTOR, AND HOW TO FORCE THAT SECTOR TO REALLOCATE.

Assumptions: Linux OS, ext2 or ext3 file system.

Bruce Allen <smartmontools-support@lists.sourceforge.net>

Thanks to Sergey Vlasov, Theodore Ts'o, Michael Bendzick, and others
for explaining this to me. I would like to add text showing how to do
this for other file systems, in particular ReiserFS, XFS, and JFS:
please email me if you can provide this information.

NOTE: Starting with GNU coreutils release 5.3.0, dd on Linux includes
options 'iflag=direct' and 'oflag=direct'.  Using these with the dd commands
below should be helpful, because adding these flags should avoid any interaction
with the block buffering IO layer in Linux and permit direct reads/writes
from the raw device.  Use 'dd --help' to see if your version of dd supports
these options. If not, build the latest code from
fttp://alpha.gnu.org/gnu/coreutils.

In this example, the disk is failing self-tests at Logical Block
Address LBA = 0x016561e9 = 23421417.  The LBA counts sectors in units
of 512 bytes, and starts at zero.

-----------------------------------------------------------------------------------------------
root]# smartctl -l selftest /dev/hda:

SMART Self-test log structure revision number 1
Num  Test_Description Status                Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline Completed: read failure    90%    217       0x016561e9
-----------------------------------------------------------------------------------------------

Note that other signs that there is a bad sector on the disk can be
found in the non-zero value of the Current Pending Sector count:
-----------------------------------------------------------------------------------------------
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME       FLAG    VALUE WORST THRESH TYPE    UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail  Always    -    0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always    -    0
197 Current_Pending_Sector  0x0022 100 100 000 Old_age Always    -    1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline    -    1
-----------------------------------------------------------------------------------------------

First Step: We need to locate the partition on which this sector of
the disk lives:
-----------------------------------------------------------------------------------------------
root]# fdisk -lu /dev/hda

Disk /dev/hda: 123.5 GB, 123522416640 bytes
255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start    End Blocks Id  System
/dev/hda1 *       63 4209029 2104483+  83  Linux
/dev/hda2    4209030 5269319 530145 82  Linux swap
/dev/hda3    5269320 238227884 116479282+  83  Linux
/dev/hda4    238227885 241248104 1510110 83  Linux
-----------------------------------------------------------------------------------------------

The partition /dev/hda3 starts at LBA 5269320 and extends past the
'problem' LBA.  The 'problem' LBA is offset 23421417 - 5269320 =
18152097 sectors into the partition /dev/hda3.

To verify the type of the file system and the mount point, look in
/etc/fstab:
-----------------------------------------------------------------------------------------------
root]# grep hda3 /etc/fstab
/dev/hda3 /data ext2 defaults 1 2
-----------------------------------------------------------------------------------------------
You can see that this is an ext2 file system, mounted at /data.

Second Step: we need to find the blocksize of the file system
(normally 4096 bytes for ext2):
-----------------------------------------------------------------------------------------------
root]# tune2fs -l /dev/hda3 | grep Block
Block count:             29119820
Block size:             4096
-----------------------------------------------------------------------------------------------
In this case the block size is 4096 bytes.

Third Step: we need to determine which File System Block contains this
LBA.  The formula is:
  b = (int)((L-S)*512/B)
where:
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.

In our example, L=23421417, S=5269320, and B=4096.  Hence the
'problem' LBA is in block number
b = (int)18152097*512/4096 = (int)2269012.125
so b=2269012.

Note: the fractional part of 0.125 indicates that this problem LBA is
actually the second of the eight sectors that make up this file system
block.

Fourth Step: we use debugfs to locate the inode stored in this block,
and the file that contains that inode:
-----------------------------------------------------------------------------------------------
root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs:  open /dev/hda3
debugfs:  icheck 2269012
Block Inode number
2269012 41032
debugfs:  ncheck 41032
Inode Pathname
41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf
-----------------------------------------------------------------------------------------------

In this example, you can see that the problematic file (with the mount
point included in the path) is:
/data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf

To force the disk to reallocate this bad block we'll write zeros to
the bad block, and sync the disk:
-----------------------------------------------------------------------------------------------
root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
root]# sync
-----------------------------------------------------------------------------------------------

NOTE: THIS LAST STEP HAS PERMANENTLY AND IRRETREVIABLY DESTROYED SOME
OF THE DATA THAT WAS IN THIS FILE.  DON'T DO THIS UNLESS YOU DON'T
NEED THE FILE OR YOU CAN REPLACE IT WITH A FRESH OR CORRECT VERSION.

Now everything is back to normal: the sector has been reallocated.
Compare the output just below to similar output near the top of this
article:
-----------------------------------------------------------------------------------------------
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME       FLAG    VALUE WORST THRESH TYPE    UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail  Always    -    1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always    -    1
197 Current_Pending_Sector  0x0022 100 100 000 Old_age Always    -    0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline    -    1
-----------------------------------------------------------------------------------------------

Note: for some disks it may be necessary to update the SMART Attribute values by using
smartctl -t offline /dev/hda

The disk now passes its self-tests again:

-----------------------------------------------------------------------------------------------

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

epingnet

白手起家

论坛徽章:: 0

6楼 [报告]

发表于 2005-12-01 11:56 |只看该作者

root]# smartctl -t long /dev/hda  [wait until test completes, then]
root]# smartctl -l selftest /dev/hda

SMART Self-test log structure revision number 1
Num  Test_Description Status                Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline Completed without error    00%    239       -
# 2  Extended offline Completed: read failure    90%    217       0x016561e9
# 3  Extended offline Completed: read failure    90%    212       0x016561e9
# 4  Extended offline Completed: read failure    90%    181       0x016561e9
# 5  Extended offline Completed without error    00%       14       -
# 6  Extended offline Completed without error    00%       4       -
-----------------------------------------------------------------------------------------------

and no longer shows any offline uncorrectable sectors:

-----------------------------------------------------------------------------------------------
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME       FLAG    VALUE WORST THRESH TYPE    UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail  Always    -    1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always    -    1
197 Current_Pending_Sector  0x0022 100 100 000 Old_age Always    -    0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline    -    0
-----------------------------------------------------------------------------------------------

A SECOND EXAMPLE

On this drive, the first sign of trouble was this email from smartd:

To: ballen
Subject: SMART error (selftest) detected on host: medusa-slave166.medusa.phys.uwm.edu

This email was generated by the smartd daemon running on host:
medusa-slave166.medusa.phys.uwm.edu in the domain: master001-nis

The following warning/error was logged by the smartd daemon:
Device: /dev/hda, Self-Test Log error count increased from 0 to 1

Running smartctl -a /dev/hda confirmed the problem:

Num  Test_Description Status                Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline Completed: read failure    80%    682       0x021d9f44

Note that the failing LBA reported is 0x021d9f44 (base 16) = 35495748 (base 10)

ID# ATTRIBUTE_NAME       FLAG    VALUE WORST THRESH TYPE    UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail  Always    -    0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always    -    0
197 Current_Pending_Sector  0x0022 100 100 000 Old_age Always    -    3
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline    -    3

and one can see above that there are 3 sectors on the list of pending
sectors that the disk can't read but would like to reallocate.

The device also shows errors in the SMART error log:

Error 212 occurred at disk power-on lifetime: 690 hours
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 12 46 9f 1d e2  Error: UNC 18 sectors at LBA = 0x021d9f46 = 35495750

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- -- ---------  --------------------
  25 00 12 46 9f 1d e0 00 2485545.000  READ DMA EXT

Signs of trouble at this LBA may also be found in SYSLOG:

[root]# grep LBA /var/log/messages | awk '{print $12}' | sort | uniq
LBAsect=35495748
LBAsect=35495750

So I decide to do a quick check to see how many bad sectors there
really are. Using the bash shell I check 70 sectors around the trouble
area:

[root]# export i=35495730
[root]# while [ $i -lt 35495800 ]
      > do echo $i
      > dd if=/dev/hda of=/dev/null bs=512 count=1 skip=$i
      > let i+=1
      > done

<SNIP>

35495734
1+0 records in
1+0 records out
35495735
dd: reading `/dev/hda': Input/output error
0+0 records in
0+0 records out

<SNIP>

35495751
dd: reading `/dev/hda': Input/output error
0+0 records in
0+0 records out
35495752
1+0 records in
1+0 records out

<SNIP>

which shows that the seventeen sectors 35495735-35495751 (inclusive)
are not readable.

Next, we identify the files at those locations.  The partitioning
information on this disk is identical to the first example above, and
as in that case the problem sectors are on the third partition
/dev/hda3.  So we have:
   L=35495735 to 35495751
   S=5269320
   B=4096
so that b=3778301 to 3778303 are the three bad blocks in the file
system.

[root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs:  open /dev/hda3
debugfs:  icheck 3778301
Block Inode number
3778301 45192
debugfs:  icheck 3778302
Block Inode number
3778302 45192
debugfs:  icheck 3778303
Block Inode number
3778303 45192
debugfs:  ncheck 45192
Inode Pathname
45192 /S1/R/H/714979488-714985279/H-R-714979984-16.gwf
debugfs:  quit

And finally, just to confirm that this is really the damaged file:

[root]# md5sum /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf
md5sum: /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf: Input/output error

Finally we force the disk to reallocate the three bad blocks:
[root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=3 seek=3778301
[root]# sync

We could also probably use:
[root]# dd if=/dev/zero of=/dev/hda bs=512 count=17 seek=35495735

At this point we now have:
ID# ATTRIBUTE_NAME       FLAG    VALUE WORST THRESH TYPE    UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail  Always    -    0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always    -    0
197 Current_Pending_Sector  0x0022 100 100 000 Old_age Always    -    0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline    -    0

which is encouraging, since the pending sectors count is now zero.
Note that the drive reallocation count has not yet increased: the
drive may now have confidence in these sectors and have decided not to
reallocate them..

A device self test:
  [root#] smartctl -t long /dev/hda
(then wait about an hour) shows no unreadable sectors or errors:

Num  Test_Description Status                Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline Completed without error    00%    692       -
# 2  Extended offline Completed: read failure    80%    682       0x021d9f44

[USEFUL HINTS ADDED BY OTHERS]

---------------------------------------------------------------------------

From: Kay Diederichs

I read your badblocks-howto at
http://smartmontools.sourceforge.net/BadBlockHowTo.txt and greatly
benefitted from it. One thing that's (maybe) missing is that often the
"smartctl -t long" scan finds a bad sector which is _not_ assigned to
any file. In that case it does not help to run debugfs, or rather
debugfs reports the fact that no file owns that sector. Furthermore,
it is somewhat laborious to come up with the correct numbers for
debugfs, and debugfs is slow ...

So what I suggest in the case of presence of
Current_Pending_Sector/Offline_Uncorrectable errors is to create a
huge file on that filesystem.
  dd if=/dev/zero of=/some/mount/point bs=4k
creates the file. Leave it running until the partition/filesystem is
full. This will make the disk reallocate those sectors which do not
belong to a file. Check the "smartctl -a" output after that and make
sure that the sectors are reallocated. If any remain, use the debugfs
method.  Of course the usual caveats apply - back it up first, and so
on.

---------------------------------------------------------------------------

This document is version $Id: BadBlockHowTo.txt,v 1.7 2005/04/26 16:56:19 ballen4705 Exp $
It is Copyright Bruce Allen (2004) and distributed under GPL2.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

返回列表

Chinaunix › 论坛 › IT运维 › 存储备份 › 在RHAS4上收到系统发来的消息，提示SMART error！

[集群与高可用] 在RHAS4上收到系统发来的消息，提示SMART error！ [复制链接]