忘记密码   免费注册 查看新帖 |

ChinaUnix.net

  平台 论坛 博客 文库 频道自动化运维 虚拟化 储存备份 C/C++ PHP MySQL 嵌入式 Linux系统
最近访问板块 发新帖
查看: 4739 | 回复: 1

[RAID与磁盘阵列] 哪位老大有权限能读EMC这个补丁说明? 请帮忙粘帖出来,谢谢! [复制链接]

论坛徽章:
0
发表于 2018-01-30 15:22 |显示全部楼层
哪位老大有权限能读EMC这个补丁说明? 请帮忙粘帖出来,谢谢!

https://support.emc.com/kb/23000

论坛徽章:
1
IT运维版块每日发帖之星
日期:2016-02-22 06:20:00
发表于 2018-04-24 21:07 |显示全部楼层
Interpreting uncorrectable data and parity errors on a CLARiiON or VNX array.

Data sector invalidated error reported by Background Verify may require unbinding and rebinding LUN.

Event log message like the following:
CLARiiON Data Sector Invalidated, eventcode 840 from abc.xy.com, Agent: nmcl11x0988a, SCSIName: K10, Array: APM00012300456

Dial-homes like the following:  

Time Stamp 01/18/03 17:22:59 Event Number 840 Severity Warning
Host Prod1-SPB-0164  Storage Array F29991234567  SPB  Device Enclosure 1 Disk 3
Description Data Sector Invalidated
Time Stamp 01/18/03 17:22:59 Event Number 956 Severity Error
Host Prod1-SPB-0164  Storage Array F29991234567  SPB  Device Enclosure 1 Disk 7
Description Parity Invalidated

Error code: 0x695 Uncorrectable Data Sector

Error code: 0x957 Uncorrectable Data Sector
Error code: 0x953 Uncorrectable Parity Sector

Error code: 0x68A Uncorrectable Parity Sector

Error code: 0x840 Data Sector Invalidated

Numerous sector reconstructed and parity sector reconstructed messages due to coherency errors in one or more RAID groups.

Parity Sector Reconstructed [r5_vr COH] messages
原因        The event codes described above are logged by FLARE when it is unable to read data from a drive, and subsequent attempts to reconstruct the data from other drives in the RAID group failed.  The "Uncorrectable" messages indicate which drive(s) FLARE was unable to successfully read from, and the "Invalidated" messages indicate which drive(s) FLARE then marked as being void of valid information in a specific location.   This marking is done by FLARE to ensure that no invalid data will be returned to a host system.  Attempts to read from an invalidated location will result in a hard error being returned to a host.   Attempts to write to an invalidated location will complete successfully and generally "fill" (overwrite) the void location. This is the reason that sometime past Uncorrectable errors disappear after host over write these sectors with new good data.

Uncorrectable or sector invalidated errors mean that n LUNs that have a total of n locations have been identified where the Background Verify or Rebuild process identified uncorrectable locations. An uncorrectable location is essentially a data void. This means that the array has determined the contents of such locations are not valid.

The determination that a location is not valid is based on information that the array maintains above and beyond the raw host data going to and from disk. Primarily this involves a cyclical redundancy check (CRC) that is maintained for each sector.

If the array identifies a sector whose data contents do not match the corresponding CRC, the array attempts to take corrective action.  For a redundant LUN this will be either to re-read from the other mirror half (RAID 1, RAID 1/0) or to re-read and XOR from the other drives in the group (RAID 3, RAID 5) for the stripe in question.  

Finding another drive has a CRC mismatch during the attempted re-read from other drives in the group after discovery of the initial CRC mismatch indicates that the error condition is uncorrectable. As a result the array will note that it is invalidating (voiding) the locations in question to ensure that the contents of them will not be fed back to a host at some point.
更改        Array logged 0x695 error, followed by 0x68A error, and then 0x840 error on multiple drives in the same RAID group.
解决方案        Run a Background Verify on the affected LUNs to determine how many uncorrectable sectors there are. If Background Verify reports any uncorrectable sectors, you should recommend that the client attempt to backup the data to determine what files were affected, and then restore any lost files. If this is not possible, or restoration of specific data files is not possible, a sequence of unbinding, rebinding, and restoring all data to the affected LUN(s) will be required.
Note: When running Background Verify, ensure there is no such event as excessive trespassing going on the array that may lead to SP log wrapping. If such an event is ongoing, take steps to resolve that before running BV (for example, stop Trespassing by taking the trespassing host down or stop host failover temporarily). This will help ensure that all the information logged during BV run is present in SP logs. This information is needed to determine exactly the extent of uncorrectables.

Caution! If the Background Verify finds any uncorrectable locations, there is a possibility that a small amount of valid data within a RAID 5 or a RAID 3 group is unprotected and could be lost on a future drive failure. This may or may not be visible to the customer application if either the affected sectors are not be accessed or if uncorrectable errors happen to be in the non-data area of the LUN. There are two ways to resolve this, which is not to be mistaken for recovering lost data:

If customer has a good backup, the customer choose to restore all data.

OR

If customer does not have a backup or does not want to do full restore (due to size of the LUN/file system), the customer can choose to identify the affected  files by doing a backup (or by another means at the application level) and then only restore those files (objects) that cannot be backed up successfully.
If there are COH (Coherency) events, see 9087 for more details.

注释        If a Celerra file server is attached, see 9527 . Read this article before proceeding. You will be asked to provide Celerra TS with the LUN numbers affected by the errors, the number of such errors found by running Background Verify and any recommendations for hardware replacements during the recovery process. These issues are considered as Data Unavailable/Data Loss situations and are typically handled by Corporate NAS L2 Support engineers.
For EMC Celerra or EMC VNX solutions, including EDL and DLM, the action of unbinding and rebinding array LUNs is a very unlikely option. The solutions linked in this Note define the actions for resolving uncorrectable errors and are to be performed by senior support engineers.

If an EMC Disk Library (EDL) is attached, see  Link Error: . Read this article before proceeding. You will be asked to provide CDL TS with the LUN numbers affected by the errors, the number of such errors found by running Background Verify, and any recommendations for hardware replacements during the recovery process.  

Frequently asked questions:

QUESTION: It is known that the only way to recover from invalidated sector errors is to unbind/rebind the LUN and restore from a backup or to restore/write the specific file that cannot be read.  Does Engineering have another way to recover data if both of these options are not possible?
ANSWER: There is no other way to recover the data other then by means of a restore operation.  Since the uncorrectable data is missing, there is no way of knowing what that data should be in order to write it back out.  This is why the sector is 'invalidated' and a hard error gets returned to the host. It is better to return a hard error than incorrect data. However, if there is a Celerra attached to the CLARiiON, the Celerra has tools available to specifically target the uncorrectable sectors to avoid total data loss through an unbind/rebind operation of the affected LUN. Please see 9527  for more details.

QUESTION: Is it possible for an invalidated sector to change locations on a disk?
ANSWER: No. An invalidated sector remains invalid for a specific location until repaired by means of a rebind, or until it is [over] written to by a host system.

QUESTION: Is there a way of finding out the actual location of an invalidated sector?  
ANSWER: It is very difficult to locate the position of an invalidated sector, due to how LUNs are mapped within RAID groups and what information is available through event logs.   Even if the specific location of an invalidated sector is determined, there is no way of knowing what data to place into the sector.  So any type of recovery effort short of restoring from a customer backup is not provided.  

Note: Release 19 patch 30 and later may provide an alternative. Contact CLARiiON Technical Support Level 2 for assistance.

QUESTION: If the invalidated sector does not appear to impact the data area, is there a way to get rid of it without unbinding/rebinding?
ANSWER: Some success has been reported when writing temporary data to fill the LUN and then deleting the temporary data.  If the invalidated area is written to with temporary data, the voided location(s) are filled, thus restoring full redundancy to the RAID group.   

Note: Release 19 patch 30 and later may provide an alternative.  Contact CLARiiON Technical Support Level 2 for assistance.

QUESTION: Can a customer run just a CHKDSK or FSCK to check the integrity of the data in the filesystem if uncorrectable errors are reported by background verify?
ANSWER: When there is an issue of uncorrectable sectors, the customer should check their data to see if any file corruption exists. In order to do this, the customer should run some type of application or program that reads all of the used sectors in the LUN space. The most common type of method is a full backup of the data. It is not advisable to simply run an FSCK (UNIX) or CHKDSK (Windows) because these utilities only check the metadata area of the files. If the uncorrectable sectors are not in metadata space, the customer will be left with the impression that the data is OK when in fact it is not.

These are the definitions for each error type:
Checksum errors - The CRC for the sector was not correct. This indicates either a data corruption of some type, or the sector was intentionally invalidated because of a previous error.

Write Stamp errors - The write stamp for a particular sector doesn't match between the data and parity drive. This is usually caused by a RAID 5 write failing between writing the data and parity. Typically these errors are correctable.

Time Stamp errors - The time stamp for a stripe doesn't match across all the drives. Time stamps are used on RAID 3 (full stripe) writes. This is usually caused by a failure while writing the stripe. Typically correctable.

Shed Stamp errors - These are most often fatal. Usually caused by a SW error where the rebuild checkpoint is not maintained properly.

Coherency errors - Indicates that although the stamps may match, the parity for a stripe does not accurately reflect the data. This would be seen on RAID 1/RAID 10 units if a write fails before getting to both drives. Potentially seen on RAID 3/RAID 5. This is most often correctable.

注释(员工和合作伙伴)        Engineering Guidelines Regarding Uncorrectable Cases   
Warning!  When uncorrectable errors are logged, it is critical the errors be addressed AS-SOON-AS-POSSIBLE to prevent additional DATA LOSS.  This is regardless whether a customer is reporting data loss.  This applies to R5 and R6 RAID types.  

Consider the following example:

A Background Verify finds numerous meta-data errors that are uncorrectable. This causes each errored data position and  its associated parity position to be "invalidated," which overwrites any data with a recognizable data pattern identifying them as invalidated blocks.

An invalidated strip cannot recover the lost data without the data being rewritten.

While a strip is marked as invalidated, any valid data still present on the strip on other drive locations is still valid, but has lost its redundancy and is vulnerable to any further errors until the invalidated data has been fixed.
In the following table drives 1 and 2 are found to contain invalid meta-data making them uncorrectable. They are then invalidated, which causes drives 1, 2, and parity to be marked as invalid. They will remain this way until overwritten.

Parity (0)

1

2

3

4











Invalid

Invalid

Invalid

Data

Data


On the customers system the uncorrectable errors are logged, and under normal circumstances cause a call-home. At this point the customer should follow the procedure of unbinding, rebinding, and restoring the units, thus removing the invalidated sectors.

Any data written to any of the strips that contain an invalidated parity risks future problems causing loss of data.

If a disk in the same RAID group is powered down by the system, the invalidations present on many of the strips found by the original Background Verify do not allow the reconstruct to recover their data.
The following table represents a strip that has valid data on drive 4, an invalidated parity sector, two invalidated data drives, 1 and 2, and a failed drive, 3.  When drive 3 is replaced and a rebuild attempts to reconstruct that data for drive 3, it is unable to accomplish this with only the data in drive 4 being available. To reconstruct a position on a RAID 5 all other positions must be present and contain valid data to allow the reconstruct to execute.

Parity (0)

1

2

3(Failed)

4











Invalid

Invalid

Invalid

?????

Data


An Engineering DIMS is NOT required if uncorrectable event occur due to degraded media rebuild.  It is also not required if you feel you have a good understanding of the issue and understand the reason for the uncorrectable events. The following two examples are degraded media caused uncorrectable events.  
Example #1 [Degraded Media During Rebuild New Media Errors]:   

A disk fails and during the rebuild another drive logs new media errors during rebuild leading up to uncorrectable errors as reported in the logs.  

Example #2 [Degraded Media During Rebuild- Older Media Errors]:   

A disk fails and during the rebuild data cannot be read from another drive (errors logged prior to first disk going down) during rebuild leading up to uncorrectable errors as reported in the logs.  
Note: Any Data Unavailable (DU) or Data Loss (DL) should continue to be reported through normal CAC notification for tracking purposes.  

The exception to the No-DIMS rule above and will require a DIMS to be opened are:  

     (a) Sudden uncorrectable errors on one or more disk with no preceding events.  
     (b) Two drives reporting media errors on the same LBA (Sector) causing an uncorrectable  
     (c) Two drives reporting CRC errors at the same time  
     (d) Multi-bit CRC events, in ex; disk 2.0.C reports three 689 Sector Reconstructed [mirror_rd_vr CRC] against different LBAs prior to the uncorrectable events.   
     (e) Reconstruct messages due to unexplained COH (Coherency) errors
您需要登录后才可以回帖 登录 | 注册

本版积分规则

【重磅资料】多云网络实战的相关问题汇总!
云网融合的多云网络

本文介绍如何管理私有云数据中心,构建数据中心互联和混合云解决方案。对于OTT 网络架构的深入理解,基本上来源于SIGCOM 的白皮书和一些公开视频。

Overlay SDN 控制器详解

云计算为了适应业务/APP 的快速开发和部署,会把网络分为两层:Overlay 和 Underlay 网络。本文主要讲Overlay网络层面的问题。

超级核心路由器演进

2016 年,网络连接已经采用100G/200G/400G(虽然 400GE 接口技术还未成熟),互联网出口也已经增长到了 T 级别。

获得资料 >>
  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号 北京市公安局海淀分局网监中心备案编号:11010802020122
广播电视节目制作经营许可证(京) 字第1234号 中国互联网协会会员  联系我们:wangnan@it168.com
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP