论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2006-11-02 06:32 |只看该作者 |倒序浏览

这两天针对EMC的hotspare做了大量的测试，起因是因为用户处的一块系统盘失效导致global write cache被disable, 要知道，write cache被禁止后，会使I/O性能下降很多。这时，一些I/O比较频繁的应用会表现为速度下降，响应慢。用vmstat观察发现idle基本为10%-20%, 显示I/O busy, wait也会变为非0，在用户处为2-5。
其实，用户处配置了36, 73, 146G的hotspare盘各一快，而系统盘(前5块，disk0-disk5)为76G。但即使是hotspare已经完全顶替了损坏的系统盘，这时write cache仍然被禁止。由于备件未到，最终采取了将146G的hotspare盘，直接插入损坏的系统盘的槽位。
以上的过程，我们得出结论(针对用户的环境)：
1)    write cache被禁止，系统地性能会急剧下降
2)    如果是系统盘损坏，无论有无hotspare盘，write cache都会被disable
3)    一旦系统盘恢复正常后(数据同步完成后，73G的盘通常需要1小时)，write cache才会自动地enable
4)    如果是非系统盘损坏，write cache不会被disable
5)    容量大的hotspare可以顶替小容量的坏盘
6)    如果SPS有1个失效，也会将write cache disable.

在我们的CX500测试环境中，也作了同样的测试。发现大部分测试结果都一样，但是第3)点有所不同，结论是：
一旦系统盘开始同步(而非等待完成), write cache就可以恢复成enable的状态，这大大的减少了write cache被disable的时间，也正是我们需要的行为。
那么，这也许是CX600的一个bug, 在CX500中已经修正? 抑或是通过升级微码就可以避免这个问题呢?
以下是从EMC的手册中摘抄的与hotspare相关的一些说明：
Hot spare - A single global spare disk, that serves as a temporary replacement for a failed disk in a RAID 5, 3, 1, or 1/0 LUN. Data from the failed disk is reconstructed automatically on the hot spare. It is reconstructed from the parity data or mirrored data on the working disks in the LUN; therefore, the data on the LUN is always accessible. A hot spare LUN cannot belong to a storage group

RAID type
Number of disks you can use
RAID 5
3 - 16
RAID 3
5 or 9 (CX-series)
RAID 1/0
2, 4, 6, 8, 10, 12, 14, 16
RAID 1
2
RAID 0
3 - 16
Disk
1
Hot spare
1

Note: If you have LUNs consisting of FC drives, allocate an FC drive as a hot
spare.If you have LUNs consisting of ATA drives, allocate an ATA drive as a
hot spare.

Rebuild priority The rebuild priority is the relative importance of reconstructing data on either a hot spare or a new disk that replaces a failed disk in a LUN. It determines the amount of resource the SP devotes to rebuilding instead of to normal I/O activity. Table 8-3 lists and describes the rebuild time associated with each rebuild value.

Value
Target rebuild time in hours
ASAP
0 (as quickly as possible) This is default.
HIGH
6
MEDIUM
12
LOW
18

The rebuild priorities correspond to the target times listed above. The storage system attempts to rebuild the LUN in the target time or less. The actual time to rebuild the LUN depends on the I/O workload, the LUN size, and the LUN RAID type. For a RAID group with multiple LUNs, the highest priority specified for any LUN in the group is used for all LUNs on the group.
Rebuilding a RAID 5, 3, 1, or 1/0 LUN
You can monitor the rebuilding of a new disk from the General tab of its Disk Properties dialog box (page 14-15).
A new disk module’s state changes as follows:
1. Powering up - The disk is powering up.
2. Rebuilding - The storage system is reconstructing the data on the new disk from the information on the other disks in the LUN. If the disk is the replacement for a hot spare that is being integrated into a redundant LUN, the state is Equalizing instead of  Rebuilding. In this situation, the storage system is simply copying the data from the hot spare onto the new disk.
3. Enabled - The disk is bound and assigned to the SP being used as the communication channel to the enclosure.

A hot spare’s state changes as follows:
1. Rebuilding - The SP is rebuilding the data on the hot spare.
2. Enabled - The hot spare is fully integrated into the LUN, or the failed disk has been replaced with a new disk and the SP is copying the data from the hot spare onto the new disk.
3. Ready - The copy is complete. The LUN consists of the disks in the original slots and the hot spare is on standby.

Rebuilding occurs at the same time as user I/O. The rebuild priority for the LUN determines the duration of the rebuild process and the amount of SP resources dedicated to rebuilding. A High or ASAP (as soon as possible) rebuild priority consumes many resources and may significantly degrade performance. A Low rebuild priority consumes fewer resources with less effect on performance. You can determine the rebuild priority for a LUN from the General tab of its LUN Properties dialog box (page 14-14).

Failed vault disk with storage-system write caching enabled
If you are using write caching, the storage system uses the disks listed in Table 14-3 for its cache vault. If one of these disks fails, the storage system dumps its write cache image to the remaining disks in the vault; then it writes all dirty (modified) pages to disk and disables write caching.
Storage-system write caching remains disabled until a replacement disk is inserted and the storage system rebuilds the LUN with the replacement disk in it. You can determine whether storage-system write caching is enabled or disabled from the Cache tab of its
Properties dialog box (page 14-14).

Storage-system type
Cache vault disks
CX3-series, CX-series
0-0 through 0-4

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u/2671/showart_193862.html

EMC, 磁盘阵列, EMC, 磁盘阵列

文库|博客

返回列表

Chinaunix › 论坛 › IT运维 › 存储备份 › 存储文档中心 › about EMC Hot Spare

[RAID与磁盘阵列] about EMC Hot Spare [复制链接]