免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 3436 | 回复: 6
打印 上一主题 下一主题

Building a 20TB ZFS file server [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-09-05 04:35 |只看该作者 |倒序浏览
  • Storage Strategy with ZFS
  • Hardware Selection – A comprehensive decision making guide
  • Hardware Assembly
  • Operating System Install
  • Overview of ZFS Concepts and Technology
  • Command-by-command ZFS setup
  • Network storage setup – CIFS and NFS
  • Performance Benchmarking
  • Lessons Learner

论坛徽章:
0
2 [报告]
发表于 2010-09-05 04:36 |只看该作者
Part 1: The ZFS Revolution\r\n\r\n                   \r\nIntroduction\r\nThis is part one of a series of blog entries from StringLiterals.com.  In this series, we are sharing the entire process of building a twenty terabyte ZFS file server from scratch.   This is part one: The ZFS Storage Revolution.\r\nBefore we rush out and start looking at hardware to buy, we need to inventory the current state of the art of storage technology, and formulate a coherent storage strategy.\r\nThe condensed version\r\nDon’t have time to read this entire post?   Here’s the condensed version:
  • We’re in the midst of a fault-tolerant revolution.  The approach of using expensive, highly reliable hardware in order to make software systems reliable is now outmoded.
  • The new approach is to combine inexpensive hardware with fault-tolerant software.
  • ZFS is the technology that brings this philosophy to the storage industry.
  • ZFS sub-plants RAID, NetApp, and EMC technologies, but it’s not for every platform (yet).
\r\nHave time to read?  Settle in for the a grand tour of past and modern storage strategies.\r\nThe Fault-Tolerance Revolution\r\nWe are in the midst of a revolution in the way hardware and software are deployed on an enterprise scale.  This revolution divides infrastructure strategies into two approaches.  The traditional approach is to spend premium dollars on the highest possible quality of redundant hardware so that software may run atop it reliability.   The new approach is to use more intelligent software that itself is tolerant of hardware failure, and then use more affordable commodity hardware.\r\nThis fault-tolerant approach can be seen in numerous facets of information technology, from the data center to the desktop.  In the data center setting, you need look no further than Google to see this approach in action.  Google, like other large websites, has developed their systems such that it can survive massive failures on commodity hardware.  Any one of their servers, or any complete rack of servers can suffer catastrophic failure, and their data and services continue  intact.  This is in part because each piece of data is stored in multiple locations.  But it’s made possible by creative software that is aware of the hardware on which it runs, and compensates for failures on-the-fly.\r\nTo see the fault-tolerant approach in the desktop setting, look no further than modern high-density hard drives.  As manufacturers strive to find ways to increase capacities beyond a few hundred gigabytes, they started adding additional platters to the drive.   This approach had scaling issues – there’s just so many platters you can fit in a 3.5″ hard drive package.   Inevitably, they were forced to physically place the bits closer and closer on the disk – and use less and less of the storage surface to magnetically record each one or zero.  Creative approaches such as perpendicular recording technology have aided greatly in the struggle to  keep bits pointed in the right magnetic direction.  But, over time, portions of the disk become less and less reliable at holding a strongly polarized charge.  To combat this, modern hard drives monitor the health of every sector on disk.  Should the strength of the signal on any particular spot of the platter become weak, the drive will pack up the data, move it to the new sector, and mark the original sector as “not to be used.”  This takes place on a regular basis, and is not considered a system failure.   In this way, modern hard drives have become tolerant of hardware failure on the small scale, and single-drive PC capacities have skyrocketed up to  2TB, as of this writing.\r\nAn Insatiable Appetite for Storage Capacity\r\nThe consumer appetite for digital storage has seen no bounds.  At first, consumers only stored documents and spreadsheets on their computers.  The popularity of digital photos and music collections then drove the demand for space up by an order of magnitude.  Digital HD camcorders have now done away with the tapes of yesteryear, and digital storage demands have increased by another order of magnitude.\r\nThe same insatiable storage appetite can be seen in the enterprise setting.  Companies that are in the know have discovered that information really is power – the power to make money.  Not only do companies save the same old documents in ever increasingly sized file formats, they’re digitally recording more types of data than ever before.  Grocery stores profile every customer’s buying habits by recording every transaction in a database, thanks to motivating customers to use “discount cards.”  Call centers are recording every telephone conversation digitally.  Databases used to only be used for inventory management and accounting – but are now commonplace in every business system from asset management to document and e-signature management.  Radio producers and movie studios store all their media directly to digital.  It’s not as though companies can afford to marvel at the capacity of a massive one-terabyte drive and replace their old RAID arrays and SAN’s with a single drive in the center of a now-empty data center.\r\nIntroducing ZFS\r\nIt’s difficult to describe ZFS in a single breath, and I’m hardly the first to extol it’s benefits.  ZFS originally stood for “Zettabyte File System” – but to refer to it as simply a file system is a gross understatement.  This is probably why “ZFS” is now officially and orphaned acronym.\r\nIn a nutshell: ZFS is a technology that brings the fault-tolerant philosophy to the storage problem space.\r\nThe features in ZFS make for highly reliable, extremely scalable storage that can be run on commodity hardware. It’sinherently more reliable than RAID and SAN filers from NetApp or EMC because it ensures data integrity all the way from the storage devices to system memory.  It scales in both size and performance beyond the largest storage need we’re likely to encounter in the next hundred years.  This scalability is made possible with 128 bit addressing, with adaptive block sizing, and through the elimination of static meta data (there is no ‘formatting’.)\r\nWikipedia has an entry for ZFS that summarizes its features in the following list:
  • Storage pools
  • Capacity
  • Copy-on-write transactional model
  • Snapshots and clones
  • Dynamic striping
  • Variable block sizes
  • Lightweight filesystem creation
  • Additional capabilities
  • Cache management
  • Adaptive Endianness
\r\nZFS also happens to be an open source project, sponsored by Sun, and lead by Jeff Bonwick. Jeff Bonwick’s blog has fantastic entries explaining the chief features of ZFS such as RAID-Z.  ZFS was announced in 2004, and has since been stabilized into a truly enterprise-ready code-base.  It’s been available in OpenSolaris for several years.   Today you can also find it in FreeBSD, and a read-only implementation is available in MacOS.\r\nIt’s notable that although ZFS is truly open sourced, and has benefited tremendously from community support and contribution, it is not released under the GPL.  As a result, it has not been adopted into the Linux kernel.  Instead, ZFS may be run in user space on a Linux box via FUSE .   Another upcoming alternative for Linux systems is the BTRFS (pronounced ‘butter F.S.’), which is an Oracle-sponsored project to deliver ZFS-like features under the GPL license.   It’s code-base is still experimental, and is not recommended for any data you intend to keep.\r\nZFS Limitations\r\nZFS is not without limitation.  The good news is that the community is well aware of them, and plans on addressing most them in future releases.  The chief limitations are as such:
  • Inability to grow a stripe one drive at a time.   It’s perfectly possible to add sets of devices to an existing ZFS zpool, and thereby expand a volume by chucks at a time.  You can also attach mirrors to existing pools in ZFS.   But if you wish to add a single drive at time without dedicating additional space to parity, ZFS does not have the facility to do so just yet.
  • A single volume (zpool) cannot include device resources that span multiple hosts.  SAS technology limits this impact, as you can use SAS expanders to attach several hundred devices to a single host in a practical manner.  You can, of course, export the ZFS file systems to be used by multiple other machines (this is actually a huge strong point, thanks to features like NFSv4 ACL’s and native in-kernel CIFS for those catering to Windows.   ZFS is not yet a complete solution for implementations spanning tens of thousands of devices – although it can be used in conjunction with iSCSI or filers to that scale.
\r\nConclusion\r\n\r\nThe benefits of ZFS technology far outweigh the limitations when it comes to our situation of need.   ZFS is a perfect match for a single, larger storage servers with a few dozen devices.  Thanks to the availability of high density hard drives, and the fault-tolerance of ZFS, twenty terabytes can by hosted in one machine affordably and responsibly.   We will be storing corporate data, so the end-to-end data integrity gives us piece of mind.  The only thing that’s close to this are NetApp or EMC solutions, which we can’t afford.\r\nWe’ll be storing a wide variety of different types of data: everything from thousands of little database and html files to large video files for our upcoming video posts.  We need a filesystem that can handle both situations at once, so ZFS’s dynamic block sizing is of particular appeal.   We’ll be serving data to a variety of UNIX servers, Linux workstations, and Windows clients, so the built-in NFSv4 and CIFS support is ideal.\r\nWe can’t wait to replace our existing hardware-raid file server with a comprehensive ZFS build.   Even though this solution is very redundant and reliable, redundancy does not a backup make.   We’ll be encrypting copies of our most important data, and partner with a trusted business to back it up off-site, over the wire.\r\nWe’ll take a more in-depth look into ZFS in a feature-by-feature fashion as we implement our new storage server.  In the process, our storage strategy will be refined.   At this point it’s enough to know that ZFS is great fit for our needs.  Expensive SAN solutions are no longer the only solution for reliable data storage.  In case you didn’t get the news; RAID is dead.  Suspiciously looming over it’s bludgen body stands ZFS.  We intend to make her our friend.

论坛徽章:
0
3 [报告]
发表于 2010-09-05 04:38 |只看该作者
Part 2: Hardware Selection\r\n\r\nThis is part of a series of blog entries from StringLiterals.com.  In this series, we are sharing the entire process of building a twenty terabyte ZFS file server from scratch.   This is part two: Hardware Selection.\r\nHardware technology moves very fast, necessitating in-depth research with each new generation of hardware.  In this article, we will help you understand which decisions need to be made, the order in which to make them, and the important factors to weigh for each decision.  Additionally, we will share a few specific hardware examples for each choice.  In the end, we’ll share our choices, and see how we performed against our tight budget of $3,500 USD.\r\nConsiderations\r\nWhen choosing hardware for our ZFS server, we must recognize that the considerations to be taken are specific to the task at hand.   We are not building a gaming PC, a business workstation, or a virtualization host.   Our goal is to assemble a file server utilizing ZFS technology.  Therefore, we take the following into consideration at each step:\r\nThrough the course of trial and error, we discovered that there is a very specific order in which you should make hardware decisions when building a storage machine.  The order is different than, for example, when building a desktop PC.   eg: with a gaming system, you typically decide on a graphics card and desired CPU first, and then build the system around those choices.\r\nWith a storage server, we start with the disks and work our way up through the interfaces to the memory and CPU, and then out over the network card.  This will be the path that data will flow.  One poorly made decision can easily throw the price tag up by thousands of dollars.   Some of this is because of the scale.  When buying twenty hard drives, a component price difference of $60 adds up quickly.   Another factor that drastically influences cost is storage connectivity.  Building a system around the wrong motherboard, for example, might force us into choosing among very expensive disk controller cards.  It pays to be aware of all of your options.\r\nLet’s start with the hard drives.\r\nHard Drives\r\nWe chose the Western Digital 1.0 TB “Black” edition drive.   We’ve previously used the WD 1TB RE3 “raid edition” with great success, so part of this decision is about brand comfort.  The reason we’ve changed from the RE3 to the black is a curious one:  Since buying the RE3’s (at a $70 premium each), we’ve learned that the only important difference between the “RE3″ and the “Black” edition is a firmware setting that can be manually changed.  This firmware setting, called Time Limited Error Recovery (TLER), controls how long a single drive will spend attempting to read a sector.\r\nWhile it might be fine for a standalone drive to spend twenty seconds to two minutes attempting to recover the data, this leads to trouble in a RAID-like setting.   If the disk controller waiting on the drive times out before the drive itself gives up on a sector, the entire drive will be marked as bad and dropped from the pool.   We much prefer the drives rapidly reporting a read failure, so that the ZFS system can quickly reassemble the missing data from parity on the fly.  We wrote an earlier post about how to ready a WD Black drive for RAID use.   Similar technology exists for other brands.  It’s called Command Completion Time Limit (CCTL) for Samsung and Error Recovery ControL (ERC) for Seagate.  Knowledge of how this feature works is the most critical concern when considering the use of large capacity consumer-grade hard drives in any sort of RAID configuration.\r\nDisk Chassis: Internal vs External\r\nSelecting a chassis for the disks all comes down to a trade-off between expandability and cost.   On one end of the spectrum, we have internal hard drives mounted in the same case as the server.  This is currently the most affordable way to go, but you can quickly run into a brick wall once your case is full of drives.  Another option is to use an external storage chassis.  External bays come in three basic varieties:\r\nDisk Chassis: External Options
  • SATA Port Multipliers
  • SAS Multilane Enclosures
  • SAS Expander Enclosures
\r\nSATA multipliers are by far the cheapest solution, but there’s a catch.   We ruled out this approach fairly quickly, in light of the fact that this architecture only allows the controller to communicate with one drive at a time.  This limitation is present because the drives must be able to act as though they have sole access to the controller.  This means the controller must ask for a piece of information and wait for the drive to provide it before moving on to the next drive to request the next piece.   This would be detrimental to performance.\r\nSAS Multilane enclosures, sometimes marked as “SAS JBOD” have two advantages.  First, the controller can communicate to multiple drives concurrently.  Second, these chassis can be connected via a single MiniSAS connector per set of four drives.   The trade-off here is that SAS controllers are relatively expensive, and you tend to fully consume the capacity of a controller with only a few drives.   The chassis themselves are affordable.  Here’s is an example 8-bay SAS JBOD enclosure for $469.  (This blog has no affiliation with PC-Pitstop)\r\nSAS Expander enclosures are the third and by-far preferred option when it comes to expandability.  These can be daisy chained to support up to 128 drives on a single MiniSAS channel.   You spend far less money on controllers since a 4 port SAS or 8 port SAS can easily drive 128-255 drive devices.  For very high density setups, this your only real choice, as simply adding controller cards is not an option when you rapidly run out of expansion slots on the motherboard.\r\nA while ago, we built a hardware RAID array using an Adaptec 8058 SAS controller, with hope of adding drives up to the sky-high limit of 256 devices via the magic of SAS expanders.   So what was the catch?   Stand-alone SAS expanders are simply not available on the market.  The only place to find them is in the backplane of hot-swap cases, and they are very expensive compared to JBOD chassis.  The best deal I’ve found with this technology is a 15-bay SAS expander enclosure for $1395 from PC Pit-stop.  iStarUSA has a great selection of storage chassis, including the V-Storm series, but I’ve been unable to find these for retail sale.\r\nDisk Chassis: Internal Options\r\nWe’re not the only ones vexed with the lack of options for storage-oriented server chassis.  With enough searching, we were able to find a few viable choices.   We fairly rapidly limited the field of options down to three chassis.  There are basically three no-brainer price points:\r\nNorco RPC-4020 – 4U 20 bay SATA case – $279
  • Pro: Extremely affordable!
  • Con: Supports ATX, but not Extended ATX server motherboards
  • Pro: Backplane takes twenty individual SATA connectors, meaning you can use cheaper SATA II controllers, including those included on most motherboards
  • Con: Backplane takes twenty individual SATA connectors, making for needless cable spaghetti should we choose a disk controller with SAS multilane connectors
\r\nSuperMicro CSE-846TQ-R900B Rackmount 24 bay – $949
  • Pro: Moderately affordable; includes a redundant power supply
  • Con: Power supply does not have an eight pin motherboard adapter needed by 5500 series Xeon boards; adapter available
  • Pro: Uses SAS multilane cables – great for cable management if we use a SAS controller card with multilane or minisas connectors
  • Con: Cannot use cheaper SATA connectors, ruling out the use of drive controllers built-in to motherboards
\r\nYMI Rackmount Pro 9U – $3,689
  • Pro: The only manufacturer I could find of extremely large storage cases.
  • Pro: If you anticipate needing 50 hot swap bays in your server chassis, this is really your only option.
\r\nThe Norco RPC-4020 case is really the gem at our price point.  For $279 we get twenty hot swap bays and 3 internal bays.   As much as we would like the added space of having 24 bays, we found that there are extremely few options.   The additional additional four bays on the SuperMicro aren’t quite the double price tag once you factor in the fact that they toss in a high quality redundant power supply; but we would still be sacrificing the the flexibility to use cheap SATA controllers due to the SAS backplane.   20 bays may seem awkward if you’re used to building RAID arrays in sets of 8 drives, but upon further contemplation we found that this case gives quite a few nice options for raidz structure:\r\nIf cashflow is tight, we could build our array slowly by using three sets of six drives, each in RAID-Z.  This leaves two hot swap bays available for the operating system which will be a mirrored set.  The trade-off of building in three “chunks” of six is that we are dedicated three drives to parity, but we’re not at the point where we can tolerate any two drives failing.  Yes, we could tolerate 2 or even 3 failures, but only if we’re lucky enough for the failures to take place in separate chunks of the array.  We’re more interested in limiting the worst-case scenario.  A second drive failure within any given set of six and we would lose the entire array.\r\nAnother alternative is to build two sets of nine.  With RAID-Z2 this yields 4 disks of parity for 18 disks, but gives us the benefit of being able to lose any two drives in the array, at the “cost” of only one additional drive of parity.  There is a performance penalty in terms of IO operations per second (IOPS) when using larger clusters of drive in a single stripe which we will discuss in more detail when we go to configure the ZFS zpool.\r\nA third option is to build the entire array at once, in which case we could consider making a 17-disc raid-Z2 array, and dedicate one of the three remaining bays towards a hot spare.   This will ensure that we quickly recover from a single drive failure, with the hot spare providing quick recovery to full redundancy.  This solution also has the same 15 drives of usable space as in scenario #1.  There is a negative performance implication to one large set of drives, but we get higher effective storage (losing less capacity to parity) while maintaining great toelrance of drive faults.\r\nMore possibilities present themselves should we decide to forgo using the hotswap bays for the operating system root partition, and move those drives to the two internal HD brackets.   By using all 20 bays for the array, we have more symetrical options, such as 4 raidz virtual devices with 5 drives each – a configuration we anticipate will be ideal for high IOPS performance.\r\nWe will try many of these configurations and analyze each in a later post.   If you would like to read ahead, we recommend the ZFS Best Practices Guide.\r\nController Cards\r\nBecause we are building a ZFS array, we have many choices when it comes to disk controllers.   We are no longer constrained to selecting fast hardware RAID controllers with a hunk of NVRAM and a BBU.  There are three basic choices when it comes to controller cards:
  • SATA controllers built into the motherboard
  • SATA controllers on expansion cards
  • SAS controllers on expansion cards
\r\nThe SATA controllers included on motherboards have one big limitation: quantity.   Most motherboards support only six SATA connectors.   There are a handful of enthusiast and server boards that provide support for 8 to 10 drives.  Notable among these is the Asus P5Q, which we likely would have selected had we gone with the  Intel Core2 platform, largely due to it’s excellent reported compatibility with OpenSolaris.\r\nStandalone SATA controllers are an affordable option.  Densities typically range from 2 to 8 devices per expansion card, and can be had for less than $100 each.  These also have the cost-savings benefit of working with cases that have older SATA backplanes.   Both SATA controllers and SAS controllers can be connected to SATA backplanes, but SATA controllers cannot be connected to SAS multi-lane backplanes.\r\nSAS controllers are our third option, and are definitely the way of the future.  The benefits include easy cabling with MinSAS SFF-8088 connectors, and near unlimited expandability both internally and externally.  The down-side is that SAS controllers are much more expensive than SATA.  Tomshardware.com has an excellent overview of SAS technology.\r\nWhen previously building our hardware RAID array (the server we are replacing with this ZFS machine), we went with the Adaptec 5805 SAS controler, which is an extremely fast controller at the street price of around $500.  Such SAS controllers are the preferred solution in two situations: The first are those circumstances where as much performance as possible must be squeezed out of 4 to 8 drives in a cost effective manner.   The second situation for which SAS really shines is for storage systems that must scale well above 24 drives, when the price of SAS Expander technology is a non-issue.  We also recommend the 24 port Areca ARC-1680IX-24-2 and the Dell PERC 5/i, which a certified component on the OpenSolaris Hardware Compatibility List.\r\nDevice controllers can very easily be one of the most expensive components of a storage system, second only to the disk drives.\r\nThere is one more important consideration when it comes to drive controllers, and that is the issue of redundancy.   It’s possible to make a ZFS pool that can survive the failure of any one disk controller.  This is achieved by making sure that no two drives withing a single raidz virtual device are hosted on the same controller.   If this level of redundancy is required, we would recommend purchasing five controllers with each one controlling four of the twenty drives.   This is far preferable to using one large controller.  We will cover this topic again when we go to setup the RAIDZ structure in ZFS.\r\nWith our budget, we chose to use a few cheap SATA II controllers with reported OpenSolaris compatibility.  We sacrificed performance for price by using the older 133mhz PCI-X 64 bit bus.  To fit the bill we ordered two of the  SuperMicro AOC-SAT2-MV8,available for $99 on NewEgg.com.   Each of these can drive 8 SATA drives.   We will initially use the motherboard to drive the remaining 4 of the 20 drives.\r\nCPU: Intel vs AMD\r\nThis is a decision often made via personal preference, so I will not attempt to persuade the reader in one direction.   I will merely state that my preference is for Intel.  My decision is based largely on two factors: performance per kilowatt, and the choice of motherboards.  By these measures, Intel pulled ahead of AMD with the introduction of the Core 2 core, and has been ahead ever since.\r\n\r\nMemory: ECC vs non-ECC\r\nWe do not want to fall victim to the handful of random data corruption that happen on a typical module of memory each year.  You can blame pesky cosmic rays for such random memory bit flips.  If anything, our location in Denver, Colorado only makes this more important, as cosmic rays find there way to earth more frequently in the mile-high city.\r\nThe ECC feature uses banks of memory to store parity information.  A process will continually scrub the memory, and is capable of correcting any one error per 64-bit word of memory.\r\nMemory: Registered vs Unbuffered\r\nThis choice is thankfully a non-decision.  The only reason to choose registered memory is if the motherboard requires it in order to reach the memory densities we require.  One benefit of the modern architectures is a very high density of natively accessed memory.  The on-chip memory controller has sufficient voltage to operate an entire bank of RAM in capacities of several dozen gigabytes.  Very rarely these days do we see the need for a register to to sit between the memory controller and the memory banks to relay instruction.  This is a good thing, because a registered memory module will take an extra clock cycle to do the necessary relaying of instructions, slowing down system performance in the area where we can least afford it.\r\nProcessor: Core i7 vs Xeon\r\nWhat’s fascinating about this particular decision is that it appears to comes down to a pure question of performance vs reliability.   The question of value can easily be brushed aside, because we have the novelty of a Xeon 5506 processor priced at the same point as the Core i7 920 processor.   So with the dollars even on both sides of the comparison, let’s look at some specification:\r\nAt first, this masquerades as a fairly easy decision.  Both processors are based on the same architecture, the Nehalem CPU core.   Although the Xeon line is marketed for servers and workstations, and the i7 towards the desktop market, we must look beyond the marketing and assess what exactly what we get with each product.  At this price point, the i7 actually has more muster in nearly every regard:  both a higher clock speed, and more on-die cache.  The higher cache of the i7 is a bit of a surprise, as this is usually a benefit of the Xeon lineup.\r\nHowever, the decision becomes black-and-white once we take under consideration one very important piece of information: the Core i7 does not support ECC memory.  In previous architectures, ECC support was a matter of motherboard choice, because the memory controller was located on the north-bridge chipset.   With the i7/5500 series architectures, the CPU contains the memory controller, and thus we have no choice but to disqualify the i7 and adopt the Xeon.   We are not going to go to great length to setup integrity safeguards on disk only to be lax about the integrity of the data once it sits in RAM.\r\nMemory Type: DDR2 vs DDR3\r\nBecause memory bandwidth is the limiting factor in most server operations, it’s important to seek the highest performing memory architecture.   Our decision of memory type is a straight-forward one.  Both the Core i7 and Xeon CPU’s support triple channel DDR3 memory.   This is the best bus arrangement currently available on the x86 platform.  It’s also the chief reason we decided to go with the new i7/5500 architecture instead of something older.   The slower bandwidth, dual channel memory architecture of the Core 2 platform is a more serious hindrance than the lesser number crunching power of older CPU’s.  The only thing to remember is that we must install this memory in matched sets of three to take advantage of the triple channel architecture.\r\nMemory Speed: 800 vs 1066 vs 1333\r\nWith memory speed, faster is usually better.   If we could drop 1333mhz memory into this system, we would do it in a heartbeat.  Unfortunately, our choice of processor had an unintentional side effect; at least at the lower price points.   The dirty truth, buried in page 11 of the Xeon 5500 series specification, is that not all 55xx processors support the highest speed memory.  The exact memory speeds supported by the Xeon differ as follows:
  • 5502 through 5506 only support 800mhz RAM
  • 5520 through 5540 support 800 and 1066mhz RAM
  • 5550 through 5580 support 800, 1066, and 1366mhz RAM
\r\nBecause of our price constraints, we have selected the Xeon 5506 for our new server.   This means we must be content with memory running at 800mhz.  Because fast memory is so affordable, we’ll go ahead and buy RAM capable of performing at 1333mhz.   This way, once the price of the 5550, through 5580 processors become more reasonable, we can drop in an upgraded CPU and immediately get the faster performance from the memory bus as well.\r\nMemory Model and Voltage\r\n\r\nOnce we know the type of memory bus and memory speed, we have one very important decision remaining.   The choice of memory model and voltage is much more important with Nehalem core CPU’s than it was in recent history.   Core i7 systems are quickly becoming notorious for instability.  It appears that the primary cause for such instability is poorly matched memory.  We’re going to play it safe, and limit or memory to modules that are either on the motherboard manufacturer’s supported memory list, or those that have been reported as tested and working by the community at large.  For this reason, we delayed the choice of the individual memory module until after we had selected the motherboard.\r\nMotherboard\r\nOnce we’ve made all the component decisions, above, it should be a fairly simple matter of finding a motherboard that adequately connects all the components.  In this case, we need two PCI-X slots, support for the Intel Xeon 5500 series processor, support for at least four Sata drives on the onboard controller, and an ATX form factor.  One weakness of the Norco case we selected is that it does not support the larger EATX form factor motherboards.   We also have a preference for all Intel components on the motherboard, especially for the network controllers.  These tend to be faster and more reliable than the off-brand network controllers.  They’re supported by OpenSolaris, but they’re harder to find.\r\nPlugging these search criteria into NewEgg yielded our prize:  The SuperMicro X8SAX motherboard.   As a bonus, this board provides three of the newer PCI-e slots.  This gives us a clear upgrade path should we wish to attach pricer SAS disk controllers in the future.\r\nThis concludes our component-by-component tour.  Let’s look at the final list and bill.\r\nSummary\r\nHere is the list of components, along with a brief review of the deciding factors.   If you skipped reading the wall of text, above, please know that there is certainly more than one valid choice for each of these components.   We highly advise against simply ordering what we have ordered.  (For one thing, we haven’t gotten far enough in our build to confirm that they indeed work together in OpenSolaris.)  Please use this guide to help you make your own decisions to best fit your particular needs.\r\nDisks: 1TB Western Digital Black Drives – $100 each x 20 = $2000
  • High density; but still within SATA spec
  • Trusted brand
  • Can have their firmware changed to act like more expensive “Raid Edition” RE3 drives
\r\nChassis:  Norco RPC-4020 4U Rack Case – $279
  • Hot swap cages for 20 drives
  • Significantly cheaper than external drive enclosures
  • Allows for direct SATA connectors (no immediate need for SAS multilane cards)
\r\nControllers:  SuperMicro AOC-SAT2-MV8 PCI-X 8 port SATA controller – $99 each x2 = $198
  • Cheaper than PCI-e SAS controllers
  • Still relatively fast
\r\nCPU: Intel Xeon 5506 – $269
  • Supports ECC where Core i7 does not
\r\nMemory:  12gb in two 6gb DDR3 kits:  Crucial 1.5v 1333mhz Cas 9 ECC – $108 each x2 = $216
  • ECC is a must for data integrity
  • 1.5v is important for motherboard compatibility
  • Will run at only 800mhz with the Xeon 5506; but we can get 1333mhz by dropping in an X5550 later
  • On the “tested memory” list for our motherboard.
\r\nMotherboard:  Supermicro MBD-X8SAX-0 – $260
  • Rare combination of 2x PCI-X and 3x PCI-e
  • Allows upgrade path to eSAS controllers in PCI-e later on
  • All-Intel chipset, including Intel gigabit LAN
  • Good reliability reports from NewEgg
  • ATX form factor important
  • 6 onboard SATA II ports, enough to drive remaining 4 hot swap bays + 2 internal HD’s
\r\nPower Supply: PC Power & Cooling 910 Watt – $170
  • Single 12 volt rail to best handle drive spin-up
  • High count of molex power adapters to sufficiently power our SATA backplane case
  • 24 pin, 8 pin, and 4 pin motherboard connectors for use with SuperMicro X8SAS motherboard
\r\nThis brings our total bill to $3,392 – safely within our $3,500 budget.   Stay tuned as we discover how well these parts work together in OpenSolaris.\r\nHopefully this post has helped you navigate your way through the maze of decisions required to build a medium sized white-box ZFS server.   Our next post will cover the OpenSolaris installation process.   We’ll then stop to take an in-depth look at the design decisions for setting up ZFS, and walk through each command required to assemble our twenty drives into a single pool of storage.   We’ll then compare and contrast multiple configuration options, run benchmarks, and select an implementation to keep.\r\nPlease subscribe to this blog via RSS to be notified of the next part in this series.

论坛徽章:
0
4 [报告]
发表于 2010-09-05 04:40 |只看该作者
本帖最后由 云杉上的蝴蝶 于 2010-9-5 04:41 编辑 \n\nZFS Performance Preview on Twenty 1TB Drives                                 \r\nWe’re in the process of benchmarking the new ZFS storage server.   Here’s a sneak peek at the speeds we’re able to achieve.\r\nFirst, the configuration of the ZFS pool.   We clustered the twenty drives into four virtual devices of five disks each:
  1. # zpool create tank raidz c7t0d0 c7t1d0 c7t2d0 c7t3d0 c7t4d0 raidz c7t5d0 c7t6d0 c7t7d0 c8t0d0 c8t1d0 raidz c8t2d0 c8t3d0 c8t4d0 c8t5d0 c8t6d0 raidz c8t7d0 c13d0 c14d0 c11d1 c12d1\r\n# zpool status -v tank\r\n
复制代码
\r\n
  1. pool: tank\r\n state: ONLINE\r\n scrub: none requested\r\nconfig:\r\n\r\n        NAME        STATE     READ WRITE CKSUM\r\n        tank        ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c7t0d0  ONLINE       0     0     0\r\n            c7t1d0  ONLINE       0     0     0\r\n            c7t2d0  ONLINE       0     0     0\r\n            c7t3d0  ONLINE       0     0     0\r\n            c7t4d0  ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c7t5d0  ONLINE       0     0     0\r\n            c7t6d0  ONLINE       0     0     0\r\n            c7t7d0  ONLINE       0     0     0\r\n            c8t0d0  ONLINE       0     0     0\r\n            c8t1d0  ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c8t2d0  ONLINE       0     0     0\r\n            c8t3d0  ONLINE       0     0     0\r\n            c8t4d0  ONLINE       0     0     0\r\n            c8t5d0  ONLINE       0     0     0\r\n            c8t6d0  ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c8t7d0  ONLINE       0     0     0\r\n            c13d0   ONLINE       0     0     0\r\n            c14d0   ONLINE       0     0     0\r\n            c11d1   ONLINE       0     0     0\r\n            c12d1   ONLINE       0     0     0\r\n\r\nerrors: No known data errors\r\n
复制代码
\r\nWriting a trivial 100GB file:
  1. # time (mkfile 100g /tank/foo)\r\n
复制代码
real    2m22.207suser    0m0.338ssys     0m43.324s\r\nThat yields a sustained write speed of 704 MB / sec!  Granted things will be slower once we test real-world loads with random reads, but this is getting out taste buds wet!  We’re looking into filebench as a more comprehensive benchmark tool.   \r\nIn addition to different workloads, we’ll also benchmark various zpool configurations.  Stay tuned.

论坛徽章:
0
5 [报告]
发表于 2010-09-05 04:45 |只看该作者
ZFS Performance RAID 10 – A stripe of mirrors (2×10)\r\n\r\nThis is a preview of the upcoming performance evaluation of our ZFS server.\r\nIn this post, we’ll explore the performance of configuring the 20 drives as a stripe of 2-way mirrors.   This effectively halves the usable storage capacity of the system, but it should yield better random read performance in terms of operations per second.  Let’s see exactly how fast it goes:\r\nCreating the pool:
  1. # zpool create tank mirror c7t0d0 c8t0d0 mirror c7t1d0 c8t1d0 mirror c7t2d0 c8t2d0 mirror c7t3d0 c8t3d0 mirror c7t4d0 c8t4d0 mirror c7t5d0 c8t5d0 mirror c7t6d0 c8t6d0 mirror c7t7d0 c8t7d0 mirror c11d1 c12d1 mirror c13d0 c14d0\r\n\r\nThe configuration:
复制代码
\r\n
  1. # zpool status -v tank\r\n\r\n  pool: tank\r\n state: ONLINE\r\n scrub: none requested\r\nconfig:\r\n\r\n        NAME        STATE     READ WRITE CKSUM\r\n        tank        ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t0d0  ONLINE       0     0     0\r\n            c8t0d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t1d0  ONLINE       0     0     0\r\n            c8t1d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t2d0  ONLINE       0     0     0\r\n            c8t2d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t3d0  ONLINE       0     0     0\r\n            c8t3d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t4d0  ONLINE       0     0     0\r\n            c8t4d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t5d0  ONLINE       0     0     0\r\n            c8t5d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t6d0  ONLINE       0     0     0\r\n            c8t6d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c7t7d0  ONLINE       0     0     0\r\n            c8t7d0  ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c11d1   ONLINE       0     0     0\r\n            c12d1   ONLINE       0     0     0\r\n          mirror    ONLINE       0     0     0\r\n            c13d0   ONLINE       0     0     0\r\n            c14d0   ONLINE       0     0     0\r\n\r\n
复制代码
\r\n
  1. Writing 100GB: \r\n# time (mkfile 100g /tank/foo)\r\n\r\nreal    3m38.322s \r\nuser    0m0.335s \r\nsys     0m38.988s
复制代码
\r\n\r\n\r\n\r\n\r\nThis yields 458 MB / sec for writes.  As expected, this is lower than a raidz stripe because we cannot divide the workload across as many devices.   Each drive effectively has to write a tenth of the overall file size because of the pairings.  This write performance penalty only works out to 45 MB/sec per drive, which indicates that we’re bus-limited on this performance.  \r\nOne benefit a hardware raid controller would have over ZFS is in writing to mirrors:  Hardware controllers could allow the CPU to write the data once to the controller, which in turn would duplicate the data across each mirrored pair.  In the case of ZFS, 916MB/sec had to transit the PCI-X bus in order to individual instruct each drive to write data, effectively sending each byte of data twice.\r\nNow let’s shift gears, and look at the performance in terms of operations per second.   We setup filebench to run a series of random reads.   The goal here is to measure how fast the filesystem can seek new information.  By using a small read size of 16 kilobytes, we can ensure that the system will spend most of it’s time seeking for data and a minimal amount of time actually reading the data.  In this way we push the system to the max to see how many read operations can be performed per second.   This is a scenario in which mirrored configurations are ideal.   With this mirror, every piece of data exists in more than one location, so we expect read performance to be near double that of a lone drive, since only half the devices need by accessed to serve an individual request.\r\n
  1. Here’s the filebench configuration profile:\r\n\r\nDEFAULTS {\r\n        runtime = 60;\r\n        dir = /tank/filebench;\r\n        stats = /tmp/filebench;\r\n        filesystem = zfs;\r\n        description = \"randomread zfs\";\r\n}\r\n\r\nCONFIG randomread16k {\r\n        function = generic;\r\n        personality = randomread;\r\n        filesize = 100000m;\r\n        iosize = 16k;\r\n        nthreads = 4;\r\n}\r\n
复制代码
\r\n
  1. And here are the results: \r\nrand-read1                412ops/s   6.4mb/s      9.7ms/op       65us/op-cpu
复制代码
\r\n\r\n\r\n\r\n\r\n412 operations per second; or 9.7 ms per operation.   The 6.4mb/sec is an indication that we did a reasonable job in setting up the test to avoid time spent reading data.  This number is low by design.  Should we actually have an application that randomly reads data in 16kilobyte chunks, there’s more we can do to optimize throughput, such as configuring the recordset property, and adding additional mirrors.

论坛徽章:
0
6 [报告]
发表于 2010-09-05 04:47 |只看该作者
本帖最后由 云杉上的蝴蝶 于 2010-9-5 04:50 编辑 \n\n
  1. $ zpool status -v tank\r\n  pool: tank\r\n state: [code]$ time (mkfile 100g /tank/foo)\r\n\r\nreal    [code]\r\n$ zpool status -v tank\r\n  pool: tank\r\n state: [code]$ time (mkfile 100g /tank/foo)\r\n\r\nreal    2m29.438s\r\nuser    0m0.336s\r\nsys     0m40.903s\r\n
复制代码
NLINE\r\n scrub: none requested\r\nconfig:\r\n\r\n        NAME        STATE     READ WRITE CKSUM\r\n        tank        ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c7t0d0  ONLINE       0     0     0\r\n            c7t1d0  ONLINE       0     0     0\r\n            c7t2d0  ONLINE       0     0     0\r\n            c7t3d0  ONLINE       0     0     0\r\n            c11d1   ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c7t4d0  ONLINE       0     0     0\r\n            c7t5d0  ONLINE       0     0     0\r\n            c7t6d0  ONLINE       0     0     0\r\n            c7t7d0  ONLINE       0     0     0\r\n            c12d1   ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c8t0d0  ONLINE       0     0     0\r\n            c8t1d0  ONLINE       0     0     0\r\n            c8t2d0  ONLINE       0     0     0\r\n            c8t3d0  ONLINE       0     0     0\r\n            c13d0   ONLINE       0     0     0\r\n          raidz1    ONLINE       0     0     0\r\n            c8t4d0  ONLINE       0     0     0\r\n            c8t5d0  ONLINE       0     0     0\r\n            c8t6d0  ONLINE       0     0     0\r\n            c8t7d0  ONLINE       0     0     0\r\n            c14d0   ONLINE       0     0     0\r\n\r\nerrors: No known data errors\r\n[/code]m29.438s\r\nuser    0m0.336s\r\nsys     0m40.903s\r\n[/code]FS Performance – RAIDZ vs RAID 10                                 \r\nJust we looked at RAID 10 (stripe of mirrors) performance in ZFS.   We sacrificed a lot of usable capacity to cluster the twenty drives in mirrored sets.   Today we’ll setup the drives in a series of raidz stripes, and see what type of performance trade-offs we make for the extra usable space.\r\nFor this test we setup the 20 drives in four raidz virtual devices of five drives each.   This yields an overall parity-to-data ratio of 1:4, or 25%.   This means that one fifth of our drive capacity is used for parity.
  1. # time (mkfile 100g /tank/foo)\r\n
复制代码
\r\nLet’s first do our standard 100gb write test.
  1. # zpool create tank mirror c7t0d0 c8t0d0 mirror c7t1d0 c8t1d0 mirror c7t2d0 c8t2d0 mirror c7t3d0 c8t3d0 mirror c7t4d0 c8t4d0 mirror c7t5d0 c8t5d0 mirror c7t6d0 c8t6d0 mirror c7t7d0 c8t7d0 mirror c11d1 c12d1 mirror c13d0 c14d0\r\n\r\nThe configuration:
复制代码
\r\nThis yields a write performance of 669 MB / sec; faster than the RAID10 result of 458 MB / sec.  This can be attributed to spreading the workload over more devices, as only 6.25 GB was written to each drive.   The stripe of mirrors required that 10 GB be written to each drive.  The limiting factor for throughput is still the PCI-X bus, which wrote 836 MB / sec of total information (data + parity) in order to support the payload of 669 MB / sec of data.  (669 x 1.25 due to a 4:1 data to parity ratio.)\r\nThe following output compares the filebench results from yesterday’s RAID 10 configuration with today’s 5×4 raidz configuration:
  1. raidz     5x4     rand-read1                273ops/s   4.3mb/s     14.6ms/op      114us/op-cpu\r\nraid10   10x2     rand-read1                412ops/s   6.4mb/s      9.7ms/op       65us/op-cpu\r\n
复制代码
\r\nOperations per second dropped from 412 to 273, a drop of nearly 34% in performance for small random reads.  This is because more devices had to participate in each individual read operation, reducing the speedup possible through parallel reads, as is possible with mirrors.\r\nThat’s all for today, but please stay tuned.   Next week we’ll compare raidz1 and raidz2 performance, compare various configurations of device clustering (one large stripe, vs two stripes, vs four), and look into how performance scales with the number of devices by growing a stripe one drive at a time.

论坛徽章:
0
7 [报告]
发表于 2010-09-05 04:51 |只看该作者
  1. $ zpool iostat -v tank 5\r\n               capacity     operations    bandwidth\r\npool         used  avail   read  write   read  write\r\n----------  -----  -----  -----  -----  -----  -----\r\ntank         163G  8.90T  1.69K  2.03K   214M   251M\r\n  mirror    16.3G   912G    172    211  21.3M  25.1M\r\n    c7t0d0      -      -     85    207  10.7M  25.0M\r\n    c8t0d0      -      -     84    208  10.6M  25.1M\r\n  mirror    16.3G   912G    173    209  21.5M  24.9M\r\n    c7t1d0      -      -     86    206  10.8M  24.9M\r\n    c8t1d0      -      -     86    206  10.7M  24.9M\r\n  mirror    16.3G   912G    172    203  21.3M  24.9M\r\n    c7t2d0      -      -     85    202  10.6M  24.9M\r\n    c8t2d0      -      -     85    202  10.7M  24.9M\r\n  mirror    16.3G   912G    171    203  21.4M  25.2M\r\n    c7t3d0      -      -     85    203  10.7M  25.2M\r\n    c8t3d0      -      -     85    203  10.7M  25.2M\r\n  mirror    16.3G   912G    172    205  21.4M  25.3M\r\n    c7t4d0      -      -     85    204  10.7M  25.2M\r\n    c8t4d0      -      -     85    205  10.7M  25.3M\r\n  mirror    16.3G   912G    173    204  21.5M  25.2M\r\n    c7t5d0      -      -     86    204  10.8M  25.2M\r\n    c8t5d0      -      -     86    204  10.8M  25.2M\r\n  mirror    16.3G   912G    173    208  21.5M  25.3M\r\n    c7t6d0      -      -     86    205  10.8M  25.3M\r\n    c8t6d0      -      -     85    206  10.7M  25.3M\r\n  mirror    16.3G   912G    174    210  21.6M  25.2M\r\n    c7t7d0      -      -     86    207  10.8M  25.2M\r\n    c8t7d0      -      -     86    206  10.8M  25.2M\r\n  mirror    16.3G   912G    173    207  21.4M  24.8M\r\n    c11d1       -      -     86    204  10.8M  24.8M\r\n    c12d1       -      -     85    203  10.7M  24.7M\r\n  mirror    16.3G   912G    174    210  21.5M  24.9M\r\n    c13d0       -      -     86    206  10.8M  24.8M\r\n    c14d0       -      -     85    207  10.7M  24.9M\r\n----------  -----  -----  -----  -----  -----  -----\r\n
复制代码
FS iostat – Monitoring live performance                                 \r\nZFS includes a built-in utility to monitor the performance of a zpool.   By default, typing “zpool iostat” will return the overall statistics since the time the array was created.   While this may be useful for determining how stressed the system is over the long haul, we really prefer see “live” data.   Adding the “5″ to the end of the line will give us data updated every five seconds.\r\nThis is an excellent opportunity to demonstrate some of the performance characteristics of mirroring:
  1. # zpool create tank raidz c7t0d0 c7t1d0 c7t2d0 c7t3d0 c7t4d0 raidz c7t5d0 c7t6d0 c7t7d0 c8t0d0 c8t1d0 raidz c8t2d0 c8t3d0 c8t4d0 c8t5d0 c8t6d0 raidz c8t7d0 c13d0 c14d0 c11d1 c12d1\r\n# zpool status -v tank\r\n
复制代码
\r\nThis was run while we were copying a file within the “tank” pool.   You can see that the overall read performance for the zpool is 215MB / sec, and the write performance was 251 MB / sec.\r\nEach virtual device is represented by the lines that contain the text “mirror.”   Here you can see that each virtual device was supporting around 21 MB/sec or read activity and 24 MB/sec of write activity.  This is fairly evenly distributed among the many striped virtual devices.\r\nLooking one level of detail further, we can see the real advantage to mirroring: read performance:
  1. mirror    16.3G   912G    172    211  21.3M  25.1M\r\n    c7t0d0      -      -     85    207  10.7M  25.0M\r\n    c8t0d0      -      -     84    208  10.6M  25.1M\r\n
复制代码
\r\nNotice that although we are pulling 21.3 MB / sec off this mirrored set of drives, each drive only has to access ~ 10.5 MB of data each second.   With more read-intensive applications, mirroring will help read performance greatly.
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP