论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2008-08-05 10:16 |只看该作者 |倒序浏览

Capacity optimization is no longer limited to secondary storage. The advantages are the same when applied to primary storage.
By Eric Burgener
The
amount of information generated by enterprises has greatly increased
over the last five years. Because enterprises often keep multiple
copies of data for recovery and other purposes, storage capacity growth
is a multiple of information growth and it is spiraling out of control,
hitting rates of 50% to 100% per year or more for many companies.
Growth at this level puts undue pressure on IT organizations, not only
to pay for and manage all of this storage, but also to find floor space
for it as well as to power and cool it.
In the 2004 time frame,
technologies began to emerge that allowed information to be stored with
much lower raw storage capacity requirements. These technologies,
sometimes referred to as capacity optimized storage (COS), have now
become widely available, with end–user surveys indicating strong growth
for COS products over the next 6 to 12 months. Vendors in this space
include Data Domain, Diligent Technologies, FalconStor Software, Hifn,
NetApp, Quantum, Sepaton, and others.
COS technologies were
originally designed for use against secondary storage such as that used
primarily for data–protection purposes. Secondary storage has certain
characteristics that figured heavily in how COS solutions were built.
First and foremost, since there was so much redundancy in the data
stored for data–protection purposes, COS solutions heavily leveraged
technologies such as data de–duplication and single–instancing to
achieve their data–reduction results. In addition, since most secondary
storage was stored in offline rather than online environments, the
capacity optimization process did not have to meet the stringent
performance requirements of online application environments. Using COS
solutions, it is realistic over time to achieve data reduction ratios
against secondary storage of 15:1 or greater.

Click here to enlarge image
But
COS left out a huge amount of data found in all application
environments: primary storage, which is different from secondary
storage in two critical respects: 1) Primary storage is used in online,
performance–sensitive environments that have stringent response time
requirements; and 2) Primary storage has little, if any, of the
redundancy that makes technologies like data de–duplication and single
instancing so effective against secondary storage. Recently, how–ever,
a few vendors have begun shipping capacity optimization solutions
specifically for use against primary storage, and the COS market is now
splitting into two separate segments: primary storage optimization
(PSO) and secondary storage optimization (SSO).
This article
discusses the emerging PSO market, reviews the architectures and
technologies, and highlights some of the vendors’ products in this
space.
Defining an emerging market
In
discussions with end users, Taneja Group has discovered that many
companies have tried COS technologies on primary storage in test
environments. There was a burning curiosity to see how effective COS
technologies would be against the data sets used by online
applications. What end users have discovered, and many COS vendors
undoubtedly have proven in their own internal testing, is that while
COS technology offers huge benefits in cost and floor space savings
against secondary storage, it achieves much lower data–reduction ratios
against primary storage. Because of their strong market showing in the
SSO market, data de–duplication vendors may have a natural advantage in
going after the PSO market, but they will clearly need to develop new
technologies to do so effectively.
Because it has very different
customer requirements and requires different technologies, PSO is
clearly a separate market from SSO. Up to this point, use of the term
COS has been synonymous with secondary storage. But with the advent of
solutions specifically targeted at primary storage, it is useful to
re–define the COS market. One approach to this might define a new,
higher–level market called “storage capacity optimization” (SCO), with
its two related sub–markets of PSO and SSO (see figure). The overall
customer requirement in the storage capacity optimization market is to
reduce the amount of raw storage capacity required to store a given
amount of information. The two sub–markets define the different sets of
technologies that are required to achieve that for primary versus
secondary storage.
Approaches to SCO
Vendors
tend to fall into two camps with respect to defining SCO approaches:
inline and post–processing approaches. Each approach may be implemented
using different architectures. Most solutions offload the capacity
optimization processing from the application server, performing it
either with dedicated resources in a card, appliance, or storage
subsystem, but capacity optimization algorithms embedded in operating
systems (such as those offered by Microsoft in Windows) and backup
software agents (such as the Symantec Net–Backup PureDisk Agent)
leverage host resources. This latter approach is a form of inline
capacity optimization.
The inline group believes that maintaining
the lowest possible storage requirements at all times is the most
important metric. These vendors’ products intercept the data and
capacity–optimize it before it is ever written to disk. While this does
keep overall storage requirements at their lowest possible levels at
all times, it does pre–sent a performance challenge. Whatever work must
be done to capacity–optimize the data must be done so it does not
impact performance in a meaningful way. For applications using
secondary storage, the performance bar is low since they are not
interactive. For primary storage applications, however, this is much
more of an issue. Although performance impacts assert themselves
differently for PSO and SSO solutions, vendors must architect their
solutions so they do not impact performance or impact it only minimally
from an end–user point of view.
The post–processing group
believes that the impact on application performance is the key metric.
With SSO solutions, these vendors take approaches that will not impact
the performance of the initial backup in any way. Post–processing
approaches generally write non–optimized data directly to disk and then
make a second pass at the data to perform the capacity optimization.
Think of this approach as analogous to a trash compactor that can be
used on–demand (or on a scheduled basis) to reduce the raw storage
capacity required to store any given data set. Policies can be
implemented that will perform the capacity optimization when the
storage capacity reaches a defined threshold. The downside to this
approach is that there must always be enough storage available to
initially store each new data set in non–optimized form.
Underneath
the covers of each approach, vendors offer different methods to
actually perform the capacity optimization. Approaches that examine the
data at a lower level of granularity (e.g., sub–file level instead of
file level) tend to offer higher–capacity optimization ratios, as do
approaches that can apply variable–length windows (when doing data
comparisons) instead of just fixed–length windows. Format–aware
approaches, such as the tape–format–aware algorithms offered by
FalconStor, can offer some additional data reduction relative to
non–format–aware approaches. Discussion of how these algorithms work is
beyond the scope of this article.
One final comment on general
SCO approaches concerns scalability. When looking for redundant data,
capacity optimization basically breaks data down into smaller
components and looks to eliminate redundancy at the building–block
level. A data repository is maintained of all the building blocks that
the solution has “seen” before, and when it finds another instance of
one of these building blocks, it inserts a reference to the instance of
that building block that it is retaining in the repository and removes
the duplicate object. Solutions that retain the latest instance of a
given building block, as opposed to the original instance (which tends
to become fragmented over time), tend to offer better read performance.
The larger the data repository, the greater chance that any given
building block already resides in it. Regardless of how large any
single repository can be, solutions that can cluster repositories
together to build a very large logical repository tend to offer better
scalability. For many capacity–optimization vendors, the ability to
offer clustering forms the basis of their claim to be an enterprise
solution.
PSO architectures
The
first solutions in the PSO space started shipping in 2005 from
Storwize. In 2007, NetApp released a product that is now called NetApp
De–Duplication for FAS (formerly called Advanced Single Instance
Storage, or A–SIS). Although NetApp initially positioned this utility
in DataOnTap for use with secondary storage, in 2008, the company began
messaging for the primary storage market as well. Also this year, Hifn,
an OEM supplier of hardware–accelerated security and enhanced
compression solutions that appear “under the covers” in many products
from enterprise storage suppliers, added PSO to its existing SSO
message.
And Ocarina Networks entered the PSO market with an announcement at this month’s Storage Networking World conference.
Interestingly,
each of these vendors uses a different architecture. Note that these
different approaches characterize what happens to writes to storage.
All PSO solutions handle reads of capacity–optimized data at wire
speeds.
Inline approaches
Storwize
uses an inline approach, with an in–band appliance—the STN
Appliance—that offers real–time capacity optimization at wire speeds
with no impact on application performance. Targeted for use with IP
networks, Storwize’s appliance uses patented methods designed
specifically for primary storage, based on enhanced Lempel–Ziv
algorithms, to achieve data–reduction ratios averaging in the range of
3:1 to 9:1. The STN Appliances maintain caches that can in some
instances provide better than native read performance. Multiple
appliances can be clustered against a large, back–end data repository
to support hundreds of terabytes of storage capacity. Deployment is
transparent, and does not require network reconfiguration. As the
pioneer in the PSO market, Storwize offered solutions about two years
before other vendors began to address this space.
Post–processing approaches
NetApp
and Ocarina Networks offer post–processing approaches to avoid
impacting the performance of online applications using primary storage.
NetApp
believes that, over the next several years, capacity optimization
technologies will migrate into the infrastructure layer and will be
available as part of server, storage, and/or operating system
platforms. NetApp’s De–Duplication for FAS was recently bundled into
its DataOnTap software at no additional charge, adding to the overall
value of NetApp’s storage platforms. Because it resides in a storage
server that can support either file– or block–based storage, it can be
used against either or both to achieve data reduction ratios of up to
6:1 against primary storage and up to 20:1 against secondary storage.
In
NetApp’s case, there are two advantages to implementing capacity
optimization as an operating system utility. First, it leverages a
close integration with NetApp’s WAFL file system to incur extremely low
overhead. For reliability purposes, WAFL already calculates a unique
checksum, called a fingerprint, associated with each block of data. To
de–duplicate data, NetApp just uses these fingerprints (which are
already calculated anyway) to perform the search for and identification
of duplicate blocks as a separate batch process, adding no overhead to
ordinary file operations. Second, it can be easily integrated with
other DataOnTap features to provide higher–level solutions. For
example, integration with NetApp’s Thin Provisioning supports an
“autosize” feature that will run capacity optimization algorithms as
needed to keep a given volume under a size defined by the
administrator. Integration with SnapVault and Snap–Mirror allows
capacity optimization to be leveraged by policy to help minimize
storage requirements for snapshots or to minimize the amount of data
sent across the network for disaster–recovery purposes.
Targeted
for use with IP storage, Ocarina Networks is the only vendor so far
that offers format–aware optimization against primary storage. Although
some SSO vendors have claimed that application–specific approaches to
capacity optimization are not effective, this tends to be relevant only
for inline capacity optimization. Format–aware capacity optimization
takes more CPU cycles and does take slightly longer, but does not
present performance issues when used by out–of–band approaches.
At
least one SSO vendor—FalconStor—is also using a post–processing,
format–aware approach. (FalconStor uses a tape–format–aware method in
its SIR product.)
Ocarina Networks’ format–aware optimization
recognizes file types and their contents, de–layers complex compound
document types, can optimize already–compressed formats, and can
perform de–duplication at the object level. Its out–of–band appliance,
called the Optimizer, selects files by policy after they have been
written to storage, identifies each file by type, and then routes it to
the appropriate format–aware optimizer.
Ocarina offers
format–aware optimizers for a number of common primary storage
environments, including Microsoft Office workloads that contain home
directories, pdfs, and digital photo sets (gif, jpeg), and Internet
e–mail mixes that include blogs, e–mails, and text messages. Ocarina
claims to achieve data reduction ratios approximately 3x better than
that achieved with enhanced Lempel–Ziv algorithms, but its products are
too new to have end users in production who can support those claims.
Hybrid approaches
With
its Express DR line of hardware acceleration cards, Hifn offers its OEM
customers the option of deploying either inline or post–processing
approaches. Hifn’s customers often embed these cards into their virtual
tape library (VTL) or backup appliance products, leveraging it as one
of several capacity optimization methods that are serially applied
against secondary storage. Hifn deploys the same set of proprietary
methods, based on its own enhanced Lempel–Ziv compression algorithms,
against both primary and secondary storage, but achieves realistic data
reduction ratios against primary storage of 2:1 to 4:1. Because Hifn’s
cards can handle wire speeds of up to 1GBps, they can also be used in
inline solutions.
All of these vendors offer or will be offering
a predictive tool that can be deployed in an hour or two to provide an
estimation of the type of data reduction ratios their PSO solutions can
achieve in a particular environment. Since achievable data reduction
ratios are very sensitive to the characteristics of different data
types, use of such a tool prior to purchase is highly recommended.
Storwize and Ocarina Networks offer such tools today, and Hifn plans to
offer one later this year. NetApp does not need a separate tool for
this, since their capacity optimization functionality ships at no
additional charge with their operating system, and it can be enabled at
the volume level to test it out against snapshots of primary production
data without impacting ordinary file operations at any time.
Benefits of PSO
Capacity
optimization offers many of the same benefits for primary storage as it
does for secondary storage. PSO reduces raw storage capacity growth in
primary storage, lowering not only the spending for new storage
capacity, but also lowering costs associated with storage management,
floor space, power, and cooling. Note that for many IT shops, the
cost–per–terabyte of primary storage is greater than that of secondary
storage due to higher performance and reliability requirements. While
the data reduction ratios may not be as great with primary storage as
with secondary storage, many shops will be enjoying greater savings
against each “reclaimed” terabyte of primary storage.
Because it lowers the overall capacity of primary storage, PSO offers enterprises of all sizes other benefits as well:

Shortened overall backup–and–restore times since less data must be written to or retrieved from disk for any given data set; and
In
cases where data sets must be shipped across networks, the smaller,
capacity–optimized data sets require less bandwidth, thereby reducing
network traffic.

Note that PSO can be a
complementary technology to SSO. Solutions that use different capacity
optimization methods for primary and secondary storage can actually
offer additive data reduction advantages. Data reduction ratios with
combined use will vary based on the actual solutions used and the
workload types. The only way to really understand the benefit PSO, or a
combination of PSO and SSO, together will provide is to test it on
specific workloads.
Challenges with PSO
Several
issues need to be evaluated as PSO technology is considered. First,
does it really pose no performance impact in your environment? This is
not just a concern for inline approaches. Understand how long it will
take a post–processing solution to complete its capacity optimization
task. What is the impact (if any) to online application performance
during this process? This is less of a concern for out–of–band
approaches than in–band approaches, but keep in mind that out–of–band
approaches do actually move data back and forth to primary storage
during the process.
Although it is not an unfamiliar challenge,
another concern with PSO is how to retrieve capacity optimized data in
the event of a problem with the PSO solution. For hardware–based
solutions, simple redundancy at the appliance or card level can be
sufficient to handle any single points of failure. Note that unlike
with encryption, there is nothing random about how data is
capacity–optimized. The same capacity–optimization methods are
predictably used across all models of a certain type in a vendor’s
product line, so any other similar model could be used to retrieve the
data.
A final concern is one shared by both PSO and SSO. Because
the technologies basically refer to a single instance of an object that
appears multiple times, if for some reason that object gets corrupted,
the damage can potentially be much greater than just losing that object
in non–capacity–optimized storage. Depending on the data reduction
ratio achieved, loss of a single object could potentially affect
thousands of instances of that object in each of the files, file
systems, or databases where it also appears. The first line of defense
against this is that most customers are already using some form of
RAID, providing redundancy against single points of failure at the
hardware level.
Over and above the hardware RAID approach,
vendors offer two additional and optional methods to address this
issue: integrated metadata and multiple spindling.
With the
integrated metadata approach, a PSO solution effectively implements a
virtualization layer that handles the abstraction of a single physical
copy of an object to however many redundant copies exist. Each time the
virtualization layer creates a reference to an object, metadata is
saved along with that object that effectively enables its re–creation
in the event the primary instance of it becomes corrupted. In a manner
similar to how chkdsk can be used to rebuild an NTFS file
system block by block, this metadata can be used to re–create any
object (albeit at a relatively slow rate).
The multiple spindling
approach ensures no single data element exists only on one spindle in a
given logical volume. Think of this as “mirroring” at the object level,
ensuring any single data element is always available on at least two
spindles. Both of these approaches, while offering improved data
reliability, do lower the overall data reduction ratio slightly.
Recommendations
Primary
storage is an area ripe for capacity optimization, and enterprises of
all sizes have a lot to gain from deploying reliable implementations of
this technology over time. The compelling economic payback is achieved
not only against your most expensive storage tier, but also the
benefits of data reduction then roll through other tiers (nearline,
offline). If you are considering deploying both PSO and SSO (which is a
smart idea in the long term), choose complementary technologies for
each tier to maximize your overall data reduction ratios. SSO has
achieved penetration rates of about 20% in the industry—slightly higher
in small and medium–sized enterprises—but shows strong purchase intent
over the next 6 to 12 months across all segments. As enterprises come
to trust SSO, this will help PSO to achieve similar penetration rates
more rapidly.
As with any emerging technology, there are
questions of performance and reliability. Demand solutions that impose
no perceivable performance impacts. Use high–availability approaches
such as clustering PSO appliances to help address reliability issues.
While there are many reference customers, enterprises have been
cautious in their deployment of PSO to date. Decide up–front whether
inline or post processing approaches are best–suited to your
environment and requirements. Check references and use predictor tools
if possible. If a PSO solution cannot offer data reduction ratios of at
least 3:1 against your particular workloads, it may not pay to
implement it over basic compression technologies that are proven and
widely available. Expect to achieve realistic data reduction ratios
across varied workloads in the 5:1 range with PSO technology.
Eric Burgener is a senior analyst and consultant with the Taneja Group research and consulting firm (www.tanejagroup.com).
The difference between compression and data de–duplicationOften
based on Lempel–Ziv (LZ) algorithms, compression uses an encoding
scheme against data to reduce the number of bits required to represent
it. The result can then be decoded to retrieve the original data.
There
are two types of compression: lossless and lossy. Lossless compression
can exactly re–create the original string, while lossy compression will
only be able to re–create a close approximation of the original string;
lossy compression tends to offer slightly higher data reduction ratios.
Data reduction ratios using compression generally are not greater than
2:1.
Data de–duplication, like compression, takes advantage of
the fact that most data has some statistical redundancy to reduce the
number of bits required to represent it. Using higher–level algorithms
that generally operate at the sub–file level (compression operates at
the file level), data de–duplication looks for patterns within files
that also appear in other files, and generally achieves much higher
data reduction ratios than standard compression.
Some vendors
offer global data de–duplication repositories that can be used to
de–duplicate data across systems, whereas compression references are
specific to one system. Data reduction ratios using data de–duplication
can be 15:1 or greater for secondary data sets such as repeated backups
over time.

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u/2671/showart_1106468.html

文库|博客

返回列表

Chinaunix › 论坛 › IT运维 › 存储备份 › 存储文档中心 › It’s time for primary storage optimization

It’s time for primary storage optimization [复制链接]

浏览过的版块