SNIA on STORAGE

Evaluation criteria for data de-dupe

How does it work? What are length or variable object length de-duplication, local ensure against hash collisions. Although concern over or remote de-duplication, inline or post-processing, and hash collisions is often raised, depending upon the hash de-duplicated or original format data protection. algorithm used and the system design, the probability the different implementation of a hash collision may actually be orders of magnitude Use of hashes less than the probability of an undetected disk read error methods? And what are the key Data de-duplication begins with a comparison of two returning corrupt data. data objects. It would be impractical (and very arduous) to scan an entire data volume for duplicate objects each Indexing evaluation criteria? time a new object was written to that volume. For that Once duplicate objects have been identified (and option- reason, de-duplication systems create relatively small ally validated), removal of the duplicate object can com- By Larry Freeman, Rory Bolt, mence. There are varying methods that systems employ when modifying their data pointer structures. However, and Tom Sas De-duplication pointers all forms of this indexing fall into four broad categories: Catalog-based indexing—A catalog of hash values is Reference pointers for duplicated objects used only to identify candidates for de-duplication. A DATA DE-DUPLICATION IS THE PROCESS separate process modifies the data pointers accordingly. of eliminating redundant copies of data. The advantage of catalog-based de-duplication is that The term “data de-duplication” was coined the catalog is only utilized to identify duplicate objects by database administrators many years ago as a way of and is not accessed during the actual reading or writing describing the process of removing duplicate database of the de-duplicated data objects; that task is handled records after two databases had been merged. via the normal file-system data structure. De-duplicated reference pointers Today the original definition of de-duplication has Lookup table-based indexing—Extends the function- been expanded. In the context of storage, de-duplication ality of the hash catalog to also contain a hash lookup refers to any algorithm that searches for duplicate data table to index the de-duplicated object’s “parent” data objects (e.g., blocks, chunks, files) and stores only a sin- pointer. The advantage of a lookup table is that it can be gle copy of those objects. The user benefits are clear: used on file systems that do not support multiple block ■ Reduces the space needed to store data; and referencing; a single data object can be stored and “refer- ■ Increases the available space to retain data for enced” many times via the lookup table. Lookup tables longer periods of time. hash values for each new object to identify potential may also be used within systems that provide block-level duplicate0712ISfreemanf1 data. services instead of file systems. How it works A hash value, also called a digital fingerprint or digi- Content-addressable store, or CAS-based, indexing—The Regardless of operating system, application, or file-system tal signature, is a small number generated from a larger hash value, or digital signature of the data object itself, type, all data objects are written to a storage system using string of data. Hash values are generated by a mathemat- may be used by itself or in combination with additional a data reference pointer, without which data could not ical formula in such a way that it is extremely unlikely metadata as the data pointer. In a content-addressable be located or retrieved. In traditional (non-de-duplicat- (but not impossible) for two non-identical data objects store (CAS), the storage location is determined by the ed) file systems, data objects are stored without regard to to produce the same hash value. In the event that two data being stored. Advantages of CAS-based indexing any similarity with other objects in the same file system. non-identical objects do map to the same hash value, include inherent single instancing/de-duplication, as well Identifying duplicate objects and redirecting reference this is termed a “hash collision.” pointers form the basis of the de-duplication algorithm. Due to the growing interest in data de-duplication and As shown in the figure, referencing several identical Evaluation criteria space reduction solutions, the SNIA DMF Data Protection Initiative has recently been tasked with forming a Spe- objects with a single “master” object allows the space Understanding a system’s use of hashes is an important cial Interest Group (SIG) focusing on this topic. This is the normally occupied by the duplicate objects to be “given criterion when you are evaluating de-duplication. If the first in a series of publications from SNIA on the topic of back” to the storage system. technology depends solely on hashes to determine if de-duplication and space reduction. The mission of the two objects are identical, then there is the possibility, DDSR SIG is to bring together a core group of companies Design considerations however remote, that hash collisions could occur and that will work together to publicize the benefits of data de-duplication and space savings technologies. Anyone Given the fact that all de-duplication technologies must some of the data referencing the object that produced interested in participating can help form the direction and identify duplicate data and support some form of ref- the collision will be corrupt. Certain government regu- ultimate success of the group. Find out more at erencing, there is a surprising variety of implementa- lations may require you to perform secondary data ob- www.snia-dmf.org/dpi. tions, including the use of hashes, indexing, fixed object ject validation after the hash compare has completed to

2 6 INFOSTOR www.infostor.com DECEMBER 2007 SNIA on STORAGE

De-duplication ratio and storage savings as enhanced data integrity capabilities and with the application being de-duplicated the ability to leverage grid-based storage Number of rather than any technical advantages/ 8 16 20 architectures. Although CAS systems are disadvantages. inherently object-based, file-system seman- When performing data backups, the tics can be implemented above the CAS. user’s primary objective is the completion 99.5% Application-aware indexing—Differs 99% of backups within an allowed time win- 98% from other indexing methods in that it 96.7% 97.5% dow. For LAN- and WAN-based backups, 80% 95% looks at data as objects. Unlike hashing Acceptable remote inline de-duplication may provide or byte-level comparisons, application- 75% de-duplication ratio the best performance. For direct-attached ( data) aware indexing finds duplication in ap- 67% and SAN-based backup, an assessment 50% plication-specific byte streams. As the 20% 43% should be made to determine which ap- name implies, this approach compares Acceptable proach works best. Either may be appropri- de-duplication ratio like objects (such as Excel documents to (non-backup data) ate, depending on data type and volume. If Excel documents) and has awareness of 1.25 1.5:1 1.75:1 2:1 3:1 4:1 5:1 10:1 20:1 30:1 40:1 50:1 100:1 200:1 post-processing de-duplication is deployed, the data structure of these formats. De-duplication ratio (n:1) users should ensure there is adequate time of variable object size de-duplication is is sometimes referred to as source de- between backup sessions to complete the Evaluation criteria that it allows duplicate data to be recog- duplication in the backup market. de-duplication post-process. De-duplication indexing is an important nized even if it has been logically shifted With general applications, the cost of consideration in technology evaluation, with0712ISfreemanf2 respect to physical block boundar- Evaluation criteria additional storage needed by post-process- particularly when it comes to resiliency ies. This can result in much better data The advantage of local de-duplication is ing needs to be weighed against the cost of design. When indexing data objects, de-duplication ratios. total application transparency and in- of system resources and the performance the index itself could become a single teroperability; however it doesn’t address of inline de-duplication to determine the point of failure. Evaluation criteria remote or distributed systems or bottle- best fit for an environment. It is important to understand what, if Fixed object length de-duplication offers necks in networks. Although it requires any, single points of failure are present in processing advantages and performs well a specialized agent or API, remote de- De-duplicated or original a de-duplication system. It is equally im- in both structured data environments duplication offers tremendous potential format data protection portant to understand what measures are (e.g., databases) and in environments for both network bandwidth savings and As is the case with all corporate data used to protect these single points of fail- where data is only appended to files. In application performance. systems, de-duplicating storage systems ure to minimize the risk of data loss. unstructured data environments such need to be protected against data loss. Another indexing consideration is as file servers, variable object length de- Inline or post-processing De-duplicating systems vary with respect the speed of the index. An inordinate duplication is able to recognize data that Another design distinction is when to to their approach to data protection. amount of time should not be required has shifted position as the result of edits perform de-duplication. Again, there are When protecting a de-duplicated sys- to store and retrieve data objects, even to a file. Variable object length de-dupli- multiple design options. tem, it is possible to perform backups and when millions of objects are stored in the cation typically offers greater de-duplica- With inline de-duplication, de-duplica- replication in the de-duplicated state. file system. When evaluating de-duplica- tion in unstructured data environments. tion is performed as the data is written The advantages of de-duplicated data tion, consider both lightly loaded and ful- Since fixed object length is a subset case to the storage system. The advantage of protection are faster operations and less ly loaded file systems and the potential of variable object size, many systems ca- inline de-duplication is that it does not resource usage in the form of LAN/WAN performance degradation caused by in- pable of variable object length de-duplica- require any duplicate data to be written bandwidth. dexing as more and more data is written tion also offer fixed object length. to disk. The duplicate object is hashed, De-duplicated systems can also be to the file system. compared, and referenced on-the-fly. A backed up and replicated in the original Local or remote disadvantage is that more system resourc- data format. The advantage of protect- Fixed object length or de-duplication es may be required to handle the entire ing data in the original format is that the variable object length Local or remote de-duplication refers to de-duplication operation in real time. data can theoretically be restored to a de-duplication where the de-duplication is performed: With post-processing de-duplication, different type of system that may not De-duplication may be performed on In local de-duplication, de-duplication de-duplication is performed after the support data de-duplication. This would fixed-size data objects or on variable-size may be performed within a device. This data is written to the storage system. The be particularly useful with long-term data objects. allows transparent operation without the advantage of post-processing de-duplica- tape retention. With fixed object length de-duplication, need for APIs or software agents. Local de- tion is that the objects can be compared data may be de-duplicated on fixed object duplication is sometimes referred to as tar- and removed at a more leisurely pace, Evaluation criteria boundaries such as 4K or 8K blocks. The get de-duplication in the backup market. and typically without heavy utilization If media usage and LAN/WAN band- advantage of fixed block de-duplication is In remote de-duplication, for LAN- or of system resources. The disadvantage of width are a concern, de-duplicated data that there is less computational overhead WAN-based systems, it is possible to per- post-processing is that all duplicate data protection offers clear cost advantages in computing where to delineate objects, form de-duplication remotely through the must be first written to the storage system, as well as performance advantages. Note less object overhead, and faster seeking use of agents or APIs without requiring requiring additional storage capacity. that while original format data protec- to arbitrary offsets. additional hardware. Remote de-dupli- tion offers the possibility of cross-plat- With variable object length de-duplica- cation extends the benefits of de-dupli- Evaluation criteria form operation, in practice many da- tion, data may be de-duplicated on vari- cation from storage efficiency to net- The decision regarding inline versus post- ta-protection solutions do not allow able object boundaries. The advantage work efficiency. Remote de-duplication processing de-duplication has more to do cross-platform operation. Finally, some

INFOSTOR www.infostor.com DECEMBER 2007 2 7 SNIA on STORAGE

systems offer users a choice of either norm (2% to 5%), this user should expect 500GB data archival volume can be re- local or remote operation, inline or post- type of data protection. storage space savings in the range of 5:1 to duced to 400GB through de-duplication, processing, data protection, and of course, 20:1. If you retain more backup images, or the spatial (volume) reduction is 100GB, the space savings reduction that you will De-duplication space savings a reduced rate of change between backups, or 20%. Think of it as receiving a “storage receive. By considering these items and So what should you expect in terms your ratio will increase. The larger num- rebate” through de-duplication. In these understanding how they impact your data of space/capacity savings with data bers, such as 300:1 or 500:1, tend to refer applications, space savings of 20% to 40% storage environment, informed decisions de-duplication? to data moved and stored for daily full may justify the cost and time you spend can be reached that benefit your envi- De-duplication vendors often claim 20:1, backups of individual systems. implementing de-duplication. ronment. Watch for more on this topic 50:1, or even up to 500:1 data-reduction ra- Another area to consider for data de- at www.snia.org. ❏ tios. These claims refer to the “time-based” duplication is non-backup data volumes, Summary space savings effect of de-duplication on such as primary storage or archival da- Data de-duplication is an important new This article was written on behalf of the SNIA repetitive data backups. The figure on ta, where the rules of time-based data technology that is quickly being em- Forum. Larry Freeman p. 27 illustrates this theoretical space sav- reduction ratios do not apply. In those braced by users as they struggle to control is a senior product manager at Network ings over time. Since these backups con- environments, volumes do not receive a data proliferation. By eliminating redun- Appliance; Rory Bolt is chief technology tain mostly unchanged data, once the first steady supply of redundant data backups, dant data objects, an immediate benefit officer, Avamar Products, EMC; and Tom Sas full backup has been stored, all subsequent but may still contain a large amount of is obtained through space efficiencies. is a product marketing manager at Hewlett- full backups will see a very high occur- duplicate data objects. When evaluating de-duplication tech- Packard. rence of de-duplication. Assuming the user The ability to reduce space in these vol- nologies, it is important to consider ma- retains 10 to 20 backup images, and the umes through de-duplication is measured jor design aspects, including use of hashes, VENDORS MENTIONED EMC, Hewlett-Packard, Network Appliance change rate between backups is within the in “spatial” terms. In other words, if a indexing, fixed or variable object length,

Cutting the costs of remote disaster recovery continued from p. 25

replicate both primary production data CommVault’s Galaxy software to direct The growing popularity of server vir- tem they can now deploy at their second- and backup data to the company’s warm virtual server images to the Symmetrix tualization is putting greater emphasis on ary sites, noting that there are now a vari- disaster-recovery site. Although current- array where data can be replicated to the virtual storage architectures as well when ety of heterogeneous replication solutions ly still backing up to tape, Shapelow says warm site via SRDF. it comes to cost savings in remote DR. available to help drive down the cost the company plans to move toward us- When it comes to replication, companies of remote DR. ing CommVault software with a virtual Don’t forget virtualization can choose from a wide range of solutions. Although Robinson doesn’t see large tape library (VTL) with data de-duplica- Server virtualization and its ability to help Software applications from vendors such companies wanting to replicate between tion functionality. “We’ll put a de-dupli- reduce the cost of remote DR has gained as DoubleTake and Neverfail are exam- a Symmetrix and a Nexsan system, for ex- cation device on-site and replicate to the traction over the past several months. ples of server-based replication solutions. ample, he has seen the desire to replicate off-site location,” he explains. Shapelow “The biggest expense, after personnel, on In addition to server-based approaches, to lower-cost systems in the same vendor’s says the company is also currently in the the secondary site tends to be hardware,” replication choices include appliance-, line. “We may see customers going from a process of switching its VMware virtu- says Lamorena. “If you use server virtual- array-, and database-level replication. Symmetrix to a Clariion. The same thing al servers to use VMware Consolidat- ization tools, you can reduce the number Among this range of current choices for with Hitachi and its replication tech- ed Backup (VCB) in conjunction with of boxes you have at the second site.” remote replication is also a growing mix nology on the USP [Universal Storage Case study: Tiered data protection – Large manufacturing company of software vendors Platform],” he says. “Technically, while I (Deferring purchases of Tier-1 storage for two years) such as DataCore and FalconStor, as well still need a USP to replicate on the other Before as combined hardware/software virtual- end, I can use cheap disk behind it. Al- Total cost of acquisition (3-year model) ization vendors such as EMC, Hitachi though it’s still frame-to-frame, it’s a way Tier Data Storage $/GB Total Data Systems, and IBM. Users also have to reduce the overall cost.” High tier 120TB High-end SAN RAID, 1 120TB 360TB $45/GB $16,200,000 local and remote replicated the choice to implement local and remote Regardless of which technology you 40TB Midrange SAN 2 40TB 80TB $29/GB $2,320,000 RAID, local replicated replication services on top of the vendors’ choose, there’s no doubt that there are respective virtualization platforms. more choices than ever when it comes to Low tier 20TB DASD 3 20TB 20TB $10/GB $200,000 ❏ Total: 180TB 460TB $18.7M The Taneja Group’s Norall sees three cost savings for remote DR. After primary levels of technology at which Tier Data Storage $/GB Total data is moved across the wire from a pri- Michele Hope is a freelance writer covering Highest tiers High-end SAN RAID, mary to secondary site: host-based repli- enterprise storage and networking. She can be 40TB local and remote replicated 1 40TB 120TB $45/GB $5,400,000 High-end SAN RAID, cation products installed as software on reached at [email protected]. 30TB 2 30TB 60TB $35/GB $2,100,000 local replicated the host agent, network virtualization Midrange SAN VENDORS MENTIONED 3 20TB 40TB $29/GB $1,160,000 20TB RAID, local replicated products that support both asynchronous Brocade, Cisco, CommVault, Data Domain, Midrange SAN 40TB RAID, not replicated 4 40TB 40TB $20/GB $800,000 and synchronous replication, and array- DataCore, Datalink, DoubleTake, EMC, Slow disk SAN level replication and mirroring. FalconStor, Hitachi Data Systems, IBM, 40TB 5 40TB 40TB $14/GB $560,000 RAID, not replicated Datalink’s Robinson says companies Juniper Networks, Neverfail, Nexsan, Lowest tiers 10TB DASD 6 10TB 40TB $10/GB $400,000 have a lot more choice and cost-savings Overland Storage, Riverbed, Silver Peak Systems, Sun, Symantec, VMware Source: Contoural Total: 180TB 340TB $10.5M opportunities in the type of storage sys-

20712ISsrf3 8 INFOSTOR www.infostor.com DECEMBER 2007