Evaluation Criteria for Data De-Dupe

Total Page:16

File Type:pdf, Size:1020Kb

Evaluation Criteria for Data De-Dupe SNIA on STORAGE Evaluation criteria for data de-dupe How does it work? What are length or variable object length de-duplication, local ensure against hash collisions. Although concern over or remote de-duplication, inline or post-processing, and hash collisions is often raised, depending upon the hash de-duplicated or original format data protection. algorithm used and the system design, the probability the different implementation of a hash collision may actually be orders of magnitude Use of hashes less than the probability of an undetected disk read error methods? And what are the key Data de-duplication begins with a comparison of two returning corrupt data. data objects. It would be impractical (and very arduous) to scan an entire data volume for duplicate objects each Indexing evaluation criteria? time a new object was written to that volume. For that Once duplicate objects have been identified (and option- reason, de-duplication systems create relatively small ally validated), removal of the duplicate object can com- By Larry Freeman, Rory Bolt, mence. There are varying methods that systems employ when modifying their data pointer structures. However, and Tom Sas De-duplication pointers all forms of this indexing fall into four broad categories: Catalog-based indexing—A catalog of hash values is Reference pointers for duplicated objects used only to identify candidates for de-duplication. A DATA DE-DUPLICATION IS THE PROCESS separate process modifies the data pointers accordingly. of eliminating redundant copies of data. The advantage of catalog-based de-duplication is that The term “data de-duplication” was coined the catalog is only utilized to identify duplicate objects by database administrators many years ago as a way of and is not accessed during the actual reading or writing describing the process of removing duplicate database of the de-duplicated data objects; that task is handled records after two databases had been merged. via the normal file-system data structure. De-duplicated reference pointers Today the original definition of de-duplication has Lookup table-based indexing—Extends the function- been expanded. In the context of storage, de-duplication ality of the hash catalog to also contain a hash lookup refers to any algorithm that searches for duplicate data table to index the de-duplicated object’s “parent” data objects (e.g., blocks, chunks, files) and stores only a sin- pointer. The advantage of a lookup table is that it can be gle copy of those objects. The user benefits are clear: used on file systems that do not support multiple block ■ Reduces the space needed to store data; and referencing; a single data object can be stored and “refer- ■ Increases the available space to retain data for enced” many times via the lookup table. Lookup tables longer periods of time. hash values for each new object to identify potential may also be used within systems that provide block-level duplicate0712ISfreemanf1 data. services instead of file systems. How it works A hash value, also called a digital fingerprint or digi- Content-addressable store, or CAS-based, indexing—The Regardless of operating system, application, or file-system tal signature, is a small number generated from a larger hash value, or digital signature of the data object itself, type, all data objects are written to a storage system using string of data. Hash values are generated by a mathemat- may be used by itself or in combination with additional a data reference pointer, without which data could not ical formula in such a way that it is extremely unlikely metadata as the data pointer. In a content-addressable be located or retrieved. In traditional (non-de-duplicat- (but not impossible) for two non-identical data objects store (CAS), the storage location is determined by the ed) file systems, data objects are stored without regard to to produce the same hash value. In the event that two data being stored. Advantages of CAS-based indexing any similarity with other objects in the same file system. non-identical objects do map to the same hash value, include inherent single instancing/de-duplication, as well Identifying duplicate objects and redirecting reference this is termed a “hash collision.” pointers form the basis of the de-duplication algorithm. Due to the growing interest in data de-duplication and As shown in the figure, referencing several identical Evaluation criteria space reduction solutions, the SNIA DMF Data Protection Initiative has recently been tasked with forming a Spe- objects with a single “master” object allows the space Understanding a system’s use of hashes is an important cial Interest Group (SIG) focusing on this topic. This is the normally occupied by the duplicate objects to be “given criterion when you are evaluating de-duplication. If the first in a series of publications from SNIA on the topic of back” to the storage system. technology depends solely on hashes to determine if de-duplication and space reduction. The mission of the two objects are identical, then there is the possibility, DDSR SIG is to bring together a core group of companies Design considerations however remote, that hash collisions could occur and that will work together to publicize the benefits of data de-duplication and space savings technologies. Anyone Given the fact that all de-duplication technologies must some of the data referencing the object that produced interested in participating can help form the direction and identify duplicate data and support some form of ref- the collision will be corrupt. Certain government regu- ultimate success of the group. Find out more at erencing, there is a surprising variety of implementa- lations may require you to perform secondary data ob- www.snia-dmf.org/dpi. tions, including the use of hashes, indexing, fixed object ject validation after the hash compare has completed to 2 6 INFOSTOR www.infostor.com DECEMBER 2007 SNIA on STORAGE De-duplication ratio and storage savings as enhanced data integrity capabilities and with the application being de-duplicated the ability to leverage grid-based storage Number of backups rather than any technical advantages/ 8 16 20 architectures. Although CAS systems are disadvantages. inherently object-based, file-system seman- When performing data backups, the tics can be implemented above the CAS. user’s primary objective is the completion 99.5% Application-aware indexing—Differs 99% of backups within an allowed time win- 98% from other indexing methods in that it 96.7% 97.5% dow. For LAN- and WAN-based backups, 80% 95% looks at data as objects. Unlike hashing Acceptable remote inline de-duplication may provide or byte-level comparisons, application- 75% de-duplication ratio the best performance. For direct-attached (backup data) aware indexing finds duplication in ap- 67% and SAN-based backup, an assessment 50% plication-specific byte streams. As the 20% 43% should be made to determine which ap- name implies, this approach compares Acceptable proach works best. Either may be appropri- de-duplication ratio like objects (such as Excel documents to (non-backup data) ate, depending on data type and volume. If Excel documents) and has awareness of 1.25 1.5:1 1.75:1 2:1 3:1 4:1 5:1 10:1 20:1 30:1 40:1 50:1 100:1 200:1 post-processing de-duplication is deployed, the data structure of these formats. De-duplication ratio (n:1) users should ensure there is adequate time of variable object size de-duplication is is sometimes referred to as source de- between backup sessions to complete the Evaluation criteria that it allows duplicate data to be recog- duplication in the backup market. de-duplication post-process. De-duplication indexing is an important nized even if it has been logically shifted With general applications, the cost of consideration in technology evaluation, with0712ISfreemanf2 respect to physical block boundar- Evaluation criteria additional storage needed by post-process- particularly when it comes to resiliency ies. This can result in much better data The advantage of local de-duplication is ing needs to be weighed against the cost of design. When indexing data objects, de-duplication ratios. total application transparency and in- of system resources and the performance the index itself could become a single teroperability; however it doesn’t address of inline de-duplication to determine the point of failure. Evaluation criteria remote or distributed systems or bottle- best fit for an environment. It is important to understand what, if Fixed object length de-duplication offers necks in networks. Although it requires any, single points of failure are present in processing advantages and performs well a specialized agent or API, remote de- De-duplicated or original a de-duplication system. It is equally im- in both structured data environments duplication offers tremendous potential format data protection portant to understand what measures are (e.g., databases) and in environments for both network bandwidth savings and As is the case with all corporate data used to protect these single points of fail- where data is only appended to files. In application performance. systems, de-duplicating storage systems ure to minimize the risk of data loss. unstructured data environments such need to be protected against data loss. Another indexing consideration is as file servers, variable object length de- Inline or post-processing De-duplicating systems vary with respect the speed of the index. An inordinate duplication is able to recognize data that Another design distinction is when to to their approach to data protection. amount of time should not be required has shifted position as the result of edits perform de-duplication. Again, there are When protecting a de-duplicated sys- to store and retrieve data objects, even to a file.
Recommended publications
  • 1 Storage Technology Unit-01/Lecture -01
    1 Unit – 1 Storage technology Unit-01/Lecture -01 Introduction to information storage and management (ISM) Information storage and management is the only subject of its kind to fill the knowledge gap in understanding varied components of modern information storage infrastructure, including virtual environments. It provides comprehensive learning of storage technology, which will enable you to make more informed decisions in an increasingly complex it environment. Ism builds a strong understanding of underlying storage technologies and prepares you to learn advanced concepts, technologies, and products. You will learn about the architectures, features, and benefits of intelligent storage systems; storage networking technologies such as FC-SAN, ip-SAN, NAS, object-based and unified storage; business continuity solutions such as backup, replication, and archive; the increasingly critical area of information security; and the emerging field of cloud computing. Introduction to storage technology Storage systems are inevitable for modern day computing. All known computing platforms ranging from handheld devices to large super computers use storage systems for storing data temporarily or permanently. Beginning from punch card which stores a few bytes of data, storage systems have reached to multi terabytes of capacities in comparatively less space and power consumption. This tutorial is intended to give an introduction to storage systems and components to the reader. Storage definition , we dont take any liability for the notes correctness. http://www.rgpvonline.com 2 here are a few definitions of storage when refers to computers. A device capable of storing data. The term usually refers to mass storage devices, such as disk and tape drives. In a computer, storage is the place where data is held in an electromagnetic or optical form for access by a computer processor.
    [Show full text]
  • THE GREEN and VIRTUAL DATA CENTER Chapter 8: Data Storage — Disk, Tape, Optical, and Memory
    Chapter Download THE GREEN AND VIRTUAL DATA CENTER Chapter 8: Data Storage — Disk, Tape, Optical, and Memory By Greg Schulz Sponsored By: Chap8.fm Page 205 Thursday, October 23, 2008 5:28 PM Chapter 8 Data Storage—Disk, Tape, Optical, and Memory I can’t remember where I stored that. In this chapter you will learn that: An expanding data footprint results in increased management costs and complexity. Data storage management is a growing concern from a green stand- point. There are many aspects of storage virtualization that can help address green challenges. Demand to store more data for longer periods of time is driving the need for more data storage capacity, which in turn drives energy consump- tion and consequent cooling demands. This chapter looks at various data storage technologies and techniques used to support data growth in an eco- nomical and environmentally friendly manner. These technologies also aid in sustaining business growth while building on infrastructure resource management functions, including data protection, business continuance and disaster recovery (BC/DR), storage allocation, data movement, and migration, along with server, storage, and networking virtualization topics. Although this chapter focuses on external direct attached and networked storage (either networked attached storage or a storage area network), the principles, techniques, and technologies also apply to internal dedicated storage. The importance of this chapter is to understand the need to sup- port and store more data using various techniques and technologies to enable more cost-effective and environmentally as well as energy-friendly data growth. 205 Chap8.fm Page 206 Thursday, October 23, 2008 5:28 PM 206 The Green and Virtual Data Center 8.1 Data Storage Trends, Challenges, and Issues After facilities cooling for all IT equipment and server energy usage, exter- nal data storage has the next largest impact on power, cooling, floor space, and environmental (PCFE) considerations in most environments.
    [Show full text]
  • Cdw Data Management for Nonprofit Organizations
    CDW DATA MANAGEMENT FOR NONPROFIT ORGANIZATIONS WHY DATA MANAGEMENT? INDUSTRY Nonprofit organizations need to manage and leverage rapidly expanding TRENDS volumes of data to fundraise more effectively, operate more efficiently and maximize your impact. A data management strategy that encompasses data storage, cloud and business intelligence solutions keeps data Ineffective use of data organized and accessible, improves IT control and visibility, lowers costs creates missed opportunities and optimizes analytics. CHALLENGE: USING CDW DATA DATA TO ADVANCE MANAGEMENT 43% YOUR MISSION SOLUTIONS Overwhelming Volumes of Data Data Storage of nonprofit organizations are using • With more data generated everyday, • Unify data storage management their donor data for potential marketing available storage space is shrinking 1 • Streamline data retention policies and and fundraising opportunities. dramatically design a customized data storage blueprint • Lack of well-defined data retention policies that is flexible, secure and easy to manage Donors expect ROM metrics also contributes to data proliferation • Specific solutions: Inefficient Data Storage • Direct-attached storage (DAS) • Need to unify storage management across • Network-attached storage (NAS) numerous sites, staff, volunteers, donors, • Storage area network (SAN) 70% visitors and members • Unified storage • Haphazard or inefficient data storage and • Software-defined storage (SDS) archiving solutions • Flash storage • Need expert guidance evaluating the costs, • Hierarchical storage management
    [Show full text]
  • An Analysis of Storage Virtualization
    Rochester Institute of Technology RIT Scholar Works Theses 12-9-2014 An Analysis of Storage Virtualization Nicholas Costa Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Costa, Nicholas, "An Analysis of Storage Virtualization" (2014). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. An Analysis of Storage Virtualization By Nicholas Costa Thesis submitted in partial fulfillment of the requirements for the degree of Masters of Science in Networking and Systems Administration Rochester Institute of Technology B. Thomas Golisano College of Computing and Information Sciences Department of Information Sciences & Technologies December 9, 2014 Rochester Institute of Technology B. Thomas Golisano College of Computing and Information Sciences Master of Science in Networking and Systems Administration Thesis Approval Form Student Name: Nicholas Costa Thesis Title: An Analysis of Storage Virtualization Thesis Committee Name Signature Date Chair (Bill Stackpole [email protected]) Committee Member (Daryl Johnson [email protected]) Committee Member (Charlie Border [email protected]) Abstract Investigating technologies and writing expansive documentation on their capabilities is like hitting a moving target. Technology is evolving, growing, and expanding what it can do each and every day. This makes it very difficult when trying to snap a line and investigate competing technologies. Storage virtualization is one of those moving targets. Large corporations develop software and hardware solutions that try to one up the competition by releasing firmware and patch updates to include their latest developments.
    [Show full text]
  • Review Article
    z Available online at http://www.journalcra.com INTERNATIONAL JOURNAL OF CURRENT RESEARCH International Journal of Current Research Vol. 5, Issue, 11, pp. 3529-3530, November, 2013 ISSN: 0975-833X REVIEW ARTICLE TACKLING DATA PROLIFERATION *Dr. A. Vijayakumar and K. Muraleedharan Karpagam University, Coimbatore ARTICLE INFO ABSTRACT Article History: It is often termed that the present day society as an ‘information society’, where information is being Received 19th September, 2013 generated every now and then. This new addition of information is ultimately leads to the Received in revised form modification of the existing knowledge or the creation of new knowledge which definitely has to pass 29th September, 2013 on to the needy. The ‘knowledge boom’ or the’ information explosion’ has paused a situation, where a Accepted 04th October, 2013 genuine information seeker feels some sort of “poverty” in the midst of plenty, where the information Published online 19th November, 2013 is abundant, and how it could be make best use of it? And here the real problem arises-how far the accumulated information is accurate and apt. Hence a mere collection of information is not enough for Key words: utility, but the ‘raw’ information is to be processed, to yield systematic and integrated information Information Explosion, needed for a specific purpose. The fact that fruit ful education is in process in every sect of schools Raw data, where students are undergoing apt and up-to-date education. The dynamic nature of modern education Storage of data, envisages right and accurate information to the student at the right direction. Data filtration, Information Lifecycle.
    [Show full text]