Avoiding the Disk Bottleneck in the Data Domain Deduplication

Benjamin Zhu Data Domain, Inc. Kai Li Data Domain, Inc. and Princeton University Hugo Patterson Data Domain, Inc.

Abstract Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.

The traditional solution has been to use tape libraries as 1 Introduction secondary storage devices and to transfer physical tapes The massive storage requirements for data protection for disaster recovery. Tape cartridges cost a small have presented a serious problem for data centers. fraction of disk storage systems and they have good Typically, data centers perform a weekly full of sequential transfer rates in the neighborhood of 100 all the data on their primary storage systems to secondary MB/sec. But, managing cartridges is a manual process storage devices where they keep these backups for weeks that is expensive and error prone. It is quite common for to months. In addition, they may perform daily restores to fail because a tape cartridge cannot be located incremental backups that copy only the data which has or has been damaged during handling. Further, random changed since the last backup. The frequency, type and access performance, needed for data restores, is retention of backups vary for different kinds of data, but extremely poor. Disk-based storage systems and network it is common for the secondary storage to hold 10 to 20 replication would be much preferred if they were times more data than the primary storage. For disaster affordable. recovery, additional offsite copies may double the During the past few years, disk-based, “deduplication” secondary storage capacity needed. If the data is storage systems have been introduced for data protection transferred offsite over a wide area network, the network [QD02, MCM01, KDLT04, Dat05, JDT05]. Such bandwidth requirement can be enormous. systems compress data by removing duplicate data across Given the data protection use case, there are two main files and often across all the data in a storage system. requirements for a secondary storage system storing Some implementations achieve a 20:1 compression ratio backup data. The first is low cost so that storing backups (total data size divided by physical space used) for 3 and moving copies offsite does not end up costing months of backup data using the daily-incremental and significantly more than storing the primary data. The weekly-full backup policy. By substantially reducing the second is high performance so that backups can complete footprint of versioned data, deduplication can make the in a timely fashion. In many cases, backups must costs of storage on disk and tape comparable and make complete overnight so the load of performing backups replicating data over a WAN to a remote site for disaster does not interfere with normal daytime usage. recovery practical.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 269 The specific deduplication approach varies among compete with high-end tape libraries. Achieving 100 system vendors. Certainly the different approaches vary MB/sec, would require 125 disks doing index lookups in in how effectively they reduce data. But, the goal of this parallel! This would increase the system cost of paper is not to investigate how to get the greatest data deduplication storage to an unattainable level. reduction, but rather how to do deduplication at high Our key idea is to use a combination of three methods to speed in order to meet the performance requirement for reduce the need for on-disk index lookups during the secondary storage used for data protection. deduplication process. We present in detail each of the The most widely used deduplication method for three techniques used in the production Data Domain secondary storage, which we call Identical Segment deduplication file system. The first is to use a Bloom Deduplication, breaks a data file or stream into filter, which we call a Summary Vector, as the summary contiguous segments and eliminates duplicate copies of data structure to test if a data segment is new to the identical segments. Several emerging commercial system. It avoids wasted lookups for segments that do systems have used this approach. not exist in the index. The second is to store data segments and their fingerprints in the same order that The focus of this paper is to show how to implement a they occur in a data file or stream. Such Stream-Informed high-throughput Identical Segment Deduplication Segment Layout (SISL) creates spatial locality for storage system at low system cost. The key performance segment and fingerprint accesses. The third, called challenge is finding duplicate segments. Given a segment Locality Preserved Caching, takes advantage of the size of 8 KB and a performance target of 100 MB/sec, a segment layout to fetch and cache groups of segment deduplication system must process approximately 12,000 fingerprints that are likely to be accessed together. A segments per second. single disk access can result in many cache hits and thus An in-memory index of all segment fingerprints could avoid many on-disk index lookups. easily achieve this performance, but the size of the index Our evaluation shows that these techniques are effective would limit system size and increase system cost. in removing the disk bottleneck in an Identical Segment Consider a segment size of 8 KB and a segment Deduplication storage system. For a system running on a fingerprint size of 20 bytes. Supporting 8 TB worth of server with two dual-core CPUs with one shelf of 15 unique segments, would require 20 GB just to store the drives, these techniques can eliminate about 99% of fingerprints. index lookups for variable-length segments with an An alternative approach is to maintain an on-disk index average size of about 8 KB. We show that the system of segment fingerprints and use a cache to accelerate indeed delivers high throughput: achieving over 100 segment index accesses. Unfortunately, a traditional MB/sec for single-stream write and read performance, cache would not be effective for this workload. Since and over 210 MB/sec for multi-stream write fingerprint values are random, there is no spatial locality performance. This is an order-of-magnitude throughput in the segment index accesses. Moreover, because the improvement over the parallel indexing techniques backup workload streams large data sets through the presented in the Venti system. system, there is very little temporal locality. Most The rest of the paper is organized as follows. Section 2 segments are referenced just once every week during the presents challenges and observations in designing a full backup of one particular system. Reference-based deduplication storage system for data protection. Section caching algorithms such as LRU do not work well for 3 describes the software architecture of the production such workloads. The Venti system, for example, Data Domain deduplication file system. Section 4 implemented such a cache [QD02]. Its combination of presents our methods for avoiding the disk bottleneck. index and block caches only improves its write Section 5 shows our experimental results. Section 6 throughput by about 16% (from 5.6MB/sec to gives an overview of the related work, and Section 7 6.5MB/sec) even with 8 parallel disk index lookups. The draws conclusions. primary reason is due to its low cache hit ratios. With low cache hit ratios, most index lookups require 2 Challenges and Observations disk operations. If each index lookup requires a disk access which may take 10 msec and 8 disks are used for 2.1 Variable vs. Fixed Length Segments index lookups in parallel, the write throughput will be about 6.4MB/sec, roughly corresponding to Venti’s An Identical Segment Deduplication system could throughput of less than 6.5MB/sec with 8 drives. While choose to use either fixed length segments or variable Venti’s performance may be adequate for the archival length segments created in a content dependent manner. usage of a small workgroup, it’s a far cry from the Fixed length segments are the same as the fixed-size throughput goal of deduplicating at 100 MB/sec to blocks of many non-deduplication file systems. For the purposes of this discussion, extents that are multiples of

270 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association some underlying fixed size unit such as a disk sector are With smaller segments, there are more segments to the same as fixed-size blocks. manage. Since each segment requires the same metadata size, smaller segments will require more storage Variable-length segments can be any number of bytes in footprint for their metadata, and the segment fingerprints length within some range. They are the result of for fewer total user bytes can be cached in a given partitioning a file or data stream in a content dependent amount of memory. The segment index is larger. There manner [Man93, BDH94]. are more updates to the index. To the that any data The main advantage of a fixed segment size is simplicity. structures scale with the number of segments, they will A conventional file system can create fixed-size blocks limit the overall capacity of the system. Since in the usual way and a deduplication process can then be commodity servers typically have a hard limit on the applied to deduplicate those fixed-size blocks or amount of physical memory in a system, the decision on segments. The approach is effective at deduplicating the segment size can greatly affect the cost of the system. whole files that are identical because every block of A well-designed duplication storage system should have identical files will of course be identical. the smallest segment size possible given the throughput In backup applications, single files are backup images and capacity requirements for the product. After several that are made up of large numbers of component files. iterative design processes, we have chosen to use 8 KB These files are rarely entirely identical even when they as the average segment size for the variable sized data are successive backups of the same file system. A single segments in our deduplication storage system. addition, deletion, or change of any component file can easily shift the remaining image content. Even if no other 2.3 Performance-Capacity Balance file has changed, the shift would cause each fixed sized A secondary storage system used for data protection segment to be different than it was last time, containing must support a reasonable balance between capacity and some bytes from one neighbor and giving up some bytes performance. Since backups must complete within a to its other neighbor. The approach of partitioning the fixed backup window time, a system with a given data into variable length segments based on content performance can only backup so much data within the allows a segment to grow or shrink as needed so the backup window. Further, given a fixed retention period remaining segments can be identical to previously stored for the data being backed up, the storage system needs segments. only so much capacity to retain the backups that can Even for storing individual files, variable length complete within the backup window. Conversely, given a segments have an advantage. Many files are very similar particular storage capacity, backup policy, and to, but not identical to other versions of the same file. deduplication efficiency, it is possible to compute the Variable length segments can accommodate these throughput that the system must sustain to justify the differences and maximize the number of identical capacity. This balance between performance and segments. capacity motivates the need to achieve good system performance with only a small number of disk drives. Because variable length segments are essential for deduplication of the shifted content of backup images, Assuming a backup policy of weekly fulls and daily we have chosen them over fixed-length segments. incrementals with a retention period of 15 weeks and a system that achieves a 20x compression ratio storing 2.2 Segment Size backups for such a policy, as a rough rule of thumb, it requires approximately as much capacity as the primary Whether fixed or variable sized, the choice of average data to store all the backup images. That is, for 1 TB of segment size is difficult because of its impact on primary data, the deduplication secondary storage would compression and performance. The smaller the consume approximately 1 TB of physical capacity to segments, the more duplicate segments there will be. Put store the 15 weeks of backups. another way, if there is a small modification to a file, the smaller the segment, the smaller the new data that must Weekly full backups are commonly done over the be stored and the more of the file’s bytes will be in weekend with a backup window of 16 hours. The duplicate segments. Within limits, smaller segments will balance of the weekend is reserved for restarting failed result in a better compression ratio. backups or making additional copies. Using the rule of thumb above, 1 TB of capacity can protect On the other hand, with smaller segments, there are more approximately 1 TB of primary data. All of that must be segments to process which reduces performance. At a backed up within the 16-hour backup window which minimum, more segments mean more times through the implies a throughput of about 18 MB/sec per terabyte of deduplication loop, but it is also likely to mean more on- capacity. disk index lookups.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 271 Following this logic, a system with a shelf of 15 SATA drives each with a capacity of 500 GB and a total usable capacity after RAID, spares, and other overhead of 6 TB could protect 6 TB of primary storage and must therefore be able to sustain over 100 MB/sec of deduplication throughput.

2.4 Fingerprint vs. Byte Comparisons An Identical Segment Deduplication storage system needs a method to determine that two segments are identical. This could be done with a byte by byte comparison of the newly written segment with the previously stored segment. However, such a comparison is only possible by first reading the previously stored segment from disk. This would be much more onerous than looking up a segment in an index and would make it

extremely difficult if not impossible to maintain the needed throughput. Figure 1: Data Domain File System architecture. To avoid this overhead, we rely on comparisons of Figure 1 is a block diagram of DDFS, which is made up segment fingerprints to determine the identity of a of a stack of software components. At the top of the segment. The fingerprint is a collision-resistant hash stack, DDFS supports multiple access protocols which value computed over the content of each segment. SHA- are layered on a common File Services interface. 1 is such a collision-resistant function [NIST95]. At a Supported protocols include NFS, CIFS, and a virtual 160-bit output value, the probability of fingerprint tape library interface (VTL). collision by a pair of different segments is extremely small, many orders of magnitude smaller than hardware When a data stream enters the system, it goes through error rates [QD02]. When occurs, it will one of the standard interfaces to the generic File Services almost certainly be the result of undetected errors in layer, which manages the name space and file metadata. RAM, IO busses, network transfers, disk storage devices, The File Services layer forwards write requests to other hardware components or software errors and not Content Store which manages the data content within a from a collision. file. Content Store breaks a data stream into segments, uses Segment Store to perform deduplication, and keeps 3 Deduplication Storage System track of the references for a file. Segment Store does the Architecture actual work of deduplication. It packs deduplicated (unique) segments into relatively large units, compresses To provide the context for presenting our methods for such units using a variation of Ziv-Lempel algorithm to avoiding the disk bottleneck, this section describes the further compress the data, and then writes the architecture of the production Data Domain File System, compressed results into containers supported by DDFS, for which Identical Segment Deduplication is an Container Manager. integral feature. Note that the methods presented in the next section are general and can apply to other Identical To read a data stream from the system, a client drives the Segment Deduplication storage systems. read operation through one of the standard interfaces and the File Services Layer. Content Store uses the At the highest level, DDFS breaks a file into variable- references to deduplicated segments to deliver the length segments in a content dependent manner [Man93, desired data stream to the client. Segment Store BDH94] and computes a fingerprint for each segment. prefetches, decompresses, reads and caches data DDFS uses the fingerprints both to identify duplicate segments from Container Manager. segments and as part of a segment descriptor used to reference a segment. It represents files as sequences of The following describes the Content Store, Segment segment fingerprints. During writes, DDFS identifies Store and the Container Manager in detail and discusses duplicate segments and does its best to store only one our design decisions. copy of any particular segment. Before storing a new segment, DDFS uses a variation of the Ziv-Lempel 3.1 Content Store algorithm to compress the segment [ZL77]. Content Store implements byte-range writes and reads for deduplicated data objects, where an object is a linear

272 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association sequence of client data bytes and has intrinsic and client- settable attributes or metadata. An object may be a conventional file, a backup image of an entire volume or a tape cartridge. To write a range of bytes into an object, Content Store performs several operations. • Anchoring partitions the byte range into variable- length segments in a content dependent manner [Man93, BDH94]. • Segment fingerprinting computes the SHA-1 hash and generates the segment descriptor based on it. Each segment descriptor contains per segment

information of at least fingerprint and size Figure 2: Containers are self-describing, immutable, • Segment mapping builds the tree of segments that units of storage several megabytes in size. All segments records the mapping between object byte ranges and are stored in containers. segment descriptors. The goal is to represent a data object using references to deduplicated segments. • Segment lookup finds the container storing the To read a range of bytes in an object, Content Store requested segment. This operation may trigger disk traverses the tree of segments created by the segment I/Os to look in the on-disk index, thus it is mapping operation above to obtain the segment throughput sensitive. descriptors for the relevant segments. It fetches the • segments from Segment Store and returns the requested Container retrieval reads the relevant portion of the byte range to the client. indicated container by invoking the Container Manager. 3.2 Segment Store • Container unpacking decompresses the retrieved Segment Store is essentially a database of segments portion of the container and returns the requested keyed by their segment descriptors. To support writes, it data segment. accepts segments with their segment descriptors and stores them. To support reads, it fetches segments 3.3 Container Manager designated by their segment descriptors. The Container Manager provides a storage container log To write a data segment, Segment Store performs several abstraction, not a block abstraction, to Segment Store. operations. Containers, shown in Figure 2, are self-describing in that a metadata section includes the segment descriptors for • Segment filtering determines if a segment is a the stored segments. They are immutable in that new duplicate. This is the key operation to deduplicate containers can be appended and old containers deleted, segments and may trigger disk I/Os, thus its but containers cannot be modified once written. When overhead can significantly impact throughput Segment Store appends a container, the Container performance. Manager returns a container ID which is unique over the life of the system. • Container packing adds segments to be stored to a container which is the unit of storage in the system. The Container Manager is responsible for allocating, The packing operation also compresses segment data deallocating, reading, writing and reliably storing using a variation of the Ziv-Lempel algorithm. A containers. It supports reads of the metadata section or a container, when fully packed, is appended to the portion of the data section, but it only supports appends Container Manager. of whole containers. If a container is not full but needs to • Segment Indexing updates the segment index that be written to disk, it is padded out to its full size. maps segment descriptors to the container holding Container Manager is built on top of standard block the segment, after the container has been appended storage. Advanced techniques such as Software RAID-6, to the Container Manager. continuous , container verification, and end to end data checks are applied to ensure a high level To read a data segment, Segment Store performs the of data integrity and reliability. following operations. The container abstraction offers several benefits.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 273 • The fixed container size makes container allocation and deallocation easy. • The large granularity of a container write achieves high disk throughput utilization. • A properly sized container size allows efficient full- stripe RAID writes, which enables an efficient software RAID implementation at the storage layer.

4 Acceleration Methods This section presents three methods to accelerate the deduplication process in our deduplication storage system: summary vector, stream-informed data layout, and locality preserved caching. The combination of these methods allows our system to avoid about 99% of Figure 3: Summary Vector operations. The Summary the disk I/Os required by a system relying on index Vector can identify most new segments without looking lookups alone. The following describes each of the three up the segment index. Initially all bits in the array are 0. techniques in detail. On insertion, shown in (a), bits specified by several hashes, h1, h2, and h3 of the fingerprint of the segment are set to 1. On lookup, shown in (b), the bits specified by 4.1 Summary Vector the same hashes are checked. If any are 0, as shown in The purpose of the Summary Vector is to reduce the this case, the segment cannot be in the system. number of times that the system goes to disk to look for a duplicate segment only to find that none exists. One can think of the Summary Vector as an in-memory, conservative summary of the segment index. If the As indicated in [FCAB98], the probability of false false positive Summary Vector indicates a segment is not in the index, positive for an element not in the set, or the rate then there is no point in looking further for the segment; , can be calculated in a straightforward fashion, the segment is new and should be stored. On the other given our assumption that hash functions are perfectly n hand, being only an approximation of the index, if the random. After all elements hashed and inserted into the Summary Vector indicates the segment is in the index, Bloom filter, the probability that a specific bit is still 0 is there is a high probability that the segment is actually in kn  1  kn / m the segment index, but there is no guarantee. 1  = e . m The Summary Vector implements the following   operations: The probability of false positive is then: • Init() k  kn  k  1   kn  • Insert(fingerprint)  11    1 e m  .   m     • Lookup(fingerprint)   We use a Bloom filter to implement the Summary Vector Using this formula, one can derive a particular parameter in our current design [Blo70]. A Bloom filter uses a to achieve a given false positive rate. For example, to vector of m bits to summarize the existence information achieve 2% false positive rate, the smallest size of the about n fingerprints in the segment index. In Init(), Summary Vector is 8  n bits (m/n = 8) and the number all bits are set to 0. Insert(a) uses k independent of hash functions can be 4 (k = 4). h h hashing functions, 1 , …, k, each mapping a fingerprint To have a fairly small probability of false positive such a m h (a) h (a) to [0, -1] and sets the bits at position 1 , …, k as a fraction of a percent, we choose m such that m/n is x Lookup(x) to 1. For any fingerprint , will check all about 8 for a target goal of n and k around 4 or 5. For bits at position h 1(x) , …, h k(x) to see if they are all set example, supporting one billion base segments requires to 1. If any of the bits is 0, then we know x is definitely about 1 GB of memory for the Summary Vector. not in the segment index. Otherwise, with high probability, x will be in the segment index, assuming At system shutdown the system writes the Summary reasonable choices of m, n, and k. Figure 3 illustrates the Vector to disk. At startup, it reads in the saved copy. To operations of Summary Vector. handle power failures and other kinds of unclean shutdowns, the system periodically checkpoints the

274 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association Summary Vector to disk. To recover, the system loads the stream. Because multiple streams can write to the most recent checkpoint of the Summary Vector and Segment Store in parallel, there may be multiple open then processes the containers appended to the container containers, one for each active stream. log since the checkpoint, adding the contained segments The end result is Stream-Informed Segment Layout or to the Summary Vector. SISL, because for a data stream, new data segments are Although several variations of Bloom filters have been stored together in the data sections, and their segment proposed during the past few years [BM05], we have descriptors are stored together in the metadata section. chosen the basic Bloom Filter for simplicity and efficient SISL offers many benefits. implementation. • When multiple segments of the same data stream are 4.2 Stream-Informed Segment Layout written to a container together, many fewer disk I/Os are needed to reconstruct the stream which helps the We use Stream-Informed Segment Layout (SISL) to system achieve high read throughput. create spatial locality for both segment data and segment descriptors and to enable Locality Preserved Caching as • Descriptors and compressed data of adjacent new described in the next section. A stream here is just the segments in the same stream are packed linearly in sequence of bytes that make up a backup image stored in the metadata and data sections respectively in the a Content Store object. same container. This packing captures duplicate locality for future streams resembling this stream, Our main observation is that in backup applications, and enables Locality Preserved Caching to work segments tend to reappear in the same of very similar effectively. sequences with other segments. Consider a 1 MB file with a hundred or more segments. Every time that file is • The metadata section is stored separately from the backed up, the same sequence of a hundred segments data section, and is generally much smaller than the will appear. If the file is modified slightly, there will be data section. For example, a container size of 4 MB, some new segments, but the rest will appear in the same an average segment size of 8 KB, and a Ziv-Lempel order. When new data contains a duplicate segment x, compression ratio of 2, yield about 1K segments in a there is a high probability that other segments in its container, and require a metadata section size of just locale are duplicates of the neighbors of x. We call this about 64 KB, at a segment descriptor size of 64 property segment duplicate locality. SISL is designed to bytes. The small granularity on container metadata preserve this locality. section reads allows Locality Preserved Caching in a highly efficient manner: 1K segments can be cached Content Store and Segment Store support a stream using a single small disk I/O. This contrasts to the abstraction that segregates the segments created for old way of one on-disk index lookup per segment. different objects, preserves the logical ordering of segments within the Content Store object, and dedicates These advantages make SISL an effective mechanism for containers to hold segments for a single stream in their deduplicating multiple-stream fine-grained data logical order. The metadata sections of these containers segments. Packing containers in a stream aware fashion store segment descriptors in their logical order. Multiple distinguishes our system from Venti and many other streams can be written to Segment Store in parallel, but systems. the stream abstraction prevents the segments for the different streams from being jumbled together in a 4.3 Locality Preserved Caching container. We use Locality Preserved Caching (LPC) to accelerate The design decision to make the deduplication storage the process of identifying duplicate segments. system stream aware is a significant distinction from other systems such as Venti. A traditional cache does not work well for caching fingerprints, hashes, or descriptors for duplicate When an object is opened for writing, Content Store detection because fingerprints are essentially random. opens a corresponding stream with Segment Store which Since it is difficult to predict the index location for next in turn assigns a container to the stream. Content Store segment without going through the actual index access writes ordered batches of segments for the object to the again, the miss ratio of a traditional cache will be stream. Segment Store packs the new segments into the extremely high. data section of the dedicated container, performs a We apply LPC to take advantage of segment duplicate variation of Ziv-Lempel compression on the data section, locality so that if a segment is a duplicate, the base and writes segment descriptors into the metadata section segment is highly likely cached already. LPC is achieved of the container. When the container fills up, it appends it with Container Manager and starts a new container for

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 275 by combining the container abstraction with a segment 4.4 Accelerated Segment Filtering cache as discussed next. We have combined all three techniques above in the For segments that cannot be resolved by the Summary segment filtering phase of our implementation. Vector and LPC, we resort to looking up the segment in For an incoming segment for write, the algorithm does the segment index. We have two goals on this retrieval: the following: • Making this retrieval a relatively rare occurrence. • Checks to see if it is in the segment cache. If it is in • Whenever the retrieval is made, it benefits segment the cache, the incoming segment is a duplicate. filtering of future segments in the locale. • If it is not in the segment cache, check the Summary LPC implements a segment cache to cache likely base Vector. If it is not in the Summary Vector, the segment descriptors for future duplicate segments. The segment is new. Write the new segment into the segment cache maps a segment fingerprint to its current container. corresponding container ID. Our main idea is to • If it is in the Summary Vector, lookup the segment maintain the segment cache by groups of fingerprints. index for its container Id. If it is in the index, the On a miss, LPC will fetch the entire metadata section in incoming segment is a duplicate; insert the metadata a container, insert all fingerprints in the metadata section section of the container into the segment cache. If into the cache, and remove all fingerprints of an old the segment cache is full, remove the metadata metadata section from the cache together. This method section of the least recently used container first. will preserve the locality of fingerprints of a container in the cache. • If it is not in the segment index, the segment is new. Write the new segment into the current container. The operations for the segment cache are: We aim to keep the segment index lookup to a minimum • Init(): Initialize the segment cache. in segment filtering. • Insert(container): Iterate through all segment descriptors in container metadata section, 5 Experimental Results and insert each descriptor and container ID into the segment cache. We would like to answer the following questions: • Remove(container): Iterate through all • How well does the deduplication storage system segment descriptors in container metadata section, work with real world datasets? and remove each descriptor and container ID from the segment cache. • How effective are the three techniques in terms of • Lookup(fingerprint): Find the corresponding reducing disk I/O operations? container ID for the fingerprint specified. • What throughput can a deduplication storage system Descriptors of all segments in a container are added or using these techniques achieve? removed from the segment cache at once. Segment caching is typically triggered by a duplicate segment that For the first question, we will report our results with real misses in the segment cache, and requires a lookup in the world data from two customer data centers. For the next segment index. As a side effect of finding the two questions, we conducted experiments with several corresponding container ID in the segment index, we internal datasets. Our experiments use a Data Domain prefetch all segment descriptors in this container to the DD580 deduplication storage system as an NFS v3 segment cache. We call this Locality Preserved Caching. server [PJSS*94]. This deduplication system features The intuition is that base segments in this container are two-socket duel-core CPU’s running at 3 Ghz, a total of likely to be checked against for future duplicate 8 GB system memory, 2 gigabit NIC cards, and a 15- segments, based on segment duplicate locality. Our drive disk subsystem running software RAID6 with one results on real world data have validated this intuition spare drive. We use 1 and 4 backup client computers overwhelmingly. running NFS v3 client for sending data. We have implemented the segment cache using a hash table. When the segment cache is full, containers that are 5.1 Results with Real World Data ineffective in accelerating segment filtering are leading The system described in this paper has been used at over candidates for replacement from the segment cache. A 1,000 data centers. The following paragraphs report the reasonable cache replacement policy is Least-Recently- deduplication results from two data centers, generated Used (LRU) on cached containers. from the auto-support mechanism of the system.

276 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association FigureFigure 44:: LLogical/Physicalogical/Physical CCapacitiesapacities aatt DDataata CCenterenter A FigureFigure 55:: CompressionCompression RatiosRatios aatt DataData CenterCenter A

StandardStandard MinMin MaxMax AverageAverage deviationdeviation onon newnew segments),segments), ccumulativeumulative ggloballobal ccompressionompression rratioatio (the(the ccumulativeumulative rratioatio ooff ddataata rreductioneduction ddueue ttoo dduplicateuplicate DailyDaily ggloballobal 10.0510.05 74.3174.31 40.6340.63 13.7313.73 compressioncompression segmentsegment eelimination),limination), aandnd ccumulativeumulative ttotalotal ccompressionompression ratioratio ((thethe ccumulativeumulative rratioatio ooff ddataata rreductioneduction ddueue ttoo DailyDaily llocalocal 1.581.58 1.971.97 1.781.78 0.090.09 compressioncompression duplicateduplicate ssegmentegment eeliminationlimination aandnd ZZiv-Lempeliv-Lempel sstyletyle compressioncompression onon newnew ssegments)egments) ooverver ttime.ime. TableTable 11:: StatisticsStatistics oonn DailyDaily GGloballobal andand DDailyaily LocalLocal AtAt thethe endend ofof 3311stst dday,ay, cumulativecumulative ggloballobal ccompressionompression CompressionCompression RRatiosatios atat DDataata CenterCenter A rratioatio rreacheseaches 222.532.53 ttoo 11,, aandnd ccumulativeumulative ttotalotal ccompressionompression rratioatio rreacheseaches 338.548.54 ttoo 11.. DataData centercenter A bbacksacks uupp sstructuredtructured ddatabaseatabase ddataata ooverver tthehe TThehe ddailyaily ggloballobal ccompressionompression rratiosatios cchangehange qquiteuite a bbitit coursecourse ofof 3131 ddaysays duringduring thethe iinitialnitial ddeploymenteployment ooff a ooverver ttime,ime, wwhereashereas tthehe ddailyaily llocalocal ccompressionompression rratiosatios aarere deduplicationdeduplication ssystem.ystem. TThehe bbackupackup ppolicyolicy iiss ttoo ddoo ddailyaily qquiteuite sstable.table. TTableable 1 ssummarizesummarizes tthehe mminimum,inimum, fullffuull bbackups,ackups, wwherehere eacheach ffufullull bbackupackup pproducesroduces ooverver 660000 mmaximum,aximum, aaverage,verage, aandnd sstandardtandard ddeviationeviation ooff bbothoth ddailyaily GBGB atat steadysteady sstate.tate. TTherehere aarere ttwowo eexceptions:xceptions: ggloballobal aandnd ddailyaily llocalocal ccompressionompression rratios,atios, eexcludingxcluding thth thth sseedingeeding ((thethe ffifirstirst 66)) ddaysays aandnd nnoo bbackupackup ((1818 ) day.day. • DuringDuring thethe iinitialnitial sseedingeeding pphasehase ((untiluntil 6 ddayay iinn tthishis eexample),xample), ddifferentifffffeerent ddataata oorr ddifferentifffffeerent ttypesypes ooff ddataata aarere DataData centercenter B bbacksacks uupp a mmixtureixture ooff sstructuredtructured ddatabaseatabase rrolledolled iintonto tthehe bbackupackup sset,et, aass bbackupackup aadministratorsdministrators andand unstructuredunstructured ffifileile ssystemystem ddataata ooverver tthehe ccourseourse ooff 4488 ffifigureigure ooutut hhowow theythey wwantant ttoo uusese tthehe ddeduplicationeduplication daysdays duringduring thethe iinitialnitial ddeploymenteployment ooff a ddeduplicationeduplication ssystem.ystem. A llowow rrateate ooff dduplicateuplicate ssegmentegment systemsystem usingusing bbothoth fullffuull andand incrementalincremental bbackups.ackups. SSimilarimilar iidentificationdentiffiication aandnd eeliminationlimination iiss ttypicallyypically aassociatedssociated toto tthathat iinn ddataata ccenterenter AA,, sseedingeeding llastsasts uuntilntil tthehe 6thth day,day, wwithith tthehe sseedingeeding pphase.hase. andand ttherehere aarere a ffewew ddaysays wwithoutithout bbackupsackups ((88thth, 12-1412-14thth, thth thth 3535 ddays).ays). OutsideOutside thesethese days,days, thethe mmaximumaximum ddailyaily • TTherehere aarere ccertainertain ddaysays ((1818 ddayay inin tthishis example)example) logicallogical bbackupackup ssizeize iiss aaboutbout 22.1.1 TTB,B, aandnd tthehe ssmallestmallest ssizeize whenwhen nono backupbackup iiss ggenerated.enerated. isis aaboutbout 5500 GB.GB. FigureFigure 4 sshowshows thethe logicallogical ccapacityapacity ((thethe aamountmount ooff ddataata FigureFigure 6 showsshows thethe llogicalogical ccapacityapacity aandnd tthehe pphysicalhysical fromffrrom useruser oorr backupbackup aapplicationpplication pperspective)erspective) aandnd tthehe capacitycapacity ooff tthehe ssystemystem ooverver ttimeime aatt ddataata ccenterenter BB.. physicalphysical capacitycapacity ((thethe aamountmount ooff ddataata sstoredtored iinn ddiskisk media)media) ooff tthehe ssystemystem ooverver ttimeime aatt ddataata ccenterenter AA.. AtAt thethe eendnd ooff 4488thth day,day, thethe logicallogical ccapacityapacity rreacheseaches aaboutbout stst 441.41.4 TTB,B, aandnd tthehe ccorrespondingorresponding pphysicalhysical ccapacityapacity iiss AtAt thethe eendnd ooff 3131 day,day, tthehe ddataata ccenterenter hhasas bbackedacked uupp aaboutbout 33.0.0 TTB.B. TThehe ttotalotal ccompressionompression rratioatio iiss 113.713.71 ttoo 11.. aboutabout 116.96.9 TB,TB, andand tthehe ccorrespondingorresponding pphysicalhysical ccapacityapacity isis lessless thanthan 440440 GB,GB, rreachingeaching a ttotalotal ccompressionompression rratioatio ooff FFigureigure 7 sshowshows dailydaily ggloballobal ccompressionompression rratio,atio, ddailyaily 38.5438.54 ttoo 1.1. llocalocal ccompressionompression rratio,atio, ccumulativeumulative ggloballobal ccompressionompression rratio,atio, aandnd ccumulativeumulative ttotalotal ccompressionompression rratioatio ooverver ttime.ime. FigureFigure 5 sshowshows dailydaily globalglobal ccompressionompression rratioatio ((thethe ddailyaily raterate ofof ddataata rreductioneduction ddueue ttoo dduplicateuplicate ssegmentegment AAtt tthehe eendnd ooff 4488thth dday,ay, cumulativecumulative ggloballobal ccompressionompression elimination),elimination), ddailyaily llocalocal ccompressionompression rratioatio ((thethe ddailyaily rrateate reachesreaches 66.85,.85, wwhilehile ccumulativeumulative ttotalotal ccompressionompression rreacheseaches ofof datadata rreductioneduction ddueue ttoo ZZiv-Lempeliv-Lempel sstyletyle ccompressionompression 13.71.13.71.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 277 FigureFigure 77:: CCompressionompression RRatiosatios aatt DDataata CCenterenter BB.. FigureFigure 66:: LLogical/Physicalogical/Physical CCapacitiesapacities aatt DDataata CCenterenter BB..

StandardStandard eeffective.fffffeective. IIndependentndependent ooff sseeding,eeding, ZZiv-Lempeliv-Lempel sstyletyle MinMin MaxMax AverageAverage deviationdeviation ccompressionompression iiss rrelativelyelatively sstable,table, givinggiving a rreductioneduction ooff aaboutbout 2 ooverver ttime.ime. TThehe rrealeal wworldorld oobservationsbservations oonn tthehe DailyDaily ggloballobal 5.095.09 45.1645.16 13.9213.92 9.089.08 compressioncompression aapplicabilitypplicability ooff dduplicateuplicate ssegmentegment eeliminationlimination dduringuring DailyDaily llocalocal 1.401.40 4.134.13 2.332.33 0.570.57 sseedingeeding aandnd aafterfftter sseedingeeding aarere pparticularlyarticularly rrelevantelevant iinn compressioncompression eevaluatingvaluating oourur ttechniquesechniques ttoo rreduceeduce ddiskisk aaccessesccesses bbelow.elow.

TableTable 22:: StatisticsStatistics oonn DailyDaily GGloballobal andand DDailyaily LocalLocal CompressionCompression RRatiosatios atat DDataata CenterCenter B 5.25.2 I/OI/O SavingsSavings withwith SummarySummary VectorVector andand LocalityLocaality PPreservedreserved CCachingaching ToTo determinedetermine tthehe eeffectivenessfffffeectiveness ooff tthehe SSummaryummary VVectorector ExchangeExchange EngineeringEngineering andand LLocalityocality PPreservedreserved CCaching,aching, wwee eexaminexamine tthehe ssavingsavings datadata datadata forffoor diskdisk readsreads toto findfind dduplicateuplicate ssegmentsegments uusingsing a 2.762.76 2.542.54 LogicalLogical capacitycapacity (TB)(TB) SummarySummary VectorVector aandnd LLocalityocality PPreservedreserved CCaching.aching. PhysicalPhysical capacitycapacity afterafter deduplicatingdeduplicating ssegmentsegments 0.490.49 0.500.50 WeWe uusese ttwowo iinternalnternal ddatasetsatasets ffoforor oourur eexperiment.xperiment. OOnene iiss a (TB)(TB) dailydaily ffufullull bbackupackup ooff a ccompany-wideompany-wide ExchangeExchange GlobalGlobal compressioncompression 5.695.69 5.045.04 informationinffoormation sstoretore ooverver a 1135-day35-day period.period. TThehe ootherther iiss tthehe PhysicalPhysical capacicapacityty aafterfter 0.220.22 0.2610.261 weeklyweekly ffufullull aandnd ddailyaily iincrementalncremental bbackupackup ooff aann locallocal compressioncompression (TB)(TB) EngineeringEngineering ddepartmentepartment ooverver a 1100-day00-day pperiod.eriod. TableTable 3 LocalLocal compressioncompression 2.172.17 1.931.93 summarizessummarizes kkeyey aattributesttributes ooff tthesehese ttwowo ddatasets.atasets. TotalTotal compressioncompression 12.3612.36 9.759.75 TheseThese iinternalnternal ddatasetsatasets aarere ggeneratedenerated ffrfromrom pproductionroduction TableTable 3: CapacitiesCapacities aandnd CCompressionompression RatiosRatios onon usageusage (albeit(albeit iinternal).nternal). WWee aalsolso oobservebserve tthathat vvariousarious ExchangeExchange andand EngineeringEngineering DDatasetsatasets compressioncompression rratiosatios pproducedroduced bbyy tthehe iinternalnternal ddatasetsatasets aarere relativelyrelatively ssimilarimilar ttoo tthosehose ooff realreal worldworld examplesexamples TableTable 2 ssummarizesummarizes tthehe mminimum,inimum, mmaximum,aximum, aaverage,verage, examinedexamined iinn ssectionection 55.1..1. WWee bbelieveelieve tthesehese iinternalnternal aandnd standardstandard deviationdeviation ooff bbothoth ddailyaily ggloballobal aandnd ddailyaily datasetsdatasets aarere rreasonableeasonable pproxiesroxies ooff rrealeal wworldorld llocalocal ccompressionompression rratios,atios, eexcludingxcluding sseedingeeding aandnd ddaysays deployments.deployments. withoutwithout bbackup.ackup. EachEach ooff tthehe bbackupackup ddatasetsatasets iiss ssentent ttoo tthehe ddeduplicatingeduplicating TThehe ttwowo ssetsets ooff resultsresults showshow thatthat tthehe deduplicationdeduplication storagestorage systemsystem wwithith a ssingleingle bbackupackup sstream.tream. WWithith rrespectespect sstoragetorage ssystemystem wworksorks wwellell wwithith tthehe rrealeal wworldorld ddatasets.atasets. toto tthehe ddeduplicationeduplication sstoragetorage ssystem,ystem, wwee mmeasureeasure tthehe AsAs expected,expected, bbothoth ccumulativeumulative ggloballobal aandnd ccumulativeumulative numbernumber ooff ddiskisk rreadseads ffoforor ssegmentegment iindexndex llookupsookups aandnd totaltotal ccompressionompression rratiosatios iincreasencrease aass tthehe ssystemystem hholdsolds localitylocality pprefetchesreffeetches nneededeeded ttoo ffifindind dduplicatesuplicates dduringuring wwriterite moremore backupbackup ddata.ata. forffoor fourffoour cases:cases: DuringDuring sseeding,eeding, dduplicateuplicate ssegmentegment eeliminationlimination tendstends toto (1)(1) wwithith nneithereither SSummaryummary VVectorector nnoror LLocalityocality bebe iineffective,nefffffeective, bbecauseecause mmostost ssegmentsegments aarere nnew.ew. AAfterfftter PPreservedreserved CCaching;aching; seeding,seeding, despitedespite tthehe llargearge vvariationariation iinn tthehe aactualctual nnumber,umber, (2)(2) wwithith SSummaryummary VVectorector oonly;nly; duplicateduplicate ssegmentegment eeliminationlimination bbecomesecomes eextremelyxtremely (3)(3) wwithith LLocalityocality PPreservedreserved CCachingaching oonly;nly; aandnd

278 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association (4) with both Summary Vector and Locality versions where each successive version (“generation”) is Preserved Caching. a somewhat modified copy of the preceding generation in the series. The generation-to-generation modifications The results are shown in Table 4. include: data reordering, deletion of existing data, and Clearly, the Summary Vector and Locality Preserved addition of new data. Single-client backup over time is Caching combined have produced an astounding simulated when synthetic data generations from a backup reduction in disk reads. Summary Vector alone reduces stream are written to the deduplication storage system in about 16.5% and 18.6% of the index lookup disk I/Os for generation order, where significant amounts of data are exchange and engineering data respectively. The unchanged day-to-day or week-to-week, but where small Locality Preserved Caching alone reduces about 82.4% changes continually accumulate. Multi-client backup and 81% of the index lookup disk I/Os for exchange and over time is simulated when synthetic data generations engineering data respectively. Together they are able to from multiple streams are written to the deduplication reduce the index lookup disk I/Os by 98.94% and 99.6% system in parallel, each stream in the generation order. respectively. There are two main advantages of using the synthetic In general, the Summary Vector is very effective for new dataset. The first is that various compression ratios can data, and Locality Preserved Caching is highly effective be built into the synthetic model, and usages for little or moderately changed data. For backup data, approximating various real world deployments can be the first full backup (seeding equivalent) does not have tested easily in house. as many duplicate data segments as subsequent full The second is that one can use relatively inexpensive backups. As a result, the Summary Vector is effective to client computers to generate an arbitrarily large amount avoid disk I/Os for the index lookups during the first full of synthetic data in memory without disk I/Os and write backup, whereas Locality Preserved Caching is highly in one stream to the deduplication system at more than beneficial for subsequent full backups. This result also 100 MB/s. Multiple cheap client computers can combine suggests that these two datasets exhibit good duplicate in multiple streams to saturate the intake of the locality. deduplication system in a switched network environment. We find it both much more costly and 5.3 Throughput technically challenging using traditional backup To determine the throughput of the deduplication storage software, high-end client computers attached to primary system, we used a synthetic dataset driven by client storage arrays as backup clients, and high–end servers as computers. The synthetic dataset was developed to media/backup servers to accomplish the same feat. model backup data from multiple backup cycles from In our experiments, we choose an average generation multiple backup streams, where each backup stream can (daily equivalent) global compression ratio of 30, and an be generated on the same or a different client computer. average generation (daily equivalent) local compression The dataset is made up of synthetic data generated on the ratio of 2 to 1 for each backup stream. These fly from one or more backup streams. Each backup compression numbers seem possible given the real world stream is made up of an ordered series of synthetic data examples in section 5.1. We measure throughput for one

Exchange data Engineering data # disk I/Os % of total # disk I/Os % of total

no Summary Vector and 328,613,503 100.00% 318,236,712 100.00% no Locality Preserved Caching Summary Vector only 274,364,788 83.49% 259,135,171 81.43% Locality Preserved Caching only 57,725,844 17.57% 60,358,875 18.97% Summary Vector and 3,477,129 1.06% 1,257,316 0.40% Locality Preserved Caching

Table 4: Index and locality reads. This table shows the number disk reads to perform index lookups or fetches from the container metadata for the four combinations: with and without the Summary Vector and with and without Locality Preserved Caching. Without either the Summary Vector or Locality Preserved Caching, there is an index read for every segment. The Summary Vector avoids these reads for most new segments. Locality Preserved Caching avoids index lookups for duplicate segments at the cost an extra read to fetch a group of segment fingerprints from the container metadata for every cache miss for which the segment is found in the index.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 279 FigureFigure 8: WrWriteite Throughputhroughput ofof Singleingle BackupBackup ClientClient andand FigureFigure 9: RReadead Throughputhroughput ofof Singleingle BackupBackup ClientClient andand 4 BackupBackup ClientsClients. 4 BackupBackup ClieClientsnts

backupbackup sstreamtream uusingsing oonene cclientlient computercomputer andand 4 backupbackup sstreamstreams uusingsing ttwowo clientclient computerscomputers forffoor writewrite aandnd rreadead forffoor 5.45.4 DiscussionDiscussion 1100 generationsgenerations ooff tthehe bbackupackup ddatasets.atasets. TThehe rresultsesults aarere TheThe ttechniquesechniques ppresentedresented iinn tthishis ppaperaper aarere ggeneraleneral shownshown inin FFiguresigures 8 andand 9.9. methodsmethods toto iimprovemprove tthroughputhroughput pperformanceerffoormance ooff deduplicationdeduplication sstoragetorage ssystems.ystems. AAlthoughlthough ourour systemsystem TThehe deduplicationdeduplication ssystemystem ddeliverselivers hhighigh wwriterite tthroughputhroughput dividesdivides a ddataata sstreamtream iintonto ccontent-basedontent-based segments,segments, thesethese rresultsesults ffoforor bbothoth ccases.ases. InIn thethe ssingleingle sstreamtream ccase,ase, tthehe methodsmethods ccanan aalsolso aapplypply ttoo ssystemystem uusingsing ffixedixed aalignedligned systemsystem aachieveschieves wwriterite tthroughputhroughput ooff 111010 MMB/secB/sec ffoforor segmentssegments ssuchuch aass VVenti.enti. generationgeneration 0 aandnd ooverver 113113 MB/secMB/sec forffoor ggenerationeneration 1 throughthrough 9.9. InIn thethe 4 sstreamtream ccase,ase, tthehe ssystemystem aachieveschieves AsAs a sideside nnote,ote, wwee hhaveave ccomparedompared tthehe ccompressionompression rratiosatios writewrite tthroughputhroughput ooff 113939 MMB/secB/sec ffoforor ggenerationeneration 0 aandnd a ofof a ssystemystem ssegmentingegmenting ddataata sstreamstreams bbyy ccontentsontents ((aboutabout sustainedsustained 217217 MMB/secB/sec ffoforor ggenerationeneration 1 tthroughhrough 99.. 8Kbytes8Kbytes onon aaverage)verage) wwithith aanothernother ssystemystem uusingsing ffixedixed alignedaligned 88KbytesKbytes ssegmentsegments oonn tthehe eengineeringngineering aandnd WriteWrite tthroughputhroughput ffoforor ggenerationeneration 0 iiss llowerower bbecauseecause aallll exchangeexchange bbackupackup ddatasets.atasets. WWee ffofoundound tthathat tthehe ffixedixed segmentssegments aarere nnewew andand requirerequire ZZiv-Lempeliv-Lempel sstyletyle alignmentalignment aapproachpproach ggetsets bbasicallyasically nnoo ggloballobal ccompressionompression compressioncompression bbyy tthehe CCPUPU ofof thethe ddeduplicationeduplication ssystem.ystem. (global(global ccompression:ompression: 11.01).01) ffoforor tthehe eengineeringngineering data,data, TheThe ssystemystem ddeliverselivers hhighigh rreadead tthroughputhroughput rresultsesults ffoforor tthehe whereaswhereas thethe ssystemystem wwithith ccontent-basedontent-based ssegmentationegmentation singlesingle sstreamtream ccase.ase. TThroughouthroughout aallll ggenerations,enerations, thethe getsgets a llotot ooff ggloballobal ccompressionompression ((6.39:1).6.39:1). TThehe mmainain systemsystem achievesachieves ooverver 110000 MMB/secB/sec rreadead tthroughput.hroughput. reasonreason ooff tthehe ddifferenceifffffeerence iiss tthathat tthehe bbackupackup ssoftwareoffttware createscreates tthehe bbackupackup ddatasetataset wwithoutithout rrealigningealigning ddataata aatt ffileile ForFor tthehe 4 streamstream ccase,ase, tthehe rreadead tthroughputhroughput iiss 221111 MMB/secB/sec boundaries.boundaries. FForor tthehe eexchangexchange bbackupackup datasetdataset wwherehere tthehe forffoor generationgeneration 00,, 119292 MMB/secB/sec ffoforor ggenerationeneration 11,, 116565 backupbackup ssoftwareoffttware aalignsligns ddataata aatt iindividualndividual mmailboxes,ailboxes, tthehe MB/secMB/sec forffoor ggenerationeneration 22,, aandnd sstaytay aatt aaroundround 114040 MMB/secB/sec globalglobal ccompressionompression ddifferenceifffference iiss llessess ((6.61:16.61:1 vvs.s. forffoor ffufutureuture generations.generations. TThehe mmainain rreasoneason ffoforor tthehe ddecreaseecrease 10.28:1),10.28:1), tthoughhough therethere iiss a ssignificantigniffiicant ggap.ap. ofof readread tthroughputhroughput iinn tthehe llaterater ggenerationsenerations iiss tthathat ffufutureuture generationsgenerations hhaveave mmoreore dduplicateuplicate ddataata ssegmentsegments tthanhan tthehe FragmentationFragmentation wwillill bbecomeecome mmoreore ssevereevere ffoforor llongong ttermerm firstffiirst few.few. HHowever,owever, thethe rreadead tthroughputhroughput sstaystays aatt aaboutbout retention,retention, aandnd ccanan rreduceeduce tthehe effectivenessefffffeectiveness ooff LLocalityocality 140140 MB/secMB/sec ffoforor llaterater ggenerationsenerations bbecauseecause ooff SStream-tream- PreservedPreserved CCaching.aching. WWee hhaveave iinvestigatednvestigated mmechanismsechanisms ttoo InformedInffoormed SegmentSegment LLayoutayout aandnd LLocalityocality PPreservedreserved reducereduce ffrfragmentationragmentation aandnd ssustainustain hhighigh wwriterite aandnd rreadead Caching.Caching. throughput.throughput. But,But, tthesehese mmechanismsechanisms aarere bbeyondeyond tthehe sscopecope ofof tthishis ppaper.aper. NoteNote tthathat wwriterite tthroughputhroughput hhasas hhistoricallyistorically bbeeneen vvaluedalued moremore tthanhan rreadead tthroughputhroughput ffoforor tthehe bbackupackup uusese ccasease ssinceince backupbackup hashas toto ccompleteomplete wwithinithin a sspecifiedpeciffiied bbackupackup 6 RRelatedelated WorWorkk windowwindow ttimeime pperioderiod aandnd iitt iiss mmuchuch mmoreore ffrfrequentrequent eventevent MuchMuch wworkork onon ddeduplicationeduplication ffofocusedocused oonn basicbasic mmethodsethods thanthan rrestore.estore. RReadead tthroughputhroughput iiss sstilltill vveryery iimportant,mportant, andand ccompressionompression rratios,atios, nnotot oonn hhighigh tthroughput.hroughput. especiallyespecially iinn tthehe ccasease ooff wwholehole ssystemystem rrestores.estores. EarlyEarly ddeduplicationeduplication sstoragetorage ssystemsystems uusese ffifile-levelile-level hashinghashing toto ddetectetect dduplicateuplicate ffilesiles aandnd rreclaimeclaim ttheirheir sstoragetorage spacespace [[ABCC*02,ABCC*02, TKSK*03,TKSK*03, KDLT04].KDLT04]. SinceSince ssuchuch

280 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association systems also use file hashes to address files. Some call We have shown that Summary Vector can reduce disk such systems content addressed storage or CAS. Since index lookups by about 17% and Locality Preserved their deduplication is at file level, such systems can Caching can reduce disk index lookups by over 80%, but achieve only limited global compression. the combined caching techniques can reduce disk index lookups by about 99%. Venti removes duplicate fixed-size data blocks by comparing their secure hashes [QD02]. It uses a large Stream-Informed Segment Layout is an effective on-disk index with a straightforward index cache to abstraction to preserve spatial locality and enable lookup fingerprints. Since fingerprints have no locality, Locality Preserved Caching. their index cache is not effective. When using 8 disks to These techniques are general methods to improve lookup fingerprints in parallel, its throughput is still throughput performance of deduplication storage limited to less than 7 MB/sec. Venti used a container systems. Our techniques for minimizing disk I/Os to abstraction to layout data on disks, but was stream achieve good deduplication performance match well agnostic, and did not apply Stream-Informed Segment against the industry trend of building many-core Layout. processors. With quad-core CPU’s already available, and To tolerate shifted contents, modern deduplication eight-core CPU’s just around the corner, it will be a systems remove redundancies at variable-size data relatively short time before a large-scale deduplication blocks divided based on their contents. Manber described storage system shows up with 400 ~ 800 MB/sec a method to determine anchor points of a large file when throughput with a modest amount of physical memory. certain bits of rolling fingerprints are zeros [Man93] and showed that Rabin fingerprints [Rab81, Bro93] can be 8 References computed efficiently. Brin et al. [BDH94] described several ways to divide a file into content-based data [ABCC*02] A. Adya, W. J. Bolosky, M. Castro, G. segments and use such segments to detect duplicates in Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. digital documents. Removing duplications at content- Lorch, M. Theimer, and R. P. Wattenhofer. based data segment level has been applied to network FARSITE: Federated, available, and reliable storage protocols and applications [SW00, SCPC*02, RLB03, for an incompletely trusted environment. In MCK04] and has reduced network traffic for distributed Proceedings of USENIX Operating Systems Design file systems [MCM01, JDT05]. Kulkarni et al. evaluated and Implementation (OSDI), December 2002. the compression efficiency between an identity-based [BM05] Andrie Z. Broder and Michael Mitzenmacher. (fingerprint comparison of variable-length segments) Network Applications of Bloom Filters: A Survey. approach and a delta-compression approach [KDLT04]. Internet Mathematics, 2005. These studies have not addressed deduplication [BDH94] S. Brin, J. Davis, H. Carcia-Molina. Copy throughput issues. Detection Mechanisms for Digital Documents (weblink). 1994, also lso in Proceedings of ACM The idea of using Bloom filter [Blo70] to implement the SIGMOD, 1995. Summary Vector is inspired by the summary data [Blo70] Burton H. Bloom. Space/time Trade-offs in structure for the proxy cache in [FCAB98]. Their work Hash Coding with Allowable Errors. also provided analysis for false positive rate. In addition, Communications of the ACM, 13 (7). 422-426. Broder and Mitzenmacher wrote an excellent survey on [JDT05] N. Jain, M. Dahlin, and R. Tewari. TAPER: network applications of Bloom filters [AM02]. TAPER Tiered Approach for Eliminating Redundancy in system used a Bloom filter to detect duplicates instead of Replica Synchronization. In Proceedings of USENIX detecting if a segment is new [JDT05]. It did not File And Storage Systems (FAST), 2005. investigate throughput issues. [Dat05] Data Domain, Data Domain Appliance Series: High-Speed Inline Deduplication Storage, 2005, 7 Conclusions http://www.datadomain.com/products/appliances.htm l This paper presents a set of techniques to substantially [FCAB98] Li Fan, Pei Cao, Jussara Almeida, and Andrie reduce disk I/Os in high-throughput deduplication Z. Broder. Summary Cache: A Scalable Wide-Area storage systems. Web Cache Sharing Protocol. in Proceedings of ACM Our experiments show that the combination of these SIGCOMM'98, (Vancouver, Canada, 1998). techniques can achieve over 210 MB/sec for 4 multiple [KDLT04] P. Kulkarni, F. Douglis, J. D. LaVoie, J. M. write data streams and over 140 MB/sec for 4 read data Tracey: Redundancy Elimination Within Large streams on storage server with two dual-core processors Collections of Files. In Proceedings of USENIX and one shelf of 15 drives. Annual Technical Conference, pages 59-72, 2004.

USENIX Association FAST ’08: 6th USENIX Conference on File and Storage Technologies 281 [Man93] Udi Manber. Finding Similar Files in A Large [RLB03] S. C. Rhea, K. Liang, and E. Brewer. Value- File System. Technical Report TR 93-33, based web caching. In WWW, pages 619–628, 2003. Department of Computer Science, University of [SCPC*02] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Arizona, October 1993, also in Proceedings of the Chow, M. S. Lam, and M. Rosenblum. Optimizing USENIX Winter 1994 Technical Conference, pages the migration of virtual computers. In Proceedings of 17–21. 1994. USENIX Operating Systems Design and [MCK04] J. C. Mogul, Y.-M. Chan, and T. Kelly. Implementation, 2002. Design, implementation, and evaluation of duplicate [SW00] N. T. Spring and D. Wetherall. A protocol- transfer detection in HTTP. In Proceedings of independent technique for eliminating redundant Network Systems Design and Implementation, 2004. network traffic. In Proceedings of ACM SIGCOMM, [MCM01] Athicha Muthitacharoen, Benjie Chen, and pages 87--95, Aug. 2000. David Mazières. A Low-bandwidth Network File [TKSK*03] N. Tolia, M. Kozuch, M. Satyanarayanan, B. System. In Proceedings of the ACM 18th Symposium Karp, A. Perrig, and T. Bressoud. Opportunistic use on Operating Systems Principles. Banff, Canada. of content addressable storage for distributed file October, 2001. systems. In Proceedings of the 2003 USENIX Annual [NIST95] National Institute of Standards and Technical Conference, pages 127–140, San Antonio, Technology, FIPS 180-1. Secure Hash Standard. US TX, June 2003. Department of Commerce, April 1995. [YPL05] L. L. You, K. T. Pollack, and D. D. E. Long. [PJSS*94] B. Pawlowski, C. Juszczak, P. Staubach, C. Deep Store: An archival storage system architecture. Smith, D. Lebel, and D. Hitz, NFS Version 3 Design In Proceedings of the IEEE International Conference and Implementation, In Proceedings of the USENIX on Data Engineering (ICDE ’05), April 2005. Summer 1994 Technical Conference. 1994. [ZL77] J. Ziv and A. Lempel. A universal algorithm for [QD02] S. Quinlan and S. Dorward, Venti: A New sequential , IEEE Trans. Inform. Approach to Archival Storage. In Proceedings of the Theory, vol. IT-23, pp. 337-343, May 1977. USENIX Conference on File And Storage Technologies (FAST), January 2002.

282 FAST ’08: 6th USENIX Conference on File and Storage Technologies USENIX Association