Primary Data Deduplication – Large Scale Study and System Design

Primary Data Deduplication – Large Scale Study and System Design Ahmed El-Shimi Ran Kalach Ankit Kumar Adi Oltean Jin Li Sudipta Sengupta Microsoft Corporation, Redmond, WA, USA Abstract platform where a variety of workloads and services compete for system resources and no assumptions of We present a large scale study of primary data dedupli- dedicated hardware can be made. cation and use the findings to drive the design of a new primary data deduplication system implemented in the Our Contributions. We present a large and diverse Windows Server 2012 operating system. File data was study of primary data duplication in file-based server analyzed from 15 globally distributed file servers hosting data. Our findings are as follows: (i) Sub-file deduplica- data for over 2000 users in a large multinational corporation is significantly more effective than whole-file dedu- tion. plication, (ii) High deduplication savings previously ob- The findings are used to arrive at a chunking and com- tained using small ∼4KB variable length chunking are pression approach which maximizes deduplication sav- achievable with 16-20x larger chunks, after chunk com- ings while minimizing the generated metadata and pro- pression is included, (iii) Chunk compressibility distri- ducing a uniform chunk size distribution. Scaling of bution is skewed with the majority of the benefit of com- deduplication processing with data size is achieved us- pression coming from a minority of the data chunks, and ing a RAM frugal chunk hash index and data partitioning (iv) Primary datasets are amenable to partitioned dedu- – so that memory, CPU, and disk seek resources remain plication with comparable space savings and the par- available to fulfill the primary workload of serving IO. titions can be easily derived from native file metadata We present the architecture of a new primary data within the dataset. Conversely, cross-server deduplica- deduplication system and evaluate the deduplication per- tion across multiple datasets surprisingly yields minor formance and chunking aspects of the system. additional gains. Finally, we present and evaluate salient aspects of a 1 Introduction new primary data deduplication system implemented in the Windows Server 2012 operating system. We focus Rapid growth in data and associated costs has motivated on addressing the challenge of scaling deduplication the need to optimize storage and transfer of data. Dedu- processing resource usage with data size such that plication has proven a highly effective technology in memory, CPU, and disk seek resources remain available eliminating redundancy in backup data. Deduplication’s to fulfill the primary workload of serving IO. The design next challenge is its application to primary data - data aspects of our system related to primary data serving which is created, accessed, and changed by end-users, (beyond a brief overview) have been left out to meet the like user documents, file shares, or collaboration data. paper length requirements. The detailed presentation Deduplication is challenging in that it requires compu- and evaluation of those aspects is planned as a future tationally costly processing and segmentation of data paper. into small multi-kilobyte chunks. The result is large metadata which needs to be indexed for efficient lookups Paper Organization. The rest of this paper is structured adding memory and throughput constraints. as follows. Section 2 provides background on deduplication and related work. Section 3 details data collection, Primary Data Deduplication Challenges. When analysis and findings. Section 4 provides an overview applied to primary data, deduplication has additional of the system architecture and covers deduplication pro- challenges. First, expectation of high or even duplication cessing aspects of our system involving data chunking, ratios no longer hold as data composition and growth is chunk indexing, and data partitioning. Section 5 high- not driven by a regular backup cycle. Second, access lights key performance evaluations and results involving to data is driven by a constant primary workload, thus these three aspects. We summarize in Section 6. heightening the impact of any degradation in data access performance due to metadata processing or on-disk 2 Background and Related Work fragmentation resulting from deduplication. Third, deduplication must limit its use of system resources such Duplication of data hosted on servers occurs for various that it does not impact the performance or scalability reasons. For example, files are copied and potentially of the primary workload running on the system. This modified by one or more users resulting in multiple fully is paramount when applying deduplication on a broad or partially identical files. Different documents may share embedded content (e.g., an image) or multiple Workload Srvrs Users Total Locations virtual disk files may exist each hosting identical OS and Data application files. Deduplication works by identifying Home Folders 8 1867 2.4TB US, Dublin, such redundancy and transparently eliminating it. (HF) Amster- dam, Japan Related Work. Primary data deduplication has received Group File Shares 3 * 3TB US, Japan recent interest in storage research and industry. We re- (GFS) view related work in the area of data deduplication, most Sharepoint 1 500 288GB US of which was done in the context of backup data dedu- Software Deploy- 1 † 399GB US ment Shares (SDS) plication. Virtualization 2 † 791GB US Data chunking: Deduplication systems differ in the Libraries (VL) granularity at which they detect duplicate data. Mi- Total 15 6.8TB crosoft Storage Server [5] and EMC’s Centera [12] use file level duplication, LBFS [18] uses variable- Table 1: Datasets used for deduplication analysis. sized data chunks obtained using Rabin fingerprint- ∗ ing [22], and Venti [21] uses individual fixed size disk Number of authors (users) assumed in 100s but not † blocks. Among content-dependent data chunking meth- quantifiable due to delegated write access. Number of ods, Two-Threshold Two-Divisor (TTTD) [13] and bi- (authors) users limited to < 10 server administrators. modal chunking algorithm [14] produce variable-sized chunks. Winnowing [23] has been used as a document 3.1 Methodology fingerprinting technique for identifying copying within We selected 15 globally distributed servers in a large large sets of documents. multinational corporation. Servers were selected to re- Backup data deduplication: Zhu et al. [26] describe an flect the following variety of file-based workload data inline backup data deduplication system. They use two seen in the enterprise: techniques to reduce lookups on the disk-based chunk index. First, a bloom filter [8] is used to track existing • Home Folder servers host the file contents of user chunks in the system so that disk lookups are avoided home folders (Documents, Photos, Music, etc.) of on non-existing chunks. Second, portions of the disk- multiple individual users. Each file in this work- based chunk index are prefetched into RAM to exploit load is typically created, modified, and accessed by sequential predictability of chunk lookups across succes- a single user. sive backup streams. Lillibridge et al. [16] use sparse indexing to reduce in-memory index size at the cost of • Group File Shares host a variety of shared files used sacrificing deduplication quality. HYDRAstor [11] is within workgroups of users. Each file in this work- a distributed backup storage system which is content- load is typically created and modified by a single addressable and implements global data deduplication, user but accessed by many users. and serves as a back-end for NEC’s primary data storage solutions. • Sharepoint Servers host collaboration office doc- Primary data deduplication: DEDE [9] is a decen- ument content within workgroups. Each file in tralized host-driven block-level deduplication system de- this workload is typically modified and accessed by signed for SAN clustered file systems for VMware ESX many users. Server. Sun’s ZFS [6] and Linux SDFS [1] provide inline • block-level deduplication. NetApp’s solution [2] is also Software Deployment shares host OS and applica- at the block level, with both inline and post-processing tion deployment binaries or packed container files options. Ocarina [3] and Permabit [4] solutions use vari- or installer files containing such binaries. Each file able size data chunking and provide inline and post- in this workload is typically created once by an ad- processing deduplication options. iDedup [24] is a re- ministrator and accessed by many users. cently published primary data deduplication system that • Virtualization Libraries are file shares containing uses inline deduplication and trades off capacity savings virtualization image files used for provisioning of (by as much as 30%) for performance. virtual machines to hypervisor hosts. Each file in this workload is typically created and updated by an 3 Data Collection, Analysis, and Findings administrator and accessed by many users. We begin with a description of datasets collected and Table 1 outlines the number of servers, users, location, analysis methodology and then move on to key findings. and total data size for each studied workload. The count 2 Dedup Space Savings Dataset 80% ~4KB var. no comp. ~4KB var. w/ comp. File Level Chunk Level Gap ~64KB var. no comp. ~64KB var. w/ comp. VL 0.0% 92.0% ¥ 60% GFS-Japan-1 2.6% 41.1% 15.8x GFS-Japan-2 13.7% 39.1% 2.9x 40% HF-Amsterdam 1.9% 15.2% 8x HF-Dublin 6.7% 16.8% 2.5x Dedup Savings 20% HF-Japan 4.0% 19.6% 4.9x GFS-US 15.9% 36.7% 2.3x 0% Sharepoint 3.1% 43.8% 14.1x Chunk Size/Comp. variations Table 2: Improvement in space savings with chunk-level Figure 2: Dedup space savings (%) for chunking with dedup vs. file-level dedup as a fraction of the origi- ∼4KB/∼64KB variable chunk sizes, with impact of nal dataset size. Effect of chunk compression is not in- compression, for GFS-US dataset. cluded. (The average chunk size is 64KB.) 100% file systems in an enterprise setting, found that file-level deduplication achieves 75% of the savings of chunk-level 80% deduplication.

Primary Data Deduplication – Large Scale Study and System Design

Achieving Superior Manageability, Efficiency, and Data Protection With

Netbackup ™ Enterprise Server and Server 8.0 - 8.X.X OS Software Compatibility List Created on September 08, 2021

Latency-Aware, Inline Data Deduplication for Primary Storage

Redefine Blockchain Storage

Data Deduplication Techniques

The Effectiveness of Deduplication on Virtual Machine Disk Images

Microsoft SMB File Sharing Best Practices Guide

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Replication, History, and Grafting in the Ori File System

Distributed Deduplication in a Cloud-Based Object Storage System

Data Deduplication in Windows Server 2012 and Ubuntu Linux

Protecting Oracle Exadata X8 with ZFS Storage Appliance