MIGRATORY COMPRESSION Coarse-Grained Data Reordering to Improve Compressibility
Total Page:16
File Type:pdf, Size:1020Kb
MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility Xing Lin*, Guanlin Lu, Fred Douglis, Philip Shilane, Grant Wallace *University of Utah EMC Corporation Data Protection and Availability Division © Copyright 2014 EMC Corporation. All rights reserved. 1 Overview Compression: finds redundancy among strings within a certain distance (window size) – Problem: window sizes are small, and similarity across a larger distance will not be identified Migratory compression (MC): coarse-grained reorganization to group similar blocks to improve compressibility – A generic pre-processing stage for standard compressors – In many cases, improve both compressibility and throughput – Effective for improving compression for archival storage © Copyright 2014 EMC Corporation. All rights reserved. 2 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques gzip 64 KB LZ; Huffman coding bzip2 900 KB Run-length encoding; Burrows-Wheeler Transform; Huffman coding 7z 1 GB LZ; Markov chain-based range coder © Copyright 2014 EMC Corporation. All rights reserved. 3 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques gzip 64 KB LZ; Huffman coding bzip2 900 KB Run-length encoding; Burrows-Wheeler Transform; Huffman coding 7z 1 GB LZ; Markov chain-based range coder © Copyright 2014 EMC Corporation. All rights reserved. 4 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques Example Throughput vs. CF gzip 64 KB LZ; Huffman coding 14 gzip bzip2 900 KB Run-length encoding; Burrows- 12 Wheeler Transform; Huffman coding 10 (MB/s) 7z 1 GB LZ; Markov chain-based range 8 coder Tput 6 bzip2 4 7z 2 0 Compression 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Compression Factor (X) The larger the window, the better the compression but slower © Copyright 2014 EMC Corporation. All rights reserved. 5 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques Example Throughput vs. CF gzip 64 KB LZ; Huffman coding 14 gzip bzip2 900 KB Run-length encoding; Burrows- 12 Wheeler Transform; Huffman coding 10 (MB/s) 7z 1 GB LZ; Markov chain-based range 8 coder Tput 6 bzip2 rzip 4 7z rzip (from rsync): 2 0 Compression intra-file deduplication; 2.0 2.5 3.0 3.5 4.0 4.5 5.0 bzip2 the remainder Compression Factor (X) The larger the window, the better the compression but slower © Copyright 2014 EMC Corporation. All rights reserved. 6 Motivation Compress a single, large file – Traditional compressors are unable to exploit redundancy across a large range of data (e.g., many GB) gz window A B A’ C … (GBs) A” B’ … 7z window transform compress A A’ A” B B’ C … … … compress … Standard compressor mzip © Copyright 2014 EMC Corporation. All rights reserved. 7 Motivation Better compression for long-term retention – Data migration from backup to archive tier – Observation: similar blocks could be scattered across “compression regions” (CRs) in the backup tier CR: unit of data that is CR1 A B C compressed together (e.g. 128 KB) CR2 A’ B’ C’ … Backup tier © Copyright 2014 EMC Corporation. All rights reserved. 8 Motivation Better compression for long-term retention – Data migration from backup to archive tier – Observation: similar blocks could be scattered across “compression regions” (CRs) in the backup tier CR1 A B C CR1 A B C Migrate CR2 A’ B’ C’ CR2 A’ B’ C’ … … Backup tier Archive tier Current practice © Copyright 2014 EMC Corporation. All rights reserved. 9 Motivation Better compression for long-term retention – Data migration from backup to archive tier – Observation: similar blocks could be scattered across “compression regions” (CRs) in the backup tier – Proposal: fill each CR with similar data CR1 A B C CR1 A A’ B B’ Migrate CR2 A’ B’ C’ CR2 C C’ … … … Backup tier Archive tier Migratory compression © Copyright 2014 EMC Corporation. All rights reserved. 10 Recap Migratory compression – Idea: improve compression by grouping similar blocks – Use cases ▪ mzip: compress a single, large file ▪ Data migration for archival storage Considerations ▪ How to identify similar blocks? ▪ How to reorganize blocks efficiently? ▪ Other tradeoffs (refer to the paper) © Copyright 2014 EMC Corporation. All rights reserved. 11 mzip Workflow File Segmentation Segmentation: partition into blocks, calculate similarity “features” Similarity detector Similarity detector: group by content and identify duplicate and Reorganizer similar blocks; output migrate and restore recipe Reorganized file Reorganizer: rearrange the input file – Reorganized file may exist only as a pipe into compressor Compressor Compressed file © Copyright 2014 EMC Corporation. All rights reserved. 12 mzip Workflow Compressed File file Segmentation Decompressor Similarity detector Reorganized file Reorganizer Extract restore Reorganized recipe file Restorer Compressor File Compressed file © Copyright 2014 EMC Corporation. All rights reserved. 13 mzip Example migrate recipe 0 3 1 5 2 4 . Block ID 0 A 1 B File 2 C Similarity 3 A’ detector 4 D 5 B’ … … 0 2 4 1 5 3 . restore recipe © Copyright 2014 EMC Corporation. All rights reserved. 14 mzip Example migrate recipe 0 3 1 5 2 4 . Block ID Reorganize 0 A A 0 1 B A’ 1 File Reorganized 2 C B 2 3 A’ B’ 3 File 4 D C 4 5 B’ D 5 … … … … Restore 0 2 4 1 5 3 . restore recipe © Copyright 2014 EMC Corporation. All rights reserved. 15 Similarity Detection Similarity feature (based on [Broder 1997]) • A strong hash for duplication detection • Features: weak hashes provide hints about similarity among blocks • Two blocks are similar when they share values for some weak hashes Mature technique • REBL: Redundancy Elimination at Block Level [Kulkarni et al. 2004] • WAN Optimized Replication [Shilane et al. 2012] © Copyright 2014 EMC Corporation. All rights reserved. 16 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 B 2 C 3 A’ 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 17 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 B 2 C 3 A’ 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 18 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 B A’ 2 C 3 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 19 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 C B 3 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 20 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 C B 3 B’ 4 D 5 input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 21 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 B 3 B’ 4 D C 5 input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 22 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 B 3 B’ 4 C 5 D input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 23 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 A ▪ Fine for memory and SSD 1 A’ 2 B 3 B’ 4 C 5 D input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 24 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 A ▪ Fine for memory and SSD 1st scan 1st 1 B Buffer – Multipass (HDD) 2 C 3 A’ 4 D 5 B’ input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 25 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 A ▪ Fine for memory and SSD 1st scan 1st 1 A’ Buffer – Multipass (HDD) 2 C B 3 4 D 5 B’ input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 26 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 ▪ Fine for memory and SSD scan 2nd 1 Buffer – Multipass (HDD) 2 C 3 4 D 5 B’ input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 27 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 B’ ▪ Fine for memory and SSD scan 2nd 1 C Buffer – Multipass (HDD) 2 D 3 4 5 input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 28 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 B’ ▪ Fine for memory and SSD scan 2nd 1 C Buffer – Multipass (HDD) 2 D ▪ Convert random IOs into 3 sequential scans 4 5 input Block-levelMultipass © Copyright 2014 EMC Corporation.