MIGRATORY COMPRESSION Coarse-Grained Data Reordering to Improve Compressibility

MIGRATORY COMPRESSION Coarse-Grained Data Reordering to Improve Compressibility

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility Xing Lin*, Guanlin Lu, Fred Douglis, Philip Shilane, Grant Wallace *University of Utah EMC Corporation Data Protection and Availability Division © Copyright 2014 EMC Corporation. All rights reserved. 1 Overview Compression: finds redundancy among strings within a certain distance (window size) – Problem: window sizes are small, and similarity across a larger distance will not be identified Migratory compression (MC): coarse-grained reorganization to group similar blocks to improve compressibility – A generic pre-processing stage for standard compressors – In many cases, improve both compressibility and throughput – Effective for improving compression for archival storage © Copyright 2014 EMC Corporation. All rights reserved. 2 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques gzip 64 KB LZ; Huffman coding bzip2 900 KB Run-length encoding; Burrows-Wheeler Transform; Huffman coding 7z 1 GB LZ; Markov chain-based range coder © Copyright 2014 EMC Corporation. All rights reserved. 3 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques gzip 64 KB LZ; Huffman coding bzip2 900 KB Run-length encoding; Burrows-Wheeler Transform; Huffman coding 7z 1 GB LZ; Markov chain-based range coder © Copyright 2014 EMC Corporation. All rights reserved. 4 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques Example Throughput vs. CF gzip 64 KB LZ; Huffman coding 14 gzip bzip2 900 KB Run-length encoding; Burrows- 12 Wheeler Transform; Huffman coding 10 (MB/s) 7z 1 GB LZ; Markov chain-based range 8 coder Tput 6 bzip2 4 7z 2 0 Compression 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Compression Factor (X) The larger the window, the better the compression but slower © Copyright 2014 EMC Corporation. All rights reserved. 5 Background • Compression Factor (CF) = ratio of original / compressed • Throughput = original size / (de)compression time Compressor MAX Window size Techniques Example Throughput vs. CF gzip 64 KB LZ; Huffman coding 14 gzip bzip2 900 KB Run-length encoding; Burrows- 12 Wheeler Transform; Huffman coding 10 (MB/s) 7z 1 GB LZ; Markov chain-based range 8 coder Tput 6 bzip2 rzip 4 7z rzip (from rsync): 2 0 Compression intra-file deduplication; 2.0 2.5 3.0 3.5 4.0 4.5 5.0 bzip2 the remainder Compression Factor (X) The larger the window, the better the compression but slower © Copyright 2014 EMC Corporation. All rights reserved. 6 Motivation Compress a single, large file – Traditional compressors are unable to exploit redundancy across a large range of data (e.g., many GB) gz window A B A’ C … (GBs) A” B’ … 7z window transform compress A A’ A” B B’ C … … … compress … Standard compressor mzip © Copyright 2014 EMC Corporation. All rights reserved. 7 Motivation Better compression for long-term retention – Data migration from backup to archive tier – Observation: similar blocks could be scattered across “compression regions” (CRs) in the backup tier CR: unit of data that is CR1 A B C compressed together (e.g. 128 KB) CR2 A’ B’ C’ … Backup tier © Copyright 2014 EMC Corporation. All rights reserved. 8 Motivation Better compression for long-term retention – Data migration from backup to archive tier – Observation: similar blocks could be scattered across “compression regions” (CRs) in the backup tier CR1 A B C CR1 A B C Migrate CR2 A’ B’ C’ CR2 A’ B’ C’ … … Backup tier Archive tier Current practice © Copyright 2014 EMC Corporation. All rights reserved. 9 Motivation Better compression for long-term retention – Data migration from backup to archive tier – Observation: similar blocks could be scattered across “compression regions” (CRs) in the backup tier – Proposal: fill each CR with similar data CR1 A B C CR1 A A’ B B’ Migrate CR2 A’ B’ C’ CR2 C C’ … … … Backup tier Archive tier Migratory compression © Copyright 2014 EMC Corporation. All rights reserved. 10 Recap Migratory compression – Idea: improve compression by grouping similar blocks – Use cases ▪ mzip: compress a single, large file ▪ Data migration for archival storage Considerations ▪ How to identify similar blocks? ▪ How to reorganize blocks efficiently? ▪ Other tradeoffs (refer to the paper) © Copyright 2014 EMC Corporation. All rights reserved. 11 mzip Workflow File Segmentation Segmentation: partition into blocks, calculate similarity “features” Similarity detector Similarity detector: group by content and identify duplicate and Reorganizer similar blocks; output migrate and restore recipe Reorganized file Reorganizer: rearrange the input file – Reorganized file may exist only as a pipe into compressor Compressor Compressed file © Copyright 2014 EMC Corporation. All rights reserved. 12 mzip Workflow Compressed File file Segmentation Decompressor Similarity detector Reorganized file Reorganizer Extract restore Reorganized recipe file Restorer Compressor File Compressed file © Copyright 2014 EMC Corporation. All rights reserved. 13 mzip Example migrate recipe 0 3 1 5 2 4 . Block ID 0 A 1 B File 2 C Similarity 3 A’ detector 4 D 5 B’ … … 0 2 4 1 5 3 . restore recipe © Copyright 2014 EMC Corporation. All rights reserved. 14 mzip Example migrate recipe 0 3 1 5 2 4 . Block ID Reorganize 0 A A 0 1 B A’ 1 File Reorganized 2 C B 2 3 A’ B’ 3 File 4 D C 4 5 B’ D 5 … … … … Restore 0 2 4 1 5 3 . restore recipe © Copyright 2014 EMC Corporation. All rights reserved. 15 Similarity Detection Similarity feature (based on [Broder 1997]) • A strong hash for duplication detection • Features: weak hashes provide hints about similarity among blocks • Two blocks are similar when they share values for some weak hashes Mature technique • REBL: Redundancy Elimination at Block Level [Kulkarni et al. 2004] • WAN Optimized Replication [Shilane et al. 2012] © Copyright 2014 EMC Corporation. All rights reserved. 16 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 B 2 C 3 A’ 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 17 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 B 2 C 3 A’ 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 18 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 B A’ 2 C 3 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 19 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 C B 3 4 D 5 B’ input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 20 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 C B 3 B’ 4 D 5 input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 21 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 B 3 B’ 4 D C 5 input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 22 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level 0 A 1 A’ 2 B 3 B’ 4 C 5 D input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 23 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 A ▪ Fine for memory and SSD 1 A’ 2 B 3 B’ 4 C 5 D input Block-level © Copyright 2014 EMC Corporation. All rights reserved. 24 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 A ▪ Fine for memory and SSD 1st scan 1st 1 B Buffer – Multipass (HDD) 2 C 3 A’ 4 D 5 B’ input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 25 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 A ▪ Fine for memory and SSD 1st scan 1st 1 A’ Buffer – Multipass (HDD) 2 C B 3 4 D 5 B’ input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 26 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 ▪ Fine for memory and SSD scan 2nd 1 Buffer – Multipass (HDD) 2 C 3 4 D 5 B’ input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 27 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 B’ ▪ Fine for memory and SSD scan 2nd 1 C Buffer – Multipass (HDD) 2 D 3 4 5 input Block-levelMultipass © Copyright 2014 EMC Corporation. All rights reserved. 28 Data Reorganization Re-arrange the input file, according to a recipe – Reorganizer & Restorer Options recipe 0 3 1 5 2 4 – Block-level ▪ Random IOs 0 B’ ▪ Fine for memory and SSD scan 2nd 1 C Buffer – Multipass (HDD) 2 D ▪ Convert random IOs into 3 sequential scans 4 5 input Block-levelMultipass © Copyright 2014 EMC Corporation.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    42 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us