The Illusion of Space and the Science of Data Compression And
Total Page:16
File Type:pdf, Size:1020Kb
The Illusion of Space and the Science of Data Compression and Deduplication Bruce Yellin EMC Proven Professional Knowledge Sharing 2010 Bruce Yellin Advisory Technology Consultant EMC Corporation [email protected] Table of Contents What Was Old Is New Again ......................................................................................................... 4 The Business Benefits of Saving Space ....................................................................................... 6 Data Compression Strategies ....................................................................................................... 9 Data Compression Basics ....................................................................................................... 10 Compression Bakeoff .............................................................................................................. 13 Data Deduplication Strategies .................................................................................................... 16 Deduplication - Theory of Operation ....................................................................................... 16 File Level Deduplication - Single Instance Storage ................................................................. 21 Fixed-Block Deduplication ....................................................................................................... 23 Variable-Block Deduplication .................................................................................................. 24 Content-Aware Deduplication .................................................................................................. 26 Delta Block Optimization ......................................................................................................... 27 Primary and Secondary Storage Optimization ............................................................................ 29 In-Line versus Post-processing ............................................................................................... 31 Primary Storage Optimization - In-line versus Post-Processing .......................................... 33 Secondary Storage Optimization - In-line versus Post-Processing ..................................... 36 Source versus Target .............................................................................................................. 37 Secondary Storage Optimization - Who Makes What? ........................................................... 39 Optimization Software versus Hardware ................................................................................. 40 Data Communications Optimization ............................................................................................ 41 The Law and Storage Optimization ............................................................................................. 43 Storage Optimization “Gotchas” and Misadventures .................................................................. 45 Before You Purchase a Optimization System ............................................................................. 47 Conclusion .................................................................................................................................. 52 Appendix I – MD5 and SHA-1 Algorithms ................................................................................... 55 Appendix II – SNIA Deduplication and VTL Features ................................................................. 57 Footnotes .................................................................................................................................... 58 Disclaimer: The views, processes or methodologies published in this compilation are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies 2010 EMC Proven Professional Knowledge Sharing 2 2009 was a turbulent year for the IT world. Faced with severe macroeconomic pressures and uncertainty, we have all participated in cost containment and reduction plans trying to do more with less. In spite of all the grim financial news, the information age continues to expand1 at an incredible rate forcing data storage to grow 40-60% a year. For example, Americans sent 110 billion text messages in a single month2 with a projected annual rate of 2 trillion messages. The growth is attributed in part to the adoption of business intelligence suites, enterprise and web applications3, and additional regulations. In our own everyday lives, we create and share multimedia documents, depend on email, tweet, text, and instant messages. Responding to the growth, companies add (virtual) servers, network devices, and more storage. Like an iceberg where the bulk of the mass lies beneath the surface, each additional terabyte of primary data might need 5-30 times that amount in additional tape or disk backup capacity. More is spent on floor space, power and cooling. There have been substantial increases in networking costs, especially as a greater emphasis is placed on disaster recovery (DR). And of course, personnel are placed under greater stress, especially when backups to not complete on time. This is clearly not the “green” way to go! The outlook for 2010 and beyond does not provide any relief according to a recent study4. IDC says information is increasing by a factor of 5 while 2,500 -FOLD 5 DVD 2,000 Growth in YEARS 4 RFID budgets are increasing by only 20% and IT staff is Digital TV MP3 players 1,500 increasing by only 10%. They also found the Digital cameras Camera phones, VoIP 1,000 Medical imaging, Laptops, administrative and overhead storage costs are 4-7X 486 Data center applications, Games Exabytes Satellite images, GPS, ATMs, Scanners 500 Sensors, Digital radio, DLP theaters, Telematics the capital expense over the next four years! Peer-to-peer, Email, Instant messaging, Videoconferencing, Exabytes CAD/CAM, Toys, Industrial machines, Security systems, Appliances 0 2008 2009 2010 2011 2012 Improving storage operational efficiency is imperative for organizations with flat or slightly increasing budgets while they try to also improve performance and reliability. IT leaders have begun to tier data, use thin provisioning, set hard quotas, and even archive inactive data. Most managers are also placing big bets on compression and the hottest concept, deduplication. 2010 EMC Proven Professional Knowledge Sharing 3 Compression and deduplication algorithms are powerful weapons to use against “evil” data sprawl. They both effectively reduce physical storage requirements by leveraging CPU Not while I’m cycles. The secret is to know how, when, and where to use here! them. Over the millennia, the common element has been to I Love Data save space. Less space saves money, time, and even fosters Sprawl ! new concepts that might ordinarily be considered impractical. What Was Old Is New Again During World War II, a blivet was a slang expression conveying the idea that it was impossible to get “ten pounds of manure in a five pound bag”. A blivet, also synonymous for a seemingly intractable problem, might have also been an apt description of trying to store 4TB of data on a 2TB disk drive, except for the science of compression. We live in a miraculous world where every so often, technology changes the rules of the game. Space compression is today’s game changer allowing us to reduce data by 30%, 60% and even 90%! Compression, a gem of the IT world, is one of those rare technologies with a lineage going back thousands of years. The ancient Greeks wrote their documents on papyrus, a thick paper-like material from the pith of the papyrus plant5. It was scarce and scriptio continua ("continuous script" in Latin) allowed people to squeeze more words into smaller areas6 by removing spaces in sentences. For example, here is part of a 181 C.E. “letter from Apollonius and Herminus to Herodes and other managers of the public bank, authorizing them to receive the tax on the sale of a slave.”7. Look closely and you will not see a single space! The Romans made extensive use of Latin abbreviations and left out spaces practicing scriptio continua on coins and other mementos to save space. For example, the coin to the right dating back to the Roman Empire uses these 32 letters: “NEROCLAVDCAESARAVGGERPMTRPIMPPP” 2010 EMC Proven Professional Knowledge Sharing 4 Inserting spaces gives us these individual words: NERO (his name) CLAVD (part of his name and this stands for Claudius) CAESAR (an Imperial title with its roots in the family name of Julius Caesar) AVG (is AVGVSTVS (Augustus), the highest authority in Rome) GER (is GERMANICVS, Ruler or Conqueror of Germania) PM (is PONTIFEX MAXIMVS, or supreme priest) TR P(stands for TRIBVNICIA POTESTAS, Power or Potency of the Tribunate) IMP (is IMPERATOR meaning the ruler is the Commander-In-Chief of the armed forces) PP (is PATER PATRIAE or father of the country). The translation has 198 letters, spaces and punctuation: “Caesar Augustus Nero Claudius, High Priest and Ruler of Rome and Germania, Supreme Commander of the armies of Rome, the father of his country, leader of the Triumvirate for as long as he shall live.8” This produced a compression ratio of 198/32 = 6:1. COMPRESSION OR DEDUPLICATION RATIO - The reduction in space expressed as a ratio. If we shrink 30GB down to 10GB, it has a 3:1 ratio. original _ size 30GB Ratio = = = 3:1 compressed _ size 10GB I am not sure who invented the shot glass, but next time you visit a pub, notice how they benefit from compression