Understanding Compression
Total Page:16
File Type:pdf, Size:1020Kb
Understanding Compression DATA COMPRESSION FOR MODERN DEVELOPERS Colt McAnlis & Aleks Haecky Understanding Compression Data Compression for Modern Developers Colt McAnlis and Aleks Haecky Beijing Boston Farnham Sebastopol Tokyo Understanding Compression by Colt McAnlis and Aleks Haecky Copyright © 2016 Colton McAnlis and Aleks Haecky. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editor: Tim McGovern Indexer: Ellen Troutman-Zaig Production Editor: Melanie Yarbrough Interior Designer: David Futato Copyeditor: Octal Publishing, Inc. Cover Designer: Karen Montgomery Proofreader: Jasmine Kwityn Illustrator: Melanie Yarbrough July 2016: First Edition Revision History for the First Edition 2016-07-11: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491961537 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Understanding Compression, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-96153-7 [LSI] From CLM To JAM and MLM: I swear to Zuul, that if you don’t eat your broccoli right now, I’m going to write a book. And in the dedication of that book, I’m going to call you out as being afraid of a piece of foliage that humans have been eating for thousands of genera‐ tions. Then, 20 years from now, when you have kids of your own, I’m going to pull that book out, and show you what I wrote, and laugh in your face, because you’ll know how crazy you’re making me right now. #parenting To KMKM: How about another decade, just for good measure? From AH To AHS and GHS: I hoped you’d learn to cook. Instead, you proved that humankind can survive on fresh apples and stale supermarket sushi. Table of Contents Foreword. xi Preface. xv Chapter Synopsis 18 1. Let’s Not Be Boring. 1 The Five Buckets of Compression Algorithms 1 Claude Shannon Is Infuriating! 2 The Only Thing You Need to Know about Data Compression 3 A World Built on Data Compression 4 2. Do Not Skip This Chapter. 9 Understanding Binary 9 Base 10 System 9 Binary Number System 10 Information Theory 13 An Excursion into Binary Search 14 Entropy: The Minimum Bits Needed to Represent a Number 16 Standard Number Lengths 17 3. Breaking Entropy. 19 Understanding Entropy 19 What This Entropy Stuff Is Good For 21 Understanding Probability 22 Breaking Entropy 23 Example 1: Delta Coding 24 Example 2: Symbol Grouping 25 Example 3: Permutations 26 v Information Theory Versus Data Compression 31 4. Variable-Length Codes. 33 Morse Code 33 Probability, Entropy, and Codeword Size 36 Variable-Length Codes 38 Using VLCs 38 Creating VLCs 42 A Handful of Example VLCs 44 Finding the Right Code for Your Data Set 51 5. Statistical Encoding. 53 Statistically Compressing to Entropy 53 Huffman Coding 55 Building a Huffman Tree 55 Generating Codewords 57 Encoding and Decoding 58 Practical Implementations 58 Arithmetic Coding 60 Finding the Right Number 61 Encoding 62 Picking the Right Output Value 64 Decoding 64 Practical Implementations 69 Asymmetric Numeral Systems 69 Encoding and Decoding Using a Transform Table 70 Creating the Reference Table 71 Using ANS for Compression 74 Decoding Example 75 So Where Does the Compression Come From? 76 Practical Compression: Which Statistical Algorithm Do I Choose? 77 6. Adaptive Statistical Encoding. 79 Locality Matters for Entropy 79 Adaptive VLC Encoding 81 Dynamically Building a VLC Table 81 Literals 84 Resets 87 Knowing When to Reset 88 Using This in Practice 89 Adaptive Arithmetic Coding 89 Adaptive Huffman Coding 90 vi | Table of Contents The Modern Choice 91 7. Dictionary Transforms. 93 A Basic Dictionary Transform 94 Finding the Right “Words” 95 The Lempel-Ziv Algorithm 98 How LZ Works 99 Encoding 104 Decoding 105 Compressing LZ output 106 LZ Variants 107 Collect Them All! 110 8. Contextual Data Transforms. 111 Run-Length Encoding 112 Dealing with Short Runs 112 Compressing 114 Delta Coding 115 XOR Delta Coding 118 Frame of Reference Delta Coding 119 Patched Frame of Reference Delta Coding 120 Compressing Delta-Encoded Data 123 Does It Work on Text? 123 Move-to-Front Coding 123 Avoiding Rogue Symbols 125 Compressing MTF 126 Burrows–Wheeler Transform 126 Ordering Is Important! 128 How BWT Works 128 Inverse BWT 130 Practical Implementations 132 Compressing BWT 132 9. Data Modeling. 135 The Chains of Markov 136 Markov and Compression 139 Practical Implementations 145 Prediction by Partial Matching 145 The Search Trie 147 Compressing a Symbol 149 Choosing a Sensible N Value 150 Dealing with Unknown Symbols 150 Table of Contents | vii Context Mixing 150 Types of Models 151 Types of Mixing 153 The Next Big Thing? 154 10. Switching Gears. 155 Media-Specific Compression 155 General-Purpose Compression 156 Compression in Practice 157 11. Evaluating Compression. 159 Compression Usage Scenarios 159 Compressed Offline, Decompressed On-Client 159 Compressed On-Client, Decompressed In-Cloud 160 Compressed In-Cloud, Decompressed On-Client 160 Compressed On-Client, Decompressed On-Client 161 Compression Need 161 Compression Ratio 162 Compression Performance 163 Decompression Performance 164 Ability to Decode-Stream 164 Comparing Compressors 165 12. Compressing Image Data Types. 167 Understanding Quality Versus File Size 167 What Reduces Image Quality? 169 Measuring Image Quality 171 Making This Work 173 Image Dimensions Are Important 173 Choosing the Correct Image Format 175 PNG 175 JPG 176 GIF 177 WebP 177 And Now for Choosing... 177 GPU Texture Formats 179 Vector Formats 180 Eyes on the Prize 182 13. Serialized Data. 183 Understanding Common Use Cases 184 Dynamically Server-Built Data 184 viii | Table of Contents Statically Built Server-Owned Data 184 Dynamically Client-Built Data 184 Statically Client-Owned Data 184 Issues with Serialized Formats 185 Human-Readable Text 185 Slow Decode Times 186 Smaller Serialized Data 186 Use a Binary Serialization Format 186 Restructure Lists for Better Compression 187 Organize for Efficient Fetching 188 Segment Out Data into the Proper Compression Format 191 14. Lossy Data Compression. 193 15. Making the World a Little Smaller. 195 Data Compression and You 195 Data Compression and the Bottom Line 195 User Acquisition and Retention 195 Running Costs 196 Planning Ahead 197 Making Your Users’ Lives a Little More Magical and Less Expensive 197 Thinking About What’s Next in Technology 198 The Next Five Billion Users 198 Mobile Networks 198 ...Starting Now 199 Glossary of Compression Words. 201 Index. 209 Table of Contents | ix Foreword When I first began programming, I had no idea what data compression was nor why it mattered. Luckily, my Apple II Plus computer came with 0.000048 GB of memory (48 KB), which was quite a lot in 1979, and was enough to let me explore program‐ ming and computer graphics without realizing that my programs and data were con‐ stantly being compressed and decompressed behind the scenes in order to reduce their size in memory. Thanks, Woz! After programming for a few years, I had discovered: • Data compression took time and could slow down my software. • Changing my data organization could make the compressed data smaller. • There are a bewildering variety of complicated data compression algorithms. This led to the realization that compression was not a rigid black box; rather, it’s a flexible tool that greatly influenced the quality of my software and could be manipu‐ lated in several ways: • Changing compression algorithms could make my software run faster. • Pairing my data organization with the right compression algorithm could make my data smaller. • Choosing the wrong data organization or algorithm could make my data larger (and/or run slower). Ah! Now I knew why data compression mattered. If things weren’t fitting into mem‐ ory or were decompressing too slowly, I could slightly change my data organization to better fit the compression algorithm. I’d simply put numbers together in one group, strings in another, build tables of recurring data types, or truncate fractions into inte‐ gers. I didn’t need to do the hard work of evaluating and adopting new compression algorithms if I could fit my data to the algorithm. Then, I began making video games professionally, and most of the game data was cre‐ ated by not-so-technical artists, designers, and musicians. It turned out that math was xi not their favorite topic of discussion, and they were less than excited about changing the game data so that it would take advantage of my single go-to compression algo‐ rithm. Well, if the data organization couldn’t be improved, that left choosing the best compression algorithm to pair up with all of this great artistic data. I surveyed the various compression algorithms and found there were a couple of broad categories suitable for my video game data: Lossless • De-duplication (LZ) • Entropy (Huffman, Arithmetic) Lossy • Reduced precision (truncation or decimation) • Image/video • Audio For text strings and binary data, I used LZ to compress away repeating duplicate data patterns.