Data Compression Explained

Data Compression Explained http://mattmahoney.net/dc/dce.html Data Compression Explained Matt Mahoney Copyright (C) 2010, 2011, Dell, Inc. You are permitted to copy and distribute material rom this book provided (1) any material you distribute includes this license, (2) the material is not modi ied, and (3) you do not charge a ee or require any other considerations or copies or or any works that incorporate material rom this book. These restrictions do not apply to normal " air use", de ined as cited quotations totaling less than one page. This book may be downloaded without charge rom http://mattmahoney.net /dc/dce.html. Last update: Oct. 30, 2011. e,reader translations by Ale.o Sanchez, Oct. 21, 2011: mobi (2indle) and epub (other readers). 3omanian translation by Alexander Ovsov, Aug. 3, 2011. About this Book This book is or the reader who wants to understand how data compression works, or who wants to write data compression so tware. 5rior programming ability and some math skills will be needed. Speci ic topics include: 1. In ormation theory No universal compression Coding is bounded Modeling is not computable Compression is an arti icial intelligence problem 2. 7enchmarks Calgary corpus results 3. Coding 8u man arithmetic asymmetric binary numeric codes (unary, 3ice, 9olomb, e4tra bit) archive ormats (error detection, encryption) 4. Modeling Fi4ed order: bytewise bitwise, indirect Variable order: DMC, 55M, CTW Context mixing: linear mixing, logistic mixing, SSE, indirect SSE, match, 5AQ, Z5AQ, Crinkler A. Transforms 3LE LZ77 (LZSS, de late, LZMA, LZC, 3OLZ, LZ5, snappy, deduplication) LZW and dictionary encoding Symbol ranking 7WT (context sorting, inverse, bzip2, 777, MSu Sort v2 and v3, Itoh,Tanaka, DivSu Sort, 7i.ective) 5redictive iltering (delta coding, color transform, linear iltering) Specialized transforms (E8E1, precomp) 1 of 110 10/29/2011 9:17 PM Data Compression Explained http://mattmahoney.net/dc/dce.html 8u man pre,coding E. Lossy compression Images (7M5, 9IF, 5N9, TIFF, F5E9) F5E9 recompression (Stu it, 5AQ, WinZI5 5ackJP9) Video (NTSC, M5E9) Audio (CD, M53, AAC, Dolby, Vorbis) Conclusion Acknowledgements 3e erences This book is intended to be sel contained. Sources are linked when appropriate, but you donGt need to click on them to understand the material. 1. Information Theory Data compression is the art o reducing the number o bits needed to store or transmit data. Compression can be either lossless or lossy. Losslessly compressed data can be decompressed to exactly its original value. An example is 1848 Morse Code. Each letter o the alphabet is coded as a sequence o dots and dashes. The most common letters in English like E and T receive the shortest codes. The least common like F, Q, C, and Z are assigned the longest codes. All data compression algorithms consist o at least a model and a coder (with optional preprocesing trans orms). A model estimates the probability distribution (E is more common than Z). The coder assigns shorter codes to the more likely symbols. There are e icient and optimal solutions to the coding problem. 8owever, optimal modeling has been proven not computable. Modeling (or equivalently, prediction) is both an arti icial intelligence (AI) problem and an art. Lossy compression discards "unimportant" data, or example, details o an image or audio clip that are not perceptible to the eye or ear. An example is the 11A3 NTSC standard or broadcast color TV, used until 2001. The human eye is less sensitive to ine detail between colors o equal brightness (like red and green) than it is to brightness (black and white). Thus, the color signal is transmitted with less resolution over a narrower requency band than the monochrome signal. Lossy compression consists o a transform to separate important rom unimportant data, ollowed by lossless compression o the important part and discarding the rest. The transform is an AI problem because it requires understanding what the human brain can and cannot perceive. In ormation theory places hard limits on what can and cannot be compressed losslessly, and by how much: 1. There is no such thing as a "universal" compression algorithm that is guaranteed to compress any input, or even any input above a certain si0e. In particular, it is not possible to compress random data or compress recursively. 2. 9iven a model (probability distribution) o your input data, the best you can do is code symbols with probability p using log 1/p bits. E icient and optimal codes are known. 2 3. Data has a universal but uncomputable probability distribution. Speci ically, any string x has probability (about) 2,HMH where M is the shortest possible description o x, and HMH is the length o M in bits, almost independent o the language in which M is written. 8owever there is no general procedure or inding M or even estimating HMH in any language. There is no algorithm that tests or randomness or tells you whether a string can be compressed any urther. 2 of 110 10/29/2011 9:17 PM Data Compression Explained http://mattmahoney.net/dc/dce.html 1.1. No Uni ersal Compression This is proved by the counting argument. Suppose there were a compression algorithm that could compress all strings o at least a certain size, say, n bits. There are exactly 2n di erent binary strings o length n. A universal compressor would have to encode each input di erently. Otherwise, i two inputs compressed to the same output, then the decompresser would not be able to decompress that output correctly. 8owever there are only 2n , 1 binary strings shorter than n bits. In act, the vast ma.ority o strings cannot be compressed by very much. The raction o strings that can be compressed rom n bits to m bits is at most 2m , n. For example, less than 0.4I o strings can be compressed by one byte. Every compressor that can compress any input must also expand some o its input. 8owever, the expansion never needs to be more than one symbol. Any compression algorithm can be modi ied by adding one bit to indicate that the rest o the data is stored uncompressed. The counting argument applies to systems that would recursively compress their own output. In general, compressed data appears random to the algorithm that compressed it so that it cannot be compressed again. 1.2. Coding is Bounded Suppose we wish to compress the digits o J, e.g. "3141A12EA3A817132384E2E4...". Assume our model is that each digit occurs with probability 0.1, independent o any other digits. Consider 3 possible binary codes: Digit BCD Huffman Binary ---- ---- ---- ---- 0 0000 000 0 1 0001 001 1 2 0010 010 10 3 0011 011 11 4 0100 100 100 5 0101 101 101 6 0110 1100 110 7 0111 1101 111 8 1000 1110 1000 9 1001 1111 1001 --- ---- ---- ---- bpc 4.0 3.4 not alid Using a 7CD (binary coded decimal) code, J would be encoded as 0011 0001 0100 0001 0101... (Spaces are shown or readability only). The compression ratio is 4 bits per character (4 bpc). I the input was ASCII te4t, the output would be compressed A0I. The decompresser would decode the data by dividing it into 4 bit strings. The 8u man code would code J as 011 001 100 001 101 1111... The decoder would read bits one at a time and decode a digit as soon as it ound a match in the table (a ter either 3 or 4 bits). The code is uniquely decodable because no code is a pre ix o any other code. The compression ratio is 3.4 bpc. The binary code is not uniquely decodable. For example, 111 could be decoded as 7 or 31 or 13 or 111. There are better codes than the 8u man code given above. For example, we could assign 8u man codes to pairs o digits. There are 100 pairs each with probability 0.01. We could assign E bit codes (000000 through 011011) to 00 through 27, and 7 bits (0111000 through 1111111) to 28 through 11. The of 110 10/29/2011 9:17 PM Data Compression Explained http://mattmahoney.net/dc/dce.html average code length is E.72 bits per pair o digits, or 3.3E bpc. Similarly, coding groups o 3 digits using 1 or 10 bits would yield 3.32A3 bpc. Shannon and Weaver (1141) proved that the best you can do or a symbol with probability p is assign a code o length log 1/p. In this example, log 1/0.1 L 3.3211 bpc. 2 2 Shannon de ined the expected in ormation content or equivocation (now called entropy) o a random variable C as its e4pected code length. Suppose C may have values C , C ,... and that each C has 1 2 i probability p(i). Then the entropy o C is 8(C) L EMlog 1/p(C)N L O p(i) log 1/p(i). For example, the 2 i 2 entropy o the digits o J, according to our model, is 10 (0.1 log 1/0.1) L 3.3211 bpc. There is no 2 smaller code or this model that could be decoded unambiguously. The in ormation content o a set o strings is at most the sum o the in ormation content o the individual strings. I C and Y are strings, then 8(C,Y) P 8(C) Q 8(Y). I they are equal, then C and Y are independent. 2nowing one string would tell you nothing about the other. The conditional entropy 8(CHY) L 8(C,Y) , 8(Y) is the in ormation content o C given Y.

Load more