Compression Algorithms

Compression Programs • File Compression: Gzip, Bzip • Archivers :Arc, Pkzip, Winrar, … Analysis of Algorithms • File Systems: NTFS Piyush Kumar (Lecture 5: Compression) Welcome to 4531 Source: Guy E. Blelloch, Emad, Tseng … Multimedia Compression Outline • HDTV (Mpeg 4) Introduction: Lossy vs. Lossless • Sound (Mp3) Information Theory: Entropy, etc. • Images (Jpeg) Probability Coding: Huffman + Arithmetic Coding Lossless vs. Lossy Encoding/Decoding Lossless: Input message = Output message Lossy: Input message Output message Will use “message” in generic sense to mean the data to be compressed Lossy does not necessarily mean loss of quality. In fact the output could be “better” than the input. – Drop random noise in images (dust on lens) Output Input Encoder Compressed Decoder – Drop background in music Message Message Message – Fix spelling errors in text. Put into better form. Writing is the art of lossy text compression. CODEC The encoder and decoder need to understand common compressed format. 1 Lossless Compression Techniques How much can we • LZW (Lempel-Ziv-Welch) compression compress? – Build dictionary – Replace patterns with index of dict. For lossless compression, assuming all • Burrows-Wheeler transform input messages are valid, if even one – Block sort data to improve compression string is compressed, some other must expand. • Run length encoding – Find & compress repetitive sequences • Huffman code – Use variable length codes based on frequency Model vs. Coder Quality of Compression To compress we need a bias on the probability of Runtime vs. Compression vs. Generality messages. The model determines this bias Several standard corpuses to compare algorithms Calgary Corpus Encoder • 2 books, 5 papers, 1 bibliography, Messages Probs. Bits 1 collection of news articles, 3 programs, Model Coder 1 terminal session, 2 object files, 1 geophysical data, 1 bitmap bw image The Archive Comparison Test maintains a Example models: comparison of just about all algorithms publicly available – Simple: Character counts, repeated strings – Complex: Models of a human face Comparison of Information Theory Algorithms An interface between modeling and coding Program Algorithm Time BPC Score • Entropy BOA PPM Var. 94+97 1.91 407 – A measure of information content PPMD PPM 11+20 2.07 265 • Entropy of the English Language IMP BW 10+3 2.14 254 – How much information does each BZIP BW 20+6 2.19 273 character in “typical” English text GZIP LZ77 Var. 19+5 2.59 318 contain? LZ77 LZ77 ? 3.94 ? 2 Entropy (Shannon 1948) Entropy Example For a set of messages S with probability p(s), s S, the self information of s is: p(S) {.25,.25,.25,.125,.125} 1 H(S) 3.25log4 2.125log8 2.25 i(s) log log p(s) p(s) p(S) {.5,.125,.125,.125,.125} Measured in bits if the log is base 2. H(S) .5log2 4.125log8 2 The lower the probability, the higher the information p(S) {.75,.0625,.0625,.0625,.0625} Entropy is the weighted average of self H(S) .75log(4 3) 4.0625log16 1.3 information. 1 H(S) p(s)log sS p(s) Entropy of the English Language Shannon’s experiment How can we measure the information per Asked humans to predict the next character character? given the whole previous text. He used ASCII code = 7 these as conditional probabilities to Entropy = 4.5 (based on character estimate the entropy of the English probabilities) Language. Huffman codes (average) = 4.7 The number of guesses required for right answer: # of guesses 1 2 3 3 5 > 5 Unix Compress = 3.5 Probability .79 .08 .03 .02 .02 .05 Gzip = 2.5 BOA = 1.9 (current close to best text compressor) From the experiment he predicted H(English) = .6-1.3 Must be less than 1.9. Data compression model Coding Input data How do we use the probabilities to code messages? Reduce Data Redundancy • Prefix codes and relationship to Entropy Reduction of Entropy • Huffman codes • Arithmetic codes Entropy Encoding • Implicit probability codes… Compressed Data 3 Assumptions Uniquely Decodable Codes A variable length code assigns a bit string Communication (or file) broken up into pieces called (codeword) of variable length to every messages. message value Adjacent messages might be of a different types a = 1, b = 01, c = 101, d = 011 and come from a different probability e.g. distributions What if you get the sequence of bits We will consider two types of coding: 1011 ? • Discrete: each message is a fixed set of bits Is it aba, ca, or, ad? – Huffman coding, Shannon-Fano coding A uniquely decodable code is a variable length • Blended: bits can be “shared” among messages code in which bit strings can always be – Arithmetic coding uniquely decomposed into its codewords. Prefix Codes Some Prefix Codes for Integers A prefix code is a variable length code in which no codeword is a prefix of n Binary Unary Split another word 1 ..001 0 1| e.g a = 0, b = 110, c = 111, d = 10 2 ..010 10 10|0 Can be viewed as a binary tree with 3 ..011 110 10|1 message values at the leaves and 0 or 4 ..100 1110 110|00 5 ..101 11110 110|01 1s on the edges. 0 1 6 ..110 111110 110|10 0 1 a 0 1 d Many other fixed prefix codes: Golomb, phased-binary, subexponential, ... b c Average Bit Length Relationship to Entropy For a code C with associated Theorem (lower bound): For any probabilities p(c) the average length probability distribution p(S) with is defined as associated uniquely decodable code C, ABL(C) p(c)l(c) cC H(S) ABL (C) We say that a prefix code C is optimal Theorem (upper bound): For any if for all prefix codes C’, probability distribution p(S) with ABL(C) ABL(C’) associated optimal prefix code C, ABL(C) H(S) 1 4 Kraft McMillan Inequality Proof of the Upper Bound (Part 1) Assign to each message a length l(s) log 1 p(s) Theorem (Kraft-McMillan): For any uniquely decodable code C, We then have l(s) log1/ p(s) 2l(c) 1 2 2 sS sS cC log1/ p(s) Also, for any set of lengths L such that 2 sS 2l 1 lL p(s) there is a prefix code C such that sS 1 l(c ) l (i 1,...,|L|) i i So by the Kraft-McMillan ineq. there is a prefix code with lengths l(s). Proof of the Upper Another property of optimal codes Bound (Part 2) Theorem: If C is an optimal prefix code for the probabilities {p1, …, pn} then pi > pj Now we can calculate the average length given l(s) implies l(ci) l(cj) ABL(S) p(s)l(s) Proof: (by contradiction) sS Assume l(c ) > l(c ). Consider switching p(s) log 1/ p(s) i j codes c and c . If l is the average length sS i j a p(s) (1 log(1/ p(s))) of the original code, the length of the new sS code is 1 p(s)log(1/ p(s)) ' la la p j (l(ci ) l(c j )) pi (l(c j ) l(ci )) sS 1 H(S) la ( p j pi )(l(ci ) l(cj )) l And we are done. a This is a contradiction since la was supposed to be optimal Corollary • The pi is smallest over the code, then l(ci) is the largest. Huffman Coding Binary trees for compression 5 Huffman Code Huffman Codes • Approach Invented by Huffman as a class assignment – Variable length encoding of symbols in 1950. – Exploit statistical frequency of symbols Used in many, if not most compression algorithms – Efficient when symbol probabilities vary widely • gzip, bzip, jpeg (as option), fax • Principle compression,… – Use fewer bits to represent frequent symbols – Use more bits to represent infrequent symbols Properties: – Generates optimal prefix codes A A B A – Cheap to generate codes – Cheap to encode and decode – l =H if probabilities are powers of 2 A A B A a Huffman Code Example Huffman Codes Huffman Algorithm Symbol Dog Cat Bird Fish Frequency 1/8 1/4 1/2 1/8 • Start with a forest of trees each Original 00 01 10 11 consisting of a single vertex Encoding 2 bits 2 bits 2 bits 2 bits corresponding to a message s and Huffman 110 10 0 111 with weight p(s) Encoding 3 bits 2 bits 1 bit 3 bits • Repeat: – Select two trees with minimum weight • Expected size roots p1 and p2 – Original 1/82 + 1/42 + 1/22 + 1/82 = 2 bits / symbol – Join into single tree by adding root with – Huffman 1/83 + 1/42 + 1/21 + 1/83 = 1.75 bits / symbol weight p1 + p2 Example Encoding and Decoding Encoding: Start at leaf of Huffman tree and p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 follow path to the root. Reverse order of bits and send. a(.1) b(.2) c(.2) d(.5) Decoding: Start at root of Huffman tree and (.3) (.5) (1.0) take branch for each bit received. When 0 1 (.5) d(.5) at leaf can output message and return to a(.1) b(.2) (.3) c(.2) 0 1 root. (1.0) Step 1 a(.1) b(.2) (.3) c(.2) 0 1 0 1 There are even faster methods that (.5) d(.5) Step 2 a(.1) b(.2) can process 8 or 32 bits at a time 0 1 (.3) c(.2) Step 3 0 1 a=000, b=001, c=01, d=1 a(.1) b(.2) 6 Huffman codes are Lemmas optimal • L1 : Let pi be the smallest over the code, then Theorem: The Huffman algorithm l(ci) is the largest and hence a leaf of the tree.

Load more