Compression Programs • File Compression: Gzip, Bzip • Archivers :Arc, Pkzip, Winrar, … Analysis of Algorithms • File Systems: NTFS

Piyush Kumar (Lecture 5: Compression)

Welcome to 4531 Source: Guy E. Blelloch, Emad, Tseng …

Multimedia Compression Outline • HDTV (Mpeg 4) Introduction: Lossy vs. Lossless • Sound (Mp3) : Entropy, etc. • Images (Jpeg) Probability Coding: Huffman +

Lossless vs. Lossy Encoding/Decoding Lossless: Input message = Output message Lossy: Input message  Output message Will use “message” in generic sense to mean the data to be compressed Lossy does not necessarily mean loss of quality. In fact the output could be “better” than the input. – Drop random noise in images (dust on lens) Output Input Encoder Compressed Decoder – Drop background in music Message Message Message – Fix spelling errors in text. Put into better form. Writing is the art of lossy text compression. CODEC The encoder and decoder need to understand common compressed format.

1 Techniques How much can we • LZW (Lempel-Ziv-Welch) compression compress? – Build dictionary – Replace patterns with index of dict. For lossless compression, assuming all • Burrows-Wheeler transform input messages are valid, if even one – Block sort data to improve compression string is compressed, some other must expand. • Run length encoding – Find & compress repetitive sequences • Huffman – Use variable length based on frequency

Model vs. Coder Quality of Compression

To compress we need a bias on the probability of Runtime vs. Compression vs. Generality messages. The model determines this bias Several corpuses to compare algorithms Calgary Corpus Encoder • 2 books, 5 papers, 1 bibliography, Messages Probs. Bits 1 collection of news articles, 3 programs, Model Coder 1 terminal session, 2 object files, 1 geophysical data, 1 bitmap bw image The Archive Comparison Test maintains a Example models: comparison of just about all algorithms publicly available – Simple: Character counts, repeated strings – Complex: Models of a human face

Comparison of Information Theory Algorithms An interface between modeling and coding Program Algorithm Time BPC Score • Entropy BOA PPM Var. 94+97 1.91 407 – A measure of information content PPMD PPM 11+20 2.07 265 • Entropy of the English Language IMP BW 10+3 2.14 254 – How much information does each BZIP BW 20+6 2.19 273 character in “typical” English text GZIP LZ77 Var. 19+5 2.59 318 contain? LZ77 LZ77 ? 3.94 ?

2 Entropy (Shannon 1948) Entropy Example For a set of messages S with probability p(s), s S, the self information of s is: p(S)  {.25,.25,.25,.125,.125} 1 H(S)  3.25log4  2.125log8  2.25 i(s)  log  log p(s) p(s) p(S)  {.5,.125,.125,.125,.125} Measured in bits if the log is base 2. H(S) .5log2  4.125log8  2 The lower the probability, the higher the information p(S)  {.75,.0625,.0625,.0625,.0625} Entropy is the weighted average of self H(S) .75log(4 3)  4.0625log16  1.3 information. 1 H(S)   p(s)log sS p(s)

Entropy of the English Language Shannon’s experiment How can we measure the information per Asked humans to predict the next character character? given the whole previous text. He used ASCII code = 7 these as conditional probabilities to Entropy = 4.5 (based on character estimate the entropy of the English probabilities) Language. Huffman codes (average) = 4.7 The number of guesses required for right answer: # of guesses 1 2 3 3 5 > 5 Unix Compress = 3.5 Probability .79 .08 .03 .02 .02 .05 Gzip = 2.5 BOA = 1.9 (current close to best text compressor) From the experiment he predicted H(English) = .6-1.3 Must be less than 1.9.

Data compression model Coding

Input data How do we use the probabilities to code messages? Reduce Data Redundancy • Prefix codes and relationship to Entropy Reduction of Entropy • Huffman codes • Arithmetic codes • Implicit probability codes…

Compressed Data

3 Assumptions Uniquely Decodable Codes A variable length code assigns a bit string Communication (or file) broken up into pieces called (codeword) of variable length to every messages. message value Adjacent messages might be of a different types a = 1, b = 01, c = 101, d = 011 and come from a different probability e.g. distributions What if you get the sequence of bits We will consider two types of coding: 1011 ? • Discrete: each message is a fixed set of bits Is it aba, ca, or, ad? – , Shannon-Fano coding A uniquely decodable code is a variable length • Blended: bits can be “shared” among messages code in which bit strings can always be – Arithmetic coding uniquely decomposed into its codewords.

Prefix Codes Some Prefix Codes for Integers A prefix code is a variable length code in which no codeword is a prefix of n Binary Unary Split another word 1 ..001 0 1| e.g a = 0, b = 110, c = 111, d = 10 2 ..010 10 10|0 Can be viewed as a binary tree with 3 ..011 110 10|1 message values at the leaves and 0 or 4 ..100 1110 110|00 5 ..101 11110 110|01 1s on the edges. 0 1 6 ..110 111110 110|10 0 1 a 0 1 d Many other fixed prefix codes: Golomb, phased-binary, subexponential, ... b c

Average Bit Length Relationship to Entropy For a code C with associated Theorem (lower bound): For any probabilities p(c) the average length probability distribution p(S) with is defined as associated uniquely decodable code C, ABL(C)   p(c)l(c) cC H(S)  ABL (C) We say that a prefix code C is optimal Theorem (upper bound): For any if for all prefix codes C’, probability distribution p(S) with ABL(C)  ABL(C’) associated optimal prefix code C,

ABL(C)  H(S) 1

4 Kraft McMillan Inequality Proof of the Upper Bound (Part 1) Assign to each message a length l(s)  log 1 p(s) Theorem (Kraft-McMillan): For any uniquely    decodable code C, We then have l(s)   log1/ p(s) 2l(c)  1 2  2  sS sS cC log1/ p(s) Also, for any set of lengths L such that  2 sS 2l  1 lL   p(s) there is a prefix code C such that sS  1 l(c )  l (i  1,...,|L|) i i So by the Kraft-McMillan ineq. there is a prefix code with lengths l(s).

Proof of the Upper Another property of optimal codes Bound (Part 2) Theorem: If C is an optimal prefix code for the probabilities {p1, …, pn} then pi > pj Now we can calculate the average length given l(s) implies l(ci)  l(cj) ABL(S)   p(s)l(s) Proof: (by contradiction) sS Assume l(c ) > l(c ). Consider switching  p(s)  log 1/ p(s) i j     codes c and c . If l is the average length sS i j a   p(s) (1 log(1/ p(s))) of the original code, the length of the new sS code is  1 p(s)log(1/ p(s)) '  la  la  p j (l(ci )  l(c j ))  pi (l(c j )  l(ci )) sS  1 H(S)  la  ( p j  pi )(l(ci )  l(cj ))  l And we are done. a This is a contradiction since la was supposed to be optimal

Corollary

• The pi is smallest over the code, then l(ci) is the largest. Huffman Coding

Binary trees for compression

5 Huffman Code Huffman Codes • Approach Invented by Huffman as a class assignment – Variable length encoding of symbols in 1950. – Exploit statistical frequency of symbols Used in many, if not most compression algorithms – Efficient when symbol probabilities vary widely • gzip, bzip, (as option), fax • Principle compression,… – Use fewer bits to represent frequent symbols – Use more bits to represent infrequent symbols Properties: – Generates optimal prefix codes A A B A – Cheap to generate codes – Cheap to encode and decode – l =H if probabilities are powers of 2 A A B A a

Huffman Code Example Huffman Codes Huffman Algorithm Symbol Dog Cat Bird Fish Frequency 1/8 1/4 1/2 1/8 • Start with a forest of trees each Original 00 01 10 11 consisting of a single vertex Encoding 2 bits 2 bits 2 bits 2 bits corresponding to a message s and Huffman 110 10 0 111 with weight p(s) Encoding 3 bits 2 bits 1 bit 3 bits • Repeat: – Select two trees with minimum weight • Expected size roots p1 and p2 – Original  1/82 + 1/42 + 1/22 + 1/82 = 2 bits / symbol – Join into single tree by adding root with – Huffman  1/83 + 1/42 + 1/21 + 1/83 = 1.75 bits / symbol weight p1 + p2

Example Encoding and Decoding Encoding: Start at leaf of Huffman tree and p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 follow path to the root. Reverse order of bits and send. a(.1) b(.2) c(.2) d(.5) Decoding: Start at root of Huffman tree and (.3) (.5) (1.0) take branch for each bit received. When 0 1 (.5) d(.5) at leaf can output message and return to a(.1) b(.2) (.3) c(.2) 0 1 root. (1.0) Step 1 a(.1) b(.2) (.3) c(.2) 0 1 0 1 There are even faster methods that (.5) d(.5) Step 2 a(.1) b(.2) can process 8 or 32 bits at a time 0 1 (.3) c(.2) Step 3 0 1 a=000, b=001, c=01, d=1 a(.1) b(.2)

6 Huffman codes are Lemmas optimal

• L1 : Let pi be the smallest over the code, then Theorem: The Huffman algorithm l(ci) is the largest and hence a leaf of the tree. ( generates an optimal prefix code. Let its parent be u ) In other words: It achieves the minimum

• L2 : If pj is second smallest over the code, then average number of bits per letter of l(cj) is the child of u in the optimal code. any prefix code. • L3 : There is an optimal prefix code with corresponding tree T*, in which the two lowest Proof: By induction frequency letters are siblings. Base Case: Trivial (one bit optimal) Assumption: The method is optimal for all alphabets of size k-1.

Proof: Proof: • Let y* and z* be the two lowest • Let Z be a better tree compared to T frequency letters merged in w*. Let produced using Huffman’s alg. T be the tree before merging and T’ • Implies ABL(Z) < ABL(T) • By lemma L3, there is such a tree Z’ in after merging. which the leaves representing y* and z* • Then : ABL(T’) = ABL(T) – p(w*) are siblings (and has same ABL as Z). • T’ is optimal by induction. • By previous page ABL(Z’) =ABL(Z) – p(w*) • Contradiction!

Adaptive Huffman Codes Huffman Coding Huffman codes can be made to be Disadvantages adaptive without completely • Integral number of bits in each code. recalculating the tree on each step. • If the entropy of a given character • Can account for changing is 2.2 bits,the Huffman code for that probabilities character must be either 2 or 3 bits • Small changes in probability, typically , not 2.2. make small changes to the Huffman tree Used frequently in practice

7 Towards Arithmetic Arithmetic Coding: coding Introduction • An Example: Consider sending a Allows “blending” of bits in a message sequence. message of length 1000 each with having probability .999 Can bound total bits required based on sum of self information: n • Self information of each message l  2   si -log(.999)= .00144 bits i1 • Sum of self information = 1.4 bits. Used in PPM, JPEG/MPEG (as option), DMM • Huffman coding will take at least 1k More expensive than Huffman coding, but integer bits. implementation is not too bad. • Arithmetic coding = 3 bits!

Arithmetic Coding (message intervals) Arithmetic Coding (sequence intervals)

Assign each probability distribution to an interval To code a message use the following: range from 0 (inclusive) to 1 (exclusive). e.g. l1  f1 li  li1  si1 fi s  p s  s p 1.0 i1 1 1 i i1 i c = .3 0.7 f (i)   p( j) j1 Each message narrows the interval by a factor of pi. b = .5 Final interval size: n f(a) = .0, f(b) = .2, f(c) = .7 0.2 s  p a = .2 n  i 0.0 i1 The interval for a particular message will be called The interval for a message sequence will be called the message interval (e.g for b the interval is [.2,.7)) the sequence interval

Arithmetic Coding: Encoding Example Uniquely defining an interval

bac Coding the message sequence: Important property:The sequence intervals for distinct message sequences of length n will never 1.0 0.7 0.3 overlap c = .3 c = .3 c = .3 Therefore: specifying any number in the final 0.7 0.55 0.27 interval uniquely determines the sequence. b = .5 b = .5 b = .5 Decoding is similar to encoding, but on each step need to determine what the message value is and 0.2 0.3 0.21 a = .2 a = .2 a = .2 then reduce interval 0.0 0.2 0.2

The final interval is [.27,.3)

8 Arithmetic Coding: Decoding Example RealArith Encoding and Decoding the number .49, knowing the message is of length 3: Decoding RealArithEncode: • Determine l and s using original recurrences 1.0 0.7 0.55 c = .3 c = .3 c = .3 • Code using l + s/2 truncated to 1+-log s bits 0.49 0.7 0.55 0.475 RealArithDecode: 0.49 0.49 b = .5 b = .5 b = .5 • Read bits as needed so code interval falls within a message interval, and then narrow sequence 0.2 0.3 0.35 interval. a = .2 a = .2 a = .2 0.0 0.2 0.3 • Repeat until n messages have been decoded .

The message is bbc.

Bound on Length Applications of Theorem: For n messages with self information Probability Coding {s1,…,sn} RealArithEncode will generate at most n How do we generate the probabilities? Using character frequencies directly does not work 2   si bits. i1 very well (e.g. 4.5 bits/char for text).   n  Technique 1: transforming the data 1  log s  1  log pi    i1  • Run length coding (ITU Fax standard)  n  • Move-to-front coding (Used in Burrows-Wheeler)  1   log pi   i1  • Residual coding (JPEG LS) n   Technique 2: using conditional probabilities  1  si   i1  • Fixed context (JBIG…almost) n • Partial matching (PPM)  2   si i1

Facsimile ITU T4 (Group 3) Run Length Coding Code by specifying message value Standard used by all home Fax Machines ITU = International Telecommunications Standard followed by number of repeated Run length encodes sequences of black+white pixels values: Fixed Huffman Code for all documents. e.g.

e.g. abbbaacccca => Run length White Black (a,1),(b,3),(a,2),(c,4),(a,1) 1 000111 010 The characters and counts can be 2 0111 11 coded based on frequency. 10 00111 0000100 This allows for small number of bits Since alternate black and white, no need for values. overhead for low counts such as 1.

9 Move to Front Coding Transforms message sequence into sequence of Residual Coding integers, that can then be probability coded Start with values in a total order: Used for message values with e.g.: [a,b,c,d,e,….] meaningfull order For each message output position in the order and e.g. integers or floats. then move to the front of the order. e.g.: c => output: 3, new order: [c,a,b,d,e,…] Basic Idea: guess next value based on a => output: 2, new order: [a,c,b,d,e,…] current context. Output difference Codes well if there are concentrations of message between guess and actual value. Use values in the message sequence. probability code on the output.

JPEG-LS JPEG LS: Stage 1

JPEG Lossless (not to be confused with lossless JPEG) Uses the following equation: Just completed standardization process. min(N,W ) if NW  max( N,W ) Codes in Raster Order. Uses 4 pixels as context:  P  max( N,W ) if NW  min(N,W ) NW N NE  N W  NW otherwise W * Averages neighbors and captures edges. e.g. Tries to guess value of * based on W, NW, N and NE. Works in two stages 40 3 * 30 40 * 3 3 *

40 3 20 30 40 40

JPEG LS: Stage 2 Using Conditional Uses 3 gradients: W-NW, NW-N, N-NE • Classifies each into one of 9 categories. Probabilities: PPM • This gives 93=729 contexts, of which only 365 are Use previous k characters as the context. needed because of symmetry. Base probabilities on counts: • Each context has a bias term that is used to e.g. if seen th 12 times followed by e 7 times, then the conditional probability p(e|th)=7/12. adjust the previous prediction Need to keep k small so that dictionary does not get too large. After correction, the residual between guessed and actual value is found and coded using a Golomblike code.

10 Ideas in Lossless LZ77: Sliding Window Lempel-Ziv compression Cursor a a c a a c a b c a b a b a c • That we did not talk about Dictionary Lookahead specifically (previously coded) Buffer – Lempel-Ziv (gzip) • Tries to guess next window from previous Dictionary and buffer “windows” are fixed length data and slide with the cursor – Burrows-Wheeler (bzip) On each step: • Context sensitive sorting • Output (p,l,c) • Block sorting transform p = relative position of the longest match in the dictionary l = length of longest match c = next char in buffer beyond longest match • Advance window by l + 1

Lossy compression Scalar Quatization • Given a camera image with 12bit color, make it 4-bit grey scale. • Uniform Vs Non-Uniform Quantization – The eye is more sensitive to low values of red compared to high values.

Vector Quantization • How do we compress a color image • Transform input into another space. (r,g,b)? • One form of transform is to choose a set of basis – Find k – representative points for all functions. colors – For every pixel, output the nearest • JPEG/MPEG both representative use this idea. – If the points are clustered around the representatives, the residuals are small and hence probability coding will work well.

11 Other Transform codes • Wavelets • Fractal base compression – Based on the idea of fixed points of functions.

12