Why compress files?
Data Compression: Huffman Coding
10.1 in Weiss (p.389)
1 2
Why compress files? What is a file?
• For long term storage (disc space is limited) • C++ program code • Executable program • For transferring files over the internet (bigger • Email - text files take longer) • HTML document • A smaller file more likely to fit in • Pictures (lossy); JPEG memory/cache • Video (lossy); MPEG • Audio (lossy); MP3
3 4
Data Compression Data Compression original compressed decompressed original compressed decompressed Y Y X Encoder Decoder X’ X Encoder Decoder X’
• Lossless compression X = X’ • Lossy compression X != X’ • Compression Ratio |X|/|Y| – Where |X| is the # of bits in X.
5 6 Lossy Compression Lossless Compression
• Some data is lost, but not too much. • No data is lost. Standards : Standards: • JPEG (Joint Photographic Experts Group) – • Gzip, Unix compress, zip, GIF, Morse code stills • Examples: • MPEG (Motion Picture Experts Group) – Run-length Encoding (RLE) – Audio and video – Huffman Coding • MP3 (MPEG-1, Layer 3)
7 8
RLE RLE
• Idea: Compactly represent long ‘runs’ of the • Idea: Compactly represent long ‘runs’ of the same character same character • “aaarrrrr!” as ‘a’x3 ‘r’x5 then ‘!’ • “aaarrrrr!” as ‘a’x3 ‘r’x5 then ‘!’ • Say… – Replace all ‘runs’ of the same character by 2 characters: the 1) character and 2) the length – ‘bee’ becomes ‘b’,1,’e’,2
9 10
Another idea: Use fewer bits per RLE character • Idea: Compactly represent long ‘runs’ of the ASCII = fixed 8 bits per character same character Example : “hello there” • “aaarrrrr!” as ‘a’x3 ‘r’x5 then ‘!’ – 11 characters * 8 bits = 88 bits • Say… Can we encode this message using fewer bits? – Replace all ‘runs’ of the same character by 2 characters: the 1) character and 2) the length – ‘bee’ becomes ‘b’,1,’e’,2 – When is this good? – When is this really bad?
11 12 Another idea: Use fewer bits per Huffman Coding character • Uses frequencies of symbols in a ASCII = fixed 8 bits per character Letter code string to build a prefix code . Example : “hello there” • Prefix Code – no code in our a 0 – 11 characters * 8 bits = 88 bits encoding is a prefix of another b 100 Can we encode this message using fewer bits? code. c 101 • We could look JUST at the message • there are only 6 possible characters + one space = 7 things; d 11 only need 3 bits • Encode: aabddcaa = could do as 16 bits (each character = 2 bits each) • Huffman can do as 14 bits
13 14
Huffman Coding Huffman Coding
• Uses frequencies of symbols in a • Uses frequencies of symbols in a Letter code Letter code string to build a prefix code . string to build a prefix code . • Prefix Code – no code in our a 0 • Prefix Code – no code in our a 0 encoding is a prefix of another b 100 encoding is a prefix of another b 100 code. code. c 101 c 101 d 11 d 11
15 16
Decoding a Prefix Code Decode: 11100010100110
Loop start at root of tree loop Letter code if bit read = 1 then go right a 0 else, go left b 100 until node is a leaf c 101 Report character found! d 11 Until end of the message
17 18 Decode: 11100010100110 Huffman Trees
Cost of a Huffman Tree containing n symbols
Letter code C(T) = p 1*r 1+p 2*r 2+p 3*r 3+….+ pn*rn a 0 b 100 Where: c 101 pi = the probability that a symbol occurs d 11 ri = the length of the path from the root to the node
19 20
Example Cost Constructing a tree
Letter Frequency code Cost: 1.75 • Determine frequency of each letter/symbol a .50 0 • Place each as an unconnected leaf node • Repeatedly merge two nodes with lowest b .125 100 frequency into one node with sum of c .125 101 frequencies d .25 11 • Huffman Coding is optimal*
21 22
Constructing a tree example Constructing a tree example
• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total
.2 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1
23 24 Constructing a tree example Constructing a tree example
• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total
.6 .4 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1
25 26
Constructing a tree example Constructing a tree example
• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total 1.0 0 1 0 1 0 1 0 1 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1
27 28
Constructing a tree example Constructing a tree example
• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1 0 10 110 1110 1111 0 10 110 1110 1111 29 Cost = .4*1 + .2*2 + .2*3 + .1*4 + .1*4 = 2.2 30 Run-time? Run-time?
• To decode an encoded message length n: • To decode an encoded message length n: O(n) • To encode message length n, with c possible characters
31 32
Run-time? Run-time?
• To decode an encoded message length n: O(n) • To decode an encoded message length n: O(n) • To encode message length n, with c possible • To encode message length n, with c possible characters characters • Count frequencies: • Count frequencies: O(n) • Build tree: • Build tree: O(clogc) (with priority queue) • Encode: • Encode: O(n)
33 34