Why compress files?

Data Compression: Huffman Coding

10.1 in Weiss (p.389)

1 2

Why compress files? What is a file?

• For long term storage (disc space is limited) • C++ program • Executable program • For transferring files over the internet (bigger • Email - text files take longer) • HTML document • A smaller file more likely to fit in • Pictures (lossy); JPEG memory/cache • (lossy); MPEG • Audio (lossy); MP3

3 4

Data Compression Data Compression original compressed decompressed original compressed decompressed Y Y X Encoder Decoder X’ X Encoder Decoder X’

X = X’ • X != X’ • Compression Ratio |X|/|Y| – Where |X| is the # of bits in X.

5 6 Lossy Compression Lossless Compression

• Some data is lost, but not too much. • No data is lost. Standards : Standards: • JPEG (Joint Photographic Experts Group) – • , Unix compress, zip, GIF, stills • Examples: • MPEG (Motion Picture Experts Group) – Run-length Encoding (RLE) – Audio and video – Huffman Coding • MP3 (MPEG-1, Layer 3)

7 8

RLE RLE

• Idea: Compactly represent long ‘runs’ of the • Idea: Compactly represent long ‘runs’ of the same character same character • “aaarrrrr!” as ‘a’x3 ‘r’x5 then ‘!’ • “aaarrrrr!” as ‘a’x3 ‘r’x5 then ‘!’ • Say… – Replace all ‘runs’ of the same character by 2 characters: the 1) character and 2) the length – ‘bee’ becomes ‘b’,1,’e’,2

9 10

Another idea: Use fewer bits per RLE character • Idea: Compactly represent long ‘runs’ of the ASCII = fixed 8 bits per character same character Example : “hello there” • “aaarrrrr!” as ‘a’x3 ‘r’x5 then ‘!’ – 11 characters * 8 bits = 88 bits • Say… Can we encode this message using fewer bits? – Replace all ‘runs’ of the same character by 2 characters: the 1) character and 2) the length – ‘bee’ becomes ‘b’,1,’e’,2 – When is this good? – When is this really bad?

11 12 Another idea: Use fewer bits per Huffman Coding character • Uses frequencies of symbols in a ASCII = fixed 8 bits per character Letter code string to build a . Example : “hello there” • Prefix Code – no code in our a 0 – 11 characters * 8 bits = 88 bits encoding is a prefix of another b 100 Can we encode this message using fewer bits? code. c 101 • We could look JUST at the message • there are only 6 possible characters + one space = 7 things; d 11 only need 3 bits • Encode: aabddcaa = could do as 16 bits (each character = 2 bits each) • Huffman can do as 14 bits

13 14

Huffman Coding Huffman Coding

• Uses frequencies of symbols in a • Uses frequencies of symbols in a Letter code Letter code string to build a prefix code . string to build a prefix code . • Prefix Code – no code in our a 0 • Prefix Code – no code in our a 0 encoding is a prefix of another b 100 encoding is a prefix of another b 100 code. code. c 101 c 101 d 11 d 11

15 16

Decoding a Prefix Code Decode: 11100010100110

Loop start at root of loop Letter code if bit read = 1 then go right a 0 else, go left b 100 until node is a leaf c 101 Report character found! d 11 Until end of the message

17 18 Decode: 11100010100110 Huffman Trees

Cost of a Huffman Tree containing n symbols

Letter code C(T) = p 1*r 1+p 2*r 2+p 3*r 3+….+ pn*rn a 0 b 100 Where: c 101 pi = the probability that a symbol occurs d 11 ri = the length of the path from the root to the node

19 20

Example Cost Constructing a tree

Letter Frequency code Cost: 1.75 • Determine frequency of each letter/symbol a .50 0 • Place each as an unconnected leaf node • Repeatedly merge two nodes with lowest b .125 100 frequency into one node with sum of c .125 101 frequencies d .25 11 • Huffman Coding is optimal*

21 22

Constructing a tree example Constructing a tree example

• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total

.2 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1

23 24 Constructing a tree example Constructing a tree example

• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total

.6 .4 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1

25 26

Constructing a tree example Constructing a tree example

• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total 1.0 0 1 0 1 0 1 0 1 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1

27 28

Constructing a tree example Constructing a tree example

• Encode “a java jar” • Encode “a java jar” • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total • 4 a’s, 2 spaces, 2 j’s, 1 v, 1 r; 10 total 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 a: .4 space: .2 j: .2 v: .1 r: .1 a: .4 space: .2 j: .2 v: .1 r: .1 0 10 110 1110 1111 0 10 110 1110 1111 29 Cost = .4*1 + .2*2 + .2*3 + .1*4 + .1*4 = 2.2 30 Run-time? Run-time?

• To decode an encoded message length n: • To decode an encoded message length n: O(n) • To encode message length n, with c possible characters

31 32

Run-time? Run-time?

• To decode an encoded message length n: O(n) • To decode an encoded message length n: O(n) • To encode message length n, with c possible • To encode message length n, with c possible characters characters • Count frequencies: • Count frequencies: O(n) • Build tree: • Build tree: O(clogc) (with ) • Encode: • Encode: O(n)

33 34