Today’s Topics • Source Coding Techniques • Huffman Mohamed Hamada • Two-pass Huffman Code

Software Engineering Lab The University of Aizu • Lemple-Ziv Encoding Email: [email protected] • Lemple-Ziv Decoding URL: http://www.u-aizu.ac.jp/~hamada

1 2

Source Coding Techniques Source Coding Techniques 1. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. Information 4. Fano code. User of 1. Huffman Code. Source 5. Shannon Code. Information 6. Arithmetic Code. 2. Two-pass Huffman Code. Source Source Encoder Decoder 3. Lemple-Ziv Code.

Channel Channel Encoder Decoder 4. Fano code.

5. Shannon Code. Modulator De-Modulator

Channel 6. Arithmetic Code.

3 4

Source Coding Techniques Source Coding Techniques

1. Huffman Code. 1. Huffman Code.

2. Two-path Huffman Code. With the Huffman code in the binary case the two least probable source output symbols are joined together, 3. Lemple-Ziv Code. resulting in a new message alphabet with one less symbol

1 take together smallest probabilites: P(i) + P(j) 4. Fano Code. 2 replace symbol i and j by new symbol 3 go to 1 - until end 5. Shannon Code .

Application examples: JPEG, MPEG, MP3 6. Arithmetic Code.

5 6

1 1. Huffman Code. 1. Huffman Code. For COMPUTER DATA data reduction is ADVANTAGES: • uniquely decodable code lossless no errors at reproduction • smallest average codeword length universal effective for different types of data

DISADVANTAGES: • LARGE tables give complexity Huffman is not universal! • sensitive to channel errors it is only valid for one particular type of source: If the source has no probability distribution Huffman code can not applied.

7 8

Huffman Coding: Example Solution A • Compute the Huffman Code for the source Source Symbol Source Stage I shown Symbol Probability Symbol sk p sk Note that: the entropy of S is k s 0.1 0 s2 0.4 1 s1 0.2 H04logS.   2  04. s1 0.2  s2 0.4

1 s3 0.2 s3 0.2 202log. 2  02. s4 0.1 s0 0.1 1 201log. 2  s4 0.1 01. 2.L 12193 9 10

Solution A Solution A

Source Stage I Stage II Source Stage I Stage II Stage III Symbol Symbol

sk sk

s2 0.4 0.4 s2 0.4 0.4 0.4

s1 0.2 0.2 s1 0.2 0.2 0.4

s3 0.2 0.2 s3 0.2 0.2 0.2

s0 0.1 0.2 s0 0.1 0.2

s4 0.1 s4 0.1

11 12

2 Solution A Solution A

Source Stage I Stage II Stage III Stage IV Source Stage I Stage II Stage III Stage IV Symbol Symbol

sk sk 0 00 s2 0.4 0.4 0.4 0.6 s2 0.4 0.4 0.4 0.6 0 s 0.2 0.2 0.4 0.4 s 0.2 0.2 0.4 0.4 10 1 1 1 0 11 s 0.2 0.2 0.2 s 0.2 0.2 0.2 3 3 1 0 s 0.1 0.2 s 0.1 0.2 010 0 0 1 s 0.1 s 0.1 011 4 4 1

13 14

Solution A Solution A Cont’d Source Symbol Code Source Stage I Stage II Stage III Stage IV Code Symbol Probability word c HS. 2 12193 Symbol k s k pk sk 0 s0 0.1 010 s2 0.4 0.4 0.4 0.6 00 L.04 2 02 . 2 s1 0.2 10 0 02... 2 01 3 01 3 s 0.2 0.2 0.4 0.4 10 s 0.4 00 1 1 2  22. 0 s3 0.2 11 s3 0.2 0.2 0.2 11 1 s 0.1 011 0 4 s 0.1 0.2 010 HH1SL  S  0 1

s4 0.1 011 1 THIS IS NOT THE ONLY SOLUTION! 15 16

Another Solution B Cont’d Another Solution B Source Symbol Code Source Stage I Stage II Stage III Stage IV Code Symbol Probability word c HS. 2 12193 Symbol k s k pk sk 0 s0 0.1 0010 s2 0.4 0.4 0.4 0.6 1 L.04 1 02 . 2 s1 0.2 01 0 02... 3 01 4 01 4 s 0.2 0.2 0.4 0.4 01 s 0.4 1 1 1 2  22. 0 s3 0.2 000 s3 0.2 0.2 0.2 000 1 s 0.1 0011 0 4 s 0.1 0.2 0010 HH1SL  S  0 1 s 0.1 0011 4 1

17 18

3 What is the difference between Source Coding Techniques the two solutions? • They have the same average length 1. Huffman Code. • They differ in the variance of the average code

length 2. Two-pass Huffman Code. K 1 2 2   plkk L 3. Lemple-Ziv Code. k 0 4. Fano Code. • Solution A 2 • σ =0.16 5. Shannon Code. • Solution B • σ2=1.36 6. Arithmetic Code.

19 20

Source Coding Techniques Source Coding Techniques 2. Two-pass Huffman Code. 2. Two-pass Huffman Code.

Example This method is used when the probability of symbols in Consider the message: M=ABABABABABACADABACADABACADABACAD the information source is unknown. So we first can estimate this probability by calculating the number of L(M)=32 #(A)=16 p(A)=16/32=0.5 occurrence of the symbols in the given message then we #(B)=8 p(B)=8/32=0.25 can find the possible Huffman . This can be #(C)=4 p(C)=4/32=0.125 summarized by the following two passes. #(D)=4 p(D)=4/32=0.125 Pass 1 : Measure the occurrence possibility of each character in the message

Pass 2 : Make possible Huffman codes

22 21 0

Source Coding Techniques Lempel-Ziv Coding • Huffman coding requires knowledge of a 1. Huffman Code. probabilistic model of the source 2. Two-path Huffman Code. • This is not necessarily always feasible • Lempel-Ziv code is an adaptive coding 3. Lemple-Ziv Code. technique that does not require prior 4. Fano Code. knowledge of symbol probabilities

5. Shannon Code. • Lempel-Ziv coding is the basis of well-known ZIP for 6. Arithmetic Code.

23 24

4 Lempel-Ziv Coding History Lempel-Ziv Coding Example

Input: 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… •Universal: effective for different types of data

•Lossless: no errors at reproduction Codebook 123456789 Index Applications: GIF, TIFF, V.42bis modem compression standard, PostScript Level 2 Subsequence 0 1

History: Representation - 1977 published by Abraham Lempel and Jakob Ziv - 1984 LZ-Welch published in IEEE Computer - Sperry patent transferred to Unisys (1986) Encoding GIF file format Required use of LZW algorithm

25 26

Lempel-Ziv Coding Example Lempel-Ziv Coding Example

0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1…

Codebook 12 3 4 5 6 7 8 9 Codebook 123456789 Index Index

Subsequence 0 1 00 Subsequence 0 1 00 01

Representation Representation

Encoding Encoding

27 28

Lempel-Ziv Coding Example Lempel-Ziv Coding Example

0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1…

Codebook 12 3 4 5 6 7 8 9 Codebook 12 3 4 5 6 7 8 9 Index Index

Subsequence 0 1 00 01 011 Subsequence 0 1 00 01 011 10

Representation Representation

Encoding Encoding

29 30

5 Lempel-Ziv Coding Example Lempel-Ziv Coding Example

0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1…

Codebook 12 3 4 5 6 7 8 9 Codebook 12 3 4 5 6 7 8 9 Index Index

Subsequence 0 1 00 01 011 10 010 Subsequence 0 1 00 01 011 10 010 100

Representation Representation

Encoding Encoding

31 32

Lempel-Ziv Coding Example Lempel-Ziv Coding Example

0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1…

Codebook 12 3 4 5 6 7 8 9 Codebook 12 3 4 5 6 7 8 9 Index Index

Subsequence 0 1 00 01 011 10 010 100 101 Subsequence 0 1 00 01 011 10 010 100 101

Representation Representation 1221112

Encoding Encoding

33 34

Lempel-Ziv Coding Example Lempel-Ziv Coding Example

Decimal Binary 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 1 001 2 010 4 100 Codebook 12 3 4 5 6 7 8 9 6 110 Index Codebook 12 3 4 5 6 7 8 9 Index Subsequence 0 1 00 01 011 10 010 100 101

Subsequence 0 1 00 01 011 10 010 100 101

Representation 11424661221112 Representation 11424661221112

Encoding Encoding 0110001

35 36

6 Lempel-Ziv Coding Example Lempel-Ziv Coding Example

Decimal Binary Information bits 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 1… 1 001 2 010 Source encoded bits 0010 0011 1001 0100 1000 1100 1101 4 100 6 110 Codebook 12 3 4 5 6 7 8 9 Index Codebook 12 3 4 5 6 7 8 9 Index Subsequence 0 1 00 01 011 10 010 100 101 Subsequence 0 1 00 01 011 10 010 100 101

Representation 11 12 42 21 41 61 62 Representation 11424661221112

Encoding Source Code 0010 0011 1001 0100 1000 1100 1101 0010110001 001 100 010 100 110 110

37 38

How Come this is Compression?! Encoding idea Lempel Ziv Welch-LZW • The hope is: Assume we have just read • If the bit sequence is long enough, a a segment w from the text. eventually the fixed length code words will a is the next symbol. be shorter than the length of subsequences w they represent. If wa is not in the dictionary, a • When applied to English text ●Write the index of w in the output file. • Lempel-Ziv achieves approximately 55% ●Add wa to the dictionary, and set w  a. • Huffman coding achieves approximately ●If wa is in the dictionary, 43% ●Process the next symbol with segment wa.

39 40

Dictionary LZ Encoding example Initial a 0 UNIVERSAL (LZW) (decoder) Consider the input message: b 1 Input string: a a b a a c a b c a b c 2 LZ encoding process: 1. Start with basic symbol set Input: a a b a a c a b c a b output update a a aa not in dictionry, output 0 add aa to dictionary 0 aa 3 a a b continue with a, store ab in dictionary 0 ab 4 2. Read a code c from the compressed file. aab a continue with b, store ba in dictionary 1 ba 5 - The address c in the dictionary determines the segment w. aabaac aa in dictionary, aac not, 3 aac 6 - write w in the output file. aabaac a 2 ca 7 4 abc 8 aabaacabc 3. Add wa to the dictionary: a is the first letter of the next segment aabaacabcab 7 cab 9 aabaacabcab 1

aabaacabcab LZ Encoder 00132471 41 42

7 LZ Decoding example Dictionary Initial Exercise a 0 1. Find Huffman code for the following source b 1 c 2 Symbol Probability Output String: h0.1 update Input e0.1 l0.4 a ? Output a 0 o0.25 a a ! output a determines ? = a, update aa 0 aa 3 w0.05 aab . output 1 determines !=b, update ab 1 ab 4 r0.05 aaba a . 3 ba 5 d0.05 aabaac . 2 aac 6 2. Find LZ code for the following input aabaaca b . 4 ca 7 aabaacabc a . 7 abc 8 0011001111010100010001001 aabaacabcab 1

00132471 LZ Decoder aabaacabcab 43 44

8