Lecture 17: Lempel-Ziv Coding and Summary
Total Page:16
File Type:pdf, Size:1020Kb
COMP2610/6261 - Information Theory Lecture 17: Lempel-Ziv Coding and Summary ANU Logo UseMark Guidelines Reid and Aditya Menon Research School of Computer Science The ANU logo is a contemporary The Australian National University refection of our heritage. It clearly presents our name, our shield and our motto: First to learn the nature of things. To preserve the authenticity of our brand identity, there are rules that govern how our logo is used. Preferred logo Black version Preferred logo - horizontal logo September 30th, 2014 The preferred logo should be used on a white background. This version includes black text with the crest in Deep Gold in either PMS or CMYK. Black Where colour printing is not available, the black logo can be used on a white background. Reverse Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 1 / 12 The logo can be used white reversed out of a black background, or occasionally a neutral dark background. Deep Gold Black C30 M50 Y70 K40 C0 M0 Y0 K100 PMS Metallic 8620 PMS Process Black PMS 463 Reverse version Any application of the ANU logo on a coloured background is subject to approval by the Marketing Offce, contact [email protected] LOGO USE GUIDELINES 1 THE AUSTRALIAN NATIONAL UNIVERSITY 1 Lempel-Ziv Coding 2 Compression Review Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 2 / 12 Run-length coding using (count; symbol) saves 12 bits: 111 0 111 1 111 0 111 1 |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} 7 a 7 b 7 a 7 b Makes no probabilistic assumptions about source. Doesn't always yield shorter strings: aa bb a b a ! 10 0 10 1 01 0 01 1 01 0 (7 to 15 bits) Misses other structure: \2 repetitions of (7 as and 7 bs)" Eliminating Repetition What is a simple, short binary description of the following string? aaaaaaa bbbbbbb aaaaaaa bbbbbbb | {z } | {z } | {z } | {z } 7as 7bs 7as 7bs A simple symbol code for fa; bg, C = f0; 1g, uses 28 bits Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 3 / 12 Makes no probabilistic assumptions about source. Doesn't always yield shorter strings: aa bb a b a ! 10 0 10 1 01 0 01 1 01 0 (7 to 15 bits) Misses other structure: \2 repetitions of (7 as and 7 bs)" Eliminating Repetition What is a simple, short binary description of the following string? aaaaaaa bbbbbbb aaaaaaa bbbbbbb | {z } | {z } | {z } | {z } 7as 7bs 7as 7bs A simple symbol code for fa; bg, C = f0; 1g, uses 28 bits Run-length coding using (count; symbol) saves 12 bits: 111 0 111 1 111 0 111 1 |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} 7 a 7 b 7 a 7 b Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 3 / 12 Eliminating Repetition What is a simple, short binary description of the following string? aaaaaaa bbbbbbb aaaaaaa bbbbbbb | {z } | {z } | {z } | {z } 7as 7bs 7as 7bs A simple symbol code for fa; bg, C = f0; 1g, uses 28 bits Run-length coding using (count; symbol) saves 12 bits: 111 0 111 1 111 0 111 1 |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} 7 a 7 b 7 a 7 b Makes no probabilistic assumptions about source. Doesn't always yield shorter strings: aa bb a b a ! 10 0 10 1 01 0 01 1 01 0 (7 to 15 bits) Misses other structure: \2 repetitions of (7 as and 7 bs)" Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 3 / 12 1 A new a 2 A new b 3 The same1 symbol as1 symbol ago 4 The same2 symbols as3 symbols ago 5 The same 10 symbols as5 symbols ago 0001100100011011001011011001 ... Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 2 A new b 3 The same1 symbol as1 symbol ago 4 The same2 symbols as3 symbols ago 5 The same 10 symbols as5 symbols ago 0001100100011011001011011001 ... Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: 1 A new a Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 3 The same1 symbol as1 symbol ago 4 The same2 symbols as3 symbols ago 5 The same 10 symbols as5 symbols ago 0001100100011011001011011001 ... Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: 1 A new a 2 A new b Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 4 The same2 symbols as3 symbols ago 5 The same 10 symbols as5 symbols ago 0001100100011011001011011001 ... Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: 1 A new a 2 A new b 3 The same1 symbol as1 symbol ago Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 5 The same 10 symbols as5 symbols ago 0001100100011011001011011001 ... Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: 1 A new a 2 A new b 3 The same1 symbol as1 symbol ago 4 The same2 symbols as3 symbols ago Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 0001100100011011001011011001 ... Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: 1 A new a 2 A new b 3 The same1 symbol as1 symbol ago 4 The same2 symbols as3 symbols ago 5 The same 10 symbols as5 symbols ago Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 Looking for Repetition Consider a sequence that starts abbababbababbab ::: We can describe each new part in terms of what we have seen so far: 1 A new a (0; a) 2 A new b (0; b) 3 The same1 symbol as1 symbol ago (1 ; 1; 1) 4 The same2 symbols as3 symbols ago (1 ; 3; 2) 5 The same 10 symbols as5 symbols ago (1 ; 5; 10) 0001100100011011001011011001 ... Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 4 / 12 1 If s = then output (0; xn) and continue; otherwise 2 Find smallest 0 ≤ i < W such that t = xn−i ::: xn−i+jt|−1xn 3 Output (1; i; jsj) 1 xn next symbol from sequence (n is total symbols read) 2 Let t last W symbols of s xn 3 If t does not appear in xn−W ::: xn−1 4 Else s s xn 2 While input sequence has more symbols: Notes: The output is converted to binary. i is represented with dlog2 W e bits. The size output jsj can be larger than W . Not very effective compression for short input sequences. Run-length encoding is essentially LZ77 with W = 1. Lempel-Ziv: Sliding Window (LZ77) LZ77(Sequence x1x2 :::, Window size W > 0) 1 Initialise s Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 5 / 12 1 If s = then output (0; xn) and continue; otherwise 2 Find smallest 0 ≤ i < W such that t = xn−i ::: xn−i+jt|−1xn 3 Output (1; i; jsj) 1 xn next symbol from sequence (n is total symbols read) 2 Let t last W symbols of s xn 3 If t does not appear in xn−W ::: xn−1 4 Else s s xn Notes: The output is converted to binary. i is represented with dlog2 W e bits. The size output jsj can be larger than W . Not very effective compression for short input sequences. Run-length encoding is essentially LZ77 with W = 1. Lempel-Ziv: Sliding Window (LZ77) LZ77(Sequence x1x2 :::, Window size W > 0) 1 Initialise s 2 While input sequence has more symbols: Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 5 / 12 1 If s = then output (0; xn) and continue; otherwise 2 Find smallest 0 ≤ i < W such that t = xn−i ::: xn−i+jt|−1xn 3 Output (1; i; jsj) 2 Let t last W symbols of s xn 3 If t does not appear in xn−W ::: xn−1 4 Else s s xn Notes: The output is converted to binary. i is represented with dlog2 W e bits. The size output jsj can be larger than W . Not very effective compression for short input sequences. Run-length encoding is essentially LZ77 with W = 1. Lempel-Ziv: Sliding Window (LZ77) LZ77(Sequence x1x2 :::, Window size W > 0) 1 Initialise s 2 While input sequence has more symbols: 1 xn next symbol from sequence (n is total symbols read) Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory September 30th, 2014 5 / 12 1 If s = then output (0; xn) and continue; otherwise 2 Find smallest 0 ≤ i < W such that t = xn−i ::: xn−i+jt|−1xn 3 Output (1; i; jsj) 3 If t does not appear in xn−W ::: xn−1 4 Else s s xn Notes: The output is converted to binary.