Universal Loseless Compression: Context Tree Weighting(CTW)

Universal Loseless Compression: Context Tree Weighting(CTW) Youngsuk Park Dept. Electrical Engineering, Stanford University Dec 9, 2014 Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren't the LZ good enough? We explore CTW. Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow Aren't the LZ good enough? We explore CTW. Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW. Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren't the LZ good enough? Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) No. convergence is slow We explore CTW. Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren't the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) We explore CTW. Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren't the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren't the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW. Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is n 1 Redn(PL; P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution. Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be estimated distribution. Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then the expected redundancy is n 1 Redn(PL; P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution. Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution. Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is n 1 Redn(PL; P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is n 1 Redn(PL; P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution. Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) What if parameters are unknown? Two Entropy Coding for known parameters Huffman code an optimal Arithmetic Code l m prefix code n 1 For L(x ) = log( n ) + 1, P(x ) 1 n n :c = Q(xn) · 2L(x ) · 2−L(x ) 0 1 4 6 10 10 Q4 0 1 0 1 Q 2 2 3 3 0:c + 2L3 3 10 10 10 10 3 0:c3 Q2 Q1 Q0 Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Two Entropy Coding for known parameters Huffman code an optimal Arithmetic Code l m prefix code n 1 For L(x ) = log( n ) + 1, P(x ) 1 n n :c = Q(xn) · 2L(x ) · 2−L(x ) 0 1 4 6 10 10 Q4 0 1 0 1 Q 2 2 3 3 0:c + 2L3 3 10 10 10 10 3 0:c3 Q2 Q1 Q0 What if parameters are unknown? Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L; xn) = ! 0 n n Natural but dependent on n(not sequential) Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ^ n ^ Note that minC2C LC (x ) = minθ2[0;1] log = nH(x ) = nh2(θ) Pθ Bernoulli Model: Two part code n iid Let C = fPθ; θ 2 [0; 1]g, X ∼ Bern(θ). But we don't know θ. Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L; xn) = ! 0 n n Natural but dependent on n(not sequential) Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Yes! with acceptable pointwise redundacny for all xn n 1 ^ n ^ Note that minC2C LC (x ) = minθ2[0;1] log = nH(x ) = nh2(θ) Pθ Bernoulli Model: Two part code n iid Let C = fPθ; θ 2 [0; 1]g, X ∼ Bern(θ). But we don't know θ. Can we we still design the compressor? Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L; xn) = ! 0 n n Natural but dependent on n(not sequential) Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . n 1 ^ n ^ Note that minC2C LC (x ) = minθ2[0;1] log = nH(x ) = nh2(θ) Pθ Bernoulli Model: Two part code n iid Let C = fPθ; θ 2 [0; 1]g, X ∼ Bern(θ). But we don't know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L; xn) = ! 0 n n Natural but dependent on n(not sequential) Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Bernoulli Model: Two part code n iid Let C = fPθ; θ 2 [0; 1]g, X ∼ Bern(θ).

Universal Loseless Compression: Context Tree Weighting(CTW)

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Answers to Exercises

Superior Guarantees for Sequential Prediction and Lossless Compression Via Alphabet Decomposition

Improved Adaptive Arithmetic Coding in MPEG-4 AVC/H.264 Video Compression Standard

Applications of Information Theory to Machine Learning Jeremy Bensadon

Skip Context Tree Switching

Bi-Level Image Compression with Tree Coding

Appendix a Information Theory

The Context Tree Weighting Method : Basic Properties

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

A Novel Encoding Algorithm for Textual Data Compression

Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study