Universal Loseless Compression: Context Tree Weighting(CTW)
Youngsuk Park
Dept. Electrical Engineering, Stanford University
Dec 9, 2014
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow
In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough?
We explore CTW.
Universal Coding with Model Classes
Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow
Aren’t the LZ good enough?
We explore CTW.
Universal Coding with Model Classes
Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW.
Universal Coding with Model Classes
Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough?
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) No. convergence is slow We explore CTW.
Universal Coding with Model Classes
Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) We explore CTW.
Universal Coding with Model Classes
Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Universal Coding with Model Classes
Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW.
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is
n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.
Universal Compression and Universal Estimation
A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be estimated distribution.
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then the expected redundancy is
n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.
Universal Compression and Universal Estimation
A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be
estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.
Universal Compression and Universal Estimation
A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be
estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is
n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2)
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Universal Compression and Universal Estimation
A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be
estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is
n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) What if parameters are unknown?
Two Entropy Coding for known parameters
Huffman code an optimal Arithmetic Code l m prefix code n 1 For L(x ) = log( n ) + 1, P(x ) 1 n n .c = Q(xn) · 2L(x ) · 2−L(x ) 0 1 4 6 10 10 Q4 0 1 0 1 Q 2 2 3 3 0.c + 2L3 3 10 10 10 10 3 0.c3 Q2
Q1
Q0
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Two Entropy Coding for known parameters
Huffman code an optimal Arithmetic Code l m prefix code n 1 For L(x ) = log( n ) + 1, P(x ) 1 n n .c = Q(xn) · 2L(x ) · 2−L(x ) 0 1 4 6 10 10 Q4 0 1 0 1 Q 2 2 3 3 0.c + 2L3 3 10 10 10 10 3 0.c3 Q2
Q1
Q0 What if parameters are unknown?
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)
Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .
Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ
Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ.
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)
Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .
Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ
Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor?
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)
Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .
n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ
Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)
Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .
Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)
Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Natural but dependent on n(not sequential)
Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Then log(n + 1) + 2 Red (L, xn) = → 0 n n
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Bernoulli Model: Two part code
n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)
Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1
Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n