Universal Loseless Compression: Context Tree Weighting(CTW)

Youngsuk Park

Dept. Electrical Engineering, Stanford University

Dec 9, 2014

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow

In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough?

We explore CTW.

Universal Coding with Model Classes

Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow

Aren’t the LZ good enough?

We explore CTW.

Universal Coding with Model Classes

Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW.

Universal Coding with Model Classes

Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough?

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) No. convergence is slow We explore CTW.

Universal Coding with Model Classes

Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) We explore CTW.

Universal Coding with Model Classes

Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Universal Coding with Model Classes

Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren’t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is

n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.

Universal Compression and Universal Estimation

A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be estimated distribution.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then the expected redundancy is

n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.

Universal Compression and Universal Estimation

A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be

estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.

Universal Compression and Universal Estimation

A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be

estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is

n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Universal Compression and Universal Estimation

A good compressor implies a good estimate of the source distribution. n n n For a given sequence x , P(x ) be true distribution and PL(x ) be

estimated distribution. n n 2−L(x ) We know that L(x ) = achieves the minimum expected P kn P n −L(xn) code length where kn = x 2 ≤ 1 by Kraft Inequality Then the expected redundancy is

n 1 Redn(PL, P) = E[L(x ) − log ] (1) P(xn) = − log kn + D(PkPL) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) What if parameters are unknown?

Two Entropy Coding for known parameters

Huffman code an optimal Arithmetic Code l m prefix code n 1 For L(x ) = log( n ) + 1, P(x ) 1 n n .c = Q(xn) · 2L(x ) · 2−L(x ) 0 1 4 6 10 10 Q4 0 1 0 1 Q 2 2 3 3 0.c + 2L3 3 10 10 10 10 3 0.c3 Q2

Q1

Q0

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Two Entropy Coding for known parameters

Huffman code an optimal Arithmetic Code l m prefix code n 1 For L(x ) = log( n ) + 1, P(x ) 1 n n .c = Q(xn) · 2L(x ) · 2−L(x ) 0 1 4 6 10 10 Q4 0 1 0 1 Q 2 2 3 3 0.c + 2L3 3 10 10 10 10 3 0.c3 Q2

Q1

Q0 What if parameters are unknown?

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)

Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .

Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ

Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)

Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .

Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ

Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor?

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)

Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .

n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ

Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)

Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .

Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)

Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n .

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Natural but dependent on n(not sequential)

Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Then log(n + 1) + 2 Red (L, xn) = → 0 n n

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Bernoulli Model: Two part code

n iid Let C = {Pθ, θ ∈ [0, 1]}, X ∼ Bern(θ). But we don’t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all xn n 1 ˆ n ˆ Note that minC∈C LC (x ) = minθ∈[0,1] log = nH(x ) = nh2(θ) Pθ Two part code: Use dlog(n + 1)e bits to encode n1 and then tune n1 code to θ = n . Then log(n + 1) + 2 Red (L, xn) = → 0 n n Natural but dependent on n(not sequential)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

n  Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Bernoulli Model: Mixture and Plug in code

Better code:

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

n  Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Bernoulli Model: Mixture and Plug in code

Better code:

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ)

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ)

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ)

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x )

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Bernoulli Model: Mixture and Plug in code

n  Better code: Enumerate all sequences with n1(n1 is away n1 from 0, n), L(xn) = log n  n1 log(n + 1) + 2 1 Red (L, xn) ≤ + O( ) → 0 (3) n 2n n

Uniform assignment over types is equivalent to uniform mixture, n R 1 n0 n1 n −1 1 Qn−1 t t L(x ) = θ θ¯ 1dθ = = pL(x |x ) where P 0 n1 n+1 t=0 t+1 t t n0(x )+1 where pL(0|x ) = t+2 is Rule of succession Use bias estimator of the conditional probability(plug-in approach) Mixture code Mix over Dirichlet’s density : √dθ = dw(θ) 1 2 Γ( 2 ) θ(1−θ) log(n + 1) 1 Red (L, xn) ≤ + O( ) → 0 (4) n 2n n

n R 1 n Qn−1 t t Note that PL(x ) = 0 Pθ(x )dw(θ) = t=0 pL(xt+1|x ) t 1 t n0(x )+ 2 where pL(0|x ) = t+1 is KT(Krichevski-Trofimov) estimator only depending on n0, n1

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) For a xn = 00 | 0100101100, sequence n = 10,n 0(10) = 2,n 1(10) = 1 For each s, calculate KT estimator for PL,s(n0(s), n1(s)) and use n Q PL(x ) = s∈S PL,s(n0(s), n1(s)) for Arithmetic coding Then redundancy is

n n 1 Redn(L, x ) = L(x ) − log (5) P(xn) 1 1 < log( n ) + 2 − log n (6) PL(x ) P(x ) 1 1 = log Q − log n + 2 (7) s∈S PL,s(n0(s), n1(s)) P(x )

Tree Source with Known Model

If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ ∈ R|S|. n For a fixed x , define the number of 0’s after s as n0(s) . Similarly define n1(s)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) For each s, calculate KT estimator for PL,s(n0(s), n1(s)) and use n Q PL(x ) = s∈S PL,s(n0(s), n1(s)) for Arithmetic coding Then redundancy is

n n 1 Redn(L, x ) = L(x ) − log (5) P(xn) 1 1 < log( n ) + 2 − log n (6) PL(x ) P(x ) 1 1 = log Q − log n + 2 (7) s∈S PL,s(n0(s), n1(s)) P(x )

Tree Source with Known Model

If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ ∈ R|S|. n For a fixed x , define the number of 0’s after s as n0(s) . Similarly define n1(s) For a xn = 00 | 0100101100, sequence n = 10,n 0(10) = 2,n 1(10) = 1

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then redundancy is

n n 1 Redn(L, x ) = L(x ) − log (5) P(xn) 1 1 < log( n ) + 2 − log n (6) PL(x ) P(x ) 1 1 = log Q − log n + 2 (7) s∈S PL,s(n0(s), n1(s)) P(x )

Tree Source with Known Model

If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ ∈ R|S|. n For a fixed x , define the number of 0’s after s as n0(s) . Similarly define n1(s) For a xn = 00 | 0100101100, sequence n = 10,n 0(10) = 2,n 1(10) = 1 For each s, calculate KT estimator for PL,s(n0(s), n1(s)) and use n Q PL(x ) = s∈S PL,s(n0(s), n1(s)) for Arithmetic coding

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Tree Source with Known Model

If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ ∈ R|S|. n For a fixed x , define the number of 0’s after s as n0(s) . Similarly define n1(s) For a xn = 00 | 0100101100, sequence n = 10,n 0(10) = 2,n 1(10) = 1 For each s, calculate KT estimator for PL,s(n0(s), n1(s)) and use n Q PL(x ) = s∈S PL,s(n0(s), n1(s)) for Arithmetic coding Then redundancy is

n n 1 Redn(L, x ) = L(x ) − log (5) P(xn) 1 1 < log( n ) + 2 − log n (6) PL(x ) P(x ) 1 1 = log Q − log n + 2 (7) s∈S PL,s(n0(s), n1(s)) P(x )

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) O(log(n)) term is the cost for unknown paramters θs, s ∈ S This upperbound through KT estimator meets the lowerbound of |S| Minimax redundancy, i.e., 2 log n + o(1)

Tree Source with Known Model

Continued.

n (s) Q n0(s) ¯ 1 n s∈S θs θs Redn(L, x ) = log(Q ) + 2 (8) s∈S PL,s(n0(s), n1(s)) X 1  ≤ log(n (s) + n (s)) + 1 + 2 (9) 2 0 1 s∈S  P  1 (n0(s) + n1(s)) ≤ |S| log s∈S + 1 + 2 (10) 2 |S| 1 n  = |S| log + 1 + 2 (11) 2 |S|

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) This upperbound through KT estimator meets the lowerbound of |S| Minimax redundancy, i.e., 2 log n + o(1)

Tree Source with Known Model

Continued.

n (s) Q n0(s) ¯ 1 n s∈S θs θs Redn(L, x ) = log(Q ) + 2 (8) s∈S PL,s(n0(s), n1(s)) X 1  ≤ log(n (s) + n (s)) + 1 + 2 (9) 2 0 1 s∈S  P  1 (n0(s) + n1(s)) ≤ |S| log s∈S + 1 + 2 (10) 2 |S| 1 n  = |S| log + 1 + 2 (11) 2 |S|

O(log(n)) term is the cost for unknown paramters θs, s ∈ S

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Tree Source with Known Model

Continued.

n (s) Q n0(s) ¯ 1 n s∈S θs θs Redn(L, x ) = log(Q ) + 2 (8) s∈S PL,s(n0(s), n1(s)) X 1  ≤ log(n (s) + n (s)) + 1 + 2 (9) 2 0 1 s∈S  P  1 (n0(s) + n1(s)) ≤ |S| log s∈S + 1 + 2 (10) 2 |S| 1 n  = |S| log + 1 + 2 (11) 2 |S|

O(log(n)) term is the cost for unknown paramters θs, s ∈ S This upperbound through KT estimator meets the lowerbound of |S| Minimax redundancy, i.e., 2 log n + o(1)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

With assumption that tree has at most D depth,

Tree Source with Unknown Model

What if we do not know the tree model either?

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

But it should be optimized over all trees and is not sequentially implementable

Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

But it should be optimized over all trees and is not sequentially implementable

Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT But it should be optimized over all trees and is not sequentially implementable

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Question: The leaves S from this context tree might be different from actual tree model. But still good?

Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Question: The leaves S from this context tree might be different from actual tree model. But still good?

Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Tree Source with Unknown Model

What if we do not know the tree model either? With assumption that tree has at most D depth,

Two part code: Use nT bits for representing tree and then use KT n estimator for that tree, LT (x ) + nT But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Calculate corresponding KT estimatorPe(n0(s), n1(s)) for each s. Assign weighting probability ( s Pe(n0(s), n1(s)) if l(s) = D w = 0s 1s (12) P Pe(n0(s),n1(s))+Pw Pw 2 if l(s) 6= D

n λ n And take PL(x ) = Pw(x ) and use it for Arithmetic coding. n Corollary: Suppose x is drawn from P1 or P2. The one can achieve w P1+P2 a redundacny of 1 bit by using the weighted distribution P = 2

Tree Source with Unknown Model

∗ Calculate n0(s), n1(s) for all s ∈ {s | s ∈ {0, 1} , |s| ≤ D} s = λ Ex) xn = 00 | 0100101100 0 1 n0(0) = 3,n 1(0) = 3 s = 0 s = 1 ... n (10) = 2,n (10) = 1 0 1 s = 0 s = 1 0 1 (xn = 0100101100) s = 00s = 01s = 10s = 11

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Assign weighting probability ( s Pe(n0(s), n1(s)) if l(s) = D w = 0s 1s (12) P Pe(n0(s),n1(s))+Pw Pw 2 if l(s) 6= D

n λ n And take PL(x ) = Pw(x ) and use it for Arithmetic coding. n Corollary: Suppose x is drawn from P1 or P2. The one can achieve w P1+P2 a redundacny of 1 bit by using the weighted distribution P = 2

Tree Source with Unknown Model

∗ Calculate n0(s), n1(s) for all s ∈ {s | s ∈ {0, 1} , |s| ≤ D} s = λ Ex) xn = 00 | 0100101100 0 1 n0(0) = 3,n 1(0) = 3 s = 0 s = 1 ... n (10) = 2,n (10) = 1 0 1 s = 0 s = 1 0 1 (xn = 0100101100) s = 00s = 01s = 10s = 11

Calculate corresponding KT estimatorPe(n0(s), n1(s)) for each s.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) n Corollary: Suppose x is drawn from P1 or P2. The one can achieve w P1+P2 a redundacny of 1 bit by using the weighted distribution P = 2

Tree Source with Unknown Model

∗ Calculate n0(s), n1(s) for all s ∈ {s | s ∈ {0, 1} , |s| ≤ D} s = λ Ex) xn = 00 | 0100101100 0 1 n0(0) = 3,n 1(0) = 3 s = 0 s = 1 ... n (10) = 2,n (10) = 1 0 1 s = 0 s = 1 0 1 (xn = 0100101100) s = 00s = 01s = 10s = 11

Calculate corresponding KT estimatorPe(n0(s), n1(s)) for each s. Assign weighting probability ( s Pe(n0(s), n1(s)) if l(s) = D w = 0s 1s (12) P Pe(n0(s),n1(s))+Pw Pw 2 if l(s) 6= D

n λ n And take PL(x ) = Pw(x ) and use it for Arithmetic coding.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Tree Source with Unknown Model

∗ Calculate n0(s), n1(s) for all s ∈ {s | s ∈ {0, 1} , |s| ≤ D} s = λ Ex) xn = 00 | 0100101100 0 1 n0(0) = 3,n 1(0) = 3 s = 0 s = 1 ... n (10) = 2,n (10) = 1 0 1 s = 0 s = 1 0 1 (xn = 0100101100) s = 00s = 01s = 10s = 11

Calculate corresponding KT estimatorPe(n0(s), n1(s)) for each s. Assign weighting probability ( s Pe(n0(s), n1(s)) if l(s) = D w = 0s 1s (12) P Pe(n0(s),n1(s))+Pw Pw 2 if l(s) 6= D

n λ n And take PL(x ) = Pw(x ) and use it for Arithmetic coding. n Corollary: Suppose x is drawn from P1 or P2. The one can achieve w P1+P2 a redundacny of 1 bit by using the weighted distribution P = 2

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Then

0 λ n X ΓD (S ) Y Pw(x ) = 2 · Pe(n0(s), n1(s)) (13) 0 0 S ∈all TD s∈S 1−2|S| Y ≥ 2 · Pe(n0(s), n1(s)) (14) s∈S 1−2|S| known n ≥ 2 PL (x ) (15) The redundancy is

n known n Redn(x ) = Redn (x ) + 2|S| − 1 (16) |S| n = log + |S| + 2 + 2|S| − 1 (17) 2 |S|

Tree Source with Unknown Model

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) The redundancy is

n known n Redn(x ) = Redn (x ) + 2|S| − 1 (16) |S| n = log + |S| + 2 + 2|S| − 1 (17) 2 |S|

Tree Source with Unknown Model

Then

0 λ n X ΓD (S ) Y Pw(x ) = 2 · Pe(n0(s), n1(s)) (13) 0 0 S ∈all TD s∈S 1−2|S| Y ≥ 2 · Pe(n0(s), n1(s)) (14) s∈S 1−2|S| known n ≥ 2 PL (x ) (15)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Tree Source with Unknown Model

Then

0 λ n X ΓD (S ) Y Pw(x ) = 2 · Pe(n0(s), n1(s)) (13) 0 0 S ∈all TD s∈S 1−2|S| Y ≥ 2 · Pe(n0(s), n1(s)) (14) s∈S 1−2|S| known n ≥ 2 PL (x ) (15) The redundancy is

n known n Redn(x ) = Redn (x ) + 2|S| − 1 (16) |S| n = log + |S| + 2 + 2|S| − 1 (17) 2 |S|

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Look at KT estimator which is is optimal in minimax sense. And the mixture is sequentially implementable by KT estimator. Even though we do not know the tree model(only depth is given), can design good compressor using CTW.

Summary

Two part code vs. Mixture code(Plug-in code)

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Even though we do not know the tree model(only depth is given), can design good compressor using CTW.

Summary

Two part code vs. Mixture code(Plug-in code) Look at KT estimator which is is optimal in minimax sense. And the mixture is sequentially implementable by KT estimator.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Summary

Two part code vs. Mixture code(Plug-in code) Look at KT estimator which is is optimal in minimax sense. And the mixture is sequentially implementable by KT estimator. Even though we do not know the tree model(only depth is given), can design good compressor using CTW.

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW) Reference

Willems; Shtarkov; Tjalkens (1995), The Context-Tree Weighting Method: Basic Properties 41, IEEE Transactions on Begleiter; El-Yaniv; Yona (2004), On Prediction Using Variable Order Markov Models 22, Journal of Artificial Intelligence Research: Journal of Artificial Intelligence Research, pp. 385421 Some tutorial and lecture note

Youngsuk Park Universal Loseless Compression: Context Tree Weighting(CTW)