Lossless Source Coding

Gadiel Seroussi Marcelo Weinberger

Information Theory Research Hewlett-Packard Laboratories Palo Alto, California

Seroussi/Weinberger – Lossless source coding 26-May-04 1 Notation

• A = discrete (usually finite) alphabet • a = | A| = size of A (when finite) n n • x1 = x = x1x2 x3 Kxn = finite sequence over A ¥ ¥ • x1 = x = x1x2 x3 Kxt K = infinite sequence over A j • xi = xi xi+1Kx j = sub-sequence (i sometimes omitted if = 1)

• pX(x) = Prob(X=x) = probability mass function (PMF) on A (subscript X and argument x dropped if clear from context) • X ~ p(x) : X obeys PMF p(x)

• Ep [F] = expectation of F w.r.t. PMF p (subscript and [ ] may be dropped) n • pˆ n (x) = empirical distribution obtained from x1 x1 • log x = logarithm to base 2 of x, unless base otherwise specified • ln x = natural logarithm of x • H(X), H(p) = entropy of a random variable X or PMF p, in bits; also • H(p) = - p log p - (1-p) log (1-p), 0 £ p £ 1 : binary entropy function • D(p||q) = relative entropy (information divergence) between PMFs p and q

Seroussi/Weinberger – Lossless source coding 26-May-04 2 LosslessLossless SourceSource CodingCoding

3. Universal Coding

Seroussi/Weinberger – Lossless source coding 26-May-04 3 Universal Modeling and Coding

q So far, it was assumed that a model of the data is available, and we aimed at compressing the data optimally w.r.t. the model q By Kraft’s inequality, the models we use can be expressed in terms of probability distributions: - L( s ) For every UD code with length function L(s), we have å2 £ 1 n string of length n over sÎA finite alphabet A Þ a code defines a probability distribution P(s) = 2-L(s) over An q Conversely, given a distribution P( ) (a model), there exists a UD code that assigns é- log p(s)ù bits to s (Shannon code) q Hence, P( ) serves as a model to encode s, and every code has an associated model l a (probabilistic) model is a tool to “understand” and predict the behavior of the data

Seroussi/Weinberger – Lossless source coding 26-May-04 4 Universal Modeling and Coding (cont.)

q Given a model P( ) on n-tuples, provides an effective mean to sequentially assign a code word of length close to -log P(s) to s

l we don’t need to see the whole string s = x1 x2 … xn : encode xt using conditional probability t t å P(x u) P(x ) n-t p(x | x x ...x ) = = uÎA t 1 2 t-1 P(xt -1 ) å P(xt -1v) vÎAn-t +1 l the model probabilities can vary arbitrarily and “adapt” to the data l in a strongly sequential model, P(xt) is independent of n

q CODING SYSTEM = MODEL + CODING UNIT l two separate problems: design a model and use it to encode

We will view as a problem of assigning probabilities to data

Seroussi/Weinberger – Lossless source coding 26-May-04 5 Coding with Model Classes

q Universal data compression deals with the optimal description of data in the absence of a given model l in most practical applications, the model is not given to us q How do we make the concept of “optimality” meaningful? l there is always a code that assigns just 1 bit to the data at hand! The answer: Model classes q We want a “universal” code to perform as well as the best model in a given class C for any string s, where the best competing model changes from string to string l universality makes sense only w.r.t. a model class q A code with length function L(xn) is pointwise universal w.r.t. a n 1 n n class C if RC (L, x ) = [L(x ) - min LC (x )] ® 0 when n ® ¥ n CÎC pointwise redundancy

Seroussi/Weinberger – Lossless source coding 26-May-04 6 How to Choose a Model Class?

Universal coding tells us how to encode optimally w.r.t. to a class; it doesn’t tell us how to choose a class!

q Some possible criteria: l complexity l prior knowledge on the data l some popular models were already presented

q We will see that the bigger the class, the slower the best possible convergence rate of the redundancy to 0 l in this sense, prior knowledge is of paramount importance: don’t learn what you already know!

Ultimately, the choice of model class is an art

Seroussi/Weinberger – Lossless source coding 26-May-04 7 Example: Bernoulli Models

A = {0,1}, C = {Pq , q Î L = [0,1] } i.i.d. distribution with parameter p(1) = q

n 1 1 ) n min LC (x ) = min log n = log n = nH (x ) CÎC qÎL ) Pq (x ) Pq ( x n ) (x ) ML-estimate of q Therefore, our goal is to find a code such that L(x n ) ) - H (x n ) ® 0 n q Here is a trivial example of a universal code: log( n + 1) Use é ù bits to encode n1, and then “tune” your Shannon code to the parameter q = n1 / n, which is precisely the ML-estimate Þ this is equivalent to telling the decoder what the best model is!

Seroussi/Weinberger – Lossless source coding 26-May-04 8 Bernoulli Models: Enumerative Code

q The total code length for this code satisfies: n æ ö æ ö æ ö æ ö L(x ) n0 n0 n1 n1 log(n +1) + 2 ˆ n log(n +1) + 2 £ -ç ÷logç ÷ - ç ÷logç ÷ + = H (x ) + n è n ø è n ø è n ø è n ø n n ® 0

l wasteful: knowledge of n1 already discards part of the sequences, to which we should not reserve a code word æ n ö ç ÷ q A slightly better code: enumerate all ç ÷ sequences with n1ones, è n1 ø and describe the sequence with an index

n é n ù L¢(x ) 1 æ ö élog(n +1)ù ˆ n log(n +1) + 2 = êlogç ÷ú + £ H (x ) + n n ê èn1 øú n n n æ n ö 2nHˆ ( x ) ç ÷ £ Stirling: ç ÷ Þ for sequences such that n /n is èn1 ø 2p (n1n0 ) / n 1 L¢(xn ) log n bounded away from 0 and 1, £ Hˆ (xn ) + + O(1/ n) n 2n

Seroussi/Weinberger – Lossless source coding 26-May-04 9 Bernoulli Models: Mixture Code

n q The enumerative code length is close to -log Q’(x ), where -1 n æ n ö 1 q Is Q( ) a probability assignment? Q¢(x ) = ç ÷ × èn1 ø n +1 -1 1 1 æ n ö 1 P (xn )dq = q n1 (1-q )n0 dq = ç ÷ × = Q¢(xn ) Þ Q¢(x n ) = 1 ò0 q ò0 ç ÷ å èn1 ø n +1 x nÎAn qAha!! So we get the same result by mixing all the models in the class and using the mixture as a model! qThis uniform mixture is very appealing because it can be sequentially implemented, independent of n: n !n ! n-1 n (xt ) +1 ¢ n 0 1 ¢ t ¢ t 0 Q (x ) = = Õ q (xt+1 | x ) where q (0 | x ) = (n +1)! t=0 t + 2 l Laplace’s rule of succession! l this results in a “plug-in” approach: estimate q and plug it in

Seroussi/Weinberger – Lossless source coding 26-May-04 10 Bernoulli Models: Mixture Code (cont.)

qMaybe we can make Stirling work for us with a different mixture, that puts more weight in the problematic regions

where n1/n approaches 0 and 1? 1 qConsider Dirichlet’s density w(q ) = Þ 1 2 G( 2 ) q (1-q ) 1 1 1 1 1 n - 1 n - 1 G(n + )G(n + ) Q¢¢(xn ) = P (xn )dw(q ) = q 1 2 (1-q ) 0 2 dq = 0 2 1 2 ò0 q 1 2 ò0 1 2 G( 2 ) n!G( 2 ) Þ by Stirling for the Gamma function, for all sequences xn L¢¢(xn ) 1 log n = - log Q¢¢(x n ) £ Hˆ (xn ) + + O(1/ n) Can we do any better? n n 2n qThis mixture also has a plug-in interpretation: n-1 n (xt ) + 1 ¢¢ n ¢¢ t ¢¢ t 0 2 Q (x ) = Õ q (xt+1 | x ) where q (0 | x ) = t=0 t +1

Seroussi/Weinberger – Lossless source coding 26-May-04 11 Summary of Codes

q Two-part code l describe best parameter and then code based on it: works whenever the number of possible optimal codes in the class is sub-exponential l the most natural approach (akin to ), but not sequential l not efficient in the example: can be improved by describing the parameter more coarsely Þ trade-off: spend less bits on the parameter and use an approximation of best parameter l alternatively, don’t allocate code words to impossible sequences q Mixture code l can be implemented sequentially l a suitable prior on the parameter gave the smallest redundancy q Plug-in codes l use a biased estimate of the conditional probability in the model class based on the data seen so far l can often be interpreted as mixtures q But what’s the best we can do?

Seroussi/Weinberger – Lossless source coding 26-May-04 12 Normalized Maximum-Likelihood Code

q Goal: find a code that attains the best worst-case pointwise redundancy

n 1 n n RC = min max RC (L, x ) = min max[L(x ) - min LC (x )] L xn ÎAn n L x n ÎAn CÎC q Consider the code defined by the probability assignment n -min LC (x ) CÎC n n 2 1 é -min LC ( x ) ù Q(x ) = R (L, x n ) = log 2 CÎC -min L (xn ) Þ C ê å ú C CÎC n n n å 2 ë x ÎA û xnÎAn This quantity depends only on the class!

q Since the redundancy of this code is the same for all xn, it must be optimal in the minimax sense! q Drawbacks: - sequential probability assignment depends on horizon n - hard to compute

Seroussi/Weinberger – Lossless source coding 26-May-04 13 Example: NML Code for the Bernoulli Class

q We evaluate the minimax redundancy RC for the Bernoulli class:

) n é n ù é æ n ö i ù 1 -H ( x ) 1 -h( n ) 1 np RC = logê å 2 ú = logêå ç ÷2 ú = log + o(1/ n) n ëx n ÎAn û n ë i=0 è i ø û 2n 2 typical of sufficient statistics: Hint: Stirling + n n n n 1 P (x ) = p(x | s(x )) p (s(x )) dq 2 q q =G( 1 ) = p ò0 q (1-q ) 2

Þ the pointwise redundancy cannot vanish faster than (log n)/2n (Dirichlet mixture or NML code)

Seroussi/Weinberger – Lossless source coding 26-May-04 14 Parametric Model Classes

q A useful limitation to the model class is to assume C = {Pq, q Î Qd }

a parameter space of dimension d q Examples: l Bernoulli: d = 1 , general i.i.d. model: d = a-1 (a = |A|) l FSM model with k states: d = k(a-1) l memoryless geometric distribution on the integers i ³ 0: P(i) = qi (1-q) d = 1 q The dimension of the parameter space (number of one-dimensional parameters) plays a fundamental role in modeling problems l with more parameters we can better fit the model to the data (e.g., a Markov model of higher order Þ the entropy cannot increase) l but on the other hand, the class is richer and the redundancy is higher: e.g., in a two-part code it takes more bits to describe the best parameter we will quantify this tradeoff of the model selection step

Seroussi/Weinberger – Lossless source coding 26-May-04 15 Minimax Redundancy for Parametric Classes

n n ) q For a parametric class, min LC (x ) = - log P n (x ) CÎC q ( x ) ML-estimate of q based on xn q A code corresponding to a distribution Q(xn), has a redundancy n P) n (x ) log q ( x ) and the minimax redundancy (with the NML code) is Q(x n ) 1 é n ù R = log P) (x ) C ê å q ( xn ) ú n ëx nÎAn û l the more sensitive the class to the parameter, the larger the redundancy

Seroussi/Weinberger – Lossless source coding 26-May-04 16 Assumptions on the Parametric Model Classes

q We will assume that each model in the class satisfies the marginality condition for a random process, namely n+1 n å Pq (x ) = Pq (x ) xn+1ÎA l this means that each model can be implemented in a strongly sequential manner q We will usually assume that the model class is “nice”, including: d l Qd is an open bounded subset of  which includes the ML estimates é 2 n ù 1 ¶ ln Pq (x ) l the Fisher information matrix I(q) is “nice” Iij (q ) = - lim Eê ú n®¥ n ¶q ¶q u for i.i.d., this is the classical Fisher information ëê i j ûú

l the maximum-likelihood estimator satisfies the CLT: the distribution of

) n n (q (x ) -q )converges to N(0, I-1(q) )

Seroussi/Weinberger – Lossless source coding 26-May-04 17 Redundancy of NML Code

d n 1 q Theorem: RC = log + log | I (q ) |dq + o(1/ n) òQ 2n 2p n d

As expected, RC grows with the number of parameters

q Idea of the proof: l partition the parameter space into small hypercubes with sides r = o(1/ n) r r l represent each hypercube Vr(q ) by a parameter q belonging to it r r n l associate to V (q ) the mass P (q ) = P r (x ) r r å q ) n r q ( x )ÎVr (q ) n l with a Taylor expansion of Pq (x ) (as a function of q) around its ) n r maximum q ( x ) we get the exact asymptotics of RC ® å Pr (q ) q r r l for each hypercube, approximate Pr(q ) using CLT Þ integral of a normal distribution

Seroussi/Weinberger – Lossless source coding 26-May-04 18 Interpretation of NML as a Two-part Code

q Encode xn in two parts: ) r r n l Part I: encode the parameter q for the hypercube Vr(q ) into which q ( x ) falls, with a code tuned to the distribution r r Pr (q ) å Pr (q ) r ) q l Part II: encode xn using the model q ,r conditioned on the fact that the n ) P) r (x ) ML estimate belongs to V (q r ) Þ distribution q r ) r Pr (q ) q In earlier two-part codes, the parameters were represented with a fixed precision O(1/ n) and then xn was encoded based on the approximate ML parameter l evaluating the approximation cost with Taylor, this precision was shown to be optimal ) l intuitively, 1 / n is the magnitude of the estimation error in q ( x n ) and therefore there is no need to encode the estimator with better precision d log n = MODEL COST 2

Seroussi/Weinberger – Lossless source coding 26-May-04 19 Redundancy of the Mixture code

n n Qw (x ) = Pq (x )dw(q ) òqÎQ d q As in the Bernoulli example, for any i.i.d. exponential family, and for Markov models, mixture code also gives the same “magic” model cost when the distribution w(q) is proportional to | I ( q ) | (Jeffrey’s prior) 1 l in the Bernoulli case, this is precisely Dirichlet’s distribution: I(q ) = q (1-q ) l the tool for solving the integral is Laplace’s integration method q The advantage of mixtures is that they yield sequential horizon-free

codes: t t t-1 t-1 Qw (x ) = Pq (x )dw(q ) = Pq (x )dw(q ) = Qw (x ) å òqÎQ å òqÎQ d d xtÎA xt ÎA t t-1 t-1 Qw (x ) t-1 Pq (x ) qw (xt | x ) = = pq (xt | x ) dw(q ) t-1 òqÎQ t-1 d Qw (x ) Pq (x )dw(q ) òqÎQ d

Seroussi/Weinberger – Lossless source coding 26-May-04 20 Important Example: FSM model classes

q Given FSM S with k states (fixed initial state) parameters = conditional probabilities per state p(x|s), x Î A, s Î S

) nx n (x | s) p n (x | s) = ML estimates: x nxn (s)

1 1 ) n target: min log n = log n = nH (x | S) qÎQ k (a -1) ) Pq (x ) Pq ( x n ) (x ) mixture distribution: n (x|s) 1 n (x|s) n xn xn Qw (x ) = p(x | s) dw(q ) = q dw(q ) òqÎQ Õ Õ ò0 k (a -1) sÎS ,xÎA sÎS ,xÎA Þ equivalent to doing i.i.d. mixture for every state sub-sequence Þ optimal universal probability assignment: nxt (x | s) +1 2 q t (x | s) = x n t (s) +a 2 Krichevski-Trofimov (KT) estimator x

Seroussi/Weinberger – Lossless source coding 26-May-04 21 FSM model: Numerical Example

q FSM = First order Markov, initial state 0

x20 = 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 1 0 0 1/2 3/4 1/6 1/2 3/4 5/6 1/8 3/8 7/10 3/4 11/14 13/16 5/6 3/20 1/2 7/12 9/14 5/16 5/22 13/18

total probability: G(n(0 | s) + 1 )G(n(1| s) + 1 ) G(6.5)G(3.5)G(3.5)G(8.5) Q(xn ) = 2 2 = Õ 1 2 1 4 sÎS n(s)!G( 2 ) 9!11!G( 2 )

- log Q(x 20 ) » 17.61 Þ code length: 18 bits ) nH (x 20 | S) = 9h(1 3) +11h(3 11) » 17.56 bits

Seroussi/Weinberger – Lossless source coding 26-May-04 22 FSM model classes and the Lempel-Ziv algorithm

q The LZ algorithm is universal for ANY class of FSM models (of any size) Þ in particular, for all Markov models

1 n ) n lim sup LLZ (x ) £ lim sup H (x | Sk ) n®¥ n n®¥ Markov machine of any order k

and for an infinite sequence x¥

1 n ) n ) ¥ lim sup LLZ (x ) £ lim lim sup H (x | S k ) = H (x ) n®¥ n k ®¥ n®¥ Markov limit exists compressibility

Seroussi/Weinberger – Lossless source coding 26-May-04 23 Expected Pointwise Redundancy

q So far, we have considered the code length for individual sequences, without taking any average q We have found that for parametric classes, the best we can do is to achieve a worst-case pointwise redundancy of about (d 2)log n bits q This means guaranteed performance, but maybe there are only a few such “unlucky” strings? q How to weight the redundancies in order to compute an average?

Since we are assuming that the class C = {Pq, q Î Qd } is a good model for the data, it makes sense to assume that the data was

drawn from a source with some distribution Pq in C q Expected pointwise redundancy of a code

n 1 n n ) Eq [RC (L, x )]= Eq [L(x ) + log P n (x )] n q ( x )

Seroussi/Weinberger – Lossless source coding 26-May-04 24 Expected Redundancy

q Maybe there exist codes that are “good” (with expected pointwise redundancy smaller than (d 2)log n ) for a significant fraction of models in the class? q We will be even “larger”: we consider the expected redundancy

1 n RC (L,q ) = Eq [L(x )]- H n (q ) = Dn (Pq || Q) n n normalized divergence Q(x n ) = 2-L(x ) for distr. defined on n-tuples This is more “liberal” because

1 n 1 n ) Hn (q) = Eq [-logPq (x )]³ Eq [-logP n (x )] n n q (x ) and therefore n RC (L,q) £ Eq [RC (L, x )]

Seroussi/Weinberger – Lossless source coding 26-May-04 25 Lower Bound on Expected Redundancy

q The pointwise universal codes we saw are - a fortiori - average universal for all parameters q Þ this means that the corresponding distribution Q is “close” to all the models in the class in the sense

that Dn (Pq || Q) ® 0 q Still, the answer is that (d 2)log n is about the best we can do for most models in the class: it’s the inevitable cost of universality q Theorem:

Assume that either CLT holds for ML estimator of parameters in Qd ) n or Pr{ n(q i (x ) -q i ) ³ log n} £ d (n) ® 0 Then for all Q and all e > 0, log n - n -1 E [log Q (x n )] ³ H (q ) + k (1 - e ) q n 2n

for all points q in Qd except in a set whose volume ® 0 as n ® ¥

Seroussi/Weinberger – Lossless source coding 26-May-04 26 Lower Bound (cont.)

q This lower bound parallels Shannon’s coding theorem: when we consider a model class instead of a single distribution, a model cost gets added to the entropy q The bound cannot hold for all models in the family, but it holds for most q One interpretation of the lower bound: if the parameters can be estimated well, they are “distinguishable”

(Pq is sensitive to q), so the class cannot be coded without a model cost

Conclusion: the number of parameters affects the achievable convergence rate of a universal code length to the entropy

Seroussi/Weinberger – Lossless source coding 26-May-04 27 Variations on the Lower Bound

q Assuming, in addition, that åd (n) < ¥ , then the cumulative n volume of “bad” parameters for all sufficiently large n ® 0 Þ the “bad” parameters form a set of Lebesgue volume 0 q Strong Redundancy-Capacity Theorem: Under very mild conditions on the model class,

-1 n - n Eq [log Q ( x )] ³ H n (q ) + (1 - e )C n é ù where Cn = sup H n (Qw ) - H n (q )dw(q ) ³ 0 and Qw is a mixture, w ëê òqÎQd ûú for all q except for a set B for which w * (B) £ e2 - nC ne , where w* is the density that achieves the supremum

q For parametric classes, Cn indeed behaves as (d 2n)log n and w*(B) ® 0

Seroussi/Weinberger – Lossless source coding 26-May-04 28 Optimal Codes for Average Redundancy

n q How do we achieve inf sup {Eq [L(x )]- H n (q )} ? L q We use the mixture Qw for which the weighted expected redundancy n [Eq [- log Qw (x )]- H n (q )] dw(q ) is maximum Þ C òqÎQ n d Þ an appropriate mixture is the best one can do to minimize the expected redundancy for the worst-case parameter (close to Jeffrey’s prior in the case of exponential families/Markov models) q In many situations, the minimum expected redundancy (or at least its main term (d 2)log n ) can be achieved by a plug-in code of the n form n Q ( x ) = P~ ( xt ) Õ q t-1 t =1 ~ t-1 where q t - 1 is an estimate of q based on x l example: Bernoulli/FSM cases (KT estimator; even Laplace’s estimator works in the average sense)

Seroussi/Weinberger – Lossless source coding 26-May-04 29 Summary of Codes and Bounds

q Minimax pointwise redundancy: NML code l for “nice” parametric families with d parameters, d n 1 RC = log + log | I (q ) |dq + o(1/ n) òQ 2n 2p n d l horizon-dependent, mixture code with suitable prior is a good horizon- free approximation l in fact, for i.i.d. models with a-1 parameters, it can be shown that t å QNML (x u) n t (x | s) +1 2 uÎAn-t x lim q NML (xt | x1x2 ...xt-1 ) = lim = n®¥ n®¥ t -1 Q (x v) n t (s) + 2 å NML x a vÎAn-t+1

KT mixture!

Seroussi/Weinberger – Lossless source coding 26-May-04 30 Summary (cont.)

q Minimax average redundancy: mixture codes with prior such that the weighted expected redundancy is maximum é ù Cn = sup H n (Qw ) - H n (q )dw(q ) w ëê òqÎQd ûú l for i.i.d. exponential families and Markov models, close to Jeffrey’s prior d n 1 Cn = log + log | I (q ) |dq + o(1/ n) òQ 2n 2pe n d

l not only we cannot do better than Cn for one parameter (minimax), but for most parameters l for “nice” parametric families, d log n C » n 2 n

Seroussi/Weinberger – Lossless source coding 26-May-04 31 Twice-universal coding

q Consider a model class to be the union of nested classes of

growing dimensionality: Q1 È Q 2 È ... È Q d È ... where Q1 Ì Q 2 Ì ... Ì Q d Ì ... l example: Markov of different order, or FSM q We seek a code length close to the minimum over the classes of the universal code length for each class:

1 n ì d ü L(x ) » min min íH n (q ) + log ný n d qÎQ d î 2n þ l here, we also try to optimize the model size l trade-off: first part diminishes with d, second part grows with d l we answer questions such as: should we model the data as i.i.d. or as Markov of order 1? l since the second level of universality is over a discrete set, we expect it not to affect the main redundancy term

Seroussi/Weinberger – Lossless source coding 26-May-04 32 Mixture Approach to Double-Universality

q Consider the series where 1 = å l i li > 0 i n n The probability assignment Q(x ) = åld Qd (x ) d universal sequential code for Q n n+1 d is sequential: Q(x ) = åQ(x ) xn+1 It is universal as desired: for all d

1 n 1 n log ld 1 d - log Q(x ) £ - log Qd (x ) - < min log + log n + O(1/ n) qÎQ n n n n d Pq (x ) 2n q However, it involves an infinite sum -1 l this can be avoided by modifying Qd so that Qd (xt ) = a for all t £ d: this way, all models with d ³ n assign the same probability a-n and the sum becomes finite, without affecting universality and sequentiality l anyway, this is obviously not practical but shows achievability of the goal

Seroussi/Weinberger – Lossless source coding 26-May-04 33 Plug-in Approach to Double-Universality

t-1 q The idea: estimate the best model class Qd based on x , and plug it to encode xt with Qd n n t-1 Q(x ) = Q ) t-1 (x | x ) Õ d (x ) t t=1 l the decoder can reproduce the process without overhead

q In general, this approach does not work pointwise, but for some model classes it was shown to work on the average

Seroussi/Weinberger – Lossless source coding 26-May-04 34 LosslessLossless SourceSource CodingCoding

5. Universal Algorithms for Tree Models

Seroussi/Weinberger – Lossless source coding 26-May-04 35 Double-Universality for Tree Models

q Tree sources (FSMX) 1 s1 1 l finite memory £ k (Markov) l # of past symbols needed to 0 s 0 s 1 determine the state might be 0 0 2 < k for some states s0 « 0, s1 « 01, s2 « 11

l by merging nodes from the full Markov tree, we get a model with a smaller number of free parameters 0 1 l the set of tree sources with unbalanced trees has measure zero in the space of p(0|s0) s0 Markov sources of any given order: 0 1 otherwise, double-universality would contradict the lower bound!!

s0 s0 s1 s2 l yet, tree models have proven very useful in practice, partly because there exist p(0|s0) p(0|s0) p(0|s1) p(0|s2) efficient twice-universal, sequential schemes in the class of tree models of any size

Seroussi/Weinberger – Lossless source coding 26-May-04 36 Contexts and Trees

q Any suffix of a sequence xt is called a context in which the next

symbol xt+1 occurs 0 1 context

0 1 0 1 ...110010 1...

0 1 next input 0 1 bit

q For a finite-memory process P, the conditioning states s(xt) are contexts that satisfy P (a | x t ) = P (a | us ( x t )), " u Î A*, a Î A q # of parameters: a-1 per leaf of the tree l assume minimal parametrization: sibling states with identical conditional distribution are “merged”

Seroussi/Weinberger – Lossless source coding 26-May-04 37 Two-pass Context Algorithm

q First pass: gather all the context statistics for xn in a tree l O(n) complexity with suffix tree methods a q After the first pass, associate to each node a cost L(s) + a -1 KT code length for symbols occurring at s: includes penalty for overparametrization

and “prune” the tree to find a subtree T with total minimum cost for the leaves (dynamic programming algorithm) q The total cost for T is: # leaves in T # nodes in T l a 1 L (xn ) + T = L (xn ) +n + T a -1 T T a -1

Seroussi/Weinberger – Lossless source coding 26-May-04 38 Two-pass Context Algorithm (cont.)

q Second pass: describe T to the decoder using nT bits and encode xn conditioned on T with KT Þ by definition, we have the best tree

1 1 n ) (a -1)l L(x n ) = L (x n ) + T < H (x n | T ) + T log n + O(1/ n) n n T n 2n

Seroussi/Weinberger – Lossless source coding 26-May-04 39 Pruning of Context Trees: Example

Seroussi/Weinberger – Lossless source coding 26-May-04 40 Drawbacks of Two-pass Approach

q Non-sequential

q The scheme has asymptotically minimum pointwise redundancy, but is redundant (given the tree, not all xn are possible) n n n n n - L( x ) - LT ( x ) - nT - LT ( x ) - LT ( x ) Q(x ) = 2 = 2 2 = 2 lT < å lT 2 T Þ mixture with weights lT is better!

q Can the mixture be efficiently done?

Seroussi/Weinberger – Lossless source coding 26-May-04 41 Mixture Approach:

q CTW: An efficient implementation of the mixture for the class of binary tree models of maximum length bounded by m l there exist extensions for non-binary and unbounded, but not as elegant q Probability assignment associated with a node s of the tree ìQ KT (s) + Q w (s0)Q w (s1) l t t t w ï s m=3 Q (s) = t í 2 KT s0 ïQ (s) if depth( s) = m s1 î t t w QCTW (x ) Qt (l) qCTW (xt | x1x2 ...xt -1 ) = t -1 = w QCTW (x ) Qt-1 (l)

q Theorem: Let T have lT leaves ( Þ nT = 2 lT -1), mT leaves at level m Q (xt ) = 2-2l T +1+mT Q KT (xt ) L (x n ) £ LKT (xt ) + n CTW å T Þ CTW T T T KT prob. assignment conditioned on T for any T

Seroussi/Weinberger – Lossless source coding 26-May-04 42 CTW: “Proof” by Example

ìQ KT (s) + Q w (s0)Q w (s1) q Let m = 2 ï t t t Q w (s) = t í 2 KT Q KT (l) + Q w (0)Q w (1) ïQ (s) if depth( s) = m Q w (l) = î t 2 Q KT (l) [Q KT (0) + Q w (00)Q w (01)] [Q KT (1) + Q w (10)Q w (11)] = + 2 8 Q KT (l) [Q KT (0) + Q KT (00)Q KT (01)] [Q KT (1) + Q KT (10)Q KT (11)] = + 2 8 Q KT (l) Q KT (0)Q KT (1) Q KT (0)Q KT (10)Q KT (11) = + + 2 8 8 root Þ nT-mT=3 Þ nT - mT = 3 Q KT (1)Q KT (01)Q KT (00) Q KT (00)Q KT (01)Q KT (10)Q KT (11) + + 8 8 Þ - = 3 nT mT Þ nT - mT = 3

Seroussi/Weinberger – Lossless source coding 26-May-04 43 CTW: Practical Issues

q Finite precision: - as node probabilities get smaller they cannot be stored Þ may need to re-compute from counts - probability assignment is a ratio of vanishing numbers

q In general, m needs to be assumed large, and complexity raises

q Binary only

q However, there are good implementations of CTW which provide the best compression ratios on text and binary file compression

q From a practical viewpoint, a plug-in approach is more appealing l however, universality is only shown on an average sense

Seroussi/Weinberger – Lossless source coding 26-May-04 44 Plug-in Approach

q In a plug-in approach, the idea is to encode xt+1 by selecting a context from xt that strikes the “right balance” between conditional entropy and model cost l a long context reduces entropy by introducing more conditioning l but the longer the context, the smaller occurrence counts it has and the statistics that we learn from them are less reliable l for each time t, it is only necessary to select a context in the path

xt xt-1 xt-2 …, not the whole tree q A plug-in twice-universal scheme consists of a context selection rule, and a coding scheme based on the statistics stored at the selected context (e.g., KT estimator) q The context selection rule is also a tool for model selection for purposes other than data compression

Seroussi/Weinberger – Lossless source coding 26-May-04 45 Context Algorithm

q The algorithm consists of three interleaved stages: l growing of a tree that captures, essentially, all occurrences of each symbol at each node t l a context selection rule that selects for xt+1 a context st (x ) from the tree t Tt grown by x

l a KT sequential probability assignment for xt+1 based on the counts t stored at st (x )

q For each new symbol, we first encode, then update Tt ®Tt+1 (think of the decoder!) l the update consists of incrementing the occurrence counts for all the nodes in the path xt xt-1 xt-2 … l when we get to a leaf that was already visited, we extend the tree one level and initialize the count of xt+1 to 1 (the others remain 0) Þ only a few initial occurrences are missing, so we can basically assume

that for all nodes s and all symbols a, the count n t (a | s) is available x l nodes in other paths don’t need to be visited

Seroussi/Weinberger – Lossless source coding 26-May-04 46 Context Selection Rule

q Basic principle: The node which would have assigned the shortest code length for its symbol occurrences in the past string should be selected ) T q Most intuitive choice: find minimum cost tree t for all times t, and) do KT-coding for xt+1 conditioned on the context xt xt-1 xt-2 …in T t l unlike two-pass, the cost should not include tree description: the decoder already has it ! l very complex and was not shown to be asymptotically optimal, even on the average

q Another possibility: for each node sb Î Tt , b Î A define D (sb) = L (sb) - L¢(s) t t t KT code length for symbols occurring at sb based on counts gathered at s KT code length for symbols occurring at sb

Choose the deepest node sb in Tt such that Dt (sb) < 0

Seroussi/Weinberger – Lossless source coding 26-May-04 47 Universality of Context Algorithm ) P t (a | sb) q Easier to analyze: x D (sb) = n t (a | sb)log ) t å x aÎA Pxt (a | s)

the selected tree Tt is the smallest complete super-tree of

{ the deepest nodes w in Tt s.t. Dt (w) ³ C log (t+1) } q Theorem: For any minimal complete tree T with k leaves defining a n tree source PT (x ) with probabilities bounded away from 0, if C > 2(a+1) then the probability assignment Q of Context Algorithm satisfies 1 P (x n ) log n E [log T ] £ k (a - 1) + O (1 / n) n T Q( x n ) 2n and, moreover, 1 P (x n ) log n log T £ k (a - 1) + O(1 / n) n Q( x n ) 2n

with PT -probability 1

Seroussi/Weinberger – Lossless source coding 26-May-04 48 Idea of the Proof

q The proof is based on the fact that ¥ t å PT {x | T t ¹ T }log t < ¥ t =1 and so the contribution of the “bad” sequences is O(1/n)

q Two (non-disjoint) classes of errors: l overestimation: the selected tree contains an internal node which is a leaf of the true tree; taken care of by the penalty term l underestimation: a leaf of the selected tree is an internal node of the true tree; requires large deviation techniques

Seroussi/Weinberger – Lossless source coding 26-May-04 49 PPM algorithm

q A context selection rule very popular in the data compression community: choose the shortest context that occurred more than a certain number of times l rationale: the context gathered enough statistics in order to be reliable l the rule is totally ad hoc: if the best tree model is short, it will tend to overestimate (think of data generated by an i.i.d. source!)

q A family of algorithms based on variations of this selection rule is called PPM (Prediction by Partial Matching) l it is a very popular scheme for text compression and yields some of the best compression ratios l however, algorithms with a stronger theoretical basis tend to do better

Seroussi/Weinberger – Lossless source coding 26-May-04 50 Application to Statistics: the MDL Principle

q Model selection is probably the most important problem in statistical inference q Minimum Description Length (MDL) principle of statistical inference: choose the model class that provides the shortest code length for the model and the data in terms of the model Þ universal coding theory provides the yardstick to measure this code length q Rationale: models serve as tools to describe regularities in the data l we should use the simplest explanation for the data, but not too simple q Strong consistency shown in various settings, solves problem of model order selection avoiding the use of artificial penalty terms q Bayesian interpretation through mixture codes l however, NML code cannot be explained as a mixture q Other interpretations: maximum-entropy principle

Seroussi/Weinberger – Lossless source coding 26-May-04 51 LosslessLossless SourceSource CodingCoding

6. Sequential Decision Problems

Seroussi/Weinberger – Lossless source coding 26-May-04 52 The Sequential Decision Problem

q The framework n l observations: x = x1 x2 … xn , xt Î A n l corresponding actions: b = b1 b2 … bn , bt Î B

l instantaneous losses l(bt , xt) accumulate over time: n n L(x ) = ål(bt ,xt ) t=1 q On-line (sequential) strategy

l {bt} , action bt is decided before observing xt t-1 t-1 l possibly randomized: { pt (bt | x , b ) }

q The goal: as n ® ¥, approximate performance of best strategy in a given reference class, for arbitrary xn (deterministic setting) l excess loss w.r.t. reference: regret or redundancy l the reference class (or expert set) may reflect limited resources

Seroussi/Weinberger – Lossless source coding 26-May-04 53 Prediction with expert advice

Expert 1 Score

Expert 2 Score x1 x2 … xn Winner . . L (xn) . . min Goal: Expert b Score n n Lseq (x ) £ Lmin (x )+e(n) n On-line Lseq (x )

q Most general setting: reference class = set of generic “experts” q Basic principle for on-line strategy

select an expert’s prediction randomly, with probability dependent on its accumulated loss

Seroussi/Weinberger – Lossless source coding 26-May-04 54 Sequential Decision Problem: Examples

q Binary prediction with Hamming loss

l x1 x2 … xn is a binary sequence (|A|=2)

l the action: predict either bt = 0 or 1 (deterministic), or assign a probability pt to 1 (randomized strategy) (|B|=2) l the loss:

deterministic strategy: l(bt , xt) = 0 if bt = xt and 1 otherwise Þ accumulated loss = total # of prediction errors

randomized strategy: E[ l(bt , xt)] = | xt - pt |

q A less trivial example: lossless data compression

l x1 x2 … xn is the data to encode, finite alphabet A

l the (deterministic) action bt is a probability distribution assigned to xt bt = {pt ( × | x1 x2 … xt-1)} (B continuous and vectorial!)

l the loss: l(bt , xt) = - log pt (xt | x1 x2 … xt-1) (given the assigned distribution, an encoder can generate a code word of length l(bt , xt) ) l the accumulated loss (total code length) is - log of the probability

assigned to x1 x2 … xn

Seroussi/Weinberger – Lossless source coding 26-May-04 55 Exponential-Weighting Algorithm

q The most general scheme for on-line expert selection

l here, we will assume A, B finite and loss bounded by lmax t q LF (x ) = loss of expert F Î F accumulated through time t At time t+1 choose the action suggested by expert F with probability t e-hLF (x ) Pt+1(F) = t e-hLF¢ (x ) å some given positive constant F¢ÎF number of experts Then, 2 n n ln b nhl max Lew (x ) £ minLF (x ) + + FÎF h 8 1 Þ with h = (8ln b ) n , normalized excess loss over best expert is l max

l max (ln b ) /(2n) (horizon-dependent scheme)

Seroussi/Weinberger – Lossless source coding 26-May-04 56 Horizon-Free Exponential Weighting

q Horizon-free variant: divide the data into blocks of exponentially growing size, and apply the horizon-dependent algorithm to each block Þ it is easy to see that the overall loss increases only by a constant factor for all n

Seroussi/Weinberger – Lossless source coding 26-May-04 57 Example: Binary Prediction with Constant Experts

q Two experts: one always says “0”, the other always says “1” l analogous to memoryless model in data compression n q Lmin (x ) = min (n0 , n1 ) 1 Pt+1(0) = t t 1+ e-h[n0 ( x )-n1 ( x )] Horizon-free approximation: 2Pt(0)-1 Choose the current winner except if 1 up to e(t) from tie, with e(t) = O( 1/ n) , in which case randomize Þ similar redundancy, different constant

e(t) D0/1

-1

q Minimum worst-case redundancy: draw xt+1 xt+2 … xn at random and predict the winner in the overall sequence of length n-1

Seroussi/Weinberger – Lossless source coding 26-May-04 58 FS reference strategies (L-Z framework)

q A given FSM is driven by the observations {xt} S = set of states f = next-state function

s1= fixed initial state st+1= f(st , xt)

Reference strategy is allowed to vary following the FSM: bt= g(st) l example: one-state machine = constant strategy q Best g for the specific xn :

g(s) = arg min Ex[l(b, x)| s] bÎB expectation w.r.t. conditional empirical distribution Þ normalized regret vanishes by applying single- state strategy to sub- sequences at each state q Take best FSM for xn, consider normalized loss for n ® ¥, and |S| ® ¥ q For log loss: FS “compressibility” of an infinite individual sequence l efficiently achievable: LZ data compression scheme

Seroussi/Weinberger – Lossless source coding 26-May-04 59 FS Predictability

q Similarly, FS “predictability” l sequential “LZ-like” decision scheme performs essentially as well as the best FSM (of any size!) for the sequence q Example: x12 = 0 1 01 00 010 011 “context” at which 8th decision is made

1 1 1 2 1 2 1 3 1 3 1 4 1

1 1 1 1 1 1 n1

n0 1 1 4 4 5 1 1 5 5 1 1 2 1 2 1 2 1 3 1 3 1 1 1 1 1

Seroussi/Weinberger – Lossless source coding 26-May-04 60 Average Loss

q When the goal is to minimize the average number of prediction errors, the redundancy can be made much smaller than the worst pointwise case l this is different from data compression (log loss)! q For example, for binary prediction and two constant experts, “0” and -1 “1”, the redundancy is given by n Eq[#errors]-min(q,1-q) and it is O(1/n) for a majority predictor without randomization Proof: Assuming (without loss of generality) p(1) = q < 0.5, ) ) ) Pt (error) = qPq {xt = 0}+ (1-q )Pq {xt = 1} =q + (1- 2q )Pq {xt =1} Þ n n ) Eq [#errors]= å Pt (error) = nq + (1- 2q )å Pq {xt =1} Þ t=1 t=1 vanishes exponentially fast (Chernoff) 1 (1- 2q ) ¥ E #errors -q < P {n t (1) ³ n t (0)} = O(1/ n) q [ ] å q x x n n t=1

Seroussi/Weinberger – Lossless source coding 26-May-04 61 LosslessLossless SourceSource CodingCoding

7. Lossless

Seroussi/Weinberger – Lossless source coding 26-May-04 62 Lossless Image Compression (the real thing…)

Input Output

010010... Store, 010010... De- Compress transmit compress

q Some applications of lossless image compression: l Images meant for further analysis and processing (as opposed to just human perception) u Medical, space l Images where loss might have legal implications u Medical l Images obtained at great cost l Applications with intensive editing and repeated compression/decompression cycles l Applications where desired quality of rendered image is unknown at time of acquisition q A new international standard (ISO/IEC: “JPEG Committee): JPEG-LS

Seroussi/Weinberger – Lossless source coding 26-May-04 63 Universality vs. Prior Knowledge

q Application of universal algorithms for tree models directly to real images yields poor results l some structural symmetries typical of images are not captured by the model l a universal model has an associated “learning cost:” why learn something we already know?

q Modeling approach: limit model class by use of “prior knowledge” l for example, images tend to be a combination of smooth regions and edges l predictive coding was successfully used for years: it encodes the difference between a pixel and a predicted value of it l prediction errors tend to follow a Laplacian distribution Þ AR model + Laplacian, where both the center and the decay are context dependent l Prediction = fixed prediction + adaptive correction

Seroussi/Weinberger – Lossless source coding 26-May-04 64 Models for Images

q Continuous tone images 123 255 8 15 ... l Gray scale: a 2D array of pixel intensity values 0 128 200 217 ... (integers) in a given range [0..(a -1)] (often a=256) . . l Color: a 2D array of vectors (e.g. triplets) whose coordinates represent intensity in a given (e.g., RGB, YUV); similar principles

Causal template q In practice, contexts are formed out of a finite c b d subset of the past sequence a x

q Conditional probability model for prediction errors: Current sample two-sided geometric distribution (TSGD) |e+s| s P(e) P(e) = c0q , q Î(0,1), s Î[0,1) TSGD l “discrete Laplacian” l shift s constrained to [0,1) by integer-valued adaptive correction (bias cancellation) on the fixed predictor e –1 –s

Seroussi/Weinberger – Lossless source coding 26-May-04 65 Complexity Constraints

q Are sophisticated models worth the price in complexity? l Algorithm Context and CTW are linear time algorithms for tree sources of limited depth, but quite expensive in practice l even arithmetic coding is not something that a practitioner will easily buy in many applications! q Is high complexity required to approach the best possible compression? q The idea in JPEG-LS: apply judicious modeling to reduce complexity, rather than to improve compression

the modeling/coding separation paradigm is less neat without complex models or arithmetic coding

Seroussi/Weinberger – Lossless source coding 26-May-04 66 The LOCO-I algorithm

q JPEG-LS is based on the LOCO-I algorithm: LOw COmplexity of Images

q Basic components: l Fixed + Adaptive prediction l Conditioning contexts based on quantized gradients l Two-parameter conditional probability model (TSGD) l Low complexity adaptive coding matched to the model (variants of Golomb codes) l Run length coding in flat areas to address drawback of symbol-by- symbol coding

Seroussi/Weinberger – Lossless source coding 26-May-04 67 JPEG-LS (LOCO-I Algorithm): Block Diagram

context pred. image pred. errors, code spec. c b d samples errors Context Golomb Modeler Coder a x Fixed Predictor predicted Gradients values

Adaptive Flat Correction Region? Predictor compressed regular regular bit-stream image run lengths, run mode run image code spec. samples samples Run Run Counter Coder Modeler Coder

Seroussi/Weinberger – Lossless source coding 26-May-04 68 Fixed Predictor

q Causal template for prediction and conditioning

Past sequence

c b d a x i+1 next pixel ì min(a, b) if c ³ max(a, b) ï predicted x$ = max(a, b) if c £ min(a, b) median of a, b, i+1 í value ï and a+b-c î a + b - c otherwise q Nonlinear, has some “edge detection” capability: l Predicts b in ‘‘vertical edge’’ l Predicts a in ‘‘horizontal edge’’ l Predicts a+b-c in ‘‘smooth region’’

Seroussi/Weinberger – Lossless source coding 26-May-04 69 Parameter Reduction and Adaptivity

q The goal in selecting the number of parameters: capture high order dependencies without excessive model cost q Adaptive coding is needed, but arithmetic coding ruled out (if possible…) due to complexity constraints q A solution that addresses both issues: Model prediction residuals with a TSGD s P(e) l only two parameters per context TSGD u “sharpness’’ (rate of decay, variance, etc.) u shift (often non-zero in a context-dependent scheme) l allows for large number of contexts (365 in JPEG-LS) e l suited to low complexity adaptive coding –1 –s |e+s| P(e) = c0q , q Î(0,1), s Î[0,1)

Seroussi/Weinberger – Lossless source coding 26-May-04 70 Context Determination

Causal template: c b d a x next pixel

q Look at the gradients g1 = (d-b), g2 = (b-c), g3 = (c-a), l gradients capture the level of activity (smoothness, edginess) surrounding a pixel

l g1, g2, g3 quantized into 9 regions determined by 3 thresholds S1, S2, S3

l maximum information on x i+1 suggests equiprobable regions -4 -3 -2 -1 0 1 2 3 4

-S3 -S2 -S1 0 S1 S2 S3 q Symmetric contexts merged:

P( e | [q1, q2, q3] ) « P( -e | [-q1, -q2, -q3] ) q A fixed number of contexts: (93 + 1)/2 = 365

Seroussi/Weinberger – Lossless source coding 26-May-04 71 Coding of TSGD’s: Golomb Codes

q Optimal prefix codes for TSGDs are built out of the Golomb codes for nonnegative integers q Given a code parameter m and an integer j, j ® { j mod m (in binary), ëj /mû (in unary) }

l example: j = 19, m = 4: 19 mod 4 = 3 ë19 /4 û = 4 1 1 0 0 0 0 1 l Golomb codes are optimal for geometric distributions l JPEG-LS uses the subfamily of Golomb power-of-2 (PO2) codes: m = 2k

G : n ® (n mod 2 k ), n / 2 k n ³ 0, 2 k { ë û } (binary) (unary) k = code parameter

l Encoding is very simple and explicit: no tables l Very suited for adaptive coding: adapt k

Seroussi/Weinberger – Lossless source coding 26-May-04 72 Adaptive Coding of TSGD’s in JPEG-LS

q Optimal prefix codes for TSGD’s are approximated in JPEG-LS by applying the Golomb-PO2 subfamily to a mapped error value: e ® M(e) or e ® M(-1-e) 0,-1,+1,-2,+2, .... ® 0, 1, 2, 3, 4, .... (Rice mapping), or -1,0,-2,+1,-3, .... ® 0, 1, 2, 3, 4, .... q Adaptive code selection (parameter k, mapping) l approximation of optimal strategy based on ML estimation for TSGD parameters q, s through sufficient statistics A = accumulated sum of error magnitudes N_ = number of negative samples q Assumption s Î [0,1) satisfied through the use of adaptive correction of the predictor, using also B = accumulated sum of error values N = total number of samples

-1 0

Seroussi/Weinberger – Lossless source coding 26-May-04 73 Adaptive Coding (cont.)

q For the Golomb-PO2 code with parameter k applied on remapped prediction errors l compute the expected code length explicitly as a function of q, s, and k l replace (the unknown) q and s by their ML estimates as a function of sufficient statistics q Optimal decision regions for k result in l Let j = ( 5 + 1) / 2 » 1.618 (golden ratio) l If A- N_ £ N j, use N_ to u choose k = 0 or k = 1 u choose s ³ ½ or s < ½ if k = 0 (irrelevant if k > 0) l If A - N_ > N j , choose k such that 1 A - N _ 1 - k + 2 < £ - k +1 k > 1 j 2 - 1 N j 2 - 1

Seroussi/Weinberger – Lossless source coding 26-May-04 74 Approximation of Decision Regions

- k q Observe j 2 -1 » 2-k lnj lnj » 0.48 » 0.5 1 Þ -k + 0.5 is always close to a power of 2 j 2 -1

q Since N - / N » 0 . 5 , k can be approximated by k @ é log2 (A/N) ù Estimate k using the trivial loop for (k=0; (N<

l If k=0 and s < -1/2, encode M(-e-1), otherwise M(e) l Only 4 variables per context are needed

Seroussi/Weinberger – Lossless source coding 26-May-04 75 Embedded Run-length Coding

q Aimed at overcoming the basic limitation of 1 bit/pixel inherent to pixel-wise prefix codes, most damaging in low-entropy regions

q Creates super-symbols representing runs of the same pixel value in the “flat region” a = b = c = d Þ special context

[q1,q2,q3]=[0,0,0]

q A run of a is counted and the count is encoded using block- MELCODE, a fast adaptation technique for Golomb-type codes

Seroussi/Weinberger – Lossless source coding 26-May-04 76 LOCO-I in One Page

loop: q Get context pixels a, b, c, d, next pixel x

q Compute gradients d-b, b-c, c-a and quantize Þ [q1, q2, q3, sign]

q [q1, q2, q3] = 0 ? YES: Process run state NO: Continue

q xpred = predict ( a, b, c )

q Update correction value for context. Correct xpred

q e = x - xpred . If sign < 0 then e = -e q Estimate k for the context q Remap e ® M (e ) or e ® M (-1-e ) q Encode M with Golomb-PO2(k) q Back to loop run state: q Count run of a until x ¹ a Þ run length L q Encode L using block-MELCODE q Update MELCODE state

Seroussi/Weinberger – Lossless source coding 26-May-04 77 Compression/Complexity trade-off

8.5 8 CALIC-A 7.5 7 6.5 6 JSLUG 5.5 5 JPEG-A 4.5 ALCM 4 CALIC-H 3.5 3 2.5 2 1.5 LOCO-I FELICS

Relative running time (approx.) . 1 JPEG-H 0.5 0 3.00 3.25 3.50 3.75 4.00 4.25 Average bits/symbol on benchmark image set

Seroussi/Weinberger – Lossless source coding 26-May-04 78