Lossless Source Coding
Total Page:16
File Type:pdf, Size:1020Kb
Lossless Source Coding Gadiel Seroussi Marcelo Weinberger Information Theory Research Hewlett-Packard Laboratories Palo Alto, California Seroussi/Weinberger – Lossless source coding 26-May-04 1 Notation • A = discrete (usually finite) alphabet • a = | A| = size of A (when finite) n n • x1 = x = x1x2 x3 Kxn = finite sequence over A ¥ ¥ • x1 = x = x1x2 x3 Kxt K = infinite sequence over A j • xi = xi xi+1Kx j = sub-sequence (i sometimes omitted if = 1) • pX(x) = Prob(X=x) = probability mass function (PMF) on A (subscript X and argument x dropped if clear from context) • X ~ p(x) : X obeys PMF p(x) • Ep [F] = expectation of F w.r.t. PMF p (subscript and [ ] may be dropped) n • pˆ n (x) = empirical distribution obtained from x1 x1 • log x = logarithm to base 2 of x, unless base otherwise specified • ln x = natural logarithm of x • H(X), H(p) = entropy of a random variable X or PMF p, in bits; also • H(p) = - p log p - (1-p) log (1-p), 0 £ p £ 1 : binary entropy function • D(p||q) = relative entropy (information divergence) between PMFs p and q Seroussi/Weinberger – Lossless source coding 26-May-04 2 LosslessLossless SourceSource CodingCoding 3. Universal Coding Seroussi/Weinberger – Lossless source coding 26-May-04 3 Universal Modeling and Coding q So far, it was assumed that a model of the data is available, and we aimed at compressing the data optimally w.r.t. the model q By Kraft’s inequality, the models we use can be expressed in terms of probability distributions: - L( s ) For every UD code with length function L(s), we have å2 £ 1 n string of length n over sÎA finite alphabet A Þ a code defines a probability distribution P(s) = 2-L(s) over An q Conversely, given a distribution P( ) (a model), there exists a UD code that assigns é- log p(s)ù bits to s (Shannon code) q Hence, P( ) serves as a model to encode s, and every code has an associated model l a (probabilistic) model is a tool to “understand” and predict the behavior of the data Seroussi/Weinberger – Lossless source coding 26-May-04 4 Universal Modeling and Coding (cont.) q Given a model P( ) on n-tuples, arithmetic coding provides an effective mean to sequentially assign a code word of length close to -log P(s) to s l we don’t need to see the whole string s = x1 x2 … xn : encode xt using conditional probability t t å P(x u) P(x ) n-t p(x | x x ...x ) = = uÎA t 1 2 t-1 P(xt -1 ) å P(xt -1v) vÎAn-t +1 l the model probabilities can vary arbitrarily and “adapt” to the data l in a strongly sequential model, P(xt) is independent of n q CODING SYSTEM = MODEL + CODING UNIT l two separate problems: design a model and use it to encode We will view data compression as a problem of assigning probabilities to data Seroussi/Weinberger – Lossless source coding 26-May-04 5 Coding with Model Classes q Universal data compression deals with the optimal description of data in the absence of a given model l in most practical applications, the model is not given to us q How do we make the concept of “optimality” meaningful? l there is always a code that assigns just 1 bit to the data at hand! The answer: Model classes q We want a “universal” code to perform as well as the best model in a given class C for any string s, where the best competing model changes from string to string l universality makes sense only w.r.t. a model class q A code with length function L(xn) is pointwise universal w.r.t. a n 1 n n class C if RC (L, x ) = [L(x ) - min LC (x )] ® 0 when n ® ¥ n CÎC pointwise redundancy Seroussi/Weinberger – Lossless source coding 26-May-04 6 How to Choose a Model Class? Universal coding tells us how to encode optimally w.r.t. to a class; it doesn’t tell us how to choose a class! q Some possible criteria: l complexity l prior knowledge on the data l some popular models were already presented q We will see that the bigger the class, the slower the best possible convergence rate of the redundancy to 0 l in this sense, prior knowledge is of paramount importance: don’t learn what you already know! Ultimately, the choice of model class is an art Seroussi/Weinberger – Lossless source coding 26-May-04 7 Example: Bernoulli Models A = {0,1}, C = {Pq , q Î L = [0,1] } i.i.d. distribution with parameter p(1) = q n 1 1 ) n min LC (x ) = min log n = log n = nH (x ) CÎC qÎL ) Pq (x ) Pq ( x n ) (x ) ML-estimate of q Therefore, our goal is to find a code such that L(x n ) ) - H (x n ) ® 0 n q Here is a trivial example of a universal code: log( n + 1) Use é ù bits to encode n1, and then “tune” your Shannon code to the parameter q = n1 / n, which is precisely the ML-estimate Þ this is equivalent to telling the decoder what the best model is! Seroussi/Weinberger – Lossless source coding 26-May-04 8 Bernoulli Models: Enumerative Code q The total code length for this code satisfies: n æ ö æ ö æ ö æ ö L(x ) n0 n0 n1 n1 log(n +1) + 2 ˆ n log(n +1) + 2 £ -ç ÷logç ÷ - ç ÷logç ÷ + = H (x ) + n è n ø è n ø è n ø è n ø n n ® 0 l wasteful: knowledge of n1 already discards part of the sequences, to which we should not reserve a code word æ n ö ç ÷ q A slightly better code: enumerate all ç ÷ sequences with n1ones, è n1 ø and describe the sequence with an index n é n ù L¢(x ) 1 æ ö élog(n +1)ù ˆ n log(n +1) + 2 = êlogç ÷ú + £ H (x ) + n n ê èn1 øú n n n æ n ö 2nHˆ ( x ) ç ÷ £ Stirling: ç ÷ Þ for sequences such that n /n is èn1 ø 2p (n1n0 ) / n 1 L¢(xn ) log n bounded away from 0 and 1, £ Hˆ (xn ) + + O(1/ n) n 2n Seroussi/Weinberger – Lossless source coding 26-May-04 9 Bernoulli Models: Mixture Code n q The enumerative code length is close to -log Q’(x ), where -1 n æ n ö 1 q Is Q( ) a probability assignment? Q¢(x ) = ç ÷ × èn1 ø n +1 -1 1 1 æ n ö 1 P (xn )dq = q n1 (1-q )n0 dq = ç ÷ × = Q¢(xn ) Þ Q¢(x n ) = 1 ò0 q ò0 ç ÷ å èn1 ø n +1 x nÎAn qAha!! So we get the same result by mixing all the models in the class and using the mixture as a model! qThis uniform mixture is very appealing because it can be sequentially implemented, independent of n: n !n ! n-1 n (xt ) +1 ¢ n 0 1 ¢ t ¢ t 0 Q (x ) = = Õ q (xt+1 | x ) where q (0 | x ) = (n +1)! t=0 t + 2 l Laplace’s rule of succession! l this results in a “plug-in” approach: estimate q and plug it in Seroussi/Weinberger – Lossless source coding 26-May-04 10 Bernoulli Models: Mixture Code (cont.) qMaybe we can make Stirling work for us with a different mixture, that puts more weight in the problematic regions where n1/n approaches 0 and 1? 1 qConsider Dirichlet’s density w(q ) = Þ 1 2 G( 2 ) q (1-q ) 1 1 1 1 1 n - 1 n - 1 G(n + )G(n + ) Q¢¢(xn ) = P (xn )dw(q ) = q 1 2 (1-q ) 0 2 dq = 0 2 1 2 ò0 q 1 2 ò0 1 2 G( 2 ) n!G( 2 ) Þ by Stirling for the Gamma function, for all sequences xn L¢¢(xn ) 1 log n = - log Q¢¢(x n ) £ Hˆ (xn ) + + O(1/ n) Can we do any better? n n 2n qThis mixture also has a plug-in interpretation: n-1 n (xt ) + 1 ¢¢ n ¢¢ t ¢¢ t 0 2 Q (x ) = Õ q (xt+1 | x ) where q (0 | x ) = t=0 t +1 Seroussi/Weinberger – Lossless source coding 26-May-04 11 Summary of Codes q Two-part code l describe best parameter and then code based on it: works whenever the number of possible optimal codes in the class is sub-exponential l the most natural approach (akin to Kolmogorov complexity), but not sequential l not efficient in the example: can be improved by describing the parameter more coarsely Þ trade-off: spend less bits on the parameter and use an approximation of best parameter l alternatively, don’t allocate code words to impossible sequences q Mixture code l can be implemented sequentially l a suitable prior on the parameter gave the smallest redundancy q Plug-in codes l use a biased estimate of the conditional probability in the model class based on the data seen so far l can often be interpreted as mixtures q But what’s the best we can do? Seroussi/Weinberger – Lossless source coding 26-May-04 12 Normalized Maximum-Likelihood Code q Goal: find a code that attains the best worst-case pointwise redundancy n 1 n n RC = min max RC (L, x ) = min max[L(x ) - min LC (x )] L xn ÎAn n L x n ÎAn CÎC q Consider the code defined by the probability assignment n -min LC (x ) CÎC n n 2 1 é -min LC ( x ) ù Q(x ) = R (L, x n ) = log 2 CÎC -min L (xn ) Þ C ê å ú C CÎC n n n å 2 ë x ÎA û xnÎAn This quantity depends only on the class! q Since the redundancy of this code is the same for all xn, it must be optimal in the minimax sense! q Drawbacks: - sequential probability assignment depends on horizon n - hard to compute Seroussi/Weinberger – Lossless source coding 26-May-04 13 Example: NML Code for the Bernoulli Class q We evaluate the minimax redundancy RC for the Bernoulli class: ) n é n ù é æ n ö i ù 1 -H ( x ) 1 -h( n ) 1 np RC = logê å 2 ú = logêå ç ÷2 ú = log + o(1/ n) n ëx n ÎAn û n ë i=0 è i ø û 2n 2 typical of sufficient statistics: Hint: Stirling + n n n n 1 P (x ) = p(x | s(x )) p (s(x )) dq 2 q q =G( 1 ) = p ò0 q (1-q ) 2 Þ the pointwise redundancy cannot vanish faster than (log n)/2n (Dirichlet mixture or NML code) Seroussi/Weinberger – Lossless source coding 26-May-04 14 Parametric Model Classes q A useful limitation to the model class is to assume C = {Pq, q Î Qd } a parameter space of dimension d q Examples: l Bernoulli: d = 1 , general i.i.d.