Lossless Source Coding

Lossless Source Coding Gadiel Seroussi Marcelo Weinberger Information Theory Research Hewlett-Packard Laboratories Palo Alto, California Seroussi/Weinberger – Lossless source coding 26-May-04 1 Notation • A = discrete (usually finite) alphabet • a = | A| = size of A (when finite) n n • x1 = x = x1x2 x3 Kxn = finite sequence over A ¥ ¥ • x1 = x = x1x2 x3 Kxt K = infinite sequence over A j • xi = xi xi+1Kx j = sub-sequence (i sometimes omitted if = 1) • pX(x) = Prob(X=x) = probability mass function (PMF) on A (subscript X and argument x dropped if clear from context) • X ~ p(x) : X obeys PMF p(x) • Ep [F] = expectation of F w.r.t. PMF p (subscript and [ ] may be dropped) n • pˆ n (x) = empirical distribution obtained from x1 x1 • log x = logarithm to base 2 of x, unless base otherwise specified • ln x = natural logarithm of x • H(X), H(p) = entropy of a random variable X or PMF p, in bits; also • H(p) = - p log p - (1-p) log (1-p), 0 £ p £ 1 : binary entropy function • D(p||q) = relative entropy (information divergence) between PMFs p and q Seroussi/Weinberger – Lossless source coding 26-May-04 2 LosslessLossless SourceSource CodingCoding 3. Universal Coding Seroussi/Weinberger – Lossless source coding 26-May-04 3 Universal Modeling and Coding q So far, it was assumed that a model of the data is available, and we aimed at compressing the data optimally w.r.t. the model q By Kraft’s inequality, the models we use can be expressed in terms of probability distributions: - L( s ) For every UD code with length function L(s), we have å2 £ 1 n string of length n over sÎA finite alphabet A Þ a code defines a probability distribution P(s) = 2-L(s) over An q Conversely, given a distribution P( ) (a model), there exists a UD code that assigns é- log p(s)ù bits to s (Shannon code) q Hence, P( ) serves as a model to encode s, and every code has an associated model l a (probabilistic) model is a tool to “understand” and predict the behavior of the data Seroussi/Weinberger – Lossless source coding 26-May-04 4 Universal Modeling and Coding (cont.) q Given a model P( ) on n-tuples, arithmetic coding provides an effective mean to sequentially assign a code word of length close to -log P(s) to s l we don’t need to see the whole string s = x1 x2 … xn : encode xt using conditional probability t t å P(x u) P(x ) n-t p(x | x x ...x ) = = uÎA t 1 2 t-1 P(xt -1 ) å P(xt -1v) vÎAn-t +1 l the model probabilities can vary arbitrarily and “adapt” to the data l in a strongly sequential model, P(xt) is independent of n q CODING SYSTEM = MODEL + CODING UNIT l two separate problems: design a model and use it to encode We will view data compression as a problem of assigning probabilities to data Seroussi/Weinberger – Lossless source coding 26-May-04 5 Coding with Model Classes q Universal data compression deals with the optimal description of data in the absence of a given model l in most practical applications, the model is not given to us q How do we make the concept of “optimality” meaningful? l there is always a code that assigns just 1 bit to the data at hand! The answer: Model classes q We want a “universal” code to perform as well as the best model in a given class C for any string s, where the best competing model changes from string to string l universality makes sense only w.r.t. a model class q A code with length function L(xn) is pointwise universal w.r.t. a n 1 n n class C if RC (L, x ) = [L(x ) - min LC (x )] ® 0 when n ® ¥ n CÎC pointwise redundancy Seroussi/Weinberger – Lossless source coding 26-May-04 6 How to Choose a Model Class? Universal coding tells us how to encode optimally w.r.t. to a class; it doesn’t tell us how to choose a class! q Some possible criteria: l complexity l prior knowledge on the data l some popular models were already presented q We will see that the bigger the class, the slower the best possible convergence rate of the redundancy to 0 l in this sense, prior knowledge is of paramount importance: don’t learn what you already know! Ultimately, the choice of model class is an art Seroussi/Weinberger – Lossless source coding 26-May-04 7 Example: Bernoulli Models A = {0,1}, C = {Pq , q Î L = [0,1] } i.i.d. distribution with parameter p(1) = q n 1 1 ) n min LC (x ) = min log n = log n = nH (x ) CÎC qÎL ) Pq (x ) Pq ( x n ) (x ) ML-estimate of q Therefore, our goal is to find a code such that L(x n ) ) - H (x n ) ® 0 n q Here is a trivial example of a universal code: log( n + 1) Use é ù bits to encode n1, and then “tune” your Shannon code to the parameter q = n1 / n, which is precisely the ML-estimate Þ this is equivalent to telling the decoder what the best model is! Seroussi/Weinberger – Lossless source coding 26-May-04 8 Bernoulli Models: Enumerative Code q The total code length for this code satisfies: n æ ö æ ö æ ö æ ö L(x ) n0 n0 n1 n1 log(n +1) + 2 ˆ n log(n +1) + 2 £ -ç ÷logç ÷ - ç ÷logç ÷ + = H (x ) + n è n ø è n ø è n ø è n ø n n ® 0 l wasteful: knowledge of n1 already discards part of the sequences, to which we should not reserve a code word æ n ö ç ÷ q A slightly better code: enumerate all ç ÷ sequences with n1ones, è n1 ø and describe the sequence with an index n é n ù L¢(x ) 1 æ ö élog(n +1)ù ˆ n log(n +1) + 2 = êlogç ÷ú + £ H (x ) + n n ê èn1 øú n n n æ n ö 2nHˆ ( x ) ç ÷ £ Stirling: ç ÷ Þ for sequences such that n /n is èn1 ø 2p (n1n0 ) / n 1 L¢(xn ) log n bounded away from 0 and 1, £ Hˆ (xn ) + + O(1/ n) n 2n Seroussi/Weinberger – Lossless source coding 26-May-04 9 Bernoulli Models: Mixture Code n q The enumerative code length is close to -log Q’(x ), where -1 n æ n ö 1 q Is Q( ) a probability assignment? Q¢(x ) = ç ÷ × èn1 ø n +1 -1 1 1 æ n ö 1 P (xn )dq = q n1 (1-q )n0 dq = ç ÷ × = Q¢(xn ) Þ Q¢(x n ) = 1 ò0 q ò0 ç ÷ å èn1 ø n +1 x nÎAn qAha!! So we get the same result by mixing all the models in the class and using the mixture as a model! qThis uniform mixture is very appealing because it can be sequentially implemented, independent of n: n !n ! n-1 n (xt ) +1 ¢ n 0 1 ¢ t ¢ t 0 Q (x ) = = Õ q (xt+1 | x ) where q (0 | x ) = (n +1)! t=0 t + 2 l Laplace’s rule of succession! l this results in a “plug-in” approach: estimate q and plug it in Seroussi/Weinberger – Lossless source coding 26-May-04 10 Bernoulli Models: Mixture Code (cont.) qMaybe we can make Stirling work for us with a different mixture, that puts more weight in the problematic regions where n1/n approaches 0 and 1? 1 qConsider Dirichlet’s density w(q ) = Þ 1 2 G( 2 ) q (1-q ) 1 1 1 1 1 n - 1 n - 1 G(n + )G(n + ) Q¢¢(xn ) = P (xn )dw(q ) = q 1 2 (1-q ) 0 2 dq = 0 2 1 2 ò0 q 1 2 ò0 1 2 G( 2 ) n!G( 2 ) Þ by Stirling for the Gamma function, for all sequences xn L¢¢(xn ) 1 log n = - log Q¢¢(x n ) £ Hˆ (xn ) + + O(1/ n) Can we do any better? n n 2n qThis mixture also has a plug-in interpretation: n-1 n (xt ) + 1 ¢¢ n ¢¢ t ¢¢ t 0 2 Q (x ) = Õ q (xt+1 | x ) where q (0 | x ) = t=0 t +1 Seroussi/Weinberger – Lossless source coding 26-May-04 11 Summary of Codes q Two-part code l describe best parameter and then code based on it: works whenever the number of possible optimal codes in the class is sub-exponential l the most natural approach (akin to Kolmogorov complexity), but not sequential l not efficient in the example: can be improved by describing the parameter more coarsely Þ trade-off: spend less bits on the parameter and use an approximation of best parameter l alternatively, don’t allocate code words to impossible sequences q Mixture code l can be implemented sequentially l a suitable prior on the parameter gave the smallest redundancy q Plug-in codes l use a biased estimate of the conditional probability in the model class based on the data seen so far l can often be interpreted as mixtures q But what’s the best we can do? Seroussi/Weinberger – Lossless source coding 26-May-04 12 Normalized Maximum-Likelihood Code q Goal: find a code that attains the best worst-case pointwise redundancy n 1 n n RC = min max RC (L, x ) = min max[L(x ) - min LC (x )] L xn ÎAn n L x n ÎAn CÎC q Consider the code defined by the probability assignment n -min LC (x ) CÎC n n 2 1 é -min LC ( x ) ù Q(x ) = R (L, x n ) = log 2 CÎC -min L (xn ) Þ C ê å ú C CÎC n n n å 2 ë x ÎA û xnÎAn This quantity depends only on the class! q Since the redundancy of this code is the same for all xn, it must be optimal in the minimax sense! q Drawbacks: - sequential probability assignment depends on horizon n - hard to compute Seroussi/Weinberger – Lossless source coding 26-May-04 13 Example: NML Code for the Bernoulli Class q We evaluate the minimax redundancy RC for the Bernoulli class: ) n é n ù é æ n ö i ù 1 -H ( x ) 1 -h( n ) 1 np RC = logê å 2 ú = logêå ç ÷2 ú = log + o(1/ n) n ëx n ÎAn û n ë i=0 è i ø û 2n 2 typical of sufficient statistics: Hint: Stirling + n n n n 1 P (x ) = p(x | s(x )) p (s(x )) dq 2 q q =G( 1 ) = p ò0 q (1-q ) 2 Þ the pointwise redundancy cannot vanish faster than (log n)/2n (Dirichlet mixture or NML code) Seroussi/Weinberger – Lossless source coding 26-May-04 14 Parametric Model Classes q A useful limitation to the model class is to assume C = {Pq, q Î Qd } a parameter space of dimension d q Examples: l Bernoulli: d = 1 , general i.i.d.

Lossless Source Coding

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support