Lossless Source Coding

Lossless Source Coding Gadiel Seroussi Marcelo Weinberger Information Theory Research Hewlett-Packard Laboratories Palo Alto, California Seroussi/Weinberger – Lossless source coding 26-May-04 1 LosslessLossless SourceSource CodingCoding 1. Fundamentals Seroussi/Weinberger – Lossless source coding 26-May-04 2 Notation • A = discrete (usually finite) alphabet • a = | A| = size of A (when finite) n n • x1 = x = x1x2 x3 Kxn = finite sequence over A ¥ ¥ • x1 = x = x1x2 x3 Kxt K = infinite sequence over A j • xi = xi xi+1Kx j = sub-sequence (i sometimes omitted if = 1) • pX(x) = Prob(X=x) = probability mass function (PMF) on A (subscript X and argument x dropped if clear from context) • X ~ p(x) : X obeys PMF p(x) • Ep [F] = expectation of F w.r.t. PMF p (subscript and [ ] may be dropped) n • pˆ n (x) = empirical distribution obtained from x1 x1 • log x = logarithm to base 2 of x, unless base otherwise specified • ln x = natural logarithm of x • H(X), H(p) = entropy of a random variable X or PMF p, in bits; also • H(p) = - p log p - (1-p) log (1-p), 0 £ p £ 1 : binary entropy function • D(p||q) = relative entropy (information divergence) between PMFs p and q Seroussi/Weinberger – Lossless source coding 26-May-04 3 Coding in a communication/storage system data data encoder channel decoder source destination noise data source channel source encoder encoder noise channel data source channel destination decoder decoder Seroussi/Weinberger – Lossless source coding 26-May-04 4 Information Theory q Shannon, “A mathematical theory of communication,” Bell Tech. Journal, 1948 l Theoretical foundations of source and channel coding l Fundamental bounds and coding theorems in a probabilistic setting u in a nutshell: perfect communication in the presence of noise is possible as long as the entropy rate of the source is below the channel capacity l Fundamental theorems essentially non-constructive: we’ve spent the last 52 years realizing Shannon’s promised paradise in practice u very successful: enabled current digital revolution (multimedia, internet, wireless communication, mass storage, …) l Separation theorem: source and channel coding can be done independently Seroussi/Weinberger – Lossless source coding 26-May-04 5 Source Coding data source channel source encoder encoder noiseless channelnoise channel data source channel destination decoder decoder Source coding = Data compression ÞÞ efficient use of bandwidth/space Seroussi/Weinberger – Lossless source coding 26-May-04 6 Data Compression de- D compressor C C D’ channel compressor (encoder) (decoder) data compressed data decompressed data the goal: size(C) < size(D) size(C) compression ratio: r = < 1 in appropriate units, size(D) e.g., bits/symbol q Channel: l Communications channel (“from here to there”) l Storage channel (“from now to then”) qq LosslessLosslesscompression:compression: DD==D’D’ the case of interest here q Lossy compression: D’ is an approximation of D under some metric Seroussi/Weinberger – Lossless source coding 26-May-04 7 Data Sources data D n source x1x2 x3 Kxt Kxn = x1 q Symbols xi Î A = a countable (usually finite) alphabet n q Probabilistic source: xi are random variables; x1 obeys some probability distribution P on An (the ensemble of possible sequences emitted by the source) ¥ l we are often interested in n®¥ : x1 is a random process ¥ ¥ u stationary (time-invariant): xi = xj , as random processes, " i,j ³ 1 u ergodic: time averages converge (to ensemble averages) u memoryless: xi are statistically independent u independent, identically distributed (i.i.d.): memoryless, and xi ~ pX " i q Individual sequence: xi are just symbols, not assumed to be a realization of a random process. The “data source” will include a probability assignment, but it will be derived from the data under certain constraints, and with certain objectives Seroussi/Weinberger – Lossless source coding 26-May-04 8 Statistics on Individual Sequences q Empirical distributions 1 memoryless, pˆ n (a) = {xi |1 £ i £ n, xi = a}, a Î A Bernoulli model x1 n l we can compute empirical statistics of any order (joint, conditional, etc.) l sequence probability according to own empirical distribution n ˆ n P n (x1 ) = pˆ n (xi ) x1 Õ x1 i=1 l this is the highest probability assigned to the sequence by any distribution from the model class (maximum likelihood estimator) l Example: A = {0,1}, n0 = {i | xi = 0}, n1 = n - n0 = {i | xi = 1} n n nn0 nn1 pˆ(0) = 0 , pˆ(1) = 1 : Pˆ(xn ) = pˆ(0)n0 pˆ(1)n1 = 0 1 n n 1 nn n l Notice that if x1 is in fact the outcome of a random process, then its empirical statistics are themselves random variables l e.g., expressions of the type Prp (| pˆ(a) - p(a) |³ e ) Seroussi/Weinberger – Lossless source coding 26-May-04 9 Statistical Models for Data Sources x 1 xt-k+1 xt-1 xt xt+1 q Markov of order k ³ 0 k : finite memory t t p(xt+1 | x1 ) = p(xt+1 | xt-k +1), t ³ k (some convention for t < k) l i.i.d = Markov of order 0 Example p(0|s1) = 0.5 q Finite State Machine (FSM) 1 s 1 l state space S = {s0,s1,...,sK-1} 1 l initial state s0 0 0 1 l transition probability s0 s2 q(s |s’,a), s,s’ÎS, aÎA 0 p(0|s ) = 0.9 p(0|s ) = 0.1 l output probability p(a|s), aÎA, sÎS 0 2 l unifilar Û deterministic transitions: Steady state next-state function f : S´ A®S state probs. é.9 .1 0ù ê ú l every Markov source is equivalent [p 0 p1 p 2 ] .5 0 .5 = [p 0 p1 p 2 ] to a unifilar FSM with K £ |A|k, but in ê.1 0 .9ú ë û symb. probs. general, finite state ¹ finite memory é5 1 5 ù é5 3ù Þ [p 0 p1 p 2 ] = , [p0 p1] = ëê8 16 16ûú ëê8 8ûú Seroussi/Weinberger – Lossless source coding 26-May-04 10 Statistical Models for Data Sources (cont.) q Tree sources (FSMX) 1 s1 1 l finite memory £ k (Markov) l # of past symbols needed to 0 s 0 s 1 determine the state might be < 0 0 2 k for some states s0 « 0, s1 « 01, s2 « 11 l by merging nodes from the full Markov tree, we get a model with a smaller number of free parameters 0 1 l the set of tree sources with unbalanced trees has measure zero in the space of p(0|s0) s0 Markov sources of any given order 0 1 l yet, tree source models have proven very useful in practice, and are associated s0 s0 s1 s2 with some of the best compression algorithms to date p(0|s0) p(0|s0) p(0|s1) p(0|s2) l more about this later ... Seroussi/Weinberger – Lossless source coding 26-May-04 11 Entropy D X ~ p(x) : H (X ) = - p(x)log p(x) [0log 0 = 0] åxÎA entropy of X (or of the PMF p(×) ), measured in bits H (X ) = E p [- log p(X )] l H measures the uncertainty or self-information of X l we also write H(p) : a random variable is not actually needed; p(×) could be an empirical distribution Example: A={0,1}, pX(1) = p, pX(0) = 1-p (overloaded notation!) H 2 ( p) = - p log p - (1- p)log(1- p) binary entropy function Main properties: H2(p) • H2(p) ³ 0, H2 (p) is Ç-convex, 0 £ p £ 1 • H2(p) ® 0 as p ® 0 or 1, with slope ¥ • H2(p) is maximal at p = 0.5, H2(0.5)=1 Þ the entropy of an unbiased coin is 1 bit p Seroussi/Weinberger – Lossless source coding 26-May-04 12 Entropy (cont.) q For a general finite alphabet A, H(X) is maximal when X ~ pu , where pu(a)=1/ |A| for all aÎ A (uniform distribution) l Jensen’s inequality: if f is a È-convex function, then Ef (X) ³ f (EX). l - log x is a È-convex function of x l H(X) = E[log(1/p(X))] = - E[- log(1/p(X))] £ log E[1/p(X)] = log |A| = H(pu) q Empirical entropy l entropy computed an an empirical distribution l Example: recall that the probability assigned to an individual binary sequence by its own zero-order empirical distribution is nn0 nn1 normalized, Pˆ(xn ) = pˆ(0)n0 pˆ(1)n1 = 0 1 1 nn in bits/symbol l we have 1 n æ n ö n æ n ö ˆ n 0 0 1 1 ˆ ˆ n - log P(x1 ) = - logç ÷ - logç ÷ = H2 ( p(0)) = H (x1 ) n n è n ø n è n ø 1 in fact, - log P ˆ ( x n ) = H ˆ ( x n ) holds for a large class of probability models n 1 1 Seroussi/Weinberger – Lossless source coding 26-May-04 13 Joint and Conditional Entropies q The joint entropy of random variables (X,Y) ~ p(x,y) is defined as H(X ,Y ) = - p(x, y)log p(x, y) åx,y l this can be extended to any number of random variables: H(X1,X2,..., Xn) Notation: H( X1,X2,..., Xn) = joint entropy of X1,X2,..., Xn (0 £ H £ n log |A|) H( X1,X2,..., Xn) = H/n = normalized per-symbol entropy (0 £ H £ log |A|) l if (X,Y) are statistically independent, then H(X,Y) = H(X) + H(Y) q The conditional entropy is defined as H (Y | X ) = p(x)H (Y | X = x) = -E log p(Y | X ) åx p( x, y) q Chain rule: H(X ,Y ) = H (X ) + H (Y | X ) q Conditioning reduces uncertainty (on the average): H (X |Y ) £ H (X ) l but H(X|Y=y) ³ H(X) is possible Seroussi/Weinberger – Lossless source coding 26-May-04 14 Entropy Rates q Entropy rate of a random process in bits/symbol, ¥ 1 n H (X1 ) = lim H( X1 ) n®¥ n if the limit exists! q A related limit based on conditional entropy * ¥ in bits/symbol, H (X1 ) = lim H (X n X n-1, X n-2 ,K, X1) n®¥ if the limit exists! Theorem: For a stationary random process, both limits exist, and * ¥ ¥ H (X1 ) = H(X1 ) q Examples: ¥ l X1,X2,..

Lossless Source Coding

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support