Lossless Source Coding

Lossless Source Coding Gadiel Seroussi Marcelo Weinberger Information Theory Research Hewlett-Packard Laboratories Palo Alto, California Seroussi/Weinberger – Lossless source coding 26-May-04 1 LosslessLossless SourceSource CodingCoding 1. Fundamentals Seroussi/Weinberger – Lossless source coding 26-May-04 2 Notation • A = discrete (usually finite) alphabet • a = | A| = size of A (when finite) n n • x1 = x = x1x2 x3 Kxn = finite sequence over A ¥ ¥ • x1 = x = x1x2 x3 Kxt K = infinite sequence over A j • xi = xi xi+1Kx j = sub-sequence (i sometimes omitted if = 1) • pX(x) = Prob(X=x) = probability mass function (PMF) on A (subscript X and argument x dropped if clear from context) • X ~ p(x) : X obeys PMF p(x) • Ep [F] = expectation of F w.r.t. PMF p (subscript and [ ] may be dropped) n • pˆ n (x) = empirical distribution obtained from x1 x1 • log x = logarithm to base 2 of x, unless base otherwise specified • ln x = natural logarithm of x • H(X), H(p) = entropy of a random variable X or PMF p, in bits; also • H(p) = - p log p - (1-p) log (1-p), 0 £ p £ 1 : binary entropy function • D(p||q) = relative entropy (information divergence) between PMFs p and q Seroussi/Weinberger – Lossless source coding 26-May-04 3 Coding in a communication/storage system data data encoder channel decoder source destination noise data source channel source encoder encoder noise channel data source channel destination decoder decoder Seroussi/Weinberger – Lossless source coding 26-May-04 4 Information Theory q Shannon, “A mathematical theory of communication,” Bell Tech. Journal, 1948 l Theoretical foundations of source and channel coding l Fundamental bounds and coding theorems in a probabilistic setting u in a nutshell: perfect communication in the presence of noise is possible as long as the entropy rate of the source is below the channel capacity l Fundamental theorems essentially non-constructive: we’ve spent the last 52 years realizing Shannon’s promised paradise in practice u very successful: enabled current digital revolution (multimedia, internet, wireless communication, mass storage, …) l Separation theorem: source and channel coding can be done independently Seroussi/Weinberger – Lossless source coding 26-May-04 5 Source Coding data source channel source encoder encoder noiseless channelnoise channel data source channel destination decoder decoder Source coding = Data compression ÞÞ efficient use of bandwidth/space Seroussi/Weinberger – Lossless source coding 26-May-04 6 Data Compression de- D compressor C C D’ channel compressor (encoder) (decoder) data compressed data decompressed data the goal: size(C) < size(D) size(C) compression ratio: r = < 1 in appropriate units, size(D) e.g., bits/symbol q Channel: l Communications channel (“from here to there”) l Storage channel (“from now to then”) qq LosslessLosslesscompression:compression: DD==D’D’ the case of interest here q Lossy compression: D’ is an approximation of D under some metric Seroussi/Weinberger – Lossless source coding 26-May-04 7 Data Sources data D n source x1x2 x3 Kxt Kxn = x1 q Symbols xi Î A = a countable (usually finite) alphabet n q Probabilistic source: xi are random variables; x1 obeys some probability distribution P on An (the ensemble of possible sequences emitted by the source) ¥ l we are often interested in n®¥ : x1 is a random process ¥ ¥ u stationary (time-invariant): xi = xj , as random processes, " i,j ³ 1 u ergodic: time averages converge (to ensemble averages) u memoryless: xi are statistically independent u independent, identically distributed (i.i.d.): memoryless, and xi ~ pX " i q Individual sequence: xi are just symbols, not assumed to be a realization of a random process. The “data source” will include a probability assignment, but it will be derived from the data under certain constraints, and with certain objectives Seroussi/Weinberger – Lossless source coding 26-May-04 8 Statistics on Individual Sequences q Empirical distributions 1 memoryless, pˆ n (a) = {xi |1 £ i £ n, xi = a}, a Î A Bernoulli model x1 n l we can compute empirical statistics of any order (joint, conditional, etc.) l sequence probability according to own empirical distribution n ˆ n P n (x1 ) = pˆ n (xi ) x1 Õ x1 i=1 l this is the highest probability assigned to the sequence by any distribution from the model class (maximum likelihood estimator) l Example: A = {0,1}, n0 = {i | xi = 0}, n1 = n - n0 = {i | xi = 1} n n nn0 nn1 pˆ(0) = 0 , pˆ(1) = 1 : Pˆ(xn ) = pˆ(0)n0 pˆ(1)n1 = 0 1 n n 1 nn n l Notice that if x1 is in fact the outcome of a random process, then its empirical statistics are themselves random variables l e.g., expressions of the type Prp (| pˆ(a) - p(a) |³ e ) Seroussi/Weinberger – Lossless source coding 26-May-04 9 Statistical Models for Data Sources x 1 xt-k+1 xt-1 xt xt+1 q Markov of order k ³ 0 k : finite memory t t p(xt+1 | x1 ) = p(xt+1 | xt-k +1), t ³ k (some convention for t < k) l i.i.d = Markov of order 0 Example p(0|s1) = 0.5 q Finite State Machine (FSM) 1 s 1 l state space S = {s0,s1,...,sK-1} 1 l initial state s0 0 0 1 l transition probability s0 s2 q(s |s’,a), s,s’ÎS, aÎA 0 p(0|s ) = 0.9 p(0|s ) = 0.1 l output probability p(a|s), aÎA, sÎS 0 2 l unifilar Û deterministic transitions: Steady state next-state function f : S´ A®S state probs. é.9 .1 0ù ê ú l every Markov source is equivalent [p 0 p1 p 2 ] .5 0 .5 = [p 0 p1 p 2 ] to a unifilar FSM with K £ |A|k, but in ê.1 0 .9ú ë û symb. probs. general, finite state ¹ finite memory é5 1 5 ù é5 3ù Þ [p 0 p1 p 2 ] = , [p0 p1] = ëê8 16 16ûú ëê8 8ûú Seroussi/Weinberger – Lossless source coding 26-May-04 10 Statistical Models for Data Sources (cont.) q Tree sources (FSMX) 1 s1 1 l finite memory £ k (Markov) l # of past symbols needed to 0 s 0 s 1 determine the state might be < 0 0 2 k for some states s0 « 0, s1 « 01, s2 « 11 l by merging nodes from the full Markov tree, we get a model with a smaller number of free parameters 0 1 l the set of tree sources with unbalanced trees has measure zero in the space of p(0|s0) s0 Markov sources of any given order 0 1 l yet, tree source models have proven very useful in practice, and are associated s0 s0 s1 s2 with some of the best compression algorithms to date p(0|s0) p(0|s0) p(0|s1) p(0|s2) l more about this later ... Seroussi/Weinberger – Lossless source coding 26-May-04 11 Entropy D X ~ p(x) : H (X ) = - p(x)log p(x) [0log 0 = 0] åxÎA entropy of X (or of the PMF p(×) ), measured in bits H (X ) = E p [- log p(X )] l H measures the uncertainty or self-information of X l we also write H(p) : a random variable is not actually needed; p(×) could be an empirical distribution Example: A={0,1}, pX(1) = p, pX(0) = 1-p (overloaded notation!) H 2 ( p) = - p log p - (1- p)log(1- p) binary entropy function Main properties: H2(p) • H2(p) ³ 0, H2 (p) is Ç-convex, 0 £ p £ 1 • H2(p) ® 0 as p ® 0 or 1, with slope ¥ • H2(p) is maximal at p = 0.5, H2(0.5)=1 Þ the entropy of an unbiased coin is 1 bit p Seroussi/Weinberger – Lossless source coding 26-May-04 12 Entropy (cont.) q For a general finite alphabet A, H(X) is maximal when X ~ pu , where pu(a)=1/ |A| for all aÎ A (uniform distribution) l Jensen’s inequality: if f is a È-convex function, then Ef (X) ³ f (EX). l - log x is a È-convex function of x l H(X) = E[log(1/p(X))] = - E[- log(1/p(X))] £ log E[1/p(X)] = log |A| = H(pu) q Empirical entropy l entropy computed an an empirical distribution l Example: recall that the probability assigned to an individual binary sequence by its own zero-order empirical distribution is nn0 nn1 normalized, Pˆ(xn ) = pˆ(0)n0 pˆ(1)n1 = 0 1 1 nn in bits/symbol l we have 1 n æ n ö n æ n ö ˆ n 0 0 1 1 ˆ ˆ n - log P(x1 ) = - logç ÷ - logç ÷ = H2 ( p(0)) = H (x1 ) n n è n ø n è n ø 1 in fact, - log P ˆ ( x n ) = H ˆ ( x n ) holds for a large class of probability models n 1 1 Seroussi/Weinberger – Lossless source coding 26-May-04 13 Joint and Conditional Entropies q The joint entropy of random variables (X,Y) ~ p(x,y) is defined as H(X ,Y ) = - p(x, y)log p(x, y) åx,y l this can be extended to any number of random variables: H(X1,X2,..., Xn) Notation: H( X1,X2,..., Xn) = joint entropy of X1,X2,..., Xn (0 £ H £ n log |A|) H( X1,X2,..., Xn) = H/n = normalized per-symbol entropy (0 £ H £ log |A|) l if (X,Y) are statistically independent, then H(X,Y) = H(X) + H(Y) q The conditional entropy is defined as H (Y | X ) = p(x)H (Y | X = x) = -E log p(Y | X ) åx p( x, y) q Chain rule: H(X ,Y ) = H (X ) + H (Y | X ) q Conditioning reduces uncertainty (on the average): H (X |Y ) £ H (X ) l but H(X|Y=y) ³ H(X) is possible Seroussi/Weinberger – Lossless source coding 26-May-04 14 Entropy Rates q Entropy rate of a random process in bits/symbol, ¥ 1 n H (X1 ) = lim H( X1 ) n®¥ n if the limit exists! q A related limit based on conditional entropy * ¥ in bits/symbol, H (X1 ) = lim H (X n X n-1, X n-2 ,K, X1) n®¥ if the limit exists! Theorem: For a stationary random process, both limits exist, and * ¥ ¥ H (X1 ) = H(X1 ) q Examples: ¥ l X1,X2,..

Lossless Source Coding

Entropy Rate of Stochastic Processes

Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2

An Introduction to Information Theory

Entropy Rate Estimation for Markov Chains with Large State Space

Package 'Infotheo'

Arxiv:1907.00325V5 [Cs.LG] 25 Aug 2020

Chapter 2: Entropy and Mutual Information

Entropy Power, Autoregressive Models, and Mutual Information

GAIT: a Geometric Approach to Information Theory

Source Coding: Part I of Fundamentals of Source and Video Coding

Relative Entropy Rate Between a Markov Chain and Its Corresponding Hidden Markov Chain

On a General Definition of Conditional Rényi Entropies