Information Theory Lecture 10 Differential Entropy

Information Theory Lecture 10 Differential Entropy STEFAN HÖST Differential Entropy Definition Let X be a real valued continuous random variable with probability density function f (x). The differential entropy is Z H(X ) = E− log f (X ) = − f (x) log f (x)dx R where it is used that 0 log 0 = 0. Sometimes the notation H(f ) is also used in the literature. Interpretation Differential entropy can not be interpreted as uncertainty Stefan Höst Information theory 1 Differential entropy Example 1 For a uniform distribution, f (x) = a , 0 ≤ x ≤ a, the differential entropy is Z a 1 1 H(X ) = − log dx = log a 0 a a Note that H(X ) < 0, 0 < a < 1. Example For a Gaussian (Normal) distribution, N (m, s) the differential entropy is Z 1 H(X ) = − f (x) log f (x)dx = log(2pes2) R 2 Stefan Höst Information theory 2 Translation and scaling Theorem • Let Y = X + c, then fY (y) = fX (y − c) and H(Y ) = H(X ) • 1 y Let Y = aX, then fY (y) = a fX ( a ) and H(Y ) = H(X ) + log a Stefan Höst Information theory 3 Gaussian distribution Lemma Let g(x) be the distribution function of X ∼ N (m, s). Let f (x) be an arbitrary distribution function with the same mean, m, and variance, s2. Then Z Z f (x) log g(x)dx = g(x) log g(x)dx R R Stefan Höst Information theory 4 Differential entropy Definition The joint differential entropy is the entropy for a 2-dimensional random variables (X, Y ) with the joint density function f (x, y), Z H(X, Y ) = E− log f (X, Y ) = − f (x, y) log f (x, y)dxdy R2 Definition The multi-dimensional joint differential entropy is the entropy for an n-dimensional random vector (X1, ... , Xn) with the joint density function f (x1, ... , xn), Z H(X1, ... , Xn) = − f (x1, ... , xn) log f (x1, ... , xn)dx1, ... , dxn Rn Stefan Höst Information theory 5 Mutual information Definition The mutual information for a pair of continuous random variables with joint probability density function f (x, y) is f (X, Y ) Z f (x, y) I(X; Y ) = E log = f (x, y) log dxdy f (X )f (Y ) R2 f (x)f (y) The mutual information can be derived as I(X; Y ) = H(X ) + H(Y ) − H(X, Y ) = H(X ) − H(X jY ) = H(Y ) − H(Y jX ) where R H(X jY ) = − R2 f (x, y) log f (xjy)dxdy Stefan Höst Information theory 6 Relative entropy Definition The relative entropy for a pair of continuous random variables with probability density functions f (x) and g(x) is f (X ) Z f (x) D(f jjg) = Ef log = f (x) log dx g(X ) R g(x) Theorem The relative entropy is non-negative, D(f jjg) ≥ 0 with equality if and only if f (x) = g(x), for all x. Stefan Höst Information theory 7 Mutual information Corollary The mutual information is non-negative, i.e. I(X; Y ) = H(X ) − H(X jY ) ≥ 0 with equality if and only if X and Y are independent. Corollary The entropy will not increase by considering side information, i.e. H(X jY ) ≤ H(X ) with equality if and only if X and Y are independent. Stefan Höst Information theory 8 Chain rule Theorem The chain rule for differential entropy states that n H(X1, X2, ... , Xn) = ∑ H(Xi jX1, ... , Xi−1) i=1 Corollary From the chain rule it follows n H(X1, X2, ... , Xn) ≤ ∑ H(Xi ) i=1 with equality iff X1, X2, ... , Xn are independent. Stefan Höst Information theory 9 Gaussian distribution Theorem The Gaussian distribution maximises the differential entropy over all distributions with mean m and variance s2. Stefan Höst Information theory 10 Continuous vs Discrete Theorem Let X be continuous r.v. with density fX (x). Construct a discrete ( + ) version X D, where p(x D) = R k 1 D f (x)dx = f (x ) k kD X D X k Then, in general, lim H(X D) does not exist. D!0 Theorem Let X and Y be continuous r.v. with density fX (x) and fY (y). D d D Construct discrete versions X and Y , where p(xk ) = Df (xk ), d f (xk ) = ∑` df (xk , y`), and p(y` ) = df (y`), f (y`) = ∑k Df (xk , y`). Then lim I(X D, Y d) = I(X; Y ) D,d!0 Stefan Höst Information theory 11.

Load more