Information Theory

Entropy: For a discrete random variable X entropy is deﬁned as

H(X)= − P (X = ai) log P (X = ai) Xi where ai are possible states of X.

• entropy is small for P (X = ai) close to 0 or 1 and large elsewhere;

• H provides degree of information gained from observing an outcome of the variable;

• Entropy is the largest for the uniformly dis- tributed variable.

1 Mutual information: One can measure how much information are shared between two variables. The mutual information I between two scalar r.v.’s X and Y is deﬁned as I(X,Y )= H(X)+ H(Y ) − H(X,Y ) where H(Y ) = − j P (Y = bj) log P (Y = bj),bi are possible statesP of Y . The joint entropy H(X,Y ) is given by

H(X,Y )= − i j P (X = ai,Y = bj) log P (X = ai,Y = bj) P P

• mutual information between two r.v is always nonnegative;

• mutual information is zero if the variables are statistically independent;

• thus mutual information can serve as a measure of dependence between the variables;

2 Differential entropy : The differential entropy H of a continuous random variable x with density px() is defined as

H(x)= − px(ξ) log px(ξ)dξ = f(px(ξ))dξ Z Z

Example: Consider r.v. with uniform pdf in the interval [0, a].

1/a , 0 ≤ ξ ≤ a px(ξ)= ( 0 , otherwise The diﬀerential entropy in this case is a 1 1 H(x)= − log dξ = log a Z0 a a Thus the less random r.v. x is the smaller (more negative) entropy of x is.

3 Generalisation to multidimensional case: if x is a random vector with density px then

H(x)= − px(ξ) log px(ξ)dξ Z

• diﬀerential entropy measures randomness - rv with pdf concentrated on a small inter- vals yields small diﬀerential entropy;

• diﬀerential entropy can be negative;

• if f is an invertible transformation of a r.v. x, y = f(x), then

H(y)= H(x)+ E[log |det Jf(x)|] where Jf(ξ) is a Jacobian of f (a matrix of partial derivatives at ξ).

4 • If y = Mx then H(y)= H(x) + log |det M| (thus entropy is not invariant under linear transformation of the random variable);

• in particular entropy is not scale-invariant, i.e. H(αx)= H(x) + log |α|

Kullback-Leibler divergence: The Kullback-Leibler divergence between 2 pdf’s p1, p2 is deﬁned as p1(ξ) ( 1 2)= 1 log KL p , p p 2 dξ Z p (ξ)

• KL is always nonnegative and zero iﬀ p1 = p2;

• KL is not a proper distance - not symmet- ric.

5 Negentropy : An entropy based measure of nongaussianity of rv x called negentropy J is deﬁned as

J(x)= H(xG) − H(x) where xG is a gaussian random variable with the same covariance matrix Σ as x and n is dimension of x . 1 n H(x )= log |det Σ| + (1 + log 2π) G 2 2

• negentropy is invariant under (invertible) linear transforms, i.e

J(Mx)= J(x)

• it is scale invariant