School of Information Science

Differential Entropy

2009 2-2 Course - -

Tetsuo Asano and Tad matsumoto Email: {t-asano, matumoto}@jaist.ac.jp

Japan Advanced Institute of Science and Technology Asahidai 1-1, Nomi, Ishikawa 923-1292, Japan http://www.jaist.ac.jp

School of Information Science

In the last Chapters, we learned …..

We assumed that the channel input and output both take discrete values:

We skip the transmitter and receiver!!!!

Channel Channel Encoder + Decoder

Binary/Non-binary finite alphabet Noise = Error source

We derived Channel Coding Theorem:

(1) There exists a (2nR, n) rate R code such that the maximum error probability λ(n) can be made arbitrarily small, if the code rate is lower than the capacity R

In this and the next Chapters, we derive …..

Channel Coding Theorem for the channel where input and output of the channel both take analog values.

For this, we eliminate the finite alphabet assumption.

Channel Channel Encoder + Decoder

Analog values Noise

However, still we ignore the transmitter and receiver. The impact of having to use physical transmitter and receiver is investigated in another course, “Channel Coding”.

School of Information Science Outline

1. Differential Entropy - Definition - Some Examples - Asymptotic Property 2. Relation of Differential and Discrete Entropies 3. Joint and Conditional Differential Entropy 4. - Some Important Properties School of Information Science Differential Entropy (1) Definition 9.1.1: Differential Entropy For continuous X with a probability density function f(x), the differential entropy is defined as: h()X = −∫ f (x)log f (x)dx S where S is the set where X is defined. Example: Uniform Distribution If a random variable X is distributed over a uniform distribution f ()x =1/ a, 0 ≤ x ≤ a the differential entropy of X is: a 1 1 h()X = − log dx = log a ∫0 a a Property: Value Range The differential entropy can take negative values, as seen in the above example.

School of Information Science Differential Entropy (2) Example: If a random variable X is distributed over a Normal distribution N(0,σ2) the differential entropy of X is: ∞ 1 ⎧ x2 ⎫ h()f = − f (x)log f (x)dx with f ()x = exp − ∫−∞ ⎨ 2 ⎬ 2πσ 2 ⎩ 2σ ⎭ After minor mathematical manipulations, we have: E(x2 ) 1 1 1 1 1 h()f = + ln 2πσ 2 = + ln 2πσ 2 = ln e + ln 2πσ 2 2σ 2 2 2 2 2 2 1 1 = ln 2πeσ 2 [nats] = log 2πeσ 2 [bits] 2 2 Theorem 9.1.2: Asymptotic Property

Let X1, X2, …. , Xn be a sequence of i.i.d. random variable following a density function f(x). Then, the following holds: 1 − log f (x1, x2 ,L, xn ) → E()− log f (x) = h(X ) n n→∞ Proof: Obvious School of Information Science Relation of Differential and Discrete Entropies (1) Property:

There exists a value xi such that the probability that a random variable X takes a value between i∆ and (i+1)∆ is given by: (i+1)∆ Pr(i∆ ≤ X ≤ ()i +1 ∆) = f (xi )∆ = f (x)dx ∫ i∆ Where f(x) is the density function of X. Proof: Obvious

Area=f(xi)∆

i∆ (i+1)∆

School of Information Science Relation of Differential and Discrete Entropies (2)

Definition 9.2.1: X∆ ∆ Define a discrete random variable, X = xi , if i∆ ≤ x ≤(i +1)∆ Entropy of this random variable X∆ is given by:

∞ (i+1)∆ ∆ H ()X = − pi log pi , with pi = f (x)dx = f (xi )∆ ∑ ∫ i∆ which leads to: −∞ ∞ ∞ ∞ ∆ H (X ) = −∑ f (xi )∆ log()f (xi )∆ = −∑ ∆f (xi )log f (xi ) − ∑∆f (xi )log ∆ −∞ −∞ 1−∞42 43 ∞ =1 = −∑∆f (xi )log f (xi ) − log ∆ −∞ School of Information Science Relation of Differential and Discrete Entropies (3)

Theorem 9.2.1: Discrete Æ Continuous ∞ ∆ H (X ) + log ∆ = −∑ ∆f (xi )log f (xi ) → h( f ) = h(X ), as ∆ → 0 −∞ Proof: Obvious Example: As we saw previously in an example, differential entropy h(X) of a random variable uniformly distributed over [0, a] is log a, and with a=1, h(X)=0. If we “quantize” this random variable following distribution [0, 1] with n- bit A/D converter, h(X ) = H (X ∆ )+ log ∆ = H (X ∆ )+ log 2−n = H (X ∆ )− n = 0 Therefore, H (X ∆ )= n which means that n bits are enough to express the contiguous random variable X while keeping n-bit accuracy.

School of Information Science Joint and Conditional Differential Entropy (1)

Definition 9.3.1: Joint Differential Entropy

The differential entropy of a set X1, X2, …, Xn of random variables following the density function f(x1, x2, …, xn) is defined as:

h(X , X , , X ) = − f (x , x , , x )log f (x , x , , x )dx dx dx 1 2 L n ∫ 1 2 L n 1 2 L n 1 2 L n

Definition 9.3.2: Conditional Differential Entropy The conditional differential entropy of random variables X and Y is defined as: f (x, y) h(X Y ) = − f (x, y)log f (x y)dxdy = − f (x, y)log dxdy = h(X ,Y) − h(Y) ∫ ∫ f (y) where f(X,Y) is the joint density of the random variable X and Y. School of Information Science Joint and Conditional Differential Entropy (2)

Theorem 9.3.1: Normal Distribution

The differential entropy of random variables X1, X2, …, Xn following the multivariate normal density function

1 ⎧ 1 T −1 ⎫ f (x1, x2 ,L, xn ) = f (x) = n exp⎨− ()()x − µ K x − µ ⎬ ()2π K 1/ 2 ⎩ 2 ⎭ is given by 1 h(X , X , , X ) = h(N (µ, K )) = log(2πe)n K bits 1 2 L n n 2 whereµ = E()x and ⎡⎛(x − µ )⎞ ⎤ ⎢⎜ 1 1 ⎟ ⎥ T ⎜()x − µ ⎟ K = E ()()x − µ x − µ = E⎢ 2 2 x − µ x − µ x − µ ⎥ []⎢⎜ ⎟()()()()1 1 2 2 L n n ⎥ ⎜ M ⎟ ⎢⎜ ⎟ ⎥ ⎢⎝()xn − µn ⎠ ⎥ Proof: ⎣ ⎦

⎡ 1 T −1 n 1/ 2 ⎤ h( f ) = − f (x)log f (x)dx = − f (x)⎢− ()()x − µ K x − µ − ln( 2π ) K ⎥dx ∫ ∫ ⎣ 2 ⎦

School of Information Science Joint and Conditional Differential Entropy (3)

Proof Continued:

⎡1 T −1 ⎤ n 1/ 2 = E⎢ ()()x − µ K x − µ ⎥ + ln( 2π ) K ⎣2 ⎦

n 1/ 2 = + ln( 2π )n K 2

1 1/ 2 1 1 = ln en + ln( 2π )n K = ln()2πe n K nats = log()2πe n K bits 2 2 2

Exercise: Provide more detailed and more concrete proof for this mathematical manipulation. School of Information Science Mutual Information (1)

Definition 9.4.1: Kullback Leibler Distance

Kullback Leibler distance D(f||g) between two density functions is defined as: f D( f g) = f log ∫S g with 0log (0/0)=0 and S being the region where the random variables are defined. Definition 9.4.2: Mutual Information The mutual information I(X,Y) between the two continuous random variables X and Y with the joint density function f(x,y) is defined as f (x, y) I(X ;Y) = f (x, y)log dxdy ∫ f (x) f (y) Property 9.4.1: I(X ;Y) = h(X ) − h(X Y) = h(Y) − h(Y X ) = D( f (x, y) f (x) f (y)) Proof: Obvious.

School of Information Science Mutual Information (2)

Theorem 9.4.1: Non Negativity of Kullback Leibler Distance D( f g) ≥ 0 , and equality holds if and only if f=g. g g Proof: − D( f g) = f log ≤ log f = log g ≤ log1 = 0 ∫S f 14243 ∫ f ∫ due to Jensen's unequality S S Jensen’s inequality for concave functions states that the equality holds if and only of f=g. Theorem 9.4.2: Non Negativity of Mutual Information I(X ;Y) ≥ 0 , and equality holds if and only X and Y are independent. Proof: Obvious

Theorem 9.4.3: Knowledge Decreases Uncertainty h(X Y ) ≤ h(X ) , and equality holds if and only X and Y are independent. Proof: Obvious School of Information Science Mutual Information (3)

Theorem 9.4.4: Chain Rule n h(X1, X 2 ,L, X n ) = ∑ h(X i X1, X 2 ,L, X i−1) i=1 Proof: Obvious

Theorem 9.4.5: n h(X1, X 2 ,L, X n ) ≤ ∑ h(X i ) i=1 and equality holds if and only X and Y are independent. Proof: Obvious

Theorem 9.4.6: h(X + c) = h(X )

i.e., translation does not change the differential entropy. Proof: Obvious

School of Information Science Mutual Information (4) Theorem 9.4.7: Normal Distribution Maximize Entropy The Normal distribution maximizes the entropy among those distributions having the same variance. Proof: Let X be a zero mean random variable with variance σ2, following the density function g(x). Also, let φ(X) be a zero mean Normal random variable with variance σ2. Then, g ⎡ x2 ⎤ 0 ≤ D(g φ) = g ln dx = −h(g) − g lnφdx = −h(g) − − ln 2πσ − gdx ∫ ∫ ∫ ⎢ 2 ⎥ φ ⎣ 2σ ⎦ 1 = −h(g) + ln 2πσ + x2 gdx 2σ 2 ∫ However, since g and f have the same variance, ∫ x2 gdx = ∫ x2φdx Therefore, 1 1 0 ≤ D(g φ) = −h(g) + ln 2πσ + x2 gdx = −h(g) + ln 2πσ + x2φdx 2σ 2 ∫ 2σ 2 ∫ = −h(g) − ∫φ lnφdx = −h(g) + h(φ) The equality holds if g = φ School of Information Science Mutual Information (5)

Theorem 9.4.8: Normal Distribution Maximize Entropy - Multi-variable’s Case - The n-dimensional zero-mean Normal distribution maximizes the entropy among other n-dimensional zero-mean distributions having the same covariance K={Kij}. Proof: Exercise

School of Information Science Summary

We have visited….. 1. Differential Entropy - Definition - Some Examples - Asymptotic Property 2. Relation of Differential and Discrete Entropies 3. Joint and Conditional Differential Entropy 4. Mutual Information - Some Important Properties