University of Central Florida School of and Computer Science EEL-6532: Theory and Coding. Spring 2010 - dcm

Lecture 2 - Wednesday January 13, 2000

Shannon Entropy In his endeavor to construct mathematically tractable models of communication Shannon concentrated on stationary and ergodic1 sources of classical information. A stationary source of information emits symbols with a probability that does not change over time and an ergodic source emits information symbols with a probability equal to the frequency of their occurrence in a long sequence. Stationary ergodic sources of information have a finite but arbitrary and potentially long correlation time. In late 1940s Shannon introduced a measure of the quantity of information a source could generate [13]. Earlier, in 1927, another scientist from , Ralph Hartley had proposed to take the logarithm of the total number of possible messages as a measure of the amount of information in a message generated by a source of information arguing that the logarithm tells us how many digits or characters are required to convey the message. Shannon recognized the relationship between thermodynamic entropy and informational entropy and, on von Neumann’s advice, he called the negative logarithm of probability of an event, entropy2. Consider an event which happens with probability p; we wish to quantify the information content of a message communicating the occurrence of this event and we impose the condition that the measure should reflect the “surprise” brought by the occurrence of this event. An initial guess for a measure of this surprise would be 1/p, the lower the probability of the event the larger the surprise. But this simplistic approach does not resist scrutiny; the surprise should be additive. If an event is composed of two independent events which occur with probabilities q and r then the probability of the event should be p = qr, but we see that: 1 1 1 6= + . p q r On the other hand, if the surprise is measured by the logarithm of 1/p, then the additivity property is obeyed: 1 1 1 log = log + log . p q r P Given a probability distribution i pi = 1 we see that the uncertainty is in fact equal to the average surprise: X 1 p log . i p i i

1A stochastic process is said to be ergodic if time averages are equal to ensemble averages, in other words if its statistical properties such as its mean and variance can be deduced from a single, sufficiently long sample (realization) of the process 2It is rumored that von Neumann told Shannon “It is already in use under that name and besides it will give you a great edge in debates because nobody really knows what entropy is anyway” [2].

1 The entropy is a measure of the uncertainty of a single random variable X before it is observed, or the average uncertainty removed by observing it. This quantity is called entropy due to its similarity to the thermodynamic entropy.

Entropy: the entropy of a random variable X with a probability density function pX (x) is: X H(X) = − pX (x) log pX (x). x

The entropy of a random variable is a positive number. Indeed, the probability pX (x) is a positive real number between 0 and 1 therefore, log pX (x) ≤ 0 and H(X) ≥ 0.

Let X be a binary random variable and p = pX (x = 1) be the probability that the X takes the value 1; then the entropy of X is:

H(p) = −p log p − (1 − p) log(1 − p). If the logarithm is in base 2 then the binary entropy is measured in bits. Figure 1 shows H(p) function of p for a binary random variable. The entropy has a maximum of 1 bit when p = 1/2 and goes to zero when p = 0 or p = 1. Intuitively, we expect the entropy to be zero when the outcome is certain and reach its maximum when both outcomes are equally likely. It is easy to see that: (i) H(X) > 0 for 0 < p < 1; (ii) H(X) is symmetric about p = 0.5;

(iii) limp→0H(X) = limp→1H(X) = 0; (iv) H(X) is increasing for 0 < p < 0.5, decreasing for 0.5 < p < 1 and has a maximum for p = 0.5. (V) The binary entropy is a concave function of p, the probability of an outcome. Before discussing this property of binary entropy we review a few properties of convex and concave functions. A function f(x) is convex over an interval (a, b) if:

f(γx1 + (1 − γ)x2) ≤ γf(x1) + (1 − γ)f(x2) ∀ x1, x2 ∈ (a, b) and 0 ≤ γ ≤ 1.

The function is strictly convex iff this inequality holds only for γ = 0 or γ = 1. A function f(x) is concave only if (−f(x)) is convex over the same interval. It is easy to prove that if the second derivative of the function f(x) is non-negative, then the function is convex. Call x0 = γx1 + (1 − γ)x2. From the Taylor series expansion: f 00(ξ) f(x) = f(x ) + f 0(x )(x − x ) + (x − x )2, x ≤ ξ ≤ x 0 0 0 2 0 0 00 2 and from the fact that the second derivative is non-negative, it follows that f (ξ)(x−x0) ≥ 0. When x = x1 then x1 − x0 = (1 − γ)(x1 − x2) and:

0 0 f(x1) ≥ f(x0) + f (x0)(x1 − x0) or f(x1) ≥ f(x0) + f (x0)(1 − γ)(x1 − x2).

When x = x2 then x2 − x0 = γ(x2 − x1) and:

2 0 0 f(x2) ≥ f(x0) + f (x0)(x2 − x0) or f(x2) ≥ f(x0) + f (x0)γ(x2 − x1). It follows that:

γf(x1) + (1 − γ)f(x2) ≥ f(γx1 + (1 − γ)x2). ¤

Convex functions enjoy a number of useful properties; for example, if X is a discrete random variable with the probability density function pX (xi) and f(x) is a convex function then f(x) satisfies Jensen inequality: Ã ! X X pX (xi)f(xi) ≥ f pX (xi)xi i i It is easy to prove that f(x) is concave if and only if: µ ¶ x + x f(x ) + f(x ) f 1 2 ≥ 1 2 . 2 2 Figure 1 illustrates the fact that the binary entropy H(p) is a concave function of p; the function lies above any chord, in particular above the cord connecting the points (p1,H(p1)) and (p2,H(p2)): µ ¶ p + p H(p ) + H(p ) H 1 2 ≥ 1 2 . 2 2 Table 1 shows some values for H(X) for 0.0001 ≤ p ≤ 0.5.

1 2

2

1 2

1

1 1 2 2

Figure 1: The entropy of a binary random variable function of the probability of an outcome.

Now we consider two random variables, X and Y with the probability density functions pX (x) and qY (y); let pXY (x, y) be the joint probability density function of X and Y . To

3 Table 1: The entropy of a binary random variable for 0.0001 ≤ p ≤ 0.5.

p H(X) p H(X) p H(X) p H(X) 0.0001 0.001 0.01 0.081 0.2 0.722 0.4 0.971 0.001 0.011 0.1 0.469 0.3 0.881 0.5 1.000

quantify the uncertainty about the pair (x, y) we introduce the joint entropy of the two random variables. Joint entropy: the joint entropy of two random variables X and Y is defined as: X H(X,Y ) = − pXY (x, y) log pXY (x, y). x,y If we have acquired all the information about the random variable X we may ask how much uncertainty is still there about the pair of the two random variables, (X,Y ). To answer this question we introduce the conditional entropy. Conditional entropy of random variable Y given X is defined as: X X H(Y |X) = − pXY (x, y) log pY |X (y|x). x y Consider two random variables X and Y . Each of them takes values over a five-letter alphabet consisting of the symbols a, b, c, d, e. The joint distribution of the two random variables is given in Table 2.

Table 2: The joint probability distribution matrix of random variables X and Y. pX,Y (x, y) a b c d e a 1/10 1/20 1/40 1/80 1/80 b 1/20 1/40 1/80 1/80 1/10 c 1/40 1/80 1/80 1/10 1/20 d 1/80 1/80 1/10 1/20 1/40 e 1/80 1/10 1/20 1/40 1/80

The marginal distribution of X and Y can be computed from the relations: X X pX (x) = p(x, y), and pY (y) = p(x, y) y x as follows X 1 1 1 1 1 1 p (a) = p (x = a, y) = + + + + = X XY 10 20 40 80 80 5 y Similarly, we obtain: 1 p (x = b) = p (x = c) = p (x = d) = p (x = e) = , X X X X 5 and

4 1 p (y = a) = p (y = b) = p (y = c) = p (y = d) = p (y = e) = . Y Y Y Y Y 5 The entropy of X is thus: · ¸ X 1 H(X) = − p (x) log p (x) = 5 log 5 = log 5 bits. X X 5 x Similarly, the entropy of Y is: · ¸ X 1 H(Y ) = − p (y) log p (y) = 5 log 5 = log 5 bits. Y Y 5 y The joint probability of X and Y is: P P H(X,Y ) = − x y pXY (x, y) log pXY (x, y) £ ¤ 1 1 1 1 1 1 1 1 1 1 = −5 10 log 10 + 20 log 20 + 40 log 40 + 80 log 80 + 80 log 80 £ ¤ 1 1 1 1 1 1 1 1 1 1 = log 5 2 log 2 + 4 log 4 + 8 log 8 + 16 log 16 + 16 log 16

15 = 8 log 5 bits. It is easy to see that: 15 H(Y,X) = log 5 bits. 8 To compute the matrix of conditional probabilities shown in Table 3 we use the known re- lation between the joint probability density and the conditional probability, pX|Y (x|y) = pXY (x, y)/pY (y). For example:

pXY (x = a, y = b) 1/20 1 pX|Y (x = a|y = b) = =⇒ pX|Y (x = a|y = b) = = . pY (y = b) 1/5 4

Table 3: The conditional probability distribution matrix of random variables X and Y.

pX|Y (x|y) a b c d e pY |X (y|x) a 1/2 1/4 1/8 1/16 1/16 b 1/4 1/8 1/16 1/16 1/2 c 1/8 1/16 1/16 1/2 1/4 d 1/16 1/16 1/2 1/4 1/8 e 1/16 1/2 1/4 1/8 1/16

The actual value of H(X|Y ) is:

µ ¶ µ ¶ X 1 1 1 1 1 1 1 1 1 1 1 H(X|Y ) = p (y)H(X|Y = a, b, c, d, e) = 5 H , , , , = H , , , , Y 5 2 4 8 16 16 2 4 8 16 16 y

5 Table 4: Properties of joint and conditional entropy.

H(X,Y ) = H(Y,X) symmetry of joint entropy H(X,Y ) ≥ 0 nonnegativity of joint entropy H(X|Y ) ≥ 0,H(Y |X) ≥ 0 nonnegativity of conditional entropy H(X|Y ) = H(X,Y ) − H(Y ) conditional and joint entropy relation H(X,Y ) ≥ H(Y ) joint entropy versus the entropy of a single rv H(X,Y ) ≤ H(X) + H(Y ) subadditivity H(X,Y,Z) + H(Y ) ≤ H(X,Y ) + H(Y,Z) strong subadditivity H(X|Y ) ≤ H(X) reduction of uncertainty by conditioning H(X,Y,Z) = H(X) + H(Y |X) + H(Z|X,Y ) chain rule for joint entropy H(X,Y |Z) = H(Y |X,Z) + H(X|Z) chain rule for conditional entropy.

or · ¸ 1 1 1 1 1 1 1 1 1 1 15 H(X|Y ) = log + log + log + log + log = bits. 2 2 4 4 8 8 16 16 16 16 8 In this particular example H(Y |X) has same value: 15 H(Y |X) = bits. 8 It is easy to see that H(X),H(Y ),H(X,Y ), and H(X|Y ) ≥ 0. Indeed: · ¸ 15 15 H(X,Y ) = H(X|Y )H(Y ) =⇒ log 5 = log 5, 8 8 15 H(X,Y ) = log 5 ≤ H(X) + H(Y ) = 2 log 5 8

H(X,Y ) = H(Y,X),

15 H(X|Y ) = ≤ H(X) = log 5. 8 Properties of joint and conditional entropy. The joint and conditional entropy of random variables X, Y , and Z have several properties summarized in Table 4. The fact that the joint entropy is symmetric and non-negative follows immediately from its definition. The conditional entropy is non-negative because 0 ≤ pY |X (y|x) ≤ 1 and 0 ≤ pX|Y (x|y) ≤ 1; the equality occurs only when Y is a deterministic function of X or, respectively, X is a deterministic function of Y . From the definition of the conditional entropy and the expression of conditional probability, pX|Y (x|y) = pXY (x, y)/pY (y) it follows that: P P P P H(X|Y ) = x y pXY (x, y) log pY (y) − x y pXY (x, y) log pXY (x, y)

= −H(Y ) + H(X,Y ).

6 But H(X|Y ) ≥ 0, thus H(X,Y ) ≥ H(Y ).

To prove subadditivity we first show that:

ln a ≤ a − 1, if a ≥ 1 with equality if and only if a = 1. Indeed, if we use the substitution a = eb the inequality becomes b ≤ eb − 1 for b ≥ 0. From the Taylor series expansion of eb we have:

b b2 bk eb − 1 = b(1 + + + ... + ...). 2! 3! (k + 1)! The parenthesis on the right hand side of this equation is greater than 1 when b > 0 and is equal to 1 only when b = 0. Thus, eb − 1 ≥ y, or a − 1 ≥ ln a; the last inequality can also be written as: 1 log a ≤ (a − 1) if a ≥ 1. ln 2 It is easy to see that ln a = log a · ln 2; indeed, any a ≥ 1 can be expressed as a = eln a but also as a = 2log a, thus eln a = 2log a. If we apply on both sides the natural logarithm we get the desired expression. Now we return to the subadditivity and express H(X) and H(Y ) function of the joint probability density pXY (x, y): " # X X X H(X) = − pX (x) log pX (x) = − pXY (x, y) log pX (x), x x y and " # X X X H(Y ) = − pY (y) log pY (y) = − pXY (x, y) log pY (y). x y x We wish to show that H(X,Y ) − H(X) − H(Y ) ≤ 0 and compute:

X X p (x)p (y) H(X,Y ) − H(X) − H(Y ) = p (x, y) log X Y . XY p (x, y) x y XY

Here, pX (x)pY (y) ≥ pXY (x, y) with equality only when X and Y are independent random variables thus, we use the inequality log a ≤ 1 (a − 1) with a = pX (x)pY (y) ≥ 1 and rewrite ln 2 pXY (x,y) the expression above: · ¸ X X p (x)p (y) 1 X X p (x)p (y) p (x, y) log X Y ≤ p (x, y) X Y − 1 . XY p (x, y) ln 2 XY p (x, y) x y XY x y XY Thus: 1 X X H(X,Y ) − H(X) − H(Y ) ≤ [p (x)p (y) − p (x, y)] ≤ 0 ln 2 X Y XY x y

7 or, " # 1 X X X X H(X,Y ) − H(X) − H(Y ) ≤ p (x)p (y) − p (x, y) = 0 ln 2 X Y XY x y x y

The strong subadditivity can be proved in a similar manner.

Finally, we show that H(X) ≤ H(X,Y ): P P pX (x) H(X) − H(X,Y ) = pXY (x, y) log x y pXY (x,y) P P h i pX (x) ≤ pXY (x, y) − 1 . x y pXY (x,y) But X X X X X X [pX (x) − pXY (x, y)] = [pX (x)] − [pXY (x, y)] = 1 − 1 = 0. x y x y x y

We conclude that:

H(X) − H(X,Y ) ≤ 0, or H(X) ≤ H(X,Y )

with equality if and only if Y is a function of X. The chain rule for joint entropy can be proved knowing the conditional and joint entropy relation H(X,Y ) = H(X) + H(Y |X). Then

H(X,Y,Z) = H(X) + H(Y,Z|X) = H(X) + H(Y |X) + H(Z|X,Y )

This rule can be generalized for n random variables X1,X2,...,Xn : Xn H (X1,X2,...,Xn) = H(Xi|Xi−1,Xi−2,...,X2,X1). i=1 To prove the chain rule for conditional entropy we start from the definition:

H(X,Y |Z) = H(X,Y,Z) − H(Z). We know that H(X|Y ) = H(X,Y ) − H(Y ) thus the right hand side of this equation can be written as:

[H(X,Y,Z) − H(X,Z)] + [H(X,Z) − H(Z)] = H(Y |X,Z) − H(X|Z). Thus:

H(X,Y |Z) = H(Y |X,Z) + H(X|Z). The chain rule for the conditional entropy can be generalized for n + 1 random variables: Xn H(X1,X2,...,Xn|Y ) = H (Xi|Y,X1,...Xi−1) . i=1 Last, but not least, we mention the “grouping” property of Shannon entropy [1].

8 H(p1, p2, ..., pn) = H(p1 + p2 + ... + pσ + pσ+1 + ... + pn)

Pσ Pσ Pσ Pσ + k=1 pkH (p1/( k=1 pk), p2/( k=1 pk), . . . , pσ/( k=1 pk))

Pn ¡ Pn Pn Pn ¢ + k=σ+1 pkH pσ+1/( k=σ+1 pk), pσ+2/( k=σ+1 pk), . . . pn/( k=σ+1 pk) .

9 Lecture 4 - Monday January 25, 2000

The Physical Support of Information Information must have a physical support. Indeed, to transmit, store, and process infor- mation we must act on a property of the physical medium embodying the information; to extract information at the output of a channel we have to perform a measurement of the phys- ical media. It seems very unlikely that anyone would dispute the statement that information is physical; nevertheless, the profound implications of this truth, the connection between the fundamental laws of physics and information have escaped the scientific community for a very long time [8, 9]. In 1929, Leo Szilard stipulated that information is physical while trying to explain Maxwell’s daemon paradox [18]. In 1961, Rolf Landauer uncovered the connection between information and the second law of thermodynamics [5]. Landauer followed a very simple argument to show that classical information can be copied reversibly, but the erasure of in- formation is an irreversible process; to erase one bit of information we have to dissipate an −23 amount of energy of at least kBT ln 2 with kB the Boltzmann’s constant, kB = 1.3807×10 Joules per degree Kelvin, and T the temperature of the system. We all accept easily, and possibly painfully, the fact that erasure of information is irreversible; think only about the information you lost last time the computer disk on your laptop crashed! To quantify the amount of energy required to erase one bit of information we have to remember a basic attribute of classical information, its independence of the physical support; a bit can be stored as the presence or absence of a single pebble in a square drawn on the sand, a flower pot or the absence of it on the balcony of Juliet, or a molecule of gas in a cylinder. We shall consider the later case when we encode one bit of information as the presence of one molecule of gas in a cylinder with two compartments. A “0” corresponds to the molecule on the left compartment and a “1” to the molecule on the right compartment, Figure 2. To erase the information we proceed as follows: (a) we remove the wall separating the two compartments and insert a piston on the right side of the cylinder; (b) we push the piston and compress the one-molecule gas. At the end of step (b) we have “erased” the information, we no longer know in which compartment the molecule was initially. If the compression is isothermal (the temperature does not change) and quasi-static (the state changes very slowly), then the laws of thermodynamics tell us that the energy required to compress m molecules of gas is equal to mkBT ln 2. In our experiment m = 1, thus the total amount of energy dissipated by us in erasing the information is kBT ln 2, equal to the amount of heat dumped into the environment. Maxwell’s demon. Maxwell imagined a gedanken experiment involving the famous demon as a challenge to the second law of thermodynamics discussed in Section . Maxwell’s demon is a mischievous fictional character who controls a trapdoor between two containers, A and B, filled with the same gas at equal temperatures; the containers are placed next to each other, B on the left of A. When a faster-than-average molecule from container A approaches the trapdoor, the demon opens it, and allows the molecule to fly from container A to B; when a slower-than-average molecule from B approaches the trapdoor, the demon opens it, and allows the molecule to fly from B to A. The average molecular speed corresponds to temperature, therefore the temperature will decrease in A and increase in B, in violation of the second law of thermodynamics. If you are about to invent a perpetuum mobile based on this gedanken experiment read further before trying to patent you invention; if you already know what is wrong with this argument, you may skip the remaining of this section.

10 0 1

Figure 2: A gedanken experiment to quantify the amount of energy required to erase a bit of information. We encode one bit of information as the presence of one molecule of gas in a cylinder with two compartments. A “0” corresponds to the molecule in the left compartment and a “1” to the molecule in the right compartment (top). To erase the information we remove the wall separating the two compartments and insert a piston on the right side of the cylinder, push the piston and compress the gas. According to the laws of thermodynamics and to Landauer’s principle the amount of energy required to erase the information is equal to kBT ln 2.

Szilard’s thermodynamic engine. One of the most famous demonstrations that the second law of thermodynamics is not violated by Maxwell’s demon was suggested by Leo Szilard who considered the total entropy of the gas and demon combined. Szilard imagined an engine powered by this demon. As before, we have one molecule of gas located either in the left half, or in the right half of a cylinder with a partition in the middle. The demon measures the position of the molecule, then removes the partition and, based on the information regarding the position of the particle, the demon inserts a piston in the cylinder to the left of the molecule. On the right of the piston, an arm is connected to a weight hanging from a pulley. As the gas expands, the piston moves to the left and pulls up the weight. During the expansion the one-molecule gas is put in contact with a reservoir of heat, draws heat from the reservoir, expands isothermally (the temperature T is maintained throughout the experiment), and lifts the weight. The work extracted by the engine in the isothermal expansion is:

WisothermalExpansion = +kBT ln 2. Then the cycle repeats itself and the Szilard’s engine keeps on converting the heat generated when the gas expands isothermally into work, a process forbidden by the second law of thermodynamics. The point one could easily miss is that at the beginning of each cycle the ensemble consist- ing of the cylinder with its auxiliary components, as well as the demon, should be returned to the initial state. Figure 3(d) shows that the gas cylinder with the one-molecule gas could be returned to the initial state if and only if the demon provides the information regarding the initial position of the molecule. The demon is an essential component of the engine, the contraption could not function without the demon. The demon’s memory stores one bit of information acquired during the first step of the process: “0” if the molecule is located on the left partition and “1” if the

11 (a) (b) Initial state Initial

isothermally

(d) (c)

Figure 3: A cycle of Szilard’s engine. (a) One molecule of gas is located either in the left half, or in the right half of a cylinder with a partition in the middle. (b) The demon measures the position of the molecule. (c) The demon removes the partition and, based on the information regarding the position of the particle, inserts a piston in the cylinder to the left of the molecule. (d) The gas expands isothermally and lifts the weight. During the expansion the one-molecule gas is put in contact with a reservoir of heat and draws heat from the reservoir. Finally, the system returns to its initial state and a new cycle begins. The cylinder with the one-molecule gas could be returned to the initial state if and only if the demon provides the information regarding the initial position of the molecule. molecule is on the right partition. This information should be erased at the beginning of each cycle; according to Landauer’s principle the energy dissipated in erasing one bit of information is:

Werasure = −kBT ln 2. Thus, the work gained by the engine is required to erase the demon’s memory; the heat gen- erated by the erasure of demon’s memory is transferred back to the heat bath to compensate for the heat absorbed during the expansion of the one-molecule gas. No net gain is archived during a cycle of this hypothetical engine:

Wcycle = WisothermalExpansion + Werasure = +kBT ln 2 − kBT ln 2 = 0 and the second law of thermodynamics is not violated. We introduce in the next section the concept of thermodynamic entropy, a measure of the amount of chaos in a microscopic system. The information available to the demon is a

12 part of the system state and it is related to the entropy. One could consider a generalized definition of entropy I (in bits), as the difference between the thermodynamic entropy ∆S and the information I about the system available to an external observer (in our case the demon) [9]:

I = ∆S − I. The energy per cycle can be expressed in terms of ∆S, the difference between the entropy at the beginning and the end of one cycle, and the change in the information available to an external observer at the same instances, at temperature T as:

Wcycle = WisothermalExpansion + Werasure = T (∆S − I) = T I. It follows that:

Wcycle = 0 ⇔ I = 0. Indeed, information is physical and it contributes to define the state of a physical system. Boltzman’s Definition of Thermodynamic Entropy According to Alfred Wehrl [24] “entropy relates macroscopic and microscopic aspects of and determines the behavior of macroscopic systems, i.e., real matter, in equilibrium (or close to equilibrium)”. The entropy is a measure of the amount of chaos in a microscopic system. The concept of entropy was first introduced in thermodynamics. Thermodynamics is the study of energy, its ability to carry out work, and the conversion between various forms of energy, such as the internal energy of a system, heat, and work. The laws of thermodynamic are derived from statistical mechanics. There are several equivalent formulations of each of the three laws of thermodynamics. The First Law of Thermodynamics states that energy can be neither created, nor destroyed, it can only be transformed from one form to another. An equivalent formulation is that the heat flowing into a system equals the sum of the change of the internal energy and the work done by the system. The Second Law of Thermodynamics states that it is impossible to create a process that has the unique effect of subtracting positive heat from a reservoir and converting it into positive work. An equivalent formulation is that the entropy of a closed system never decreases, whatever the processes that occur in the system: ∆S ≥ 0, where ∆S = 0 refers to reversible processes and ∆S > 0 refers to irreversible ones. A consequence of this law is that no heat engine can have 100% efficiency. The Third Law of Thermodynamics states that the entropy of a system at zero absolute temperature is a well-defined constant. This is due to the fact that many systems at zero temperature are in their ground states and the entropy is determined by the degeneracy of the ground state. For example, a crystal lattice has a unique ground state and has zero entropy when the temperature reaches 0◦ Kelvin. Clausius defined entropy as a measure of energy unavailable for doing useful work. He discovered the fact that entropy can never decrease in a physical process and can only remain constant in a reversible process. This result became known as the Second Law of Thermo- dynamics. There is a strong belief among astrophysicists and cosmologists that our Universe started in a state of perfect order and its entropy is steadily increasing, leading some to be- lieve in the possibility of a “heat death.” Of course, this sad perspective could be billions or trillions of years away.

13 The thermodynamic entropy S of a gas, is also defined statistically but it does not reflect a macroscopic property. The entropy quantifies the notion that a gas is a statistical ensemble and it measures the randomness, or the degree of disorder, of the ensemble. The entropy is larger when the vectors describing the individual movements of the molecules of gas are in a higher state of disorder than when all molecules are well organized and moving in the same direction with the same speed. Ludwig Boltzmann postulated that:

S = kB ln(Ω)

where kB is Boltzmann constant and Ω is the number of dynamical microstates, all of equal probability, consistent with a given macrostate; the microstates are specified by the position and the momentum of each molecule of gas3. The Second Law of Thermodynamics tells us that the entropy of an isolated system never decreases. Indeed, differentiating the previous equation we get δΩ δS = k ≥ 0. B Ω Let us now get some insight into Boltzmann’s definition of entropy. He wanted to provide a macroscopic characterization on an ensemble of microscopic systems; he considered N particles such that each particle is at one of the m energy levels, E1 ≤ E2 ≤ ... ≤ Em. The number of Pm particles at energy levels E1,E2,...,Em are N1,N2,...,Nm, respectively, with N = i=1 Ni. The probability that a given particle is at energy level Ei is pi = Ni/N, 1 ≤ i ≤ m, Figure 4. A microstate σi is characterized by an m-dimensional vector (p1, p2, . . . , pm) when N tends to infinity. The question posed by Boltzmann is how many microstates exist. To answer this question let us assume that we have m boxes and we wish to count the number of different ways the given quantitative distribution of the N particles among these boxes can be accomplished irrespective of which box a given particle is in. The answer to this question is: µ ¶ N N! = = Ω. N1,N2,...,Nm N1!N2! ...Nm! Call Q the average number of bits required to characterize the state of the system; Q is proportional with the logarithm of Ω, the number of microstates, more precisely: 1 Q = log Ω. N We will show that the number of bits, Q, is given by:

1 N! −1 Q = log = H(p1, p2, . . . , pm) + O(N log N). N N1!,N2!,...,Nm! P In this expression, H(p1, p2, . . . , pm) = − i pi log pi denotes the informational Shannon en- tropy, a quantity we shall have much to say about later in this chapter. To prove this equality we express Q as: 1 Q = {log(N!) − [log(N !) + log(N !) + ... + log(N !)]} N 1 2 m √ We use Stirling’s approximation: n! ≈ 2πn(nn/en) and Q becomes:

3 A version of the equation S = kB ln(Ω) is engraved on Boltzmann’s tombstone.

14 1 N molecules of gas

ki N1

1 2 3 m-1 m Energy levels

Nm N1 N2 N3 Nm-1 N= N1+N2+ ..+Nm

p1=N1/N p3=N3/N pm=Nm/N

A microstate is characterized by a vector p =(p1, p2, ...pm)

The thermodynamic entropy: The number of microstates: = Ω ≈ S kB ln Q H ( p1, p2 ,...pm )

Shannon entropy - the number of bits required to label individual microstates.

Figure 4: The relationship between the thermodynamic entropy and Shannon entropy.

√ Q = 1 [ (log 2πN + N log N − N log e) N √ − (log √2πN1 + N1 log N1 − N1 log e) − (log 2πN2 + N2 log N2 − N2 log e) . − . √ − (log 2πNm + Nm log Nm − nm log e)]. Now Q can be expressed as a sum of two terms:

Q = Q1 + Q2 with · ¸ N N N Q = log N − 1 log N + 2 log N + ... + m log N . 1 N 1 N 2 N m It is easy to see that:

1 N + N + ... + N log N = N log N = 1 2 m log N = p log N + p log N + ... + p log N. N N 1 2 m Thus:

15 X Q1 = − pi log pi = H(p1, p2, . . . , pm). i Then we observe that:

− log eN − (− log eN1 − log eN2 − ... − log eNm ) = − log eN/(N1+N2+...Nm) = − log e.

It follows that:

√ s 1 2πN 1 1 N p Q2 = log = log (m−1)/2 . N e (2πN1)(2πN2) ... (2πNm) N e(2π) N1N2 ...Nm

Then:

−1 Q2 = O(N log N). We have thus shown that:

−1 Q = H(p1, p2, . . . , pm) + O(N log N). We conclude that:

Q = H(p1, p2, . . . , pm) when N → very large. ¤ This expression confirms our intuition; if all N particles have the same energy, then the entropy of such a perfectly ordered system is zero. If a system has an infinite number of particles, all states are equally probable and the second law of thermodynamics is inapplicable to it. To find the most probable microstate we haveP to maximize H(p1, p2, . . . , pN ) subject to some constraint. For example, if the constraint is i piEi = κ (the average energy is constant), then:

e−λEi pi = P λEj j e with λ the solution of the equation:

X e−λEi Ei P = κ. eλEj i j

This equation has a unique solution if E1 ≤ κ ≤ Em; this solution is known as the Maxwell- Boltzmann distribution. Let us now leave aside the thermodynamics and consider messages generated by a source. We label each message by a binary string of length N. Let p0 = N0/N be the probability of a 0 in a label and p1 = N1/N be the probability of a 1 in a label. N0 is the typical number of zeros and N1 the typical number of one’s in a label. Then the number of bits required to label a message will be:

16 µ ¶ N ν = log = −N [p0 log p0 + p1 log p1] . N0,N1

Thus, if we have an alphabet with n letters occurring with probability pi, 1 ≤ i ≤ n, the average information in bits per letter transmitted when N is large is:

ν Xn = H(p , p , . . . , p ) = − p log p N 1 2 n i i i=1

with H(p1, p2, . . . , pn) the Shannon entropy. The connection between the thermodynamic entropy of Boltzmann and the informational entropy introduced by Shannon and discussed in depth in the next section is inescapable. Indeed, Boltzmann’s entropy can be written as: X S = −kB ln 2 pi log pi or S = kB ln 2 × H(p). i To convert one bit of classical information into units of thermodynamic entropy we multiply Shannon’s entropy by the constant (kB ln 2); the quantity kB ln 2 is the amount of entropy associated with the erasure of one bit of information according to Landauer’s principle.

17 Lecture 6 - Wednesday January 27, 2010

Mutual Information; Condensational Mutual Information; Relative Entropy We introduce two important concepts from Shannon’s : the mutual information which measures the reduction in uncertainty of a random variable X due to another random variable Y , and the conditional mutual information of X, Y , and Z which quantifies the reduction in the uncertainty of random variables X and Y given Z. Mutual information of two random variables X and Y is defined as:

X X p (x, y) I(X; Y ) = p (x, y) log XY . XY p (x)p (y) x y X Y

Conditional mutual information of random variables X, Y , and Z is defined as:

X X X p (x, y|z) I(X; Y |Z) = p (x, y, z) log XY |Z . XYZ p (x|z)p (y|z) x y z X|Z Y |Z

The mutual information of random variables X, Y , and Z has several properties summa- rized in Table 5. The symmetry of mutual information follows immediately from the definition of mutual information. To establish the relation of mutual information with entropy and con- ditional entropy we use the fact that pXY (x, y) = pX|Y (x|y)pY (y):

P P P P I(X; Y ) = − x y pXY (x, y) log pX (x) + x y pXY (x, y) log pX|Y (x|y)

= H(X) − H(X|Y ).

We also note that H(X|X) = 0 thus I(X; X) = H(X). We have proved earlier that H(X) ≥ H(X|Y ) thus I(X,Y ) ≥ 0 with equality only if X and Y are independent. The equality I(X; Y ) = H(X) + H(Y ) − H(X,Y ) follows immediately from the definition of I(X; Y ):

P hP i P P I(X; Y ) = − x y pXY (x, y) log pX (x) − y [ x pXY (x, y)] log pY (y) P P + x y pXY (x, y) log pXY (x, y)

= H(X) + H(Y ) − H(X,Y ).

The conditional mutual information is defined as:

I(X; Y |Z) = H(X|Z) − H(X|Y,Z) To prove the chain rule for mutual information we start from the definition of mutual infor- mation of three random variables:

I(X,Y ; Z) = H(X,Y ) − H(X,Y |Z) = [H(X|Y ) + H(Y )] − [H(X|Y,Z) + H(Y |Z)]

18 Table 5: Properties of mutual information.

I(X; Y ) = I(Y ; X) symmetry of mutual entropy I(X; Y ) = H(X) − H(X|Y ) mutual information, entropy, and conditional entropy I(X; Y ) = H(Y ) − H(Y |X) mutual information, entropy, and conditional entropy I(X; X) = H(X) mutual self information and entropy I(X; X) ≥ 0, non-negativity of mutual self information I(X; Y ) = H(X) + H(Y ) − H(X,Y ) mutual information, entropy, and joint entropy I(X; Y |Z) = H(X|Z) − H(X|Y,Z) conditional mutual information and conditional entropy I(X; Y,Z) = I(X; Z|Y ) + I(Y ; Z) chain rule for mutual information I(X; Y ) ≤ I(X; Z) if X 7→ Y 7→ Z data processing inequality

But:

H(X|Y ) − H(X|Y,Z) = I(X; Z|Y ) and H(Y ) − H(Y |Z) = I(Y ; Z). Thus:

I(X,Y ; Z) = I(X; Z|Y ) + I(Y ; Z)

This chain rule can be generalized to n + 1 random variables, X1,X2,...Xn,Y : X I(X1,X2,...Xn; Y ) = I(Xi; Y |X1,X2,...Xi−1). i The Venn diagram in Figure 5 provides an intuitive justification of the relations among the mutual entropy, joint entropy, and conditional entropies of random variables X and Y we have discussed earlier:

I(X; Y ) = H(X) + H(Y ) − H(X,Y ),

H(X,Y ) = H(X|Y ) + H(Y |X) + I(X; Y ),

H(X|Y ) = H(X) − I(X; Y ),H(Y |X) = H(Y ) − I(X; Y ).

In practice, we rarely have complete information about an event, there are discrepancies between what we expect and the real probability distribution of a random variable. For example, to schedule its gate availability an airport assumes that a particular flight arrives on time 75% of the time, while in reality the flight is on time only 61% of the time. To measure how close two distributions are from each other we introduce the concept of relative entropy. Given a random variable and two probability distributions p(x) and q(x) we wish to develop an entropy-like measure of how close the two distributions are. If the real probability distribution of events is p(x) but we erroneously assume that it is q(x) then the surprise when an event occurs is log(1/q(x)) and the average surprise is: X 1 p(x) log . q(x) x

19 H(X) H(Y)

H(X) I(X;Y) H(Y) H(X,Y)

H(X|Y) I(X;Y) H(Y|X)

Figure 5: Venn diagrams illustrating the relationship between H(X) and H(Y ), the entropy of random variables X and Y , the mutual information I(X; Y ), the joint entropy H(X,Y ) and the conditional entropy, H(X|Y ).

In this expression we use p(x), the correct probabilities forP averaging. Yet, the correct amount of information is given by Shannon’s entropy, H(X) = − p(x) log p(x). The relative entropy is defined as the difference between the average surprise and Shannon’s entropy. Relative entropy between two distributions p(x) and q(x) is defined as:

X p(x) H(p(x) || q(x)) = p(x) log . q(x) x The relative entropy is not a metric in mathematical sense, it is not symmetric:

H(p(x) || q(x)) 6= H(q(x) || p(x)). The expression of the relative entropy can also be written as:

X X 1 X H(p(x) || q(x)) = p(x) log p(x) + p(x) log = −H(X) − p(x) log q(x). q(x) x x x This expression justifies our assertion that the relative entropy is equal to the difference between the average surprise and Shannon’s entropy. We show now that the relative entropy is non-negative thus, as pointed out in [20], we notice an “uncertainty deficit” caused by our inaccurate assumptions; our average surprise is larger than Shannon’s entropy. The relative entropy between two distributions p(x) and q(x) is non-negative; it is zero only when p(x) = q(x). Proof: By virtue of the inequality ln x ≤ x − 1 for x ≥ 0 and of the fact that log x = ln x/ ln 2 it follows that:

20 · ¸ 1 X q(x) H(p(x) || q(x)) ≥ p(x) 1 − . ln 2 p(x) x But · ¸ X q(x) X X p(x) 1 − = p(x) − q(x) = 1 − 1 = 0. p(x) x x x Thus

H(p(x) || q(x)) ≥ 0 with equality if and only if p(x) = q(x).

21 Lecture 7 - Wednesday February 3, 2010

Fano Inequality; Data Processing Inequality Fano’s inequality is used to find a lower bound on the error probability of any decoder; it relates the average information lost in a noisy channel to the probability of the categoriza- tion error. We first introduce the concept of a Markov chain, then state Fano’s inequality, and finally prove the data processing inequality, a result with many applications in classical Information Theory. Markov chain: random variables X, Y , and Z form a Markov chain, denoted as X 7→ Y 7→ Z, if the conditional probability distribution of Z depends only on Y (it is independent of X):

pXYZ (x, y, z) = pX (x)pY |X (y|x)pZ|Y (z|y) =⇒ X 7→ Y 7→ Z.

Consider now a random variable X with the probability density function pX (x) and let | X | denote the number of elements in the range of X. Let Y be another random variable related to X, with the conditional probability pY |X (y|x). Fano’s Inequality. When we estimate X based on the observation of Y as Xˆ = f(X) we make ˆ an error perr = Prob(X 6= X) and:

H(perr) + perr log(| X |) ≥ H(X|Y ) A weaker form of Fano’s inequality is:

H(X|Y ) − 1 p ≥ . err log(| X |) Data processing inequality. If random variables X,Y,Z form a Markov chain, X → Y → Z, then:

I(X; Z) ≤ I(X; Y ). The inequality follows from the chain rule which allows us to expand the mutual information in two different ways:

I(X; Y,Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X,Z|Y ). But I(X,Y |Z) is non-negative; we also note that X and Z are independent given Y in the Markov chain X 7→ Y 7→ Z:

I(X,Y |Z) ≥ 0 and I(X; Z|Y ) = 0. Thus:

I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X,Z|Y ) =⇒ I(X; Z) ≤ I(X; Y ). Informally, the data processing inequality states that one cannot gather more information, Z, by processing a set of data, Y . than the information provided by the data to begin with, X; e.g., no amount of signal processing could increase the information we receive from a space probe.

22 Lecture 8 - Monday February 8, 2010

Classical Information Transmission through Discrete Channels When two agents communicate, each one can influence the physical state of the other through some physical process. The precise nature of the agents and of the signalling process can be very different, thus it is necessary to consider an abstract model of communication. In this model a source has the ability to select among a number of distinct physical signals, pass the selected signal to a communication channel which then affects the physical state of the destination; finally, the destination attempts to identify precisely the specific signal that caused the change of its physical state. In this section we consider an abstraction, the discrete memoryless channel, defined as the triplet (X, Y, pY |X (y|x)) with:

• The input channel alphabet X; the selection of an input symbol x ∈ X is modelled by the random variable X with the probability density function pX (x). • The output channel alphabet Y ; the selection of an output symbol y ∈ Y is modelled by the random variable Y with the probability density function pY (y). • The conditional probability density function p(y|x), x ∈ X, y ∈ Y , the probability of observing output symbol y when the input symbol is x. The channel is memoryless if the probability distribution of the output depends only on the input at that time and not on the past history.

The capacity of a discrete memoryless channel (X, Y, pY |X (y|x)) with pX (x) the input distri- bution is defined as:

C = max I(X; Y ) pX (x) In communication there are mappings from an original message to a transformed one carried out at the source and an inverse mapping carried out at the destination, called encoding and decoding,respectively. Information encoding allows us to accomplish such critical tasks for processing, transmission, and storing of information as: (i) Error control. An error occurs when an input symbol is distorted during transmission and interpreted by the destination as another symbol from the alphabet. The error control mechanisms transform a noisy channel into a noiseless one; they are built into communication protocols to eliminate the effect of transmission errors. (ii) Data compression. We wish to reduce the amount of data transmitted through a channel with either no or minimal effect on the ability to recognize the information produced by the source. In this case encoding and decoding are called compression and decompression, respectively. (iii) Support confidentiality. We want to restrict access to information to only those who have the proper authorization. The discipline covering this facet of encoding is called cryptography and the processes of encoding/decoding are called encryption/decryption. Source encoding is the process of transforming the information produced by the source into messages. The source may produce a continuous stream of symbols from the source

23 Source Binary Communication Source Source Receiver Encoder Channel Decoder

A 00 00 A B 10 10 B C 01 01 C D 11 11 D

(a)

A 00 00 A Source Source Source B 10 10 B Receiver Encoder C 01 01 C Decoder D 11 11 D

Channel Binary Communication Channel Encoder Channel Decoder

00 00000 00000 00 10 10110 10110 10 01 01011 01011 01 11 11101 11101 11 (b)

Figure 6: Source and channel encoding and decoding. (a) The source alphabet consists of four symbols, A,B,C, and D. The source encoder maps each input symbol into a two-bit code, A is mapped to 00, B to 10, C to 01, and D to 11. The source decoder performs an inverse mapping and delivers a string consisting of the four output alphabet symbols. If a one-bit error occurs when the sender generates the symbol D, the source decoder may get 01 or 10 instead of 11 and decode it as either C or B instead of D. (b) The channel encoder maps a two-bit string into a five-bit string. If a one-bit or a two-bit error occur, the channel decoder receives a string that does not map to any of the valid five-bit strings and detects an error. For example, when 10110 is transmitted and errors in the second and third bit positions occur, the channel decoder detects the error because 11010 is not a valid code word. alphabet. Then the source encoder cuts this stream into blocks of fixed size. The source decoder performs an inverse mapping and delivers symbols from the output alphabet. Channel encoding allows transmission of the message generated by the source through the channel. The channel encoder accepts as input a set of messages of fixed length and maps the source alphabet into a channel alphabet, then adds a set of redundancy symbols, and finally sends the message through the channel. The channel decoder first determines if the message is in error and takes corrective actions; then it removes the redundancy symbols, maps the channel alphabet into the source alphabet, and hands each message to the source decoder. The source decoder processes the message and passes it to the receiver.

24 Example 3. Information may be subject to multiple stages of encoding. Figure 6 illustrates multiple encodings for error control on a binary channel; the source uses a four-letter alphabet and the source encoder maps these four symbols into two-bit strings, then the source decoder performs the inverse mapping, as seen in Figure 6(a). The channel encoder increases the redundancy of each symbol encoded by the source encoder by mapping each two-bit string into a five-bit string, Figure 6(b). This mapping allows the channel decoder to detect a single bit error in the transmission of a symbol from the source alphabet. We are interested in two fundamental questions regarding communication over noisy chan- nels: (i) Is it possible to encode the information transmitted over a noisy channel to minimize the probability of errors? (ii) How does the noise affect the capacity of a channel? Intuitively, we know that when we wish to send a delicate porcelain by mail, we have to package it prop- erly; the more packaging material we add, the more likely it is that the porcelain will arrive at the destination in its original condition, but, at the same time, we increase the weight of the parcel and add to the cost of shipping. We also know that whenever the level of noise in a room or on a phone line is high, we have to repeat words and sentences several times before the other party understands us. Thus, the actual rate we are able to transmit information through a communication channel is lowered by the presence of the noise. Rigorous answers to both questions, consistent with our intuition, are provided by two theorems, the source coding theorem which provides an upper bound for the compression of a message and the channel coding theorem which states that the two agents can communicate at a rate close to the channel capacity C and with a probability of error ² arbitrarily small [13, 14, 15]. The source coding theorem (Shannon). Informally, it states that: A message containing n independent, identically distributed samples of a random variable X with entropy H(X) can be compressed to a length:

lX (n) = nH(X) + O(n) The theorem establishes the limits to possible data compression and a practical meaning of the Shannon entropy.

Example 4. Message length and entropy. Eight cars labelled C1, C2,..., C8 compete in several Formula I races. The probability of winning calculated based on the past race history for the eight cares, p1, p2, . . . , p8, are:

1 1 1 1 1 1 1 1 p = , p = , p = , p = , p = , p = , p = , and p = . 1 2 2 4 3 8 4 16 5 64 6 64 7 64 8 64 To send a binary message revealing the winner of a particular race we could encode the identities of the winning car in several ways. For example, we can use an “obvious” encoding scheme; the identities of C1, C2, C3, C4, C5, C6, C7, C8 could be encoded using three bits, the binary representation of integers 0 to 7, namely 000, 001, 010, 011, 100, 101, 110, 111, respec- tively. Obviously, in this case we need three bits, the average length of the string used to communicate the winner of any race is ¯l = 3. Let us now consider an encoding which reduces the average number of bits transmitted. The cars have different probability to win a race and it makes sense to assign a shorter string to a car which has a higher probability to win. Thus, a better encoding of the identities of C1, C2, C3, C4, C5, C6, C7, C8 is: 0, 10, 110, 1110, 111100, 111101, 111110, 111111.

25 The corresponding lengths of the strings encoding the identity of each car are:

l1 = 1, l2 = 2, l3 = 3, l4 = 4, l5 = l6 = l7 = l8 = 6. In this case, ¯l, the average length of the string used to communicate the winner is:

X8 1 1 1 1 1 ¯l = l × p = 1 × + 2 × + 3 × + 4 × + 4 × (4 × ) = 2 bits. i i 2 4 8 16 64 i=1 Using this encoding scheme the expected length of the string designating the winner over a large number of car races is less than the one for “obvious” encoding presented above. Now we calculate the entropy of the random variable W indicating the winner:

1 1 1 1 1 1 1 1 4 1 H(W ) = − × log − × log − × log − × log − × log = 2 bits. 2 2 2 4 2 4 8 2 8 16 2 16 64 2 64 We observe that the average length of the string identifying the outcome of a race for this particular encoding scheme is equal with the entropy, which means that this is an optimal encoding scheme. Indeed, the entropy provides the average information, or, equivalently, the average uncertainty removed by receiving a message. The channel coding theorem. Given a noisy channel with capacity C and given 0 < ² < 1, there is a coding scheme that allows us to transmit information through the channel at a rate arbitrarily close to channel capacity, C, with a probability of error less than ². This result constitutes what mathematicians call a theorem of existence; it only states that a solution for transforming a noisy communication channel into a noiseless one exists, without giving a hint of how to achieve this result. In real life, various sources of noise distort transmission and lower the channel capacity. Channel capacity theorem for noisy channels. The capacity C of a noisy channel is: Signal C = B × log (1 + ) 2 Noise where B is the bandwidth, the maximum transmission rate, Signal is the average power of the signal, and Noise is the average noise power. The signal-to-noise ratio is usually expressed Signal 3 in (dB) given by the formula: 10 × log10( Noise ). A signal-to-noise ratio of 10 corresponds to 30 dB, and one of 106 corresponds to 60 dB.

Example 5. Maximum data rate over a phone line. A phone line allows transmission of frequencies in the range 500 Hz to 4000 Hz and has a signal-to-noise ratio of 30 dB. The maximum data rate through the phone line is:

C = (4000 − 500) × log2(1 + 1000) = 3500 × log2(1001) = 35 kbps. If the signal-to-noise ratio improves to 60 dB, the maximum data rate doubles to about 70 kbps. However, improving the signal-to-noise ratio of a phone line by three orders of magnitude, from 103 to 106, is extremely difficult. Interestingly enough, we can reach the same conclusion regarding the capacity of a noisy communication channel based on simple physical arguments and the mathematical tools used

26 in Section for expressing Boltzmann’s entropy. Assume that Alice and Bob want to exchange one bit of information but the channel is noisy and there is a probability of one bit in error when three bits are transmitted, p = 1/3. The simplest approach is for Alice to send three bits rather than one, instead of sending “0” she sends “000” and instead of “1” she sends“111.” Bob will examine the three bits received and will know that when they are “100,“ or “010,” or “001” the message was “0” and when the bits are “011,” or “101,” or “110” the message was “1.” In the next chapter we will see that this is a repetitive error correcting code. Consider now the following gedanken experiment [9]: Alice wants to send m bits of infor- mation and the probability of error is p. Alice decides to pack the m bits into a message of n > m bits. Then pn bits in the string received by Bob could be in error and (1 − p)n bits may not be affected by errors. Alice is clairvoyant and knows ahead of time what bits will be affected by error and she specifies how the pn errors are distributed among the n bits of the message. To specify the number of ways the errors are distributed we have to count the number of ways pn bits in error and (1 − p)n bits that are not in error can be packed in a message of n bits. We conclude that in order to specify how the pn errors are distributed among the n bits of the encoded message, Alice needs a number of bits k equal to: µ ¶ n k = log . pn We have encountered this expression earlier and we know that the number of bits required to specify the number of ways the errors are distributed is:

k ≈ −n[p log p + (1 − p) log(1 − p)] = nH(p). This the number if bits that Alice would have to use to isolate the errors. Never mind that this approach is unfeasible because Alice cannot possibly know the position of the random errors when she transmits the message, but there is a lesson to be learned from this experiment. When Bob receives the message of n bits and erases it, then, according to Landauer’s principle the amount of heat/energy he generates is n × (kBT ln 2) at temperature T. Part of this heat/energy, namely k × (kBT ln 2), is generated while Bob attempts to extract the k bits of the message from the garbled string of n bits. The additional energy generated to erase the whole message is due to the fact that the communication channel is noisy and we had to specify the information about the errors. In other words, a noisy channel prevents us from utilizing its full capacity, we have to spend an additional amount of energy to erase the additional information used to encode the message. As we have shown earlier, to locate the errors Bob needs to use at least nH(p) bits thus, the maximum capacity of the noisy channel is the fraction of the channel capacity used to transmit useful information:

m = n(1 − H(p)).

27 Lecture 9 - Wednesday February 10, 2010

Shannon Source Encoding Source coding addresses two questions: (i) Is there an upper bound for the data compres- sion level and, if so what is this upper bound; (ii) Is it always possible to reach this upper bound? The answer to both questions is “yes.”

Source Sequences Atypical A = {a ,a ,....a } 1 1 2 m a :| − log P ( a ) − H ( A |) > δ n a = (a ,a ,....., a ) k1 k2 kn a , p(a ) P(a) = P(a )P(a ).....P(a ) i i k1 k2 kn

m = − 1 H (A) ∑ p(ai )log p(ai ) a :| − log P ( a ) − H ( A |) ≤ δ i=1 n Typical £ ¤ 1 Figure 7: Source coding. The inequality Prob | − n log P (a) − H(A)| > δ < ² partitions the set of strings of length n, into two classes: (i) typical strings that occur with high probability, and (ii) atypical strings that occur with a vanishing probability

Informally, Shannon source encoding theorem states that a message containing n inde- pendent, identically distributed samples of a random variable X with entropy H(X) can be compressed to a length:

lX (n) = nH(X) + O(n)

The justificationP of this theorem is based on the weak law of large numbers which states that x¯ = x p , the mean of a large number of independent, identically distributed random i i xi P variables, xi, approaches 1/n i xi, the average, with a high probability when n is large: " # 1 Xn Prob | x − x¯| > δ < ², n i i=1 with δ and ² two arbitrarily small positive real numbers. Given a source of classical information A = {a1, a2, . . . , am} with an alphabet with m symbols, let {p(a1), p(a2), . . . , p(an)} be the probabilities of individual symbols; then the Shannon entropy of this source is: Xm H(A) = − p(ai) log p(ai). i=1

A message a = (ak1 , ak2 , . . . , akn ) consisting of a sequence of n symbols independently selected from the input alphabet has a probability:

P (a) = p(ak1 )p(ak2 ) . . . p(akn ).

28 If we define a random variable xi = − log p(ai) then we can establish the following correspon- dence with the quantities in the expression of the weak law of large numbers:

Xn Xn Xm Xm

xi = − log p(aki ) = − log P (a), x¯ = xip(xi) = − p(ai) log p(ai) = H(A). i=1 i=1 i=1 i=1

It follows that given two arbitrarily small real numbers, (δ, ²) ≥ 0 then, for sufficiently large n the following inequality holds n [12]: · ¸ 1 Prob | − log P (a) − H(A)| > δ < ². n This inequality partitions Λ, the set of stings of length n, into two subsets, see Figure 7: (a) The subset of Typical strings: ½ ¾ 1 Λ = a : | − log P (a ) − H(A)| ≤ δ T ypical T ypical n T ypical

that occur with high probability, Prob (aT ypical) ≥ (1 − ²); and (b) The subset of Atypical strings: ½ ¾ 1 Λ = a : | − log P (a ) − H(A)| > δ . Atypical Atypical n Atipical

that occur with a vanishing probability, Prob (aAtypical) < ². The two subsets are disjoint and complementary: \ [ ΛT ypical ΛAtypical = ∅ and Λ = ΛT ypical ΛAtypical.

We concentrate on typical strings generated by the source as atypical strings occur with a vanishing probability and can be ignored. Now we shall determine | ΛT ypical |, the cardinality of the set of typical strings, a ∈ ΛT ypical. The inequality defining the strings in this subset can be rewritten as:

1 −δ ≤ − log P (a ) − H(A) ≤ +δ =⇒ 2−n(H(A)+δ) ≤ P (a ) ≤ 2−n(H(A)−δ). n T ypical T ypical The first inequality can be expressed in terms of the cardinality set: X X −n(H(A)+δ) −n(H(A)+δ) P (atypical) ≥ 2 =| ΛT ypical | 2 .

a∈ΛT ypical a∈ΛT ypical It follows that:

n(H(A)+δ) | ΛT ypical | ≤ 2 . Similarly, the second inequality can be expressed as: X X −n(H(A)−δ) −n(H(A)−δ) P (atypical) ≤ 2 =| ΛT ypical | 2 .

a∈ΛT ypical a∈ΛT ypical

29 This implies that:

n(H(A)−δ) | ΛT ypical | ≥ 2 . nH(A) When δ 7→ 0 then | ΛT ypical | 7→ 2 , the cardinality of the set of typical string converges to 2n(H(A)−δ). There are 2nH(A) typical strings, therefore we need log 2nH(A) = nH(A) bits to encode all possible typical strings; this is the upper bound for the data compression provided by Shannon’s classical source encoding theorem. Simple models of classical memoryless channels. A binary channel is one where the input and the output alphabet consists of binary sym- bols: X = {0, 1} and Y = {0, 1}.A unidirectional binary communication channel is one where the information propagates in one direction only, from the source to the destination.

Source Noiseless, binary symmetric communication channel Destination

x=0 y=0

x=1 y=1

(a)

Source Noisy, binary symmetric communication channel Destination 1-p x=0 y=0 p p x=1 y=1 1-p

(b)

Figure 8: (a) A noiseless binary symmetric channel maps a 0 at the input into a 0 at the output and a 1 into a 1. (b) A noisy symmetric channel maps a 0 into a 1, and a 1 into a 0 with probability p. An input symbol is mapped into itself with probability 1 − p.

Noiseless binary channel. A noiseless binary channel transmits each symbol in the input alpha- bet without errors, as shown in Figure 8(a). The noiseless channel model is suitable in some cases for performance analysis but it is not useful for reliability analysis when transmission errors have to be accounted for. Noisy binary symmetric channel. This model assumes that an input symbol is affected by error with probability p > 0; a 1 at the input becomes a 0 at the output and a 0 at the input becomes a 1 at the output, as shown in Figure 8(b). Assume that the two input symbols occur with probabilities Prob(X = 0) = q and Prob(X = 1) = 1 − q. In this case: X H(Y |X) = Prob(X = x) × H(Y |X = x) x∈X

H(Y |X) = − {q [p log2 p + (1 − p) log2(1 − p)] + (1 − q)[p log2 p + (1 − p) log2(1 − p)]}

H(Y |X) = − [p log2 p + (1 − p) log2(1 − p)].

30 Then the mutual information is:

I(X; Y ) = H(Y ) − H(Y | X) = H(Y ) + [p log2 p + (1 − p) log2(1 − p)]. We can maximize I(X; Y ) over q, to get the channel capacity per symbol in the input alphabet, by making H(Y ) = 1:

Cs = 1 + [p log2 p + (1 − p) log2(1 − p)]. 1 When p = 2 the channel capacity is 0 because the output is independent of the input. When p = 0 or p = 1, the capacity is 1, and we have in fact a noiseless channel.

Figure 9: A binary erasure channel. The transmitter sends a bit and the receiver receives with probability 1−pe the bit unaffected by error, or it receives with probability pe a message e that the bit was “erased.” pe is the “erasure probability.”

Binary erasure (BER) channel. This channel model captures the following behavior: the transmitter sends a bit and the receiver receives with probability 1 − pe the bit unaffected by error, or it receives with probability pe a message that the bit was “erased”, where pe is the “erasure probability.” A binary erasure channel with input X and output Y is characterized by the following conditional probabilities:

Prob (Y = 0|X = 0) = Prob (Y = 1|X = 1) = 1 − pe Prob (Y = e|X = 0) = Prob (Y = e|X = 1) = pe Prob (Y = 0|X = 1) = Prob (Y = 1|X = 0) = 0. The binary erasure channel is in some sense error-free, the sender does not have to encode the information and the receiver does not have to invest any effort to decode it, Figure 9. CBEC , the capacity of a binary erasure channel is:

CBEC = 1 − pe. Indeed:

CBEC = max I(X; Y ) = max (H(Y ) − H(Y |X)) = max (H(Y )) − H(pe). pX (x) pX (x) If E denotes the erasure event, Y = e, then:

H(Y ) = H(Y,E) = H(E) + H(Y |E).

If the probability Prob (X = 1) = p1 then:

H(Y ) = H ((1 − p1)(1 − pe), pe, p1(1 − pe)) = H(pe) + (1 − pe)H(p1).

31 It follows that:

CBEC = max (H(Y ) − H(pe)) = max(1 − pe)H(p1) = (1 − pe). pX (x) p1 The following argument provides an intuitive justification for the expression of the binary erasure channel capacity: assume that the sender receives instantaneous feedback when the bit was erased and then retransmits the bit in error. This would result in a average transmission rate of (1 − pe) bits/sec. Of course this ideal scenario is not feasible thus (1 − pe) is an upper bound on the channel rate. This brings us to the role of the feedback; it can be proved that the rate (1 − pe) is the best that can be achieved with, or without feedback [3]. It is surprising that the feedback does not increase the capacity of discrete memoryless channels. q-ary symmetric channels. This model assumes a channel with an alphabet of q symbols. If the probability of error is p then the probability that a symbol is correctly transmitted over the channel is 1 − p. We assume that if an error occurs, the symbol is replaced by one of the other q − 1 symbols with equal probability. For the following analysis we make the assumption that errors are introduced by the channel at random and that the probability of an error in one coordinate is independent of errors in adjacent coordinates. The probability that the received n-tuple r is decoded as codeword c at the output of a q-ary symmetric channel is: µ ¶ p d Prob(r, c) = (1 − p)n−d . q − 1 where d(r, c) = d, d is the distance of the code and n is the length of a codeword. Indeed, n − d coordinate positions in c are not altered by the channel; the probability of this event is (1 − p)n−d. In each of the remaining d coordinate positions, a symbol in c is altered and transformed into the corresponding symbol in r; the probability of this is (p/(q − 1))d. Trace Distance and Fidelity Information Theory is concerned with statistical properties of random variables and often demands an answer to the question how similar two probability density functions are. For example, we have two information sources S1 and S2 which share the same alphabet, but have different input and output probability density functions, pX (x) and pY (x), and we ask ourselves how similar are the two sources. Another question we may pose is how well a com- munication channel preserves information over time. Let I be the random variable describing the input of a noisy channel and O its output. Let I˜ by a copy of I. Now we are interested

in the relationships of the joint probability distributions pI,I˜(xi, xi) and pI,O(xi, yi). We introduce two analytical measures of the similiarity/dissimilarity of two probability density functions, pX (x) and pY (x). Whenever there is no ambiguity we omit the argument of the density function and write pX instead of pX (x).

Trace distance, D(pX (x), pY (x)), (also called Kolmogorov, or L1 distance) of two probability density functions, pX (x) and pY (x), is defined as: 1 X D(p (x), p (x)) = | p (x) − p (x) | . X Y 2 X Y x

32 Fidelity, F (pX (x), pY (x)) of two probability density functions, pX (x) and pY (x) is defined as: X p F (pX (x), pY (x)) = pX (x)pY (x) x It isp easy to see thatp the fidelity is the inner product of unitary vectors with components equal to pX (x) and qpY (x). These vectorsq connect the center with points on a sphere of radius P 2 P 2 one; indeed, x pX (x) = 1 and x pY (x) = 1. A geometric interpretation of fidelity is thus the cosine of θ, the angle between the two vectors:

F (pX (x), pY (x)) = cos θ.

Example 6. Two noisy binary symmetric channels with input and output alphabets {0, 1} have probability of error p and q, respectively. This means that when we transmit one of the input symbols the probability of recovering it correctly is 1 − p for the first channel and 1 − q for the second. The two measures of similarity between the two distributions are: 1 D(p (x), p (x)) = [| p − q | + | (1 − p) − (1 − q) |] =| p − q | X Y 2 and √ p √ p F (pX (x), pY (x)) = pq + (1 − p)(1 − q) = pq + pq + 1 − (p + q). If p = 0.4 and q = 0.9 then

√ p D(pX (x), pY (x)) =| p − q |= 0.5,F (pX (x), pY (x)) = pq + pq + 1 − (p + q) = 0.6245.

The trace distance is a metric while the fidelity is not. Indeed, the trace distance satisfies the properties of a metric:

D(pX , pY ) ≥ 0 non-negativity D(pX , pY ) = 0 iff pX = pY identity of indiscernibles D(pX , pY ) = D(pY , pX ) symmetry D(pX , pZ ) ≤ D(pX , pY ) + D(pY , pZ ) triangle inequality. The fidelity fails to satisfy one of the required properties: X p X F (pX (x), pX (x)) = pX (x)pX (x) = pX (x) = 1 6= 0. x x

33 References

[1] R. B. Ash. Information Theory Dover Publishing House, New York, NY, 1965. [2] J. Brown. The Quest for the Quantum Computer. Simon and Schuster, New York, NY, 1999. [3] T. M. Cover and J. A. Thomas. Elements of Information Theory. [4] Wiley Series in Telecommunications. John Wiley & Sons, NewR. W. Hamming. York, NY, 1991.“Error Detecting and Error Correcting Codes.” Bell Systems Tech. Journal, 29:147–160, 1950. [5] R. Landauer. “Irreversibility and Heat Generation in the Computing Process.” IBM Journal of Research and Development, 5: 182 - 192, 1961. [6] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. [7] R. Penrose. The Road to Reality: A Complete Guide to the Laws of the Universe. Vintage Books, ISBN 978-0-679-77631-4, 2007. [8] D. Petz. “Entropy, von Neumann and the von Neumann Entropy” Preprint, arxiv.org/math-ph/0102013 v1, February 2001. [9] M. B. Plenio and V. Vitelli. “The Physics of Forgetting: Landauer’s Erasure Principle and Information Theory.” Preprint, arxiv.org/quant-ph/0103108 v1, March 2001. [10] I. S. Reed. “A Class of Multiple-error-correcting Codes and the Decoding Scheme.” IEEE Transactions on Information Theory, 4:3849, 1954. [11] I. S. Reed and G. Solomon. “Polynomial Codes over Certain Finite Fields” SIAM Journal of Applied Mathematics, 8: 300 - 304, 1960. [12] C. E. Shannon. “A Mathematical Theory of Communication.” Bell Sys. Tech. Journal, 27:379–423 & 623–656, 1948. [13] C. E. Shannon. “Communication in the Presence of Noise.” Proceedings of the IRE, 37: 10 - 21, 1949. [14] C. E. Shannon. “Certain Results in Coding Theory for Noisy Channels.” Information and Control, 1(1): 6 - 25, 1957. [15] C. E. Shannon and W. Weaver. A Mathematical Theory of Communication. University of Illinois Press, Urbana, IL, 1963. [16] M. Sudan. “Lecture Notes for Essential Coding Theory”, http://theory.lcs.mit.edu/madhu, December 2005. [17] N. S. Szabo and R. I. Tanaka. Residue Arithmetic and Its Applications to Computer Technology. McGraw-Hill, New York, NY, 1967. [18] L. Szil´ard.“Uber¨ die Entropieverminderung in einem Thermodynamichen System bei Eingriffen Intelligenter Wesen.” Zeitschrifft f¨urPhysik, 53:840 – 856, 1929.

34 [19] A. M. Turing. “On Computable Numbers with an Application to the Entschei- dungsproblem.” Proceedings of the London Mathematical Society 2, 42: 230, 1936. S. A. Vanstone and P. C. van Oorschot. An Introduction to Error Correcting Codes with Applications. Kluwer Academic Publishers, Boston, MA, 1989.

[20] V. Vedral. “The Role of Entropy in Quantum Information Theory.” Preprint, arxiv.org/quant-ph/0102094 v1, 2001.

[21] S. Verd`uand T. S. Han. “A General Formula for Classical Channel Capacity.” IEEE Trans. on Inform. Theory, 40:1147–1157, 1994.

[22] G. Vidal. “Efficient Classical Simulation of Slightly Entangled Quantum Computa- tions.” Phys. Rev. Lett. 91, 14792–14976, 2003.

[23] A. J. Viterbi. “Error Bounds for for Convolutional Codes and an Asymptotically Opti- mum Decoding Algorithm.” IEEE Trans. on Information Theory,”IT-13:260–269, 1967.

[24] A. Wehrl “General Properties of Entropy.” Reviews of Modern Physics, 50(2): 221–260, 1978.

[25] C. F. von Weizs¨acker. Die Einheit der Natur (The Unity of Nature). Ed. F. Zucker. Farrar, Straus, and Giroux, 1971

[26] C. F. von Weizs¨acker and E. von Weizs¨acker. “Wideraufname der begrifflichen Frage: Was ist Information?” (Revisting the Fundamental Question: What is Information?). Nova Acta Leopoldina, 206, 535, 1972.

[27] J. M. Wozencraft and B. Reiffen. Sequential Decoding. MIT Press, Cambridge, MA, 1961.

35