: MATH34600

Professor Oliver Johnson [email protected] Twitter: @BristOliver

School of Mathematics, University of Bristol

Teaching Block 1, 2019-20

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 1 / 157 Why study information theory?

Information theory was introduced by Claude Shannon in 1948. Paper A Mathematical Theory of Communication available via course webpage. It forms the basis for our modern world. Information theory is used every time you make a phone call, take a selfie, download a movie, stream a song, save files to your hard disk. Represent randomly generated ‘messages’ by sequences of 0s and 1s Key idea: some sources of randomness are more random than others. Can compress random information (remove redundancy) to save space when saving it. Can pad out random information (introduce redundancy) to avoid errors when transmitting it. Quantity called entropy quantifies how well we can do.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 2 / 157 Relationship to other fields (from Cover and Thomas)

Now also quantum information, security, algorithms, AI/machine learning (privacy), neuroscience, bioinformatics, ecology . . .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 3 / 157 Course outline 15 lectures, 3 exercise classes, 5 mandatory HW sets. Printed notes are minimal, lectures will provide motivation and help. IT IS YOUR RESPONSIBILITY TO ATTEND LECTURES AND TO ENSURE YOU HAVE A FULL SET OF NOTES AND SOLUTIONS Course webpage for notes, problem sheets, links, textbooks etc: https://people.maths.bris.ac.uk/∼maotj/IT.html Drop-in sessions: 11.30-12.30 on Mondays, G83 Fry Building. Just turn up to in these times. (Other times, I may be out or busy - but just email [email protected] to fix an appointment). Notes inherited from Dr Karoline Wiesner – thanks! This material is copyright of the University unless explicitly stated otherwise. It is provided exclusively for educational purposes at the University and is to be downloaded or copied for your private study only.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 4 / 157 Course outline (cont.)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 5 / 157 Contents

1 Introduction

2 Section 1: Review

3 Section 2: Entropy and Variants

4 Section 3: Lossless Source Coding with Variable Codeword Lengths

5 Section 4: Lossy Source Coding with Fixed Codeword Lengths

6 Section 5: Channel Coding

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 6 / 157 Section 1: Probability Review

Objectives: by the end of this section you should be able to Recall the necessary basic definitions from probability. Understand the relationship between joint, marginal and conditional probability etc Use basic information-theoretic notation and terminology such as alphabet, message, source etc.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 7 / 157 Section 1.1: Concepts and notation

Information theory can be viewed as a branch of applied probability. We will briefly review the concepts from you are expected to know. We also set the notation used throughout the course. A summary of basic probability can also be found in Chapter 2 of MacKay’s excellent book Information Theory, Inference, and Learning Algorithms, available as pdf online, http://www.cambridge.org/0521642981.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 8 / 157 Information sources Imagine a source of randomness e.g. tossing a coin, signals from space, sending an SMS Produces a sequence of symbols from an alphabet (i.e. a finite set)

Definition 1.1. We use the following notation for finite sets, sequences, etc.: X , Y, Z, etc. alphabets (i.e., finite sets). x ∈ X etc. letter from an alphabet. X × Y = {xy : x ∈ X , y ∈ Y} Cartesian product of sets.a n x = x1x2 ... xn sequence of length n. X n = X × X n−1 set of all strings/words/sequences n = {x = x1x2 ... xn : xi ∈ X } of length n. X 0 = {} sequence of length 0 (empty string) X ∗ = X 0 ∪ X 1 ∪ X 2 ∪ ... set of all strings of any length.

aUsually write the elements as concatenation xy not an ordered pair (x, y).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 9 / 157 Coin tossing

Example 1.2. Consider tossing coins. The alphabet X is the set {H, T } with letters ‘H’ and ‘T ’. The sequence (or string) HTHH ∈ X 4. Want to quantify how likely this outcome is for (biased) coins.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 10 / 157 Section 1.2: Random variables

Definition 1.3 (Discrete random variable). A discrete random variable (r.v.) X is a function from a probability space (Ω, P) into a set X . That is, for each outcome ω ∈ Ω the X (ω) ∈ X .

X takes on one of a possible set of values X = {x1, x2, x3,... x|X |}. Write probability PX (x) ≡ P(X = x) ≥ 0. P P We always require that x∈X PX (x) = x∈X P(X = x) = 1.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 11 / 157 Example: English language

Example 1.4. Take a vowel randomly drawn from an English document.

X a e i o u PX (xi ) 0.19 0.29 0.19 0.23 0.10

Note that the are normalised (rescaled to add up to 1). The total probability of a vowel in the complete alphabet (including ’space’) is 0.31.

Example 1.5. For an example of English letter and word frequencies see Shannon’s original paper from 1948, A Mathematical Theory of Communication, available online.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 12 / 157 Jointly distributed random variables

Definition 1.6 (Jointly distributed random variables). XY is a jointly distributed random variable on X × Y if X and Y are random variables taking values on X and Y.

Joint distribution PXY s.t. PXY (xy) = P(X = x, Y = y), for all x ∈ X , y ∈ Y.

Remark 1.7.

We obtain the marginal distribution PX of random variable X from the joint distribution PXY by summation: X X PX (x) = P(X = x) = P(XY = xy) = PXY (xy), y∈Y y∈Y P or in short-hand notation (where context is clear): P(x) = y∈Y P(xy).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 13 / 157 Conditional probability

Definition 1.8. The conditional probability of a random variable X , given knowledge of Y , is calculated as follows:

P(XY = xy) PXY (xy) PX |Y (x|y) = = , if PY (y) 6= 0 . P(Y = y) PY (y)

If PY (y) = 0 then the conditional probability is undefined. P Note x∈X PX |Y (x|y) = 1 for any fixed y.

Remark 1.9. We will often use the product rule (also known as chain rule) for calculating probabilities:

PXY (xy) = PX |Y (x|y)PY (y) = PY |X (y|x)PX (x)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 14 / 157 Independence

Definition 1.10. Two random variables X and Y are independent if and only if:

PXY (xy) = PX (x)PY (y),

for all x ∈ X and y ∈ Y.

Equivalently, conditional probability mass function PX |Y (x|y) does not depend on y. Assume that e.g. successive coin tosses are independent. If X and Y are independent, then so are f (X ) and g(Y ) for any functions f and g. Random variables X and Y are not necessarily independent.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 15 / 157 Example: English text

Example 1.11. The ordered pair XY consisting of two successive letters in an English document is a jointly distributed r.v. The possible outcomes are ordered pairs such as aa, ab, ac, and zz. Of these, we might expect ab and ac to be more probable than aa and zz. For example, the most probable value for the second letter given that the first one was q is u. Successive letters in English are not independent.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 16 / 157 Independent and identically distributed (i.i.d.)

Definition 1.12.

Random variables X1, X2,..., Xn are independent and identically distributed (i.i.d.) if they are independent and if their individual distributions are all identical.

In other words, there is a probability distribution PX on X such that

PX1X2···Xn (x1x2 ... xn) = PX (x1) ··· PX (xn).

This definition captures the intuitive notion of n independent repeated trials of the same experiment (observations of Xi ).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 17 / 157 Section 1.3: Expectation and variance

Definition 1.13 (Expectation). If outcomes X ⊂ R, the expectation or mean EX of a discrete random variable X is defined as X EX := PX (x) x. x∈X

The expectation is an average of the values x taken by X , weighted by their probabilities PX (x) = P(X = x). Note that the above sum is finite by the assumption of discreteness. Similarly define expectation of functions by P Ef (X ) := x∈X PX (x) f (x).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 18 / 157 Variance and covariance

Definition 1.14. For random variables X and Y : The variance Var X is defined as

2 2 2 Var X := E(X − EX ) = E(X ) − (EX ) .

The covariance of X and Y  Cov (X , Y ) = E (X − EX ) × (Y − EY ) = E(X × Y ) − E(X ) × E(Y ).

Variance measures how spread out the distribution of X is around EX . Covariance measures whether there is a linear trend relating X and Y .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 19 / 157 Basic properties of expectation and variance

Theorem 1.15. We have the following properties of E and Var . For real numbers λ, µ and jointly distributed random variables X , Y : (i) E(λX + µY ) = λEX + µEY . 2 (ii) mina E (X − a) = Var X . (iii) Var (λX ) = λ2Var X . (iv) Var (X + Y ) = Var X + Var Y + 2Cov (X , Y ). (v) If X and Y are independent, then

E(X × Y ) = (EX ) × (EY ) or equivalently Cov (X , Y ) = 0.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 20 / 157 Probability example Example 1.16.

Consider the joint distribution given by

PXY (xy) X = −1 X = 1 1 Y = −1 4 0 3 3 Y = 1 20 5  2 3  1 3 Marginal distributions: PX = 5 , 5 and PY = 4 , 4 . Conditional distributions  1 4 I For X: PX |Y =−1 = {1, 0} and PX |Y =1 = 5 , 5  5 3 I For Y: PY |X =−1 = 8 , 8 and PY |X =1 = {0, 1}. We obtain 2 3 1 I EX = 5 (−1) + 5 (1) = 5 1 3 1 I EY = 4 (−1) + 4 (1) = 2 1 3 3 14 I E(X × Y ) = 4 (1) + 20 (−1) + 5 (1) = 20 . Note E(X × Y ) 6= EX × EY , which shows that the two variables are not independent (could also tell this by joint mass function).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 21 / 157 Weak law of large numbers

Theorem 1.17 (Weak law of large numbers).

Consider i.i.d. real valued random variables U1, U2,..., Un,... ∼ U such that E|U| < ∞. Define the empirical average as

n 1 X S := U . n n i i=1 Then for all  > 0,  lim |Sn − U| ≥  = 0. n→∞ P E

Fits with intuitive understanding that ‘things average themselves out’.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 22 / 157 Jensen’s inequality

Theorem 1.18. If f is a convex function then for any random variable

Ef (X ) ≥ f (EX ).

If f is strictly convex, then equality holds if and only if X is deterministic (if P(X = x0) = 1 for some x0).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 23 / 157 Sketch Proof of Jensen’s inequality. Pn Pn Claimed result is that i=1 pi f (xi ) ≥ f ( i=1 pi xi ) . Convexity means p1f (x1) + p2f (x2) ≥ f (p1x1 + p2x2); i.e. case n = 2. ∗ Do induction; if result true for n then take pi = pi /(1 − p1) for Pn+1 ∗ i = 2,..., n + 1 so that i=2 pi = 1.

n+1 n+1 X X ∗ pi f (xi ) = p1f (x1) + (1 − p1) pi f (xi ) i=1 i=2 n+1 ! X ∗ ≥ p1f (x1) + (1 − p1)f pi xi i=2 n+1 ! X ∗ ≥ f p1x1 + (1 − p1) pi xi i=2 n+1 ! X = f pi xi . i=1

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 24 / 157 Section 2: Entropy and Variants

Objectives: by the end of this section you should be able to Define and interpret the entropy of a random variable. Define relative entropy and prove that it is positive. Define joint entropy, conditional entropy and mutual information. Understand and prove the relationships between these quantities.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 25 / 157 Section 2.1: Information sources and entropy

In A Mathematical Theory of Communication, Shannon writes The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 26 / 157 Information sources

This set of possible messages is generated by the information source, which is at the beginning of any communication process. An information source emits one message at a time sampled from a given probability distribution. Hence, such a source can be represented by a discrete random variable.

Definition 2.1 (Information source). An information source X is a discrete random variable with source alphabet X and probability distribution PX .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 27 / 157 Information

Example 2.2.

Consider an information source generating outcomes (x1,..., xn) with probabilities (p1,..., pn) respectively.

Suppose outcome xi happens (this occurs with probability pi ). How surprised are we? How much have we learnt?

If pi is small, we’ve learned a lot (we are quite surprised).

If pi is big, we’re not so surprised.

Measure surprise via some decreasing function φ(·); surprise is φ(pi ). In fact, we will see that φ(u) = − log(u) is a good choice. This choice was actually introduced by Hartley in the 1920s.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 28 / 157 Shannon Entropy Definition 2.3 (Entropy).

The entropy of the random variable X is the following function of PX X H(X ) ≡ H(PX ) := − PX (x) log PX (x) . x∈X Here, logs are taken to base 2. The unit of entropy is the bit. We adopt the convention 0 log 0 := 0.

Values of x irrelevant: only (unordered) list of values PX (x) matters.

Remark 2.4. The entropy H(X ) is the expectation of the real random variable ϕ(X ), with the function ϕ(x) := φ(PX (x)) = − log PX (x):  H(X ) = E − log PX (X ) .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 29 / 157 Example: Binary random variable

Example 2.5.

Let X = {0, 1}, i.e. |X | = 2, and PX (0) = p, PX (1) = 1 − p. Then, for 0 ≤ p ≤ 1:

H2(p) := H(X ) = −p log p − (1 − p) log(1 − p) .

Note H2(0) = H2(1) = 0 bits and H2(1/2) = 1 bit.

Draw the graph of − log p, −p log p and H2(p)?

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 30 / 157 Example: uniform distribution

Example 2.6.

Suppose X = {1,..., m} and PX (x) = 1/m for all x. Then m X 1  1  H(X ) = − log = log m. m m x=1

Turning this round, a random variable with entropy H is ‘as unpredictable’ as a uniform with 2H outcomes. The bigger the H the more unpredictable.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 31 / 157 Motivation for entropy

Can motivate entropy axiomatically – Shannon did this. Shannon formulated his axioms in terms of the uncertainty of a random variable. In information theory a measure of uncertainty is equivalent to a measure of information. Idea: “uncertainty before outcome of random variable is observed” equals “expected amount of information obtained when outcome is observed”.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 32 / 157 Axioms for entropy

Shannon wanted a measure of to have the following properties as a function of the probabilities PX (x):

1 H should be continuous in the PX (x) (small changes in PX mean small changes in H)

2 If all the probabilities are equal, PX (x) = 1/|X |, then H should be a monotonic increasing function of |X |. 3 If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 33 / 157 Illustration of third axiom

At the left we have three possibilities, on the right we first choose between two possibilities and if the second occurs make another choice with probabilities 1/3, 2/3. We require, in this special case, that 1 1 1 1 1 1 2 1 H , , = H , + H , . 2 3 6 2 2 2 3 3 Coefficient 1/2 because this second choice only occurs half the time. Shannon proved that the only function observing all axioms is of the P form −K x∈X pX (x) log pX (x).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 34 / 157 Properties of entropy Proposition 2.7.

The entropy has the following properties:

(i) H(X ) is a function only of the probabilities PX (x), and not their ordering or the labelling of X . (ii) H(X ) ≥ 0 with equality if and only if PX is a point mass at some x ∈ X , i.e. 0 ( 1 if x = x0, PX (x) = 0 if x 6= x0.

(iii) H(X ) ≤ log |X | with equality if and only if PX is uniform. (iv) H(P) is a strictly concave function of P, i.e. for 0 ≤ λ ≤ 1 and distributions PX and QX ,  λ H(PX ) + (1 − λ) H(QX ) ≤ H λPX + (1 − λ)QX ,

with equality if and only if λ = 0 or λ = 1 or PX = QX .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 35 / 157 Proof of Proposition 2.7. (i) A sum is invariant under permutations. (ii) H(X ) is the sum of non-negative terms:

I Because 0 ≤ PX (x) ≤ 1 the −PX (x) log PX (x) ≥ 0, with equality if and only if PX (x) = 0 or 1. I Hence H(PX ) ≥ 0. I If H(PX ) = 0 all terms in the sum must be zero, and hence each PX (x) = 0 or 1. (iii) (left as homework) (iv) θ(t) = −t log t is strictly concave (θ00(t) = −1/(t · ln 2) < 0). Apply Jensen (Theorem 1.18) to a Bernoulli(λ) r.v. to deduce for any x:

λθ(PX (x)) + (1 − λ)θ(QX (x)) ≤ θ (λPX (x) + (1 − λ)QX (x))

Summing over x ∈ X we deduce the result (see Remark 2.19 for a better proof).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 36 / 157 Section 2.2: Relative entropy We define a variant (sort of generalization) of entropy: Definition 2.8. Given two probability distributions P and Q taking values on the same set X , we define the relative entropy from P to Q by

X  P(x)  D(PkQ) = P(x) log . Q(x) x∈X

Notice that if P(x) = Q(x) for all x then D(PkQ) = 0. In fact (see below) D(PkQ) ≥ 0 with equality if and only if P ≡ Q. Hence think of it as a kind of distance (even if not symmetric, no triangle inequality). D occurs in lots of applications in Statistics and Machine Learning; also called Kullback–Leibler (KL) distance.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 37 / 157 Gibbs inequality

Theorem 2.9. For any probability distributions P and Q, the

D(PkQ) ≥ 0,

with equality if and only if P ≡ Q.

Often rearrange this as X X − PX (x) log PX (x) ≤ − PX (x) log QX (x) x∈X x∈X to deduce upper bounds on entropy.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 38 / 157 Proof of Gibbs inequality, Theorem 2.9

Proof. Notice function f (u) = − log u is strictly convex. Apply Jensen’s inequality (Theorem 1.18) to obtain

X  Q(x) D(PkQ) = P(x) − log P(x) x∈X ! ! X Q(x) X ≥ − log P(x) = − log Q(x) = 0. P(x) x∈X x∈X

Deduce equality holds if and only if P(x)/Q(x) ≡ c for some c; but P and Q are probability distributions.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 39 / 157 Log-sum inequality

Corollary 2.10.

For any two collections of positive numbers (a1,..., an) and (b1,..., bn), we have n   n ! Pn  X ai X i=1 ai ai log ≥ ai log Pn , (2.1) bi bi i=1 i=1 i=1

with equality if and only if ai /bi is constant in i.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 40 / 157 Proof of Corollary 2.10 Proof. Pn Pn Writing A = i=1 ai and B = i=1 bi , we can create probability distributions P(i) = ai /A and Q(i) = bi /B. Then, positivity of D(PkQ) gives

n   X ai ai /A 0 ≤ D(PkQ) = log A bi /B i=1 n   n  ! 1 X ai X A = ai log − ai log A bi B i=1 i=1 n    ! 1 X ai A = ai log − A log . A bi B i=1 Corollary 2.10 is equivalent to the positivity of the bracketed term.

Equality holds if and only if ai /A ≡ bi /B.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 41 / 157 Section 2.3: Joint and conditional entropy, mutual information

Given random variables X and Y , since the pair XY is itself a random variable, we can ask about its entropy. In fact, following definition isn’t new at all; it’s what you’d get if you treated XY as a random variable directly.

Definition 2.11.

Given random variables X ∈ X and Y ∈ Y with joint pmf PXY , we can define the joint entropy X H(XY ) = − PXY (xy) log PXY (xy). xy∈X Y

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 42 / 157 Mutual information

Natural to ask how H(XY ) relates to H(X ) and H(Y ). Make the following definition

Definition 2.12. Given random variables X ∈ X and Y ∈ Y, we can define the mutual information I (X ; Y ) = H(X ) + H(Y ) − H(XY ).

I (X ; Y ) measures how much information there is in X about Y . Also how much information there is in Y about X . Symmetry is interesting

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 43 / 157 Symmetry of I and causality

– from A Diary on Information Theory by Alfred R´enyi Mutual information does not capture causality. Need more advanced quantities: transfer entropy, directed information?

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 44 / 157 Noisy typewriter

Example 2.13.

Consider X uniformly distributed on {0,..., 3}. Define Z uniformly distributed on {0, 1} and Y = X + Z mod 4. Can write joint probability distribution of X and Y :

PXY (x, y) x = 0 x = 1 x = 2 x = 3 y = 0 1/8 0 0 1/8 y = 1 1/8 1/8 0 0 y = 2 0 1/8 1/8 0 y = 3 0 0 1/8 1/8

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 45 / 157 Noisy typewriter (cont.) Example 2.13. Looking at marginals, can see X and Y are both uniform on set of size 4, so

H(X ) = H(Y ) = log 4 = 2 (see Example 2.6).

Similarly XY uniform on set of pairs of size 8 so

H(XY ) = log 8 = 3

Hence I (X ; Y ) = 2 + 2 − 3 = 1. In general taking n < m, if X uniform on {0,..., m − 1} and Y = X + Z mod m, where Z uniform on {0,..., n − 1} then

H(X ) = H(Y ) = log m, H(XY ) = log(mn), I (X ; Y ) = log(m/n) > 0.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 46 / 157 Subadditivity of entropy Lemma 2.14 (Subadditivity of entropy).

For any two jointly distributed random variables X and Y ,

H(XY ) ≤ H(X ) + H(Y ) i.e., H(PXY ) ≤ H(PX ) + H(PY ),

with equality if and only if X and Y are independent, i.e., if PXY = PX × PY .

We gain less information in total from the value of XY than the sum of the information we would gain by learning X and learning Y . In general X has clues about Y , unless they are independent.

Corollary 2.15.

Equivalently the mutual information I (X ; Y ) ≥ 0 with equality if and only if X and Y are independent.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 47 / 157 Proof of Lemma 2.14.

Use Gibbs inequality (Theorem 2.9) with P = PXY and Q = PX × PY and definition of marginals:

0 ≤ D(PXY kPX × PY )   X PXY (xy) = PXY (xy) log PX (x)PY (y) x∈X ,y∈Y X = PXY (xy) log PXY (xy) x∈X ,y∈Y X X − PXY (xy) log PX (x) − PXY (xy) log PY (y) x∈X ,y∈Y x∈X ,y∈Y X = PXY (xy) log PXY (xy) x∈X ,y∈Y X X − PX (x) log PX (x) − PY (y) log PY (y) x∈X y∈Y = −H(XY ) + H(X ) + H(Y ).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 48 / 157 Conditional entropy Given two random variables X and Y we can consider the conditional probability distribution PY |X (·|x) for any value x. We can write X H(Y |X = x) := − PY |X (y|x) log PY |X (y|x), y∈Y for the entropy of this conditional probability distribution.

Definition 2.16.

The average of H(Y |X = x) over PX is called the conditional entropy, X H(Y |X ) := PX (x)H(Y |X = x). x∈X

H(Y |X ) represents how surprised we are to learn Y , given that we know X already.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 49 / 157 Chain rule for entropy

Proposition 2.17.

1. For any random variables X and Y :

H(XY ) = H(X ) + H(Y |X ) (chain rule for entropy).

2. Equivalently,

I (X ; Y ) = H(Y ) − H(Y |X ) (2.2) X = H(Y ) − PX (x)H(Y |X = x) (2.3) x∈X

3. H(Y |X ) ≥ 0 . 4. H(Y |X ) ≤ H(Y ), with equality if and only if X and Y are independent.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 50 / 157 Noisy typewriter Example 2.13 (cont.)

Example 2.18. For the noisy typewriter, for any x the conditional distribution Y |X = x has probability 1/2 for two values. Hence H(Y |X = x) = log 2 = 1 for all x. P So H(Y |X ) = x∈X PX (x)H(Y |X = x) = 1. Equivalently by the chain rule and values previously calculated

H(Y |X ) = H(XY ) − H(X ) = 3 − 2 = 1.

Perhaps not a surprise. Can argue that

H(Y |X ) = H(X + Z|X ) = H(Z|X ) = H(Z) = log 2 = 1.

(Don’t worry if you can’t follow this chain!)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 51 / 157 Proof of Chain Rule, Proposition 2.17.

1. By definition PY |X =x = PXY (xy)/PX (x). Hence X H(Y |X ) = PX (x)H(Y |X = x) x∈X   X X PXY (xy) PXY (xy) = − PX (x) log PX (x) PX (x) x∈X y∈Y X = − PXY (xy) (log PXY (xy) − log PX (x)) x∈X ,y∈Y X = H(XY ) + PX (x) log PX (x) x∈X = H(XY ) − H(X ).

3. Follows since H(Y |X = x) ≥ 0 for all x (since it is an entropy). 4. Combining Corollary 2.15 and (2.2) we deduce the result.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 52 / 157 Proof of concavity of entropy Remark 2.19. Result H(Y |X ) ≤ H(Y ) allows better proof of Proposition 2.7iv). Recall Proposition 2.7iv) claims that

λ H(P) + (1 − λ) H(Q) ≤ HλP + (1 − λ)Q.

X is outcome of flipping a coin which is heads with probability λ. Consider Y with conditional distribution Y |X given as

Y |{ heads } ∼ P, Y |{ tails } ∼ Q

Then overall Y ∼ λP + (1 − λ)Q, so RHS is H(Y ) as required.

H(Y |X ) = P( heads )H(Y | heads ) + P( tails )H(Y | tails ) = λH(P) + (1 − λ)H(Q).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 53 / 157 Intuition for the entropy zoo: Hu correspondence Remark 2.20. Can use 1-1 matching between entropies and the size µ of sets:

H(X ) ←→ µ(A) I (X ; Y ) ←→ µ(A ∩ B) H(XY ) ←→ µ(A ∪ B) H(X |Y ) ←→ µ(A \ B)

See MacKay P140:

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 54 / 157 Intuition: Hu correspondence (cont.)

Remark 2.20. Hu shows that for pairs of random variables any linear relationship in these H and I is true, if and only if the corresponding relationship in terms of set sizes also holds. For example Proposition 2.17

H(XY ) = H(X ) + H(Y |X ) ←→ µ(A ∪ B) = µ(A) + µ(B \ A).

For example Lemma 2.14

H(XY ) ≤ H(X ) + H(Y ) ←→ µ(A ∪ B) ≤ µ(A) + µ(B).

Recommend checking the other results of this chapter in the same spirit. Doesn’t count as a proof: gives intuition as to which results may hold.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 55 / 157 Section 2.4: Further entropy results (not proved here) Many further properties of entropy and related quantities are available.

Theorem 2.21.

For collections of random variables X1, X2,... and Y1, Y2,...: 1 (See Cover and Thomas Theorem 2.5.1) Generalized chain rule (generalizes Proposition 2.17)

n X H(X1,..., Xn) = H(Xi |X1,..., Xi−1). i=1 Think of the cumulative surprise of learning the collection of random variables one by one. 2 General conditioning principle:

H(X1,..., Xi |Y1,..., Yj ) is increasing in i and decreasing in j .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 56 / 157 Fano’s inequality

Theorem 2.22. Suppose that we have a random variable X , and a noisy version Y , and try to estimate X by Xb = g(Y ) for some (random) function g. Then H(X |Y ) − 1 (X 6= Xb) ≥ . P log(|X | − 1)

Note, we sometimes write X 7→ Y 7→ Xb = g(Y ).

Proof (not examinable).

Write E = I(X 6= Xb) for the error indicator random variable. Consider the conditional entropy H(X , E|Xb) in two ways.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 57 / 157 Proof of Fano (cont.)

1 Using the conditional form of the chain rule (Proposition 2.17):

H(X , E|Xb) = H(X |Xb) + H(E|X , Xb) = H(X |Xb) ≥ H(X |Xb, Y ) = H(X |Y ).

Here, H(E|X , Xb) because if we know X and Xb we know E. 2 Similarly H(X , E|Xb) equals

H(E|Xb) + H(X |E, Xb)   ≤ 1 + P(E = 0)H(X |E = 0, Xb) + P(E = 1)H(X |E = 1, Xb)

= 1 + P(X 6= Xb)H(X |E = 1, Xb) ≤ 1 + P(X 6= Xb) log(|X | − 1),

because if E = 0 and we know Xb then we know X .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 58 / 157 Section 3: Lossless Source Coding with Variable Codeword Lengths

Objectives: by the end of this section you should be able to Understand the relationships between decodability and the prefix-free property. State and prove the Kraft inequality Understand relationship between the entropy of a source and the expected length of optimal compression.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 59 / 157 Section 3.1: Variable-length source coding

Want to efficiently represent output of source in binary. Want to be able to transmit it as cheaply as possible. Think of e.g. coding text or video.

Definition 3.1 (Variable-length source code). A (variable-length) binary source code for source X is a injective function f : X −→ {0, 1}∗ from the source alphabet into the finite binary sequences. Say each f (x) is a codeword. Code being injective means no two symbols are mapped onto same sequence. Otherwise wouldn’t be able to decode.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 60 / 157 How good is a code? Length function

Definition 3.2. For binary strings in bn ∈ {0, 1}n ⊆ {0, 1}∗ define the length function ` as `(bn) := n. Note that considering source X and code f , the  length variable L := ` f (X ) is a real random variable. Write `X for brevity. P Hence E`X = x∈X P(X = x)`x is the expected length of the code.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 61 / 157 Unique decodability We demanded that f : X −→ {0, 1}∗ be injective. Would like this to be true for any sequence from the source alphabet X , not only single symbols. So, we extend the function f to all sequences composed from X and demand it be injective, as well:

Definition 3.3. To encode strings of symbols we concatenate codewords f (x):

f ∗ : X ∗ −→ {0, 1}∗, n ∗ x = x1 ... xn 7−→ f (x1 ... xn) = f (x1) ... f (xn).

Definition 3.4 (Uniquely decodable code (UDC)). If the encoding function f ∗ is injective, we say that f is a uniquely decodable code (UDC).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 62 / 157 UDC, or not UDC, that is the question

Example 3.5.

For X = {A, B, C, D} consider the encoding functions f , g:

A B C D f 1 00 01 10 g 0 10 110 111

Are these encoding functions uniquely decodable? Explain why or why not. Hint: Try to decode 1101100111. f is not uniquely decodable because ABA and DC both map to 1001. g is uniquely decodable (though it may not be obvious why).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 63 / 157 Prefix-free codes

Key idea: can write codewords of g on a tree – they all lie at leaves (never go through one codeword to get to another).

Definition 3.6 (Prefix-free code). A sequence ak is a prefix of a sequence bn if k ≤ n and there exists a sequence cn−k such that bn = ak cn−k . An encoding f is prefix-free if no codeword f (x) is the prefix of another one, f (y) (with y 6= x).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 64 / 157 Prefix-free implies UDC

Lemma 3.7. All prefix-free codes are uniquely decodable.

Proof. Scan the encoded sequence from left to right. Thinking of going down the tree: you are guaranteed to reach a unique codeword (and there will be no other codeword on that path). This lends prefix-free codes the name ‘instantaneously decodable’.

Remark 3.8. The converse is not necessarily true. Not all uniquely decodable codes are prefix-free.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 65 / 157 Example 3.9. Code g from Example 3.5 is prefix-free, and hence instantaneously decodable. As one scans a g ∗(xn) from the left, it begins either with 0, or with 10, or with 110, or with 111. Hence can represent it on a tree. One can decide after reading the first ’0’ or three ’1’s what the first symbol will be. E.g.

110 111 110 0 0 0 10 10 0 110 111

gives CDCAAABBACD.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 66 / 157 Section 3.2: Kraft inequality We shall analyse the combinatorial properties of prefix-free codes. Of course, good codes are not just uniquely decodable but also short. Thus, our goal is to minimise the expected length of an encoding E`X over all uniquely decodable codes.

We denote this minimum, which we still need to find, by `min(X ). Can’t use any old lengths: there is a constraint, called the Kraft inequality.

Remark 3.10 (Intuition for Kraft). If all codewords have same length ` (say), there can be at most 2` of them, so X X X 2−`x = 2−` ≤ 2−` 1 ≤ 2−`2` = 1. x∈X x∈X x∈X

−`x In general, word of length `x ‘blocks out’ proportion 2 of the tree.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 67 / 157 Kraft inequality

Proposition 3.11 (Kraft inequality).

Let f : X −→ {0, 1}∗ be a prefix-free code.

(a) Then the codeword lengths `x := `(f (x)) satisfy the Kraft inequality, X 2−`x ≤ 1. (3.1) x∈X

(b) Conversely, for integers `x , x ∈ X , satisfying the Kraft inequality  (3.1), there exists a prefix-free code f with ` f (x) = `x for all x ∈ X .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 68 / 157 Proof of Kraft inequality, Proposition 3.11(a)

Proof.

To prove (a), let `0 := maxx∈X `x be the length of the longest code word .

For each word x define the set of sequences of length `0 which start with the code word f (x):

`0 ` Sx := {b : f (x) is a prefix of b 0 } n o = f (x)c`0−`x : c`0−`x ∈ {0, 1}`0−`x

⊆ {0, 1}`0

` −`x Observe that |Sx | = 2 0 .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 69 / 157 Proof of Kraft inequality, Proposition 3.11(a) (cont.)

Proof.

We now show that the Sx are pair-wise disjoint: otherwise there were ` (with x 6= y) b 0 ∈ Sx ∩ Sy .

Assuming w.l.o.g. that `x ≤ `y , we have:

`0 `0−`x b = f (x)c = f (x)c1 ... c`y −`x c`y −`x +1 ... c`0−`x | {z } (i)

`0 `0−`y b = f (y)d = f (y) d1 ... d`0−`y . | {z } (ii)

(i) and (ii) must be equal and also f (y) = f (x)c`y −`x . This, however, means that f (x) is a prefix of f (y) which is a contradiction to the assumption of the code being prefix-free.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 70 / 157 Proof of Kraft inequality, Proposition 3.11(a) (cont.)

Proof. Now we get by counting,

2`0 = |{0, 1}`0 |

[ ≥ Sx x∈X X = |Sx | since Sx are disjoint x∈X X = 2`0−`x . x∈X

`0 P −`x Dividing by 2 we deduce that 1 ≥ x∈X 2 . This concludes the proof of (a).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 71 / 157 Proof of Kraft inequality, Proposition 3.11(b) Proof.

P −`x To prove (b), let 1 ≥ x∈X 2 . We construct the code recursively, i.e. by induction.

For the basis, if X has a single element x, s.t. `x = 0, then X = {x}, and we let f (x) =  (the empty word) and are done. For the inductive step we use the following Lemma.

Lemma 3.12.

∗ P −`x Let f : X −→ {0, 1} satisfy Kraft (i.e. x∈X 2 ≤ 1). X can be partitioned as X = X0 ∪ X1, such that X 1 2−`x ≤ , for i = 0, 1. 2 x∈Xi

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 72 / 157 Proof of Kraft inequality, Proposition 3.11(b)

Proof. Before proving Lemma 3.12, let us apply it. Says that we can partition into two subsets. Assume that you have prefix free codes for the two subsets. ∗ That is, assume fb : Xb → {0, 1} , b = 0, 1 with

 `(f (x)) = ` − 1 if x ∈ X 0 x 0 (3.2) `(f1(x)) = `x − 1 if x ∈ X1

Now define a prefix-free code for the entire set X0 ∪ X1 in the following way:  0f (x) if x ∈ X f (x) := 0 0 1f1(x) if x ∈ X1

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 73 / 157 Proof of Kraft inequality, Proposition 3.11(b)

Proof. f (x) is also prefix-free. That is, if f (x) were the prefix of some f (y), y 6= x then the two code words would have to start with the same b.

But by definition, this would mean x, y ∈ Xb, and  f (x) = bfb(x) f (y) = bfb(y)

So fb(x) would be a prefix of fb(y) (or vice versa) which is a contradiction to the assumption.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 74 / 157 Proof of Kraft inequality, Proposition 3.11(b)

Proof. Applying this construction recursively yields a prefix-free code for X , assuming Lemma 3.12 is true. We need Lemma 3.12 to partition the set of X into pair-wise disjoint sets. This guarantees that we can divide X into subsets repeatedly until we obtain sets containing only one symbol.

Motivation for Lemma 3.12 comes from a greedy method:

I Keep adding the x ∈ X with the smallest `x into X0, as long as the total weight, P 2−`x ≤ 1/2. x∈X0 I This process either terminates by exhausting X , i.e. X0 = X , and P 2−`x < 1/2; or in a set X with P 2−`x = 1/2 (see below). x∈Xi 0 x∈X0 P −`x I Hence 2 ≤ 1/2 (where we let X := X \X ). x∈X1 1 0

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 75 / 157 Proof of Lemma 3.12

Proof.

P −`x If x∈X 2 ≤ 1/2, Lemma 3.12 is true. P −`x Hence assume x∈X 2 > 1/2. Use the greedy method outlined above and add elements to X0. −` In fact, will show if P 2 x < 1/2, then for the next largest ` ∗ , x∈X0 x X 2−`x + 2−`x∗ ≤ 1/2 .

x∈X0

Hence, next largest `x∗ either does not fill up 1/2 or fills up 1/2 exactly.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 76 / 157 Proof of Lemma 3.12 (cont.) Proof.

Let `max be max{`x : x ∈ X0}; by assumption 1 ≤ `max ≤ `x∗ . This allows us to rewrite our assumption: X 2−`x < 2−1

x∈X0 X =⇒ 2`max −`x < 2`max −1

x∈X0 X =⇒ 2`max −`x + 1 ≤ 2`max −1 (both sides are integers)

x∈X0 X 1 =⇒ 2−`x + 2−`max ≤ 2 x∈X0

X −` −` ∗ 1 −` ∗ −` =⇒ 2 x + 2 x ≤ (as ` ≤ ` ∗ so 2 x ≤ 2 max ) 2 max x x∈X0

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 77 / 157 Example

Example 3.13. Does a prefix-free code exist for the alphabet and word lengths:

x ∈ X A B C D E F `x 2 2 3 4 4 4

X 1 1 1 1 1 1 2−`x = + + + + + 4 4 8 16 16 16 x∈X 1 13 = (4 + 4 + 2 + 1 + 1 + 1) = < 1. 16 16 Hence a code exists.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 78 / 157 Recap

Let’s summarise what we have done so far: Given a list of codeword lengths, we can study the following three properties: (U) There exists a uniquely decodable code f : X → {0, 1}∗ with lengths `x . ∗ (P) There exists a prefix-free code f : X → {0, 1} with lengths `x . P −`x (K) The Kraft inequality holds: x∈X 2 ≤ 1. We have shown so far that:

All prefix−free Prop. 3.11 (K) ⇔ (P) codes⇒ are UDC (U)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 79 / 157 McMillan Theorem The Kraft inequality can be strengthened to UDCs. Theorem 3.14 (McMillan).

Let f : X −→ {0, 1}∗ be a uniquely decodable code. (a) Then it satisfies the Kraft inequality (3.1), that is X 2−`x ≤ 1. x∈X

(b) Conversely, for integers `x , x ∈ X , satisfying the Kraft inequality  (3.1), there exists a uniquely decodable code f with ` f (x) = `x for all x ∈ X .

Remark 3.15. Hence, all three properties – (U),(P), and (K) – are equivalent:

(K)⇔(P)⇔(U)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 80 / 157 Proof of Theorem 3.14

Proof. Proving part (b) is trivial, using Proposition 3.11 and the fact that any prefix-free code is uniquely decodable. To prove part (a) let f ∗ : X n → {0, 1}∗ be a UDC and `max := maxx `x . Consider the kth extension of the code (i.e. code k-tuples using it) Since the code is uniquely decodable, the kth extension is non-singular, i.e. for xu 6= xv ⇒ f (xu) 6= f (xv ). If a(m) is the number of code words of length m, we have a(m) ≤ 2m (since code is non-singular).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 81 / 157 Proof of Theorem 3.14 (cont.)

Proof. For the extension code, the length of a code sequence is

k X `x := `x1x2...xk = `xi . i=1

Consider the kth power of the sum in the Kraft inequality:

!k X −` X X −` −` 2 x = ··· 2 x1 ... 2 xk

x∈X x1∈X xk ∈X X −` −` = 2 x1 ... 2 xk xk ∈X k X −` X −` = 2 x1x2...xk = 2 x xk ∈X k x∈X k

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 82 / 157 Proof of Theorem 3.14 (cont.) Proof. We now gather terms by word length, remembering that there are a(m) ≤ 2m of length m. We find:

!k k`max k`max X −`x X −m X m −m 2 = a(m)2 ≤ 2 2 = k`max , x∈X m=1 m=1

Hence, taking the kth root,

X −`x 1/k 2 ≤ (k`max ) . x∈X This inequality is true for any k. Taking the limit k → ∞ we obtain, as desired X 2−`x ≤ 1 . x∈X

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 83 / 157 Section 3.3: Best prefix-free codes

Remember we set out to minimise the expected length over all uniquely decodable codes for a given source.

With Proposition 3.11, our task of minimising E`X over all uniquely decodable codes f simplifies as follows: X `min(X ) = min PX (x) `x f UDC x∈X ( ) X X −`x = min PX (x) `x : integers `x s.t. 2 ≤ 1 . x∈X x∈X How to solve this optimization problem?

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 84 / 157 Lower bound on `min(X )

First, we obtain a lower bound, by relaxing the constraint that the `x have to be integers to `x ∈ R. −`x Then writing QX (x) = 2 (sort of a probability distribution) for optimal choice of `x X `min(X ) = PX (x)`x x∈X X = − PX (x) log QX (x) x∈X   X PX (x) X = PX (x) log − PX (x) log PX (x) QX (x) x∈X x∈X  1  ≥ 1 log P + H(X ) x∈X QX (x) ≥ H(X ). Here we use the log-sum inequality (Corollary 2.10), and the final P result follows because Kraft gives us that x∈X QX (x) ≤ 1. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 85 / 157 Entropy is the data compression limit

We have proved the following result:

Theorem 3.16 (Shannon’s noiseless coding theorem: converse part).

Given an information source X and a uniquely decodable code f , we have

H(X ) ≤ E`X , (3.3)

and equality holds iff `x := − log PX (x).

Remark 3.17. This bound is one of the key justifications for being interested in Shannon entropy: the entropy is the limit for data compression.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 86 / 157 Upper bound on `min(X )

The lower bound suggests that we should find lengths `x for the code −`x satisfying Kraft such that for all x, 2 is close to PX (x). So we make the reasonable guess

`x := d− log PX (x)e . Because of (a) (b) − log PX (x) ≤ `x < − log PX (x) + 1 this assignment satisfies the Kraft inequality

by (a) X −`x X 2 ≤ PX (x) = 1 . x∈X x∈X Using (b) we find X X  E`X = PX (x) `x < PX (x) − log PX (x) + 1 = H(X ) + 1. x∈X x∈X

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 87 / 157 Shannon’s noiseless coding theorem We have just proved Shannon’s noiseless coding theorem: Theorem 3.18 (Shannon’s noiseless coding theorem: direct part).

Given an information source X , the optimal code safisfies

H(X ) ≤ `min(X ) < H(X ) + 1 . (3.4)

In particular, we can construct a prefix-free code f , such that

`x = d− log P(x)e . (3.5)

Definition 3.19. Given an information source X , we say a UDC is Shannon optimal if

H(X ) ≤ E`X < H(X ) + 1. (3.6)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 88 / 157 Constructing Shannon codes

Remark 3.20. Note, that in the above derivation of Shannon’s noiseless coding theorem we have a proof by construction.

Choose `x := d− log PX (x)e, then take the codeword lengths `x and sort them in increasing order. We can explicitly construct a code using a binary tree.

Take the first `x . Choose the left-most empty leaf at depth `x , this corresponds to the binary code word f (x). All descendants of this leaf are eliminated.

Repeat this with all `x .

Kraft tells us that since each word with length `x takes up proportion 2−`x of the tree, we can pack them in this way.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 89 / 157 Constructing Shannon codes

Definition 3.21 (Shannon code). A code for information source X which is constructed by choosing `x := d− log PX (x)e and with the above (or an equivalent) procedure is called a Shannon code.

Any Shannon code is Shannon optimal, per definition.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 90 / 157 Shannon code example Example 3.22.

Construct a Shannon code for the following information source. Can this code be improved in terms of its expected length?

x ∈ X A B C D E F PX (x) 1/4 1/4 1/8 1/8 3/16 1/16

Direct calculation using `x = d− log PX (x)e gives `x = {2, 2, 3, 3, 3, 4}. From the graph/tree we can read off the code:

x ∈ X A B C D E F f (x) 00 01 100 101 110 1110

The code can be improved by shortening the last code word to 111 without compromising unique decodability.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 91 / 157 Section 3.4: Huffman codes

Example 3.22 shows that Shannon codes are Shannon optimal but do not necessarily achieve the shortest expected length of all UDCs. Thus, we define:

Definition 3.23 (Optimal code). A uniquely decodable code f for information source X is called optimal if

E`X = `min(X ).

We will now introduce a second method for constructing a UDC, the so-called Huffman construction. Theorem: Huffman codes are optimal in the above sense. A proof is not included here but can be found, for example in Ash Information Theory.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 92 / 157 Constructing Huffman codes

Remark 3.24. The construction proceeds as follows. Arrange the letters x ∈ X by decreasing probabilities. Add the two smallest probabilities to obtain a new probability distribution on one less symbol. Arrange the new probabilities again in decreasing order. Continue this process until there are only two probabilities left. When we have only two probabilities left, we assign codewords. Each time two probabilities were merged, we can represent that in a binary tree. Constructing the whole tree, we can assign 0 and 1 to each branching point by some rule. The resulting code is prefix-free by construction.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 93 / 157 Huffman code example

Example 3.25. Construct a Huffman code for the information source given below.

x ∈ X a b c d e . PX (x) 0.2 0.1 0.3 0.1 0.3 A solution is x ∈ X a b c d e . f (x) 10 111 01 110 00 Other solutions possible, but will have same lengths.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 94 / 157 Section 4: Lossy Source Coding with Fixed Codeword Lengths

Objectives: by the end of this section you should be able to Understand that the optimal variable-length coding solutions of the previous section can be hard to compute. Describe an alternative of fixed length coding, which is almost optimal with a fixed probability of error. Understand this in the context of the Asymptotic Equipartition Property and typical sets.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 95 / 157 Section 4.1: Block codes

We have seen that for any r.v. X , there exist prefix-free codes f : X → {0, 1}∗ such that

H(X ) ≤ E`X < H(X ) + 1,

For instance, both the Shannon and Huffman code have this property. The extra 1 bit in the upper bound might be OK when the entropy is large, that is typically not the case. Natural to ask if the theoretical lower bound can be reached by some coding strategy. n We write Pn for a general probability distribution on X .

For example, we may choose to take P2 as the product distribution, ⊗2 P2 = P ⊗ P = P . We introduce block coding, which we motivate as follows.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 96 / 157 Motivating example

Example 4.1. Let X = {a, b}, and let P(a) := 1/4, P(b) := 3/4. The code f (a) := 0, f (b) := 1 obviously has minimal expected codelength, and is essentially unique. 2 Let f2 denote the extension of f to X , given by f2(aa) = f (a)f (a) = 00, f2(ab) = f (a)f (b) = 01, f2(ba) = f (b)f (a) = 10, f2(bb) = f (b)f (b) = 11. ⊗2 If we evaluate the expected codelength of f2 with respect to P , we get E`X = 2. Hence the expected bits per symbol transmitted remains at 1.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 97 / 157 Motivating example (cont.)

Example 4.1. However, we can find a UDC for P⊗2 with shorter expected codelength.

For example, Huffman code g2 has g2(aa) = 101, g2(ab) = 11, g2(ba) = 100, g2(bb) = 0. The expected codeword length is E`X = 1.6875, so bits per symbol is now 0.84. What if we consider longer and longer extensions?

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 98 / 157 Block codes

Definition 4.2 (Block code). Let X be a finite set. A function n ∗ fn : X → {0, 1} is called a block code for blocklength n.

Unique decodability and the prefix-free property are defined for block codes the same way as for single-letter codes.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 99 / 157 Expected length per letter

Remark 4.3. n ∗ Given an information source X and a block code fn : X → {0, 1} n for blocklength n, we can write `X n = `(fn(X )) for the length of the encoded block. n With X ∼ Pn the expected codelength is

Ln := E`X n .

To compare performance of block codes for different blocklengths, consider the expected codelength per letter (or coding rate):

Ln 1 = ` n . n nE X This tells us how many bits we need to use on average to encode one letter.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 100 / 157 Key idea

Remark 4.4. n By definition, if fn is a Shannon-optimal code for (X , Pn) then

n n H(X ) ≤ Ln < H(X ) + 1.

Now, for independent X , Y , the H(XY ) = H(X ) + H(Y ), so n H(X ) = nH(X ) (since Pn is a product distribution, i.e. successive symbols are independent). Dividing by n we get 1 1 H(X ) ≤ L < H(X ) + . n n n Using block codes for large enough blocklength n, the expected codelength per letter can be brought arbitrarily close to H(X ).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 101 / 157 Section 4.2: Fixed-length codes

As described, variable-length block codes have optimal performance. However, e.g. Huffman codes can be hard to calculate in practice. Also, variable-length codes are susceptible to transmission errors. Shannon combined the robustness of fixed-length coding with data compression in his lossy source coding theorem (Theorem 4.11). The price to pay for data compression is to allow errors in the decoding process, which are reducible to arbitrarily close to zero. The main idea is simple: 1 use block coding; 2 encode only those sequences that are “typical”, and ignore the rest; 3 show that the number of bits needed to encode the typical sequences grows with the entropy instead of log |X |, and 4 the error from ignoring the “atypical” sequences is negligible if the blocklength is large enough.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 102 / 157 Fixed-length code definitions Definition 4.5. n Let X with n ∈ N be an information source with probability distribution P⊗n.

A fixed-length block code for blocklength n is a pair (fn, gn), where

n k fn : X → {0, 1} is the encoding function and

k n gn : {0, 1} → X is the decoding function.

The error probability of the code (fn, gn) is

n n n n pe (fn, gn) := P{x ∈ X : gn(fn(x )) 6= x }.

We agree to tolerate a certain level of errors; that is, we consider (fn, gn) such that pe (fn, gn) ≤ . Key question: how small can k be?

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 103 / 157 Equivalent formulation

Remark 4.6. Finding a lossy source code for blocklength n and error threshold  is n equivalent to finding a subset An ⊂ X such that

n P(X ∈ An) ≥ 1 − . (4.1)

Code quality is measured by the codelength/blocklength ratio

1 dlog |A |e, (4.2) n n the average number of bits to encode one letter. k Again we take log to base 2 – if |An| ∼ 2 then k ∼ log |An|.

Optimising (4.2) over all subsets An satisfying (4.1) is equivalent to optimising the codelength over all codes (fn, gn) with pe (fn, gn) ≤ .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 104 / 157 High-probability strings: Example

It is in principle clear how to determine the optimal set An. Order the elements of X n according to decreasing probability.

Keep adding elements in An until their total probability reaches 1 − .

Example 4.7.

Let X = {a, b} and P(a) = 1/3, P(b) = 2/3, with n = 3 and  = 1/3. First, we list the probabilities in decreasing order. Adding the four largest probabilities, we have

P(bbb) + P(bba) + P(bab) + P(abb) = 20/27 > 2/3 = 1 − .

On the other hand, if we add fewer probabilities then we can’t reach the 1 −  threshold. Hence there is a unique optimal set, given by A3 = {bbb, bba, bab, abb}.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 105 / 157 Typical sets While optimal, this procedure can be computationally intensive and difficult to analyse.

Shannon realised that instead of this optimal choice of An, we can choose a suboptimal set, the so-called typical set. This set is still asymptotically optimal, and much easier both to work with and to analyse.

Definition 4.8. Let X be a finite set with probability distribution P. A sequence xn ∈ X n is entropy δ-typical (for P), if for some δ ≥ 0

n Pn − log P(x ) − log(P(xi )) − H(P) = i=1 − H(P) ≤ δ. (4.3) n n

We call the collection of sequences satisfying (4.3) the entropy δ-typical set (or just ‘typical set’), denoted Tn,δ.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 106 / 157 Equivalent formulation of typical set

Remark 4.9. n From Definition 4.8 we can immediately see that for x ∈ Tn,δ: 1 H(P) − δ ≤ − log (xn) ≤ H(P) + δ , (4.4) n P Equivalently,

−n(H(P)+δ) n −n(H(P)−δ) 2 ≤ P(x ) ≤ 2 . (4.5)

That is, all strings in the typical set Tn,δ essentially have the same probability ' 2−nH(P). n(H(P)+δ) There can’t be more than 2 of them, so can code Tn,δ with binary strings of length n(H(P) + δ. See “Proof of Direct Part of Theorem 4.11” for more formal version of this argument.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 107 / 157 Asymptotic Equipartition Property (AEP)

Theorem 4.10 (AEP).

Let X be a finite set with probability distribution P and let Tn,δ be the corresponding entropy δ-typical set. Then

lim (Tn,δ) = 1 . n→∞ P

i.e. for large n, most of the probability is concentrated on the set Tn,δ, over which the probability distribution is more or less uniform. This is called the Asymptotic Equipartition Property (AEP).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 108 / 157 Proof of Theorem 4.10

Proof. It is a special case of the weak law of large numbers Theorem 1.17.

Key is to consider random variables Ui = − log P(Xi ) (log-probability of ith symbol observed). P Ui are i.i.d. with mean x∈X P(x)(− log P(x)) = H(P). We write n Pn 1 X − log(P(xi )) S = U = i=1 . n n i n i=1 Theorem 1.17 gives  lim (Tn,δ) = lim |Sn − H(P)| ≤ δ = 1. n→∞ P n→∞ P

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 109 / 157 Shannon’s lossy fixed-length source coding theorem Theorem 4.11. Direct Part: For every rate R > H(P), there exists a sequence of n subsets An ⊂ X , such that 1 lim dlog |An|e ≤ R, n→∞ n lim (An) = 1 . n→∞ P Converse part: For any rate R < H(P), and any sequence of subsets n An ⊂ X , such that 1 lim dlog |An|e ≤ R , and n→∞ n we have that the limit (if it exists)

lim (An) 6= 1. n→∞ P

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 110 / 157 Proof of Theorem 4.11 Direct part. Let R > H(P), and let δ > 0 be such that H(P) + δ ≤ R.

Take An = Tn,δ. Then we have

X n X n 1 = P(x ) ≥ P(x ) n n n x ∈X x ∈Tn,δ X −n(H(P)+δ) −n(H(P)+δ) ≥ 2 = |Tn,δ|2 . n x ∈Tn,δ

n(H(P)+δ) Hence, we deduce that |Tn,δ| ≤ 2 or rearranging that 1 1 1 dlog |T |e ≤ H(P) + δ + ≤ R + . n n,δ n n Taking the limit n → ∞ the first claim follows. The second claim follows from the AEP, Theorem 4.10.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 111 / 157 Proof of Theorem 4.11 Converse part. For R < H(P) let 0 < δ < 1 be such that R + 2δ ≤ H(P).

Consider n large enough that the typical set Tn,δ has probability ≥ 1 − δ (will happen by the AEP, Theorem 4.10).

Construct the optimal set An constructed as in Example 4.7 above. First use all the strings of probability ≥ 2−n(H(P)−δ) (not typical).

These contribute at most δ to the probability P(An). Hence, need to use typical strings as well. However, can use at most 2nR typical strings by assumption. These contribute at most

2nR 2−n(H(P)−δ) ≤ 2−nδ

to the probability P(An). −nδ In total P(An) ≤ δ + 2 , which does not converge to 1.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 112 / 157 Section 5: Channel Coding

Objectives: by the end of this section you should be able to Describe noisy communication channels. Understand the definitions of rate and capacity of a channel. Know how to calculate and bound capacity for some practical examples.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 113 / 157 Section 5.1: Introduction to channel coding

So far, we have considered the scenario where our encoded message could be perfectly recovered (unless we choose to discard low-probability strings). Of course, in reality this is rarely the case, and all sorts of errors beyond our control might happen. For instance, bits might be flipped or erased. We show how to find a reliable coding scheme by using longer codewords, i.e., by fighting the noise by introducing redundancy. Reliability here means that we can control the error using long enough codewords, so the decoding error probability can be made arbitrarily small.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 114 / 157 Model for communication channel

Can describe a channel through a collection of conditional probabilities PY |X = W .

Definition 5.1. Let X and Y be finite sets. A channel W with input alphabet X and output alphabet Y is a function W : X × Y → [0, 1] satisfying X W (y|x) = 1 for all x ∈ X . y∈Y

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 115 / 157 Interpretation of channel Remark 5.2. The above condition says that for every x ∈ X , the function W (y|x) is a probability distribution on Y. The interpretation is that for a given input x, the channel produces a random output y according to the probability distribution W (y|x).

That is W (y|x) = PY |X (y|x) encodes the conditional probabilities of output given input. Numbers W (y|x) are called the transition probabilities of the channel.

Given an ordering of X = {x1,..., xn} and Y = {y1,..., ym}, W can be represented by a matrix, which we also denote by W , with entries

Wij := W (yj |xi ) P satisfying j W (yj |xi ) = 1 for each i (rows sum to 1). Such a matrix is called a stochastic matrix (cf Markov chains?)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 116 / 157 Binary symmetric channel

Example 5.3.

The binary symmetric channel with error parameter p ∈ [0, 1] has the same input and output alphabets X = Y = {0, 1}. It is given by the stochastic matrix

 1 − p p  W = . p 1 − p

We denote this channel by BSp. The interpretation is that every input bit is either flipped, with probability p, or transmitted faithfully, with probability 1 − p.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 117 / 157 Erasure channel

Example 5.4. The binary erasure channel with erasure parameter p ∈ [0, 1] has input alphabet X = {0, 1}, and output alphabet Y = {0, 1, e}. It is given by the stochastic matrix (writing the last column for e)

 1 − p 0 p  W = . 0 1 − p p

We denote this channel by BEp Every input bit is either transmitted faithfully, with probability 1 − p, or erased, with probability p. Erasure of the input is represented by the output symbol e.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 118 / 157 Joint channel distribution

The transition probabilities of channel W define, for every input signal x ∈ X , a probability distribution W (y|x) on the output alphabet Y.

Definition 5.5.

A channel W induces a joint probability distribution PXY on X × Y for every probability distribution PX . It is defined as PXY (xy) = PX (x)W (y|x) .

Check that PXY is a probability distribution on X × Y (adds up to 1).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 119 / 157 Output distributions Can consider the marginal distribution of Y induced by resulting joint distribution PXY . Definition 5.6.

Given probability distribution PX on input alphabet X , then W defines a probability distribution PY on the output, as for any y ∈ Y: X PY (y) = PXY (xy) x∈X X = PX (x)W (y|x). x∈X

Remark 5.7.

(cf Markov chains) Can write PX as a row vector.

Check output distribution PY obtained as a row vector by multiplying this row vector on the left of W .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 120 / 157 Output distributions example

Example 5.8.

Let W = BEp be the binary erasure channel with error parameter p, and let the input distribution PX be PX (0) = 1/3, PX (1) = 2/3. Then 1 1 − p P (0) = P (0)W (0|0) + P (1)W (0|1) = (1 − p) + 0 = , Y X X 3 3 2 2(1 − p) P (1) = P (0)W (1|0) + P (1)W (1|1) = 0 + (1 − p) = , Y X X 3 3 1 2 P (e) = P (0)W (e|0) + P (1)W (e|1) = p + p = p. Y X X 3 3 P Note that y∈Y PY (y) = 1, as it should.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 121 / 157 Section 5.2: The Shannon capacity

Definition 5.9.

Let W be a channel and PX be a probability distribution on X . The mutual information between the input X and the output Y of the channel for the input distribution PX is

I (X ; Y ) = D(PXY kPX × PY ) (5.1) X PXY (xy) = PXY (xy) log PX (x)PY (y) x∈X ,y∈Y

X PX (x)W (y|x) = PX (x)W (y|x) log . PX (x)PY (y) x∈X ,y∈Y

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 122 / 157 Shannon capacity definition Definition 5.10 (Shannon capacity). Let W be a channel. The Shannon capacity of W is

C(W ) := max I (X ; Y ) , PX i.e., the maximum mutual information between the input and the output.

Remark 5.11. Recall from (2.2) that

I (X ; Y ) = H(Y ) − H(Y |X ) (better for calculations?) (5.2) = H(X ) − H(X |Y ) (better for thinking?) (5.3)

(5.3) is ‘amount uncertainty about X is reduced on learning Y ’.

Maximising over PX gives us a free chance to optimize over the distribution we send through the channel.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 123 / 157 Remarks

The significance of C(W ) comes from Shannon’s noisy channel coding theorem, Theorem 5.25. This connects Shannon capacity to the optimal rate of communication achieved by using the channel many times. Before giving this theorem in the next section, we explore some properties of the Shannon capacity. In general there is no closed formula for the Shannon capacity. It can be explicitly computed only in some very special cases; we will see a few such examples below. For general channels, the capacity can only be evaluated numerically. Information inequalities are useful both in estimating the Shannon capacity and in evaluating it when possible.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 124 / 157 Capacity bounds

Remark 5.12. A trivial upper bound on the capacity can be obtained from (5.2), which yields that (since H(Y |X ) is positive)

I (X ; Y ) ≤ H(Y ) ≤ log |Y|

for any input distribution PX , and hence C(W ) ≤ log |Y|. It can be shown in a similar way using (5.3) that

C(W ) ≤ H(X ) ≤ log |X |.

Conversely, the mutual information for any fixed input distribution gives a lower bound on the capacity, by definition:

C(W ) ≥ I (X ; Y ) for a given PX .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 125 / 157 Section 5.3: Shannon capacity examples

Example 5.13.

Let W = BSp be the binary symmetric channel, Example 5.3. Note that the rows of the transition matrix are permutations of each other, and hence

H(Y |X = 0) = H(Y |X = 1) = H(p, 1 − p).

Using (5.2), we have, for any input distribution PX ,

I (X ; Y ) = H(Y ) − PX (0)H(Y |X = 0) − PX (1)H(Y |X = 1) = H(Y ) − H(p, 1 − p).

Only the first term depends on PX .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 126 / 157 Example: Binary Symmetric Channel (cont.)

Example 5.13.

Choosing PX to be the uniform input distribution, we get PY (0) = PY (1) = 1/2, and hence H(Y ) = 1, which is the maximum value possible. We deduce that C(BSp) = 1 − H(p, 1 − p) .

C(BSp) takes its maximum value 1 for p = 0 and p = 1, i.e., when there is no error (p = 0) and when the error is trivially correctible (p = 1).

In general, C(BSp) is symmetric in p (same value for p and 1 − p): could flip every bit if p ≥ 1/2.

C(BSp) = 0 if and only if p = 1/2, in which case the output distribution of the channel is independent of the input signal.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 127 / 157 Weakly symmetric channels

The above argument to determine C(BSp) used only some symmetry properties of the channel. It can be generalised to a large class of channels with the same symmetry properties:

Definition 5.14 (Weakly symmetric channel).

We say that a channel W is weakly symmetric if 1 the rows of the transition matrix are permutations of each other, and 2 there exists a constant c > 0 such that all column sums of the transition matrix are equal to c or to zero.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 128 / 157 Capacity of weakly symmetric channels

Proposition 5.15.

For a weakly symmetric channel W

C(W ) = log(#{ columns with non-zero sum}) − H(Y |X = x),

for any x ∈ X .

Proof. As before, the H(Y |X = x) are all the same (entropy of a row).

Hence H(Y |X ) fixed, and does not depend on PX . Uniform distribution on input X gives a uniform distribution on outputs Y corresponding to non-zero column sums. (See this from ‘row vector product’ interpretation of Remark 5.7?)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 129 / 157 Example: Binary erasure channels

The binary erasure channel BEp satisfies the first property in Definition 5.14. However it is not weakly symmetric, as the column sums are 1 − p, 1 − p, 2p, which are not all the same unless p = 1/3. Nevertheless, we can still give an explicit formula for its capacity using a convexity property of the Shannon capacity, which we prove below.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 130 / 157 Convex combinations of distributions

Let P1,..., Pr be probability distributions on the same finite set X . A convex combination of these probability distributions with weights t1,..., tr ≥ 0, t1 + ... + tr = 1, is given by

t1P1 + t2P2 + ... + tr Pr

which again is a probability distribution on X .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 131 / 157 Relative entropy of convex combinations

Proposition 5.16.

The relative entropy is a jointly convex function of its arguments, i.e.

D(t1P1 + t2P2 + ... + tr Pr ||t1Q1 + t2Q2 + ... + tr Qr )

≤ t1D(P1||Q1) + ... + tr D(Pr ||Qr ) .

Proof. For any x, the log-sum inequality (Corollary 2.10) gives:

r !  Pr  r   X i=1 ti Pi (x) X ti Pi (x) ti Pi (x) log Pr ≤ ti Pi (x) log , ti Qi (x) ti Qi (x) i=1 i=1 i=1 and the result follows by summing over x.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 132 / 157 Convex combinations of channels

Definition 5.17. We can form a convex combination of channels:

t1W1 + ... + tnWn ,

This convex combination is again a well-defined channel.

The interpretation of t1W1 + ... + tnWn is that we use the channel Wi with probability ti .

Proposition 5.18.

The Shannon capacity is a convex function of the channel,

C(t1W1 + ... + tnWn) ≤ t1C(W1) + ... + tnC(Wn) .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 133 / 157 Convexity of Shannon capacity

Proof. P Joint probability PXY (xy) = PX (x) i ti Wi (y|x). Pn P Pn Similarly PY (y) = i=1 ti u PX (u)Wi (y|u) = i=1 ti Py,i (y). Hence in (5.1) using Proposition 5.16:

I (X ; Y ) = D(PXY kPX × PY ) ! n n X X = D ti PX Wi ti PX Py,i i=1 i=1 n X ≤ D(PX Wi kPX Py,i ). i=1

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 134 / 157 Binary erasure channel lower bound

Example 5.19.

First, to derive a lower bound for the capacity C(BEp) of the binary erasure channel.

Choose the particular input distribution, PX (0) = 1/2, PX (1) = 1/2. The marginal distribution on the output is then 1−p PY (0) = PY (1) = 2 , PY (e) = p. The joint distribution on input and output is 1−p PXY (0, 0) = PXY (1, 1) = 2 , PXY (0, e) = PXY (1, e) = p/2. From these distribution we can compute the entropies H(X ) = 1, H(Y ) = H(p, 1 − p) + 1 − p, and H(XY ) = 1 + H(p, 1 − p). Hence C(BEp) ≥ I (X ; Y ) = 1 − p.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 135 / 157 Binary erasure channel upper bound

Example 5.19. To give an upper bound, we decompose the channel as

BEp = (1 − p)BE0 + pBE1,

where BE0 and BE1 are the binary erasure channel with parameter p = 0 and p = 1, respectively.

Note that both BE0 and BE1 are weakly symmetric, and hence we can compute their Shannon capacities using Proposition 5.15 as

C(BE0) = log 2 − H(1, 0, 0) = 1,

C(BE1) = log 1 − H(0, 0, 1) = 0,

since the number of columns with non-zero sum is 2 for BE0 and 1 for BE1.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 136 / 157 Binary erasure channel upper bound

Example 5.19. Hence, by the convexity of the Shannon capacity (Proposition 5.18) we have

C(BEp) ≤ (1 − p)C(BE0) + pC(BE1) = (1 − p)

Combining upper and lower bound we find

C(BEp) = 1 − p.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 137 / 157 Section 5.4: Memoryless channels and block coding

We will use the same channel many times, and assume the channel acts independently and in the same way on consecutive inputs. n uses of the channel W can then be described as a new channel W ⊗n with input alphabet X n and output alphabet Yn, given by

⊗n W (y1 ... yn|x1 ... xn) = W (y1|x1) ... W (yn|xn),

where each xi ∈ X and yi ∈ Y. Such a channel is called memoryless or i.i.d. We call n the blocklength. Idea is to not transmit all the possible messages; we use a smaller set of codewords, hoping errors will be reduced.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 138 / 157 Memoryless channel example

Example 5.20.

Let W = BSp be a binary symmetric channel with parameter p. Then

W ⊗3(000|001) = W (0|0)W (0|0)W (0|1) = p(1 − p)2, W ⊗3(111|001) = W (1|0)W (1|0)W (1|1) = p2(1 − p), W ⊗3(010|010) = W (0|0)W (1|1)W (0|0) = (1 − p)3, etc.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 139 / 157 Block channel coding and (transmission) rate Definition 5.21.

Let W be a channel. A block code Cn for n uses of the channel is a triple Cn = (Mn, fn, gn), where

Mn is the size of the code (number of possible messages transmitted); n fn : {1,..., Mn} → X is the encoding function; n gn : Y → {1,..., Mn} is the decoding function.

Definition 5.22 (Rate).

The rate of a block code Cn with Mn codewords taken from the binary alphabet {0, 1}n is defined as

1 R(C ) := log M . n n n n In general, if we have Mn codewords from X , the rate is log Mn/(n log |X |).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 140 / 157 Rate example

Example 5.23. Suppose blocklength n = 10 and we use the BSC, Example 5.3. If we use all 1024 = 210 possible input words, the rate is 1. If we only use 32 = 25, the rate is 1/2.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 141 / 157 Error probabilities

Definition 5.24.

For each message i ∈ {1,..., Mn}, define the success probability ps,i (Mn, fn, gn)

X ⊗n n ps,i (Mn, fn, gn) := W (y |fn(i)) n n n y ∈Y :gn(y )=i

(the probability of decoding the message as i when the codeword fn(i) was sent) The probability of erroneous decoding of i is then

X ⊗n n pe,i (Mn, fn, gn) := 1 − ps,i (Mn, fn, gn) = W (y |fn(i)) . n n n y ∈Y :gn(y )6=i

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 142 / 157 Error probabilities (cont.)

Definition 5.24.

Error probability pe,i is defined for each message. Combine into overall value either as: 1 The average error probability of the code is

M 1 Xn pe,av(Mn, fn, gn) := pe,i (Mn, fn, gn) , Mn i=1

2 The maximum error probability is

pe,max(Mn, fn, gn) := max pe,i (Mn, fn, gn) . i∈{1,...,Mn}

Of course, we have pe,av(Mn, fn, gn) ≤ pe,max(Mn, fn, gn).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 143 / 157 Balancing the rate and error probability

To use the channel as efficiently as possible, we should maximise transmission rate R while keeping the error probability as small as possible. Clearly we expect a tradeoff between these two quantities. The fundamental theorem in channel coding, due to Shannon, states that the best achievable rate is equal to the Shannon capacity of the channel.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 144 / 157 Shannon’s noisy channel coding theorem Theorem 5.25 (Shannon’s noisy channel coding theorem).

Let W be a channel with Shannon capacity C(W ). Direct part: For any rate R < C(W ), there exists a sequence of block codes (Mn, fn, gn), n ∈ N, such that 1 lim log Mn ≥ R, and n→∞ n lim pe,max(Mn, fn, gn) = 0. n→∞ Strong converse part: For any rate R > C(W ), and a sequence of block codes (Mn, fn, gn), n ∈ N, such that 1 lim log Mn ≥ R, then n→∞ n lim pe,max(Mn, fn, gn) = 1. n→∞

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 145 / 157 Remark 5.26. The above formulation of the channel coding theorem (Theorem 5.25) is very similar to the lossy source coding theorem (Theorem 4.11). However, there is also a significant difference between the two. The proof of the direct part of Theorem 4.11 is constructive, i.e., the proof not only gives the existence of a sequence of good codes, but also a construction of it. The proof of Theorem 5.25 is different. It shows the average of the error probability over a large number of random codes is small, and hence there has to exist at least one code with small error probability. This argument only proves the existence and doesn’t provide a construction for a good code.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 146 / 157 Motivation for noisy channel theorem Recall Example 2.13, the noisy typewriter. Example 5.27.

Consider X uniformly distributed on X = {0,..., 3}. Define Z uniformly distributed on {0, 1} and Y = X + Z mod 4. Can write joint probability distribution of X and Y :

PXY X = 0 X = 1 X = 2 X = 3 Y = 0 1/8 0 0 1/8 Y = 1 1/8 1/8 0 0 Y = 2 0 1/8 1/8 0 Y = 3 0 0 1/8 1/8

Recall that I (X ; Y ) = 2 + 2 − 3 = 1; half of max value (log |X | = 2) Can transmit with pe = 0. Only use X = 0 and X = 2. If Y ∈ {0, 1}, know X = 0. If Y ∈ {2, 3}, know X = 2. Rate R = log 2/ log 4 = 1/2.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 147 / 157 Motivation for noisy channel coding theorem

Noisy typewriter motivates intuitive argument for Theorem 5.25. It relies on the existence of jointly typical sequences. Idea is that all channels are roughly like noisy typewriters.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 148 / 157 Jointly typical sequences

Definition 5.28 (Jointly entropy-δ typical sequence).

Let X , Y be finite sets with joint probability distribution PXY . A sequence pair (xn, y n), xn ∈ X n, y n ∈ Yn is jointly entropy δ-typical (for PXY ), if for some δ ≥ 0: Pn − log PX (xi ) i=1 − H(X ) ≤ δ, n Pn − log PY (yi ) i=1 − H(Y ) ≤ δ, and n Pn − log PXY (xi yi ) i=1 − H(XY ) ≤ δ n

We denote the set of jointly entropy-δ typical sequences as An,δ.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 149 / 157 Joint asymptotic equipartition property

We have the equivalent to the asymptotic equipartition property. Theorem 5.29.

Let X , Y be finite sets with joint probability distribution PXY .

Let An,δ be the corresponding jointly entropy δ-typical set. Then for (Xi , Yi ) i.i.d. ∼ PXY : 1 n n lim ((X , Y ) ∈ An,δ) = 1 . n→∞ P

2 n(H(XY )+δ) |An,δ| ≤ 2

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 150 / 157 Heuristic argument for Theorem 5.25

Remark 5.30. From Theorem 5.29, can assume random codeword and output are jointly typical. There are 2nH(Y ) outputs, each roughly equally likely.

Each input x corresponds to set of outputs Ax . nH(Y |X ) Here |Ax | ' 2 outputs, again each equally likely. nH(Y ) nH(Y |X ) Suggests we can use a subset of Mn = 2 /2 inputs, so that |Ax | are disjoint, but cover set of possible outputs.

This scheme has rate log Mn/n = H(Y ) − H(Y |X ) = I (X ; Y ).

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 151 / 157 Section 5.5: Rate/error trade-off example

We illustrate the trade-off between the rate and the error probability with the following example. Example 5.31. Assume that our set of messages is A, B, C, D We want to find a fixed-length binary code for this set and the binary symmetric channel W = BSp with error probability p = 1/10.

In other words, we want to communicate M2 = 4 different messages, i.e. two bits of information, with two uses of the channel.

So we need to choose encoding and decoding functions f2 and g2.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 152 / 157 Example 5.31. The trivial encoding is:

f2 : A 7→ 00, B 7→ 01, C 7→ 10, D 7→ 11,

and define g2 to be the inverse of f2. The probability of correctly decoding A is then

2 ps,A = (1 − p) = 0.81 ,

and hence the error probability is

2 pe,A = 1 − (1 − p) = 0.19 .

Similarly, the error probability for each message is also 0.19.

And so pe,av(M2, f2, g2) = pe,max(Mn, fn, gn) = 0.19. 1 The rate of this code is R(C2) = 2 log 4 = 1 .

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 153 / 157 Example 5.31. The error probability 0.19 is rather high; in order to decrease it, we can use longer codewords. For instance, we can construct the following code with codelength n = 5; i.e., we use the channel 5 times to send each of our M2 = 4 messages:

f5 : A 7→ 00000, B 7→ 00111, C 7→ 11010, D 7→ 11101. (5.4)

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 154 / 157 Maximum likelihood decoding

Example 5.31. In general we use maximum likelihood decoding. Given output sequence y 5 ∈ {0, 1}5 then we decode it as the message for which the probability of the output being y 5 is maximal.

For instance, we decode 00010 as g5(00010) := A since the message A yields the sequence 00010 at the output with a strictly higher probability than any other message. It is possible to have more than one maximum; e.g., for the output sequence 01100 both A and D maximise the likelihood of getting 01100 at the output. In such cases, we decode the output sequence as one of the maximum likelihood messages.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 155 / 157 Minimum distance decoding

Example 5.31. Note that any two codewords in (5.4) differ from each other in at least 3 bits. 5 Hence, any output sequence y that differs from a codeword f5(i) in at most one bit is strictly more likely to result from the message i than from any other message, and hence will be decoded as i. For every i ∈ {A, B, C, D}, we define a “ball of Hamming distance d” around it as

n n Bd (i) := {y ∈ Y : |{k : yk 6= (fn(i))k }| ≤ d}.

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 156 / 157 Example 5.31.

Setting d = 1 and n = 5, then every sequence in B1(i) is decoded as i, and hence,

X ⊗5 5 ps,i (M5, f5, g5) ≥ W (y |f5(i)) 5 y ∈B1(i) = (1 − p)5 + 5p(1 − p)4 = 0.91854. (5.5)

Since (5.5) is true for all i, we have

pe,i (M5, f5, g5) = 1 − ps,i (M5, f5, g5) ≤ 1 − 0.91854 = 0.08146,

which is less than half of the error probability of the trivial code. The price to pay for this smaller error probability is a smaller rate; indeed, the rate of this code is 1 R(C ) = log 4 = 0.4. 5 5

Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 157 / 157