COMP2610/6261 - Lecture 21: Computing Capacities, Coding in Practice, & Review

ANU Logo UseMark Guidelines Reid and Aditya Menon

Research School of Computer Science The ANU logo is a contemporary The Australian National University refection of our heritage. It clearly presents our name, our shield and our motto: First to learn the nature of things.

To preserve the authenticity of our brand identity, there are rules that govern how our logo is used. Preferred logo Black version Preferred logo - horizontal logo October 14, 2013 The preferred logo should be used on a white background. This version includes black text with the crest in Deep Gold in either PMS or CMYK. Black Where colour printing is not available, the black logo can be used on a white background. Reverse Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 1 / 21 The logo can be used white reversed out of a black background, or occasionally a neutral dark background.

Deep Gold Black C30 M50 Y70 K40 C0 M0 Y0 K100 PMS Metallic 8620 PMS Process Black PMS 463 Reverse version Any application of the ANU logo on a coloured background is subject to approval by the Marketing Offce, contact [email protected]

LOGO USE GUIDELINES 1 THE AUSTRALIAN NATIONAL UNIVERSITY 1 Computing Capacities

2 Good vs. Practical Codes

3 Linear Codes

4 Coding: Review

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 2 / 21 1 Computing Capacities

2 Good Codes vs. Practical Codes

3 Linear Codes

4 Coding: Review

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 3 / 21 1 Compute the I (X ; Y ) for a general pX = (p0, p1)

2 Determine which choice of pX maximises I (X ; Y ) 3 Use that maximising value to determine C

Binary Symmetric Channel: We first consider the binary symmetric channel with X = Y = 0, 1 A A { } and flip probability f . It has transition matrix

1 f f  Q = − f 1 f −

Computing Capacities

Recall the definition of capacity for a channel Q with inputs X and A ouputs Y A C = max I (X ; Y ) pX How do we actually calculate this quantity?

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 4 / 21 Binary Symmetric Channel: We first consider the binary symmetric channel with X = Y = 0, 1 A A { } and flip probability f . It has transition matrix

1 f f  Q = − f 1 f −

Computing Capacities

Recall the definition of capacity for a channel Q with inputs X and A ouputs Y A C = max I (X ; Y ) pX How do we actually calculate this quantity?

1 Compute the mutual information I (X ; Y ) for a general pX = (p0, p1)

2 Determine which choice of pX maximises I (X ; Y ) 3 Use that maximising value to determine C

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 4 / 21 Computing Capacities

Recall the definition of capacity for a channel Q with inputs X and A ouputs Y A C = max I (X ; Y ) pX How do we actually calculate this quantity?

1 Compute the mutual information I (X ; Y ) for a general pX = (p0, p1)

2 Determine which choice of pX maximises I (X ; Y ) 3 Use that maximising value to determine C

Binary Symmetric Channel: We first consider the binary symmetric channel with X = Y = 0, 1 A A { } and flip probability f . It has transition matrix

1 f f  Q = − f 1 f −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 4 / 21 In general, q := pY = QpX , so above calculation is just     (1 f ) f p0 q = pY = − f (1 f ) p1 − Using H2(q) = q log q (1 q) log (1 q) and letting − 2 − − 2 − q = q1 = P(y = 1) we see the entropy

H(Y ) = H2(q1) = H2(fp0 + (1 f )p1) −

Computing Capacities Binary Symmetric Channel - Step 1

The mutual information can be expressed as I (X ; Y ) = H(Y ) H(Y X ). − | We therefore need to compute two terms: H(Y ) and H(Y X ) so we need | the distributions P(y) and P(y x). | Computing H(Y ):

P(y = 0) = (1 f )P(x = 0) + fP(x = 1) = (1 f )p0 + fp1 − − P(y = 1) = (1 f )P(x = 1) + fP(x = 0) = fp0 + (1 f )p1 − −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 5 / 21 Using H2(q) = q log q (1 q) log (1 q) and letting − 2 − − 2 − q = q1 = P(y = 1) we see the entropy

H(Y ) = H2(q1) = H2(fp0 + (1 f )p1) −

Computing Capacities Binary Symmetric Channel - Step 1

The mutual information can be expressed as I (X ; Y ) = H(Y ) H(Y X ). − | We therefore need to compute two terms: H(Y ) and H(Y X ) so we need | the distributions P(y) and P(y x). | Computing H(Y ):

P(y = 0) = (1 f )P(x = 0) + fP(x = 1) = (1 f )p0 + fp1 − − P(y = 1) = (1 f )P(x = 1) + fP(x = 0) = fp0 + (1 f )p1 − − In general, q := pY = QpX , so above calculation is just     (1 f ) f p0 q = pY = − f (1 f ) p1 −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 5 / 21 Computing Capacities Binary Symmetric Channel - Step 1

The mutual information can be expressed as I (X ; Y ) = H(Y ) H(Y X ). − | We therefore need to compute two terms: H(Y ) and H(Y X ) so we need | the distributions P(y) and P(y x). | Computing H(Y ):

P(y = 0) = (1 f )P(x = 0) + fP(x = 1) = (1 f )p0 + fp1 − − P(y = 1) = (1 f )P(x = 1) + fP(x = 0) = fp0 + (1 f )p1 − − In general, q := pY = QpX , so above calculation is just     (1 f ) f p0 q = pY = − f (1 f ) p1 − Using H2(q) = q log q (1 q) log (1 q) and letting − 2 − − 2 − q = q1 = P(y = 1) we see the entropy

H(Y ) = H2(q1) = H2(fp0 + (1 f )p1) − Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 5 / 21 X X = H2(f )P(x) = H2(f ) P(x) = H2(f ) x x

So, X H(Y X ) = H(Y x)P(x) | x |

Computing I (X ; Y ): Putting it all together gives

I (X ; Y ) = H(Y ) H(Y X ) = H2(fp0 + (1 f )p1) H2(f ) − | − −

Computing Capacities Binary Symmetric Channel - Step 1

Computing H(Y X ): | Since P(y x) is described by the matrix Q, we have | H(Y x = 0)= H2(P(y = 1 x = 0)) = H2(Q1 0) = H2(f ) | | , and similarly, H(Y x = 1)= H2(P(y = 1 x = 1)) = H2(Q0 1) = H2(f ) | | ,

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 6 / 21 X X = H2(f )P(x) = H2(f ) P(x) = H2(f ) x x

Computing I (X ; Y ): Putting it all together gives

I (X ; Y ) = H(Y ) H(Y X ) = H2(fp0 + (1 f )p1) H2(f ) − | − −

Computing Capacities Binary Symmetric Channel - Step 1

Computing H(Y X ): | Since P(y x) is described by the matrix Q, we have | H(Y x = 0)= H2(P(y = 1 x = 0)) = H2(Q1 0) = H2(f ) | | , and similarly, H(Y x = 1)= H2(P(y = 1 x = 1)) = H2(Q0 1) = H2(f ) | | , So, X H(Y X ) = H(Y x)P(x) | x |

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 6 / 21 Computing Capacities Binary Symmetric Channel - Step 1

Computing H(Y X ): | Since P(y x) is described by the matrix Q, we have | H(Y x = 0)= H2(P(y = 1 x = 0)) = H2(Q1 0) = H2(f ) | | , and similarly, H(Y x = 1)= H2(P(y = 1 x = 1)) = H2(Q0 1) = H2(f ) | | , So, X X X H(Y X ) = H(Y x)P(x) = H2(f )P(x) = H2(f ) P(x) = H2(f ) | x | x x

Computing I (X ; Y ): Putting it all together gives

I (X ; Y ) = H(Y ) H(Y X ) = H2(fp0 + (1 f )p1) H2(f ) − | − −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 6 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) BSC (f = 0) and p = (0.5, 0.5): C(Q ) = H (0.5) H (0.15)X= 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 I (X ; Y ) = H2(0.5) H2(0) = 1 We’ll justify the symmetry argumen−t later. If there’s any doubt about 0.3 the symmetryBSCargumen (f =t, 0.w15)e can andalwaysp resort= (0to.5explicit, 0.5):maximization X 0.2 of the mutual information I(X; Y ), I (X ; Y ) = H2(0.5) H2(0.15) 0.39 − ≈ 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) BSC (f =− 0.15)− and p−X = (0.9, 0.1): 0 0 0.25 0.5 0.75 1 I (X ; Y ) = H2(0.22) H2(0.15) 0.15 p1 Example 9.10. The noisy typewriter. The optimal− input distribution≈ is a uni- form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual informationI (X ; Y )I( forX; Yf) for= 0a .binary15 Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 inputMaximisedistribution isI (notX ;soYstraigh): tforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of ySince= 1 is easiestI (X ;toYwrite) isdo symmetricwn: in p1 it is maximised when p0 = p1 = 0.5 in which case C = 0.39 for BSC with f = 0.15. P (y = 1) = p (1 f). (9.13) 1 − Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual information I(X; Y ) for a Z for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Examples:

Computing Capacities Binary Symmetric Channel - Steps 2 and 3

Binary Symmetric Channel (BSC) with flip probability f [0, 1]: ∈ I (X ; Y ) = H2(fp0 + (1 f )p1) H2(f ) − −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 7 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetryBSCargumen (f =t, 0.w15)e can andalwaysp resort= (0to.5explicit, 0.5):maximization X 0.2 of the mutual information I(X; Y ), I (X ; Y ) = H2(0.5) H2(0.15) 0.39 − ≈ 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) BSC (f =− 0.15)− and p−X = (0.9, 0.1): 0 0 0.25 0.5 0.75 1 I (X ; Y ) = H2(0.22) H2(0.15) 0.15 p1 Example 9.10. The noisy typewriter. The optimal− input distribution≈ is a uni- form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual informationI (X ; Y )I( forX; Yf) for= 0a .binary15 Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 inputMaximisedistribution isI (notX ;soYstraigh): tforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of ySince= 1 is easiestI (X ;toYwrite) isdo symmetricwn: in p1 it is maximised when p0 = p1 = 0.5 in which case C = 0.39 for BSC with f = 0.15. P (y = 1) = p (1 f). (9.13) 1 − Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual information I(X; Y ) for a Z for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities Binary Symmetric Channel - Steps 2 and 3

Binary Symmetric Channel (BSC) with flip probability f [0, 1]: ∈ I (X ; Y ) = H2(fp0 + (1 f )p1) H2(f ) − − Examples:

BSC (f = 0) and pX = (0.5, 0.5): I (X ; Y ) = H2(0.5) H2(0) = 1 −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 7 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetry argument, we can always resort to explicit maximization 0.2 of the mutual information I(X; Y ), 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) BSC (f =− 0.15)− and p−X = (0.9, 0.1): 0 0 0.25 0.5 0.75 1 I (X ; Y ) = H2(0.22) H2(0.15) 0.15 p1 Example 9.10. The noisy typewriter. The optimal− input distribution≈ is a uni- form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual informationI (X ; Y )I( forX; Yf) for= 0a .binary15 Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 inputMaximisedistribution isI (notX ;soYstraigh): tforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of ySince= 1 is easiestI (X ;toYwrite) isdo symmetricwn: in p1 it is maximised when p0 = p1 = 0.5 in which case C = 0.39 for BSC with f = 0.15. P (y = 1) = p (1 f). (9.13) 1 − Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual information I(X; Y ) for a Z for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities Binary Symmetric Channel - Steps 2 and 3

Binary Symmetric Channel (BSC) with flip probability f [0, 1]: ∈ I (X ; Y ) = H2(fp0 + (1 f )p1) H2(f ) − − Examples:

BSC (f = 0) and pX = (0.5, 0.5): I (X ; Y ) = H2(0.5) H2(0) = 1 − BSC (f = 0.15) and pX = (0.5, 0.5): I (X ; Y ) = H2(0.5) H2(0.15) 0.39 − ≈

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 7 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetry argument, we can always resort to explicit maximization 0.2 of the mutual information I(X; Y ), 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) − − − 0 0 0.25 0.5 0.75 1 Example 9.10. The noisy typewriter. The optimal input distribution is a uni- p1 form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual informationI (X ; Y )I( forX; Yf) for= 0a .binary15 Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 inputMaximisedistribution isI (notX ;soYstraigh): tforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of ySince= 1 is easiestI (X ;toYwrite) isdo symmetricwn: in p1 it is maximised when p0 = p1 = 0.5 in which case C = 0.39 for BSC with f = 0.15. P (y = 1) = p (1 f). (9.13) 1 − Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual information I(X; Y ) for a Z for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities Binary Symmetric Channel - Steps 2 and 3

Binary Symmetric Channel (BSC) with flip probability f [0, 1]: ∈ I (X ; Y ) = H2(fp0 + (1 f )p1) H2(f ) − − Examples:

BSC (f = 0) and pX = (0.5, 0.5): I (X ; Y ) = H2(0.5) H2(0) = 1 − BSC (f = 0.15) and pX = (0.5, 0.5): I (X ; Y ) = H2(0.5) H2(0.15) 0.39 − ≈ BSC (f = 0.15) and pX = (0.9, 0.1): I (X ; Y ) = H2(0.22) H2(0.15) 0.15 − ≈

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 7 / 21 Maximise I (X ; Y ): Since I (X ; Y ) is symmetric in p1 it is maximised when p0 = p1 = 0.5 in which case C = 0.39 for BSC with f = 0.15.

Computing Capacities Binary Symmetric Channel - Steps 2 and 3

Copyright Cambridge UniversityBinary Press Symmetric 2003. On-screen Channel viewing permitted. (BSC) Printing with not permitted. flip probability http://www.cambridge.org/0521642981f [0, 1]: You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ∈ 9.6: The noisy-channel coding theoremI (X ; Y ) = H2(fp0 + (1 f )p1) H2(f ) 151 − − HoExamplesw much better: can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) BSC (f = 0) and p = (0.5, 0.5): C(Q ) = H (0.5) H (0.15)X= 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 I (X ; Y ) = H2(0.5) H2(0) = 1 We’ll justify the symmetry argumen−t later. If there’s any doubt about 0.3 the symmetryBSCargumen (f =t, 0.w15)e can andalwaysp resort= (0to.5explicit, 0.5):maximization X 0.2 of the mutual information I(X; Y ), I (X ; Y ) = H2(0.5) H2(0.15) 0.39 − ≈ 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) BSC (f =− 0.15)− and p−X = (0.9, 0.1): 0 0 0.25 0.5 0.75 1 I (X ; Y ) = H2(0.22) H2(0.15) 0.15 p1 Example 9.10. The noisy typewriter. The optimal− input distribution≈ is a uni- form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual informationI (X ; Y )I( forX; Yf) for= 0a .binary15 Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 input distribution is not so straightforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of y = 1 is easiest to write down:

P (y = 1) = p (1 f). (9.13) Mark Reid and Aditya Menon (ANU) 1COMP2610/6261− - Information Theory Oct. 14, 2014 7 / 21 Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual information I(X; Y ) for a Z for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.] Computing Capacities Binary Symmetric Channel - Steps 2 and 3

Copyright Cambridge UniversityBinary Press Symmetric 2003. On-screen Channel viewing permitted. (BSC) Printing with not permitted. flip probability http://www.cambridge.org/0521642981f [0, 1]: You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. ∈ 9.6: The noisy-channel coding theoremI (X ; Y ) = H2(fp0 + (1 f )p1) H2(f ) 151 − − HoExamplesw much better: can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) BSC (f = 0) and p = (0.5, 0.5): C(Q ) = H (0.5) H (0.15)X= 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 I (X ; Y ) = H2(0.5) H2(0) = 1 We’ll justify the symmetry argumen−t later. If there’s any doubt about 0.3 the symmetryBSCargumen (f =t, 0.w15)e can andalwaysp resort= (0to.5explicit, 0.5):maximization X 0.2 of the mutual information I(X; Y ), I (X ; Y ) = H2(0.5) H2(0.15) 0.39 − ≈ 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) BSC (f =− 0.15)− and p−X = (0.9, 0.1): 0 0 0.25 0.5 0.75 1 I (X ; Y ) = H2(0.22) H2(0.15) 0.15 p1 Example 9.10. The noisy typewriter. The optimal− input distribution≈ is a uni- form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual informationI (X ; Y )I( forX; Yf) for= 0a .binary15 Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 inputMaximisedistribution isI (notX ;soYstraigh): tforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of ySince= 1 is easiestI (X ;toYwrite) isdo symmetricwn: in p1 it is maximised when p0 = p1 = 0.5 in which case C = 0.39 for BSC with f = 0.15. P (y = 1) = p (1 f). (9.13) Mark Reid and Aditya Menon (ANU) 1COMP2610/6261− - Information Theory Oct. 14, 2014 7 / 21 Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual information I(X; Y ) for a Z for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.] X = Y = 0, 1 X = 0, 1 , Y = 0, ?, 1 X = Y = 0, 1 A A { } A { } A { } A A { } 0.90 .1 0.70 .1 0.9 0 Q = Q = 0.10 .9 Q = 0.20 .2 0.1 1 0.10 .7 Symmetric Not Symmetric Subsets: 0, 1 Symmetric { } Subsets: 0, 1 , ? { } { }

(Linear codes achieve rates at the capacity of symmetric channels.)

Symmetry

Symmetric Channel

A channel with input X and outputs Y and matrix Q is symmetric if A A Y can be partitioned into subsets Y Y so that each sub-matrix Q A 0 ⊆ 0 containing only rows for outputs Y 0 has: Columns that are all permutations of each other Rows that are all permutations of each other

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 8 / 21 X = 0, 1 , Y = 0, ?, 1 X = Y = 0, 1 A { } A { } A A { } 0.70 .1 0.9 0 Q = Q = 0.20 .2 0.1 1 0.10 .7 Not Symmetric Symmetric Subsets: 0, 1 , ? (Linear codes achieve rates at the capacity{ } of{ symmetric} channels.)

Symmetry

Symmetric Channel

A channel with input X and outputs Y and matrix Q is symmetric if A A Y can be partitioned into subsets Y Y so that each sub-matrix Q A 0 ⊆ 0 containing only rows for outputs Y 0 has: Columns that are all permutations of each other Rows that are all permutations of each other

X = Y = 0, 1 A A { } 0.90 .1 Q = 0.10 .9

Symmetric Subsets: 0, 1 { }

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 8 / 21 X = Y = 0, 1 A A { } 0.9 0 Q = 0.1 1

Not Symmetric

(Linear codes achieve rates at the capacity of symmetric channels.)

Symmetry

Symmetric Channel

A channel with input X and outputs Y and matrix Q is symmetric if A A Y can be partitioned into subsets Y Y so that each sub-matrix Q A 0 ⊆ 0 containing only rows for outputs Y 0 has: Columns that are all permutations of each other Rows that are all permutations of each other

X = Y = 0, 1 X = 0, 1 , Y = 0, ?, 1 A A { } A { } A { } 0.90 .1 0.70 .1 Q = 0.10 .9 Q = 0.20 .2 0.10 .7 Symmetric Subsets: 0, 1 Symmetric { } Subsets: 0, 1 , ? { } { }

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 8 / 21 (Linear codes achieve rates at the capacity of symmetric channels.)

Symmetry

Symmetric Channel

A channel with input X and outputs Y and matrix Q is symmetric if A A Y can be partitioned into subsets Y Y so that each sub-matrix Q A 0 ⊆ 0 containing only rows for outputs Y 0 has: Columns that are all permutations of each other Rows that are all permutations of each other

X = Y = 0, 1 X = 0, 1 , Y = 0, ?, 1 X = Y = 0, 1 A A { } A { } A { } A A { } 0.90 .1 0.70 .1 0.9 0 Q = Q = 0.10 .9 Q = 0.20 .2 0.1 1 0.10 .7 Symmetric Not Symmetric Subsets: 0, 1 Symmetric { } Subsets: 0, 1 , ? { } { }

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 8 / 21 (Linear codes achieve rates at the capacity of symmetric channels.)

Symmetry

Symmetric Channel

A channel with input X and outputs Y and matrix Q is symmetric if A A Y can be partitioned into subsets Y Y so that each sub-matrix Q A 0 ⊆ 0 containing only rows for outputs Y 0 has: Columns that are all permutations of each other Rows that are all permutations of each other

X = Y = 0, 1 X = 0, 1 , Y = 0, ?, 1 X = Y = 0, 1 A A { } A { } A { } A A { } 0.90 .1 0.70 .1 0.9 0 Q = Q = 0.10 .9 Q = 0.20 .2 0.1 1 0.10 .7 Symmetric Not Symmetric Subsets: 0, 1 Symmetric { } Subsets: 0, 1 , ? { } { }

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 8 / 21 Symmetry

Symmetric Channel

A channel with input X and outputs Y and matrix Q is symmetric if A A Y can be partitioned into subsets Y Y so that each sub-matrix Q A 0 ⊆ 0 containing only rows for outputs Y 0 has: Columns that are all permutations of each other Rows that are all permutations of each other

X = Y = 0, 1 X = 0, 1 , Y = 0, ?, 1 X = Y = 0, 1 A A { } A { } A { } A A { } 0.90 .1 0.70 .1 0.9 0 Q = Q = 0.10 .9 Q = 0.20 .2 0.1 1 0.10 .7 Symmetric Not Symmetric Subsets: 0, 1 Symmetric { } Subsets: 0, 1 , ? (Linear codes achieve rates at the capacity{ } of{ symmetric} channels.)

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 8 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetry argument, we can always resort to explicit maximization 0.2 of the mutual information I(X; Y ), 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) − − − 0 0 0.25 0.5 0.75 1 Non-SymmetricExample 9.10. The Channelsnoisy typewriter.: The optimal input distribution is a uni- p1 form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual What can we do if the channel is not symmetric? information I(X; Y ) for a binary Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 We caninputstilldistribution calculateis not soIstraigh(X ;tforwY )ard. forWe aevaluate generalI(X; Y ) inputexplic- distributionas a function of the inputpX itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} Findingof they = 1 maximisingis easiest to write down:pX is more challenging P (y = 1) = p (1 f). (9.13) Example (Z Channel with P(y =1 0 −x = 1) = f ): Then the mutual information is: | I(X; Y ) I(X; Y ) = H(Y ) H(Y X) H(Y ) = H2(P(y = 1)) = H2(0p0−+ (1| f )p1) 0.7 = H2(p1(1 f)) −(p0H2(0) + p1H2(f)) 0.6 = H2((1 f )p1) − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − − 0.4 H(Y X ) = p H (P(y = 1 x = 0)) + p H (P(y = 0 x = 1)) 0.3 0This2 is a non-trivial function of p 1, sho2wn in figure 9.3. It is maximized | | 1 | 0.2 = p0forHf2(0)= 0.15+pby1Hp1∗2=(f0).445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 | {z } { } 0 0.25 0.5 0.75 1 information=0 by using input symbol 0 more frequently than 1. p1

I (X ; Y ) =ExerciseH2((19.12.[1,fp.)158p1])Whatp1isHthe2(fcapacit) y of the binary symmetric channel Figure 9.3. The mutual − − I (X ;informationY ) for Z channelI(X; Y ) withfor a Zf = 0.15 for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities in General

Symmetric Channels: If the channel is symmetric, the maximising pX – and thus the capacity – can be obtained via the uniform distribution over inputs (Exercise 10.10).

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 9 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetry argument, we can always resort to explicit maximization 0.2 of the mutual information I(X; Y ), 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) − − − 0 0 0.25 0.5 0.75 1 Example 9.10. The noisy typewriter. The optimal input distribution is a uni- p1 form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual information I(X; Y ) for a binary Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 We caninputstilldistribution calculateis not soIstraigh(X ;tforwY )ard. forWe aevaluate generalI(X; Y ) inputexplic- distributionas a function of the inputpX itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} Findingof they = 1 maximisingis easiest to write down:pX is more challenging P (y = 1) = p (1 f). (9.13) Example (Z Channel with P(y =1 0 −x = 1) = f ): Then the mutual information is: | I(X; Y ) I(X; Y ) = H(Y ) H(Y X) H(Y ) = H2(P(y = 1)) = H2(0p0−+ (1| f )p1) 0.7 = H2(p1(1 f)) −(p0H2(0) + p1H2(f)) 0.6 = H2((1 f )p1) − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − − 0.4 H(Y X ) = p H (P(y = 1 x = 0)) + p H (P(y = 0 x = 1)) 0.3 0This2 is a non-trivial function of p 1, sho2wn in figure 9.3. It is maximized | | 1 | 0.2 = p0forHf2(0)= 0.15+pby1Hp1∗2=(f0).445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 | {z } { } 0 0.25 0.5 0.75 1 information=0 by using input symbol 0 more frequently than 1. p1

I (X ; Y ) =ExerciseH2((19.12.[1,fp.)158p1])Whatp1isHthe2(fcapacit) y of the binary symmetric channel Figure 9.3. The mutual − − I (X ;informationY ) for Z channelI(X; Y ) withfor a Zf = 0.15 for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities in General

Symmetric Channels: If the channel is symmetric, the maximising pX – and thus the capacity – can be obtained via the uniform distribution over inputs (Exercise 10.10).

Non-Symmetric Channels: What can we do if the channel is not symmetric?

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 9 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetry argument, we can always resort to explicit maximization 0.2 of the mutual information I(X; Y ), 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) − − − 0 0 0.25 0.5 0.75 1 Example 9.10. The noisy typewriter. The optimal input distribution is a uni- p1 form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual information I(X; Y ) for a binary Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 input distribution is not so straightforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of y = 1 is easiest to write down:

P (y = 1) = p (1 f). (9.13) Example (Z Channel with P(y =1 0 −x = 1) = f ): Then the mutual information is: | I(X; Y ) I(X; Y ) = H(Y ) H(Y X) H(Y ) = H2(P(y = 1)) = H2(0p0−+ (1| f )p1) 0.7 = H2(p1(1 f)) −(p0H2(0) + p1H2(f)) 0.6 = H2((1 f )p1) − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − − 0.4 H(Y X ) = p H (P(y = 1 x = 0)) + p H (P(y = 0 x = 1)) 0.3 0This2 is a non-trivial function of p 1, sho2wn in figure 9.3. It is maximized | | 1 | 0.2 = p0forHf2(0)= 0.15+pby1Hp1∗2=(f0).445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 | {z } { } 0 0.25 0.5 0.75 1 information=0 by using input symbol 0 more frequently than 1. p1

I (X ; Y ) =ExerciseH2((19.12.[1,fp.)158p1])Whatp1isHthe2(fcapacit) y of the binary symmetric channel Figure 9.3. The mutual − − I (X ;informationY ) for Z channelI(X; Y ) withfor a Zf = 0.15 for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities in General

Symmetric Channels: If the channel is symmetric, the maximising pX – and thus the capacity – can be obtained via the uniform distribution over inputs (Exercise 10.10).

Non-Symmetric Channels: What can we do if the channel is not symmetric?

We can still calculate I (X ; Y ) for a general input distribution pX Finding the maximising pX is more challenging

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 9 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

How much better can we do? By symmetry, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) BSC 2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s any doubt about 0.3 the symmetry argument, we can always resort to explicit maximization 0.2 of the mutual information I(X; Y ), 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) − − − 0 0 0.25 0.5 0.75 1 Example 9.10. The noisy typewriter. The optimal input distribution is a uni- p1 form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual information I(X; Y ) for a binary Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 input distribution is not so straightforward. We evaluate I(X; Y ) explic- as a function of the input itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} of y = 1 is easiest to write down:

P (y = 1) = p (1 f). (9.13) 1 − Then the mutual information is: I(X; Y ) I(X; Y ) = H(Y ) H(Y X) − | 0.7 = H2(p1(1 f)) (p0H2(0) + p1H2(f)) 0.6 − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − 0.4 0.3 This is a non-trivial function of p1, shown in figure 9.3. It is maximized 0.2 for f = 0.15 by p1∗ = 0.445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 { } 0 0.25 0.5 0.75 1 information by using input symbol 0 more frequently than 1. p1

Exercise 9.12.[1, p.158] What is the capacity of the binary symmetric channel Figure 9.3. The mutual I (X ;informationY ) for Z channelI(X; Y ) withfor a Zf = 0.15 for general f? channel with f = 0.15 as a function of the input distribution. Exercise 9.13.[2, p.158] Show that the capacity of the binary erasure channel with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.]

Computing Capacities in General

Symmetric Channels: If the channel is symmetric, the maximising pX – and thus the capacity – can be obtained via the uniform distribution over inputs (Exercise 10.10).

Non-Symmetric Channels: What can we do if the channel is not symmetric?

We can still calculate I (X ; Y ) for a general input distribution pX Finding the maximising pX is more challenging Example (Z Channel with P(y = 0 x = 1) = f ): |

H(Y ) = H2(P(y = 1)) = H2(0p0 + (1 f )p1) − = H2((1 f )p1) − H(Y X ) = p0H2(P(y = 1 x = 0)) + p1H2(P(y = 0 x = 1)) | | | = p0 H2(0) +p1H2(f ) | {z } =0

I (X ; Y ) = H2((1 f )p1) p1H2(f ) − −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 9 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9.6: The noisy-channel coding theorem 151

ComputingHow Capacitiesmuch better can we do? inBy symmetry General, the optimal input distribu- tion is 0.5, 0.5 and the capacity is { } I(X; Y ) C(Q ) = H (0.5) H (0.15) = 1.0 0.61 = 0.39 bits. (9.11) Symmetric ChannelsBSC :2 − 2 − 0.4 We’ll justify the symmetry argument later. If there’s anpy doubt about 0.3 If the channelthe issymmetry symmetricargument, w,e thecan alw maximisingays resort to explicit maximizationX – and thus the capacity – 0.2 can be obtainedof the m viautual theinformation uniformI(X; Y ), distribution over inputs (Exercise 10.10). 0.1 I(X; Y ) = H2((1 f)p1 + (1 p1)f) H2(f) (figure 9.2). (9.12) − − − 0 0 0.25 0.5 0.75 1 Non-SymmetricExample 9.10. The Channelsnoisy typewriter.: The optimal input distribution is a uni- p1 form distribution over x, and gives C = log2 9 bits. Figure 9.2. The mutual What can we do if the channel is not symmetric? information I(X; Y ) for a binary Example 9.11. Consider the Z channel with f = 0.15. Identifying the optimal symmetric channel with f = 0.15 We caninputstilldistribution calculateis not soIstraigh(X ;tforwY )ard. forWe aevaluate generalI(X; Y ) inputexplic- distributionas a function of the inputpX itly for = p , p . First, we need to compute P (y). The probability distribution. PX { 0 1} Findingof they = 1 maximisingis easiest to write down:pX is more challenging P (y = 1) = p (1 f). (9.13) Example (Z Channel with P(y =1 0 −x = 1) = f ): Then the mutual information is: | I(X; Y ) I(X; Y ) = H(Y ) H(Y X) H(Y ) = H2(P(y = 1)) = H2(0p0−+ (1| f )p1) 0.7 = H2(p1(1 f)) −(p0H2(0) + p1H2(f)) 0.6 = H2((1 f )p1) − − 0.5 = H2(p1(1 f)) p1H2(f). (9.14) − − − 0.4 H(Y X ) = p H (P(y = 1 x = 0)) + p H (P(y = 0 x = 1)) 0.3 0This2 is a non-trivial function of p 1, sho2wn in figure 9.3. It is maximized | | 1 | 0.2 = p0forHf2(0)= 0.15+pby1Hp1∗2=(f0).445. We find C(QZ) = 0.685. Notice the optimal 0.1 input distribution is not 0.5, 0.5 . We can communicate slightly more 0 | {z } { } 0 0.25 0.5 0.75 1 information=0 by using input symbol 0 more frequently than 1. p1

I (X ; Y ) =ExerciseH2((19.12.[1,fp.)158p1])Whatp1isHthe2(fcapacit) y of the binary symmetric channel Figure 9.3. The mutual − − I (X ;informationY ) for Z channelI(X; Y ) withfor a Zf = 0.15 for general f? channel with f = 0.15 as a [2, p.158] function of the input distribution. Mark Reid and AdityaExercise Menon9.13. (ANU)Show thatCOMP2610/6261the capacity of - Informationthe binary erasure Theorychannel Oct. 14, 2014 9 / 21 with f = 0.15 is CBEC = 0.85. What is its capacity for general f? Comment.

9.6 The noisy-channel coding theorem

It seems plausible that the ‘capacity’ we have defined may be a measure of information conveyed by a channel; what is not obvious, and what we will prove in the next chapter, is that the capacity indeed measures the rate at which blocks of data can be communicated over the channel with arbitrarily small probability of error. We make the following definitions. An (N, K) block code for a channel Q is a list of S = 2K codewords

K x(1), x(2), . . . , x(2 ) , x(s) N , { } ∈ AX each of length N. Using this code we can encode a signal s ∈ 1, 2, 3, . . . , 2K as x(s). [The number of codewords S is an integer, { } but the number of bits specified by choosing a codeword, K log S, is ≡ 2 not necessarily an integer.] Example (Z Channel): Showed earlier that I (X ; Y ) = H2((1 f )p) pH2(f ) so solve − − d  1 (1 f )p  I (X ; Y ) = 0 (1 f ) log − − H2(f ) = 0 dp ⇐⇒ − 2 (1 f )p − − 1 (1 f )p − − = 2H2(f )/(1−f ) ⇐⇒ (1 f )p − 1/(1 f ) p = − ⇐⇒ 1 + 2H2(f )/(1−f )

1/0.85 For f = 0.15, we get p = 0.44 and so C = H2(0.38) 0.44H2(0.15) 0.685 1+20.61/0.85 ≈ − ≈

d 1 p Homework: Show that dp H2(p) = log2 −p

Computing Capacities in General

What to do once we know I (X ; Y )?

I (X ; Y ) is concave in pX = single maximum ⇒ For binary inputs, just look for stationary points (not for X > 2) d |A | i.e., where I (X ; Y ) = 0 for pX = (1 p, p) dp −

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 10 / 21 d 1 p Homework: Show that dp H2(p) = log2 −p

Computing Capacities in General

What to do once we know I (X ; Y )?

I (X ; Y ) is concave in pX = single maximum ⇒ For binary inputs, just look for stationary points (not for X > 2) d |A | i.e., where I (X ; Y ) = 0 for pX = (1 p, p) dp − Example (Z Channel): Showed earlier that I (X ; Y ) = H2((1 f )p) pH2(f ) so solve − − d  1 (1 f )p  I (X ; Y ) = 0 (1 f ) log − − H2(f ) = 0 dp ⇐⇒ − 2 (1 f )p − − 1 (1 f )p − − = 2H2(f )/(1−f ) ⇐⇒ (1 f )p − 1/(1 f ) p = − ⇐⇒ 1 + 2H2(f )/(1−f )

1/0.85 For f = 0.15, we get p = 0.44 and so C = H2(0.38) 0.44H2(0.15) 0.685 1+20.61/0.85 ≈ − ≈

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 10 / 21 Computing Capacities in General

What to do once we know I (X ; Y )?

I (X ; Y ) is concave in pX = single maximum ⇒ For binary inputs, just look for stationary points (not for X > 2) d |A | i.e., where I (X ; Y ) = 0 for pX = (1 p, p) dp − Example (Z Channel): Showed earlier that I (X ; Y ) = H2((1 f )p) pH2(f ) so solve − − d  1 (1 f )p  I (X ; Y ) = 0 (1 f ) log − − H2(f ) = 0 dp ⇐⇒ − 2 (1 f )p − − 1 (1 f )p − − = 2H2(f )/(1−f ) ⇐⇒ (1 f )p − 1/(1 f ) p = − ⇐⇒ 1 + 2H2(f )/(1−f )

1/0.85 For f = 0.15, we get p = 0.44 and so C = H2(0.38) 0.44H2(0.15) 0.685 1+20.61/0.85 ≈ − ≈

d 1 p Homework: Show that dp H2(p) = log2 −p Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 10 / 21 Theory vs. Practice The NCCT theorem tells us that good block codes exist for any noisy channel (in fact, most random codes are good) However, the theorem is non-constructive: it does not tell us how to create practical codes for a given noisy channel The construction of practical codes that achieve rates up to the capacity for general channels is ongoing research

Theory and Practice

The difference between theory and practice is that, in theory, there is no difference between theory and practice but, in practice, there is. — Jan L. A. van de Snepscheut

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 11 / 21 Theory and Practice

The difference between theory and practice is that, in theory, there is no difference between theory and practice but, in practice, there is. — Jan L. A. van de Snepscheut Theory vs. Practice The NCCT theorem tells us that good block codes exist for any noisy channel (in fact, most random codes are good) However, the theorem is non-constructive: it does not tell us how to create practical codes for a given noisy channel The construction of practical codes that achieve rates up to the capacity for general channels is ongoing research

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 11 / 21 Good: Can achieve arbitrarily small error up to some maximum rate strictly less than the channel capacity (i.e, for any  a good coding scheme can make a code with K/N = Rmax < C and pBM < ) Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM a > 0 as N or → → ∞ pBM 0 = K/N 0) → ⇒ → Practical: Can be coded and decoded in time that is polynomial in the block length N.

Types of Codes

When we talk about types of codes we will be referring to schemes for creating (N, K) codes for any size N. MacKay makes the following distinctions: Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any  > 0 a very good coding scheme can make a code with K/N = C and pBM < )

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 12 / 21 Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM a > 0 as N or → → ∞ pBM 0 = K/N 0) → ⇒ → Practical: Can be coded and decoded in time that is polynomial in the block length N.

Types of Codes

When we talk about types of codes we will be referring to schemes for creating (N, K) codes for any size N. MacKay makes the following distinctions: Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any  > 0 a very good coding scheme can make a code with K/N = C and pBM < ) Good: Can achieve arbitrarily small error up to some maximum rate strictly less than the channel capacity (i.e, for any  a good coding scheme can make a code with K/N = Rmax < C and pBM < )

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 12 / 21 Practical: Can be coded and decoded in time that is polynomial in the block length N.

Types of Codes

When we talk about types of codes we will be referring to schemes for creating (N, K) codes for any size N. MacKay makes the following distinctions: Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any  > 0 a very good coding scheme can make a code with K/N = C and pBM < ) Good: Can achieve arbitrarily small error up to some maximum rate strictly less than the channel capacity (i.e, for any  a good coding scheme can make a code with K/N = Rmax < C and pBM < ) Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM a > 0 as N or → → ∞ pBM 0 = K/N 0) → ⇒ →

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 12 / 21 Types of Codes

When we talk about types of codes we will be referring to schemes for creating (N, K) codes for any size N. MacKay makes the following distinctions: Very Good: Can achieve arbitrarily small error at any rate up to the channel capacity (i.e., for any  > 0 a very good coding scheme can make a code with K/N = C and pBM < ) Good: Can achieve arbitrarily small error up to some maximum rate strictly less than the channel capacity (i.e, for any  a good coding scheme can make a code with K/N = Rmax < C and pBM < ) Bad: Cannot achieve arbitrarily small error, or only achieve it if the rate goes to zero (i.e., either pBM a > 0 as N or → → ∞ pBM 0 = K/N 0) → ⇒ → Practical: Can be coded and decoded in time that is polynomial in the block length N.

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 12 / 21 Random Codes

During the discussion of the Noisy-Channel Coding Theorem we saw how to construct very good random codes via expurgation and typical set decoding.

Properties: Very Good: Rates up to C are achievable with arbitrarily small error Construction is easy Not Practical: K I The 2 codewords have no structure and must be “memorised” I Typical set decoding is expensive

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 13 / 21 1 Computing Capacities

2 Good Codes vs. Practical Codes

3 Linear Codes

4 Coding: Review

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 14 / 21 Linear (N, K) Block Code A linear (N, K) block code is an (N, K) block code where s is first represented as a K-bit binary vector s 0, 1 K and then encoded via ∈ { } multiplication by an N K binary matrix G to form t = G s modulo 2. × > > Here linear means all S = 2K messages can be obtained by “adding” different combinations of the K codewords ti = G>ei where ei is K-bit string with single 1 in position i.

Example: Suppose (N, K) = (7, 4). To send s = 3, first create s = 0011 and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where e0 = 0001 and e1 = 0010.

Linear Codes

(N, K) Block Code An (N, K) block code is a list of S = 2K codewords x(1),..., x(S) , each { } of length N. A signal s 1, 2,..., 2K is encoded as x(s). ∈ { }

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 15 / 21 Example: Suppose (N, K) = (7, 4). To send s = 3, first create s = 0011 and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where e0 = 0001 and e1 = 0010.

Linear Codes

(N, K) Block Code An (N, K) block code is a list of S = 2K codewords x(1),..., x(S) , each { } of length N. A signal s 1, 2,..., 2K is encoded as x(s). ∈ { } Linear (N, K) Block Code A linear (N, K) block code is an (N, K) block code where s is first represented as a K-bit binary vector s 0, 1 K and then encoded via ∈ { } multiplication by an N K binary matrix G to form t = G s modulo 2. × > > Here linear means all S = 2K messages can be obtained by “adding” different combinations of the K codewords ti = G>ei where ei is K-bit string with single 1 in position i.

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 15 / 21 Linear Codes

(N, K) Block Code An (N, K) block code is a list of S = 2K codewords x(1),..., x(S) , each { } of length N. A signal s 1, 2,..., 2K is encoded as x(s). ∈ { } Linear (N, K) Block Code A linear (N, K) block code is an (N, K) block code where s is first represented as a K-bit binary vector s 0, 1 K and then encoded via ∈ { } multiplication by an N K binary matrix G to form t = G s modulo 2. × > > Here linear means all S = 2K messages can be obtained by “adding” different combinations of the K codewords ti = G>ei where ei is K-bit string with single 1 in position i.

Example: Suppose (N, K) = (7, 4). To send s = 3, first create s = 0011 and send t = G>s = G>(e0 + e1) = G>e0 + G>e1 = t0 + t1 where e0 = 0001 and e1 = 0010. Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 15 / 21 Linear Codes: Examples

(7,4) Hamming Code (6,3) Repetition Code

1 0 0 0 1 0 0 0 1 0 0 0 1 0     0 0 1 0 0 0 1   G> =   G> = 0 0 0 1 1 0 0     1 1 1 0 0 1 0   0 1 1 1 0 0 1 1 0 1 1

For s = 0011, For s = 010,

G>s( mod 2) = [0 0 1 1 1 0 0]> G>s( mod 2) = [0 1 0 0 1 0]>

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 16 / 21 Simple! Just compute

P(y x) P(x y) = P | | x0 P(y x)P(x) ∈C | But: the number of codes x is 2K so, naively, the sum is expensive ∈ C linear codes provide structure that the above method doesn’t exploit

Decoding

We can construct codes with a relatively simple encoding but how do we decode them? That is, given the input distribution and channel model Q how do we find the posterior distribution over x given we received y?

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 17 / 21 Decoding

We can construct codes with a relatively simple encoding but how do we decode them? That is, given the input distribution and channel model Q how do we find the posterior distribution over x given we received y?

Simple! Just compute

P(y x) P(x y) = P | | x0 P(y x)P(x) ∈C | But: the number of codes x is 2K so, naively, the sum is expensive ∈ C linear codes provide structure that the above method doesn’t exploit

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 17 / 21 An NCCT can be proved for linear codes (i.e., “there exists a ” replacing “there exists a code”) but the proof is still non-constructive.

Practical linear codes: Use very large block sizes N Based on semi-random code constructions Apply probabilistic decoding techniques Used in wireless and satellite communication

Types of Linear Code

Many commonly used codes are linear: Repetition Codes: e.g., 0 000 ; 1 111 → → Convolution Codes: Linear coding plus bit shifts Concatenation Codes: Two or more levels of error correction Hamming Codes: Parity checking Low-Density Parity-Check Codes: Semi-random construction

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 18 / 21 Practical linear codes: Use very large block sizes N Based on semi-random code constructions Apply probabilistic decoding techniques Used in wireless and satellite communication

Types of Linear Code

Many commonly used codes are linear: Repetition Codes: e.g., 0 000 ; 1 111 → → Convolution Codes: Linear coding plus bit shifts Concatenation Codes: Two or more levels of error correction Hamming Codes: Parity checking Low-Density Parity-Check Codes: Semi-random construction

An NCCT can be proved for linear codes (i.e., “there exists a linear code” replacing “there exists a code”) but the proof is still non-constructive.

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 18 / 21 Types of Linear Code

Many commonly used codes are linear: Repetition Codes: e.g., 0 000 ; 1 111 → → Convolution Codes: Linear coding plus bit shifts Concatenation Codes: Two or more levels of error correction Hamming Codes: Parity checking Low-Density Parity-Check Codes: Semi-random construction

An NCCT can be proved for linear codes (i.e., “there exists a linear code” replacing “there exists a code”) but the proof is still non-constructive.

Practical linear codes: Use very large block sizes N Based on semi-random code constructions Apply probabilistic decoding techniques Used in wireless and satellite communication Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 18 / 21 1 Computing Capacities

2 Good Codes vs. Practical Codes

3 Linear Codes

4 Coding: Review

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 19 / 21 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

9

Coding: Review Communication over a Noisy Channel

The Big Picture 9.1 The big picture Source 6 ? Source Compressor Decompressor coding 6 ? Channel Encoder Decoder coding 6 - Noisy channel

In Chapters 4–6, we discussed source coding with block codes, symbol codes and stream codes. We implicitly assumed that the channel from the compres- Source Codingsor to forthe decompressor Compressionwas noise-free. RealChannelchannels are noisy Coding. We will now for Reliability spend two chapters on the subject of noisy-channel coding – the fundamen- tal possibilities and limitations of error-free communication through a noisy Shrink sequenceschannel. The aim of channel coding is to make the noisyProtectchannel b sequencesehave like a noiseless channel. We will assume that the data to be transmitted has been Identify andthrough removea good compressor, redundancyso the bit stream has no obAddvious redundancy known. The form of redundancy channel code, which makes the transmission, will put back redundancy of a Size limitedspecial bysort, entropydesigned to make the noisy received signalRatedecodeable. limited by capacity 1 Suppose we transmit 1000 bits per second with p0 = p1 = /2 over a Source Codingnoisy channel Theoremsthat flips bits with probability f = 0Noisy-Channel.1. What is the rate of Coding Theorem transmission of information? We might guess that the rate is 900 bits per (Block &second Variableby subtracting Length)the expected number of errors per second. But this is not correct, because the recipient does not know where the errors occurred. Mark Reid and Aditya MenonConsider (ANU)the case whereCOMP2610/6261the noise is so great - Informationthat the Theoryreceived symbols are Oct. 14, 2014 20 / 21 independent of the transmitted symbols. This corresponds to a noise level of f = 0.5, since half of the received symbols are correct due to chance alone. But when f = 0.5, no information is transmitted at all. Given what we have learnt about entropy, it seems reasonable that a mea- sure of the information transmitted is given by the mutual information between the source and the received signal, that is, the entropy of the source minus the conditional entropy of the source given the received signal. We will now review the definition of conditional entropy and mutual in- formation. Then we will examine whether it is possible to use such a noisy channel to communicate reliably. We will show that for any channel Q there is a non-zero rate, the capacity C(Q), up to which information can be sent

146 You should call it entropy. . . no one really knows what entropy really is, so in a debate you will always have the advantage.

— J. von Neumann to C. Shannon

Thanks!

Why “Entropy”?

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 21 / 21 Thanks!

Why “Entropy”?

You should call it entropy. . . no one really knows what entropy really is, so in a debate you will always have the advantage.

— J. von Neumann to C. Shannon

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 21 / 21 Why “Entropy”?

You should call it entropy. . . no one really knows what entropy really is, so in a debate you will always have the advantage.

— J. von Neumann to C. Shannon

Thanks!

Mark Reid and Aditya Menon (ANU) COMP2610/6261 - Information Theory Oct. 14, 2014 21 / 21