Information Theory

Lecturer: Cong Ling Office: 815 2 Lectures

Introduction 11 Separation Theorem – 131 1 Introduction - 1 12 Polar codes - 143 Entropy Properties Continuous Variables 2 Entropy I - 19 13 - 170 3 Entropy II – 32 14 Gaussian Channel - 185 Lossless Source Coding 15 Parallel Channels – 198 4 Theorem - 47 Lossy Source Coding 5 Algorithms - 60 16 Rate Distortion Theory - 211 Network 6 Data Processing Theorem - 76 17 NIT I - 241 7 Typical Sets - 86 18 NIT II - 256 8 Channel Capacity - 98 Revision 9 Joint Typicality - 112 19 Revision - 274 10 Coding Theorem - 123 20 3 Claude Shannon

• C. E. Shannon, ”A mathematical theory of communication,” Bell System Technical Journal, 1948. • Two fundamental questions in communication theory: • Ultimate limit on data compression – entropy • Ultimate transmission rate of communication 1916 - 2001 – channel capacity • Almost all important topics in information theory were initiated by Shannon 4 Origin of Information Theory

• Common wisdom in 1940s: – It is impossible to send information error-free at a positive rate – Error control by using retransmission: rate → 0 if error-free • Still in use today – ARQ (automatic repeat request) in TCP/IP computer networking • Shannon showed reliable communication is possible for all rates below channel capacity • As long as source entropy is less than channel capacity, asymptotically error-free communication can be achieved • And anything can be represented in bits – Rise of digital information technology 5 Relationship to Other Fields 6 Course Objectives

• In this course we will (focus on communication theory): – Define what we mean by information. – Show how we can compress the information in a source to its theoretically minimum value and show the tradeoff between data compression and distortion. – Prove the channel coding theorem and derive the information capacity of different channels. – Generalize from point-to-point to network information theory. 7 Relevance to Practice

• Information theory suggests means of achieving ultimate limits of communication – Unfortunately, these theoretically optimum schemes are computationally impractical – So some say “little info, much theory” (wrong) • Today, information theory offers useful guidelines to design of communication systems – Polar code (achieves channel capacity) – CDMA (has a higher capacity than FDMA/TDMA) – Channel-coding approach to source coding (duality) – Network coding (goes beyond routing) 8 Books/Reading

Book of the course: • Elements of Information Theoryby T M Cover & J A Thomas, Wiley, £39 for 2 nd ed. 2006, or £14 for 1 st ed. 1991 (Amazon)

Free references • Information Theory and Network Codingby R. W. Yeung, Springer http://iest2.ie.cuhk.edu.hk/~whyeung/book2/

• Information Theory, Inference, and Learning Algorithms by D MacKay, Cambridge University Press http://www.inference.phy.cam.ac.uk/mackay/itila/

• Lecture Notes on Network Information Theory by A. E. Gamal and Y.-H. Kim, (Book is published by Cambridge University Press) http://arxiv.org/abs/1001.3404

• C. E. Shannon, ”A mathematical theory of communication,” Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. 9 Other Information

• Course webpage: http://www.commsp.ee.ic.ac.uk/~cling • Assessment: Exam only – no coursework. • Students are encouraged to do the problems in problem sheets. • Background knowledge – Mathematics – Elementary probability • Needs intellectual maturity – Doing problems is not enough; spend some time thinking 10 Notation • Vectors and matrices – v=vector, V=matrix • Scalar random variables – x = R.V, x = specific value, X = alphabet • Random column vector of length N – x = R.V, x = specific value, XN = alphabet

– xi and xi are particular vector elements • Ranges – a:b denotes the range a, a+1, …, b • Cardinality – |X| = the number of elements in set X 11 Discrete Random Variables

• A x takes a value x from

the alphabet X with probability px(x). The vector of probabilities is px . Examples: 1 1 1 1 1 1 X = [1;2;3;4;5;6] , px = [ /6; /6; /6; /6; /6; /6]

pX is a “probability mass vector” “english text” X = [a; b;…, y; z; ]

px = [0.058; 0.013; …; 0.016; 0.0007; 0.193]

Note: we normally drop the subscript from px if unambiguous 12 Expected Values

• If g(x) is a function defined on X then

often write E for E Egx ()x = ∑ pxgx ()() X x∈X Examples: 1 1 1 1 1 1 X = [1;2;3;4;5;6] , px = [ /6; /6; /6; /6; /6; /6] E x = 3.5 = µ E x 2 =15.17 = σ 2 + µ 2 E sin(0.1x ) = 0.338

This is the “entropy” of X E − log2 ( p(x )) = 2.58 13 Shannon Information Content • The Shannon Information Content of an outcome with probability p is –log 2p • Shannon’s contribution – a statistical view – Messages, noisy channels are random – Pre-Shannon era: deterministic approach (Fourier…)

• Example 1: Coin tossing – X = [Head; Tail], p = [½; ½], SIC = [1; 1] bits • Example 2: Is it my birthday ? 364 1 – X = [No; Yes], p = [ /365 ; /365 ], SIC = [0.004; 8.512] bits

Unlikely outcomes give more information 14 Minesweeper

• Where is the bomb ? • 16 possibilities – needs 4 bits to specify

Guess Prob SIC

15 1. No /16 0.093 bits 14 2. No /15 0.100 bits 13 3. No /14 0.107 bits 1 M 4. Yes /13 3.700 bits Total 4.000 bits

SIC = – log 2 p 15 Minesweeper

• Where is the bomb ? • 16 possibilities – needs 4 bits to specify

Guess Prob SIC

15 1. No /16 0.093 bits 16 Entropy

xH()x =− E log( x x 2 px ()) =− ∑ pxpx ()log2 () x∈x – H(x) = the average Shannon Information Content of x – H(x) = the average information gained by knowing its value – the average number of “yes-no” questions needed to find x is in the range [H(x), H(x)+1) – H(x) = the amount of uncertainty before we know its value

We use log( x) ≡ log 2(x) and measure H(x) in bits

– if you use log e it is measured in nats

– 1 nat = log 2(e) bits = 1.44 bits ln( x) d log x log e • log (x) = 2 = 2 2 ln( 2) dx x

H(X ) depends only on the probability vector pX not on the alphabet X, so we can write H(pX) 17 Entropy Examples

1 (1) Bernoulli Random Variable 0.8 0.6

X = [0;1] , px = [1–p; p] H(p) 0.4

H (x ) = −(1− p)log(1− p) − plog p 0.2

Very common – we write H(p) to 0 0 0.2 0.4 0.6 0.8 1 mean H([1 –p; p]). p H ( p) = − 1( − p)log(1− p) − plog p Maximum is when p=1/2 H′( p) = log(1− p) − log p H ′′( p) = − p−1 1( − p)−1 loge (2) Four Coloured Shapes 1 1 X = [ ò; ¢; ©; s], px = [½; ¼; /8; /8]

H (x ) = H (px ) = ∑− log( p(x)) p(x) 1 1 =1×½ + 2×¼ + 3× 8 + 3× 8 =1.75 bits 18 Comments on Entropy

• Entropy plays a central role in information theory • Origin in thermodynamics – S = k ln Ω, k: Boltzman’s constant, Ω: number of microstates – The second law: entropy of an isolated system is non- decreasing • Shannon entropy – Agrees with intuition: additive, monotonic, continuous – Logarithmic measure could be derived from an axiomatic approach (Shannon 1948) 19 Lecture 2

• Joint and – Chain rule • – If x and y are correlated, their mutual information is the average information that y gives about x • E.g. Communication Channel: x transmitted but y received • It is the amount of information transmitted through the channel • Jensen’s Inequality 20 Joint and Conditional Entropy

p(x,y) y=0 y=1 Joint Entropy: H(x,y) x=0 ½ ¼ x=1 0 ¼ H (x ,y ) = E − log p(x ,y ) = −½ log½ −¼log¼ − 0log 0 −¼log¼ =1.5 bits Note: 0 log 0 = 0

Conditional Entropy: H(y |x) p(y|x) y=0 y=1 2 1 x=0 /3 /3 H y x(y x | )= E − log p ( | ) x=1 0 1 = − ∑ pxy( , )log pyx ( | ) Note: rows sum to 1 x, y 2 1 =−½ log3 − ¼log 3 − 0log 0 − ¼ log1 = 0.689 bits 21 Conditional Entropy – View 1

Additional Entropy: p(x, y) y=0 y=1 p(x) p(|) xy y x = p (,) ÷ p () x=0 ½ ¼ ¾ H y x(|)y x = E − log(|) p x=1 0 ¼ ¼ =−E{}{}log(,) x px y −− E log() p

1113 1 =−=H(, xx y ) HH () (24 , ,0, 4 ) − H ( 44 , ) = 0.689 bits H(Y|X) is the average additional information in Y when you know X H(x ,y)

H(y |x)

H(x) 22 Conditional Entropy – View 2

p(x, y) y=0 y=1 H(y | x=x ) p(x) Average Row Entropy: x=0 ½ ¼ H(1/3) ¾ x=1 0 ¼ H(1) ¼

H (y | x ) = E − log p(y | x ) = ∑− p(x, y)log p(y | x) x, y = ∑− p(x) p(y | x)log p(y | x) = ∑p(x) ∑− p(y | x)log p(y | x) x, y x∈XY y ∈

1 = ∑ p(x)H ()y | x = x = ¾ × H ( 3) + ¼ × H (0) = 0.689 bits x∈X

Take a weighted average of the entropy of each row using p(x) as weight 23 Chain Rules

• Probabilities p(x ,y ,z ) = p(z | x ,y ) p(y | x ) p(x ) • Entropy H (x ,y ,z ) = H (z | x ,y ) + H (y | x ) + H (x ) n H (x :1 n ) = ∑ H (x i | x :1i−1) i=1

The log in the definition of entropy converts products of probability into sums of entropy 24 Mutual Information Mutual information is the average amount of information that you get about x from observing the value of y - Or the reduction in the uncertainty of x due to knowledge of y I(x ;y ) = H (x ) − H (x | y ) = H (x ) + H (y ) − H (x ,y )

Information in x Information in x when you already know y H(x ,y) Mutual information is symmetrical H(x |y) I(x ;y) H(y |x)

I(x ;y ) = I(y ;x ) H(x) H(y) Use “;” to avoid ambiguities between I(x;y,z) and I(x,y;z) 25 Mutual Information Example p(x,y) y=0 y=1 • If you try to guess y you have a 50% x=0 ½ ¼ chance of being correct. x=1 0 ¼ • However, what if you know x ? – Best guess: choose y = x – If x =0 (p=0.75 ) then 66% correct prob – If x =1 (p=0.25 ) then 100% correct prob

H(x , y)=1.5 – Overall 75% correct probability

I(x ;y ) = H (x ) − H (x | y ) H(x |y) I(x ; y) H(y | x) =0.5 =0.311 =0.689 = H (x ) + H (y ) − H (x ,y ) H (x ) = 0.811, H (y ) = ,1 H (x ,y ) =1.5 H(x)=0.811 H(y)=1 I(x ;y ) = 0.311 26 Conditional Mutual Information

Conditional Mutual Information I(x ;y | z ) = H (x | z ) − H (x | y ,z ) = H (x | z ) + H (y | z ) − H (x ,y | z )

Note : Z conditioning applies to both X and Y

Chain Rule for Mutual Information

I(x 1,x 2 ,x 3;y ) = I(x 1;y ) + I(x 2 ;y | x 1) + I(x 3;y | x 1,x 2 ) n I(x :1 n ;y ) = ∑ I(x i ;y | x :1i−1) i=1 H(x ,y) 27

H(x |y) I(x ;y) H(y | x)

Review/Preview H(x) H(y)

• Entropy: H (x ) = ∑-log2 ( p(x)) p(x) = E − log2 ( pX (x)) x∈X – Positive and bounded 0≤H (x ) ≤ log |X | ♦ • Chain Rule: H (x ,y ) = H (x ) + H (y | x ) ≤ H (x ) + H (y ) ♦ – Conditioning reduces entropy H (y | x ) ≤ H (y ) ♦ • Mutual Information: I(y ;x ) = H (y ) − H (y | x ) = H (x ) + H (y ) − H (x ,y ) – Positive and Symmetrical I(x ;y ) = I(y ;x ) ≥ 0 ♦ – x and y independent ⇔ H (x ,y ) = H (y ) + H (x ) ♦ ⇔ I(x ;y ) = 0 ♦ = inequalities not yet proved 28 Convex & Concave functions f(x) is strictly convex over (a,b) if f (λu + (1− λ)v) < λf (u) + (1− λ) f (v) ∀u ≠ v ∈(a,b), 0 < λ <1

– every chord of f(x) lies above f(x) Con cave is like this – f(x) is concave ⇔ –f(x) is convex • Examples – Strictly Convex: x2 , x4 , e x , x log x [ x ≥ 0] – Strictly Concave: log x, x [x ≥ 0] – Convex and Concave: x

d 2 f – Test: > 0 ∀ x ∈ ( a , b ) ⇒ f(x) is strictly convex dx2

“convex” (not strictly) uses “ ≤” in definition and “ ≥” in test 29 Jensen’s Inequality

Jensen’s Inequality : (a) f(x) convex ⇒ Ef (x) ≥ f(Ex)

(b) f(x) strictly convex ⇒ Ef (x) > f(Ex) unless x constant Proof by induction on |X| These sum to 1

– |X|=1: E f (x ) = f (E x ) = f (x1) k k −1 pi – |X|= k: E f (x ) = ∑ pi f (xi ) =pk f (xk ) + 1( − pk )∑ f (xi ) i=1 i=1 1− pk  k −1 p   i  ≥ pk f (xk ) + 1( − pk ) f xi Assume JI is true ∑1− p   i=1 k  for |X|= k–1  k −1 p  ≥ f  p x + 1( − p ) i x  = f ()E x  k k k ∑ i   i=1 1− pk 

Follows from the definition of convexity for two-mass-point distribution 30 Jensen’s Inequality Example

Mnemonic example: f(x) = x2 : strictly convex

X = [–1; +1] 4 p = [½; ½] 3 E x = 0 2 f(E x)=0 f(x)

E f(x) = 1 > f(E x) 1

0 -2 -1 0 1 2 x 31 Summary H(x ,y) • Chain Rule: H (x ,y ) = H (y | x ) + H (x ) H(x |y) H(y |x) • Conditional Entropy: H(x) H(y) H (y | x ) = H (x ,y ) − H (x ) = ∑ p(x)H (y | x) – Conditioning reduces entropy x∈X H (y | x ) ≤ H (y ) ♦ • Mutual Information I(x ;y ) = H (x ) − H (x | y ) ≤ H (x ) ♦ – In communications, mutual information is the amount of information transmitted through a noisy channel • Jensen’s Inequality f(x) convex ⇒ Ef (x) ≥ f(Ex)

♦ = inequalities not yet proved 32 Lecture 3

• Relative Entropy – A measure of how different two probability mass vectors are • Information Inequality and its consequences – Relative Entropy is always positive • Mutual information is positive • Uniform bound • Conditioning and correlation reduce entropy • Stochastic Processes – – Markov Processes 33 Relative Entropy

Relative Entropy or Kullback-Leibler Divergence between two probability mass vectors p and q p(x) p(x ) D(p || q) = ∑ p(x)log = Ep log = Ep ()− log q(x ) − H (x ) x∈X q(x) q(x )

where Ep denotes an expectation performed using probabilities p D(p|| q) measures the “distance” between the probability mass functions p and q.

We must have pi=0 whenever qi=0 else D(p|| q)= ∞ Beware: D(p|| q) is not a true distance because: – (1) it is asymmetric between p, q and – (2) it does not satisfy the triangle inequality. 34 Relative Entropy Example

X = [1 2 3 4 5 6] T p = [1 1 1 1 1 1 ] ⇒ H (p) = 2.585 6 6 6 6 6 6 q = []1 1 1 1 1 1 ⇒ H (q) = 2.161 10 10 10 10 10 2

D(p || q) = Ep (−log qx ) − H (p) = 2.935 − 2.585 = 0.35

D(q || p) = Eq (−log px ) − H (q) = 2.585 − 2.161 = 0.424 35 Information Inequality

Information (Gibbs’) Inequality: D(p || q) ≥ 0 • Define A = {x : p(x) > 0} ⊆ X p(x) q(x) • Proof − D(p ||q) = −∑ p(x)log = ∑ p(x)log x∈A q(x) x∈A p(x)  q(x)      Jensen’s ≤ log∑ p(x)  = log∑q(x) ≤ log ∑q(x) = log1 = 0  x∈A p(x)   x∈A   x∈X  inequality

If D(p|| q)=0 : Since log( ) is strictly concave we have equality in the proof only if q(x)/ p(x), the argument of log , equals a constant.

But ∑ p ( x ) = ∑ q ( x ) = 1 so the constant must be 1 and p ≡ q x∈X x∈X 36 Information Inequality Corollaries

• Uniform distribution has highest entropy – Set q = [| X|–1, …, | X|–1]T giving H(q)=log| X| bits

D(p || q) = Ep {−logq(x )}− H (p) = log | X | −H (p) ≥ 0

• Mutual Information is non-negative p(x y , ) xI y y (;) x y x = HHH () + () − (,) = E log p( yx ) p ( ) =Dp(( y x y , )|| p ( )( p )) ≥ 0 with equality only if p(x,y) ≡ p(x)p(y) ⇔ x and y are independent. 37 More Corollaries

• Conditioning reduces entropy 0 ≤ I(x ;y ) = H (y ) − H (y | x ) ⇒ H (y | x ) ≤ H (y )

with equality only if x and y are independent.

• Independence Bound n n H (x :1 n ) = ∑ H (x i | x :1i−1) ≤ ∑ H (x i ) i=1 i=1

with equality only if all xi are independent.

E.g.: If all xi are identical H(x1: n) = H(x1) 38 Conditional Independence Bound

• Conditional Independence Bound n n H (x :1 n | y :1 n ) = ∑ H (x i | x :1 i−1,y :1 n ) ≤ ∑ H (x i | y i ) i=1 i=1 • Mutual Information Independence Bound

If all xi are independent or, by symmetry, if all yi are independent:

I(x :1 n ;y :1 n ) = H (x :1 n ) − H (x :1 n | y :1 n ) n n n ≥ ∑ H (x i ) − ∑ H (x i | y i ) = ∑ I(x i ;y i ) i=1 i=1 i=1

E.g.: If n=2 with xi i.i.d. Bernoulli ( p=0.5 ) and y1=x2 and y2=x1, then I(xi;yi)=0 but I(x1:2 ; y1:2 ) = 2 bits. 39 Stochastic Process

Stochastic Process {xi} = x1, x2, … often … Entropy: H ({x i}) = H (x 1) + H (x 2 | x 1) + = ∞ 1 Entropy Rate: H (X) = lim H (x :1 n ) if limit exists n→∞ n – Entropy rate estimates the additional entropy per new sample. – Gives a lower bound on number of code bits per sample. Examples: – Typewriter with m equally likely letters each time: H(X) =log m

– xi i.i.d. random variables: H (X) = H (x i ) 40 Stationary Process

Stochastic Process {xi} is stationary iff

p(x :1 n = a :1 n ) = p(x k+ :1( n) = a :1 n ) ∀k,n,ai ∈ X

If {xi} is stationary then H(X) exists and 1 H (X) = lim H (x :1 n ) = lim H (x n | x :1 n−1) n→∞ n n→∞ )a( )b(

Proof: 0 ≤ H (x n | x :1 n−1) ≤ H (x n | x :2 n−1) = H (x n−1 | x :1 n−2 ) (a) conditioning reduces entropy, (b) stationarity

Hence H(xn|x1: n-1) is positive, decreasing ⇒ tends to a limit, say b Hence 1 1 n H (x k | x :1 k−1) → b ⇒ H (x :1 n ) = ∑ H (x k | x :1 k−1) → b = H (X) n n k=1 41 Markov Process (Chain)

Discrete-valued stochastic process {xi} is

• Independent iff p(xn|x0: n–1)= p(xn)

• Markov iff p(xn|x0: n–1)= p(xn|xn–1)

– time-invariant iff p( xn=b| xn–1=a ) = pab indep of n – States

– Transition matrix: T = { tab } • Rows sum to 1: T1 = 1 where 1 is a vector of 1’s t12 T • pn = T pn–1 1 2 T t14 t • Stationary distribution: p$ = T p$ t13 24 t 3 34 4 t43 Independent Stochastic Process is easiest to deal with, Markov is next easiest 42 Stationary Markov Process

If a Markov process is a) irreducible : you can go from any state a to any b in a finite number of steps b) aperiodic : ∀ state a, the possible times to go from a to a have highest common factor = 1 then it has exactly one stationary distribution, p$. T T – p$ is the eigenvector of T with λ = 1 : T p$ = p$ n T ⋯ T T → 1p$ where 1 = [1 1 1] n→∞ – Initial distribution becomes irrelevant ( asymptotically T n T stationary ) (T ) p0$0$= p1p = p , ∀ p 0 43 Chess Board

H(pH(pH(p )=3.07129,)=3.10287,)=3.09141,)=1.58496,)=2.99553,)=3.0827,)=3.111,H(p )=0, H(p H(p H(pH(pH(p | | | ||p| p pppp )=2.30177)=2.23038)=2.20683)=2.54795)=1.58496)=2.24987)=2.09299)=0 6372485 1 58632741 47521630 Random Walk • Move ˜¯ ˘˙˚¸ equal prob T • p1 = [1 0 … 0]

– H(p1) = 0 1 T • p$ = /40 × [3 5 3 5 8 5 3 5 3]

– H(p$) = 3.0855

• HXH()= lim (x xn | n −1 ) n→∞

=−limpxx (nn ,−1 )log( pxx nn | − 1 ) =− pt $, iij , log( t ij , ) = 2.2365 n→∞ ∑ ∑ i, j 44 Chess Board

H(p )=0, H(p | p )=0 1 1 0 Random Walk • Move ˜¯ ˘˙˚¸ equal prob T • p1 = [1 0 … 0]

– H(p1) = 0 1 T • p$ = /40 × [3 5 3 5 8 5 3 5 3]

– H(p$) = 3.0855

HXH()= lim (x xn | n −1 ) • n→∞

=−limpxx (,nn−1 )log(| pxx nn − 1 ) =− pt $, iij , log() t ij , n→∞ ∑ ∑ i, j – H( X )= 2.2365 45 Chess Board Frames

H(p )=0, H(p | p )=0 H(p )=1.58496, H(p | p )=1.58496 H(p )=3.10287, H(p | p )=2.54795 H(p )=2.99553, H(p | p )=2.09299 1 1 0 2 2 1 3 3 2 4 4 3

H(p )=3.111, H(p | p )=2.30177 H(p )=3.07129, H(p | p )=2.20683 H(p )=3.09141, H(p | p )=2.24987 H(p )=3.0827, H(p | p )=2.23038 5 5 4 6 6 5 7 7 6 8 8 7 46 Summary p(x ) • Relative Entropy: D(p || q) = Ep log ≥ 0 – D(p|| q) = 0 iff p ≡ q q(x ) • Corollaries – Uniform Bound: Uniform p maximizes H(p) – I(x ; y) ≥ 0 ⇒ Conditioning reduces entropy n n – Indep bounds: H (x :1 n ) ≤ ∑ H (x i ) H (x :1 n | y :1 n ) ≤ ∑ H (x i | y i ) i=1 i=1 n I(x :1 n ;y :1 n ) ≥ ∑ I(x i ;y i ) if x i or y i are indep i=1 • Entropy Rate of stochastic process:

–{xi} stationary : H()X = lim H (x xn |1: n − 1 ) n→∞ –{xi} stationary Markov :

HH()X = (|x xnn−1 ) =∑ − ptt $,, iij log() ij , i, j 47 Lecture 4

• Source Coding Theorem – n i.i.d. random variables each with entropy H(X) can be compressed into more than nH(X) bits as n tends to infinity • Instantaneous Codes – Symbol-by-symbol coding – Uniquely decodable • Kraft Inequality – Constraint on the code length • Optimal Symbol Code lengths – Entropy Bound 48 Source Coding

• Source Code: C is a mapping X→D+ – X a random variable of the message –D+ = set of all finite length strings from D –D is often binary – e.g. {E, F, G} →{0,1} + : C(E)=0, C(F)=10, C(G)=11 • Extension: C+ is mapping X+ →D+ formed by

concatenating C(xi) without punctuation – e.g. C+(EFEE GE) =01000110 49 Desired Properties

• Non-singular: x1≠ x2 ⇒ C(x1) ≠ C(x2) – Unambiguous description of a single letter of X • Uniquely Decodable: C+ is non-singular – The sequence C+(x+) is unambiguous – A stronger condition – Any encoded string has only one possible source string producing it – However, one may have to examine the entire encoded string to determine even the first source symbol – One could use punctuation between two codewords but inefficient 50 Instantaneous Codes • Instantaneous (or Prefix ) Code – No codeword is a prefix of another – Can be decoded instantaneously without reference to future codewords • Instantaneous ⇒ Uniquely Decodable ⇒ Non- singular Examples: – C(E,F,G, H) = (0, 1, 00, 11) IU – C(E,F) = (0, 101) IU – C(E,F) = (1, 101) IU – C(E,F,G, H) = (00, 01, 10, 11) IU – C(E,F,G, H) = (0, 01, 011, 111) IU 51 Code Tree

Instantaneous code: C(E,F,G, H) = (00, 11, 100, 101) Form a D-ary tree where D = |D|

– D branches at each node 0 E – Each codeword is a leaf 0 1 G – Each node along the path to 0 a leaf is a prefix of the leaf 0 1 1 ⇒ can’t be a leaf itself H – Some leaves may be unused 1 F

111011000000 → FHGEE 52

Kraft Inequality (instantaneous codes)

• Limit on codeword lengths of instantaneous codes – Not all codewords can be too short X|| −li • Codeword lengths l1, l2, …, l|X| ⇒ ∑ 2 ≤1 i=1 • Label each node at depth l ¼ 0 –l ½ E with 2 1/ ¼ 8 0 • Each node equals the sum of 1 0 G ¼ all its leaves 1 0 1/ ½ 8 • Equality iff all leaves are utilised 1 1 H • Total code budget = 1 1 ¼ F Code 00 uses up ¼ of the budget 1 Code 100 uses up /8 of the budget Same argument works with D-ary tree 53

McMillan Inequality (uniquely decodable codes)

If uniquely decodable C has codeword lengths |X| −l l1, l2, …, l|X| , then ∑ D i ≤1 The same |X| i=1 Proof: Let −li S = ∑ D and M = maxli then for any N, i=1 ||XN |||| XXX || N −l  −()li1 + l i ++... l i + SDD=∑i  = ∑∑... ∑ 2 N = ∑ D− length{C (x)} N i=1  iii1 === 111 2 N x∈X NM NM NM = D−l | x : l = length{C + (x)}| ≤ re-orderD−l Dl sum= by 1total= NM length ∑ Sum over all sequences∑ of length∑N l =1 l =1 l =1 If S > 1 then SN > NM for some N. Hence S ≤ 1. max number of distinct sequences of length l Implication: uniquely decodable codes doesn’t offer further reduction of codeword lengths than instantaneous codes 54

McMillan Inequality (uniquely decodable codes)

If uniquely decodable C has codeword lengths |X| −l l1, l2, …, l|X| , then ∑ D i ≤1 The same |X| i=1 Proof: Let −li S = ∑ D and M = maxli then for any N, i=1 ||XN |||| XXX || N −l  −()li1 + l i ++... l i + SDD=∑i  = ∑∑... ∑ 2 N = ∑ D− length{C (x)} N i=1  iii1 === 111 2 N x∈X NM NM NM = ∑ D−l | x : l = length{C + (x)}| ≤ ∑ D−l Dl = ∑1 = NM l =1 l =1 l =1 If S > 1 then SN > NM for some N. Hence S ≤ 1.

Implication: uniquely decodable codes doesn’t offer further reduction of codeword lengths than instantaneous codes 55 How Short are Optimal Codes?

If l(x) = length( C(x)) then C is optimal if L=E l(x) is as small as possible. We want to minimize ∑ p ( x ) l ( x ) subject to x∈X 1. ∑ D−l(x) ≤1 x∈X 2. all the l(x) are integers

Simplified version: Ignore condition 2 and assume condition 1 is satisfied with equality.

less restrictive so lengths may be shorter than actually possible ⇒ lower bound 56

Optimal Codes (non-integer li)

|X| |X| p(x )l −li • Minimize ∑ i i subject to ∑ D =1 i=1 i=1 Use Lagrange multiplier: X|| X|| −li ∂J Define J = ∑ p(xi )li + λ∑ D and set = 0 i=1 i=1 ∂li

∂J −li −li = p(xi ) − λ ln(D)D = 0 ⇒ D = p(xi )/ λ ln(D) ∂li X|| also D−li = 1 ⇒ λ = /1 ln(D) ⇒ ∑ li = –log D(p(xi)) i=1

E− log2 ( p (x ) ) H (x ) x x E l()log()x =− ED () p = == H D () log2DD log 2 no uniquely decodable code can do better than this 57 Bounds on Optimal Code Length

Round up optimal code lengths: li = − logD p(xi )

• li are bound to satisfy the Kraft Inequality (since the optimum lengths do)

• For this choice, –log D(p(xi)) ≤ li ≤ –log D(p(xi)) + 1 • Average shortest length: (since we added <1 * HLH xD()x ≤ < D ()1 + to optimum values) • We can do better by encoding blocks of n symbols −1 − 1 − 1 − 1 nH x x Dn()x 1:≤ nEl () 1: n ≤ nH Dn () 1: + n

• If entropy rate of xi exists ( ⇐ xi is stationary process ) −1 − 1 xn HDnD()x 1: → H ()X ⇒ n E l ()1: nD → H () X Also known as source coding theorem 58 Block Coding Example

1 X = [ A;B], p = [0.9; 0.1] x 0.8 El

H (xi ) = .0 469 -1 n 0.6 Huffman coding: 0.4 sym AB 2 4 6 8 10 12 • n=1 n prob 0.9 0.1 n−1E l =1 code 0 1

• n=2 sym AA AB BA BB prob 0.81 0.09 0.09 0.01 n−1E l = 0.645 code 0 11 100 101

• n=3 sym AAA AAB … BBA BBB prob 0.729 0.081 … 0.009 0.001 n−1E l = 0.583 code 0 101 … 10010 10011 The extra 1 bit inefficiency becomes insignificant for large blocks 59 Summary

• McMillan Inequality for D-ary codes: X|| • any uniquely decodable C has ∑ D−li ≤1 i=1 • Any uniquely decodable code:

E l x()x ≥ H D () • Source coding theorem • Symbol-by-symbol encoding

H D (x ) ≤ E l(x ) ≤ H D (x ) +1

−1 • Block encoding n E l(x 1: n )→ H D ()X 60 Lecture 5

• Source Coding Algorithms • Huffman Coding • Lempel-Ziv Coding 61 Huffman Code

An optimal binary instantaneous code must satisfy:

1. p(xi ) > p(x j ) ⇒ li ≤ l j (else swap codewords) 2. The two longest codewords have the same length (else chop a bit off the longer codeword) 3. ∃ two longest codewords differing only in the last bit (else chop a bit off all of them) Huffman Code construction

1. Take the two smallest p(xi) and assign each a different last bit. Then merge into a single symbol. 2. Repeat step 1 until only one symbol remains Used in JPEG, MP3… 62 Huffman Code Example

X = [a, b, c, d, e], px = [0.25 0.25 0.2 0.15 0.15]

Read diagram backwards for codewords: C(X) = [01 10 11 000 001], L = 2.3, H(x) = 2.286 For D-ary code, first add extra zero-probability symbols until |X|–1 is a multiple of D–1 and then group D symbols at a time 63 Huffman Code is Optimal Instantaneous Code

Huffman traceback gives codes for progressively larger alphabets: 0 0 a 0.25 0.25 0.25 0.55 1.0 0 b 0.25 0.25 0.45 0.45 1 p2=[0.55 0.45], c 0.2 0.2 1 c2=[0 1], L2=1 1 d 0.15 0 0.3 0.3 p3=[0.45 0.3 0.25], e 0.15 1 c3=[1 00 01], L3=1.55 p4=[0.3 0.25 0.25 0.2], c4=[00 01 10 11], L4=2 p5=[0.25 0.25 0.2 0.15 0.15], c5=[01 10 11 000 001], L5=2.3

We want to show that all these codes are optimal including C5 64 Huffman Code is Optimal Instantaneous Code

Huffman traceback gives codes for progressively larger alphabets: p2=[0.55 0.45], c2=[0 1], L2=1 p3=[0.45 0.3 0.25], c3=[1 00 01], L3=1.55 p4=[0.3 0.25 0.25 0.2], c4=[00 01 10 11], L4=2 p5=[0.25 0.25 0.2 0.15 0.15], c5=[01 10 11 000 001], L5=2.3

We want to show that all these codes are optimal including C5 65 Huffman Optimality Proof

Suppose one of these codes is sub-optimal:

– ∃ m>2 with cm the first sub-optimal code (note c 2 is definitely optimal)

– An optimal c′m must have LC'm < L Cm

– Rearrange the symbols with longest codes in c ′m so the two lowest probs pi and pj differ only in the last digit (doesen’t change optimality)

– Merge xi and xj to create a new code c′m−1 as in Huffman procedure

– L C'm–1 =L C'm– pi– pj since identical except 1 bit shorter with prob pi+ p j

– But also L Cm –1 =L Cm – pi– pj hence L C'm–1 < L Cm –1 which contradicts assumption that cm is the first sub-optimal code

Hence, Huffman coding satisfies HLH xD()x ≤ < D ()1 + Note: Huffman is just one out of many possible optimal codes 66 Shannon-Fano Code

Fano code 1. Put probabilities in decreasing order 2. Split as close to 50-50 as possible; repeat with each half

H(x) = 2.81 bits a 0.20 0 00 0 b 0.19 0 010 1 L = 2.89 bits c 0.17 1 011 SF d 0.15 0 100 0 e 0.14 1 101 1 f 0.06 0 110 Not necessarily optimal: the 1 g 0.05 0 1110 best code for this p actually 1 h 0.04 1 1111 has L = 2.85 bits 67 Shannon versus Huffman

i−1 F= x x p x (),x pp () ≥≥≥ ()⋯ p () Shannon i∑ k =1 k 1 2 m

encoding:roundthenumberFi∈[0,1] to − log p (x i )  bits

H xD(x )≤ L SF ≤ H D ( ) + 1 (excercise)

px = .0[ 36 .0 34 .0 25 .0 05] ⇒ H (x ) = .1 78 bits

− log2 px = .1[ 47 .1 56 2 .4 32]

lS = − log2 px = [2 2 2 ]5

LS = .2 15 bits Huffman

l H = [1 2 3 3]

LH = 1.94 bits Individual codewords may be longer in Huffman than Shannon but not the average 68 Issues with Huffman Coding • Requires the probability distribution of the source – Must recompute entire code if any symbol probability changes • A block of N symbols needs |X|N pre-calculated probabilities • For many practical applications, however, the underlying probability distribution is unknown – Estimate the distribution • Arithmetic coding: extension of Shannon-Fano coding; can deal with large block lengths – Without the distribution • Universal coding : Lempel-Ziv coding 69 Universal Coding • Does not depend on the distribution of the source • Compression of an individual sequence • Run length coding – Runs of data are stored (e.g., in fax machines) Example: WWWWWWWWWBBWWWWWWWBBBBBBWW 9W2B7W6B2W • Lempel-Ziv coding – Generalization that takes advantage of runs of strings of characters (such as WWWWWWWWWBB ) – Adaptive dictionary compression algorithms – Asymptotically optimum: achieves the entropy rate for any stationary ergodic source 70 Lempel-Ziv Coding (LZ78)

Memorize previously occurring substrings in the input data – parse input into the shortest possible distinct ‘phrases’, i.e., each phrase is the shortest phrase not seen earlier – number the phrases starting from 1 (0 is the empty string) ABAA BA BAB BB AB … Look up a dictionary 12 _3_4__ 5_6_7 – each phrase consists of a previously occurring phrase (head) followed by an additional A or B (tail) – encoding: give location of head followed by the additional symbol for tail 0A0B1A2A4B2B1B… – decoder uses an identical dictionary

locations are underlined 71 Lempel-Ziv Example

Input = 110110101000100100010011011011011010100010010010110101010101110110101000101101010001001101010001001000100101001001111101010001001000100101001010101000100100010010100100101000100100010010100100011010001001000100101001001000100100010010100100010101000100100010010100100001001000100101001000100100010010100100000100100010010100101001000100101001011000100010010100100100010010100100010100100100010010100100001001010010001001010010010010100100010100100010011001010010001010010010100101010010001001001010010001001100100010010100 Dictionary Send Decode 0000000000 φ 1 1 Remark: 0001001011 1 00 0 • No need to send the 001001010 0 01 11 111 001101111 11 10 10 011 dictionary (imagine zip 0100100 01 100 001 0100 and unzip!) 0101101 010 010 00 000 • Can be reconstructed 0110110 00 001 01 100 • Need to send 0’s in 01 , 0111111 10 101 0010 01000 010 and 001 to avoid 1000 0100 1000 10100 010011 ambiguity (i.e., 1001 01001 1001 0 01001 0 instantaneous code) location No need to always send 4 bits 72 Lempel-Ziv Comments

Dictionary D contains K entries D(0), …, D(K–1) . We need to send M=ceil(log K) bits to specify a dictionary entry. Initially K=1, D(0)= φ = null string and M=ceil(log K) = 0 bits.

Input Action 1 “1” ∉D so send “ 1” and set D(1)= “1”. Now K=2 ⇒ M=1 . 0 “0” ∉D so split it up as “ φ”+”0” and send location “ 0” (since D(0)= φ) followed by “ 0”. Then set D(2)= “0” making K=3 ⇒ M=2 . 1 “1” ∈ D so don’t send anything yet – just read the next input bit. 1 “11” ∉D so split it up as “1” + “1” and send location “ 01 ” (since D(1)= “1” and M=2 ) followed by “ 1”. Then set D(3)= “11” making K=4 ⇒ M=2 . 0 “0” ∈ D so don’t send anything yet – just read the next input bit. 1 “01” ∉D so split it up as “0” + “1” and send location “ 10 ” (since D(2)= “0” and M=2 ) followed by “ 1”. Then set D(4)= “01” making K=5 ⇒ M=3 . 0 “0” ∈ D so don’t send anything yet – just read the next input bit. 1 “01” ∈ D so don’t send anything yet – just read the next input bit. 0 “010” ∉D so split it up as “01” + “0” and send location “ 100 ” (since D(4)= “01” and M=3 ) followed by “ 0”. Then set D(5)= “010” making K=6 ⇒ M=3 .

So far we have sent 10001 110 1100 0 where dictionary entry numbers are in red . 73 Lempel-Ziv Properties

• Simple to implement • Widely used because of its speed and efficiency – applications: compress, gzip, GIF, TIFF, modem … – variations: LZW (considering last character of the current phrase as part of the next phrase, used in Adobe Acrobat), LZ77 (sliding window) – different dictionary handling, etc • Excellent compression in practice – many files contain repetitive sequences – worse than arithmetic coding for text files 74 Asymptotic Optimality

• Asymptotically optimum for stationary ergodic source (i.e. achieves entropy rate) • Let c(n) denote the number of phrases for a sequence of length n • Compressed sequence consists of c(n) pairs (location, last bit) • Needs c(n)[log c(n)+1] bits in total

•{Xi} stationary ergodic ⇒

−1 cn( )[log cn ( )+ 1] limsupnlX (1: n )limsup= ≤ H X () with probability 1 n→∞ n →∞ n – Proof: C&T chapter 12.10 – may only approach this for an enormous file 75 Summary

• Huffman Coding: H D (x ) ≤ E l(x ) ≤ H D (x ) +1 – Bottom-up design – Optimal ⇒ shortest average length

• Shannon-Fano Coding: H D (x ) ≤ E l(x ) ≤ H D (x ) +1 – Intuitively natural top-down design • Lempel-Ziv Coding – Does not require probability distribution – Asymptotically optimum for stationary ergodic source (i.e. achieves entropy rate) 76 Lecture 6

• Markov Chains – Have a special meaning – Not to be confused with the standard definition of Markov chains (which are sequences of discrete random variables) • Data Processing Theorem – You can’t create information from nothing • Fano’s Inequality – Lower bound for error in estimating X from Y 77 Markov Chains

If we have three random variables: x, y, z p(x, y, z) = p(z | x, y) p(y | x) p(x) they form a Markov chain x→y→z if p(z | x, y) = p(z | y) ⇔ p(x, y, z) = p(z | y) p(y | x) p(x)

A Markov chain x→y→z means that – the only way that x affects z is through the value of y – if you already know y, then observing x gives you no additional information about z, i.e. I(x ;z | y ) = 0 ⇔ H (z | y ) = H (z | x ,y ) – if you know y, then observing z gives you no additional information about x. 78 Data Processing

• Estimate z = f(y), where f is a function • A special case of a Markov chain x ‰ y ‰ f(y)

Data Y Z X Channel Estimator

• Does processing of y increase the information that y contains about x? 79 Markov Chain Symmetry

If x→y→z p(x, y, z) )a( p(x, y) p(z | y) p(x, z | y) = = = p(x | y) p(z | y) p(y) p(y)

(a) p(z | x, y) = p(z | y) Hence x and z are conditionally independent given y Also x→y→z iff z→y→x since p(z | y) p(y) (a) p(x, z | y) p(y) p(x, y, z) p(x | y) = p(x | y) = = p(y, z) p(y, z) p(y, z) = p(x | y, z) (a) p(x, z | y) = p(x | y) p(z | y) Conditionally indep. Markov chain property is symmetrical 80 Data Processing Theorem

If x→y→z then I(x ; y) ≥ I(x ; z) – processing y cannot add new information about x If x→y→z then I(x ; y) ≥ I(x ; y | z) – Knowing z does not increase the amount y tells you about x Proof: Apply chain rule in different ways I(x ;y ,z ) = I(x ;y ) + I(x ;z | y ) = I(x ;z ) + I(x ;y | z ) (a) but I(x ;z | y ) = 0 hence I(x ;y ) = I(x ;z ) + I(x ;y | z ) so I(x ;y ) ≥ I(x ;z ) and I(x ;y ) ≥ I(x ;y | z )

(a) I(x ; z)=0 iff x and z are independent; Markov ⇒ p(x,z | y)= p(x |y)p(z | y) 81 So Why Processing?

• One can not create information by manipulating the data • But no information is lost if equality holds • Sufficient statistic – z contains all the information in y about x – Preserves mutual information I(x ;y) = I(x ; z) • The estimator should be designed in a way such that it outputs sufficient statistics • Can the estimation be arbitrarily accurate? 82 Fano’s Inequality

ˆ If we estimate x from y, what is pe = p(x ≠ x ) ? x y x^ H(|)x y ≤ Hpp (e ) + e log||X x() yH(|)x y − H () p(a) () H (|)1 − ⇒p ≥e ≥ (a) the second form is e log||X log|| X weaker but easier to use 1 x xˆ ≠ Proof: Define a random variable e =  0 x xˆ = HHHHH(,|) xee e x e x x x x ˆ = (|)ˆ + (|,) ˆ = (|) ˆ + (|,)ˆ chain rule ⇒HHH x e e x(|)0x x ˆ +≤ () + (|,)ˆ H≥0; H(e |y)≤ H(e) ˆ ˆ x x=+HH x( x e ) ( | , e =−+ 0)(1 pHe ) ( | , ep = 1) e

≤Hp(e ) +×− 0 (1 pp e ) + e log |X | H(e)= H(pe) HH x yx ( xx y | )≤ ( |ˆ )since II (;)ˆ ≤ (;) Markov chain 83 Implications

• Zero probability of error pe =0 ⇒ H (|)0x y = • Low probability of error if H(x|y) is small • If H(x|y) is large then the probability of error is high • Could be slightly strengthened to

H(x y | )≤ Hpp (e ) + e log(|X | − 1) • Fano’s inequality is used whenever you need to show that errors are inevitable – E.g., Converse to channel coding theorem 84 Fano Example

T X = {1:5} , px = [0.35, 0.35, 0.1, 0.1, 0.1] Y = {1:2} if x ≤ 2 then y = x with probability 6/7 while if x > 2 then y =1 or 2 with equal prob. Our best strategy is to guess x y x ˆ=( → → ˆ ) T – px | y=1 = [0.6, 0.1, 0.1, 0.1, 0.1]

– actual error prob: pe = 0.4

H (x | y ) −1 .1 771−1 p ≥ = = .0 3855 Fano bound: e log () | X | − 1 log( 4 ) (exercise)

Main use: to show when error free transmission is impossible since pe > 0 85 Summary • Markov: x → y → z ⇔ p(z | x, y) = p(z | y) ⇔ I(x ;z | y ) = 0 • Data Processing Theorem: if x→y→z then – I(x ; y) ≥ I(x ; z), I(y ; z) ≥ I(x ; z) – I(x ; y) ≥ I(x ; y | z) can be false if not Markov

– Long Markov chains: If x1→ x2 → x3 → x4 → x5 → x6, then Mutual Information increases as you get closer together:

• e.g. I(x3, x4) ≥ I(x2, x4) ≥ I(x1, x5) ≥ I(x1, x6) • Fano’s Inequality: if x → y → xˆ then H (x | y ) − H ( p ) H (x | y ) −1 H (x | y ) −1 p ≥ e ≥ ≥ e log()| X | −1 log()| X | −1 log | X |

weaker but easier to use since independent of pe 86 Lecture 7

• Law of Large Numbers – Sample mean is close to expected value • Asymptotic Equipartition Principle (AEP)

– -logP( x1,x2,…, xn)/ n is close to entropy H • The Typical Set – Probability of each sequence close to 2 -nH – Size (~2 nH ) and total probability (~1) • The Atypical Set – Unimportant and could be ignored 87 Typicality: Example

X = {a, b, c, d}, p = [0.5 0.25 0.125 0.125] –log p = [1 2 3 3] ⇒ H(p) = 1.75 bits Sample eight i.i.d. values • typical ⇒ correct proportions

adbabaac –log p(x) = 14 = 8×1.75 = nH(x) • not typical ⇒ log p(x) ≠ nH(x)

dddddddd –log p(x) = 24 88 Convergence of Random Variables

• Convergence

x n → y ⇒ ∀ε > 0,∃m such that ∀n > m |, x n − y |< ε n→∞ –n Example: y x n = ±2 , = 0 choose m = − log ε

• Convergence in probability (weaker than convergence) prob x n → y ⇒ ∀ε > 0, P()| x n − y |> ε → 0 −1 − 1 Example: xn ∈{0;1}, p = [1 − nn ; ] −1 n→∞ for any small ε ε , p (| xn |> ) = n → 0 prob

soxn→ 0 (but x n → 0) Note: y can be a constant or another random variable 89 Law of Large Numbers n Given i.i.d. {x }, sample mean 1 i s n = ∑ x i n i=1 −1 −1 2 – E s n = E x = µ Var s n = n Var x = n σ

As n increases, Var sn gets smaller and the values become clustered around the mean prob

LLN: s n → µ

⇔ ∀ε > 0, P()| s n − µ |> ε → 0 n→∞ The expected value of a random variable is equal to the long-term average when sampling repeatedly. 90 Asymptotic Equipartition Principle

• x is the i.i.d. sequence {xi} for 1 ≤ i ≤n n – Prob of a particular sequence is p(x) = ∏ p(x i ) i=1 – Average E − log p(x) = n E − log p(xi ) = nH (x )

• AEP: 1 prob −logp (x ) → H (x ) n • Proof: 1 1 n −logp (x ) = − ∑ log p ( x i ) n n i=1 prob

law of large numbers →E −log p ( xi ) = H ()x 91 Typical Set

Typical set (for finite n)

(n) n −1 Tε = {x ∈ X : − n log p(x) − H (x ) < ε}

Example: N=128,N=32,N=64,N=16,N=8,N=4,N=1,N=2, p=0.2,p=0.2,p=0.2, e=0.1,e=0.1,e=0.1, ppp =29%=41%=85%=62%=0%=72%=45% TTT – xi Bernoulli with p(xi =1)= p – e.g. p ([0 1 1 0 0 0])= p2(1 –p)4 – For p=0.2, H(X)=0.72 bits (n) – Red bar shows T0.1

-2.5-2.5 -2-2 -1.5-1.5 -1-1 -0.5-0.5 00 NN-1 -1logloglog p(x) p(x)p(x) 92 Typical Set Frames

N=1, p=0.2, e=0.1, p =0% N=2, p=0.2, e=0.1, p =0% N=4, p=0.2, e=0.1, p =41% N=8, p=0.2, e=0.1, p =29% T T T T

-2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 N-1 log p(x) N-1 log p(x) N-1 log p(x) N-1 log p(x)

N=16, p=0.2, e=0.1, p =45% N=32, p=0.2, e=0.1, p =62% N=64, p=0.2, e=0.1, p =72% N=128, p=0.2, e=0.1, p =85% T T T T

-2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 N-1 log p(x) N-1 log p(x) N-1 log p(x) N-1 log p(x)

1111100100011000100101100110011, ,0 10, 1101, 00 , 00110010, 1010 , 0100, 0001001010010000, 00010001, 0000 , 00010000,, 0001000100010000 00000000 , 0000100001000000 , 0000000010000000 93 Typical Set: Properties

(n) 1. Individual prob: x ∈Tε ⇒ log p(x) = − nH (x ) ± nε

(n) 2. Total prob: p(x ∈Tε ) >1−ε for n > Nε

n>Nε 3. Size: n(H (x )−ε ) (n) n(H (x )+ε ) (1−ε )2 < Tε ≤ 2

n prob Proof 2: −1 −1 − n log p(x) = n ∑ − log p(xi ) → E − log p(xi ) = H (x ) i=1 −1 Hence ∀ε > 0 ∃Nε s.t. ∀n > Nε p( − n log p(x) − H (x ) > ε ) < ε

Proof 3a: (n) −n(H (x )−ε ) −n(H (x )−ε ) (n) f.l.e. n, 1−ε < p(x ∈Tε ) ≤ ∑ 2 = 2 Tε n)( x∈Tε Proof 3b: −n(H (x )+ε ) −n(H (x )+ε ) (n) 1 = ∑ p(x) ≥ ∑ p(x) ≥ ∑ 2 = 2 Tε n )( n )( x x∈Tε x∈Tε 94 Consequence

• for any ε and for n > Nε “Almost all events are almost equally surprising”

(n) log p(x) = − nH (x ) ± nε • p(x ∈Tε ) >1−ε and Coding consequence |X|n elements (n) –x ∈Tε : ‘ 0’ + at most 1+ n(H+ε) bits (n) –x ∉Tε : ‘ 1’ + at most 1+ nlog| X| bits – L = Average code length (n ) ≤p(x ∈ Tε )[2 + nH ( + ε )] (n ) +p(x ∉ Tε )[2 + n log | X |] ≤++n( H ε ε )() n logX ++ 2 2 ≤ 2n(H (x )+ε ) elements = εnH() ε ++ε ε log X ++ 2( 2) nnH−1 =+ ( ') 95 Source Coding & Data Compression

For any choice of ε > 0, we can, by choosing block size, n, large enough, do the following: • make a lossless code using only H(x)+ ε bits per symbol on average: L ≤H + ε n • The coding is one-to-one and decodable – However impractical due to exponential complexity • Typical sequences have short descriptions of length ≈ nH – Another proof of source coding theorem (Shannon’s original proof) • However, encoding/decoding complexity is exponential in n 96 Smallest high-probability Set

(n) n Tε is a small subset of X containing most of the probability mass. Can you get even smaller ? –1 For any 0 < ε < 1 , choose N0 = –ε log ε , then for any (n) (n) n(H (x )−2ε) n>max( N0,Nε) and any subset S satisfying S < 2

(n) (n) (n) (n) (n) p(x ∈ S )= p(x ∈ S ∩Tε )+ p(x ∈ S ∩Tε ) < S (n) max p(x) + p(x ∈T (n) ) n)( ε x∈Tε n(H −2ε ) −n(H −ε ) < 2 2 + ε for n > Nε −nε −nε log ε = 2 + ε < 2ε for n > N ,0 2 < 2 = ε Answer: No 97 Summary

• Typical Set x n)( x x – Individual Prob ∈Tε ⇒ log p( ) = − nH ( ) ± nε n)( – Total Prob p(x ∈Tε ) >1−ε for n > Nε n>Nε n(H x )( −ε ) n)( n(H x )( +ε ) – Size 1( −ε 2) < Tε ≤ 2 • No other high probability set can be much (n) smaller than Tε • Asymptotic Equipartition Principle – Almost all event sequences are equally surprising • Can be used to prove source coding theorem 98 Lecture 8

• Channel Coding • Channel Capacity – The highest rate in bits per channel use that can be transmitted reliably ♦ – The maximum mutual information • Discrete Memoryless Channels – Symmetric Channels – Channel capacity • Binary Symmetric Channel • Binary Erasure Channel • Asymmetric Channel

♦ = proved in channel coding theorem 99 Model of Digital Communication

Source Coding Channel Coding

In Noisy Out Compress Encode Decode Decompress Channel • Source Coding – Compresses the data to remove redundancy • Channel Coding – Adds redundancy/structure to protect against channel errors 100 Discrete Memoryless Channel

• Input: x∈X, Output y∈Y y x Noisy Channel

• Time-Invariant Transition-Probability Matrix (Q ) = p(y = y | x = x ) y |x i, j j i T – Hence py = Qy |x px – Q: each row sum = 1, average column sum = |X|| Y|–1

• Memoryless : p(yn|x1: n, y1: n–1) = p(yn|xn) • DMC = Discrete Memoryless Channel 101 Binary Channels

0 0 • Binary Symmetric Channel 1− f f    x y –X = [0 1], Y = [0 1]  f 1− f  1 1

0 0 • Binary Erasure Channel 1− f f 0    x ? y –X = [0 1], Y = [0 ? 1]  0 f 1− f  1 1

• Z Channel  1 0  0 0 –X = [0 1], Y = [0 1]   x y  f 1− f  1 1

Symmetric: rows are permutations of each other; columns are permutations of each other Weakly Symmetric: rows are permutations of each other; columns have the same sum 102 Weakly Symmetric Channels

Weakly Symmetric: 1. All columns of Q have the same sum = |X|| Y|–1 – If x is uniform (i.e. p(x) = | X|–1) then y is uniform py()=∑ pyxpx (|)() =X−1 ∑ pyx (|) =×= XXYY −−− 111 xX∈ xX ∈ 2. All rows are permutations of each other

– Each row of Q has the same entropy so

H (y | x ) = ∑ p(x)H (y | x = x) = H (Q :,1 )∑ p(x) = H (Q :,1 ) x∈X x∈X

where Q1,: is the entropy of the first (or any other) row of the Q matrix

Symmetric: 1. All rows are permutations of each other 2. All columns are permutations of each other Symmetric ⇒ weakly symmetric 103 Channel Capacity

• Capacity of a DMC channel: C = max I(x ;y ) px – Mutual information (not entropy itself) is what could be transmitted through the channel H(x , y) ♦ – Maximum is over all possible input distributions px – ∃ only one maximum since I(x ;y) is concave in px for fixed py|x – We want to find the px that maximizes I(x ;y) – Limits on C: H(x | y) I(x ; y) H(y | x) 0≤C ≤ min( H ( yx ), H ( )) ≤ min(H log(x)X ,log Y ) H(y) • Capacity for n uses of channel:

(n) 1 C = max I(x :1 n ;y :1 n ) p n x :1n ♦ = proved in two pages time 104 Mutual Information Plot 1– f Binary Symmetric Channel 1– p 0 0 I(X;Y) Bernoulli Input x f y f p 1 1 1– f 1

I y(;) x y x y = H () − H (|) 0.5 =Hf(2 − pf +− p ) Hf ()

0

-0.5 1 0 0.8 0.2 0.6 0.4 0.6 0.4 0.8 0.2 1 0 Channel Error Prob (f) Input Bernoulli Prob (p) 105

Mutual Information Concave in pX

Mutual Information I(x;y) is concave in px for fixed py|x

Proof: Let u and v have prob mass vectors pu and pv – Define z: bernoulli random variable with p(1) = λ

– Let x = u if z=1 and x=v if z=0 ⇒ px= λ pu +(1–λ) pv I(x ,z ;y ) = I(x ;y ) + I(z ;y | x ) = I(z ;y ) + I(x ;y | z ) but I(z ;y | x ) = H(y | x ) − H(y | x ,z ) = 0 so U 1 X I(x ;y ) ≥ I(x ;y | z ) V 0 p(Y|X) = λI(x ;y | z =1) + (1− λ)I(x ;y | z = 0)

Z Y = λI(u ;y ) + (1− λ)I(v ;y )

Special Case: y=x ⇒ I(x; x)= H(x) is concave in px 106

Mutual Information Convex in pY|X

Mutual Information I(x ;y) is convex in py|x for fixed px Proof: define u, v, x etc: – py |x = λpu |x + (1− λ)pv |x I(x ;y ,z ) = I(x ;y | z ) + I(x ;z ) = I(x ;y ) + I(x ;z | y ) but I(x ;z ) = 0 and I(x ;z | y ) ≥ 0 so I(x ;y ) ≤ I(x ;y | z ) = λI(x ;y | z =1) + (1− λ)I(x ;y | z = 0) = λI(x ;u ) + (1− λ)I(x ;v ) 107 n-use Channel Capacity For Discrete Memoryless Channel:

I(x :1 n ;y :1 n ) = H (y :1 n ) − H (y :1 n | x :1 n ) n n Chain; Memoryless = ∑ H (y i | y :1i−1) − ∑ H (y i | x i ) i=1 i=1 n n n Conditioning ≤ ∑ H (y i ) − ∑ H (y i | x i ) = ∑ I(x i ;y i ) Reduces i=1 i=1 i=1 Entropy

with equality if yi are independent ⇒ xi are independent We can maximize I(x;y) by maximizing each I(xi;yi) independently and taking xi to be i.i.d. – We will concentrate on maximizing I(x; y) for a single channel use

– The elements of Xi are not necessarily i.i.d. 108 Capacity of Symmetric Channel

If channel is weakly symmetric:

IHHHHH y y(;) x y x y =− () (|) =− () (QQ1,: )log|| ≤Y − ( 1,: ) with equality iff input distribution is uniform

∴ Information Capacity of a WS channel is C = log| Y|–H(Q1,: ) 1– f 0 0 x f y For a binary symmetric channel (BSC) : f – |Y| = 2 1 1 1– f – H(Q1,: ) = H(f) – I(x;y) ≤ 1 – H(f) ∴ Information Capacity of a BSC is 1–H(f) 109 Binary Erasure Channel (BEC)

1– f 1− f f 0  0 0 f    0 f 1− f  x ? y f 1 1 I(x ;y ) = H (x ) − H (x | y ) 1– f = H (x ) − p(y = 0)×0 − p(y = ?)H (x ) − p(y =1)×0 = H (x ) − H (x ) f H(x | y) = 0 when y=0 or y=1 = 1( − f )H (x )

≤1 − f since max value of H(x) = 1 C=1 − f with equality when x is uniform since a fraction f of the bits are lost, the capacity is only 1–f and this is achieved when x is uniform 110 Asymmetric Channel Capacity

1–f T T 0: a 0 Let px = [ a a 1–2a] ⇒ py=Q px = px f H (y ) = −2a log a − (1− 2a)log(1− 2a) f x 1: a 1 y H (y | x ) = 2aH ( f ) + 1( − 2a)H )1( = 2aH ( f ) 1–f 2: 1–2a 2 To find C, maximize I(x ;y) = H(y) – H(y |x)

I = −2a log a − (1− 2a)log(1− 2a)− 2aH ( f ) 1− f f 0   dI = −2loge − 2log a + 2loge + 2log(1− 2a) − 2H ( f ) = 0 Q =  f 1− f 0   da  0 0 1 1− 2a −1 log = log(a−1 − 2)= H ( f ) ⇒ a = (2 + 2H ( f ) ) a Note: –1 H ( f ) d(log x) = x log e ⇒ C = −2a log(a2 )− (1− 2a)log(1− 2a) = −log(1− 2a)

1 Examples: f = 0 ⇒ H( f) = 0 ⇒ a = /3 ⇒ C = log 3 = 1.585 bits/use f = ½ ⇒ H( f) = 1 ⇒ a = ¼ ⇒ C = log 2 = 1 bits/use 111 Summary

• Given the channel, mutual information is concave in input distribution • Channel capacity C = max I(x ;y ) px – The maximum exists and is unique • DMC capacity

– Weakly symmetric channel: log| Y|–H(Q1,: ) – BSC: 1–H(f) – BEC: 1–f – In general it very hard to obtain closed-form; numerical method using convex optimization instead 112 Lecture 9

• Jointly Typical Sets • Joint AEP • Channel Coding Theorem – Ultimate limit on information transmission is channel capacity – The central and most successful story of information theory – Random Coding – Jointly typical decoding 113 Intuition on the Ultimate Limit

nH(y|x) • Consider blocks of n symbols: 2nH(x) 2 2nH(y)

x1: n Noisy y1: n Channel

– For large n, an average input sequence x1:n corresponds to about 2nH (y|x) typical output sequences – There are a total of 2nH (y) typical output sequences – For nearly error free transmission, we select a number of input sequences whose corresponding sets of output sequences hardly overlap – The maximum number of distinct sets of output sequences is 2n(H(y)–H(y|x)) = 2 nI (y ; x) – One can send I(y;x) bits per channel use for large n can transmit at any rate < C with negligible errors 114 Jointly Typical Set x,y is the i.i.d. sequence {xi,yi} for 1 ≤ i ≤ n N – Prob of a particular sequence is p(,)x y = ∏ pxy (,)i i i=1

– E − log p(x, y) = n E − log p(xi , yi ) = nH (x ,y ) – Jointly Typical set:

(n ) n − 1 Jε =∈−{x, yXY : n log() p x −< H ()x ε , −n−1 log p (y ) − H (y ) < ε ,

−1 −nlog p (x , y ) − H (x y , ) < ε} 115 Jointly Typical Example

1– f Binary Symmetric Channel 0 0 f T x y f = 0.2, px = (0.75 0.25) f

T  0.6 0.15 1 1 py = ()0.65 0.35 , Pxy =   1– f 0.05 0.2  Jointly typical example (for any ε): x = 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 y = 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 all combinations of x and y have exactly the right frequencies 116 Jointly Typical Diagram

Each point defines both an x sequence and a y sequence 2nlog| Y| Dots represent jointly typical 2nH(y) pairs (x,y ) Typical in x or y: 2n(H(x)+ H(y))

y) (x, Inner rectangle nH 2nlog| X| nH(x) 2 nH(x|y) represents pairs 2 al: 2 pic that are typical ty nH(y|x) tly 2 in x or y but not oin J necessarily All sequences: 2nlog(| X|| Y|) jointly typical • There are about 2nH (x) typical x’s in all • Each typical y is jointly typical with about 2nH (x|y) of these typical x’s • The jointly typical pairs are a fraction 2–nI (x ;y) of the inner rectangle • Channel Code: choose x’s whose J.T. y’s don’t overlap; use J.T. for decoding • There are 2nI (x ;y) such codewords x’s 117 Joint Typical Set Properties

(n) 1. Indiv Prob: x, y ∈ Jε ⇒ log p(x, y)= −nH(x ,y ) ± nε (n) 2. Total Prob: p(x,y ∈ Jε )>1−ε for n > Nε n> N ε x y nH() ε (,)x y −ε (n ) nH() (,) + 3. Size: (1−ε )2 N , p(− n−1 log p(x) − H (x ) > ε )< 1 1 3

Similarly choose N2 , N3 for other conditions and set Nε = max()N1, N2 , N3

Proof 3: 1−ε < p(x, y) ≤ J n)( max p(x, y) = J n)( 2−n(H (x ,y )−ε ) n>N ∑ ε n)( ε ε n)( x,y∈Jε x,y∈Jε 1 ≥ p(x, y) ≥ J n)( min p(x, y) = J n)( 2−n(H (x ,y )+ε ) ∀n ∑ ε n)( ε n )( x,y∈Jε x,y∈Jε 118 Properties

4. If px'=px and py' =py with x' and y' independent: −n(I (x ,y )+3ε ) (n) −n(I (x ,y )−3ε ) (1−ε )2 ≤ p(x ,' y'∈ Jε )≤ 2 for n > Nε Proof: |J| × (Min Prob) ≤ Total Prob ≤ |J| × (Max Prob) (n) p(x ,' y'∈ Jε )= ∑ p(x ,' y )' = ∑ p(x )' p(y )' n )( n)( x y',' ∈Jε x y',' ∈Jε p(x ,' y'∈ J (n) )≤ J (n) max p(x )' p(y )' ε ε n)( x ,' y'∈Jε ≤ 2n()()()()H (x ,y )+ε 2−n H (x )−ε 2−n H (y )−ε = 2−n I ( ;y) −x3ε p(x ,' y'∈ J (n) )≥ J (n) min p(x )' p(y )' ε ε n)( x ,' y'∈Jε −n()I ( ;yx )+3ε ≥ (1−ε )2 for n > Nε 119 Channel Coding

^ w∈1:M x1:n Noisy y1:n Decoder w∈1:M Encoder Channel g(y)

• Assume Discrete Memoryless Channel with known Qy|x • An (M, n) code is – A fixed set of M codewords x(w)∈Xn for w=1: M – A deterministic decoder g(y)∈1: M • The rate of an (M,n) code: R=(log M)/ n bits/transmission

• Error probability δ λ w =pgww( (y ()) ≠) = ∑ p()y| x ( w ) g(y ) ≠ w y∈Yn n)( – Maximum Error Probability λ = max λw 1≤w≤M M – Average Error probability n)( 1 Pe = ∑λw M w=1

δC = 1 if C is true or 0 if it is false 120 Shannon’s ideas • Channel coding theorem: the basic theorem of information theory – Proved in his original 1948 paper • How do you correct all errors? • Shannon’s ideas – Allowing arbitrarily small but nonzero error probability – Using the channel many times in succession so that AEP holds – Consider a randomly chosen code and show the expected average error probability is small • Use the idea of typical sequences • Show this means ∃ at least one code with small max error prob • Sadly it doesn’t tell you how to construct the code 121 Channel Coding Theorem

• A rate R is achievable if RC – If R

0.2 A very counterintuitive result : 0.15 Achievable Despite channel errors you can get arbitrarily low bit error 0.1 rates provided that R

0 0 1 2 3 Rate R/C 122 Summary

• Jointly typical set −log(,)px y = nH (,)x y ± n ε

(n ) p()x, y ∈ J ε > 1 − ε

(n ) n() H (x y , ) +ε Jε ≤ 2

−n() I (x y , ) + 3ε (n ) −n() I (x y , )3 − ε (1−ε )2 ≤p()x ', y ' ∈≤ J ε 2 • Machinery to prove channel coding theorem 123 Lecture 10

• Channel Coding Theorem – Proof • Using joint typicality • Arguably the simplest one among many possible ways -nE (R) • Limitation: does not reveal Pe ~ e – Converse (next lecture) 124 Channel Coding Principle

nH(y|x) • Consider blocks of n symbols: 2nH(x) 2 2nH(y)

x1: n Noisy y1: n Channel

– An average input sequence x1:n corresponds to about 2nH (y|x) typical output sequences – Random Codes: Choose M = 2nR ( R≤I(x;y) ) random codewords x(w) • their typical output sequences are unlikely to overlap much. – Joint Typical Decoding: A received vector y is very likely to be in the typical output set of the transmitted x(w) and no others. Decode as this w. Channel Coding Theorem: for large n, can transmit at any rate R < C with negligible errors 125 Random (2nR,n) Code

• Choose ε ≈ error prob , joint typicality ⇒ Nε , choose n>N ε

• Choose px so that I(x ;y)= C, the information capacity

n nR • Use px to choose a code C with random x(w)∈X , w=1:2 – the receiver knows this code and also the transition matrix Q • Assume the message W∈1:2nR is uniformly distributed • If received value is y; decode the message by seeing how many x(w)’s are jointly typical with y – if x(k) is the only one then k is the decoded message – if there are 0 or ≥2 possible k’s then declare an error message 0 – we calculate error probability averaged over all C and all W 2nR 2nR (a ) pE= pC()2−nR λ () C -nR = p( C )λ ( C ) =p( E| w = 1 ) () ∑ ∑ w = 2∑∑ p ()() Cλw C ∑ 1 C w =1 w=1 C C (a) since error averaged over all possible codes is independent of w 126 Decoding Errors

• Assume we transmit x(1) and receive y (n ) nR • Define the J.T. events e w ={((),)xw y ∈ Jε } for w ∈ 1:2

x1: n Noisy y1: n Channel

• Decode using joint typicality

• We have an error if either e1 false or ew true for w ≥ 2 • The x(w) for w ≠ 1 are independent of x(1) and hence also Joint AEP − (In (x ,y )−3ε ) independent of y. So p(ew true) < 2 for any w ≠ 1 127 Error Probability for Random Code

• Upper bound p(A ∪B) ≤p(A)+p(B) 2nR pEpEWp=| == 1 e e e e e ∪∪∪∪⋯ ≤ p + p ()()()1232nR () 1 ∑ ()w w=2 nR 2 (1) Joint typicality ε ≤+ε ∑ x2 y−−nI()() ε(;)3x y ε <+ 2nR 2 −− nI (;)3 i=2 (2) Joint AEP log ε ε ε ≤+ε 2−n() C − R − 3ε ≤ 2 for R <− C 3 and n >− C− R − 3ε wehavechosen x ypx ( )suchthat I ( ;= ) C • Since average of P(E) over all codes is ≤ 2ε there must be at least one code for which this is true: this code has −nR ♦ 2 ∑λw ≤ 2ε w • Now throw away the worst half of the codewords; the remaining –1 ♦ ones must all have λw ≤ 4ε. The resultant code has rate R–n ≅ R. ♦ = proved on next page 128 Code Selection & Expurgation

• Since average of P(E) over all codes is ≤ 2ε there must be at least one code for which this is true. Proof: K K −1 (n) −1 (n) (n) 2ε ≥ K P ,ie ≥ K min()()P ,ie = min P ,ie ∑ ∑ i i i=1 i=1 K = num of codes • Expurgation: Throw away the worst half of the codewords; the

remaining ones must all have λw≤ 4ε.

Proof: Assume λw are in descending order

M ½M ½M −1 −1 −1 2ε ≥ M ∑λw ≥ M ∑λw ≥ M ∑λ½M ≥ ½λ½M w=1 w=1 w=1

⇒ λ½M ≤ 4ε ⇒ λw ≤ 4ε ∀ w > ½M

M ′ = ½×2nR messages in n channel uses ⇒ R′ = n−1 log M ′ = R − n−1 129 Summary of Procedure

−1 • For any R

• Find the optimum pX so that I(x ; y) = C nR • Choosing codewords randomly (using pX) to construct codes with 2 (a) codewords and using joint typicality as the decoder • Since average of P(E) over all codes is ≤ 2ε there must be at least (b) one code for which this is true. • Throw away the worst half of the codewords. Now the worst (c) codeword has an error prob ≤ 4ε with rate = R – n–1 > R – ε • The resultant code transmits at a rate as close to C as desired with an error probability that can be made as small as desired (but n unnecessarily large). Note: ε determines both error probability and closeness to capacity 130 Remarks • Random coding is a powerful method of proof, not a method of signaling • Picking randomly will give a good code • But n has to be large (AEP) • Without a structure, it is difficult to encode/decode – Table lookup requires exponential size • Channel coding theorem does not provide a practical coding scheme • Folk theorem (but outdated now): – Almost all codes are good, except those we can think of 131 Lecture 11

• Converse of Channel Coding Theorem – Cannot achieve R>C • Capacity with feedback – No gain for DMC but simpler encoding/ decoding • Joint Source-Channel Coding – No point for a DMC 132 Converse of Coding Theorem

(n) • Fano’s Inequality : if Pe is error prob when estimating w from y, ()n () n H(w |y )≤+ 1 Pe log W =+ 1 nRP e

• Hence nR = H (w ) = H (w | y) + I(w ;y) Definition of I ≤ H (w | y) + I(x(w );y) Markov : w → x → y →wˆ n)( ≤ 1+ nRPe + I(x;y) Fano n)( ≤ 1+ nRPe + nC n-use DMC capacity −1 (n ) RCn− − C ⇒≥Pe →−>1 0if R > C Rn→∞ R (n) • For large (hence for all) n, Pe has a lower bound of (R–C)/ R if w equiprobable – If achievable for small n, it could be achieved also for large n by concatenation. 133 Minimum Bit-Error Rate

w x y w^ 1:nR 1:n Noisy 1:n 1: nR Encoder Decoder Channel Suppose – w is i.i.d. bits with H(w )=1 1: nR i ∆ ˆ – The bit-error rate is PEpb={} e(w w ii ≠ ) = Ep{} () i i i (a) (b) ˆ ˆ Then nC ≥ I(x :1 n ;y :1 n ) ≥ I(w w1:nR ; 1: nR ) =H w w(w 1:nR )( − H 1: nR | 1: nR ) nR nR (c) ˆ =nR − H (|w w w ˆ , ) ≥nR − H (w w |ˆ ) =nR1 − E{} H (|)w wi i ∑ i1: nR 1: i − 1 ∑ i i ( i ) (d) i=1 (c) i=1 (e) ˆ = nR(1− H (P )) =nR(1 − E{} H (|)e wi i )≥ nR(1− E{}H (ei ) )≥nR(1 − HEP ( (ei )) ) b i i i 0.2 (a) n-use capacity Hence 0.15 (b) Data processing theorem −1 0.1 R ≤ C(1− H (Pb )) (c) Conditioning reduces entropy 0.05 −1 Bit error probability e w w= ⊕ ˆ Pb ≥ H 1( − C / R) (d) i i i 0 0 1 2 3 Rate R/C (e) Jensen: EH(x) ≤ H(E x) 134 Coding Theory and Practice

• Construction for good codes – Ever since Shannon founded information theory – Practical : Computation & memory ∝ nk for some k • Repetition code: rate ‰ 0 • Block codes: encode a block at a time – Hamming code: correct one error – Reed-Solomon code, BCH code: multiple errors (1950s) • Convolutional code: convolve bit stream with a filter • Concatenated code: RS + convolutional • Capacity-approaching codes: – Turbo code: combination of two interleaved convolutional codes (1993) – Low-density parity-check (LDPC) code (1960) – Dream has come true for some channels today 135 Channel with Feedback

w∈1:2 nR ^ xi Noisy yi w Encoder Decoder y1: i–1 Channel

• Assume error-free feedback: does it increase capacity ? • A (2 nR , n) feedback code is

– A sequence of mappings xi = xi(w,y1: i–1) for i=1: n ˆ – A decoding function w = g(y :1 n )

• A rate R is achievable if ∃ a sequence of (2 nR ,n) feedback n)( ˆ codes such that Pe = P(w ≠w ) n→∞ →0

• Feedback capacity, CFB ≥ C, is the sup of achievable rates 136 Feedback Doesn’t Increase Capacity

w∈1:2 nR ^ xi Noisy yi w Encoder Decoder I(w ;y) = H (y) − H (y |w ) y Channel n 1: i–1 = H (y) − ∑ H (y i | y :1i−1,w ) i=1 n since x =x (w,y ) = H (y) − ∑ H (y i | y :1i−1,w ,x i ) i i 1: i–1 i=1 n = H (y) − H (y | x ) ∑ i i since yi only directly depends on xi i=1 n n n ≤ H (y ) − H (y | x ) cond reduces ent ∑ i ∑ i i = ∑ I(x i ;y i ) ≤ nC i=1 i=1 i=1 DMC Hence n)( Fano nR = H (w ) = H (w | y) + I(w ;y) ≤1+ nRPe + nC

−1 n)( R − C − n ⇒ Pe ≥ Ï Any rate > is unachievable R C

The DMC does not benefit from feedback: CFB = C 137 Example: BEC with feedback

1– f • Capacity is 1 – f 0 0 • Encode algorithm f – If y = ? , tell the sender to retransmit bit i x ? y i f – Average number of transmissions per bit: 1 1 1– f 1 1+ f + f 2 +⋯ = 1− f • Average number of successfully recovered bits per transmission = 1 – f – Capacity is achieved! • Capacity unchanged but encoding/decoding algorithm much simpler. 138 Joint Source-Channel Coding ^ w1: n x1: n Noisy y1: n w Encoder Decoder 1: n Channel

• Assume wi satisfies AEP and |W|< ∞ – Examples: i.i.d.; Markov; stationary ergodic • Capacity of DMC channel is C – if time-varying: C = lim n−1I(x;y) n→∞ • Joint Source-Channel Coding Theorem: ♦ (n ) ˆ ∃ codes withPe= P (w w1: n ≠ 1: n ) →n→∞ 0 iff H(W) < C – errors arise from two reasons • Incorrect encoding of w • Incorrect decoding of y

♦ = proved on next page 139 Source-Channel Proof (⇐)

• Achievability is proved by using two-stage encoding – Source coding – Channel coding n(H(W)+ ε) • For n > Nε there are only 2 w’s in the typical set: encode using n(H(W)+ ε) bits – encoder error < ε • Transmit with error prob less than ε so long as H(W)+ ε < C • Total error prob < 2 ε 140 Source-Channel Proof (⇒)

^ w1: n x1: n Noisy y1: n w Encoder Decoder 1: n Channel

ˆ (n ) Fano’s Inequality : H(w | w )1≤ + Pe n log W

−1 entropy rate of stationary process H() W≤ n H ()w 1: n −1ˆ − 1 ˆ =nH w(|)w w w 1:nn 1: + nI (;) 1: nn 1: definition of I −1 ()n − 1 ≤n(1 + Pne logW ) + nI (x y1: n ; 1: n ) Fano + Data Proc Inequ −1 (n ) ≤n + Pe log W + C Memoryless channel

(n ) Let n→∞⇒ Pe →⇒0 HWC ( ) ≤ 141 Separation Theorem

• Important result: source coding and channel coding might as well be done separately since same capacity – Joint design is more difficult • Practical implication: for a DMC we can design the source encoder and the channel coder separately – Source coding: efficient compression – Channel coding: powerful error-correction codes • Not necessarily true for – Correlated channels – Multiuser channels • Joint source-channel coding: still an area of research – Redundancy in human languages helps in a noisy environment 142 Summary

• Converse to channel coding theorem – Proved using Fano’s inequality – Capacity is a clear dividing point: • If R < C, error prob. ‰ 0 • Otherwise, error prob. ‰ 1 • Feedback doesn’t increase the capacity of DMC – May increase the capacity of memory channels (e.g., ARQ in TCP/IP) • Source-channel separation theorem for DMC and stationary sources 143 Lecture 12

• Polar codes – Channel polarization – How to construct polar codes – Encoding and decoding • Polar source coding • Extension 144 About Polar Codes

• Provably capacity-achieving • Encoding complexity O( log ) • Successive decoding complexity O( log ) • Probability of error ≈ 2 • Main idea: channel polarization 145 What Is Channel Polarization?

• Normal channel • Extreme channel

Useless channel

Polarization

sometimes cute, Perfect sometimes lazy, channel hard to manage 146 Channel Polarization

• Among all channels, there are two classes which are easy to communicate optimally – The perfect channels the output Y determines the input X – The useless channels Y is independent of X • Polarization is a technique to convert noisy channels to a mixture of extreme channels – The process is information-conserving 147 Generator Matrix • Generator Matrix 1 0 ⨂ = , = 2 1 1 ⨂ denotes the n-fold Kronecker product. • Example

1 0 0 = , = and so on. 1 1 • Encoding Let u be the length-N input to the encoder, then = is the codeword. 148 Channel Combining and Splitting

• Basic operation ( = 2) X1 X1 W Y1 U1 + W Y1 Combine X2 X2 W Y2 U2 W Y2

(X1, X2)=(U1, U2) Channel splitting

U1 + W Y1 U1 + W Y1

W Y2 U2 W Y2

: U1 →(Y1, Y2) : U2 →(Y1, Y2, U1) 149 What Happens? • Suppose is a BEC( p), i.e., Y=X with probability 1 – p, and Y=? (erasure) with probability p. – has input U1 and output (Y1, Y2)=(U1+U2, U2) or (?, U2) or (U1+U2, ?) or (?, ?). – is a BEC(2 p – p2) – has input U2 and output (Y1, Y2, U1)=(U1+U2, U2, U1) or (?, U2, U1) or (U1+U2, ?, U1) or (?, ?, U1). – is a BEC( p2) • is worse than , and is better (recall capacity C(W)=1 – p). – () + = 2 – () ≤ () ≤ () 150 Example: BEC(0.5)

• W is a BEC with erasure probability p = 0.5.

• If we use two copies of W separately

X1 W Y1 2CW ( )= 2 × 0.5 = 1 X2 W Y2 151 Example: BEC(0.5)

• Channel combining and splitting

X1 U1 + W Y1 X2 U2 W Y2

U1 + W Y1 U1 + W Y1

W Y2 U2 W Y2 152 Example: BEC(0.5)

• Channel 1 U1 + W Y1 CW(− )= 4 × log 2 = 0.25 W Y2 16

(Y1,Y2) Transitional probabilities 00 0? 01 ?0 ?? ?1 10 1? 11

0 1/8 1/8 0 1/8 1/4 1/8 0 1/8 1/8 U1 1 0 1/8 1/8 1/8 1/4 1/8 1/8 1/8 0 153 Example: BEC(0.5)

1 • Channel CW(+ )= 12 × log 2 = 0.75 16 U1 + W Y1 CWCWCW(− )+ ( + )2() = U2 W Y2 CWCWCW()−< () < () + (Y1,Y2,U1) Transitional probabilities 000 0?0 010 ?00 ??0 ?10 100 1?0 110 0 1/8 1/8 0 1/8 1/8 0 0 0 0 1 0 0 0 0 1/8 1/8 0 1/8 1/8 U2 001 0?1 011 ?01 ??1 ?11 101 1?1 111 0 0 0 0 1/8 1/8 0 1/8 1/8 0 1 0 1/8 1/8 0 1/8 1/8 0 0 0 154 More Polarization

• Repeating this, we obtain N ‘bit channels’ at the n-th step. • More conveniently, this process can be described as a binary tree.

–Note how the ‘bit channels’ ⋯ are labelled in the tree.

…… …… 155 Martingale • Now pick a ‘bit channel’ uniformly at random on the n-th level of the tree, which is equivalent to a random

traverse on the tree, namely, at each step the r.v bi takes the value of 0 or 1 with equal probability.

• We claim capacity Cn at the n-th step is a martingale . • Proof: By information-preserving 1 1 , … , = + 2 … 2 …

= … =

• By the martingale convergence theorem, Cn converges to a random variable C∞ such that = = = ().

• In fact, the limit C∞ = 0 or 1 is a binary random variable (these are the fixed points of the polar transform). 156 Review of Martingales

• Let {Xn, n ≥ 0} be a random process. If , … , , = then {Xn} is referred to as a martingale.

• Martingale convergence theorem: Let {Xn, n ≥ 0} be a martingale with finite means.

Then there exists a random variable X∞ such that → almost surely as → ∞. 157 How to construct polar codes • To achieve (), we need to identify the indices of those bit channels (branches in tree) with capacity ≈ 1. • For BEC, this can be computed recursively 2 … = … 2 … = 2 … − … • For other types of channels, it is difficult to obtain closed-form formulas. So numerical computation is often used. 158 Polarization Speed

• For any positive real number < 0.5, 1 lim # ⋯ : … ≥1 − 2 → = (). lim # ⋯ : … < 1 − 2 → = 1 − (). • The above statements do not hold for > 0.5. • Thus, the polarization speed is roughly 2 . 159 Convergence

• The portion of almost prefect bit channels is C(W), meaning that the capacity is achieved. • Example: capacities for N = 2 12 for BEC(0.5) Capacity 160 Encoding

• Given = 2 , calculate … for all synthetic bit channels. • Given rate < 1 and = , sort

… in descending order and define the union of the indices of the first elements as the information set Ω. • Choose the information bits and freeze to be all-zero. Obtain the codeword = , ∙ . 161 Construction Example • W is a BEC(0.5), = 8, R=0.5. 162 Construction Example • W is a BEC(0.5), = 8, R=0.5.

Encoding complexity ( log ) 163 Successive decoding • For the decoding we need to compute the likelihood ratio for , = ⋯ …(∙ |1) = …(∙ |0)

If ∈ Ω, = 1 if > 1; otherwise, = 0.

• Similar to … , () can also be calculated recursively. • For more details of the decoding, see E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEE Trans. on Information Theory,, vol. 55, no. 7, pp. 3051–3073, 2009. 164 Probability of Error • For a polar code with block length and rate < (), the error probability under the successive cancellation decoding is given by ≤ (2 ) < 0.5

Error Bound of the SC decoding for BEC(0.5) 165 Polar Source Coding • Let x be a random variable generated by a Bernoulli source Ber(), i.e., Pr(x =0)= and Pr(x =1)=1 – . • The entropy (in bits) of x is x = − log − (1 − )log (1 − ) • If x = 0, i.e., = 0 or 1, x is a constant , no need for compression. • If x = 1, i.e., = 0.5, x is totally random, we cannot do any compression. • In other cases, can the polarization technique be used to achieve rate x ? 166 Source Polarization • Similar idea applies to source coding: general sources polarization extreme sources • Basic source polarization

U1, U2 = (X1, X2) U1 + X1 ~ Ber(p) U1 + U2 U1 = U1, U2 = X1, X2 = 2(x) U2 X2 ~ Ber(p) U1 ≥ (x) ≥ U2 U1 The process is entropy-conserving , but we obtain two new sources with higher and lower entropy than the original one. • Example: when = 0.11 , x = 0.5, U1 = 0.713 , U2|U1 = 0.287 . 167 Source Coding • Keep polarizing by increasing , the entropy of the synthetic sources tends to 0 or 1. • Again, by the property of the martingale , the proportion of those sources with entropy close to 1 is close to x . • Source coding is realized by recording the indexes with entropy close to 1, while the rest bits can be recovered with high probability because their associated entropy is almost 0 . • For me details, see E. Arikan, “Source polarization,” IEEE ISIT 2010, pp. 899-903. 168 Performance Entropy/Rate

n 169 Extensions • Polar codes also achieve capacity of other types of channels (discrete or continuous). • Achieve entropy bound of other types of sources (lossless or lossy). • Quantum polar codes, network information theory…

Big bang in information theory 170 Lecture 13

• Continuous Random Variables • Differential Entropy – can be negative – not really a measure of the information in x – coordinate-dependent • Maximum entropy distributions – Uniform over a finite range – Gaussian if a constant variance 171 Continuous Random Variables

Changing Variables x Fx (x) = fx (t)dt • pdf: f x ( x ) CDF: ∫ ∞− • For g(x) monotonic: y = g(x ) ⇔ x = g −1(y )

−1 −1 Fy (y) = Fx (g (y)) or 1− Fx (g (y)) according to slope of g(x) dF (y) dg −1(y) dx f (y) = Y = f ()g −1(y) = f ()x where x = g −1(y) y dy x dy x dy • Examples:

Suppose fx (x) = 5.0 for x ∈( 2,0 ) ⇒ Fx (x) = 5.0 x

(a) y = 4x ⇒ x = .0 25y ⇒ fy (y) = 5.0 × .0 25 = .0 125 for y ∈( )8,0

4 ¼ −¾ −¾ (b) z = x ⇒ x = z ⇒ fz (z) = 5.0 ×¼z = .0 125z for z ∈( ,0 16) 172 Joint Distributions

Joint pdf: fx ,y (x, y) ∞ fx (x) = fx ,y (x, y)dy Marginal pdf: ∫ ∞−

Independence: ⇔ fx ,y (x, y) = fx (x) fy (y) f (x, y) Conditional pdf: x ,y fx |y (x) = fy (y) y y

Example:

f =1 for y ∈(0 1, ), x ∈(y, y +1) x Y x ,y f

fX fx |y =1 for x ∈(y, y +1) x 1 f = for y ∈()max(0, x −1),min(x 1, ) y |x min(x 1, − x) 173 Entropy of Continuous R.V.

• Given a continuous pdf f(x), we divide the range of x into bins of width ∆ (i+ )1 ∆ – For each i, ∃x with f (xi )∆ = f (x)dx mean value theorem i ∫i∆ • Define a discrete random variable Y

–Y = { xi} and py = { f(xi)∆}

– Scaled, quantised version of f(x) with slightly unevenly spaced xi

• H (y ) = −∑ f (xi )∆ log(f (xi )∆)

= −log ∆ − ∑ f (xi )log()f (xi ) ∆ ∞ → − log ∆ − f (x)log f (x)dx = − log ∆ + h(x ) ∆→0 ∫−∞ ∞ • Differential entropy: h(x ) = − fx (x)log fx (x)dx ∫ ∞− - Similar to entropy of discrete r.v. but there are differences 174 Differential Entropy

∆ ∞ Differential Entropy: h(x ) =− fx (x)log fx (x)dx = E − log fx (x) ∫ ∞− Bad News: – h(x) does not give the amount of information in x – h(x) is not necessarily positive – h(x) changes with a change of coordinate system Good News:

– h1(x) – h2(x) does compare the uncertainty of two continuous random variables provided they are quantised to the same precision – Relative Entropy and Mutual Information still work fine – If the range of x is normalized to 1 and then x is quantised to n bits, the entropy of the resultant discrete random variable is approximately h(x)+ n 175 Differential Entropy Examples

• Uniform Distribution: x ~ U (a,b) – f (x) = (b − a)−1 for x ∈(a,b) and f (x) = 0 elsewhere b – h(x ) = − (b − a)−1 log(b − a)−1dx = log(b − a) ∫a – Note that h(x) < 0 if (b–a) < 1 • Gaussian Distribution: x ~ N(µ,σ 2 ) 2−½ 1 2− 2  – f()2 x= µ σ()πσ exp − ( x − )  2 ∞   – h(x ) = −(loge) f (x)ln f (x)dx ∫ ∞− 1 ∞ =−(logefx ) ( )() − ln(2 µπσ σ 2 ) −− ( x ) 2− 2 ∫−∞ 2 log y 1 2− 2 2 log y = e =(loge )( ln(2 µπσ σ ) + E() ( x − ) ) x log x 2 e 1 2 =(loge )() ln(2πσ2 ) + 1 = ½ log(2πeσ ) ≅ log( 1.4 σ ) bits 2 176 Multivariate Gaussian

Given mean, m, and symmetric positive definite covariance matrix K,

−½ 1 T −1  x 1:n ~,N( m K )⇔ f ()2 x =π K exp()() −− x m K x − m  2  1 T −1  h() f=−() log e f ()x ×−− ( xmKxm ) ( −− )½ ln 2 π Kx  d ∫ 2  = ½ log(e)×(ln 2πK + E((x − m)T K −1(x − m))) T −1 =½ log()e ×( ln 2πK + E tr( ( x −− m )( x m ) K ))tr (AB ) = tr (BA)

T −1 T = ½ log(e)×(ln 2πK + tr(E(x − m)(x − m) K )) Efxx = K = ½ log(e)× (ln 2πK + tr(KK −1 )) = ½ log(e)×(ln 2πK + n) = ½ log(en )+½ log(2πK ) tr( I)= n=ln( en)

=½ log π( 2π eKK) = ½ log( (2 e ) n ) bits 177 Other Differential Quantities

Joint Differential Entropy h(x ,y ) = − f (x, y)log f (x, y)dxdy = E − log f (x, y) ∫∫ x ,y x ,y x ,y , yx Conditional Differential Entropy h(x | y ) = − f (x, y)log f (x | y)dxdy = h(x ,y ) − h(y ) ∫∫ x ,y x ,y , yx Mutual Information f (x, y) I(x ;y ) = f (x, y)log x ,y dxdy = h(x ) + h(y ) − h(x ,y ) ∫∫ x ,y , yx fx (x) fy (y) Relative Differential Entropy of two pdf’s: f (x) D( f || g) = f (x)log dx ∫ g(x) (a) must have f(x)=0 ⇒ g(x)=0 (b) continuity ⇒ 0 log(0/0) = 0 = −hf (x ) − E f log g(x ) 178 Differential Entropy Properties

Chain Rules h(x ,y ) = h(x ) + h(y | x ) = h(y ) + h(x | y ) I(x ,y ;z ) = I(x ;z ) + I(y ;z | x )

Information Inequality: D( f || g) ≥ 0

Proof: Define S = {x : f (x) > 0}

g()x  g ()x  −Dfg( || ) = f (x )log dE x = log ∫ f   x∈S f()x  f ()x 

 g(x)   g(x)    ≤ log E  = log ∫ f (x) dx Jensen + log() is concave  f (x)   S f (x)      = log ∫ g(x)dx ≤ log1 = 0  S  all the same as for discrete r.v. H() 179 Information Inequality Corollaries

Mutual Information ≥ 0

I(x ;y ) = D( fx ,y || fx fy ) ≥ 0

Conditioning reduces Entropy

h(x ) − h(x | y ) = I(x ;y ) ≥ 0

Independence Bound n n h(x :1 n ) = ∑ h(x i | x :1i−1) ≤ ∑ h(x i ) i=1 i=1

all the same as for H() 180 Change of Variable

Change Variable: y = g(x ) −1 −1 dg (y) from earlier fy (y) = fx ()g (y) dy dx h(y ) = −E log( f (y )) = −E log()f (g −1(y )) − E log y x dy

dx dy = −E log()f (x ) − E log =h(x ) + E log x dy dx Examples: – Translation: y = x + a ⇒ dy / dx =1 ⇒ h(y ) = h(x )

– Scaling: x y x y =⇒c dydxc/ =⇒ h () =+ h ()log c

– Vector version: y :1 n = Ax :1 n ⇒ h(y) = h(x) + log det(A)

not the same as for H() 181 Concavity & Convexity

• Differential Entropy:

– h(x) is a concave function of fx(x) ⇒ ∃ a maximum • Mutual Information:

– I(x ; y) is a concave function of fx (x) for fixed fy|x (y)

– I(x ; y) is a convex function of fy|x (y) for fixed fx (x) Proofs: T Exactly the same as for the discrete case : pz = [1–λ, λ]

1 f(u|x) 1 U U 1 U X X X Y V V V 0 0 f(y|x) f(v|x) 0

Z Z Y Z H(x ) ≥ H(x | z ) I(x ;y ) ≥ I(x ;y |z ) I(x ;y ) ≤ I(x ;y |z ) 182 Uniform Distribution Entropy

What distribution over the finite range (a,b) maximizes the entropy ? Answer: A uniform distribution u(x)=( b–a)–1 Proof: Suppose f(x) is a distribution for x∈(a,b )

0 ≤ D( f || u) = −hf (x ) − E f logu(x )

= −hf (x ) + log(b − a)

⇒ hf (x ) ≤ log(b − a) 183 Maximum Entropy Distribution

What zero-mean distribution maximizes the entropy on (–∞, ∞)n for a given covariance matrix K ? −½ Answer: A multivariate Gaussian φ(x) = 2πK exp(− ½xT K −1x)

Proof: 0 ≤ D( f ||φ) = −hf (x) − E f logφ(x) T −1 ⇒ hf (x) ≤ −(log e)E f (−½ ln(2πK )−½x K x) T −1 = ½(loge)(ln(2πK )+ tr(E f xx K ))

T = ½(loge)(ln(2πK )+ tr(I)) Efxx = K

n = ½ log(2πeK ) = hφ (x) tr( I)= n=ln( e ) Since translation doesn’t affect h(X), we can assume zero-mean w.l.o.g. 184 Summary

∞ • Differential Entropy: h(x ) = − fx (x)log fx (x)dx ∫ ∞− – Not necessarily positive – h(x+a ) = h(x), h(a x) = h(x) + log| a| • Many properties are formally the same – h(x|y) ≤ h(x) – I(x; y) = h(x) + h(y) – h(x, y) ≥ 0, D(f|| g)= E log( f/g) ≥ 0

– h(x) concave in fx(x); I(x; y) concave in fx(x) • Bounds: – Finite range: Uniform distribution has max: h(x) = log( b–a) – Fixed Covariance: Gaussian has max: h(x) = ½log((2 πe)n|K|) 185 Lecture 14

• Discrete-Time Gaussian Channel Capacity • Continuous Typical Set and AEP • Gaussian Channel Coding Theorem • Bandlimited Gaussian Channel – Shannon Capacity 186 Capacity of Gaussian Channel

Z Discrete-time channel : yi = xi + zi i

– Zero-mean Gaussian i.i.d. zi ~ N (0, N) Xi Yi n + −1 2 – Average power constraint n ∑ xi ≤ P i=1 Ey 2 = E(x + z )2 = Ex 2 + 2E(x )E(z ) + Ez 2 ≤ P + N X,Z indep and EZ =0

Information Capacity – Define information capacity : C = max I(x ;y ) Ex 2 ≤P I(x ;y ) = h(y ) − h(y | x ) = h(y ) − h(x + z | x ) a)( = h(y ) − h(z | x ) = h(y ) − h(z ) (a) Translation independence ≤ ½ log 2πe(P + N )−½ log 2πeN Gaussian Limit with 1 P  equality when x~N(0, P) =log 1 +  2 N  The optimal input is Gaussian 187 Achievability z ^ w∈1:2 nR x y w∈0:2 nR Encoder + Decoder

• An (M,n) code for a Gaussian Channel with power constraint is – A set of M codewords x(w)∈Xn for w=1: M with x(w)Tx(w) ≤ nP ∀w – A deterministic decoder g(y)∈0: M where 0 denotes failure n)( n)( – Errors: codeword: λi max : λ average : Pe i i • Rate R is achievable if ∃ seq of (2 nR ,n) codes with λ(n) → 0 n→∞

• Theorem: R achievable iff R < C = ½ log(1+ PN −1 ) ♦

♦ = proved on next pages 188 Argument by Sphere Packing

• Each transmitted xi is received as a probabilistic cloud yi – cloud ‘radius’ = Var(y | x)= nN Law of large numbers

• Energy of yi constrained to n(P+N) so clouds must fit into a hypersphere of radius n(P + N ) • Volume of hypersphere ∝ r n • Max number of non-overlapping clouds:

½n (nP + nN) −1 = 2½nlog()1+PN ()nN ½n • Max achievable rate is ½log(1+ P/N ) 189 Continuous AEP

Typical Set: Continuous distribution, discrete time i.i.d. For any ε>0 and any n, the typical set with respect to f(x) is (n ) n − 1 Tε =∈−{x Snfh: log()x − ()x ≤ ε} where S is the support of f ⇔ {x : f(x)>0 } n f()x = ∏ fx ()i since x i are independent i=1 xh(x )=− E log f ( ) =− nEf−1 log (x )

Typical Set Properties n)( Proof: LLN 1. p(x ∈Tε ) >1−ε for n > Nε

n>Nε 2. n(h(x )−ε ) (n) n(h(x )+ε ) Proof: Integrate 1( −ε )2 ≤ Vol(Tε ) ≤ 2 max/min prob where Vol(A) = ∫ dx x∈A 190 Continuous AEP Proof

Proof 1: By law of large numbers n prob −1 −1 − n log f (x :1 n ) = −n ∑log f (x i ) → E − log f (x ) = h(x ) i=1 prob

Reminder: x n → y ⇒ ∀ε > ,0 ∃Nε such that ∀n > Nε , P()| x n − y |> ε < ε

1−ε ≤fd (x ) x for nN > Proof 2a: ∫ ε Property 1 (n ) Tε ≤ 2− (hn ( X )−ε ) dx = 2− (hn ( X )−ε ) Vol(T n)( ) ∫ ε max f(x) within T n)( Tε Proof 2b: 1=∫fd ()xx ≥ ∫ fd () xx n( n ) S T ε ≥ 2− (hn ( X )+ε ) dx = 2− (hn ( X )+ε ) Vol(T n)( ) ∫ ε min f(x) within T n)( Tε 191 Jointly Typical Set

2 Jointly Typical: xi, yi i.i.d from ℜ with fx,y (xi,yi)

()n 2 n − 1 Jε ={x, y ∈ℜ− : nfh log()()X x −x < ε ,

−1 −nlog fY ()y − h ()y < ε ,

−1 −nlog fX, Y (x , y ) − h (x y , ) < ε} Properties:

(n ) 1. Indiv p.d.: xy,∈⇒Jε log fx y, ( xy ,) =− nh (x y , ) ± n ε

(n) 2. Total Prob: p(x, y ∈ Jε )>1−ε for n > Nε n>Nε n()h(x ,y )−ε (n) n()h(x ,y )+ε 3. Size: (1−ε )2 ≤ Vol()Jε ≤ 2

n>Nε − ()In (x ;y )+3ε n)( − ()In (x ;y )−3ε 4. Indep x′,y′: 1( −ε )2 ≤ p()x ,' y'∈ Jε ≤ 2 Proof of 4.: Integrate max/min f(x´, y´)= f(x´)f(y´), then use known bounds on Vol( J) 192 Gaussian Channel Coding Theorem

R is achievable iff R < C = ½ log(1+ PN −1 ) Proof ( ⇐): Choose ε > 0 n nR Random codebook : x w ∈ℜ for w = :1 2 where xw are i.i.d. ~ N ,0( P − ε ) Use Joint typicality decoding T Errors : 1. Power too big p(x x > nP)→ 0 ⇒ ≤ ε for n > M ε n)( 2. y not J.T. with x p(x, y ∉ Jε ) < ε for n > Nε 2nR n)( nR − In (( x ;y )−3ε ) 3. another x J.T. with y ∑ p(x j , yi ∈ Jε ) ≤ (2 − )1 × 2 j=2 n)( − (In ( X Y ); −R−3ε ) Total Err Pε ≤ ε + ε + 2 ≤ 3ε for large n if R < I(X ;Y ) − 3ε (n) Expurgation : Remove half of codebook *: λ < 6ε now max error We have constructed a code achieving rate R–n–1

T *:Worst codebook half includes xi: xi xi >nP ⇒ λi=1 193 Gaussian Channel Coding Theorem

(n ) - 1 T Proof ( ⇒): Assume Pe →0 and nx x < P for each x ( w )

nR = H (w ) = I(w ;y :1 n ) + H (w | y :1 n )

≤ I(x :1 n ;y :1 n ) + H (w | y :1 n ) Data Proc Inequal

= h(y :1 n ) − h(y :1 n | x :1 n ) + H (w | y :1 n ) n ≤ ∑ h(y i ) − h(z :1 n ) + H (w | y :1 n ) Indep Bound + Translation i=1 n (n ) Z i.i.d + Fano, |W|=2 nR ≤∑ I(x yi ; i ) + 1 + nRP e i=1

n −1 (n ) ≤∑½log() 1 +PN ++ 1 nRP e max Information Capacity i=1

−1 − 1 ()n −1 R≤½log( 1 + PN) ++ n RP e → ½ log(1+ PN ) 194 Bandlimited Channel • Channel bandlimited to f∈(–W,W) and signal duration T – Not exactly – Most energy in the bandwidth, most energy in the interval • Nyquist : Signal is defined by 2WT samples

– white noise with double-sided p.s.d. ½N0 becomes i.i.d gaussian N(0,½ N0) added to each coefficient – Signal power constraint = P ⇒ Signal energy ≤ PT • Energy constraint per coefficient: n–1xTx

Compare discrete time version: ½log(1+ PN –1) bits per channel use 195 Limit of Infinite Bandwidth

P  C= W log 1 +  bits/second WN 0  P C→ log e W →∞ N0 Minimum signal to noise ratio (SNR) E PT P/ C b= b = →ln 2 =− 1.6 dB W →∞ N0 N 0 N 0 Given capacity, trade-off between P and W

- Increase P, decrease W - Increase W, decrease P - spread spectrum - ultra wideband 196 Channel Code Performance

• Power Limited – High bandwidth – Spacecraft, Pagers – Use QPSK/4-QAM – Block/Convolution Codes • Bandwidth Limited – Modems, DVB, Mobile phones – 16-QAM to 256-QAM – Convolution Codes • Value of 1 dB for space – Better range, lifetime, weight, bit rate – $80 M (1999)

Diagram from “An overview of Communications” by John R Barry 197 Summary

• Gaussian channel capacity 1 P  C =log 1 +  bits/transmission 2 N  • Proved by using continuous AEP • Bandlimited channel P  CW=log 1 +  bits/second WN 0  – Minimum SNR = –1.6 dB as W →∞ 198 Lecture 15

• Parallel Gaussian Channels – Waterfilling • Gaussian Channel with Feedback – Memoryless: no gain – Memory: at most ½ bits/transmission 199 Parallel Gaussian Channels

• n independent Gaussian channels – A model for nonwhite noise wideband channel where each component represents a different frequency

– e.g. digital audio, digital TV, Broadband ADSL, WiFi z1 (multicarrier/OFDM) x1 y1 • Noise is independent zi ~ N(0, Ni) + • Average Power constraint ExTx ≤ nP

• Information Capacity: C= max I (x ; y ) z T n f(x ): Ef x x ≤ nP • R

Need to find f(x): C= max I (x ; y ) T f(x ): Ef x x ≤ nP

I(x;y) = h(y) − h(y | x) = h(y) − h(z | x) Translation invariance n y z = h(y) − h(z) = h( ) − ∑ h( i ) x,z indep; Zi indep i=1 (a) n (b) n ≤ ()h(y ) − h(z ) ≤ ½ log()1+ PN −1 (a) indep bound; ∑ i i ∑ i i (b) capacity limit i=1 i=1

Equality when: (a) yi indep ⇒ xi indep; (b) xi ~ N(0, Pi) n ½ log()1+ PN −1 We need to find the Pi that maximise ∑ i i i=1 201 Parallel Gaussian: Optimal Powers

n We need to find the P that maximise log(e) ½ ln()1+ PN −1 i n ∑ i i i=1 – subject to power constraint ∑ Pi = nP – use Lagrange multiplier i=1

n n −1 J = ∑½ ln()1+ Pi Ni − λ∑ Pi v i=1 i=1 ∂J −1 P = ½()P + N − λ = 0 ⇒Pi + N i = v 3 ∂P i i P1 i P2 n n −1 Also ∑PnPi = ⇒ vPn =+ ∑ N i i=1 i = 1 N3 N1 N2 Water Filling: put most power into least noisy channels to make equal power + noise in each channel 202 Very Noisy Channels

• What if water is not enough? • Must have P ≥ 0 ∀i i v • If v < Ni then set Pi=0 and recalculate v (i.e., Pi = max( v – Ni,0) ) P 1 P N Kuhn-Tucker Conditions: 2 3

(not examinable) N1 N2 – Max f(x) subject to Ax +b=0 and

gi (x) ≥ 0 for i ∈1: M with f , gi concave M T – set J (x) = f (x) − ∑ µi gi (x) − λ Ax i=1

– Solution x0, λ, µi iff

∇J (x0 ) = 0, Ax + b = 0, gi (x0 ) ≥ 0, µi ≥ 0, µi gi (x0 ) = 0 203 Colored Gaussian Noise

T T • Suppose y = x + z where E zz = KZ and E xx = KX

• We want to find KX to maximize capacity subject to n 2 power constraint : E∑x i ≤ nP ⇔ tr ()K X ≤ nP i=1 T T – Find noise eigenvectors: KZ = QQΛ with QQ = I – Now QTy = QTx + QTz = QTx + w T T T T where E ww = E Q zz Q = E Q KZQ = Λ is diagonal

• ⇒ Wi are now independent ( so previous result on P.G.C. applies )

T T – Power constraint is unchanged tr(Q K X Q)= tr(K X QQ )= tr(K X ) T – Use water-filling and indep. messages QKQX +Λ = v I

T −1 – Choose QKQX =−v I Λ where vPn =+ tr ( Λ )

T ⇒KX = QI(v − Λ) Q 204 Power Spectrum Water Filling

Noise Spectrum 5

• If z is from a stationary process 4.5 then diag( Λ) → power spectrum 4 N(f) n→∞ 3.5 3 – To achieve capacity use waterfilling on 2.5 v 2

noise power spectrum 1.5

1

W 0.5

P=max() v − N ( f ),0 df 0 ∫−W -0.5 0 0.5 Frequency

W Transmit Spectrum max(v − N ( f ),0 )  2 C=½log 1 +  df ∫−W 1 .5 N( f )  1

0 .5

0 – Waterfilling on spectral domain -0 .5 0 0 .5 Frequency 205 Gaussian Channel + Feedback

x i = xi (w ,y i−1:1 ) zi Does Feedback add capacity ? w xi yi – White noise (& DMC) – No Coder + – Coloured noise – Not much n I(w ;y) = h(y) − h(y |w ) = h(y) − ∑ h(y i |w ,y :1i−1) Chain rule i=1 n = h(y) − ∑ h(y i |w ,y :1i−1,x :1i ,z :1i−1) xi=xi(w,y1: i–1), z = y – x i=1 n = h(y) − ∑ h(z i |w ,y :1i−1,x :1i ,z :1i−1) z = y – x and translation invariance i=1 n = h(y) − ∑h(z i | z :1i−1) z may be colored; zi depends only on z1: i–1 i=1 h(z) = ½log (2π eK ) bits = h(y) − h(z) Chain rule, Z ⇒ maximize I(w ; y) by maximizing h(y) ⇒ y gaussian 1 K y ≤ log ⇒ we can take z and x = y – z jointly gaussian 2 K z 206 Maximum Benefit of Feedback

zi w x y Coder i + i −1 |K y | Cn, FB = max ½ n log tr(K ) ≤nP x |K z | | 2(K+ K ) | Lemmas 1 & 2: ≤ max ½n−1 log x z tr(K ) ≤nP x |K z | |2( Kx +Kz )| ≥ |Ky|

n −1 2 |K+ K | = max ½n log x z n tr(K ) ≤nP |k A|= k |( A| x |K z |

−1 |Kx+ K z | =+½ max ½n log =+ ½ C n bits / transmission tr(K ) ≤nP x |K z | Ky = Kx +Kz if no feedback

Cn: capacity without feedback Having feedback adds at most ½ bit per transmission for colored Gaussian noise channels 207 Max Benefit of Feedback: Lemmas

Lemma 1: Kx+z+Kx–z = 2( Kx+Kz) T T K x+z + K x−z = E(x + z)(x + z) + E(x − z)(x − z) = E(xxT + xzT + zxT + zzT + xxT − xzT − zxT + zzT ) T T = E()2xx + 2zz = 2()K x + K z Lemma 2: If F,G are positive definite then | F+G| ≥ |F| Consider two indep random vectors f~N(0, F), g~N(0, G)

n ½ log(() 2π e |F+ G |) = h (f + g ) Conditioning reduces h() ≥ h(f + g | g) = h(f | g) Translation invariance n =h(f ) = ½ log(() 2π e |F | ) f,g independent

Hence: |2( Kx+Kz)| = | Kx+z+Kx–z| ≥ |Kx+z| = | Ky| 208 Gaussian Feedback Coder

z x = x (w ,y ) i x and z jointly gaussian ⇒ i i i−1:1 w x y x = Bz + v(w ) Coder i + i where v is indep of z and

B is strictly lower triangular since xi indep of zj for j>i. y = x + z = (B + I)z + v T T T T T K y = Eyy = E((B + I)zz (B + I) + vv )= (B + I)K z (B + I) + K v T T T T T K x = Exx = E(Bzz B + vv ) = BK zB + K v (B + I)K (B + I)T + K −1 | K y | −1 z v Capacity: Cn,FB = max½n = max½n log K ,B K ,B v | K z | v | K z | T subject to K x = tr(BK zB + K v )≤ nP hard to solve L Optimization can be done numerically 209 Gaussian Feedback: Toy Example

2 1 0 0 25 n = ,2 P = ,2 K =  , B =   Z     det( K ) 1 2 b 0 20 y

x=B z +⇒= v x v, x = bz + v 15

1 12 1 2 ) Y det(K Goal: Maximize (w.r.t. Kv and b) 10 T 5 |||KY=+()() BIKBI z ++ K v |

0 -1.5 -1 -0.5 0 0.5 1 1.5 Subject to: b 5 K v must be positive definite 4 Power constraint : tr()BK BT + K ≤ 4 z v PbZ1 3 P Solution (via numerically search): V2 2 b=0: |Ky|=16 C=0.604 bits Power: V1 V2 bZ1 V1 V2 Power: 1 PV1 b=0.69: |Ky|=20.7 C=0.697 bits 0 -1.5 -1 -0.5 0 0.5 1 1.5 b Feedback increases C by 16% Px 1 = Pv 1 Px 2 = Pv 2 + Pbz 1 210 Summary • Water-filling for parallel Gaussian channel

n + x+ = max( x ,0) 1 (v− N i )  C =log 1 +  + ∑ (v− N ) = nP i=1 2 Ni  ∑ i • Colored Gaussian noise

n + 1 (v − λ )  λi eigenvealues of Κ z C =log 1 + i ∑   (v−λ ) + = nP i=1 2 λi  ∑ i • Continuous Gaussian channel + W ()v− N( f )  C=½log 1 +  df ∫−W   N( f )  • Feedback bound 1 C≤ C + n, FB n 2 211 Lecture 16

• Lossy Source Coding – For both discrete and continuous sources – Bernoulli Source, Gaussian Source • Rate Distortion Theory – What is the minimum distortion achievable at a particular rate? – What is the minimum rate to achieve a particular distortion? • Channel/Source Coding Duality 212 Lossy Source Coding

f(x )∈1:2 nR x^ x1: n Encoder 1: n Decoder 1: n f( ) g( )

d(x, xˆ) ≥ 0 Distortion function: 0 x = xˆ 2  – examples: (i) d (x, xˆ) = (x − xˆ) (ii) d H (x, xˆ) =  S 1 x ≠ xˆ ∆ n  – sequences: −1 d(x,xˆ) = n ∑ d(xi , xˆi ) i=1 Distortion of Code f ( ), g ( ) : ˆ n n D = Ex∈Xn d(x, x)= E d(x, g( f (x)))

Rate distortion pair (R,D) is achievable for source X if

∃ a sequence fn( ) and gn( ) such that lim E n d(x, gn ( fn (x))) ≤ D n→∞ x∈X 213 Rate Distortion Function

Rate Distortion function for {xi} with pdf p(x) is defined as RD( )= min{ R } such that ( R,D ) is achievable Theorem: R(D) = min I(x ;xˆ) over all p(x ,xˆ) such that : (a) p(x ) is correct ♦ ˆ (b) Ex ,xˆ d(x ,x ) ≤ D

– this expression is the Rate Distortion function for X

Proof is not examinable Lossless coding : If D = 0 then we have R(D) = I(x ;x) = H(x)

♦ p(,) xx x x ˆ= p ()(|) q ˆ 214 R(D) bound for Bernoulli Source

Bernoulli: X = [0,1] , pX = [1–p , p] assume p ≤ ½ – Hamming Distance: d(x, xˆ) = x ⊕ xˆ 1 0.8 – If D≥p, R(D)=0 since we can set g( ) ≡0 0.6 p=0.5

– For D < p ≤ ½, if ˆ 0.4

E d(x x , ) ≤ D then R(D), p=0.2,0.5

0.2 p=0.2 I(x ;xˆ) = H (x ) − H (x | xˆ) 0 0 0.1 0.2 0.3 0.4 0.5 ˆ ˆ D = H ( p) − H (x ⊕ x | x ) ⊕ is one-to-one ˆ ≥ H ( p) − H (x ⊕ x ) Conditioning reduces entropy ≥ H ( p) − H (D) Prob.( x ⊕=≤ˆ 1)D for D ≤ ½ H( x ⊕ˆ ) ≤ HD () as Hp () monotonic Hence R(D) ≥ H(p) – H(D) 215 R(D) for Bernoulli source

We know optimum satisfies R(D) ≥ H(p) – H(D) – We show we can find a p ( xˆ , x ) that attains this. – Peculiarly, we consider a channel with x ˆ as the input and error probability D Now choose r to give x the correct 1– r 0 1– D 0 1– p D probabilities: ^x x D r(1− D ) +− (1 rDp ) = r 1 1 p ⇒=r( pD − )(1 − 2 D ) −1, D≤ p 1– D Now I x x(;) x x x ˆ = H () − H (|)ˆ =− HpHD () () and p(x x ≠=⇒ˆ ) D distortion ≤ D Hence R(D) = H(p) – H(D) If D ≥ p or D ≥ 1- p, we can achieve R(D)= 0 trivially. 216 R(D) bound for Gaussian Source

• Assume X ~ N(0,σ 2 ) and d(x, xˆ) = (x − xˆ)2 • Want to minimize I(x ;xˆ) subject to E(x − xˆ)2 ≤ D

I(x ;xˆ) = h(x ) − h(x | xˆ) = ½ log 2πeσ 2 − h(x − xˆ | xˆ) Translation Invariance ≥ ½ log 2πeσ 2 − h(x − xˆ) Conditioning reduces entropy ≥ ½ log 2πeσ 2 −½ log(2πe Var(x − xˆ )) Gauss maximizes entropy for given covariance 2 ≥ ½ log 2πeσ −½ log 2πeD require Var (x − xˆ )≤ E(x − xˆ)2 ≤ D  2  ˆ σ I(x ;x ) ≥ max½ log ,0 I(x ; y) always positive  D  217 R(D) for Gaussian Source

p(xˆ, x) To show that we can find a z ~ N(0, D) that achieves the bound, we ^ construct a test channel that x ~ N(0, σ2–D) x ~ N(0, σ2) introduces distortion D< σ2 +

3.5 I(x ;xˆ) = h(x ) − h(x | xˆ) 3 2 = ½ log 2πeσ − h(x − xˆ | xˆ) 2.5 = ½ log 2πeσ 2 − h(z | xˆ) 2 achievable

R(D) 1.5 σ 2 = ½ log 1 D 2 0.5 σ impossible ⇒R( D ) = max{½ log ,0} 0 D 0 0.5 1 1.5 D/ σ2 2 2 σ mp =4σ 2 mp / 3 16 / 3 ⋅σ ⇒D( R ) = 2R cf. PCMD ( R ) = = 2 22R 2 2 R 218 Lloyd Algorithm

Problem: Find optimum quantization levels for Gaussian pdf a. Bin boundaries are midway between quantization levels b. Each quantization level equals the mean value of its own bin Lloyd algorithm: Pick random quantization levels then apply conditions (a) and (b) in turn until convergence.

3 0.2

2 0.15 1

0 0.1 MSE -1

Quantization levels 0.05 -2

-3 0 1 2 0 10 10 10 0 1 2 10 10 10 Iteration Iteration Solid lines are bin boundaries. Initial levels uniform in [–1,+1] . Best mean sq error for 8 levels = 0.0345 σ2. Predicted D(R) = ( σ/8)2 = 0.0156 σ2 219 Vector Quantization To get D(R), you have to quantize many values together – True even if the values are independent

D = 0.035 D = 0.028 2.5 2.5

2 2

Scalar 1.5 1.5 Vector quant. 1 1 quant.

0.5 0.5

0 0 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5

Two gaussian variables: one quadrant only shown – Independent quantization puts dense levels in low prob areas – Vector quantization is better (even more so if correlated) 220 Multiple Gaussian Variables

• Assume x1: n are independent gaussian sources with different variances. How should we apportion the available total distortion between the sources?

2 −1 T • Assume x i ~ N(0,σ i ) and d(x,xˆ) = n (x − xˆ ) (x − xˆ )≤ D n ˆ ˆ Mut Info Independence Bound I(x :1 n ;x :1 n ) ≥ ∑ I(x i ;x i ) i=1 for independent x i n n  σ 2  ≥ R(D ) = max½ log i ,0 R(D) for individual ∑ i ∑   Gaussian i=1 i=1  Di  We must find the D that minimize 2 i D0if D 0 < σ i ⇒D = i  2 n 2 σ i otherwise  σ i   max ½log 0,  n ∑   −1 i=1 Di   such that n∑ Di = D i=1 221 Reverse Water-filling

n  σ 2  n σ 2 Minimize max ½log i ,0 subject to D ≤ nD R = ½ log i ∑   ∑ i i D i=1  Di  i=1 2 Use a Lagrange multiplier: ½log σi n 2 n σ i R J = ½ log + λ Di 2 R3 ∑ ∑ R1 i=1 D i=1 i ½log D ∂J −1 −1 = −½Di + λ = 0 ⇒ Di = ½λ = D0 ∂Di n ⇒ D = D X1 X2 X3 ∑ Di = nD 0 = nD 0 i=1 σ2 ½log i Choose Ri for equal distortion ½log D R3 2 2 0 R1 R =0 • If σi

X1 X2 X3 222 Channel/Source Coding Duality

• Channel Coding Noisy channel – Find codes separated enough to give non-overlapping output images. – Image size = channel noise Sphere Packing – The maximum number (highest rate) is when the images just don’t

overlap (some gap). Lossy encode • Source Coding – Find regions that cover the sphere – Region size = allowed distortion – The minimum number (lowest rate) is when they just fill the sphere Sphere Covering (with no gap ). 223 Gaussian Channel/Source • Capacity of Gaussian channel ( n: length) – Radius of big sphere n(P + N) – Radius of small spheres nN n n( P+ N ) P+ N n / 2 Maximum number of – Capacity nC   2 =n =   small spheres packed nN N  in the big sphere • Rate distortion for Gaussian source – Variance σ2 → radius of big sphere nσ 2 – Radius of small spheres nD for distortion D 2 n / 2 Minimum number of – Rate nR( D ) σ  2 =   small spheres to cover D  the big sphere 224 Channel Decoder as Source Encoder

z1: n ~ N(0, D) nR y ~ N(0, σ2–D) 2 ^ nR w∈2 1: n x1: n ~ N(0, σ ) w∈2 Encoder + Decoder

• For R ≅ C = ½ log ( 1 + (σ 2 − D ) D − 1 ) , we can find a channel ˆ 2 encoder/decoder so that p(w ≠w ) < ε and E(x i − y i ) = D • Now reverse the roles of encoder and decoder. Since ˆ ˆ ˆ 2 2 x y xpp() wx w y ≠=≠<( ) ε and E ()()ii −≅−= E ii D Source coding z1: n ~ N(0, D) Channel x ~ N(0, σ2) Channel w^∈2nR Channel x^ w y1: n 1: n 1: n Encoder + Decoder Encoder

Source encoder Source decoder We have encoded x at rate R=½log( σ2D–1) with distortion D! 225 Summary • Lossy source coding: tradeoff between rate and distortion • Rate distortion function R(D) = min I(x ;xˆ) ˆ p ˆ|xx ts .. Ed (x ,x )≤D • Bernoulli source: R(D) = ( H(p) – H(D)) +

+ • Gaussian source 1 σ 2  (reverse waterfilling): R( D )=  log  2 D  • Duality: channel decoding (encoding) ⇔ source encoding (decoding) 226 Nothing But Proof

• Proof of Rate Distortion Theorem – Converse: if the rate is less than R(D), then distortion of any code is higher than D – Achievability: if the rate is higher than R(D), then there exists a rate-R code which achieves distortion D

Quite technical! 227 Review

^ f(x )∈1:2 nR x x1: n Encoder 1: n Decoder 1: n

fn( ) gn( )

Rate Distortion function for x whose px(x) is known is ˆ R(D) = inf R such that ∃fn , gn with lim E n d(x, x) ≤ D n→∞ x∈X Rate Distortion Theorem: ˆ ˆ R(D) = min I(x ;x ) over all p(xˆ | x) such that Ex ,xˆ d(x ,x ) ≤ D

3.5 We will prove this theorem for discrete X 3 2.5 and bounded d(x,y) ≤ dmax 2 achievable

R(D) 1.5 R(D) curve depends on your choice of d(,) 1 0.5 Decreasing and convex impossible 00 0.5 1 1.5 D/ σ2 228 Converse: Rate Distortion Bound

Suppose we have found an encoder and decoder at rate R0 with expected distortion D for independent xi (worst case) ˆ We want to prove that R0 ≥ R(D) = R(E d(x;x)) −1 ˆ – We show first that R0 ≥ n ∑ I(x i ;x i ) i ˆ ˆ n – We know that I(x i ;x i ) ≥ R(E d(x i ;x i )) Def of R(D)

– and use convexity of R(D) to show

−1 ˆ − 1 ˆ  ˆ nREd∑()() x x (;)x x ii≥ RnEd ∑ (;) ii  = REd (;)()x x = RD i i 

We prove convexity first and then the rest 229 Convexity of R(D)

3.5 If p1(xˆ | x) and p2 (xˆ | x) are 3 associated with (D , R ) and (D , R ) 1 1 2 2 2.5 on the R(D) curve we define 2 (D ,R ) 1 1 R(D) pλ (xˆ | x) = λp1(xˆ | x) + (1− λ) p2 (xˆ | x) 1.5 1 Then R λ 0.5 E d(x, xˆ) = λD + (1− λ)D = D (D ,R ) pλ 1 2 λ 2 2 0 0 D 0.5 1 1.5 λ D/ σ2 R(D ) ≤ I (x ;xˆ) R(D) = min I(x ;xˆ) λ pλ p(xˆ|x ) ≤ λI (x ;xˆ) + (1− λ)I (x ;xˆ) I(x ;xˆ) convex w.r.t. p(xˆ | x) p1 p2 = λR(D ) + (1− λ)R(D ) 1 2 p1 and p2 lie on the R(D) curve 230 Proof that R ≥ R(D)

ˆ ˆ ˆ ˆ nR0 ≥ H (x :1 n ) ≥HH x ()x 1:n − ( 1: n | 1: n ) Uniform bound; H(x x | )≥ 0 ˆ = I(x :1 n ;x :1 n ) Definition of I(;) n ˆ x indep: Mut Inf ≥ ∑ I(x i ;x i ) i i=1 Independence Bound

n n ˆ −1 ˆ definition of R ≥ ∑ R()()E d(x i ;x i ) = n∑ n R E d(x i ;x i ) i=1 i=1  n  −1 ˆ ˆ convexity ≥ nRn ∑ E d(x i ;x i ) = nR(E d(x :1 n ;x :1 n ))  i=1  defn of vector d( )

≥ nR(D) original assumption that E(d) ≤ D and R(D) monotonically decreasing 231 Rate Distortion Achievability ^ f(x )∈1:2 nR x x1: n Encoder 1: n Decoder 1: n

fn( ) gn( ) We want to show that for any D, we can find an encoder

and decoder that compresses x1: n to nR (D) bits.

• pX is given • Assume we know the p ( xˆ | x ) that gives I(x ;xˆ) = R(D) nR ˆ • Random codebook: Choose 2 random x i ~ pxˆ – There must be at least one code that is as good as the average • Encoder: Use joint typicality to design – We show that there is almost always a suitable codeword

First define the typical set we will use, then prove two preliminary results. 232 Distortion Typical Set

ˆˆ ˆ Distortion Typical: x x (x x i, i )∈X × X drawn i.i.d. ~ p ( , )

(n) n ˆ n −1 J d ,ε = {x,xˆ ∈ X × X : − n log p(x) − H (x ) < ε, − n−1 log p(xˆ) − H (xˆ) < ε, − n−1 log p(x,xˆ) − H (x ,xˆ) < ε

d(x,xˆ) − E d(x ,xˆ) < ε} new condition Properties of Typical Set: (n) ˆ 1. Indiv p.d.: x,xˆ ∈ J d ,ε ⇒ log p(x,xˆ )= −nH (x ,x ) ± nε

(n) 2. Total Prob: p(x,xˆ ∈ J d ,ε )>1−ε for n > Nε

ˆ weak law of large numbers; d(x i ,x i ) are i.i.d. 233 Conditional Probability Bound

(n) −n(I (x ;xˆ )+3ε ) Lemma: x,xˆ ∈ J d ,ε ⇒ p(xˆ )≥ p(xˆ | x)2

p(xˆ,x) Proof: p()xˆ | x = p()x p(xˆ,x) = p()xˆ take max of top and min of bottom p()()xˆ p x

−n(H (x ,xˆ )−ε ) 2 n ≤ p()xˆ bounds from def of J 2−n()H (x )+ε 2−n()H (xˆ )+ε ˆ = p(xˆ )2n(I (x ;x )+3ε ) def n of I 234 Curious but Necessary Inequality

Lemma: u,v∈[0,1], m > 0 ⇒ (1−uv )m ≤1−u + e−vm Proof: u=0 : e−vm ≥ 0 ⇒ (1− 0)m ≤1− 0 + e−vm u=1 : Define f (v) = e−v −1+ v ⇒ f ′(v) =1− e−v f (0) = 0 and f ′(v) > 0 for v > 0 ⇒ f (v) ≥ 0 for v ∈[0 1, ] Hence for v ∈[0 1, ], 0 ≤1− v ≤ e−v ⇒ (1− v)m ≤ e−vm

m 0< u<1 : Define gv (u) = (1− uv)

2 n−2 ⇒ gv′′(x) = m(m −1)v (1− uv) ≥ 0 ⇒ gv (u) convex for u,v ∈ [0,1] m (1− uv) = gv (u) ≤ (1− u)gv (0) + ugv (1) convexity for u,v ∈ [0,1] = (1− u)1+ u(1− v)m ≤1− u + ue−vm ≤1− u + e−vm 235 Achievability of R(D): preliminaries

^ f(x )∈1:2 nR x x1: n Encoder 1: n Decoder 1: n

fn( ) gn( )

• Choose D and find a p(xˆ | x) such that I(x ;xˆ) = R(D); E d(x, xˆ) ≤ D ˆ ˆ Choose δ > 0 and define pxˆ = { p(x) = ∑ p(x) p(x | x) } x nR n • Decoder: For each w∈1: 2 choose gn (w) = xˆ w drawn i.i.d. ~ pxˆ (n) • Encoder: fn (x) = min w such that (x,xˆ w )∈ J d ,ε else 1 if no such w

• Expected Distortion: D = Ex,g d(x,xˆ) – over all input vectors x and all random decoding functions, g – for large n we show D = D + δ so there must be one good code 236 Expected Distortion

We can divide the input vectors x into two categories: (n) a) if ∃w such that (x,xˆ w )∈ J d ,ε then d(x,xˆ w ) < D + ε since E d(x,xˆ) ≤ D b) if no such w exists we must have d(x,xˆ w ) < dmax since we are assuming that d(,) is bounded. Suppose

the probability of this situation is Pe.

Hence D = Ex,g d(x,xˆ)

≤ (1− Pe )(D + ε ) + Pedmax

≤ D + ε + Pedmax

We need to show that the expected value of Pe is small 237 Error Probability

Define the set of valid inputs for (random) code g (n) V (g) = {x : ∃w with (x, g(w)) ∈ J d ,ε }

We have Pe = ∑ p(g) ∑ p(x) = ∑ p(x) ∑ p(g) Change the order gx∉V ( g ) xg: x∉V ( g)

(n) Define K(x,xˆ) =1 if (x,xˆ)∈ J d ,ε else 0 Prob that a random xˆ does not match x is 1− ∑ p(xˆ)K(x, xˆ) xˆ 2nR   Prob that an entire code does not match x is 1− ∑ p ()(,) xxxˆ K ˆ  xˆ 

2nR Codewords are i.i.d.   ˆ ˆ Hence Pe = ∑p(x)1− ∑ p(x)K(x,x) x xˆ  238 Achievability for Average Code

(n) −n(I (x ;xˆ )+3ε ) Since x,xˆ ∈ Jd ,ε ⇒ p(xˆ)≥ p(xˆ | x)2 2nR   ˆ ˆ Pe = ∑p(x)1− ∑ p(x)K(x,x) x xˆ  2nR −n() I (x x ;ˆ ) + 3 ε  ≤∑p()1x − ∑ p (|)(,)2 xxxxˆ K ˆ ⋅  x xˆ 

Using (1 − uv ) m ≤ 1 − u + e −vm with u = ∑ p(xˆ | x)K (x, xˆ ;) v = 2 − nI (x ;xˆ )−3nε ; m = 2 nR xˆ   ≤ ∑p(x)1− ∑ p(xˆ | x)K(x,xˆ) + exp()− 2−n()I (x ;xˆ )+3ε 2nR  x xˆ  Note : 0 ≤ u, v ≤ 1 as required 239 Achievability for Average Code

 ˆ  ˆ ˆ − ()In ( X ;X )+3ε nR Pe ≤ ∑p(x)1− ∑ p(x | x)K(x,x) + exp(− 2 2 ) x xˆ  Mutual information does n(R−I ( X ;Xˆ )−3ε ) =1− ∑ p(x,xˆ)K(x,xˆ) + exp(− 2 ) not involve particular x x,xˆ

(n) n(R−I ( X ;Xˆ )−3ε ) = P{(x,xˆ)∉ J d ,ε }+ exp(− 2 )

n→∞→0 since both terms → as0 n →∞ provided nRI > (,)3x x + ˆ ε

Hence ∀δ > 0, D = Ex,g d(x,xˆ) can be made ≤ D +δ 240 Achievability

Since ∀δ > 0, D = Ex,g d(x,xˆ) can be made ≤ D +δ there must be at least one g with Ex d(x,xˆ) ≤ D +δ Hence (R,D) is achievable for any R > R(D) ^ f(x )∈1:2 nR x x1: n Encoder 1: n Decoder 1: n

fn( ) gn( ) that is lim EX (x,xˆ) ≤ D n→∞ :1 n In fact a stronger result is true (proof in C&T 10.6):

∀δ > 0, D and R > R(D), ∃fn , gn with p(d(x,xˆ) ≤ D +δ ) → 1 n→∞ 241 Lecture 17

• Introduction to network information theory • Multiple access • Distributed source coding 242 Network Information Theory • System with many senders and receivers • New elements: interference, cooperation, competition, relay, feedback… • Problem: decide whether or not the sources can be transmitted over the channel – Distributed source coding – Distributed communication – The general problem has not yet been solved, so we consider various special cases • Results are presented without proof (can be done using mutual information, joint AEP) 243 Implications to Network Design

• Examples of large information networks – Computer networks – Satellite networks – Telephone networks • A complete theory of network communications would have wide implications for the design of communication and computer networks • Examples – CDMA (code-division multiple access): mobile phone network – Network coding : significant capacity gain compared to routing-based networks 244 Network Models Considered

• Multi-access channel • Broadcast channel • Distributed source coding • Relay channel • Interference channel • Two-way channel • General communication network 245 State of the Art

• Triumphs • Unknowns – Multi-access channel – The simplest relay channel

– Gaussian broadcast – The simplest channel interference channel

Reminder: Networks being built (ad hoc networks, sensor networks) are much more complicated 246 Multi-Access Channel

• Example: many users communicate with a common base station over a common channel • What rates are achievable simultaneously? • Best understood multiuser channel • Very successful: 3G CDMA mobile phone networks 247 Capacity Region • Capacity of single-user Gaussian channel 1  P   P  C = log1+  = C  2  N   N  • Gaussian multi-access channel with m users m Xi has equal power P Y = ∑ X i + Z i=1 noise Z has variance N • Capacity region Ri: rate for user i  P  Ri < C   N  Transmission: independent and simultaneous (i.i.d. Gaussian codebooks)  2P  Ri + R j < C   N  Decoding: joint decoding, look for m  3P  codewords whose sum is closest to Y Ri + R j + Rk < C   N  ⋮ The last inequality dominates when all rates are the same m  mP  ∑ Ri < C  i=1  N  The sum rate goes to ∞ with m 248 Two-User Channel • Capacity region CDMA P1  R1 < C   Onion N  FDMA, TDMA peeling P2  R2 < C   N   Naïve P1+ P 2  TDMA R1+ R 1 < C   N  – Corresponds to CDMA – Surprising fact: sum rate = rate achieved by a single sender with power P1+P2 • Achieves a higher sum rate than treating interference as noise, i.e., P  P  C1 + C 2  PN2+  PN 1 +  249 Onion Peeling

• Interpretation of corner point: onion-peeling – First stage: decoder user 2, considering user 1 as noise – Second stage: subtract out user 2, decoder user 1 • In fact, it can achieve the entire capacity region – Any rate-pairs between two corner points achievable by time- sharing • Its technical term is successive interference cancelation (SIC) – Removes the need for joint decoding – Uses a sequence of single-user decoders • SIC is implemented in the uplink of CDMA 2000 EV-DO (evolution-data optimized) – Increases throughput by about 65% 250 Comparison with TDMA and FDMA • FDMA (frequency-division multiple access)

P1  R1= W 1 log 1 +  Total bandwidth W = W1 + W2 N0 W 1  Varying W1 and W2 tracing out the P2  R2= W 2 log 1 +  curve in the figure N0 W 2  • TDMA (time-division multiple access) – Each user is allotted a time slot, transmits and other users remain silent – Naïve TDMA: dashed line – Can do better while still maintaining the same average power constraint; the same capacity region as FDMA • CDMA capacity region is larger – But needs a more complex decoder 251 Distributed Source Coding

• Associate with nodes are sources that are generally dependent • How do we take advantage of the dependence to reduce the amount of information transmitted? • Consider the special case where channels are noiseless and without interference • Finding the set of rates associate with each source such that all required sources can be decoded at destination • Data compression dual to multi-access channel 252 Two-User Distributed Source Coding

• X and Y are correlated • But the encoders cannot communicate; have to encode independently • A single source: R > H(X) • Two sources: R > H(X,Y) if encoding together • What if encoding separately? • Of course one can do R > H(X) + H(Y) • Surprisingly, R = H(X,Y) is sufficient ( Slepian-Wolf coding, 1973 ) • Sadly, the coding scheme was not practical (again) 253 Slepian-Wolf Coding • Achievable rate region

R1 ≥ H (X |Y )

R2 ≥ H (Y | X ) achievable

R1 + R2 ≥ H (X ,Y ) • Core idea: joint typicality

• Interpretation of corner point R1 = H(X), R2 = H(Y|X) – X can encode as usual – Associate with each xn is a jointly typical fan (however Y doesn’t know) xn yn – Y sends the color (thus compression) – Decoder uses the color to determine the point in jointly typical fan associated with xn • Straight line: achieved by time- sharing 254 Wyner-Ziv Coding

• Distributed source coding with side information

• Y is encoded at rate R 2 • Only X to be recovered

• How many bits R 1 are required?

• If R2 = H(Y), then R1 = H(X|Y) by Slepian-Wolf coding • In general RHXU1 ≥ ( | )

RIYU2 ≥ ( ; ) where U is an auxiliary random variable (can be thought of as approximate version of Y) 255 Rate-Distortion

• Given Y, what is the rate- W distortion to describe X?

ˆ ˆ R(D) = min I x ;( x ) over all p(xˆ x)| such that E ,xxˆ d x ,( x ) ≤ D RDIXWIYWY ( )= minmin{( ; ) − (; )} pwx( | ) f over all decoding functions fYW: × → X ˆ and all pwx(|) such that Ex, w , y dxxD (,)ˆ ≤

• The general problem of rate-distortion for correlated sources remains unsolved 256 Lecture 18

• Network information theory – II – Broadcast – Relay – Interference channel – Two-way channel – Comments on general communication networks 257 Broadcast Channel

X

Y2

Y1

• One-to-many: HDTV station sending different information simultaneously to many TV receivers over a common channel; lecturer in classroom • What are the achievable rates for all different receivers? • How does the sender encode information meant for different signals in a common signal? • Only partial answers are known. 258 Two-User Broadcast Channel

• Consider a memoryless broadcast channel with one encoder and two decoders

• Independent messages at rate R1 and R2

• Degraded broadcast channel: p(y1, y2|x) = p(y1|x) p(y2| y1)

– Meaning X → Y1 → Y2 (Markov chain)

– Y2 is a degraded version of Y1 (receiver 1 is better) • Capacity region of degraded broadcast channel

R2 ≤ I(U;Y2 ) U is an auxiliary

R1 ≤ I(X ;Y1 |U ) random variable 259 Scalar Gaussian Broadcast Channel • All scalar Gaussian broadcast channels belong to the class of degraded channels

Y1 = X + Z1 Assume variance

Y2 = X + Z2 N1 < N 2 • Capacity region Coding Strategy   αP Encoding: one codebook with power αP at R1 ≤ C   N 1  rate R1, another with power (1-α)P at rate (1−α ) P  R2, send the sum of two codewords R2 ≤ C   αP+ N 2  Decoding: Bad receiver Y2 treats Y1 as noise; good receiver Y1 first decode Y2, subtract it 0 ≤ α ≤ 1 out, then decode his own message 260 Relay Channel • One source, one destination, one or more intermediate relays • Example: one relay

– A broadcast channel ( X to Y and Y1)

– A multi-access channel ( X and X1 to Y) – Capacity is unknown! Upper bound:

C ≤ sup min{I(X , X1;Y), I(X ;Y,Y1 | X1)} p ,( xx 1 ) – Max-flow min-cut interpretation

• First term: maximum rate from X and X1 to Y

• Second term: maximum rate from X to Y and Y1 261 Degraded Relay Channel

• In general, the max-flow min-cut bound cannot be achieved • Reason – Interference – What for the relay to forward? – How to forward? • Capacity is known for degraded relay channel

(i.e, Y is a degradation of Y1, or relay is better than receiver), i.e., the upper bound is achieved

C= supmin{( IXXYIXYYX ,1 ; ),( ; , 1 | 1 )} p( x , x 1 ) 262 Gaussian Relay Channel • Channel model YXZ1= + 1 Variance( ZN1 ) = 1

YXZXZ=+++112Variance( ZN 22 ) = ⋯ • Encoding at relay: X1i = fi (Y ,11 Y12 , ,Y1i−1) • Capacity    X has power P   P + P1 + 2 1( −α)PP1 αP  C = max minC ,C  0≤α ≤1  N + N   N    1 2   1  X1 has power P1 – If P P relay − desitination SNR 1 ≥ source − relay SNR N2 N1

then C = C ( P / N 1 ) (capacity from source to relay can be achieved; exercise)

– Rate C = CP ( /( N 1 + N 2 ) ) without relay is increased by the

relay to C = C(P / N1 ) 263 Interference Channel • Two senders, two receivers, with crosstalk

– Y1 listens to X1 and doesn’t care what X2 speaks or what Y2 hears – Similarly with X2 and Y2 • Neither a broadcast channel nor a multiaccess channel • This channel has not been solved – Capacity is known to within one bit (Etkin, Tse, Wang 2008) – A promising technique  interference alignment (Camdambe, Jafar 2008) 264 Symmetric Interference Channel • Model Y1= X 1 + aX 21 + Zequal power P

YXaXZ2212=++Var( Z 1 ) = Var( ZN 2 ) = • Capacity has been derived in the strong interference case ( a ≥ 1) (Han, Kobayashi, 1981) – Very strong interference ( a2 ≥ 1 + P/N) is equivalent to no interference whatsoever

• Symmetric capacity (for each user R1 = R2)

SNR = P/N INR = a2P/N 265 Capacity 266 Very strong interference = no interference

• Each sender has power P and rate C(P/N) • Independently sends a codeword from a Gaussian codebook • Consider receiver 1 – Treats sender 1 as interference – Can decode sender 2 at rate C(a2P/( P+N)) – If C(a2P/( P+N)) > C(P/N), i.e., rate 2 ‰ 1 > rate 2 ‰ 2 (crosslink is better) he can perfectly decode sender 2 – Subtracting it from received signal, he sees a clean channel with capacity C(P/N) 267 An Example

• Two cell-edge users (bottleneck of the cellular network) • No exchange of data between the base stations or between the mobiles • Traditional approaches – Orthogonalizing the two links (reuse ½) – Universal frequency reuse and treating interference as noise • Higher capacity can be achieved by advanced interference management 268 Two-Way Channel • Similar to interference channel, but in both directions (Shannon 1961) • Feedback – Sender 1 can use previously received symbols from sender 2, and vice versa – They can cooperate with each other • Gaussian channel: – Capacity region is known (not the case in general) – Decompose into two independent channels  P   1  Coding strategy: Sender 1 sends a codeword; R1 < C   N1  so does sender 2. Receiver 2 receives a sum  P  but he can subtract out his own thus having an R < C 2  2   interference-free channel from sender 1.  N2  269 General Communication Network

• Many nodes trying to communicate with each other • Allows computation at each node using it own message and all past received symbols • All the models we have considered are special cases • A comprehensive theory of network information flow is yet to be found 270 Capacity Bound for a Network • Max-flow min-cut – Minimizing the maximum flow across cut sets yields an upper bound on the capacity of a network • Outer bound on capacity region

c c ∑ R ji ),( ≤ I(X (S ) ;Y (S ) | X (S ) ) i∈S , j∈S c – Not achievable in general 271 Questions to Answer

• Why multi-hop relay? Why decode and forward? Why treat interference as noise? • Source-channel separation? Feedback? • What is really the best way to operate wireless networks? • What are the ultimate limits to information transfer over wireless networks? 272 Scaling Law for Wireless Networks

• High signal attenuation: (transport) capacity is O(n) bit-meter/sec for a planar network with n nodes (Xie- Kumar’04) • Low attenuation: capacity can grow superlinearly • Requires cooperation between nodes • Multi-hop relay is suboptimal but order optimal 273 Network Coding • Routing: store and forward (as in Internet) • Network coding: recompute and redistribute • Given the network topology, coding can increase capacity (Ahlswede, Cai, Li, Yeung, 2000) – Doubled capacity for butterfly network B A • Active area of research Butterfly Network 274 Lecture 19

• Revision Lecture 275 Summary (1)

• Entropy: H (x ) = ∑ p(x)× −log2 p(x) = E − log2 ( pX (x)) x∈X – Bounds: 0 ≤ H (x ) ≤ log X – Conditioning reduces entropy: H (y | x ) ≤ H (y ) n n – Chain Rule: H (x :1 n ) = ∑ H (x i | x :1i−1) ≤ ∑ H (x i ) i=1 i=1 n H (x :1 n | y :1 n ) ≤ ∑ H (x i | y i ) • Relative Entropy: i=1

D(p || q)= Ep log(p(x ) / q(x ))≥ 0 276 Summary (2)

H(x ,y) • Mutual Information: H(x |y) I(x ;y) H(y | x) I(y ;x ) = H (y ) − H (y | x ) H(x) H(y)

= H (x ) + H (y ) − H (x ,y ) = D()px ,y || px py – Positive and Symmetrical: I(x ;y ) = I(y ;x ) ≥ 0 – x, y indep ⇔ H (x ,y ) = H (y ) + H (x ) ⇔ I(x ;y ) = 0 n – Chain Rule: I(x :1 n ;y ) = ∑ I(x i ;y | x :1i−1) i=1 n x i independen t ⇒ I(x :1 n ;y :1 n ) ≥ ∑ I(x i ;y i ) i=1 n p(y i | x :1 n ;y :1i−1 ) = p(y i | x i ) ⇒ I(x :1 n ;y :1 n ) ≤ ∑ I(x i ;y i ) i=1 n-use DMC capacity 277 Summary (3)

• Convexity: f’’ (x) ≥ 0 ⇒ f(x) convex ⇒ Ef (x) ≥ f(Ex) – H(p) concave in p

– I(x ; y) concave in px for fixed py|x

– I(x ; y) convex in py|x for fixed px • Markov: x → y → z ⇔ p(z | x, y) = p(z | y) ⇔ I(x ;z | y ) = 0 ⇒ I(x ; y) ≥ I(x ; z) and I(x ; y) ≥ I(x; y | z) H (x | y ) −1 • Fano: x → y → xˆ ⇒ p()xˆ ≠ x ≥ log()| X | −1 • Entropy Rate: −1 H (X) = lim n H (x :1 n ) n→∞ – Stationary process H()X = lim H (x xn |1: n − 1 ) n→∞ – Markov Process: H (X) = lim H (x n | x n−1) if stationary n→∞ 278 Summary (4)

X|| • Kraft: Uniquely Decodable ⇒∑ D−li ≤1 ⇒ ∃ instant code i=1

• Average Length: Uniquely Decodable ⇒ LC = E l (x) ≥ HD(x)

• Shannon-Fano: Top-down 50% splits. LSF ≤ HD(x)+1

• Huffman: Bottom-up design. Optimal. LH ≤ H D (x ) +1 – Designing with wrong probabilities, q ⇒ penalty of D(p|| q) – Long blocks disperse the 1-bit overhead • Lempel-Ziv Coding: – Does not depend on source distribution – Efficient algorithm widely used – Approaches entropy rate for stationary ergodic sources 279 Summary (5)

• Typical Set n)( x – Individual Prob x∈Tε ⇒ log p(x) = − nH( ) ± nε n)( – Total Prob p(x ∈Tε ) >1−ε for n > Nε n>Nε n(H x )( −ε ) n)( n(H x )( +ε ) – Size 1( −ε 2) < Tε ≤ 2 – No other high probability set can be much smaller • Asymptotic Equipartition Principle – Almost all event sequences are equally surprising 280 Summary (6)

• DMC Channel Capacity: C = max I(x ;y ) px • Coding Theorem – Can achieve capacity: random codewords, joint typical decoding – Cannot beat capacity: Fano inequality • Feedback doesn’t increase capacity of DMC but could simplify coding/decoding • Joint Source-Channel Coding doesn’t increase capacity of DMC 281 Summary (7)

• Polar codes are low-complexity codes directly built from information theory. • Their constructions are aided by the polarization phenomenon. • For channel coding, polar codes achieve channel capacity. • For source coding, polar codes achieve the entropy bound. • And much more. 282 Summary (8)

• Differential Entropy: h(x ) = E − log fx (x ) – Not necessarily positive – h(x+a ) = h(x), h(a x) = h(x) + log| a|, h(x|y) ≤ h(x) – I(x; y) = h(x) + h(y) – h(x, y) ≥ 0, D(f|| g)= E log( f/g) ≥ 0 • Bounds: – Finite range: Uniform distribution has max: h(x) = log( b–a) – Fixed Covariance: Gaussian has max: h(x) = ½log((2 πe)n|K|) • Gaussian Channel – Discrete Time: C=½log(1+ PN –1) –1 –1 – Bandlimited: C=W log(1+ PN 0 W ) −1 −1 −1 ()W /C −1 • For constant C: Eb N0 = PC N0 = (W / C)(2 −1)→ ln 2 = − 6.1 dB W →∞ – Feedback: Adds at most ½ bit for coloured noise 283 Summary (9)

• Parallel Gaussian Channels : Total power constraint ∑ Pi = nP

– White noise: Waterfilling: Pi=max( v − N i ,0 ) – Correlated noise: Waterfill on noise eigenvectors

• Rate Distortion: R(D) = min I(x ;xˆ) ˆ p ˆ|xx ts .. Ed (x ,x )≤D

– Bernoulli Source with Hamming d: R(D) = max( H(px)–H(D),0) – Gaussian Source with mean square d: R(D) = max(½log( σ2 D–1),0)

– Can encode at rate R: random decoder, joint typical encoder

– Can’t encode below rate R: independence bound 284 Summary (10)

P1  P 2 RC1<, RC 2 <  • Gaussian multiple access N  N

P1+ P 2  1 channel RRC1+< 1   , Cx ( ) =+ log(1 x ) N  2

• Distributed source coding R1≥ HXY(|), R 2 ≥ HYX (|)

– Slepian-Wolf coding R1+ R 2 ≥ HXY( , ) • Scalar Gaussian broadcast channel α α P (1− ) P  RC1≤, RC 2 ≤   ,01 ≤≤α N1 α P+ N 2  • Gaussian Relay channel

     P + P1 + 2 1( −α)PP1 αP  C = max minC ,C  0≤α ≤1  N + N   N    1 2   1  285 Summary (11)

• Interference channel – Strong interference = no interference • Gaussian two-way channel – Decompose into two independent channels • General communication network – Max-flow min-cut theorem – Not achievable in general – But achievable for multiple access channel and Gaussian relay channel