Information Theory
Lecturer: Cong Ling Office: 815 2 Lectures
Introduction 11 Separation Theorem – 131 1 Introduction - 1 12 Polar codes - 143 Entropy Properties Continuous Variables 2 Entropy I - 19 13 Differential Entropy - 170 3 Entropy II – 32 14 Gaussian Channel - 185 Lossless Source Coding 15 Parallel Channels – 198 4 Theorem - 47 Lossy Source Coding 5 Algorithms - 60 16 Rate Distortion Theory - 211 Channel Capacity Network Information Theory 6 Data Processing Theorem - 76 17 NIT I - 241 7 Typical Sets - 86 18 NIT II - 256 8 Channel Capacity - 98 Revision 9 Joint Typicality - 112 19 Revision - 274 10 Coding Theorem - 123 20 3 Claude Shannon
• C. E. Shannon, ”A mathematical theory of communication,” Bell System Technical Journal, 1948. • Two fundamental questions in communication theory: • Ultimate limit on data compression – entropy • Ultimate transmission rate of communication 1916 - 2001 – channel capacity • Almost all important topics in information theory were initiated by Shannon 4 Origin of Information Theory
• Common wisdom in 1940s: – It is impossible to send information error-free at a positive rate – Error control by using retransmission: rate → 0 if error-free • Still in use today – ARQ (automatic repeat request) in TCP/IP computer networking • Shannon showed reliable communication is possible for all rates below channel capacity • As long as source entropy is less than channel capacity, asymptotically error-free communication can be achieved • And anything can be represented in bits – Rise of digital information technology 5 Relationship to Other Fields 6 Course Objectives
• In this course we will (focus on communication theory): – Define what we mean by information. – Show how we can compress the information in a source to its theoretically minimum value and show the tradeoff between data compression and distortion. – Prove the channel coding theorem and derive the information capacity of different channels. – Generalize from point-to-point to network information theory. 7 Relevance to Practice
• Information theory suggests means of achieving ultimate limits of communication – Unfortunately, these theoretically optimum schemes are computationally impractical – So some say “little info, much theory” (wrong) • Today, information theory offers useful guidelines to design of communication systems – Polar code (achieves channel capacity) – CDMA (has a higher capacity than FDMA/TDMA) – Channel-coding approach to source coding (duality) – Network coding (goes beyond routing) 8 Books/Reading
Book of the course: • Elements of Information Theoryby T M Cover & J A Thomas, Wiley, £39 for 2 nd ed. 2006, or £14 for 1 st ed. 1991 (Amazon)
Free references • Information Theory and Network Codingby R. W. Yeung, Springer http://iest2.ie.cuhk.edu.hk/~whyeung/book2/
• Information Theory, Inference, and Learning Algorithms by D MacKay, Cambridge University Press http://www.inference.phy.cam.ac.uk/mackay/itila/
• Lecture Notes on Network Information Theory by A. E. Gamal and Y.-H. Kim, (Book is published by Cambridge University Press) http://arxiv.org/abs/1001.3404
• C. E. Shannon, ”A mathematical theory of communication,” Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. 9 Other Information
• Course webpage: http://www.commsp.ee.ic.ac.uk/~cling • Assessment: Exam only – no coursework. • Students are encouraged to do the problems in problem sheets. • Background knowledge – Mathematics – Elementary probability • Needs intellectual maturity – Doing problems is not enough; spend some time thinking 10 Notation • Vectors and matrices – v=vector, V=matrix • Scalar random variables – x = R.V, x = specific value, X = alphabet • Random column vector of length N – x = R.V, x = specific value, XN = alphabet
– xi and xi are particular vector elements • Ranges – a:b denotes the range a, a+1, …, b • Cardinality – |X| = the number of elements in set X 11 Discrete Random Variables
• A random variable x takes a value x from
the alphabet X with probability px(x). The vector of probabilities is px . Examples: 1 1 1 1 1 1 X = [1;2;3;4;5;6] , px = [ /6; /6; /6; /6; /6; /6]
pX is a “probability mass vector” “english text” X = [a; b;…, y; z;
px = [0.058; 0.013; …; 0.016; 0.0007; 0.193]
Note: we normally drop the subscript from px if unambiguous 12 Expected Values
• If g(x) is a function defined on X then
often write E for E Egx ()x = ∑ pxgx ()() X x∈X Examples: 1 1 1 1 1 1 X = [1;2;3;4;5;6] , px = [ /6; /6; /6; /6; /6; /6] E x = 3.5 = µ E x 2 =15.17 = σ 2 + µ 2 E sin(0.1x ) = 0.338
This is the “entropy” of X E − log2 ( p(x )) = 2.58 13 Shannon Information Content • The Shannon Information Content of an outcome with probability p is –log 2p • Shannon’s contribution – a statistical view – Messages, noisy channels are random – Pre-Shannon era: deterministic approach (Fourier…)
• Example 1: Coin tossing – X = [Head; Tail], p = [½; ½], SIC = [1; 1] bits • Example 2: Is it my birthday ? 364 1 – X = [No; Yes], p = [ /365 ; /365 ], SIC = [0.004; 8.512] bits
Unlikely outcomes give more information 14 Minesweeper
• Where is the bomb ? • 16 possibilities – needs 4 bits to specify
Guess Prob SIC
15 1. No /16 0.093 bits 14 2. No /15 0.100 bits 13 3. No /14 0.107 bits 1 M 4. Yes /13 3.700 bits Total 4.000 bits
SIC = – log 2 p 15 Minesweeper
• Where is the bomb ? • 16 possibilities – needs 4 bits to specify
Guess Prob SIC
15 1. No /16 0.093 bits 16 Entropy
xH()x =− E log( x x 2 px ()) =− ∑ pxpx ()log2 () x∈x – H(x) = the average Shannon Information Content of x – H(x) = the average information gained by knowing its value – the average number of “yes-no” questions needed to find x is in the range [H(x), H(x)+1) – H(x) = the amount of uncertainty before we know its value
We use log( x) ≡ log 2(x) and measure H(x) in bits
– if you use log e it is measured in nats
– 1 nat = log 2(e) bits = 1.44 bits ln( x) d log x log e • log (x) = 2 = 2 2 ln( 2) dx x
H(X ) depends only on the probability vector pX not on the alphabet X, so we can write H(pX) 17 Entropy Examples
1 (1) Bernoulli Random Variable 0.8 0.6
X = [0;1] , px = [1–p; p] H(p) 0.4
H (x ) = −(1− p)log(1− p) − plog p 0.2
Very common – we write H(p) to 0 0 0.2 0.4 0.6 0.8 1 mean H([1 –p; p]). p H ( p) = − 1( − p)log(1− p) − plog p Maximum is when p=1/2 H′( p) = log(1− p) − log p H ′′( p) = − p−1 1( − p)−1 loge (2) Four Coloured Shapes 1 1 X = [ ò; ¢; ©; s], px = [½; ¼; /8; /8]
H (x ) = H (px ) = ∑− log( p(x)) p(x) 1 1 =1×½ + 2×¼ + 3× 8 + 3× 8 =1.75 bits 18 Comments on Entropy
• Entropy plays a central role in information theory • Origin in thermodynamics – S = k ln Ω, k: Boltzman’s constant, Ω: number of microstates – The second law: entropy of an isolated system is non- decreasing • Shannon entropy – Agrees with intuition: additive, monotonic, continuous – Logarithmic measure could be derived from an axiomatic approach (Shannon 1948) 19 Lecture 2
• Joint and Conditional Entropy – Chain rule • Mutual Information – If x and y are correlated, their mutual information is the average information that y gives about x • E.g. Communication Channel: x transmitted but y received • It is the amount of information transmitted through the channel • Jensen’s Inequality 20 Joint and Conditional Entropy
p(x,y) y=0 y=1 Joint Entropy: H(x,y) x=0 ½ ¼ x=1 0 ¼ H (x ,y ) = E − log p(x ,y ) = −½ log½ −¼log¼ − 0log 0 −¼log¼ =1.5 bits Note: 0 log 0 = 0
Conditional Entropy: H(y |x) p(y|x) y=0 y=1 2 1 x=0 /3 /3 H y x(y x | )= E − log p ( | ) x=1 0 1 = − ∑ pxy( , )log pyx ( | ) Note: rows sum to 1 x, y 2 1 =−½ log3 − ¼log 3 − 0log 0 − ¼ log1 = 0.689 bits 21 Conditional Entropy – View 1
Additional Entropy: p(x, y) y=0 y=1 p(x) p(|) xy y x = p (,) ÷ p () x=0 ½ ¼ ¾ H y x(|)y x = E − log(|) p x=1 0 ¼ ¼ =−E{}{}log(,) x px y −− E log() p
1113 1 =−=H(, xx y ) HH () (24 , ,0, 4 ) − H ( 44 , ) = 0.689 bits H(Y|X) is the average additional information in Y when you know X H(x ,y)
H(y |x)
H(x) 22 Conditional Entropy – View 2
p(x, y) y=0 y=1 H(y | x=x ) p(x) Average Row Entropy: x=0 ½ ¼ H(1/3) ¾ x=1 0 ¼ H(1) ¼
H (y | x ) = E − log p(y | x ) = ∑− p(x, y)log p(y | x) x, y = ∑− p(x) p(y | x)log p(y | x) = ∑p(x) ∑− p(y | x)log p(y | x) x, y x∈XY y ∈
1 = ∑ p(x)H ()y | x = x = ¾ × H ( 3) + ¼ × H (0) = 0.689 bits x∈X
Take a weighted average of the entropy of each row using p(x) as weight 23 Chain Rules
• Probabilities p(x ,y ,z ) = p(z | x ,y ) p(y | x ) p(x ) • Entropy H (x ,y ,z ) = H (z | x ,y ) + H (y | x ) + H (x ) n H (x :1 n ) = ∑ H (x i | x :1i−1) i=1
The log in the definition of entropy converts products of probability into sums of entropy 24 Mutual Information Mutual information is the average amount of information that you get about x from observing the value of y - Or the reduction in the uncertainty of x due to knowledge of y I(x ;y ) = H (x ) − H (x | y ) = H (x ) + H (y ) − H (x ,y )
Information in x Information in x when you already know y H(x ,y) Mutual information is symmetrical H(x |y) I(x ;y) H(y |x)
I(x ;y ) = I(y ;x ) H(x) H(y) Use “;” to avoid ambiguities between I(x;y,z) and I(x,y;z) 25 Mutual Information Example p(x,y) y=0 y=1 • If you try to guess y you have a 50% x=0 ½ ¼ chance of being correct. x=1 0 ¼ • However, what if you know x ? – Best guess: choose y = x – If x =0 (p=0.75 ) then 66% correct prob – If x =1 (p=0.25 ) then 100% correct prob
H(x , y)=1.5 – Overall 75% correct probability
I(x ;y ) = H (x ) − H (x | y ) H(x |y) I(x ; y) H(y | x) =0.5 =0.311 =0.689 = H (x ) + H (y ) − H (x ,y ) H (x ) = 0.811, H (y ) = ,1 H (x ,y ) =1.5 H(x)=0.811 H(y)=1 I(x ;y ) = 0.311 26 Conditional Mutual Information
Conditional Mutual Information I(x ;y | z ) = H (x | z ) − H (x | y ,z ) = H (x | z ) + H (y | z ) − H (x ,y | z )
Note : Z conditioning applies to both X and Y
Chain Rule for Mutual Information
I(x 1,x 2 ,x 3;y ) = I(x 1;y ) + I(x 2 ;y | x 1) + I(x 3;y | x 1,x 2 ) n I(x :1 n ;y ) = ∑ I(x i ;y | x :1i−1) i=1 H(x ,y) 27
H(x |y) I(x ;y) H(y | x)
Review/Preview H(x) H(y)
• Entropy: H (x ) = ∑-log2 ( p(x)) p(x) = E − log2 ( pX (x)) x∈X – Positive and bounded 0≤H (x ) ≤ log |X | ♦ • Chain Rule: H (x ,y ) = H (x ) + H (y | x ) ≤ H (x ) + H (y ) ♦ – Conditioning reduces entropy H (y | x ) ≤ H (y ) ♦ • Mutual Information: I(y ;x ) = H (y ) − H (y | x ) = H (x ) + H (y ) − H (x ,y ) – Positive and Symmetrical I(x ;y ) = I(y ;x ) ≥ 0 ♦ – x and y independent ⇔ H (x ,y ) = H (y ) + H (x ) ♦ ⇔ I(x ;y ) = 0 ♦ = inequalities not yet proved 28 Convex & Concave functions f(x) is strictly convex over (a,b) if f (λu + (1− λ)v) < λf (u) + (1− λ) f (v) ∀u ≠ v ∈(a,b), 0 < λ <1
– every chord of f(x) lies above f(x) Con cave is like this – f(x) is concave ⇔ –f(x) is convex • Examples – Strictly Convex: x2 , x4 , e x , x log x [ x ≥ 0] – Strictly Concave: log x, x [x ≥ 0] – Convex and Concave: x
d 2 f – Test: > 0 ∀ x ∈ ( a , b ) ⇒ f(x) is strictly convex dx2
“convex” (not strictly) uses “ ≤” in definition and “ ≥” in test 29 Jensen’s Inequality
Jensen’s Inequality : (a) f(x) convex ⇒ Ef (x) ≥ f(Ex)
(b) f(x) strictly convex ⇒ Ef (x) > f(Ex) unless x constant Proof by induction on |X| These sum to 1
– |X|=1: E f (x ) = f (E x ) = f (x1) k k −1 pi – |X|= k: E f (x ) = ∑ pi f (xi ) =pk f (xk ) + 1( − pk )∑ f (xi ) i=1 i=1 1− pk k −1 p i ≥ pk f (xk ) + 1( − pk ) f xi Assume JI is true ∑1− p i=1 k for |X|= k–1 k −1 p ≥ f p x + 1( − p ) i x = f ()E x k k k ∑ i i=1 1− pk
Follows from the definition of convexity for two-mass-point distribution 30 Jensen’s Inequality Example
Mnemonic example: f(x) = x2 : strictly convex
X = [–1; +1] 4 p = [½; ½] 3 E x = 0 2 f(E x)=0 f(x)
E f(x) = 1 > f(E x) 1
0 -2 -1 0 1 2 x 31 Summary H(x ,y) • Chain Rule: H (x ,y ) = H (y | x ) + H (x ) H(x |y) H(y |x) • Conditional Entropy: H(x) H(y) H (y | x ) = H (x ,y ) − H (x ) = ∑ p(x)H (y | x) – Conditioning reduces entropy x∈X H (y | x ) ≤ H (y ) ♦ • Mutual Information I(x ;y ) = H (x ) − H (x | y ) ≤ H (x ) ♦ – In communications, mutual information is the amount of information transmitted through a noisy channel • Jensen’s Inequality f(x) convex ⇒ Ef (x) ≥ f(Ex)
♦ = inequalities not yet proved 32 Lecture 3
• Relative Entropy – A measure of how different two probability mass vectors are • Information Inequality and its consequences – Relative Entropy is always positive • Mutual information is positive • Uniform bound • Conditioning and correlation reduce entropy • Stochastic Processes – Entropy Rate – Markov Processes 33 Relative Entropy
Relative Entropy or Kullback-Leibler Divergence between two probability mass vectors p and q p(x) p(x ) D(p || q) = ∑ p(x)log = Ep log = Ep ()− log q(x ) − H (x ) x∈X q(x) q(x )
where Ep denotes an expectation performed using probabilities p D(p|| q) measures the “distance” between the probability mass functions p and q.
We must have pi=0 whenever qi=0 else D(p|| q)= ∞ Beware: D(p|| q) is not a true distance because: – (1) it is asymmetric between p, q and – (2) it does not satisfy the triangle inequality. 34 Relative Entropy Example
X = [1 2 3 4 5 6] T p = [1 1 1 1 1 1 ] ⇒ H (p) = 2.585 6 6 6 6 6 6 q = []1 1 1 1 1 1 ⇒ H (q) = 2.161 10 10 10 10 10 2
D(p || q) = Ep (−log qx ) − H (p) = 2.935 − 2.585 = 0.35
D(q || p) = Eq (−log px ) − H (q) = 2.585 − 2.161 = 0.424 35 Information Inequality
Information (Gibbs’) Inequality: D(p || q) ≥ 0 • Define A = {x : p(x) > 0} ⊆ X p(x) q(x) • Proof − D(p ||q) = −∑ p(x)log = ∑ p(x)log x∈A q(x) x∈A p(x) q(x) Jensen’s ≤ log∑ p(x) = log∑q(x) ≤ log ∑q(x) = log1 = 0 x∈A p(x) x∈A x∈X inequality
If D(p|| q)=0 : Since log( ) is strictly concave we have equality in the proof only if q(x)/ p(x), the argument of log , equals a constant.
But ∑ p ( x ) = ∑ q ( x ) = 1 so the constant must be 1 and p ≡ q x∈X x∈X 36 Information Inequality Corollaries
• Uniform distribution has highest entropy – Set q = [| X|–1, …, | X|–1]T giving H(q)=log| X| bits
D(p || q) = Ep {−logq(x )}− H (p) = log | X | −H (p) ≥ 0
• Mutual Information is non-negative p(x y , ) xI y y (;) x y x = HHH () + () − (,) = E log p( yx ) p ( ) =Dp(( y x y , )|| p ( )( p )) ≥ 0 with equality only if p(x,y) ≡ p(x)p(y) ⇔ x and y are independent. 37 More Corollaries
• Conditioning reduces entropy 0 ≤ I(x ;y ) = H (y ) − H (y | x ) ⇒ H (y | x ) ≤ H (y )
with equality only if x and y are independent.
• Independence Bound n n H (x :1 n ) = ∑ H (x i | x :1i−1) ≤ ∑ H (x i ) i=1 i=1
with equality only if all xi are independent.
E.g.: If all xi are identical H(x1: n) = H(x1) 38 Conditional Independence Bound
• Conditional Independence Bound n n H (x :1 n | y :1 n ) = ∑ H (x i | x :1 i−1,y :1 n ) ≤ ∑ H (x i | y i ) i=1 i=1 • Mutual Information Independence Bound
If all xi are independent or, by symmetry, if all yi are independent:
I(x :1 n ;y :1 n ) = H (x :1 n ) − H (x :1 n | y :1 n ) n n n ≥ ∑ H (x i ) − ∑ H (x i | y i ) = ∑ I(x i ;y i ) i=1 i=1 i=1
E.g.: If n=2 with xi i.i.d. Bernoulli ( p=0.5 ) and y1=x2 and y2=x1, then I(xi;yi)=0 but I(x1:2 ; y1:2 ) = 2 bits. 39 Stochastic Process
Stochastic Process {xi} = x1, x2, … often … Entropy: H ({x i}) = H (x 1) + H (x 2 | x 1) + = ∞ 1 Entropy Rate: H (X) = lim H (x :1 n ) if limit exists n→∞ n – Entropy rate estimates the additional entropy per new sample. – Gives a lower bound on number of code bits per sample. Examples: – Typewriter with m equally likely letters each time: H(X) =log m
– xi i.i.d. random variables: H (X) = H (x i ) 40 Stationary Process
Stochastic Process {xi} is stationary iff
p(x :1 n = a :1 n ) = p(x k+ :1( n) = a :1 n ) ∀k,n,ai ∈ X
If {xi} is stationary then H(X) exists and 1 H (X) = lim H (x :1 n ) = lim H (x n | x :1 n−1) n→∞ n n→∞ )a( )b(
Proof: 0 ≤ H (x n | x :1 n−1) ≤ H (x n | x :2 n−1) = H (x n−1 | x :1 n−2 ) (a) conditioning reduces entropy, (b) stationarity
Hence H(xn|x1: n-1) is positive, decreasing ⇒ tends to a limit, say b Hence 1 1 n H (x k | x :1 k−1) → b ⇒ H (x :1 n ) = ∑ H (x k | x :1 k−1) → b = H (X) n n k=1 41 Markov Process (Chain)
Discrete-valued stochastic process {xi} is
• Independent iff p(xn|x0: n–1)= p(xn)
• Markov iff p(xn|x0: n–1)= p(xn|xn–1)
– time-invariant iff p( xn=b| xn–1=a ) = pab indep of n – States
– Transition matrix: T = { tab } • Rows sum to 1: T1 = 1 where 1 is a vector of 1’s t12 T • pn = T pn–1 1 2 T t14 t • Stationary distribution: p$ = T p$ t13 24 t 3 34 4 t43 Independent Stochastic Process is easiest to deal with, Markov is next easiest 42 Stationary Markov Process
If a Markov process is a) irreducible : you can go from any state a to any b in a finite number of steps b) aperiodic : ∀ state a, the possible times to go from a to a have highest common factor = 1 then it has exactly one stationary distribution, p$. T T – p$ is the eigenvector of T with λ = 1 : T p$ = p$ n T ⋯ T T → 1p$ where 1 = [1 1 1] n→∞ – Initial distribution becomes irrelevant ( asymptotically T n T stationary ) (T ) p0$0$= p1p = p , ∀ p 0 43 Chess Board
H(pH(pH(p )=3.07129,)=3.10287,)=3.09141,)=1.58496,)=2.99553,)=3.0827,)=3.111,H(p )=0, H(p H(p H(pH(pH(p | | | ||p| p pppp )=2.30177)=2.23038)=2.20683)=2.54795)=1.58496)=2.24987)=2.09299)=0 6372485 1 58632741 47521630 Random Walk • Move ˜¯ ˘˙˚¸ equal prob T • p1 = [1 0 … 0]
– H(p1) = 0 1 T • p$ = /40 × [3 5 3 5 8 5 3 5 3]
– H(p$) = 3.0855
• HXH()= lim (x xn | n −1 ) n→∞
=−limpxx (nn ,−1 )log( pxx nn | − 1 ) =− pt $, iij , log( t ij , ) = 2.2365 n→∞ ∑ ∑ i, j 44 Chess Board
H(p )=0, H(p | p )=0 1 1 0 Random Walk • Move ˜¯ ˘˙˚¸ equal prob T • p1 = [1 0 … 0]
– H(p1) = 0 1 T • p$ = /40 × [3 5 3 5 8 5 3 5 3]
– H(p$) = 3.0855
HXH()= lim (x xn | n −1 ) • n→∞
=−limpxx (,nn−1 )log(| pxx nn − 1 ) =− pt $, iij , log() t ij , n→∞ ∑ ∑ i, j – H( X )= 2.2365 45 Chess Board Frames
H(p )=0, H(p | p )=0 H(p )=1.58496, H(p | p )=1.58496 H(p )=3.10287, H(p | p )=2.54795 H(p )=2.99553, H(p | p )=2.09299 1 1 0 2 2 1 3 3 2 4 4 3
H(p )=3.111, H(p | p )=2.30177 H(p )=3.07129, H(p | p )=2.20683 H(p )=3.09141, H(p | p )=2.24987 H(p )=3.0827, H(p | p )=2.23038 5 5 4 6 6 5 7 7 6 8 8 7 46 Summary p(x ) • Relative Entropy: D(p || q) = Ep log ≥ 0 – D(p|| q) = 0 iff p ≡ q q(x ) • Corollaries – Uniform Bound: Uniform p maximizes H(p) – I(x ; y) ≥ 0 ⇒ Conditioning reduces entropy n n – Indep bounds: H (x :1 n ) ≤ ∑ H (x i ) H (x :1 n | y :1 n ) ≤ ∑ H (x i | y i ) i=1 i=1 n I(x :1 n ;y :1 n ) ≥ ∑ I(x i ;y i ) if x i or y i are indep i=1 • Entropy Rate of stochastic process:
–{xi} stationary : H()X = lim H (x xn |1: n − 1 ) n→∞ –{xi} stationary Markov :
HH()X = (|x xnn−1 ) =∑ − ptt $,, iij log() ij , i, j 47 Lecture 4
• Source Coding Theorem – n i.i.d. random variables each with entropy H(X) can be compressed into more than nH(X) bits as n tends to infinity • Instantaneous Codes – Symbol-by-symbol coding – Uniquely decodable • Kraft Inequality – Constraint on the code length • Optimal Symbol Code lengths – Entropy Bound 48 Source Coding
• Source Code: C is a mapping X→D+ – X a random variable of the message –D+ = set of all finite length strings from D –D is often binary – e.g. {E, F, G} →{0,1} + : C(E)=0, C(F)=10, C(G)=11 • Extension: C+ is mapping X+ →D+ formed by
concatenating C(xi) without punctuation – e.g. C+(EFEE GE) =01000110 49 Desired Properties
• Non-singular: x1≠ x2 ⇒ C(x1) ≠ C(x2) – Unambiguous description of a single letter of X • Uniquely Decodable: C+ is non-singular – The sequence C+(x+) is unambiguous – A stronger condition – Any encoded string has only one possible source string producing it – However, one may have to examine the entire encoded string to determine even the first source symbol – One could use punctuation between two codewords but inefficient 50 Instantaneous Codes • Instantaneous (or Prefix ) Code – No codeword is a prefix of another – Can be decoded instantaneously without reference to future codewords • Instantaneous ⇒ Uniquely Decodable ⇒ Non- singular Examples: – C(E,F,G, H) = (0, 1, 00, 11) IU – C(E,F) = (0, 101) IU – C(E,F) = (1, 101) IU – C(E,F,G, H) = (00, 01, 10, 11) IU – C(E,F,G, H) = (0, 01, 011, 111) IU 51 Code Tree
Instantaneous code: C(E,F,G, H) = (00, 11, 100, 101) Form a D-ary tree where D = |D|
– D branches at each node 0 E – Each codeword is a leaf 0 1 G – Each node along the path to 0 a leaf is a prefix of the leaf 0 1 1 ⇒ can’t be a leaf itself H – Some leaves may be unused 1 F
111011000000 → FHGEE 52
Kraft Inequality (instantaneous codes)
• Limit on codeword lengths of instantaneous codes – Not all codewords can be too short X|| −li • Codeword lengths l1, l2, …, l|X| ⇒ ∑ 2 ≤1 i=1 • Label each node at depth l ¼ 0 –l ½ E with 2 1/ ¼ 8 0 • Each node equals the sum of 1 0 G ¼ all its leaves 1 0 1/ ½ 8 • Equality iff all leaves are utilised 1 1 H • Total code budget = 1 1 ¼ F Code 00 uses up ¼ of the budget 1 Code 100 uses up /8 of the budget Same argument works with D-ary tree 53
McMillan Inequality (uniquely decodable codes)
If uniquely decodable C has codeword lengths |X| −l l1, l2, …, l|X| , then ∑ D i ≤1 The same |X| i=1 Proof: Let −li S = ∑ D and M = maxli then for any N, i=1 ||XN |||| XXX || N −l −()li1 + l i ++... l i + SDD=∑i = ∑∑... ∑ 2 N = ∑ D− length{C (x)} N i=1 iii1 === 111 2 N x∈X NM NM NM = D−l | x : l = length{C + (x)}| ≤ re-orderD−l Dl sum= by 1total= NM length ∑ Sum over all sequences∑ of length∑N l =1 l =1 l =1 If S > 1 then SN > NM for some N. Hence S ≤ 1. max number of distinct sequences of length l Implication: uniquely decodable codes doesn’t offer further reduction of codeword lengths than instantaneous codes 54
McMillan Inequality (uniquely decodable codes)
If uniquely decodable C has codeword lengths |X| −l l1, l2, …, l|X| , then ∑ D i ≤1 The same |X| i=1 Proof: Let −li S = ∑ D and M = maxli then for any N, i=1 ||XN |||| XXX || N −l −()li1 + l i ++... l i + SDD=∑i = ∑∑... ∑ 2 N = ∑ D− length{C (x)} N i=1 iii1 === 111 2 N x∈X NM NM NM = ∑ D−l | x : l = length{C + (x)}| ≤ ∑ D−l Dl = ∑1 = NM l =1 l =1 l =1 If S > 1 then SN > NM for some N. Hence S ≤ 1.
Implication: uniquely decodable codes doesn’t offer further reduction of codeword lengths than instantaneous codes 55 How Short are Optimal Codes?
If l(x) = length( C(x)) then C is optimal if L=E l(x) is as small as possible. We want to minimize ∑ p ( x ) l ( x ) subject to x∈X 1. ∑ D−l(x) ≤1 x∈X 2. all the l(x) are integers
Simplified version: Ignore condition 2 and assume condition 1 is satisfied with equality.
less restrictive so lengths may be shorter than actually possible ⇒ lower bound 56
Optimal Codes (non-integer li)
|X| |X| p(x )l −li • Minimize ∑ i i subject to ∑ D =1 i=1 i=1 Use Lagrange multiplier: X|| X|| −li ∂J Define J = ∑ p(xi )li + λ∑ D and set = 0 i=1 i=1 ∂li
∂J −li −li = p(xi ) − λ ln(D)D = 0 ⇒ D = p(xi )/ λ ln(D) ∂li X|| also D−li = 1 ⇒ λ = /1 ln(D) ⇒ ∑ li = –log D(p(xi)) i=1
E− log2 ( p (x ) ) H (x ) x x E l()log()x =− ED () p = == H D () log2DD log 2 no uniquely decodable code can do better than this 57 Bounds on Optimal Code Length
Round up optimal code lengths: li = − logD p(xi )
• li are bound to satisfy the Kraft Inequality (since the optimum lengths do)
• For this choice, –log D(p(xi)) ≤ li ≤ –log D(p(xi)) + 1 • Average shortest length: (since we added <1 * HLH xD()x ≤ < D ()1 + to optimum values) • We can do better by encoding blocks of n symbols −1 − 1 − 1 − 1 nH x x Dn()x 1:≤ nEl () 1: n ≤ nH Dn () 1: + n
• If entropy rate of xi exists ( ⇐ xi is stationary process ) −1 − 1 xn HDnD()x 1: → H ()X ⇒ n E l ()1: nD → H () X Also known as source coding theorem 58 Block Coding Example
1 X = [ A;B], p = [0.9; 0.1] x 0.8 El
H (xi ) = .0 469 -1 n 0.6 Huffman coding: 0.4 sym AB 2 4 6 8 10 12 • n=1 n prob 0.9 0.1 n−1E l =1 code 0 1
• n=2 sym AA AB BA BB prob 0.81 0.09 0.09 0.01 n−1E l = 0.645 code 0 11 100 101
• n=3 sym AAA AAB … BBA BBB prob 0.729 0.081 … 0.009 0.001 n−1E l = 0.583 code 0 101 … 10010 10011 The extra 1 bit inefficiency becomes insignificant for large blocks 59 Summary
• McMillan Inequality for D-ary codes: X|| • any uniquely decodable C has ∑ D−li ≤1 i=1 • Any uniquely decodable code:
E l x()x ≥ H D () • Source coding theorem • Symbol-by-symbol encoding
H D (x ) ≤ E l(x ) ≤ H D (x ) +1
−1 • Block encoding n E l(x 1: n )→ H D ()X 60 Lecture 5
• Source Coding Algorithms • Huffman Coding • Lempel-Ziv Coding 61 Huffman Code
An optimal binary instantaneous code must satisfy:
1. p(xi ) > p(x j ) ⇒ li ≤ l j (else swap codewords) 2. The two longest codewords have the same length (else chop a bit off the longer codeword) 3. ∃ two longest codewords differing only in the last bit (else chop a bit off all of them) Huffman Code construction
1. Take the two smallest p(xi) and assign each a different last bit. Then merge into a single symbol. 2. Repeat step 1 until only one symbol remains Used in JPEG, MP3… 62 Huffman Code Example
X = [a, b, c, d, e], px = [0.25 0.25 0.2 0.15 0.15]
Read diagram backwards for codewords: C(X) = [01 10 11 000 001], L = 2.3, H(x) = 2.286 For D-ary code, first add extra zero-probability symbols until |X|–1 is a multiple of D–1 and then group D symbols at a time 63 Huffman Code is Optimal Instantaneous Code
Huffman traceback gives codes for progressively larger alphabets: 0 0 a 0.25 0.25 0.25 0.55 1.0 0 b 0.25 0.25 0.45 0.45 1 p2=[0.55 0.45], c 0.2 0.2 1 c2=[0 1], L2=1 1 d 0.15 0 0.3 0.3 p3=[0.45 0.3 0.25], e 0.15 1 c3=[1 00 01], L3=1.55 p4=[0.3 0.25 0.25 0.2], c4=[00 01 10 11], L4=2 p5=[0.25 0.25 0.2 0.15 0.15], c5=[01 10 11 000 001], L5=2.3
We want to show that all these codes are optimal including C5 64 Huffman Code is Optimal Instantaneous Code
Huffman traceback gives codes for progressively larger alphabets: p2=[0.55 0.45], c2=[0 1], L2=1 p3=[0.45 0.3 0.25], c3=[1 00 01], L3=1.55 p4=[0.3 0.25 0.25 0.2], c4=[00 01 10 11], L4=2 p5=[0.25 0.25 0.2 0.15 0.15], c5=[01 10 11 000 001], L5=2.3
We want to show that all these codes are optimal including C5 65 Huffman Optimality Proof
Suppose one of these codes is sub-optimal:
– ∃ m>2 with cm the first sub-optimal code (note c 2 is definitely optimal)
– An optimal c′m must have LC'm < L Cm
– Rearrange the symbols with longest codes in c ′m so the two lowest probs pi and pj differ only in the last digit (doesen’t change optimality)
– Merge xi and xj to create a new code c′m−1 as in Huffman procedure
– L C'm–1 =L C'm– pi– pj since identical except 1 bit shorter with prob pi+ p j
– But also L Cm –1 =L Cm – pi– pj hence L C'm–1 < L Cm –1 which contradicts assumption that cm is the first sub-optimal code
Hence, Huffman coding satisfies HLH xD()x ≤ < D ()1 + Note: Huffman is just one out of many possible optimal codes 66 Shannon-Fano Code
Fano code 1. Put probabilities in decreasing order 2. Split as close to 50-50 as possible; repeat with each half
H(x) = 2.81 bits a 0.20 0 00 0 b 0.19 0 010 1 L = 2.89 bits c 0.17 1 011 SF d 0.15 0 100 0 e 0.14 1 101 1 f 0.06 0 110 Not necessarily optimal: the 1 g 0.05 0 1110 best code for this p actually 1 h 0.04 1 1111 has L = 2.85 bits 67 Shannon versus Huffman
i−1 F= x x p x (),x pp () ≥≥≥ ()⋯ p () Shannon i∑ k =1 k 1 2 m
encoding:roundthenumberFi∈[0,1] to − log p (x i ) bits
H xD(x )≤ L SF ≤ H D ( ) + 1 (excercise)
px = .0[ 36 .0 34 .0 25 .0 05] ⇒ H (x ) = .1 78 bits
− log2 px = .1[ 47 .1 56 2 .4 32]
lS = − log2 px = [2 2 2 ]5
LS = .2 15 bits Huffman
l H = [1 2 3 3]
LH = 1.94 bits Individual codewords may be longer in Huffman than Shannon but not the average 68 Issues with Huffman Coding • Requires the probability distribution of the source – Must recompute entire code if any symbol probability changes • A block of N symbols needs |X|N pre-calculated probabilities • For many practical applications, however, the underlying probability distribution is unknown – Estimate the distribution • Arithmetic coding: extension of Shannon-Fano coding; can deal with large block lengths – Without the distribution • Universal coding : Lempel-Ziv coding 69 Universal Coding • Does not depend on the distribution of the source • Compression of an individual sequence • Run length coding – Runs of data are stored (e.g., in fax machines) Example: WWWWWWWWWBBWWWWWWWBBBBBBWW 9W2B7W6B2W • Lempel-Ziv coding – Generalization that takes advantage of runs of strings of characters (such as WWWWWWWWWBB ) – Adaptive dictionary compression algorithms – Asymptotically optimum: achieves the entropy rate for any stationary ergodic source 70 Lempel-Ziv Coding (LZ78)
Memorize previously occurring substrings in the input data – parse input into the shortest possible distinct ‘phrases’, i.e., each phrase is the shortest phrase not seen earlier – number the phrases starting from 1 (0 is the empty string) ABAA BA BAB BB AB … Look up a dictionary 12 _3_4__ 5_6_7 – each phrase consists of a previously occurring phrase (head) followed by an additional A or B (tail) – encoding: give location of head followed by the additional symbol for tail 0A0B1A2A4B2B1B… – decoder uses an identical dictionary
locations are underlined 71 Lempel-Ziv Example
Input = 110110101000100100010011011011011010100010010010110101010101110110101000101101010001001101010001001000100101001001111101010001001000100101001010101000100100010010100100101000100100010010100100011010001001000100101001001000100100010010100100010101000100100010010100100001001000100101001000100100010010100100000100100010010100101001000100101001011000100010010100100100010010100100010100100100010010100100001001010010001001010010010010100100010100100010011001010010001010010010100101010010001001001010010001001100100010010100 Dictionary Send Decode 0000000000 φ 1 1 Remark: 0001001011 1 00 0 • No need to send the 001001010 0 01 11 111 001101111 11 10 10 011 dictionary (imagine zip 0100100 01 100 001 0100 and unzip!) 0101101 010 010 00 000 • Can be reconstructed 0110110 00 001 01 100 • Need to send 0’s in 01 , 0111111 10 101 0010 01000 010 and 001 to avoid 1000 0100 1000 10100 010011 ambiguity (i.e., 1001 01001 1001 0 01001 0 instantaneous code) location No need to always send 4 bits 72 Lempel-Ziv Comments
Dictionary D contains K entries D(0), …, D(K–1) . We need to send M=ceil(log K) bits to specify a dictionary entry. Initially K=1, D(0)= φ = null string and M=ceil(log K) = 0 bits.
Input Action 1 “1” ∉D so send “ 1” and set D(1)= “1”. Now K=2 ⇒ M=1 . 0 “0” ∉D so split it up as “ φ”+”0” and send location “ 0” (since D(0)= φ) followed by “ 0”. Then set D(2)= “0” making K=3 ⇒ M=2 . 1 “1” ∈ D so don’t send anything yet – just read the next input bit. 1 “11” ∉D so split it up as “1” + “1” and send location “ 01 ” (since D(1)= “1” and M=2 ) followed by “ 1”. Then set D(3)= “11” making K=4 ⇒ M=2 . 0 “0” ∈ D so don’t send anything yet – just read the next input bit. 1 “01” ∉D so split it up as “0” + “1” and send location “ 10 ” (since D(2)= “0” and M=2 ) followed by “ 1”. Then set D(4)= “01” making K=5 ⇒ M=3 . 0 “0” ∈ D so don’t send anything yet – just read the next input bit. 1 “01” ∈ D so don’t send anything yet – just read the next input bit. 0 “010” ∉D so split it up as “01” + “0” and send location “ 100 ” (since D(4)= “01” and M=3 ) followed by “ 0”. Then set D(5)= “010” making K=6 ⇒ M=3 .
So far we have sent 10001 110 1100 0 where dictionary entry numbers are in red . 73 Lempel-Ziv Properties
• Simple to implement • Widely used because of its speed and efficiency – applications: compress, gzip, GIF, TIFF, modem … – variations: LZW (considering last character of the current phrase as part of the next phrase, used in Adobe Acrobat), LZ77 (sliding window) – different dictionary handling, etc • Excellent compression in practice – many files contain repetitive sequences – worse than arithmetic coding for text files 74 Asymptotic Optimality
• Asymptotically optimum for stationary ergodic source (i.e. achieves entropy rate) • Let c(n) denote the number of phrases for a sequence of length n • Compressed sequence consists of c(n) pairs (location, last bit) • Needs c(n)[log c(n)+1] bits in total
•{Xi} stationary ergodic ⇒
−1 cn( )[log cn ( )+ 1] limsupnlX (1: n )limsup= ≤ H X () with probability 1 n→∞ n →∞ n – Proof: C&T chapter 12.10 – may only approach this for an enormous file 75 Summary
• Huffman Coding: H D (x ) ≤ E l(x ) ≤ H D (x ) +1 – Bottom-up design – Optimal ⇒ shortest average length
• Shannon-Fano Coding: H D (x ) ≤ E l(x ) ≤ H D (x ) +1 – Intuitively natural top-down design • Lempel-Ziv Coding – Does not require probability distribution – Asymptotically optimum for stationary ergodic source (i.e. achieves entropy rate) 76 Lecture 6
• Markov Chains – Have a special meaning – Not to be confused with the standard definition of Markov chains (which are sequences of discrete random variables) • Data Processing Theorem – You can’t create information from nothing • Fano’s Inequality – Lower bound for error in estimating X from Y 77 Markov Chains
If we have three random variables: x, y, z p(x, y, z) = p(z | x, y) p(y | x) p(x) they form a Markov chain x→y→z if p(z | x, y) = p(z | y) ⇔ p(x, y, z) = p(z | y) p(y | x) p(x)
A Markov chain x→y→z means that – the only way that x affects z is through the value of y – if you already know y, then observing x gives you no additional information about z, i.e. I(x ;z | y ) = 0 ⇔ H (z | y ) = H (z | x ,y ) – if you know y, then observing z gives you no additional information about x. 78 Data Processing
• Estimate z = f(y), where f is a function • A special case of a Markov chain x ‰ y ‰ f(y)
Data Y Z X Channel Estimator
• Does processing of y increase the information that y contains about x? 79 Markov Chain Symmetry
If x→y→z p(x, y, z) )a( p(x, y) p(z | y) p(x, z | y) = = = p(x | y) p(z | y) p(y) p(y)
(a) p(z | x, y) = p(z | y) Hence x and z are conditionally independent given y Also x→y→z iff z→y→x since p(z | y) p(y) (a) p(x, z | y) p(y) p(x, y, z) p(x | y) = p(x | y) = = p(y, z) p(y, z) p(y, z) = p(x | y, z) (a) p(x, z | y) = p(x | y) p(z | y) Conditionally indep. Markov chain property is symmetrical 80 Data Processing Theorem
If x→y→z then I(x ; y) ≥ I(x ; z) – processing y cannot add new information about x If x→y→z then I(x ; y) ≥ I(x ; y | z) – Knowing z does not increase the amount y tells you about x Proof: Apply chain rule in different ways I(x ;y ,z ) = I(x ;y ) + I(x ;z | y ) = I(x ;z ) + I(x ;y | z ) (a) but I(x ;z | y ) = 0 hence I(x ;y ) = I(x ;z ) + I(x ;y | z ) so I(x ;y ) ≥ I(x ;z ) and I(x ;y ) ≥ I(x ;y | z )
(a) I(x ; z)=0 iff x and z are independent; Markov ⇒ p(x,z | y)= p(x |y)p(z | y) 81 So Why Processing?
• One can not create information by manipulating the data • But no information is lost if equality holds • Sufficient statistic – z contains all the information in y about x – Preserves mutual information I(x ;y) = I(x ; z) • The estimator should be designed in a way such that it outputs sufficient statistics • Can the estimation be arbitrarily accurate? 82 Fano’s Inequality
ˆ If we estimate x from y, what is pe = p(x ≠ x ) ? x y x^ H(|)x y ≤ Hpp (e ) + e log||X x() yH(|)x y − H () p(a) () H (|)1 − ⇒p ≥e ≥ (a) the second form is e log||X log|| X weaker but easier to use 1 x xˆ ≠ Proof: Define a random variable e = 0 x xˆ = HHHHH(,|) xee e x e x x x x ˆ = (|)ˆ + (|,) ˆ = (|) ˆ + (|,)ˆ chain rule ⇒HHH x e e x(|)0x x ˆ +≤ () + (|,)ˆ H≥0; H(e |y)≤ H(e) ˆ ˆ x x=+HH x( x e ) ( | , e =−+ 0)(1 pHe ) ( | , ep = 1) e
≤Hp(e ) +×− 0 (1 pp e ) + e log |X | H(e)= H(pe) HH x yx ( xx y | )≤ ( |ˆ )since II (;)ˆ ≤ (;) Markov chain 83 Implications
• Zero probability of error pe =0 ⇒ H (|)0x y = • Low probability of error if H(x|y) is small • If H(x|y) is large then the probability of error is high • Could be slightly strengthened to
H(x y | )≤ Hpp (e ) + e log(|X | − 1) • Fano’s inequality is used whenever you need to show that errors are inevitable – E.g., Converse to channel coding theorem 84 Fano Example
T X = {1:5} , px = [0.35, 0.35, 0.1, 0.1, 0.1] Y = {1:2} if x ≤ 2 then y = x with probability 6/7 while if x > 2 then y =1 or 2 with equal prob. Our best strategy is to guess x y x ˆ=( → → ˆ ) T – px | y=1 = [0.6, 0.1, 0.1, 0.1, 0.1]
– actual error prob: pe = 0.4
H (x | y ) −1 .1 771−1 p ≥ = = .0 3855 Fano bound: e log () | X | − 1 log( 4 ) (exercise)
Main use: to show when error free transmission is impossible since pe > 0 85 Summary • Markov: x → y → z ⇔ p(z | x, y) = p(z | y) ⇔ I(x ;z | y ) = 0 • Data Processing Theorem: if x→y→z then – I(x ; y) ≥ I(x ; z), I(y ; z) ≥ I(x ; z) – I(x ; y) ≥ I(x ; y | z) can be false if not Markov
– Long Markov chains: If x1→ x2 → x3 → x4 → x5 → x6, then Mutual Information increases as you get closer together:
• e.g. I(x3, x4) ≥ I(x2, x4) ≥ I(x1, x5) ≥ I(x1, x6) • Fano’s Inequality: if x → y → xˆ then H (x | y ) − H ( p ) H (x | y ) −1 H (x | y ) −1 p ≥ e ≥ ≥ e log()| X | −1 log()| X | −1 log | X |
weaker but easier to use since independent of pe 86 Lecture 7
• Law of Large Numbers – Sample mean is close to expected value • Asymptotic Equipartition Principle (AEP)
– -logP( x1,x2,…, xn)/ n is close to entropy H • The Typical Set – Probability of each sequence close to 2 -nH – Size (~2 nH ) and total probability (~1) • The Atypical Set – Unimportant and could be ignored 87 Typicality: Example
X = {a, b, c, d}, p = [0.5 0.25 0.125 0.125] –log p = [1 2 3 3] ⇒ H(p) = 1.75 bits Sample eight i.i.d. values • typical ⇒ correct proportions
adbabaac –log p(x) = 14 = 8×1.75 = nH(x) • not typical ⇒ log p(x) ≠ nH(x)
dddddddd –log p(x) = 24 88 Convergence of Random Variables
• Convergence
x n → y ⇒ ∀ε > 0,∃m such that ∀n > m |, x n − y |< ε n→∞ –n Example: y x n = ±2 , = 0 choose m = − log ε
• Convergence in probability (weaker than convergence) prob x n → y ⇒ ∀ε > 0, P()| x n − y |> ε → 0 −1 − 1 Example: xn ∈{0;1}, p = [1 − nn ; ] −1 n→∞ for any small ε ε , p (| xn |> ) = n → 0 prob
soxn→ 0 (but x n → 0) Note: y can be a constant or another random variable 89 Law of Large Numbers n Given i.i.d. {x }, sample mean 1 i s n = ∑ x i n i=1 −1 −1 2 – E s n = E x = µ Var s n = n Var x = n σ
As n increases, Var sn gets smaller and the values become clustered around the mean prob
LLN: s n → µ
⇔ ∀ε > 0, P()| s n − µ |> ε → 0 n→∞ The expected value of a random variable is equal to the long-term average when sampling repeatedly. 90 Asymptotic Equipartition Principle
• x is the i.i.d. sequence {xi} for 1 ≤ i ≤n n – Prob of a particular sequence is p(x) = ∏ p(x i ) i=1 – Average E − log p(x) = n E − log p(xi ) = nH (x )
• AEP: 1 prob −logp (x ) → H (x ) n • Proof: 1 1 n −logp (x ) = − ∑ log p ( x i ) n n i=1 prob
law of large numbers →E −log p ( xi ) = H ()x 91 Typical Set
Typical set (for finite n)
(n) n −1 Tε = {x ∈ X : − n log p(x) − H (x ) < ε}
Example: N=128,N=32,N=64,N=16,N=8,N=4,N=1,N=2, p=0.2,p=0.2,p=0.2, e=0.1,e=0.1,e=0.1, ppp =29%=41%=85%=62%=0%=72%=45% TTT – xi Bernoulli with p(xi =1)= p – e.g. p ([0 1 1 0 0 0])= p2(1 –p)4 – For p=0.2, H(X)=0.72 bits (n) – Red bar shows T0.1
-2.5-2.5 -2-2 -1.5-1.5 -1-1 -0.5-0.5 00 NN-1 -1logloglog p(x) p(x)p(x) 92 Typical Set Frames
N=1, p=0.2, e=0.1, p =0% N=2, p=0.2, e=0.1, p =0% N=4, p=0.2, e=0.1, p =41% N=8, p=0.2, e=0.1, p =29% T T T T
-2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 N-1 log p(x) N-1 log p(x) N-1 log p(x) N-1 log p(x)
N=16, p=0.2, e=0.1, p =45% N=32, p=0.2, e=0.1, p =62% N=64, p=0.2, e=0.1, p =72% N=128, p=0.2, e=0.1, p =85% T T T T
-2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 -2.5 -2 -1.5 -1 -0.5 0 N-1 log p(x) N-1 log p(x) N-1 log p(x) N-1 log p(x)
1111100100011000100101100110011, ,0 10, 1101, 00 , 00110010, 1010 , 0100, 0001001010010000, 00010001, 0000 , 00010000,, 0001000100010000 00000000 , 0000100001000000 , 0000000010000000 93 Typical Set: Properties
(n) 1. Individual prob: x ∈Tε ⇒ log p(x) = − nH (x ) ± nε
(n) 2. Total prob: p(x ∈Tε ) >1−ε for n > Nε
n>Nε 3. Size: n(H (x )−ε ) (n) n(H (x )+ε ) (1−ε )2 < Tε ≤ 2
n prob Proof 2: −1 −1 − n log p(x) = n ∑ − log p(xi ) → E − log p(xi ) = H (x ) i=1 −1 Hence ∀ε > 0 ∃Nε s.t. ∀n > Nε p( − n log p(x) − H (x ) > ε ) < ε
Proof 3a: (n) −n(H (x )−ε ) −n(H (x )−ε ) (n) f.l.e. n, 1−ε < p(x ∈Tε ) ≤ ∑ 2 = 2 Tε n)( x∈Tε Proof 3b: −n(H (x )+ε ) −n(H (x )+ε ) (n) 1 = ∑ p(x) ≥ ∑ p(x) ≥ ∑ 2 = 2 Tε n )( n )( x x∈Tε x∈Tε 94 Consequence
• for any ε and for n > Nε “Almost all events are almost equally surprising”
(n) log p(x) = − nH (x ) ± nε • p(x ∈Tε ) >1−ε and Coding consequence |X|n elements (n) –x ∈Tε : ‘ 0’ + at most 1+ n(H+ε) bits (n) –x ∉Tε : ‘ 1’ + at most 1+ nlog| X| bits – L = Average code length (n ) ≤p(x ∈ Tε )[2 + nH ( + ε )] (n ) +p(x ∉ Tε )[2 + n log | X |] ≤++n( H ε ε )() n logX ++ 2 2 ≤ 2n(H (x )+ε ) elements = εnH() ε ++ε ε log X ++ 2( 2) nnH−1 =+ ( ') 95 Source Coding & Data Compression
For any choice of ε > 0, we can, by choosing block size, n, large enough, do the following: • make a lossless code using only H(x)+ ε bits per symbol on average: L ≤H + ε n • The coding is one-to-one and decodable – However impractical due to exponential complexity • Typical sequences have short descriptions of length ≈ nH – Another proof of source coding theorem (Shannon’s original proof) • However, encoding/decoding complexity is exponential in n 96 Smallest high-probability Set
(n) n Tε is a small subset of X containing most of the probability mass. Can you get even smaller ? –1 For any 0 < ε < 1 , choose N0 = –ε log ε , then for any (n) (n) n(H (x )−2ε) n>max( N0,Nε) and any subset S satisfying S < 2
(n) (n) (n) (n) (n) p(x ∈ S )= p(x ∈ S ∩Tε )+ p(x ∈ S ∩Tε ) < S (n) max p(x) + p(x ∈T (n) ) n)( ε x∈Tε n(H −2ε ) −n(H −ε ) < 2 2 + ε for n > Nε −nε −nε log ε = 2 + ε < 2ε for n > N ,0 2 < 2 = ε Answer: No 97 Summary
• Typical Set x n)( x x – Individual Prob ∈Tε ⇒ log p( ) = − nH ( ) ± nε n)( – Total Prob p(x ∈Tε ) >1−ε for n > Nε n>Nε n(H x )( −ε ) n)( n(H x )( +ε ) – Size 1( −ε 2) < Tε ≤ 2 • No other high probability set can be much (n) smaller than Tε • Asymptotic Equipartition Principle – Almost all event sequences are equally surprising • Can be used to prove source coding theorem 98 Lecture 8
• Channel Coding • Channel Capacity – The highest rate in bits per channel use that can be transmitted reliably ♦ – The maximum mutual information • Discrete Memoryless Channels – Symmetric Channels – Channel capacity • Binary Symmetric Channel • Binary Erasure Channel • Asymmetric Channel
♦ = proved in channel coding theorem 99 Model of Digital Communication
Source Coding Channel Coding
In Noisy Out Compress Encode Decode Decompress Channel • Source Coding – Compresses the data to remove redundancy • Channel Coding – Adds redundancy/structure to protect against channel errors 100 Discrete Memoryless Channel
• Input: x∈X, Output y∈Y y x Noisy Channel
• Time-Invariant Transition-Probability Matrix (Q ) = p(y = y | x = x ) y |x i, j j i T – Hence py = Qy |x px – Q: each row sum = 1, average column sum = |X|| Y|–1
• Memoryless : p(yn|x1: n, y1: n–1) = p(yn|xn) • DMC = Discrete Memoryless Channel 101 Binary Channels
0 0 • Binary Symmetric Channel 1− f f x y –X = [0 1], Y = [0 1] f 1− f 1 1
0 0 • Binary Erasure Channel 1− f f 0 x ? y –X = [0 1], Y = [0 ? 1] 0 f 1− f 1 1
• Z Channel 1 0 0 0 –X = [0 1], Y = [0 1] x y f 1− f 1 1
Symmetric: rows are permutations of each other; columns are permutations of each other Weakly Symmetric: rows are permutations of each other; columns have the same sum 102 Weakly Symmetric Channels
Weakly Symmetric: 1. All columns of Q have the same sum = |X|| Y|–1 – If x is uniform (i.e. p(x) = | X|–1) then y is uniform py()=∑ pyxpx (|)() =X−1 ∑ pyx (|) =×= XXYY −−− 111 xX∈ xX ∈ 2. All rows are permutations of each other
– Each row of Q has the same entropy so
H (y | x ) = ∑ p(x)H (y | x = x) = H (Q :,1 )∑ p(x) = H (Q :,1 ) x∈X x∈X
where Q1,: is the entropy of the first (or any other) row of the Q matrix
Symmetric: 1. All rows are permutations of each other 2. All columns are permutations of each other Symmetric ⇒ weakly symmetric 103 Channel Capacity
• Capacity of a DMC channel: C = max I(x ;y ) px – Mutual information (not entropy itself) is what could be transmitted through the channel H(x , y) ♦ – Maximum is over all possible input distributions px – ∃ only one maximum since I(x ;y) is concave in px for fixed py|x – We want to find the px that maximizes I(x ;y) – Limits on C: H(x | y) I(x ; y) H(y | x) 0≤C ≤ min( H ( yx ), H ( )) ≤ min(H log(x)X ,log Y ) H(y) • Capacity for n uses of channel:
(n) 1 C = max I(x :1 n ;y :1 n ) p n x :1n ♦ = proved in two pages time 104 Mutual Information Plot 1– f Binary Symmetric Channel 1– p 0 0 I(X;Y) Bernoulli Input x f y f p 1 1 1– f 1
I y(;) x y x y = H () − H (|) 0.5 =Hf(2 − pf +− p ) Hf ()
0
-0.5 1 0 0.8 0.2 0.6 0.4 0.6 0.4 0.8 0.2 1 0 Channel Error Prob (f) Input Bernoulli Prob (p) 105
Mutual Information Concave in pX
Mutual Information I(x;y) is concave in px for fixed py|x
Proof: Let u and v have prob mass vectors pu and pv – Define z: bernoulli random variable with p(1) = λ
– Let x = u if z=1 and x=v if z=0 ⇒ px= λ pu +(1–λ) pv I(x ,z ;y ) = I(x ;y ) + I(z ;y | x ) = I(z ;y ) + I(x ;y | z ) but I(z ;y | x ) = H(y | x ) − H(y | x ,z ) = 0 so U 1 X I(x ;y ) ≥ I(x ;y | z ) V 0 p(Y|X) = λI(x ;y | z =1) + (1− λ)I(x ;y | z = 0)
Z Y = λI(u ;y ) + (1− λ)I(v ;y )
Special Case: y=x ⇒ I(x; x)= H(x) is concave in px 106
Mutual Information Convex in pY|X
Mutual Information I(x ;y) is convex in py|x for fixed px Proof: define u, v, x etc: – py |x = λpu |x + (1− λ)pv |x I(x ;y ,z ) = I(x ;y | z ) + I(x ;z ) = I(x ;y ) + I(x ;z | y ) but I(x ;z ) = 0 and I(x ;z | y ) ≥ 0 so I(x ;y ) ≤ I(x ;y | z ) = λI(x ;y | z =1) + (1− λ)I(x ;y | z = 0) = λI(x ;u ) + (1− λ)I(x ;v ) 107 n-use Channel Capacity For Discrete Memoryless Channel:
I(x :1 n ;y :1 n ) = H (y :1 n ) − H (y :1 n | x :1 n ) n n Chain; Memoryless = ∑ H (y i | y :1i−1) − ∑ H (y i | x i ) i=1 i=1 n n n Conditioning ≤ ∑ H (y i ) − ∑ H (y i | x i ) = ∑ I(x i ;y i ) Reduces i=1 i=1 i=1 Entropy
with equality if yi are independent ⇒ xi are independent We can maximize I(x;y) by maximizing each I(xi;yi) independently and taking xi to be i.i.d. – We will concentrate on maximizing I(x; y) for a single channel use
– The elements of Xi are not necessarily i.i.d. 108 Capacity of Symmetric Channel
If channel is weakly symmetric:
IHHHHH y y(;) x y x y =− () (|) =− () (QQ1,: )log|| ≤Y − ( 1,: ) with equality iff input distribution is uniform
∴ Information Capacity of a WS channel is C = log| Y|–H(Q1,: ) 1– f 0 0 x f y For a binary symmetric channel (BSC) : f – |Y| = 2 1 1 1– f – H(Q1,: ) = H(f) – I(x;y) ≤ 1 – H(f) ∴ Information Capacity of a BSC is 1–H(f) 109 Binary Erasure Channel (BEC)
1– f 1− f f 0 0 0 f 0 f 1− f x ? y f 1 1 I(x ;y ) = H (x ) − H (x | y ) 1– f = H (x ) − p(y = 0)×0 − p(y = ?)H (x ) − p(y =1)×0 = H (x ) − H (x ) f H(x | y) = 0 when y=0 or y=1 = 1( − f )H (x )
≤1 − f since max value of H(x) = 1 C=1 − f with equality when x is uniform since a fraction f of the bits are lost, the capacity is only 1–f and this is achieved when x is uniform 110 Asymmetric Channel Capacity
1–f T T 0: a 0 Let px = [ a a 1–2a] ⇒ py=Q px = px f H (y ) = −2a log a − (1− 2a)log(1− 2a) f x 1: a 1 y H (y | x ) = 2aH ( f ) + 1( − 2a)H )1( = 2aH ( f ) 1–f 2: 1–2a 2 To find C, maximize I(x ;y) = H(y) – H(y |x)
I = −2a log a − (1− 2a)log(1− 2a)− 2aH ( f ) 1− f f 0 dI = −2loge − 2log a + 2loge + 2log(1− 2a) − 2H ( f ) = 0 Q = f 1− f 0 da 0 0 1 1− 2a −1 log = log(a−1 − 2)= H ( f ) ⇒ a = (2 + 2H ( f ) ) a Note: –1 H ( f ) d(log x) = x log e ⇒ C = −2a log(a2 )− (1− 2a)log(1− 2a) = −log(1− 2a)
1 Examples: f = 0 ⇒ H( f) = 0 ⇒ a = /3 ⇒ C = log 3 = 1.585 bits/use f = ½ ⇒ H( f) = 1 ⇒ a = ¼ ⇒ C = log 2 = 1 bits/use 111 Summary
• Given the channel, mutual information is concave in input distribution • Channel capacity C = max I(x ;y ) px – The maximum exists and is unique • DMC capacity
– Weakly symmetric channel: log| Y|–H(Q1,: ) – BSC: 1–H(f) – BEC: 1–f – In general it very hard to obtain closed-form; numerical method using convex optimization instead 112 Lecture 9
• Jointly Typical Sets • Joint AEP • Channel Coding Theorem – Ultimate limit on information transmission is channel capacity – The central and most successful story of information theory – Random Coding – Jointly typical decoding 113 Intuition on the Ultimate Limit
nH(y|x) • Consider blocks of n symbols: 2nH(x) 2 2nH(y)
x1: n Noisy y1: n Channel
– For large n, an average input sequence x1:n corresponds to about 2nH (y|x) typical output sequences – There are a total of 2nH (y) typical output sequences – For nearly error free transmission, we select a number of input sequences whose corresponding sets of output sequences hardly overlap – The maximum number of distinct sets of output sequences is 2n(H(y)–H(y|x)) = 2 nI (y ; x) – One can send I(y;x) bits per channel use for large n can transmit at any rate < C with negligible errors 114 Jointly Typical Set x,y is the i.i.d. sequence {xi,yi} for 1 ≤ i ≤ n N – Prob of a particular sequence is p(,)x y = ∏ pxy (,)i i i=1
– E − log p(x, y) = n E − log p(xi , yi ) = nH (x ,y ) – Jointly Typical set:
(n ) n − 1 Jε =∈−{x, yXY : n log() p x −< H ()x ε , −n−1 log p (y ) − H (y ) < ε ,
−1 −nlog p (x , y ) − H (x y , ) < ε} 115 Jointly Typical Example
1– f Binary Symmetric Channel 0 0 f T x y f = 0.2, px = (0.75 0.25) f
T 0.6 0.15 1 1 py = ()0.65 0.35 , Pxy = 1– f 0.05 0.2 Jointly typical example (for any ε): x = 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 y = 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 all combinations of x and y have exactly the right frequencies 116 Jointly Typical Diagram
Each point defines both an x sequence and a y sequence 2nlog| Y| Dots represent jointly typical 2nH(y) pairs (x,y ) Typical in x or y: 2n(H(x)+ H(y))
y) (x, Inner rectangle nH 2nlog| X| nH(x) 2 nH(x|y) represents pairs 2 al: 2 pic that are typical ty nH(y|x) tly 2 in x or y but not oin J necessarily All sequences: 2nlog(| X|| Y|) jointly typical • There are about 2nH (x) typical x’s in all • Each typical y is jointly typical with about 2nH (x|y) of these typical x’s • The jointly typical pairs are a fraction 2–nI (x ;y) of the inner rectangle • Channel Code: choose x’s whose J.T. y’s don’t overlap; use J.T. for decoding • There are 2nI (x ;y) such codewords x’s 117 Joint Typical Set Properties
(n) 1. Indiv Prob: x, y ∈ Jε ⇒ log p(x, y)= −nH(x ,y ) ± nε (n) 2. Total Prob: p(x,y ∈ Jε )>1−ε for n > Nε n> N ε x y nH() ε (,)x y −ε (n ) nH() (,) + 3. Size: (1−ε )2
Similarly choose N2 , N3 for other conditions and set Nε = max()N1, N2 , N3
Proof 3: 1−ε < p(x, y) ≤ J n)( max p(x, y) = J n)( 2−n(H (x ,y )−ε ) n>N ∑ ε n)( ε ε n)( x,y∈Jε x,y∈Jε 1 ≥ p(x, y) ≥ J n)( min p(x, y) = J n)( 2−n(H (x ,y )+ε ) ∀n ∑ ε n)( ε n )( x,y∈Jε x,y∈Jε 118 Properties
4. If px'=px and py' =py with x' and y' independent: −n(I (x ,y )+3ε ) (n) −n(I (x ,y )−3ε ) (1−ε )2 ≤ p(x ,' y'∈ Jε )≤ 2 for n > Nε Proof: |J| × (Min Prob) ≤ Total Prob ≤ |J| × (Max Prob) (n) p(x ,' y'∈ Jε )= ∑ p(x ,' y )' = ∑ p(x )' p(y )' n )( n)( x y',' ∈Jε x y',' ∈Jε p(x ,' y'∈ J (n) )≤ J (n) max p(x )' p(y )' ε ε n)( x ,' y'∈Jε ≤ 2n()()()()H (x ,y )+ε 2−n H (x )−ε 2−n H (y )−ε = 2−n I ( ;y) −x3ε p(x ,' y'∈ J (n) )≥ J (n) min p(x )' p(y )' ε ε n)( x ,' y'∈Jε −n()I ( ;yx )+3ε ≥ (1−ε )2 for n > Nε 119 Channel Coding
^ w∈1:M x1:n Noisy y1:n Decoder w∈1:M Encoder Channel g(y)
• Assume Discrete Memoryless Channel with known Qy|x • An (M, n) code is – A fixed set of M codewords x(w)∈Xn for w=1: M – A deterministic decoder g(y)∈1: M • The rate of an (M,n) code: R=(log M)/ n bits/transmission
• Error probability δ λ w =pgww( (y ()) ≠) = ∑ p()y| x ( w ) g(y ) ≠ w y∈Yn n)( – Maximum Error Probability λ = max λw 1≤w≤M M – Average Error probability n)( 1 Pe = ∑λw M w=1
δC = 1 if C is true or 0 if it is false 120 Shannon’s ideas • Channel coding theorem: the basic theorem of information theory – Proved in his original 1948 paper • How do you correct all errors? • Shannon’s ideas – Allowing arbitrarily small but nonzero error probability – Using the channel many times in succession so that AEP holds – Consider a randomly chosen code and show the expected average error probability is small • Use the idea of typical sequences • Show this means ∃ at least one code with small max error prob • Sadly it doesn’t tell you how to construct the code 121 Channel Coding Theorem
• A rate R is achievable if R 0.2 A very counterintuitive result : 0.15 Achievable Despite channel errors you can get arbitrarily low bit error 0.1 rates provided that R 0 0 1 2 3 Rate R/C 122 Summary • Jointly typical set −log(,)px y = nH (,)x y ± n ε (n ) p()x, y ∈ J ε > 1 − ε (n ) n() H (x y , ) +ε Jε ≤ 2 −n() I (x y , ) + 3ε (n ) −n() I (x y , )3 − ε (1−ε )2 ≤p()x ', y ' ∈≤ J ε 2 • Machinery to prove channel coding theorem 123 Lecture 10 • Channel Coding Theorem – Proof • Using joint typicality • Arguably the simplest one among many possible ways -nE (R) • Limitation: does not reveal Pe ~ e – Converse (next lecture) 124 Channel Coding Principle nH(y|x) • Consider blocks of n symbols: 2nH(x) 2 2nH(y) x1: n Noisy y1: n Channel – An average input sequence x1:n corresponds to about 2nH (y|x) typical output sequences – Random Codes: Choose M = 2nR ( R≤I(x;y) ) random codewords x(w) • their typical output sequences are unlikely to overlap much. – Joint Typical Decoding: A received vector y is very likely to be in the typical output set of the transmitted x(w) and no others. Decode as this w. Channel Coding Theorem: for large n, can transmit at any rate R < C with negligible errors 125 Random (2nR,n) Code • Choose ε ≈ error prob , joint typicality ⇒ Nε , choose n>N ε • Choose px so that I(x ;y)= C, the information capacity n nR • Use px to choose a code C with random x(w)∈X , w=1:2 – the receiver knows this code and also the transition matrix Q • Assume the message W∈1:2nR is uniformly distributed • If received value is y; decode the message by seeing how many x(w)’s are jointly typical with y – if x(k) is the only one then k is the decoded message – if there are 0 or ≥2 possible k’s then declare an error message 0 – we calculate error probability averaged over all C and all W 2nR 2nR (a ) pE= pC()2−nR λ () C -nR = p( C )λ ( C ) =p( E| w = 1 ) () ∑ ∑ w = 2∑∑ p ()() Cλw C ∑ 1 C w =1 w=1 C C (a) since error averaged over all possible codes is independent of w 126 Decoding Errors • Assume we transmit x(1) and receive y (n ) nR • Define the J.T. events e w ={((),)xw y ∈ Jε } for w ∈ 1:2 x1: n Noisy y1: n Channel • Decode using joint typicality • We have an error if either e1 false or ew true for w ≥ 2 • The x(w) for w ≠ 1 are independent of x(1) and hence also Joint AEP − (In (x ,y )−3ε ) independent of y. So p(ew true) < 2 for any w ≠ 1 127 Error Probability for Random Code • Upper bound p(A ∪B) ≤p(A)+p(B) 2nR pEpEWp=| == 1 e e e e e ∪∪∪∪⋯ ≤ p + p ()()()1232nR () 1 ∑ ()w w=2 nR 2 (1) Joint typicality ε ≤+ε ∑ x2 y−−nI()() ε(;)3x y ε <+ 2nR 2 −− nI (;)3 i=2 (2) Joint AEP log ε ε ε ≤+ε 2−n() C − R − 3ε ≤ 2 for R <− C 3 and n >− C− R − 3ε wehavechosen x ypx ( )suchthat I ( ;= ) C • Since average of P(E) over all codes is ≤ 2ε there must be at least one code for which this is true: this code has −nR ♦ 2 ∑λw ≤ 2ε w • Now throw away the worst half of the codewords; the remaining –1 ♦ ones must all have λw ≤ 4ε. The resultant code has rate R–n ≅ R. ♦ = proved on next page 128 Code Selection & Expurgation • Since average of P(E) over all codes is ≤ 2ε there must be at least one code for which this is true. Proof: K K −1 (n) −1 (n) (n) 2ε ≥ K P ,ie ≥ K min()()P ,ie = min P ,ie ∑ ∑ i i i=1 i=1 K = num of codes • Expurgation: Throw away the worst half of the codewords; the remaining ones must all have λw≤ 4ε. Proof: Assume λw are in descending order M ½M ½M −1 −1 −1 2ε ≥ M ∑λw ≥ M ∑λw ≥ M ∑λ½M ≥ ½λ½M w=1 w=1 w=1 ⇒ λ½M ≤ 4ε ⇒ λw ≤ 4ε ∀ w > ½M M ′ = ½×2nR messages in n channel uses ⇒ R′ = n−1 log M ′ = R − n−1 129 Summary of Procedure −1 • For any R • Find the optimum pX so that I(x ; y) = C nR • Choosing codewords randomly (using pX) to construct codes with 2 (a) codewords and using joint typicality as the decoder • Since average of P(E) over all codes is ≤ 2ε there must be at least (b) one code for which this is true. • Throw away the worst half of the codewords. Now the worst (c) codeword has an error prob ≤ 4ε with rate = R – n–1 > R – ε • The resultant code transmits at a rate as close to C as desired with an error probability that can be made as small as desired (but n unnecessarily large). Note: ε determines both error probability and closeness to capacity 130 Remarks • Random coding is a powerful method of proof, not a method of signaling • Picking randomly will give a good code • But n has to be large (AEP) • Without a structure, it is difficult to encode/decode – Table lookup requires exponential size • Channel coding theorem does not provide a practical coding scheme • Folk theorem (but outdated now): – Almost all codes are good, except those we can think of 131 Lecture 11 • Converse of Channel Coding Theorem – Cannot achieve R>C • Capacity with feedback – No gain for DMC but simpler encoding/ decoding • Joint Source-Channel Coding – No point for a DMC 132 Converse of Coding Theorem (n) • Fano’s Inequality : if Pe is error prob when estimating w from y, ()n () n H(w |y )≤+ 1 Pe log W =+ 1 nRP e • Hence nR = H (w ) = H (w | y) + I(w ;y) Definition of I ≤ H (w | y) + I(x(w );y) Markov : w → x → y →wˆ n)( ≤ 1+ nRPe + I(x;y) Fano n)( ≤ 1+ nRPe + nC n-use DMC capacity −1 (n ) RCn− − C ⇒≥Pe →−>1 0if R > C Rn→∞ R (n) • For large (hence for all) n, Pe has a lower bound of (R–C)/ R if w equiprobable – If achievable for small n, it could be achieved also for large n by concatenation. 133 Minimum Bit-Error Rate w x y w^ 1:nR 1:n Noisy 1:n 1: nR Encoder Decoder Channel Suppose – w is i.i.d. bits with H(w )=1 1: nR i ∆ ˆ – The bit-error rate is PEpb={} e(w w ii ≠ ) = Ep{} () i i i (a) (b) ˆ ˆ Then nC ≥ I(x :1 n ;y :1 n ) ≥ I(w w1:nR ; 1: nR ) =H w w(w 1:nR )( − H 1: nR | 1: nR ) nR nR (c) ˆ =nR − H (|w w w ˆ , ) ≥nR − H (w w |ˆ ) =nR1 − E{} H (|)w wi i ∑ i1: nR 1: i − 1 ∑ i i ( i ) (d) i=1 (c) i=1 (e) ˆ = nR(1− H (P )) =nR(1 − E{} H (|)e wi i )≥ nR(1− E{}H (ei ) )≥nR(1 − HEP ( (ei )) ) b i i i 0.2 (a) n-use capacity Hence 0.15 (b) Data processing theorem −1 0.1 R ≤ C(1− H (Pb )) (c) Conditioning reduces entropy 0.05 −1 Bit error probability e w w= ⊕ ˆ Pb ≥ H 1( − C / R) (d) i i i 0 0 1 2 3 Rate R/C (e) Jensen: EH(x) ≤ H(E x) 134 Coding Theory and Practice • Construction for good codes – Ever since Shannon founded information theory – Practical : Computation & memory ∝ nk for some k • Repetition code: rate ‰ 0 • Block codes: encode a block at a time – Hamming code: correct one error – Reed-Solomon code, BCH code: multiple errors (1950s) • Convolutional code: convolve bit stream with a filter • Concatenated code: RS + convolutional • Capacity-approaching codes: – Turbo code: combination of two interleaved convolutional codes (1993) – Low-density parity-check (LDPC) code (1960) – Dream has come true for some channels today 135 Channel with Feedback w∈1:2 nR ^ xi Noisy yi w Encoder Decoder y1: i–1 Channel • Assume error-free feedback: does it increase capacity ? • A (2 nR , n) feedback code is – A sequence of mappings xi = xi(w,y1: i–1) for i=1: n ˆ – A decoding function w = g(y :1 n ) • A rate R is achievable if ∃ a sequence of (2 nR ,n) feedback n)( ˆ codes such that Pe = P(w ≠w ) n→∞ →0 • Feedback capacity, CFB ≥ C, is the sup of achievable rates 136 Feedback Doesn’t Increase Capacity w∈1:2 nR ^ xi Noisy yi w Encoder Decoder I(w ;y) = H (y) − H (y |w ) y Channel n 1: i–1 = H (y) − ∑ H (y i | y :1i−1,w ) i=1 n since x =x (w,y ) = H (y) − ∑ H (y i | y :1i−1,w ,x i ) i i 1: i–1 i=1 n = H (y) − H (y | x ) ∑ i i since yi only directly depends on xi i=1 n n n ≤ H (y ) − H (y | x ) cond reduces ent ∑ i ∑ i i = ∑ I(x i ;y i ) ≤ nC i=1 i=1 i=1 DMC Hence n)( Fano nR = H (w ) = H (w | y) + I(w ;y) ≤1+ nRPe + nC −1 n)( R − C − n ⇒ Pe ≥ Ï Any rate > is unachievable R C The DMC does not benefit from feedback: CFB = C 137 Example: BEC with feedback 1– f • Capacity is 1 – f 0 0 • Encode algorithm f – If y = ? , tell the sender to retransmit bit i x ? y i f – Average number of transmissions per bit: 1 1 1– f 1 1+ f + f 2 +⋯ = 1− f • Average number of successfully recovered bits per transmission = 1 – f – Capacity is achieved! • Capacity unchanged but encoding/decoding algorithm much simpler. 138 Joint Source-Channel Coding ^ w1: n x1: n Noisy y1: n w Encoder Decoder 1: n Channel • Assume wi satisfies AEP and |W|< ∞ – Examples: i.i.d.; Markov; stationary ergodic • Capacity of DMC channel is C – if time-varying: C = lim n−1I(x;y) n→∞ • Joint Source-Channel Coding Theorem: ♦ (n ) ˆ ∃ codes withPe= P (w w1: n ≠ 1: n ) →n→∞ 0 iff H(W) < C – errors arise from two reasons • Incorrect encoding of w • Incorrect decoding of y ♦ = proved on next page 139 Source-Channel Proof (⇐) • Achievability is proved by using two-stage encoding – Source coding – Channel coding n(H(W)+ ε) • For n > Nε there are only 2 w’s in the typical set: encode using n(H(W)+ ε) bits – encoder error < ε • Transmit with error prob less than ε so long as H(W)+ ε < C • Total error prob < 2 ε 140 Source-Channel Proof (⇒) ^ w1: n x1: n Noisy y1: n w Encoder Decoder 1: n Channel ˆ (n ) Fano’s Inequality : H(w | w )1≤ + Pe n log W −1 entropy rate of stationary process H() W≤ n H ()w 1: n −1ˆ − 1 ˆ =nH w(|)w w w 1:nn 1: + nI (;) 1: nn 1: definition of I −1 ()n − 1 ≤n(1 + Pne logW ) + nI (x y1: n ; 1: n ) Fano + Data Proc Inequ −1 (n ) ≤n + Pe log W + C Memoryless channel (n ) Let n→∞⇒ Pe →⇒0 HWC ( ) ≤ 141 Separation Theorem • Important result: source coding and channel coding might as well be done separately since same capacity – Joint design is more difficult • Practical implication: for a DMC we can design the source encoder and the channel coder separately – Source coding: efficient compression – Channel coding: powerful error-correction codes • Not necessarily true for – Correlated channels – Multiuser channels • Joint source-channel coding: still an area of research – Redundancy in human languages helps in a noisy environment 142 Summary • Converse to channel coding theorem – Proved using Fano’s inequality – Capacity is a clear dividing point: • If R < C, error prob. ‰ 0 • Otherwise, error prob. ‰ 1 • Feedback doesn’t increase the capacity of DMC – May increase the capacity of memory channels (e.g., ARQ in TCP/IP) • Source-channel separation theorem for DMC and stationary sources 143 Lecture 12 • Polar codes – Channel polarization – How to construct polar codes – Encoding and decoding • Polar source coding • Extension 144 About Polar Codes • Provably capacity-achieving • Encoding complexity O( log ) • Successive decoding complexity O( log ) • Probability of error ≈ 2 • Main idea: channel polarization 145 What Is Channel Polarization? • Normal channel • Extreme channel Useless channel Polarization sometimes cute, Perfect sometimes lazy, channel hard to manage 146 Channel Polarization • Among all channels, there are two classes which are easy to communicate optimally – The perfect channels the output Y determines the input X – The useless channels Y is independent of X • Polarization is a technique to convert noisy channels to a mixture of extreme channels – The process is information-conserving 147 Generator Matrix • Generator Matrix 1 0 ⨂ = , = 2 1 1 ⨂ denotes the n-fold Kronecker product. • Example 1 0 0 = , = and so on. 1 1 • Encoding Let u be the length-N input to the encoder, then = is the codeword. 148 Channel Combining and Splitting • Basic operation ( = 2) X1 X1 W Y1 U1 + W Y1 Combine X2 X2 W Y2 U2 W Y2 (X1, X2)=(U1, U2) Channel splitting U1 + W Y1 U1 + W Y1 W Y2 U2 W Y2