Asymptotic Equipartition Property

Total Page:16

File Type:pdf, Size:1020Kb

Asymptotic Equipartition Property Lecture 5: Asymptotic Equipartition Property • Law of large number for product of random variables • AEP and consequences Dr. Yao Xie, ECE587, Information Theory, Duke University Stock market • Initial investment Y0, daily return ratio ri, in t-th day, your money is Yt = Y0r1 · ::: · rt: • Now if returns ratio ri are i.i.d., with { 4; w.p. 1=2 r = i 0; w.p. 1=2: • So you think the expected return ratio is Eri = 2, • and then t t EYt = E(Y0r1 · ::: · rt) = Y0(Eri) = Y02 ? Dr. Yao Xie, ECE587, Information Theory, Duke University 1 Is \optimized" really optimal? • With Y0 = 1, actual return Yt goes like 1 4 16 0 0 0 ::: • Optimize expected return is not optimal? • Fundamental reason: products does not behave the same as addition Dr. Yao Xie, ECE587, Information Theory, Duke University 2 (Weak) Law of large number Theorem. For independent, identically distributed (i.i.d.) random variables Xi, 1 Xn X¯ = X ! EX; in probability. n n i i=1 • Convergence in probability if for every ϵ > 0, P fjXn − Xj > ϵg ! 0: • Proof by Markov inequality. • So this means P fjX¯n − EXj ≤ ϵg ! 1; n ! 1: Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Other types of convergence • In mean square if as n ! 1 2 E(Xn − X) ! 0 • With probability 1 (almost surely) if as n ! 1 n o P lim Xn = X = 1 n!1 • In distribution if as n ! 1 lim Fn ! F; n where Fn and F are the cumulative distribution function of Xn and X. Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Product of random variables • v How does this behave? u uYn tn Xi i=1 pQ P • n n ≤ 1 n Geometric mean i=1 Xi arithmetic mean n i=1 Xi • Examples: { Volume V of a random box, each dimension Xi, V = X1 · ::: · Xn · · { Stock return Yt = Y0r1 ::: rt Q n { Joint distribution of i.i.d. RVs: p(x1; : : : ; xn) = i=1 p(xi) Dr. Yao Xie, ECE587, Information Theory, Duke University 5 Law of large number for product of random variables • We can write log Xi Xi = e • Hence v u uYn P 1 n tn log Xi Xi = en i=1 i=1 • So from LLN v u uYn tn E(log X) log EX Xi ! e ≤ e = EX: i=1 Dr. Yao Xie, ECE587, Information Theory, Duke University 6 • Stock example: 1 1 E log r = log 4 + log 0 = −∞ i 2 2 E log ri E(Yt) ! Y0e = 0; t ! 1: • Example { a; w.p. 1=2 X = b; w.p. 1=2: 8 v 9 < u = uYn p a + b E tn X ! ab ≤ : i; 2 i=1 Dr. Yao Xie, ECE587, Information Theory, Duke University 7 Asymptotic equipartition property (AEP) • LLN states that 1 Xn X ! EX n i i=1 • AEP states that most sequences 1 1 log ! H(X) n p(X1;X2;:::;Xn) −nH(X) p(X1;X2;:::;Xn) ≈ 2 • Analyze using LLN for product of random variables Dr. Yao Xie, ECE587, Information Theory, Duke University 8 AEP lies in the heart of information theory. • Proof for lossless source coding • Proof for channel capacity • and more... Dr. Yao Xie, ECE587, Information Theory, Duke University 9 AEP Theorem. If X1;X2;::: are i.i.d. ∼ p(x), then 1 − log p(X ;X ;:::;X ) ! H(X); in probability. n 1 2 n Proof: 1 1 Xn − log p(X ;X ; ··· ;X ) = − log p(X ) n 1 2 n n i i=1 ! −E log p(X) = H(X): There are several consequences. Dr. Yao Xie, ECE587, Information Theory, Duke University 10 Typical set A typical set (n) Aϵ n contains all sequences (x1; x2; : : : ; xn) 2 X with the property −n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1; x2; : : : ; xn) ≤ 2 : Dr. Yao Xie, ECE587, Information Theory, Duke University 11 Not all sequences are created equal • Coin tossing example: X 2 f0; 1g, p(1) = 0:8 P P − p(1; 0; 1; 1; 0; 1) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0:0164 P P − p(0; 0; 0; 0; 0; 0) = p Xi(1 − p)5 Xi = p4(1 − p)2 = 0:000064 • In this example, if 2 (n) (x1; : : : ; xn) Aϵ ; 1 H(X) − ϵ ≤ − log p(X ;:::;X ) ≤ H(X) + ϵ. n 1 n • This means a binary sequence is in typical set is the frequency of heads is approximately k=n Dr. Yao Xie, ECE587, Information Theory, Duke University 12 •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ◭ •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• Dr. Yao Xie, ECE587, Information Theory, Duke University 13 p = 0:6, n = 25, k = number of \1"s n n 1 k pk(1 − p) n−k − log p(x n) !k" ! k" n 0 1 0 .000000 1 .321928 1 25 0 .000000 1 .298530 2 300 0 .000000 1 .275131 3 2300 0 .000001 1 .251733 4 12650 0 .000007 1 .228334 5 53130 0 .000054 1 .204936 6 177100 0 .000227 1 .181537 7 480700 0 .001205 1 .158139 8 1081575 0 .003121 1 .134740 9 2042975 0 .013169 1 .111342 10 3268760 0 .021222 1 .087943 11 4457400 0 .077801 1 .064545 12 5200300 0 .075967 1 .041146 13 5200300 0 .267718 1 .017748 14 4457400 0 .146507 0 .994349 15 3268760 0 .575383 0 .970951 16 2042975 0 .151086 0 .947552 17 1081575 0 .846448 0 .924154 18 480700 0 .079986 0 .900755 19 177100 0 .970638 0 .877357 20 53130 0 .019891 0 .853958 21 12650 0 .997633 0 .830560 22 2300 0 .001937 0 .807161 23 300 0 .999950 0 .783763 24 25 0 .000047 0 .760364 25 1 0 .000003 0 .736966 Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Consequences of AEP (n) Theorem. 1. If (x1; x2; : : : ; xn) 2 Aϵ , then for n sufficiently large: 1 H(X) − ϵ ≤ − log p(x ; x ; : : : ; x ) ≤ H(X) + ϵ n 1 2 n (n) 2. P fAϵ g ≥ 1 − ϵ. (n) n(H(X)+ϵ) 3. jAϵ j ≤ 2 . (n) n(H(X)−ϵ) 4. jAϵ j ≥ (1 − ϵ)2 . Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Property 1 (n) If (x1; x2; : : : ; xn) 2 Aϵ , then 1 H(X) − ϵ ≤ − log p(x ; x ; : : : ; x ) ≤ H(X) + ϵ. n 1 2 n • Proof from definition: 2 (n) (x1; x2; : : : ; xn) Aϵ ; if −n(H(X)+ϵ) −n(H(X)−ϵ) 2 ≤ p(x1; x2; : : : ; xn) ≤ 2 : • The number of bits used to describe sequences in typical set is approximately nH(X). Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Property 2 (n) P fAϵ g ≥ 1 − ϵ for n sufficiently large. • Proof: From AEP: because 1 − log p(X ;:::;X ) ! H(X) n 1 n in probability, this means for a given ϵ > 0, when n is sufficiently large 1 pf − log p(X1;:::;Xn) − H(X) ≤ ϵg ≥ 1 − ϵ. | n {z } (n) 2Aϵ { High probability: sequences in typical set are \most typical". { These sequences almost all have same probability - \equipartition". Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Property 3 and 4: size of typical set − n(H(X)−ϵ) ≤ j (n)j ≤ n(H(X)+ϵ) (1 ϵ)2 Aϵ 2 • Proof: X 1 = p(x1; : : : ; xn) (x ;:::;x ) 1 Xn ≥ p(x1; : : : ; xn) (n) (x ;:::;x )2A 1 Xn ϵ −n(H(X)+ϵ) ≥ p(x1; : : : ; xn)2 (n) (x1;:::;xn)2Aϵ j (n)j −n(H(X)+ϵ) = Aϵ 2 : Dr. Yao Xie, ECE587, Information Theory, Duke University 18 (n) On the other hand, P fAϵ g ≥ 1 − ϵ for n, so X 1 − ϵ < p(x1; : : : ; xn) (n) (x1;:::;xn)2Aϵ ≤ j (n)j −n(H(X)−ϵ) Aϵ 2 : • Size of typical set depends on H(X). • When p = 1=2 in coin tossing example, H(X) = 1, 2nH(X) = 2n: all sequences are typical sequences. Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Typical set diagram • This enables us to divide all sequences into two sets { Typical set: high probability to occur, sample entropy is close to true entropy so we will focus on analyzing sequences in typical set { Non-typical set: small probability, can ignore in general n:| | n elements Non-typical set Typical set (n) n(H + ∋ ) A ∋ : 2 elements Dr. Yao Xie, ECE587, Information Theory, Duke University 20 Data compression scheme from AEP • Let X1;X2;:::;Xn be i.i.d. RV drawn from p(x) • We wish to find short descriptions for such sequences of RVs Dr. Yao Xie, ECE587, Information Theory, Duke University 21 • Divide all sequences in X n into two sets Non-typical set Description: n log | | + 2 bits Typical set Description: n(H + ∋ ) + 2 bits Dr. Yao Xie, ECE587, Information Theory, Duke University 22 • Use one bit to indicate which set (n) { Typical set Aϵ use prefix \1" Since there are no more than 2n(H(X)+ϵ) sequences, indexing requires no more than d(H(X) + ϵ)e + 1 (plus one extra bit) { Non-typical set use prefix \0" Since there are at most jX jn sequences, indexing requires no more than dn log jX je + 1 n n n • Notation: x = (x1; : : : ; xn), l(x ) = length of codeword for x • We can prove [ ] 1 E l(Xn) ≤ H(X) + ϵ n Dr. Yao Xie, ECE587, Information Theory, Duke University 23 Summary of AEP Almost everything is almost equally probable. • Reasons that AEP has H(X) − 1 n ! { n log p(x ) H(X), in probability { n(H(X) ϵ) suffices to describe that random sequence on average { 2H(X) is the effective alphabet size { Typical set is the smallest set with probability near 1 { Size of typical set 2nH(X) { The distance of elements in the set nearly uniform Dr.
Recommended publications
  • Entropy Rate of Stochastic Processes
    Entropy Rate of Stochastic Processes Timo Mulder Jorn Peters [email protected] [email protected] February 8, 2015 The entropy rate of independent and identically distributed events can on average be encoded by H(X) bits per source symbol. However, in reality, series of events (or processes) are often randomly distributed and there can be arbitrary dependence between each event. Such processes with arbitrary dependence between variables are called stochastic processes. This report shows how to calculate the entropy rate for such processes, accompanied with some denitions, proofs and brief examples. 1 Introduction One may be interested in the uncertainty of an event or a series of events. Shannon entropy is a method to compute the uncertainty in an event. This can be used for single events or for several independent and identically distributed events. However, often events or series of events are not independent and identically distributed. In that case the events together form a stochastic process i.e. a process in which multiple outcomes are possible. These processes are omnipresent in all sorts of elds of study. As it turns out, Shannon entropy cannot directly be used to compute the uncertainty in a stochastic process. However, it is easy to extend such that it can be used. Also when a stochastic process satises certain properties, for example if the stochastic process is a Markov process, it is straightforward to compute the entropy of the process. The entropy rate of a stochastic process can be used in various ways. An example of this is given in Section 4.1.
    [Show full text]
  • Entropy Rate Estimation for Markov Chains with Large State Space
    Entropy Rate Estimation for Markov Chains with Large State Space Yanjun Han Department of Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720 [email protected] Chuan-Zheng Lee, Tsachy Weissman Yihong Wu Department of Electrical Engineering Department of Statistics and Data Science Stanford University Yale University Stanford, CA 94305 New Haven, CT 06511 {czlee, tsachy}@stanford.edu [email protected] Tiancheng Yu Department of Electronic Engineering Tsinghua University Haidian, Beijing 100084 [email protected] Abstract Entropy estimation is one of the prototypicalproblems in distribution property test- ing. To consistently estimate the Shannon entropy of a distribution on S elements with independent samples, the optimal sample complexity scales sublinearly with arXiv:1802.07889v3 [cs.LG] 24 Sep 2018 S S as Θ( log S ) as shown by Valiant and Valiant [41]. Extendingthe theory and algo- rithms for entropy estimation to dependent data, this paper considers the problem of estimating the entropy rate of a stationary reversible Markov chain with S states from a sample path of n observations. We show that Provided the Markov chain mixes not too slowly, i.e., the relaxation time is • S S2 at most O( 3 ), consistent estimation is achievable when n . ln S ≫ log S Provided the Markov chain has some slight dependency, i.e., the relaxation • 2 time is at least 1+Ω( ln S ), consistent estimation is impossible when n . √S S2 log S . Under both assumptions, the optimal estimation accuracy is shown to be S2 2 Θ( n log S ).
    [Show full text]
  • Entropy Power, Autoregressive Models, and Mutual Information
    entropy Article Entropy Power, Autoregressive Models, and Mutual Information Jerry Gibson Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USA; [email protected] Received: 7 August 2018; Accepted: 17 September 2018; Published: 30 September 2018 Abstract: Autoregressive processes play a major role in speech processing (linear prediction), seismic signal processing, biological signal processing, and many other applications. We consider the quantity defined by Shannon in 1948, the entropy rate power, and show that the log ratio of entropy powers equals the difference in the differential entropy of the two processes. Furthermore, we use the log ratio of entropy powers to analyze the change in mutual information as the model order is increased for autoregressive processes. We examine when we can substitute the minimum mean squared prediction error for the entropy power in the log ratio of entropy powers, thus greatly simplifying the calculations to obtain the differential entropy and the change in mutual information and therefore increasing the utility of the approach. Applications to speech processing and coding are given and potential applications to seismic signal processing, EEG classification, and ECG classification are described. Keywords: autoregressive models; entropy rate power; mutual information 1. Introduction In time series analysis, the autoregressive (AR) model, also called the linear prediction model, has received considerable attention, with a host of results on fitting AR models, AR model prediction performance, and decision-making using time series analysis based on AR models [1–3]. The linear prediction or AR model also plays a major role in many digital signal processing applications, such as linear prediction modeling of speech signals [4–6], geophysical exploration [7,8], electrocardiogram (ECG) classification [9], and electroencephalogram (EEG) classification [10,11].
    [Show full text]
  • Source Coding: Part I of Fundamentals of Source and Video Coding
    Foundations and Trends R in sample Vol. 1, No 1 (2011) 1–217 c 2011 Thomas Wiegand and Heiko Schwarz DOI: xxxxxx Source Coding: Part I of Fundamentals of Source and Video Coding Thomas Wiegand1 and Heiko Schwarz2 1 Berlin Institute of Technology and Fraunhofer Institute for Telecommunica- tions — Heinrich Hertz Institute, Germany, [email protected] 2 Fraunhofer Institute for Telecommunications — Heinrich Hertz Institute, Germany, [email protected] Abstract Digital media technologies have become an integral part of the way we create, communicate, and consume information. At the core of these technologies are source coding methods that are described in this text. Based on the fundamentals of information and rate distortion theory, the most relevant techniques used in source coding algorithms are de- scribed: entropy coding, quantization as well as predictive and trans- form coding. The emphasis is put onto algorithms that are also used in video coding, which will be described in the other text of this two-part monograph. To our families Contents 1 Introduction 1 1.1 The Communication Problem 3 1.2 Scope and Overview of the Text 4 1.3 The Source Coding Principle 5 2 Random Processes 7 2.1 Probability 8 2.2 Random Variables 9 2.2.1 Continuous Random Variables 10 2.2.2 Discrete Random Variables 11 2.2.3 Expectation 13 2.3 Random Processes 14 2.3.1 Markov Processes 16 2.3.2 Gaussian Processes 18 2.3.3 Gauss-Markov Processes 18 2.4 Summary of Random Processes 19 i ii Contents 3 Lossless Source Coding 20 3.1 Classification
    [Show full text]
  • Relative Entropy Rate Between a Markov Chain and Its Corresponding Hidden Markov Chain
    ⃝c Statistical Research and Training Center دورهی ٨، ﺷﻤﺎرهی ١، ﺑﻬﺎر و ﺗﺎﺑﺴﺘﺎن ١٣٩٠، ﺻﺺ ٩٧–١٠٩ J. Statist. Res. Iran 8 (2011): 97–109 Relative Entropy Rate between a Markov Chain and Its Corresponding Hidden Markov Chain GH. Yariy and Z. Nikooraveshz;∗ y Iran University of Science and Technology z Birjand University of Technology Abstract. In this paper we study the relative entropy rate between a homo- geneous Markov chain and a hidden Markov chain defined by observing the output of a discrete stochastic channel whose input is the finite state space homogeneous stationary Markov chain. For this purpose, we obtain the rela- tive entropy between two finite subsequences of above mentioned chains with the help of the definition of relative entropy between two random variables then we define the relative entropy rate between these stochastic processes and study the convergence of it. Keywords. Relative entropy rate; stochastic channel; Markov chain; hid- den Markov chain. MSC 2010: 60J10, 94A17. 1 Introduction Suppose fXngn2N is a homogeneous stationary Markov chain with finite state space S = f0; 1; 2; :::; N − 1g and fYngn2N is a hidden Markov chain (HMC) which is observed through a discrete stochastic channel where the input of channel is the Markov chain. The output state space of channel is characterized by channel’s statistical properties. From now on we study ∗ Corresponding author 98 Relative Entropy Rate between a Markov Chain and Its ::: the channels state spaces which have been equal to the state spaces of input chains. Let P = fpabg be the one-step transition probability matrix of the Markov chain such that pab = P rfXn = bjXn−1 = ag for a; b 2 S and Q = fqabg be the noisy matrix of channel where qab = P rfYn = bjXn = ag for a; b 2 S.
    [Show full text]
  • Entropy Rates & Markov Chains Information Theory Duke University, Fall 2020
    ECE 587 / STA 563: Lecture 4 { Entropy Rates & Markov Chains Information Theory Duke University, Fall 2020 Author: Galen Reeves Last Modified: September 1, 2020 Outline of lecture: 4.1 Entropy Rate........................................1 4.2 Markov Chains.......................................3 4.3 Card Shuffling and MCMC................................4 4.4 Entropy of the English Language [Optional].......................6 4.1 Entropy Rate • A stochastic process fXig is an indexed sequence of random variables. They need not be in- n dependent nor identically distributed. For each n the distribution on X = (X1;X2; ··· ;Xn) is characterized by the joint pmf n pXn (x ) = p(x1; x2; ··· ; xn) • A stochastic process provides a useful probabilistic model. Examples include: ◦ the state of a deck of cards after each shuffle ◦ location of a molecule each millisecond. ◦ temperature each day ◦ sequence of letters in a book ◦ the closing price of the stock market each day. • With dependent sequences, hows does the entropy H(Xn) grow with n? (1) Definition 1: Average entropy per symbol H(X ;X ; ··· ;X ) H(X ) = lim 1 2 n n!1 n (2) Definition 2: Rate of information innovation 0 H (X ) = lim H(XnjX1;X2; ··· ;Xn−1) n!1 • If fXig are iid ∼ p(x), then both limits exists and are equal to H(X). nH(X) H(X ) = lim = H(X) n!1 n 0 H (X ) = lim H(XnjX1;X2; ··· ;Xn−1) = H(X) n!1 But what if the symbols are not iid? 2 ECE 587 / STA 563: Lecture 4 • A stochastic process is stationary if the joint distribution of subsets is invariant to shifts in the time index, i.e.
    [Show full text]
  • Information Theory 1 Entropy 2 Mutual Information
    CS769 Spring 2010 Advanced Natural Language Processing Information Theory Lecturer: Xiaojin Zhu [email protected] In this lecture we will learn about entropy, mutual information, KL-divergence, etc., which are useful concepts for information processing systems. 1 Entropy Entropy of a discrete distribution p(x) over the event space X is X H(p) = − p(x) log p(x). (1) x∈X When the log has base 2, entropy has unit bits. Properties: H(p) ≥ 0, with equality only if p is deterministic (use the fact 0 log 0 = 0). Entropy is the average number of 0/1 questions needed to describe an outcome from p(x) (the Twenty Questions game). Entropy is a concave function of p. 1 1 1 1 7 For example, let X = {x1, x2, x3, x4} and p(x1) = 2 , p(x2) = 4 , p(x3) = 8 , p(x4) = 8 . H(p) = 4 bits. This definition naturally extends to joint distributions. Assuming (x, y) ∼ p(x, y), X X H(p) = − p(x, y) log p(x, y). (2) x∈X y∈Y We sometimes write H(X) instead of H(p) with the understanding that p is the underlying distribution. The conditional entropy H(Y |X) is the amount of information needed to determine Y , if the other party knows X. X X X H(Y |X) = p(x)H(Y |X = x) = − p(x, y) log p(y|x). (3) x∈X x∈X y∈Y From above, we can derive the chain rule for entropy: H(X1:n) = H(X1) + H(X2|X1) + ..
    [Show full text]
  • Arxiv:1711.03962V1 [Stat.ME] 10 Nov 2017 the Entropy Rate of an Individual’S Behavior
    Estimating the Entropy Rate of Finite Markov Chains with Application to Behavior Studies Brian Vegetabile1;∗ and Jenny Molet2 and Tallie Z. Baram2;3;4 and Hal Stern1 1Department of Statistics, University of California, Irvine, CA, U.S.A. 2Department of Anatomy and Neurobiology, University of California, Irvine, CA, U.S.A. 3Department of Pediatrics, University of California, Irvine, CA, U.S.A. 4Department of Neurology, University of California, Irvine, CA, U.S.A. *email: [email protected] Summary: Predictability of behavior has emerged an an important characteristic in many fields including biology, medicine, and marketing. Behavior can be recorded as a sequence of actions performed by an individual over a given time period. This sequence of actions can often be modeled as a stationary time-homogeneous Markov chain and the predictability of the individual's behavior can be quantified by the entropy rate of the process. This paper provides a comprehensive investigation of three estimators of the entropy rate of finite Markov processes and a bootstrap procedure for providing standard errors. The first two methods directly estimate the entropy rate through estimates of the transition matrix and stationary distribution of the process; the methods differ in the technique used to estimate the stationary distribution. The third method is related to the sliding-window Lempel-Ziv (SWLZ) compression algorithm. The first two methods achieve consistent estimates of the true entropy rate for reasonably short observed sequences, but are limited by requiring a priori specification of the order of the process. The method based on the SWLZ algorithm does not require specifying the order of the process and is optimal in the limit of an infinite sequence, but is biased for short sequences.
    [Show full text]
  • Entropy Rate
    Lecture 6: Entropy Rate • Entropy rate H(X) • Random walk on graph Dr. Yao Xie, ECE587, Information Theory, Duke University Coin tossing versus poker • Toss a fair coin and see and sequence Head, Tail, Tail, Head ··· −nH(X) (x1; x2;:::; xn) ≈ 2 • Play card games with friend and see a sequence A | K r Q q J ♠ 10 | ··· (x1; x2;:::; xn) ≈? Dr. Yao Xie, ECE587, Information Theory, Duke University 1 How to model dependence: Markov chain • A stochastic process X1; X2; ··· – State fX1;:::; Xng, each state Xi 2 X – Next step only depends on the previous state p(xn+1jxn;:::; x1) = p(xn+1jxn): – Transition probability pi; j : the transition probability of i ! j P = j – p(xn+1) xn p(xn)p(xn+1 xn) – p(x1; x2; ··· ; xn) = p(x1)p(x2jx1) ··· p(xnjxn−1) Dr. Yao Xie, ECE587, Information Theory, Duke University 2 Hidden Markov model (HMM) • Used extensively in speech recognition, handwriting recognition, machine learning. • Markov process X1; X2;:::; Xn, unobservable • Observe a random process Y1; Y2;:::; Yn, such that Yi ∼ p(yijxi) • We can build a probability model Yn−1 Yn n n p(x ; y ) = p(x1) p(xi+1jxi) p(yijxi) i=1 i=1 Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Time invariance Markov chain • A Markov chain is time invariant if the conditional probability p(xnjxn−1) does not depend on n p(Xn+1 = bjXn = a) = p(X2 = bjX1 = a); for all a; b 2 X • For this kind of Markov chain, define transition matrix 2 3 6 ··· 7 6P11 P1n7 = 6 ··· 7 P 46 57 Pn1 ··· Pnn Dr.
    [Show full text]
  • Introduction to Information Theory and Coding
    Introduction to information theory and coding Louis WEHENKEL Set of slides No 4 Source modeling and source coding Stochastic processes and models for information sources • First Shannon theorem : data compression limit • Overview of state of the art in data compression • Relations between automatic learning and data compression • IT 2007-4, slide 1 The objectives of this part of the course are the following : Understand the relationship between discrete information sources and stochastic processes • Define the notions of entropy for stochastic processes (infinitely long sequences of not necessarily independent • symbols) Talk a little more about an important class of stochastic processes (Markov processes/chains) • How can we compress sequences of symbols in a reversible way • What are the ultimate limits of data compression • Review state of the art data compression algorithms • Look a little bit at irreversible data compression (analog signals, sound, images. ) • Do not expect to become a specialist in data compression in one day. IT 2007-4, note 1 1. Introduction How and why can we compress information? Simple example : A source S which emits messages as sequences of the 3 symbols (a, b, c). 1 1 1 P (a) = 2 , P (b) = 4 , P (c) = 4 (i.i.d.) 1 1 1 Entropy : H(S) = 2 log 2 + 4 log 4 + 4 log 4 = 1.5 Shannon/ source symbol Not that there is redundancy : H(S) < log 3 = 1.585. How to code “optimaly” using a binary code ? 1. Associate to each symbol of the source a binary word. 2. Make sure that the code is unambiguous (can decode correctly).
    [Show full text]
  • Redundancy Rates of Slepian-Wolf Coding∗
    Redundancy Rates of Slepian-Wolf Coding∗ Dror Baron, Mohammad Ali Khojastepour, and Richard G. Baraniuk Dept. of Electrical and Computer Engineering, Rice University, Houston, TX 77005 Abstract The results of Shannon theory are asymptotic and do not reflect the finite codeword lengths used in practice. Recent progress in design of distributed source codes, along with the need for distributed compression in sensor networks, has made it crucial to understand how quickly distributed communication systems can approach their ultimate limits. Slepian and Wolf considered the distributed encoding of length-n sequences x and y. If y is available at the decoder, then as n increases x can be encoded losslessly at rates arbitrarily close to the conditional entropy H(X Y ). However, for any finite n there is a positive probability that x and y are | not jointly typical, and so x cannot be decoded correctly. We examine a Bernoulli setup, where x is generated by passing y through a binary symmetric correlation channel. We prove that the finite n requires us to increase the rate above the conditional entropy by K()/√n, where is the probability of error. We also study the cost of universality in Slepian-Wolf coding, and propose a universal variable 1 rate scheme wherein the encoder for x receives PY = n i yi. For PY < 0.5, our redundancy rate is K0()/√n above the empirical conditional entropy. When 1/6 1/6 P PY 0.5 = O(n− ), K0()=Ω(n ), and another scheme with redundancy rate | −1/3 | O(n− ) should be used. Our results indicate that the penalties for finite n and unknown statistics can be large, especially for P 0.5.
    [Show full text]
  • Claude Shannon: His Work and Its Legacy1
    History Claude Shannon: His Work and Its Legacy1 Michelle Effros (California Institute of Technology, USA) and H. Vincent Poor (Princeton University, USA) The year 2016 marked the centennial of the birth of Claude Elwood Shannon, that singular genius whose fer- tile mind gave birth to the field of information theory. In addition to providing a source of elegant and intrigu- ing mathematical problems, this field has also had a pro- found impact on other fields of science and engineering, notably communications and computing, among many others. While the life of this remarkable man has been recounted elsewhere, in this article we seek to provide an overview of his major scientific contributions and their legacy in today’s world. This is both an enviable and an unenviable task. It is enviable, of course, because it is a wonderful story; it is unenviable because it would take volumes to give this subject its due. Nevertheless, in the hope of providing the reader with an appreciation of the extent and impact of Shannon’s major works, we shall try. To approach this task, we have divided Shannon’s work into 10 topical areas: - Channel capacity - Channel coding - Multiuser channels - Network coding - Source coding - Detection and hypothesis testing - Learning and big data - Complexity and combinatorics - Secrecy - Applications We will describe each one briefly, both in terms of Shan- non’s own contribution and in terms of how the concepts initiated by Shannon have influenced work in the inter- communication is possible even in the face of noise and vening decades. By necessity, we will take a minimalist the demonstration that a channel has an inherent maxi- approach in this discussion.
    [Show full text]