ECE644 – Lecture 2

Copy Link

ECE644 – Lecture 2 Asymptotic Equipartition Property (AEP) Data Compression 01/23/02 Information Theory, Spring 2002, 1 Hairong Qi Recap ()= − () () H X ∑ p x log2 p x ( )= − () () H X ,Y ∑∑p x, y log2 p x, y x, y H ()Y | X = ∑ p ()(x H Y | X = x )= −∑∑p ()()x, y log p y | x x xy () ()p()x D p || q = ∑ p x log () x q x () ()= () p x, y I X ;Y ∑∑p x, y log ()() x, y p x p y 01/23/02 Information Theory, Spring 2002, 2 Hairong Qi Venn Diagram H ()X H ()X ,Y H ()Y H ()()(X ,Y = H X + H Y | X ) = H ()Y + H (X | Y ) I()()()X ;Y = H X − H X | Y = H ()Y − H (Y | X ) = H ()X + H ()Y − H (X ,Y ) H ()X |Y I ()X ;Y H ()Y | X 01/23/02 Information Theory, Spring 2002, 3 Hairong Qi 1 Asymptotic Equipartition Property 1 1 log Weak law of large AEP ()m n p X1, X 2 , X n numbers Is close to the For i.i.d. random entropy H 1 n variables, ∑ X i is n i=1 close to its expected value EX for large values of n ()m −nH p X1, X 2 , , X n is close to 2 01/23/02 Information Theory, Spring 2002, 4 Hairong Qi The AEP m () If X1, X 2 , are i.i.d. ~ p x , then 1 1 log → H ()X ()m n p X1, X 2 , X n 01/23/02 Information Theory, Spring 2002, 5 Hairong Qi The Typical Set ()n The typical set Aε with respect to p()x is the set of ()m ∈ℜn sequences x1, x2 , , xn with the following property −n()H ()X +ε ≤ ()m ≤ −n()H ()X −ε 2 p x1, x2 , , xn 2 Other properties: 1 1) H()X −ε ≤ − log p()()x , x ,m, x ≤ H X + ε n 1 2 n ()n 2) Pr{}Aε >1−ε for n sufficiently large ()n n()H ()X +ε 3) Aε ≤ 2 01/23/02 Information Theory, Spring 2002, 6 Hairong Qi 2 Data Compression - Problem Let X1, X2, …, Xn be i.i.d. random variables drawn from the probability mass function p(x) Try to find short descriptions for such sequences of random variables 01/23/02 Information Theory, Spring 2002, 7 Hairong Qi Data Compression - AEP Divide all sequence into two sets: the typical set and the non-typical set all sequence ℜn Non-typical set ()n What is Aε ? 01/23/02 Information Theory, Spring 2002, 8 Hairong Qi Expected Length of the Codeword E(l(X n ))= ?? 01/23/02 Information Theory, Spring 2002, 9 Hairong Qi 3 Calculation of Typical Set Consider a sequence of I.I.d. binary random variables X1, X2, …, Xn, where the probability that Xi=1 is 0.6 – (a) Calculate H(X) – (b) With n=25 and ε=0.1, which sequences fall in the typical set? – (c) What is the probability of the typical set? – (d) How many elements are there in the typical set? 01/23/02 Information Theory, Spring 2002, 10 Hairong Qi Homework 1 (Due 1/30/02) 1) Prove the three equations on Lecture1 slide 17 2) Problem 16 3) Calculation of typical set described in slide 10 of lecture 2 4) Read handout and identify research area 5) Reading: “Claude Shannon: Reluctant Father of the Digital Age” by M. Mitchell Waldrop 6) Experiment: “Shannon’s experiment to calculate the Entropy of English” 7) Reading: “A mathematical theory of communication” by Claude E. Shannon 01/23/02 Information Theory, Spring 2002, 11 Hairong Qi 4.

Recommended publications

Introduction to Information Theory and Coding

Introduction to information theory and coding Louis WEHENKEL Set of slides No 4 Source modeling and source coding Stochastic processes and models for information sources • First Shannon theorem : data compression limit • Overview of state of the art in data compression • Relations between automatic learning and data compression • IT 2007-4, slide 1 The objectives of this part of the course are the following : Understand the relationship between discrete information sources and stochastic processes • Deﬁne the notions of entropy for stochastic processes (inﬁnitely long sequences of not necessarily independent • symbols) Talk a little more about an important class of stochastic processes (Markov processes/chains) • How can we compress sequences of symbols in a reversible way • What are the ultimate limits of data compression • Review state of the art data compression algorithms • Look a little bit at irreversible data compression (analog signals, sound, images. ) • Do not expect to become a specialist in data compression in one day. IT 2007-4, note 1 1. Introduction How and why can we compress information? Simple example : A source S which emits messages as sequences of the 3 symbols (a, b, c). 1 1 1 P (a) = 2 , P (b) = 4 , P (c) = 4 (i.i.d.) 1 1 1 Entropy : H(S) = 2 log 2 + 4 log 4 + 4 log 4 = 1.5 Shannon/ source symbol Not that there is redundancy : H(S) < log 3 = 1.585. How to code “optimaly” using a binary code ? 1. Associate to each symbol of the source a binary word. 2. Make sure that the code is unambiguous (can decode correctly).
Claude Shannon: His Work and Its Legacy1

History Claude Shannon: His Work and Its Legacy1 Michelle Effros (California Institute of Technology, USA) and H. Vincent Poor (Princeton University, USA) The year 2016 marked the centennial of the birth of Claude Elwood Shannon, that singular genius whose fer- tile mind gave birth to the field of information theory. In addition to providing a source of elegant and intrigu- ing mathematical problems, this field has also had a pro- found impact on other fields of science and engineering, notably communications and computing, among many others. While the life of this remarkable man has been recounted elsewhere, in this article we seek to provide an overview of his major scientific contributions and their legacy in today’s world. This is both an enviable and an unenviable task. It is enviable, of course, because it is a wonderful story; it is unenviable because it would take volumes to give this subject its due. Nevertheless, in the hope of providing the reader with an appreciation of the extent and impact of Shannon’s major works, we shall try. To approach this task, we have divided Shannon’s work into 10 topical areas: - Channel capacity - Channel coding - Multiuser channels - Network coding - Source coding - Detection and hypothesis testing - Learning and big data - Complexity and combinatorics - Secrecy - Applications We will describe each one briefly, both in terms of Shan- non’s own contribution and in terms of how the concepts initiated by Shannon have influenced work in the inter- communication is possible even in the face of noise and vening decades. By necessity, we will take a minimalist the demonstration that a channel has an inherent maxi- approach in this discussion.
Introduction to AEP Consequences

Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that for independent and identically distributed (i.i.d.) random variables: 1 n X ⎯⎯→⎯ EX ∑ i n−>∞ n i=1 Similarly, the AEP states that: 1 1 log ⎯n⎯→−⎯>∞ H n p(X1, X 2 ,...X n ) Where p(X1,X2,…Xn) is the probability of observing the sequence X1,X2,…Xn. Thus, the probability assigned to an observed sequence will be close to 2-nH (from the definition of entropy). Consequences We can divide the set of all sequences in to two sets, the typical set, where the sample entropy is close to the true entropy, and the non-typical set, which contains the other sequences. The importance of this subdivision is that any property that is proven for the typical sequences will then be true with high probability and will determine the average behavior of a large sample (i.e. a sequence of a large number of random variables). For example, if we consider a random variable X∈{0,1} having a probability mass function defined by p(1)=p and p(0)=q, the probability of a sequence {x ,x ,…x } is: 1 2 n n ∏ p(xi ) i=1 For example, the probability of the sequence (1,0,1,1,0,1) is p4q2. Clearly, it is not true that all 2n sequences of length n have the same probability. In this example, we can say that the number of 1’s in the sequence is close to np.
"Entropy Rates of a Stochastic Process"

Elements of Information Theory Thomas M. Cover, Joy A. Thomas Copyright 1991 John Wiley & Sons, Inc. Print ISBN 0-471-06259-6 Online ISBN 0-471-20061-1 Chapter 4 Entropy Rates of a Stochastic Process The asymptotic equipartition property in Chapter 3 establishes that nH(X) bits suffice on the average to describe n independent and identically distributed random variables. But what if the random variables are dependent? In particular, what if the random variables form a stationary process? We will show, just as in the i.i.d. case, that the entropy H(X,, X,, . ,X,> grows (asymptotically) linearly with n at a rate H(g), which we will call the entropy rate of the process. The interpretation of H(Z) as the best achievable data compression will await the analysis in Chapter 5. 4.1 MARKOV CHAINS A stochastic process is an indexed sequence of random variables. In general, there can be an arbitrary dependence among the ran- dom variables. The process is characterized by the joint probability massfunctionsPr{(X,,X, ,..., Xn)=(x1,x2 ,..., x,)}=p(zl,xz ,..., x,1, x,)EZ?forn=1,2,.... (x1, x2, * ’ l 9 Definition: A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index, i.e., Pr{X, =x1,X2 =x2,. ,X, =x,} =Pr{X1+I=x1,X2+1=x2,...,X,.1=x,} (4.1) for every shift I and for all x,, x2, . , x, E Z. 60 4.1 MARKOV CHAINS 61 A simple example of a stochastic process with dependence is one in which each random variable depends on the one preceding it and is conditionally independent of all the other preceding random variables.
Detecting Out-Of-Distribution Inputs to Deep Generative Models Using

DETECTING OUT-OF-DISTRIBUTION INPUTS TO DEEP GENERATIVE MODELS USING TYPICALITY Eric Nalisnick,∗ Akihiro Matsukawa, Yee Whye Teh, Balaji Lakshminarayanan∗ DeepMind fenalisnick, amatsukawa, ywteh, [email protected] ABSTRACT Recent work has shown that deep generative models can assign higher likeli- hood to out-of-distribution data sets than to their training data (Nalisnick et al., 2019; Choi et al., 2019). We posit that this phenomenon is caused by a mis- match between the model’s typical set and its areas of high probability density. In-distribution inputs should reside in the former but not necessarily in the latter, as previous work has presumed (Bishop, 1994). To determine whether or not inputs reside in the typical set, we propose a statistically principled, easy-to-implement test using the empirical distribution of model likelihoods. The test is model ag- nostic and widely applicable, only requiring that the likelihood can be computed or closely approximated. We report experiments showing that our procedure can successfully detect the out-of-distribution sets in several of the challenging cases reported by Nalisnick et al.(2019). 1 INTRODUCTION Recent work (Nalisnick et al., 2019; Choi et al., 2019; Shafaei et al., 2018) showed that a variety of deep generative models fail to distinguish training from out-of-distribution (OOD) data according to the model likelihood. This phenomenon occurs not only when the data sets are similar but also when they have dramatically different underlying semantics. For instance, Glow (Kingma & Dhariwal, 2018), a state-of-the-art normalizing ﬂow, trained on CIFAR-10 will assign a higher likelihood to SVHN than to its CIFAR-10 training data (Nalisnick et al., 2019; Choi et al., 2019).
Lecture 3 -- Convergence and Typical Sets

ECE 587 / STA 563: Lecture 3 { Convergence and Typical Sets Information Theory Duke University, Fall 2020 Author: Galen Reeves Last Modified: September 19, 2020 Outline of lecture: 3.1 Probability Review.....................................1 3.1.1 Basic Inequalities..................................1 3.1.2 Convergence of Random Variables........................2 3.2 The Typical Set and AEP.................................3 3.2.1 High Probability Sets...............................3 3.2.2 The Typical Set..................................4 3.2.3 Examples......................................5 3.1 Probability Review 3.1.1 Basic Inequalities • Markov's Inequality: For any nonnegative random variable X and t > 0, [X] [X ≥ t] ≤ E P t • Proof: We have x 1(x ≥ t) ≤ ; for all x ≥ 0 t Evaluating this inequality with X and the taking the expectation gives the stated result. • Chebyshev's Inequality: For any random variable X with finite second moment and t > 0, Var(X) X − E[X] > t ≤ P t2 2 • Proof: Apply Markov's inequality too Y = (X − E[X]) : 2 E[Y ] Var(X) X − E[X] > t = Y > t ≤ = P P t2 t2 • Chernoff Bound: For any random variable X, t 2 R, and λ > 0, P[X ≥ t] ≤ exp(−λt) E[exp(λX)] Here E[exp(λX)] is the moment generating function • Proof: P[X ≥ t] = P[λX ≥ λt] Since λ > 0 h λX λti = P e ≥ e Since exp(·) is nondecreasing −λt h λX i ≤ e E e Markov's Inq. 2 ECE 587 / STA 563: Lecture 3 ¯ 1 Pn • Chernoff Bound for Sums: Let X1;X2; ··· be iid and Xn = n i=1 Xi. For any λ > 0, ¯ ¯ P Xn ≥ t ≤ exp(−nλt) E exp(nλXn) Chernoff Inq.
Universally Typical Sets for Ergodic Sources of Multidimensional Data

Universally Typical Sets for Ergodic Sources of Multidimensional Data Tyll Krüger,Guido Montufar, Ruedi Seiler and Rainer Siegmund-Schultze http://arxiv.org/abs/1105.0393 • lossless: algorithm ensurs exact reconstruction • main idea (Shannon): encode typical but small set • universal: algorithm does not involve specific properties of random process. • main idea: construction of universally typical sets. Universal Lossless Encoding Algorithms • data modeled by stationary/ergodic random process • main idea (Shannon): encode typical but small set • universal: algorithm does not involve specific properties of random process. • main idea: construction of universally typical sets. Universal Lossless Encoding Algorithms • data modeled by stationary/ergodic random process • lossless: algorithm ensurs exact reconstruction • universal: algorithm does not involve specific properties of random process. • main idea: construction of universally typical sets. Universal Lossless Encoding Algorithms • data modeled by stationary/ergodic random process • lossless: algorithm ensurs exact reconstruction • main idea (Shannon): encode typical but small set • main idea: construction of universally typical sets. Universal Lossless Encoding Algorithms • data modeled by stationary/ergodic random process • lossless: algorithm ensurs exact reconstruction • main idea (Shannon): encode typical but small set • universal: algorithm does not involve specific properties of random process. Universal Lossless Encoding Algorithms • data modeled by stationary/ergodic random process • lossless: algorithm ensurs exact reconstruction • main idea (Shannon): encode typical but small set • universal: algorithm does not involve specific properties of random process. • main idea: construction of universally typical sets. • small size enh(µ) but still • nearly full measure • output sequences with higher or smaler probability than e−nh(µ) will rarely be observed. Entropy Typical Set 1 (xn) with − log µ(xn) ∼ h(µ) 1 n 1 .
Universally Typical Sets for Ergodic Sources of Multidimensional Data

KYBERNETIKA | MANUSCRIPT PREVIEW UNIVERSALLY TYPICAL SETS FOR ERGODIC SOURCES OF MULTIDIMENSIONAL DATA Tyll Kruger,¨ Guido Montufar,´ Ruedi Seiler, and Rainer Siegmund-Schultze We lift important results about universally typical sets, typically sampled sets, and empirical entropy estimation in the theory of samplings of discrete ergodic information sources from the usual one-dimensional discrete-time setting to a multidimensional lattice setting. We use techniques of packings and coverings with multidimensional windows to construct sequences of multidimensional array sets which in the limit build the generated samples of any ergodic source of entropy rate below an h0 with probability one and whose cardinality grows at most at exponential rate h0. Keywords: universal codes, typical sampling sets, entropy estimation, asymptotic equipar- tition property, ergodic theory Classification: 94A24, 62D05, 94A08 1. INTRODUCTION An entropy-typical set is defined as a set of nearly full measure consisting of output sequences the negative log-probability of which is close to the entropy of the source distribution. The scope of this definition is revealed by the asymptotic equipartition property (AEP), which was introduced by McMillan [8] as the convergence in probabil- 1 arXiv:1105.0393v3 [cs.IT] 12 Nov 2013 n ity of the sequence n log µ(x1 ) to a constant h, namely, the Shannon entropy rate of the process µ [11]. Many− processes have the AEP, as has been shown, e.g., in [1, 2, 8, 9]. In particular, for stationary discrete-time ergodic processes, this property is guaranteed by the Shannon-McMillan (SM) theorem [8] and in the stronger form of almost-sure con- vergence by the Shannon-McMillan-Breiman (SMB) theorem [2].
Compressing Stationary Ergodic Sources

§ 8. Compressing stationary ergodic sources We have examined the compression of i.i.d. sequence {Si}, for which 1 l(f ∗(Sn)) → HS() in prob. (8.1) n 0 R > H(S) lim ∗(Sn; nR) = (8.2) n→∞ 1 RHS< ( ) In this lecture, we shall examine similar results for ergodic processes and we ﬁrst state the main theory as follows: Theorem 8.1 (Shannon-McMillan). Let {S1;S2;::: } be a stationary and ergodic discrete process, then 1 1 Ð→P H also a.s. and in log n ; L1 (8.3) n PSn (S ) 1 n where H = limn→∞ n HS() is the entropy rate. Corollary 8.1. For any stationary and ergodic discrete process {}S1;S2;::: ,(8.1){(8.2) hold with HS() replaced by H. Proof. Shannon-McMillan (we only need convergence in probability) + Theorem 6.4 + Theorem 7.1 ∗ n 1 which tie together the respective CDF of the random variable l(f (S )) and log n . PSn (s ) In Lecture 7 we learned the asymptotic equipartition property (AEP) for iid sources. Here we generalize it to stationary ergodic sources thanks to Shannon-McMillan. Corollary 8.2 (AEP for stationary ergodic sources). Let {}S1;S2;::: be a stationary and ergodic discrete process. For any δ > 0, deﬁne the set 1 1 δ = n ∶ W − HW ≤ ¡ Tn s log n δ : n PSn (S ) Then n δ 1. P S ∈ Tn → 1 as n → ∞. n(H−δ) δ (H+δ)n 2. 2 (1 + o(1)) ≤ STnS ≤ 2 (1 + o(1)). Note: • Convergence in probability for stationary ergodic Markov chains [Shannon 1948] • Convergence in L1 for stationary ergodic processes [McMillan 1953] 90 • Convergence almost surely for stationary ergodic processes [Breiman 1956] (Either of the last two results implies the convergence Theorem 8.1 in probability.) • For a Markov chain, existence of typical sequences can be understood by thinking of Markov process as sequence of independent decisions regarding which transitions to take.
6.441 Information Theory, Lecture 7B

Network coding for multicast relation to compression and generalization of Slepian-Wolf 1 Overview • Review of Slepian-Wolf • Distributed network compression • Error exponents Source-channel separation issues • Code construction for finite field multiple access networks 2 Distributed data compression Consider two correlated sources (X, Y ) ∼ p(x, y) that must be separately encoded for a user who wants to reconstruct both What information transmission rates from each source allow de coding with arbitrarily small probability of error? E.g. H(X1) X1 . X2 . H(X2|X1) 3 Distributed source code nR1 nR2 A ((2 , 2 ),n) distributed source code for joint source (X, Y ) consists of encoder maps n nR1 f1 : X →{1, 2,...,2 } Yn →{ nR2} f2 : 1, 2,...,2 and a decoder map nR1 nR2 n n g : {1, 2,...,2 }×{1, 2,...,2 }→X ×Y n n - X is mapped to f1(X ) n n - Y is mapped to f2(Y ) -(R1,R2) is the rate pair of the code 4 Probability of error (n) n n n n Pe =Pr{g(f1(X ),f2(Y )) =( X ,Y )} Slepian-Wolf Definitions: A rate pair (R1,R2)is achievable if there exists a sequence of nR1 nR2 ((2 , 2 ),n) distributed source codes with probability of error (n) Pe → 0as n →∞ achievable rate region - closure of the set of achievable rates Slepian-Wolf Theorem: 5 For the distributed source coding problem for source (X, Y ) drawn i.i.d. ∼ p(x, y), the achievable rate region is R1 ≥ H(X|Y ) R2 ≥ H(Y |X) R1 + R2 ≥ H(X, Y ) Proof of achievability Main idea: show that if the rate pair is in the Slepian-Wolf region, we can use a random binning encoding scheme with typical set decoding
ECE 534 Information Theory - MIDTERM

ECE 534 Information Theory - MIDTERM 10/02/2013, LH 207. • This exam has 4 questions, each of which is worth 25 points. • You will be given the full 1.25 hours. Use it wisely! Many of the problems have short answers; try to ﬁnd shortcuts. Do questions that you think you can answer correctly ﬁrst. • You may bring and use one 8.5x11" double-sided crib sheet. • No other notes or books are permitted. • No calculators are permitted. • Talking, passing notes, copying (and all other forms of cheating) is forbidden. • Make sure you explain your answers in a way that illustrates your understanding of the problem. Ideas are important, not just the calculation. • Partial marks will be given. • Write all answers directly on this exam. Your name: Your UIN: Your signature: The exam has 4 questions, for a total of 100 points. Question: 1 2 3 4 Total Points: 30 20 30 20 100 Score: ECE534 Fall 2011 MIDTERM Name: 1. Entropy and the typical set Consider a random variable X which takes on the values −1; 0 and 1 with probabilities p(−1) = 1=4; p(0) = 1=2; p(1) = 1=4. For parts (c) and (d) and (e) and (f), we consider a sequence of 8 i.i.d. throws, or instances of this random variable. (a) (4 points) Find the entropy in base 4 of this random variable. Solution: 1 1 1 1 1 1 1 H (X) = log (4) + log (2) + log (4) = + · + = 0:75 4 4 4 2 4 4 4 4 2 2 4 (b) (4 points) Find the entropy in base 2 of this random variable.
EE 376A: Information Theory Lecture Notes

EE 376A: Information Theory Lecture Notes Prof. Tsachy Weissman TA: Idoia Ochoa, Kedar Tatwawadi February 25, 2016 Contents 1 Introduction 1 1.1 Lossless Compression . .1 1.2 Channel Coding . .3 1.3 Lossy Compression . .5 2 Entropy, Relative Entropy, and Mutual Information 6 2.1 Entropy . .6 2.2 Conditional and Joint Entropy . .9 2.3 Mutual Information . 11 3 Asymptotic Equipartition Properties 12 3.1 Asymptotic Equipartition Property (AEP) . 12 3.2 Fixed Length (Near) Lossless Compression . 15 4 Lossless Compression 18 4.1 Uniquely decodable codes and prefix codes . 18 4.2 Prefix code for dyadic distributions . 20 4.3 Shannon codes . 21 4.4 Average codelength bound for uniquely decodable codes . 22 4.5 Huffman Coding . 24 4.6 Optimality of Huffman codes . 25 5 Communication and Channel Capacity 28 5.1 The Communication Problem . 28 5.2 Channel Capacity . 29 5.2.1 Channel Capacity of various discrete channels . 30 5.2.2 Recap . 31 5.3 Information Measures for Continuous Random Variables . 32 5.3.1 Examples . 33 5.3.2 Gaussian Distribution . 34 5.4 Channel Capacity of the AWGN Channel (Additive White Gaussian Noise) . 34 5.4.1 Channel Coding Theorem for this Setting . 35 5.4.2 An Aside: Cost Constraint . 35 5.4.3 The Example . 35 5.4.4 Rough Geometric Interpretation (Picture) . 36 i 5.5 Joint Asymptotic Equipartition Property (AEP) . 38 5.5.1 Set of Jointly Typical Sequences . 38 5.6 Direct Theorem . 39 5.7 Fano’s Inequality . 40 5.8 Converse Theorem . 41 5.9 Some Notes on the Direct and Converse Theorems .