Lecture 1: Introduction, Entropy and ML Estimation 1.1 About the Class
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2
EE376A/STATS376A Information Theory Lecture 3 - 01/16/2018 Lecture 3: Entropy, Relative Entropy, and Mutual Information Lecturer: Tsachy Weissman Scribe: Yicheng An, Melody Guan, Jacob Rebec, John Sholar In this lecture, we will introduce certain key measures of information, that play crucial roles in theoretical and operational characterizations throughout the course. These include the entropy, the mutual information, and the relative entropy. We will also exhibit some key properties exhibited by these information measures. 1 Notation A quick summary of the notation 1. Discrete Random Variable: U 2. Alphabet: U = fu1; u2; :::; uM g (An alphabet of size M) 3. Specific Value: u; u1; etc. For discrete random variables, we will write (interchangeably) P (U = u), PU (u) or most often just, p(u) Similarly, for a pair of random variables X; Y we write P (X = x j Y = y), PXjY (x j y) or p(x j y) 2 Entropy Definition 1. \Surprise" Function: 1 S(u) log (1) , p(u) A lower probability of u translates to a greater \surprise" that it occurs. Note here that we use log to mean log2 by default, rather than the natural log ln, as is typical in some other contexts. This is true throughout these notes: log is assumed to be log2 unless otherwise indicated. Definition 2. Entropy: Let U a discrete random variable taking values in alphabet U. The entropy of U is given by: 1 X H(U) [S(U)] = log = − log (p(U)) = − p(u) log p(u) (2) , E E p(U) E u Where U represents all u values possible to the variable. -
Package 'Infotheo'
Package ‘infotheo’ February 20, 2015 Title Information-Theoretic Measures Version 1.2.0 Date 2014-07 Publication 2009-08-14 Author Patrick E. Meyer Description This package implements various measures of information theory based on several en- tropy estimators. Maintainer Patrick E. Meyer <[email protected]> License GPL (>= 3) URL http://homepage.meyerp.com/software Repository CRAN NeedsCompilation yes Date/Publication 2014-07-26 08:08:09 R topics documented: condentropy . .2 condinformation . .3 discretize . .4 entropy . .5 infotheo . .6 interinformation . .7 multiinformation . .8 mutinformation . .9 natstobits . 10 Index 12 1 2 condentropy condentropy conditional entropy computation Description condentropy takes two random vectors, X and Y, as input and returns the conditional entropy, H(X|Y), in nats (base e), according to the entropy estimator method. If Y is not supplied the function returns the entropy of X - see entropy. Usage condentropy(X, Y=NULL, method="emp") Arguments X data.frame denoting a random variable or random vector where columns contain variables/features and rows contain outcomes/samples. Y data.frame denoting a conditioning random variable or random vector where columns contain variables/features and rows contain outcomes/samples. method The name of the entropy estimator. The package implements four estimators : "emp", "mm", "shrink", "sg" (default:"emp") - see details. These estimators require discrete data values - see discretize. Details • "emp" : This estimator computes the entropy of the empirical probability distribution. • "mm" : This is the Miller-Madow asymptotic bias corrected empirical estimator. • "shrink" : This is a shrinkage estimate of the entropy of a Dirichlet probability distribution. • "sg" : This is the Schurmann-Grassberger estimate of the entropy of a Dirichlet probability distribution. -
Arxiv:1907.00325V5 [Cs.LG] 25 Aug 2020
Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities Ronan Perry1, Ronak Mehta1, Richard Guo1, Jesús Arroyo1, Mike Powell1, Hayden Helm1, Cencheng Shen1, and Joshua T. Vogelstein1;2∗ Abstract. Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when different features are measured on different scales. We propose decision forest-based adaptive nearest neighbor estimators and show that they are able to effectively estimate posterior probabilities, conditional entropies, and mutual information even in the aforementioned settings. We provide an extensive study of efficacy for classification and posterior probability estimation, and prove cer- tain forest-based approaches to be consistent estimators of the true posteriors and derived information-theoretic quantities under certain assumptions. In a real-world connectome application, we quantify the uncertainty about neuron type given various cellular features in the Drosophila larva mushroom body, a key challenge for modern neuroscience. 1 Introduction Uncertainty quantification is a fundamental desiderata of statistical inference and data science. In supervised learning settings it is common to quantify uncertainty with either conditional en- tropy or mutual information (MI). Suppose we are given a pair of random variables (X; Y ), where X is d-dimensional vector-valued and Y is a categorical variable of interest. Conditional entropy H(Y jX) measures the uncertainty in Y on average given X. On the other hand, mutual information quantifies the shared information between X and Y . -
GAIT: a Geometric Approach to Information Theory
GAIT: A Geometric Approach to Information Theory Jose Gallego Ankit Vani Max Schwarzer Simon Lacoste-Julieny Mila and DIRO, Université de Montréal Abstract supports. Optimal transport distances, such as Wasser- stein, have emerged as practical alternatives with theo- retical grounding. These methods have been used to We advocate the use of a notion of entropy compute barycenters (Cuturi and Doucet, 2014) and that reflects the relative abundances of the train generative models (Genevay et al., 2018). How- symbols in an alphabet, as well as the sim- ever, optimal transport is computationally expensive ilarities between them. This concept was as it generally lacks closed-form solutions and requires originally introduced in theoretical ecology to the solution of linear programs or the execution of study the diversity of ecosystems. Based on matrix scaling algorithms, even when solved only in this notion of entropy, we introduce geometry- approximate form (Cuturi, 2013). Approaches based aware counterparts for several concepts and on kernel methods (Gretton et al., 2012; Li et al., 2017; theorems in information theory. Notably, our Salimans et al., 2018), which take a functional analytic proposed divergence exhibits performance on view on the problem, have also been widely applied. par with state-of-the-art methods based on However, further exploration on the interplay between the Wasserstein distance, but enjoys a closed- kernel methods and information theory is lacking. form expression that can be computed effi- ciently. We demonstrate the versatility of our Contributions. We i) introduce to the machine learn- method via experiments on a broad range of ing community a similarity-sensitive definition of en- domains: training generative models, comput- tropy developed by Leinster and Cobbold(2012). -
Information Content and Error Analysis
43 INFORMATION CONTENT AND ERROR ANALYSIS Clive D Rodgers Atmospheric, Oceanic and Planetary Physics University of Oxford ESA Advanced Atmospheric Training Course September 15th – 20th, 2008 44 INFORMATION CONTENT OF A MEASUREMENT Information in a general qualitative sense: Conceptually, what does y tell you about x? We need to answer this to determine if a conceptual instrument design actually works • to optimise designs • Use the linear problem for simplicity to illustrate the ideas. y = Kx + ! 45 SHANNON INFORMATION The Shannon information content of a measurement of x is the change in the entropy of the • probability density function describing our knowledge of x. Entropy is defined by: • S P = P (x) log(P (x)/M (x))dx { } − Z M(x) is a measure function. We will take it to be constant. Compare this with the statistical mechanics definition of entropy: • S = k p ln p − i i i X P (x)dx corresponds to pi. 1/M (x) is a kind of scale for dx. The Shannon information content of a measurement is the change in entropy between the p.d.f. • before, P (x), and the p.d.f. after, P (x y), the measurement: | H = S P (x) S P (x y) { }− { | } What does this all mean? 46 ENTROPY OF A BOXCAR PDF Consider a uniform p.d.f in one dimension, constant in (0,a): P (x)=1/a 0 <x<a and zero outside.The entropy is given by a 1 1 S = ln dx = ln a − a a Z0 „ « Similarly, the entropy of any constant pdf in a finite volume V of arbitrary shape is: 1 1 S = ln dv = ln V − V V ZV „ « i.e the entropy is the log of the volume of state space occupied by the p.d.f. -
On a General Definition of Conditional Rényi Entropies
proceedings Proceedings On a General Definition of Conditional Rényi Entropies † Velimir M. Ili´c 1,*, Ivan B. Djordjevi´c 2 and Miomir Stankovi´c 3 1 Mathematical Institute of the Serbian Academy of Sciences and Arts, Kneza Mihaila 36, 11000 Beograd, Serbia 2 Department of Electrical and Computer Engineering, University of Arizona, 1230 E. Speedway Blvd, Tucson, AZ 85721, USA; [email protected] 3 Faculty of Occupational Safety, University of Niš, Carnojevi´ca10a,ˇ 18000 Niš, Serbia; [email protected] * Correspondence: [email protected] † Presented at the 4th International Electronic Conference on Entropy and Its Applications, 21 November–1 December 2017; Available online: http://sciforum.net/conference/ecea-4. Published: 21 November 2017 Abstract: In recent decades, different definitions of conditional Rényi entropy (CRE) have been introduced. Thus, Arimoto proposed a definition that found an application in information theory, Jizba and Arimitsu proposed a definition that found an application in time series analysis and Renner-Wolf, Hayashi and Cachin proposed definitions that are suitable for cryptographic applications. However, there is still no a commonly accepted definition, nor a general treatment of the CRE-s, which can essentially and intuitively be represented as an average uncertainty about a random variable X if a random variable Y is given. In this paper we fill the gap and propose a three-parameter CRE, which contains all of the previous definitions as special cases that can be obtained by a proper choice of the parameters. Moreover, it satisfies all of the properties that are simultaneously satisfied by the previous definitions, so that it can successfully be used in aforementioned applications. -
Chapter 4 Information Theory
Chapter Information Theory Intro duction This lecture covers entropy joint entropy mutual information and minimum descrip tion length See the texts by Cover and Mackay for a more comprehensive treatment Measures of Information Information on a computer is represented by binary bit strings Decimal numb ers can b e represented using the following enco ding The p osition of the binary digit 3 2 1 0 Bit Bit Bit Bit Decimal Table Binary encoding indicates its decimal equivalent such that if there are N bits the ith bit represents N i the decimal numb er Bit is referred to as the most signicant bit and bit N as the least signicant bit To enco de M dierent messages requires log M bits 2 Signal Pro cessing Course WD Penny April Entropy The table b elow shows the probability of o ccurrence px to two decimal places of i selected letters x in the English alphab et These statistics were taken from Mackays i b o ok on Information Theory The table also shows the information content of a x px hx i i i a e j q t z Table Probability and Information content of letters letter hx log i px i which is a measure of surprise if we had to guess what a randomly chosen letter of the English alphab et was going to b e wed say it was an A E T or other frequently o ccuring letter If it turned out to b e a Z wed b e surprised The letter E is so common that it is unusual to nd a sentence without one An exception is the page novel Gadsby by Ernest Vincent Wright in which -
Noisy Channel Coding
Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni Hirvonen Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing [email protected] T-61.182 Special Course in Information Science II / Spring 2004 1 Contents • More entropy definitions { joint & conditional entropy { mutual information • Communication over a noisy channel { overview { information conveyed by a channel { noisy channel coding theorem 2 Joint Entropy Joint entropy of X; Y is: 1 H(X; Y ) = P (x; y) log P (x; y) xy X Y 2AXA Entropy is additive for independent random variables: H(X; Y ) = H(X) + H(Y ) iff P (x; y) = P (x)P (y) 3 Conditional Entropy Conditional entropy of X given Y is: 1 1 H(XjY ) = P (y) P (xjy) log = P (x; y) log P (xjy) P (xjy) y2A "x2A # y2A A XY XX XX Y It measures the average uncertainty (i.e. information content) that remains about x when y is known. 4 Mutual Information Mutual information between X and Y is: I(Y ; X) = I(X; Y ) = H(X) − H(XjY ) ≥ 0 It measures the average reduction in uncertainty about x that results from learning the value of y, or vice versa. Conditional mutual information between X and Y given Z is: I(Y ; XjZ) = H(XjZ) − H(XjY; Z) 5 Breakdown of Entropy Entropy relations: Chain rule of entropy: H(X; Y ) = H(X) + H(Y jX) = H(Y ) + H(XjY ) 6 Noisy Channel: Overview • Real-life communication channels are hopelessly noisy i.e. introduce transmission errors • However, a solution can be achieved { the aim of source coding is to remove redundancy from the source data -
Shannon Entropy and Kolmogorov Complexity
Information and Computation: Shannon Entropy and Kolmogorov Complexity Satyadev Nandakumar Department of Computer Science. IIT Kanpur October 19, 2016 This measures the average uncertainty of X in terms of the number of bits. Shannon Entropy Definition Let X be a random variable taking finitely many values, and P be its probability distribution. The Shannon Entropy of X is X 1 H(X ) = p(i) log : 2 p(i) i2X Shannon Entropy Definition Let X be a random variable taking finitely many values, and P be its probability distribution. The Shannon Entropy of X is X 1 H(X ) = p(i) log : 2 p(i) i2X This measures the average uncertainty of X in terms of the number of bits. The Triad Figure: Claude Shannon Figure: A. N. Kolmogorov Figure: Alan Turing Just Electrical Engineering \Shannon's contribution to pure mathematics was denied immediate recognition. I can recall now that even at the International Mathematical Congress, Amsterdam, 1954, my American colleagues in probability seemed rather doubtful about my allegedly exaggerated interest in Shannon's work, as they believed it consisted more of techniques than of mathematics itself. However, Shannon did not provide rigorous mathematical justification of the complicated cases and left it all to his followers. Still his mathematical intuition is amazingly correct." A. N. Kolmogorov, as quoted in [Shi89]. Kolmogorov and Entropy Kolmogorov's later work was fundamentally influenced by Shannon's. 1 Foundations: Kolmogorov Complexity - using the theory of algorithms to give a combinatorial interpretation of Shannon Entropy. 2 Analogy: Kolmogorov-Sinai Entropy, the only finitely-observable isomorphism-invariant property of dynamical systems. -
Understanding Shannon's Entropy Metric for Information
Understanding Shannon's Entropy metric for Information Sriram Vajapeyam [email protected] 24 March 2014 1. Overview Shannon's metric of "Entropy" of information is a foundational concept of information theory [1, 2]. Here is an intuitive way of understanding, remembering, and/or reconstructing Shannon's Entropy metric for information. Conceptually, information can be thought of as being stored in or transmitted as variables that can take on different values. A variable can be thought of as a unit of storage that can take on, at different times, one of several different specified values, following some process for taking on those values. Informally, we get information from a variable by looking at its value, just as we get information from an email by reading its contents. In the case of the variable, the information is about the process behind the variable. The entropy of a variable is the "amount of information" contained in the variable. This amount is determined not just by the number of different values the variable can take on, just as the information in an email is quantified not just by the number of words in the email or the different possible words in the language of the email. Informally, the amount of information in an email is proportional to the amount of “surprise” its reading causes. For example, if an email is simply a repeat of an earlier email, then it is not informative at all. On the other hand, if say the email reveals the outcome of a cliff-hanger election, then it is highly informative. -
Lecture 11: Channel Coding Theorem: Converse Part 1 Recap
EE376A/STATS376A Information Theory Lecture 11 - 02/13/2018 Lecture 11: Channel Coding Theorem: Converse Part Lecturer: Tsachy Weissman Scribe: Erdem Bıyık In this lecture1, we will continue our discussion on channel coding theory. In the previous lecture, we proved the direct part of the theorem, which suggests if R < C(I), then R is achievable. Now, we are going to prove the converse statement: If R > C(I), then R is not achievable. We will also state some important notes about the direct and converse parts. 1 Recap: Communication Problem Recall the communication problem: Memoryless Channel n n J ∼ Uniff1; 2;:::;Mg −! encoder −!X P (Y jX) −!Y decoder −! J^ log M bits • Rate = R = n channel use • Probability of error = Pe = P (J^ 6= J) (I) (I) The main result is C = C = maxPX I(X; Y ). Last week, we showed R is achievable if R < C . In this lecture, we are going to prove that if R > C(I), then R is not achievable. 2 Fano's Inequality Theorem (Fano's Inequality). Let X be a discrete random variable and X^ = X^(Y ) be a guess of X based on Y . Let Pe = P (X 6= X^). Then, H(XjY ) ≤ h2(Pe) + Pe log(jX j − 1) where h2 is the binary entropy function. ^ Proof. Let V = 1fX6=X^ g, i.e. V is 1 if X 6= X and 0 otherwise. H(XjY ) ≤ H(X; V jY ) (1) = H(V jY ) + H(XjV; Y ) (2) ≤ H(V ) + H(XjV; Y ) (3) X = H(V ) + H(XjV =v; Y =y)P (V =v; Y =y) (4) v;y X X = H(V ) + H(XjV =0;Y =y)P (V =0;Y =y) + H(XjV =1;Y =y)P (V =1;Y =y) (5) y y X = H(V ) + H(XjV =1;Y =y)P (V =1;Y =y) (6) y X ≤ H(V ) + log(jX j − 1) P (V =1;Y =y) (7) y = H(V ) + log(jX j − 1)P (V =1) (8) = h2(Pe) + Pe log(jX j − 1) (9) 1Reading: Chapter 7.9 and 7.12 of Cover, Thomas M., and Joy A. -
Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory
NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information theory started with Claude Shannon's A mathematical theory of communication. The first building block was entropy, which he sought as a functional H(·) of probability densities with two desired properties: 1. Decreasing in P (X), such that if P (X1) < P (X2), then h(P (X1)) > h(P (X2)). 2. Independent variables add, such that if X and Y are independent, then H(P (X; Y )) = H(P (X)) + H(P (Y )). These are only satisfied for − log(·). Think of it as a \surprise" function. Definition 8.1 (Entropy) The entropy of a random variable is the amount of information needed to fully describe it; alternate interpretations: average number of yes/no questions needed to identify X, how uncertain you are about X? X H(X) = − P (X) log P (X) = −EX [log P (X)] (8.1) X Average information, surprise, or uncertainty are all somewhat parsimonious plain English analogies for entropy. There are a few ways to measure entropy for multiple variables; we'll use two, X and Y . Definition 8.2 (Conditional entropy) The conditional entropy of a random variable is the entropy of one random variable conditioned on knowledge of another random variable, on average. Alternative interpretations: the average number of yes/no questions needed to identify X given knowledge of Y , on average; or How uncertain you are about X if you know Y , on average? X X h X i H(X j Y ) = P (Y )[H(P (X j Y ))] = P (Y ) − P (X j Y ) log P (X j Y ) Y Y X X = = − P (X; Y ) log P (X j Y ) X;Y = −EX;Y [log P (X j Y )] (8.2) Definition 8.3 (Joint entropy) X H(X; Y ) = − P (X; Y ) log P (X; Y ) = −EX;Y [log P (X; Y )] (8.3) X;Y 8-1 8-2 Lecture 8: Information Theory and Maximum Entropy • Bayes' rule for entropy H(X1 j X2) = H(X2 j X1) + H(X1) − H(X2) (8.4) • Chain rule of entropies n X H(Xn;Xn−1; :::X1) = H(Xn j Xn−1; :::X1) (8.5) i=1 It can be useful to think about these interrelated concepts with a so-called information diagram.