Justifications of Shannon Entropy and Mutual Information in Statistical

Total Page:16

File Type:pdf, Size:1020Kb

Justifications of Shannon Entropy and Mutual Information in Statistical Justifications of Shannon Entropy and Mutual Information in Statistical Inference EE378A Lecture Notes Spring 2017, Stanford University April 19, 2017 Let X = fx1; x2; : : : ; xng be a finite set with jX j = n. Let Γn denote the set of probability measures on X , and let R¯ denote the extended real line. We recall the definition of the logarithmic loss as follows: ¯ Definition 1 (Logarithmic loss). The logarithmic loss `log : X × Γn 7! R is defined by 1 ` = ln ; (1) log P (x) where P (x) denotes the probability of x under measure P . In this lecture, we go over some elegant results justifying the use of the logarithmic loss in statistical inference. Recall the following lemma that were introduced in Lecture 2: Lemma 1. The true distribution minimizes the expected logarithmic loss: 1 P = arg min EP ln ; (2) Q Q(X) and 1 H(P ) = min EP ln : (3) Q Q(X) Proof. The result follows from the observation that for any Q 2 Γn, 1 1 P (X) ln − ln = ln (4) EP Q(X) EP P (X) EP Q(X) X P (x) = P (x) ln (5) Q(x) x2X X Q(x) = P (x) − ln (6) P (x) x2X ! X Q(x) ≥ − ln P (x) (7) P (x) x2X = 0; (8) where we applied Jensen's inequality for the convex function − ln x on (0; 1). Lemma 1 is the foundational result behind the use of the cross entropy loss in machine learning. A natural question arises: is the logarithmic loss the only loss function that satisfies the property that the true distribution minimizes the expected loss? Perhaps surprisingly, it is true under a natural \locality" constraint. The following result is also known as the fact that the logarithmic scoring rule is the unique proper scoring rule. 1 Theorem 1. [1] Suppose n ≥ 3. Then, the only function f such that for any P P 2 arg min EP f(Q(X)) (9) Q is of the form 1 f(p) = c ln + b for all p 2 (0; 1); (10) p where b; c ≥ 0 are constants. Theorem 1 does not hold if n = 2. We refer to [2] for detailed discussions regarding the dichotomy between the binary and non-binary alphabets. 1 Justification of the Shannon entropy (and the logarithmic loss) Csiszar [3] provided a survey of axiomatic characterizations of information measures up to 2008. Csiszar stated that \the intuitively most appealing axiomatic result is due to Aczel{Forte{Ng [4]", which characterized the Shannon entropy as a natural measure to measure uncertainty. We quote their results below. Theorem 2. [4] Let K(P ) be a general functional of discrete distribution P . We assume the following axioms: 1. Subadditivity: K(PXY ) ≤ K(PX ) + K(PY ); (11) 2. Additivity: when X and Y are independent, K(PXY ) = K(PX ) + K(PY ); (12) 3. Expansibility: K(p1; p2; : : : ; pm; 0) = K(p1; p2; : : : ; pm); (13) 4. Symmetry: for all permutations π, K(p1; p2; : : : ; pm) = K(pπ(1); pπ(2); : : : ; pπ(m)); (14) 5. Normalization: K(1=2; 1=2) = 1; (15) 6. Small for small probabilities: lim K(1 − q; q) = 0: (16) q!0+ P Then, the Shannon entropy H(P ) = x2X −P (x) ln P (x) is the only functional satisfying the axioms above. 2 Justification of mutual information (and the logarithmic loss) For the rest of the lecture, we present a characterization of mutual information that justifies the use of the logarithmic loss from another perspective. Concretely, we ask the following question: Suppose X and Y are dependent random variables. How relevant is Y for inference on X? Toward answering this question, let ` : X × X!^ R¯ be an arbitrary loss function with reconstruction alphabet R, where R is arbitrary. Given (X; Y ) ∼ PXY , it is natural to quantify the benefit of additional 2 side information Y by computing the difference between the expected losses in estimating X 2 X with and without side information Y , respectively. This motivates the following definition: ^ C(`; PXY ) , inf EP [`(X; x^1)] − inf EP [`(X; X2)]; (17) x^12R X^2(Y ) wherex ^1 2 R is deterministic, and X^2 = X^2(Y ) 2 R is any measurable function of Y . In the following discussions, we require that indeterminate forms like 1 − 1 do not appear in the definition of C(`; PXY ). By taking Y to be independent of X, this requirement implies that for all P 2 Γn, inf EP [`(X; x^1)] < 1: (18) x^12R The formulation (17) has appeared previously in the statistics literature. DeGroot [5] in 1962 defined the information contained in an experiment, which turns out to be equivalent to (17). Later, Dawid [6] defined the coherent dependence function, which is equivalent to (17), and used it to quantify the dependence between two random variables X; Y . Our framework of quantifying the predictive benefit of side information is closely connected to the notion of proper scoring rules and the literature on probability forecasting in statistics. The survey by Gneiting and Raftery [7] provides a good overview. Having introduced the yardstick in (17), we now reformulate the question of interest: Which loss func- tion(s) ` can be used to define C(`; PXY ) in a meaningful way? Of course, \meaningful" is open to in- terpretation, but it is desirable that C(`; PXY ) be well-defined, at minimum. This motivates the following axiom: Axiom 1. For all distributions PXY , the quantity C(`; PXY ) satisfies C(`; PTY ) ≤ C(`; PXY ) whenever T (X) 2 X is a statistically sufficient transformation of X for Y . We remind the reader that the statement `T is a statistically sufficient transform of X for Y ' means that the following two Markov chains hold: T − X − Y; X − T − Y (19) That is, T (X) preserves all of the information X contains about Y . In words, the Data Processing Axiom stipulates that processing the data X ! T cannot boost the predictive benefit of the side information1. To convince the reader that the Data Processing Axiom is a natural requirement, suppose instead that the Data Processing Axiom did not hold. Since X and T are mutually sufficient statistics for Y , this would imply that there is no unique value which quantifies the benefit of side information Y for the random variable of interest. Thus, the Data Processing Axiom is needed for the benefit of side information to be well-defined. Although the Data Processing Axiom may seem to be a benign requirement, it has far-reaching implica- tions for the form C(`; PXY ) can take. This is captured by our first main result: Theorem 3. [8] Let n ≥ 3. Under the Data Processing Axiom, the function C(`; PXY ) is uniquely deter- mined by the mutual information, C(`; PXY ) = I(X; Y ); (20) up to a multiplicative factor. We prove Theorem 3 below. To begin, we show that the measure of relevance defined in (17) is equivalently characterized by a bounded convex function defined on the X -simplex. The following lemma achieves this goal. Lemma 2. There exists a bounded convex function V :Γn ! R, depending on `, such that ! X C(`; PXY ) = PY (y)V (PXjY =y) − V (PX ): (21) y 1In fact, the Data Processing Axiom is weaker than this general data processing statement since it only addresses statistically sufficient transformations of X. 3 The proof of Lemma 2 follows from defining V (P ) by V (P ) = − inf EP [`(X; x^)]; (22) x^2R and its details are deferred to the appendix. In the statistics literature, the quantity −V (P ) is usually called the generalized entropy or the Bayes envelope. The next lemma asserts that we only need to consider symmetric (invariant to permutations) functions V (P ). Lemma 3. Under the Data Processing Axiom, there exists a symmetric finite convex function G :Γn ! R, such that ! X C(`; PXY ) = PY (y)G(PXjY =y) − G(PX ); (23) y and G(·) is equal to V (·) in Lemma 2 up to a linear translation: G(P ) = V (P ) + hc; P i; (24) where c 2 Rn is a constant vector. The proof of Lemma 3 follows by applying a permutation to the space X and applying the Data Processing Axiom. Details are deferred to the appendix. Now we are in a position to begin the proof of Theorem 3 in earnest. It suffices to consider the case when the side information Y is binary valued, i.e., Y 2 f1; 2g. We will show that the Data Processing Axiom mandates the usage of the logarithmic loss even when we constrain ourselves to this situation. Define α fY = 1g. Take P (t);P (t) to be two probability distribution on X parametrized in the , P λ1 λ2 following way: P (t) = (λ t; λ (1 − t); r − λ ; p ; : : : ; p ) (25) λ1 1 1 1 4 n P (t) = (λ t; λ (1 − t); r − λ ; p ; : : : ; p ); (26) λ2 2 2 2 4 n P where r , 1 − i≥4 pi; t 2 [0; 1]; 0 ≤ λ1 < λ2 ≤ r. Taking P P (t);P P (t), it follows from Lemma 2 that Xj1 , λ1 Xj2 , λ2 C(`; PXY ) = αV (P (t)) + (1 − α)V (P (t)) − V (αP (t) + (1 − α)P (t)): (27) λ1 λ2 λ1 λ2 Note that the following transformation T (X) is a statistically sufficient transformation of X for Y : ( x1 X 2 fx1; x2g; T (X) = (28) X otherwise. The Data Processing Axiom implies that for all α 2 [0; 1], t 2 [0; 1] and legitimate λ2 > λ1 ≥ 0, αV (P (t)) + (1 − α)V (P (t)) − V (αP (t) + (1 − α)P (t)) λ1 λ2 λ1 λ2 = αV (P (1)) + (1 − α)V (P (1)) − V (αP (1) + (1 − α)P (1)): (29) λ1 λ2 λ1 λ2 We now define the function (t) R(λ, t) , V (Pλ ); (30) where we note that the bi-variate function R(λ, t) implicitly depends on the parameter p4; p5 : : : pn which we shall fix for the rest of this proof.
Recommended publications
  • Distribution of Mutual Information
    Distribution of Mutual Information Marcus Hutter IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland [email protected] http://www.idsia.ch/-marcus Abstract The mutual information of two random variables z and J with joint probabilities {7rij} is commonly used in learning Bayesian nets as well as in many other fields. The chances 7rij are usually estimated by the empirical sampling frequency nij In leading to a point es­ timate J(nij In) for the mutual information. To answer questions like "is J (nij In) consistent with zero?" or "what is the probability that the true mutual information is much larger than the point es­ timate?" one has to go beyond the point estimate. In the Bayesian framework one can answer these questions by utilizing a (second order) prior distribution p( 7r) comprising prior information about 7r. From the prior p(7r) one can compute the posterior p(7rln), from which the distribution p(Iln) of the mutual information can be cal­ culated. We derive reliable and quickly computable approximations for p(Iln). We concentrate on the mean, variance, skewness, and kurtosis, and non-informative priors. For the mean we also give an exact expression. Numerical issues and the range of validity are discussed. 1 Introduction The mutual information J (also called cross entropy) is a widely used information theoretic measure for the stochastic dependency of random variables [CT91, SooOO] . It is used, for instance, in learning Bayesian nets [Bun96, Hec98] , where stochasti­ cally dependent nodes shall be connected. The mutual information defined in (1) can be computed if the joint probabilities {7rij} of the two random variables z and J are known.
    [Show full text]
  • Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2
    EE376A/STATS376A Information Theory Lecture 3 - 01/16/2018 Lecture 3: Entropy, Relative Entropy, and Mutual Information Lecturer: Tsachy Weissman Scribe: Yicheng An, Melody Guan, Jacob Rebec, John Sholar In this lecture, we will introduce certain key measures of information, that play crucial roles in theoretical and operational characterizations throughout the course. These include the entropy, the mutual information, and the relative entropy. We will also exhibit some key properties exhibited by these information measures. 1 Notation A quick summary of the notation 1. Discrete Random Variable: U 2. Alphabet: U = fu1; u2; :::; uM g (An alphabet of size M) 3. Specific Value: u; u1; etc. For discrete random variables, we will write (interchangeably) P (U = u), PU (u) or most often just, p(u) Similarly, for a pair of random variables X; Y we write P (X = x j Y = y), PXjY (x j y) or p(x j y) 2 Entropy Definition 1. \Surprise" Function: 1 S(u) log (1) , p(u) A lower probability of u translates to a greater \surprise" that it occurs. Note here that we use log to mean log2 by default, rather than the natural log ln, as is typical in some other contexts. This is true throughout these notes: log is assumed to be log2 unless otherwise indicated. Definition 2. Entropy: Let U a discrete random variable taking values in alphabet U. The entropy of U is given by: 1 X H(U) [S(U)] = log = − log (p(U)) = − p(u) log p(u) (2) , E E p(U) E u Where U represents all u values possible to the variable.
    [Show full text]
  • A Statistical Framework for Neuroimaging Data Analysis Based on Mutual Information Estimated Via a Gaussian Copula
    bioRxiv preprint doi: https://doi.org/10.1101/043745; this version posted October 25, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. A statistical framework for neuroimaging data analysis based on mutual information estimated via a Gaussian copula Robin A. A. Ince1*, Bruno L. Giordano1, Christoph Kayser1, Guillaume A. Rousselet1, Joachim Gross1 and Philippe G. Schyns1 1 Institute of Neuroscience and Psychology, University of Glasgow, 58 Hillhead Street, Glasgow, G12 8QB, UK * Corresponding author: [email protected] +44 7939 203 596 Abstract We begin by reviewing the statistical framework of information theory as applicable to neuroimaging data analysis. A major factor hindering wider adoption of this framework in neuroimaging is the difficulty of estimating information theoretic quantities in practice. We present a novel estimation technique that combines the statistical theory of copulas with the closed form solution for the entropy of Gaussian variables. This results in a general, computationally efficient, flexible, and robust multivariate statistical framework that provides effect sizes on a common meaningful scale, allows for unified treatment of discrete, continuous, uni- and multi-dimensional variables, and enables direct comparisons of representations from behavioral and brain responses across any recording modality. We validate the use of this estimate as a statistical test within a neuroimaging context, considering both discrete stimulus classes and continuous stimulus features. We also present examples of analyses facilitated by these developments, including application of multivariate analyses to MEG planar magnetic field gradients, and pairwise temporal interactions in evoked EEG responses.
    [Show full text]
  • On Measures of Entropy and Information
    On Measures of Entropy and Information Tech. Note 009 v0.7 http://threeplusone.com/info Gavin E. Crooks 2018-09-22 Contents 5 Csiszar´ f-divergences 12 Csiszar´ f-divergence ................ 12 0 Notes on notation and nomenclature 2 Dual f-divergence .................. 12 Symmetric f-divergences .............. 12 1 Entropy 3 K-divergence ..................... 12 Entropy ........................ 3 Fidelity ........................ 12 Joint entropy ..................... 3 Marginal entropy .................. 3 Hellinger discrimination .............. 12 Conditional entropy ................. 3 Pearson divergence ................. 14 Neyman divergence ................. 14 2 Mutual information 3 LeCam discrimination ............... 14 Mutual information ................. 3 Skewed K-divergence ................ 14 Multivariate mutual information ......... 4 Alpha-Jensen-Shannon-entropy .......... 14 Interaction information ............... 5 Conditional mutual information ......... 5 6 Chernoff divergence 14 Binding information ................ 6 Chernoff divergence ................. 14 Residual entropy .................. 6 Chernoff coefficient ................. 14 Total correlation ................... 6 Renyi´ divergence .................. 15 Lautum information ................ 6 Alpha-divergence .................. 15 Uncertainty coefficient ............... 7 Cressie-Read divergence .............. 15 Tsallis divergence .................. 15 3 Relative entropy 7 Sharma-Mittal divergence ............. 15 Relative entropy ................... 7 Cross entropy
    [Show full text]
  • Information Theory 1 Entropy 2 Mutual Information
    CS769 Spring 2010 Advanced Natural Language Processing Information Theory Lecturer: Xiaojin Zhu [email protected] In this lecture we will learn about entropy, mutual information, KL-divergence, etc., which are useful concepts for information processing systems. 1 Entropy Entropy of a discrete distribution p(x) over the event space X is X H(p) = − p(x) log p(x). (1) x∈X When the log has base 2, entropy has unit bits. Properties: H(p) ≥ 0, with equality only if p is deterministic (use the fact 0 log 0 = 0). Entropy is the average number of 0/1 questions needed to describe an outcome from p(x) (the Twenty Questions game). Entropy is a concave function of p. 1 1 1 1 7 For example, let X = {x1, x2, x3, x4} and p(x1) = 2 , p(x2) = 4 , p(x3) = 8 , p(x4) = 8 . H(p) = 4 bits. This definition naturally extends to joint distributions. Assuming (x, y) ∼ p(x, y), X X H(p) = − p(x, y) log p(x, y). (2) x∈X y∈Y We sometimes write H(X) instead of H(p) with the understanding that p is the underlying distribution. The conditional entropy H(Y |X) is the amount of information needed to determine Y , if the other party knows X. X X X H(Y |X) = p(x)H(Y |X = x) = − p(x, y) log p(y|x). (3) x∈X x∈X y∈Y From above, we can derive the chain rule for entropy: H(X1:n) = H(X1) + H(X2|X1) + ..
    [Show full text]
  • Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory
    NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information theory started with Claude Shannon's A mathematical theory of communication. The first building block was entropy, which he sought as a functional H(·) of probability densities with two desired properties: 1. Decreasing in P (X), such that if P (X1) < P (X2), then h(P (X1)) > h(P (X2)). 2. Independent variables add, such that if X and Y are independent, then H(P (X; Y )) = H(P (X)) + H(P (Y )). These are only satisfied for − log(·). Think of it as a \surprise" function. Definition 8.1 (Entropy) The entropy of a random variable is the amount of information needed to fully describe it; alternate interpretations: average number of yes/no questions needed to identify X, how uncertain you are about X? X H(X) = − P (X) log P (X) = −EX [log P (X)] (8.1) X Average information, surprise, or uncertainty are all somewhat parsimonious plain English analogies for entropy. There are a few ways to measure entropy for multiple variables; we'll use two, X and Y . Definition 8.2 (Conditional entropy) The conditional entropy of a random variable is the entropy of one random variable conditioned on knowledge of another random variable, on average. Alternative interpretations: the average number of yes/no questions needed to identify X given knowledge of Y , on average; or How uncertain you are about X if you know Y , on average? X X h X i H(X j Y ) = P (Y )[H(P (X j Y ))] = P (Y ) − P (X j Y ) log P (X j Y ) Y Y X X = = − P (X; Y ) log P (X j Y ) X;Y = −EX;Y [log P (X j Y )] (8.2) Definition 8.3 (Joint entropy) X H(X; Y ) = − P (X; Y ) log P (X; Y ) = −EX;Y [log P (X; Y )] (8.3) X;Y 8-1 8-2 Lecture 8: Information Theory and Maximum Entropy • Bayes' rule for entropy H(X1 j X2) = H(X2 j X1) + H(X1) − H(X2) (8.4) • Chain rule of entropies n X H(Xn;Xn−1; :::X1) = H(Xn j Xn−1; :::X1) (8.5) i=1 It can be useful to think about these interrelated concepts with a so-called information diagram.
    [Show full text]
  • A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by MURAL - Maynooth University Research Archive Library 1 A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´ Abstract Given a collection of strings, each with an associated probability of occurrence, the guesswork of each of them is their position in a list ordered from most likely to least likely, breaking ties arbitrarily. Guesswork is central to several applications in information theory: Average guesswork provides a lower bound on the expected computational cost of a sequential decoder to decode successfully the transmitted message; the complementary cumulative distribution function of guesswork gives the error probability in list decoding; the logarithm of guesswork is the number of bits needed in optimal lossless one-to-one source coding; and guesswork is the number of trials required of an adversary to breach a password protected system in a brute-force attack. In this paper, we consider memoryless string-sources that generate strings consisting of i.i.d. characters drawn from a finite alphabet, and characterize their corresponding guesswork. Our main tool is the tilt operation on a memoryless string-source. We show that the tilt operation on a memoryless string-source parametrizes an exponential family of memoryless string-sources, which we refer to as the tilted family of the string-source. We provide an operational meaning to the tilted families by proving that two memoryless string-sources result in the same guesswork on all strings of all lengths if and only if their respective categorical distributions belong to the same tilted family.
    [Show full text]
  • Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)
    Advanced Signal Processing Group Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA) Dr Mohsen Naqvi School of Electrical and Electronic Engineering, Newcastle University [email protected] UDRC Summer School Advanced Signal Processing Group Acknowledgement and Recommended text: Hyvarinen et al., Independent Component Analysis, Wiley, 2001 UDRC Summer School 2 Advanced Signal Processing Group Source Separation and Component Analysis Mixing Process Un-mixing Process 푥 푠1 1 푦1 H W Independent? 푠푛 푥푚 푦푛 Unknown Known Optimize Mixing Model: x = Hs Un-mixing Model: y = Wx = WHs =PDs UDRC Summer School 3 Advanced Signal Processing Group Principal Component Analysis (PCA) Orthogonal projection of data onto lower-dimension linear space: i. maximizes variance of projected data ii. minimizes mean squared distance between data point and projections UDRC Summer School 4 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 2-D Gaussian Data UDRC Summer School 5 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 1st PC in the direction of the largest variance UDRC Summer School 6 Advanced Signal Processing Group Principal Component Analysis (PCA) Example: 2nd PC in the direction of the second largest variance and each PC is orthogonal to other components. UDRC Summer School 7 Advanced Signal Processing Group Principal Component Analysis (PCA) • In PCA the redundancy is measured by correlation between data elements • Using
    [Show full text]
  • Language Models
    Recap: Language models Foundations of Natural Language Processing Language models tell us P (~w) = P (w . w ): How likely to occur is this • 1 n Lecture 4 sequence of words? Language Models: Evaluation and Smoothing Roughly: Is this sequence of words a “good” one in my language? LMs are used as a component in applications such as speech recognition, Alex Lascarides • machine translation, and predictive text completion. (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp Koehn) 24 January 2020 To reduce sparse data, N-gram LMs assume words depend only on a fixed- • length history, even though we know this isn’t true. Alex Lascarides FNLP lecture 4 24 January 2020 Alex Lascarides FNLP lecture 4 1 Evaluating a language model Two types of evaluation in NLP Intuitively, a trigram model captures more context than a bigram model, so Extrinsic: measure performance on a downstream application. • should be a “better” model. • – For LM, plug it into a machine translation/ASR/etc system. – The most reliable evaluation, but can be time-consuming. That is, it should more accurately predict the probabilities of sentences. • – And of course, we still need an evaluation measure for the downstream system! But how can we measure this? • Intrinsic: design a measure that is inherent to the current task. • – Can be much quicker/easier during development cycle. – But not always easy to figure out what the right measure is: ideally, one that correlates well with extrinsic measures. Let’s consider how to define an intrinsic measure for LMs.
    [Show full text]
  • Neural Networks and Backpropagation
    10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 20 HTF 11 Mitchell 4 April 3, 2017 1 Reminders • Homework 6: Unsupervised Learning – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring 2 Neural Networks Outline • Logistic Regression (Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning Last Lecture – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2-Layer Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic Chain Rule (of calculus) This Lecture – Chain Rule for Arbitrary Computation Graph – Backpropagation Algorithm – Module-based Automatic Differentiation (Autodiff) 3 DECISION BOUNDARY EXAMPLES 4 Example #1: Diagonal Band 5 Example #2: One Pocket 6 Example #3: Four Gaussians 7 Example #4: Two Pockets 8 Example #1: Diagonal Band 9 Example #1: Diagonal Band 10 Example #1: Diagonal Band Error in slides: “layers” should read “number of hidden units” All the neural networks in this section used 1 hidden layer. 11 Example #1: Diagonal Band 12 Example #1: Diagonal Band 13 Example #1: Diagonal Band 14 Example #1: Diagonal Band 15 Example #2: One Pocket
    [Show full text]
  • On the Estimation of Mutual Information
    proceedings Proceedings On the Estimation of Mutual Information Nicholas Carrara * and Jesse Ernst Physics department, University at Albany, 1400 Washington Ave, Albany, NY 12222, USA; [email protected] * Correspondence: [email protected] † Presented at the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 30 June–5 July 2019. Published: 15 January 2020 Abstract: In this paper we focus on the estimation of mutual information from finite samples (X × Y). The main concern with estimations of mutual information (MI) is their robustness under the class of transformations for which it remains invariant: i.e., type I (coordinate transformations), III (marginalizations) and special cases of type IV (embeddings, products). Estimators which fail to meet these standards are not robust in their general applicability. Since most machine learning tasks employ transformations which belong to the classes referenced in part I, the mutual information can tell us which transformations are most optimal. There are several classes of estimation methods in the literature, such as non-parametric estimators like the one developed by Kraskov et al., and its improved versions. These estimators are extremely useful, since they rely only on the geometry of the underlying sample, and circumvent estimating the probability distribution itself. We explore the robustness of this family of estimators in the context of our design criteria. Keywords: mutual information; non-parametric entropy estimation; dimension reduction; machine learning 1. Introduction Interpretting mutual information (MI) as a measure of correlation has gained considerable attention over the past couple of decades for it’s application in both machine learning [1–4] and in dimensionality reduction [5], although it has a rich history in Communication Theory, especially in applications of Rate-Distortion theory [6] (While MI has been exploited in these examples, it has only recently been derived from first principles [7]).
    [Show full text]
  • Entropy and Decisions
    Entropy and decisions CSC401/2511 – Natural Language Computing – Winter 2021 Lecture 5, Serena Jeblee, Frank Rudzicz and Sean Robertson CSC401/2511 – Spring 2021 University of Toronto This lecture • Information theory and entropy. • Decisions. • Classification. • Significance and hypothesis testing. Can we quantify the statistical structure in a model of communication? Can we quantify the meaningful difference between statistical models? CSC401/2511 – Spring 2021 2 Information • Imagine Darth Vader is about to say either “yes” or “no” with equal probability. • You don’t know what he’ll say. • You have a certain amount of uncertainty – a lack of information. Darth Vader is © Disney And the prequels and Rey/Finn Star Wars suck CSC401/2511 – Spring 2021 3 Star Trek is better than Star Wars Information • Imagine you then observe Darth Vader saying “no” • Your uncertainty is gone; you’ve received information. • How much information do you receive about event ! when you observe it? 1 ! 1 = log ! )(1) For the units For the inverse of measurement 1 1 ! "# = log = log = 1 bit ! )("#) ! 1 ,2 CSC401/2511 – Spring 2021 4 Information • Imagine Darth Vader is about to roll a fair die. • You have more uncertainty about an event because there are more possibilities. • You receive more information when you observe it. 1 ! 5 = log ! )(5) " = log! ! ≈ 2.59 bits ⁄" CSC401/2511 – Spring 2021 5 Information is additive • From k independent, equally likely events ", 1 " 1 1 $ 8 = log = log $ % binary decisions = log! = % bits ! 9(8") ! 9 8 " 1 " 52 • For a unigram model, with each of 50K words # equally likely, 1 $ < = log ≈ 15.61 bits ! 1 550000 and for a sequence of 1K words in that model, " 1 $ < = log! ≈ 15,610???bits 1 #$$$ 550000 CSC401/2511 – Spring 2021 6 Information with unequal events • An information source S emits symbols without memory from a vocabulary #!, #", … , ## .
    [Show full text]