Exponential Family of Distributions and Generalized Linear Model (GLM) (Draft: Version 0.9.2)

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered: • Exponential family of distributions • Mean and (canonical) link functions • Convexity of log partition function • Generalized linear model (GLM) • Various GLM models 1 Exponential family of distributions In this section, we study a family of probability distribution called the exponential family (of distributions). It is of a special form, but most, if not all, of the well known probability distributions belong to this class. 1 1.1 Definition Definition 1. A probability distribution (PDF or PMF) is said to belong to the exponential family of distributions (in natural or canonical form) if it is of the form. 1 P (y) = h(y)eθ·T (y); (1) θ Z(θ) m k where y = (y1; ··· ; ym) is a point in R ; and θ = (θ1; ··· ; θk) 2 R is a parameter called the canonical (natural) parameter; T : Rm ! Rk is R θ·T (y) a map T (y) = (T1(y); ··· ;Tk(y)); and Z(θ) = h(y)e dy is called the partition function, while its logarithm, A(θ) = log Z(θ); is called the log partition (cumulant) function. Remark. In this lecture and throughout this course, the \dot" notation as in θ · T (y) always means the inner (dot) product of two vectors. Equivalently, Pθ(y) can be written in the form Pθ(y) = exp[θ · T (y) − A(θ) + C(y)]; (2) where C(y) = log h(y). More generally, one sometimes introduces an extra parameter φ, called the dispersion parameter, to control the shape of Pθ(y) by θ · T (y) − A(θ) P (y) = exp + C(y; φ) : θ φ 1.2 Standing assumption In our discussion on the exponential family of distribution, we always assume the following. • In case Pθ(y) is a probability density function (PDF), it is assumed to be continuous as a function of y. It means that there is no singularity in the probability measure Pθ(y); • In case Pθ(y) is a probability mass function (PMF), there exists a range of discrete value of Pθ(y) that is the same for all θ and for all y. If Pθ(y) satisfies either condition, we say it is regular, which is always as- sumed throughout this course. 2 Remark. Sometimes people use the more general form of Pθ(y) to write 1 P (y) = h(y)eη(θ)·T (y): θ Z(θ) But most of the time the same result can be obtained without the use of the general form η(θ): So it is not much of a loss of generality to stick to our convention of using just θ: 1.3 Examples Let us now look at a few illustrative examples. (1) Bernoulli(µ) The Bernoulli distribution is perhaps the simplest in the exponential family. Let Y be a random variable taking its binary value in f0; 1g. Let µ = P [Y = 1]: Its distribution (in fact, the probability mass function, PMF) is then succinctly written as P (y) = µy(1 − µ)1−y: Then, P (y) = µy(1 − µ)1−y = exp[y log µ + (1 − y) log(1 − µ)] µ = exp y log + log(1 − µ) : 1 − µ µ Letting T (y) = y and θ = log and recalling the definitions of 1 − µ logit and σ functions given in Lecture 3, we have µ logit(µ) = log : 1 − µ Thus θ = logit(µ): (3) Its inverse function is the sigmoid function: µ = σ(θ); (4) where 1 σ(θ) = : 1 + e−θ 3 Therefore, we have A(θ) = − log(1 − µ) = log(1 + eθ): (5) Thus Pθ(y); written in canonical form as in (2), becomes θ Pθ(y) = exp[θ · y − log(1 + e )]: (2) Exponential distribution The exponential distribution is a distribution that models the indepen- dent arrival time. Its distribution (the probability density function, PDF) is given as −θy Pθ(y) = θe I(x ≥ 0): To put it in the exponential family form, we use the same θ as the canonical parameter and we let T (y) = −y and h(y) = I(y ≥ 0): Since 1 Z Z(θ) = = e−θy (y ≥ 0)dy; θ I it is already in the canonical form given as in (1). (3) Normal distribution The normal (Gaussian) distribution given by 1 (y − µ)2 P (y) = p exp − 2πσ2 2σ2 is the single most well known distribution. As far as its relation with the exponential family is concerned there are two views. (Here, this σ is a number, not the sigmoid function.) • 1st view (σ2 as a dispersion parameter) This is the case when the main focus is the mean. In this case, the variance σ2 is regarded as known or as a parameter that can be fiddled with as if known. Writing P (y) in the form 2 3 1 2 1 2 − y + yµ − µ 1 P (y) = exp 6 2 2 − log(2πσ2)7 ; θ 4 σ2 2 5 4 One can see right away it is in the form of the exponential family if we set θ = (θ1; θ2) = (µ, 1) 1 T (y) = (y; − y2) 2 φ = σ2 1 1 A(θ) = µ2 = θ2 2 2 1 1 C(y; φ) = − log(2πσ2): 2 • 2nd view (φ = 1) When both µ and σ are parameters to be treated as unknown, we take this point of view. In here, we set the dispersion parameter φ = 1: Writing out Pθ(y) we have 1 µ 1 1 P (y) = exp − y2 + y − µ2 − log σ − log 2π : θ 2σ2 σ2 2σ2 2 Thus it is easy to see the following: 1 T (y) = (y; y2) 2 µ 1 θ = (θ ; θ ) = ( ; − ) 1 2 σ2 σ2 2 1 2 1 θ1 1 A(θ) = 2 µ + log σ = − − log(−θ2) 2σ 2 θ2 2 1 C(y) = − log(2π): 2 1.4 Properties of exponential family The log partition function A(θ) plays a key role, so let us now look at it more carefully. First, since Z Z θ · T (y) − A(θ) P (y)dy = exp + C(y; φ) dy = 1; θ φ 5 taking rθ; we have Z θ · T (y) − A(θ) T (y) − r A(θ) exp + C(y; φ) θ dy = 0: φ φ Thus Z θ · T (y) − A(θ) Z θ · T (y) − A(θ) exp + C(y; φ) T (y)dy = exp + C(y; φ) r A(θ)dy φ φ θ Z θ · T (y) − A(θ) = r A(θ) exp + C(y; φ) dy: θ φ Since Z θ · T (y) − A(θ) exp + C(y; φ) dy = 1; φ we have the following: Proposition 1. Z rθA(θ) = Pθ(y)T (y)dy = E[T (Y )]; where Y is the random variable with distribution Pθ(y): Writing the component, we have @A Z θ · T (y) − A(θ) = exp + C(y; φ) Ti(y)dy: @θi φ Taking the second partial derivative, we have @2A 1 Z θ · T (y) − A(θ) @A(θ) = exp + C(y; φ) Tj(y) − Ti(y)dy @θi@θj φ φ @θj 1 Z θ · T (y) − A(θ) = exp + C(y; φ) T (y)T (y)dy φ φ i j @A(θ) Z θ · T (y) − A(θ) − exp + C(y; φ) Ti(y)dy : @θj φ 6 Using Proposition 1 once more, we have @2A 1 n o = E[Ti(Y )Tj(Y )] − E[Ti(Y )]E[Tj(Y )] @θi@θj φ 1 h i = E T (Y ) − E[T (Y )] T (Y ) − E[T (Y )] φ i i j j 1 = Cov(T (Y );T (Y )): φ i j Therefore we have the following result on the Hessian matrix of A: 2 2 @ A 1 1 Proposition 2. Dθ A = = Cov(T (Y );T (Y )) = Cov(T (Y )); @θi@θj φ φ where Y is the random variable with distribution Pθ(y): Here, Cov(T (Y );T (Y )) denotes the covariance matrix of T (Y ) with itself, which is sometimes called the variance matrix. Since the covariance matrix is always positive semi-definite, we have Corollary 1. A(θ) is a convex function of θ: Remark. In most exponential family models, Cov(T (Y )) is positive definite, in which case A(θ) is strictly convex. In this lecture and throughout the course, we always assume that Cov(T (Y )) is positive definite and therefore that A(θ) is strictly convex. 1.5 Maximum likelihood estimation (i) N (i) m Let D = fy gi=1 be a given data of IID samples, where y 2 R : Then its likelihood function L(θ) and the log likelihood function l(θ) are given by N Y θ · T (y(i)) − A(θ) L(θ) = exp + C(y(i); φ) φ i=1 N N 1 X X l(θ) = θ · T (y(i)) − NA(θ) + C(y(i); φ): φ i=1 i=1 7 Since A(θ) is strictly convex, l(θ) has a unique maximum at θb that is the unique solution of rθl(θ) = 0: Note that N 1 X r l(θ) = T (y(i)) − Nr A(θ) : θ φ θ i=1 So we have the following: Proposition 3. There is a unique θb that maximizes l(θ) at which N 1 X r A(θ)j = T (y(i)): θ θ=θb N i=1 1.5.1 Example: Bernoulli Distribution Let us apply the above argument to the Bernoulli distribution Bernoulli(µ) and see what it leads to.
Recommended publications
  • Chapter 6 Continuous Random Variables and Probability

    Chapter 6 Continuous Random Variables and Probability

    EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2019 Chapter 6 Continuous Random Variables and Probability Distributions Chap 6-1 Probability Distributions Probability Distributions Ch. 5 Discrete Continuous Ch. 6 Probability Probability Distributions Distributions Binomial Uniform Hypergeometric Normal Poisson Exponential Chap 6-2/62 Continuous Probability Distributions § A continuous random variable is a variable that can assume any value in an interval § thickness of an item § time required to complete a task § temperature of a solution § height in inches § These can potentially take on any value, depending only on the ability to measure accurately. Chap 6-3/62 Cumulative Distribution Function § The cumulative distribution function, F(x), for a continuous random variable X expresses the probability that X does not exceed the value of x F(x) = P(X £ x) § Let a and b be two possible values of X, with a < b. The probability that X lies between a and b is P(a < X < b) = F(b) -F(a) Chap 6-4/62 Probability Density Function The probability density function, f(x), of random variable X has the following properties: 1. f(x) > 0 for all values of x 2. The area under the probability density function f(x) over all values of the random variable X is equal to 1.0 3. The probability that X lies between two values is the area under the density function graph between the two values 4. The cumulative density function F(x0) is the area under the probability density function f(x) from the minimum x value up to x0 x0 f(x ) = f(x)dx 0 ò xm where
  • Skewed Double Exponential Distribution and Its Stochastic Rep- Resentation

    Skewed Double Exponential Distribution and Its Stochastic Rep- Resentation

    EUROPEAN JOURNAL OF PURE AND APPLIED MATHEMATICS Vol. 2, No. 1, 2009, (1-20) ISSN 1307-5543 – www.ejpam.com Skewed Double Exponential Distribution and Its Stochastic Rep- resentation 12 2 2 Keshav Jagannathan , Arjun K. Gupta ∗, and Truc T. Nguyen 1 Coastal Carolina University Conway, South Carolina, U.S.A 2 Bowling Green State University Bowling Green, Ohio, U.S.A Abstract. Definitions of the skewed double exponential (SDE) distribution in terms of a mixture of double exponential distributions as well as in terms of a scaled product of a c.d.f. and a p.d.f. of double exponential random variable are proposed. Its basic properties are studied. Multi-parameter versions of the skewed double exponential distribution are also given. Characterization of the SDE family of distributions and stochastic representation of the SDE distribution are derived. AMS subject classifications: Primary 62E10, Secondary 62E15. Key words: Symmetric distributions, Skew distributions, Stochastic representation, Linear combina- tion of random variables, Characterizations, Skew Normal distribution. 1. Introduction The double exponential distribution was first published as Laplace’s first law of error in the year 1774 and stated that the frequency of an error could be expressed as an exponential function of the numerical magnitude of the error, disregarding sign. This distribution comes up as a model in many statistical problems. It is also considered in robustness studies, which suggests that it provides a model with different characteristics ∗Corresponding author. Email address: (A. Gupta) http://www.ejpam.com 1 c 2009 EJPAM All rights reserved. K. Jagannathan, A. Gupta, and T. Nguyen / Eur.
  • 1 One Parameter Exponential Families

    1 One Parameter Exponential Families

    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
  • 6: the Exponential Family and Generalized Linear Models

    6: the Exponential Family and Generalized Linear Models

    10-708: Probabilistic Graphical Models 10-708, Spring 2014 6: The Exponential Family and Generalized Linear Models Lecturer: Eric P. Xing Scribes: Alnur Ali (lecture slides 1-23), Yipei Wang (slides 24-37) 1 The exponential family A distribution over a random variable X is in the exponential family if you can write it as P (X = x; η) = h(x) exp ηT T(x) − A(η) : Here, η is the vector of natural parameters, T is the vector of sufficient statistics, and A is the log partition function1 1.1 Examples Here are some examples of distributions that are in the exponential family. 1.1.1 Multivariate Gaussian Let X be 2 Rp. Then we have: 1 1 P (x; µ; Σ) = exp − (x − µ)T Σ−1(x − µ) (2π)p=2jΣj1=2 2 1 1 = exp − (tr xT Σ−1x + µT Σ−1µ − 2µT Σ−1x + ln jΣj) (2π)p=2 2 0 1 1 B 1 −1 T T −1 1 T −1 1 C = exp B− tr Σ xx +µ Σ x − µ Σ µ − ln jΣj)C ; (2π)p=2 @ 2 | {z } 2 2 A | {z } vec(Σ−1)T vec(xxT ) | {z } h(x) A(η) where vec(·) is the vectorization operator. 1 R T It's called this, since in order for P to normalize, we need exp(A(η)) to equal x h(x) exp(η T(x)) ) A(η) = R T ln x h(x) exp(η T(x)) , which is the log of the usual normalizer, which is the partition function.
  • Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F

    Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F

    Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F Course : Basic Econometrics : HC43 / Statistics B.A. Hons Economics, Semester IV/ Semester III Delhi University Course Instructor: Siddharth Rathore Assistant Professor Economics Department, Gargi College Siddharth Rathore guj75845_appC.qxd 4/16/09 12:41 PM Page 461 APPENDIX C SOME IMPORTANT PROBABILITY DISTRIBUTIONS In Appendix B we noted that a random variable (r.v.) can be described by a few characteristics, or moments, of its probability function (PDF or PMF), such as the expected value and variance. This, however, presumes that we know the PDF of that r.v., which is a tall order since there are all kinds of random variables. In practice, however, some random variables occur so frequently that statisticians have determined their PDFs and documented their properties. For our purpose, we will consider only those PDFs that are of direct interest to us. But keep in mind that there are several other PDFs that statisticians have studied which can be found in any standard statistics textbook. In this appendix we will discuss the following four probability distributions: 1. The normal distribution 2. The t distribution 3. The chi-square (␹2 ) distribution 4. The F distribution These probability distributions are important in their own right, but for our purposes they are especially important because they help us to find out the probability distributions of estimators (or statistics), such as the sample mean and sample variance. Recall that estimators are random variables. Equipped with that knowledge, we will be able to draw inferences about their true population values.
  • Package 'Distributional'

    Package 'Distributional'

    Package ‘distributional’ February 2, 2021 Title Vectorised Probability Distributions Version 0.2.2 Description Vectorised distribution objects with tools for manipulating, visualising, and using probability distributions. Designed to allow model prediction outputs to return distributions rather than their parameters, allowing users to directly interact with predictive distributions in a data-oriented workflow. In addition to providing generic replacements for p/d/q/r functions, other useful statistics can be computed including means, variances, intervals, and highest density regions. License GPL-3 Imports vctrs (>= 0.3.0), rlang (>= 0.4.5), generics, ellipsis, stats, numDeriv, ggplot2, scales, farver, digest, utils, lifecycle Suggests testthat (>= 2.1.0), covr, mvtnorm, actuar, ggdist RdMacros lifecycle URL https://pkg.mitchelloharawild.com/distributional/, https: //github.com/mitchelloharawild/distributional BugReports https://github.com/mitchelloharawild/distributional/issues Encoding UTF-8 Language en-GB LazyData true Roxygen list(markdown = TRUE, roclets=c('rd', 'collate', 'namespace')) RoxygenNote 7.1.1 1 2 R topics documented: R topics documented: autoplot.distribution . .3 cdf..............................................4 density.distribution . .4 dist_bernoulli . .5 dist_beta . .6 dist_binomial . .7 dist_burr . .8 dist_cauchy . .9 dist_chisq . 10 dist_degenerate . 11 dist_exponential . 12 dist_f . 13 dist_gamma . 14 dist_geometric . 16 dist_gumbel . 17 dist_hypergeometric . 18 dist_inflated . 20 dist_inverse_exponential . 20 dist_inverse_gamma
  • Lecture 2 — September 24 2.1 Recap 2.2 Exponential Families

    Lecture 2 — September 24 2.1 Recap 2.2 Exponential Families

    STATS 300A: Theory of Statistics Fall 2015 Lecture 2 | September 24 Lecturer: Lester Mackey Scribe: Stephen Bates and Andy Tsao 2.1 Recap Last time, we set out on a quest to develop optimal inference procedures and, along the way, encountered an important pair of assertions: not all data is relevant, and irrelevant data can only increase risk and hence impair performance. This led us to introduce a notion of lossless data compression (sufficiency): T is sufficient for P with X ∼ Pθ 2 P if X j T (X) is independent of θ. How far can we take this idea? At what point does compression impair performance? These are questions of optimal data reduction. While we will develop general answers to these questions in this lecture and the next, we can often say much more in the context of specific modeling choices. With this in mind, let's consider an especially important class of models known as the exponential family models. 2.2 Exponential Families Definition 1. The model fPθ : θ 2 Ωg forms an s-dimensional exponential family if each Pθ has density of the form: s ! X p(x; θ) = exp ηi(θ)Ti(x) − B(θ) h(x) i=1 • ηi(θ) 2 R are called the natural parameters. • Ti(x) 2 R are its sufficient statistics, which follows from NFFC. • B(θ) is the log-partition function because it is the logarithm of a normalization factor: s ! ! Z X B(θ) = log exp ηi(θ)Ti(x) h(x)dµ(x) 2 R i=1 • h(x) 2 R: base measure.
  • Statistical Inference

    Statistical Inference

    GU4204: Statistical Inference Bodhisattva Sen Columbia University February 27, 2020 Contents 1 Introduction5 1.1 Statistical Inference: Motivation.....................5 1.2 Recap: Some results from probability..................5 1.3 Back to Example 1.1...........................8 1.4 Delta method...............................8 1.5 Back to Example 1.1........................... 10 2 Statistical Inference: Estimation 11 2.1 Statistical model............................. 11 2.2 Method of Moments estimators..................... 13 3 Method of Maximum Likelihood 16 3.1 Properties of MLEs............................ 20 3.1.1 Invariance............................. 20 3.1.2 Consistency............................ 21 3.2 Computational methods for approximating MLEs........... 21 3.2.1 Newton's Method......................... 21 3.2.2 The EM Algorithm........................ 22 1 4 Principles of estimation 23 4.1 Mean squared error............................ 24 4.2 Comparing estimators.......................... 25 4.3 Unbiased estimators........................... 26 4.4 Sufficient Statistics............................ 28 5 Bayesian paradigm 33 5.1 Prior distribution............................. 33 5.2 Posterior distribution........................... 34 5.3 Bayes Estimators............................. 36 5.4 Sampling from a normal distribution.................. 37 6 The sampling distribution of a statistic 39 6.1 The gamma and the χ2 distributions.................. 39 6.1.1 The gamma distribution..................... 39 6.1.2 The Chi-squared distribution.................. 41 6.2 Sampling from a normal population................... 42 6.3 The t-distribution............................. 45 7 Confidence intervals 46 8 The (Cramer-Rao) Information Inequality 51 9 Large Sample Properties of the MLE 57 10 Hypothesis Testing 61 10.1 Principles of Hypothesis Testing..................... 61 10.2 Critical regions and test statistics.................... 62 10.3 Power function and types of error.................... 64 10.4 Significance level............................
  • Likelihood-Based Inference

    Likelihood-Based Inference

    The score statistic Inference Exponential distribution example Likelihood-based inference Patrick Breheny September 29 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/32 The score statistic Univariate Inference Multivariate Exponential distribution example Introduction In previous lectures, we constructed and plotted likelihoods and used them informally to comment on likely values of parameters Our goal for today is to make this more rigorous, in terms of quantifying coverage and type I error rates for various likelihood-based approaches to inference With the exception of extremely simple cases such as the two-sample exponential model, exact derivation of these quantities is typically unattainable for survival models, and we must rely on asymptotic likelihood arguments Patrick Breheny Survival Data Analysis (BIOS 7210) 2/32 The score statistic Univariate Inference Multivariate Exponential distribution example The score statistic Likelihoods are typically easier to work with on the log scale (where products become sums); furthermore, since it is only relative comparisons that matter with likelihoods, it is more meaningful to work with derivatives than the likelihood itself Thus, we often work with the derivative of the log-likelihood, which is known as the score, and often denoted U: d U (θ) = `(θjX) X dθ Note that U is a random variable, as it depends on X U is a function of θ For independent observations, the score of the entire sample is the sum of the scores for the individual observations: X U = Ui i Patrick Breheny Survival
  • Chapter 10 “Some Continuous Distributions”.Pdf

    Chapter 10 “Some Continuous Distributions”.Pdf

    CHAPTER 10 Some continuous distributions 10.1. Examples of continuous random variables We look at some other continuous random variables besides normals. Uniform distribution A continuous random variable has uniform distribution if its density is f(x) = 1/(b a) − if a 6 x 6 b and 0 otherwise. For a random variable X with uniform distribution its expectation is 1 b a + b EX = x dx = . b a ˆ 2 − a Exponential distribution A continuous random variable has exponential distribution with parameter λ > 0 if its λx density is f(x) = λe− if x > 0 and 0 otherwise. Suppose X is a random variable with an exponential distribution with parameter λ. Then we have ∞ λx λa (10.1.1) P(X > a) = λe− dx = e− , ˆa λa FX (a) = 1 P(X > a) = 1 e− , − − and we can use integration by parts to see that EX = 1/λ, Var X = 1/λ2. Examples where an exponential random variable is a good model is the length of a telephone call, the length of time before someone arrives at a bank, the length of time before a light bulb burns out. Exponentials are memoryless, that is, P(X > s + t X > t) = P(X > s), | or given that the light bulb has burned 5 hours, the probability it will burn 2 more hours is the same as the probability a new light bulb will burn 2 hours. Here is how we can prove this 129 130 10. SOME CONTINUOUS DISTRIBUTIONS P(X > s + t) P(X > s + t X > t) = | P(X > t) λ(s+t) e− λs − = λt = e e− = P(X > s), where we used Equation (10.1.1) for a = t and a = s + t.
  • Random Variables and Applications

    Random Variables and Applications

    Random Variables and Applications OPRE 6301 Random Variables. As noted earlier, variability is omnipresent in the busi- ness world. To model variability probabilistically, we need the concept of a random variable. A random variable is a numerically valued variable which takes on different values with given probabilities. Examples: The return on an investment in a one-year period The price of an equity The number of customers entering a store The sales volume of a store on a particular day The turnover rate at your organization next year 1 Types of Random Variables. Discrete Random Variable: — one that takes on a countable number of possible values, e.g., total of roll of two dice: 2, 3, ..., 12 • number of desktops sold: 0, 1, ... • customer count: 0, 1, ... • Continuous Random Variable: — one that takes on an uncountable number of possible values, e.g., interest rate: 3.25%, 6.125%, ... • task completion time: a nonnegative value • price of a stock: a nonnegative value • Basic Concept: Integer or rational numbers are discrete, while real numbers are continuous. 2 Probability Distributions. “Randomness” of a random variable is described by a probability distribution. Informally, the probability distribution specifies the probability or likelihood for a random variable to assume a particular value. Formally, let X be a random variable and let x be a possible value of X. Then, we have two cases. Discrete: the probability mass function of X specifies P (x) P (X = x) for all possible values of x. ≡ Continuous: the probability density function of X is a function f(x) that is such that f(x) h P (x < · ≈ X x + h) for small positive h.
  • Fisher Information Matrix for Gaussian and Categorical Distributions

    Fisher Information Matrix for Gaussian and Categorical Distributions

    Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2).