Exponential Family

Total Page:16

File Type:pdf, Size:1020Kb

Exponential Family School of Computer Science Probabilistic Graphical Models Learning generalized linear models and tabular CPT of fully observed BN Eric Xing Lecture 7, September 30, 2009 X1 X1 X1 X1 X2 X3 X2 X3 X2 X3 Reading: X4 X4 © Eric Xing @ CMU, 2005-2009 1 Exponential family z For a numeric random variable X T p(x |η) = h(x)exp{η T (x) − A(η)} Xn 1 N = h(x)exp{}η TT (x) Z(η) is an exponential family distribution with natural (canonical) parameter η z Function T(x) is a sufficient statistic. z Function A(η) = log Z(η) is the log normalizer. z Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,... © Eric Xing @ CMU, 2005-2009 2 1 Recall Linear Regression z Let us assume that the target variable and the inputs are related by the equation: T yi = θ xi + ε i where ε is an error term of unmodeled effects or random noise z Now assume that ε follows a Gaussian N(0,σ), then we have: 1 ⎛ (y −θ T x )2 ⎞ p(y | x ;θ) = exp⎜− i i ⎟ i i ⎜ 2 ⎟ 2πσ ⎝ 2σ ⎠ © Eric Xing @ CMU, 2005-2009 3 Recall: Logistic Regression (sigmoid classifier) z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x z We can used the brute-force gradient method as in LR z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model! © Eric Xing @ CMU, 2005-2009 4 2 Example: Multivariate Gaussian Distribution z For a continuous vector random variable X∈Rk: 1 ⎧ 1 T −1 ⎫ p(x µ,Σ) = 1/2 exp⎨− (x − µ) Σ (x − µ)⎬ ()2π k /2 Σ ⎩ 2 ⎭ Moment parameter 1 = exp{}− 1 tr()Σ−1 xxT + µ T Σ−1 x − 1 µ T Σ−1µ − log Σ ()2π k /2 2 2 z Exponential family representation Natural parameter −1 1 −1 −1 1 −1 η = [Σ µ;− 2 vec(Σ )]= [η1 , vec(η2 )], η1 = Σ µ and η2 = − 2 Σ T (x) = []x; vec()xxT 1 T −1 1 T 1 A(η) = 2 µ Σ µ + log Σ = − 2 tr(η2η1η1 ) − 2 log()−2η2 h(x) = ()2π −k /2 z Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)- element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom) © Eric Xing @ CMU, 2005-2009 5 Example: Multinomial distribution z For a binary vector random variable x ~ multi(x |π ), x1 x1 xK ⎧ ⎫ p(xπ ) = π1 π 2 Lπ K = exp⎨∑ xk lnπ k ⎬ ⎩ k ⎭ ⎧K −1 ⎛ K −1 ⎞ ⎛ K −1 ⎞⎫ = exp⎨∑ xk lnπ k + ⎜1− ∑ xK ⎟ln⎜1− ∑π k ⎟⎬ ⎩ k =1 ⎝ k =1 ⎠ ⎝ k =1 ⎠⎭ ⎧K −1 ⎛ π ⎞ ⎛ K −1 ⎞⎫ = exp x ln⎜ k ⎟ + ln⎜1− π ⎟ ⎨∑ k ⎜ K −1 ⎟ ∑ k ⎬ ⎩ k =1 ⎝1− ∑k =1 π k ⎠ ⎝ k =1 ⎠⎭ z Exponential family representation ⎡ ⎛π k ⎞ ⎤ η = ⎢ln⎜ ⎟;0⎥ ⎣ ⎝ π K ⎠ ⎦ T (x) = []x K −1 K ⎛ ⎞ ⎛ ηk ⎞ A(η) = −ln⎜1 − ∑π k ⎟ = ln⎜∑e ⎟ ⎝ k =1 ⎠ ⎝ k =1 ⎠ h(x) = 1 © Eric Xing @ CMU, 2005-2009 6 3 Why exponential family? z Moment generating property dA d 1 d = log Z(η) = Z(η) dη dη Z(η) dη 1 d = h(x)exp{}η TT (x) dx Z(η) dη ∫ h(x)exp η TT (x) = T (x) {}dx ∫ Z(η) = E[]T(x) d 2 A h(x)exp η TT (x) h(x)exp η TT (x) 1 d = T 2 (x) { }dx − T (x) { }dx Z(η) dη 2 ∫ Z(η) ∫ Z(η) Z(η) dη = E[]T 2 (x) − E 2 []T (x) = Var[]T(x) © Eric Xing @ CMU, 2005-2009 7 Moment estimation z We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(η). z The qth derivative gives the qth centered moment. dA(η) = mean dη d 2 A(η) = variance dη 2 L z When the sufficient statistic is a stacked vector, partial derivatives need to be considered. © Eric Xing @ CMU, 2005-2009 8 4 Moment vs canonical parameters z The moment parameter µ can be derived from the natural (canonical) parameter dA(η) def 8 = E[]T (x) = µ A dη 4 z A(η) is convex since 2 d A(η) η = Var[]T (x) > 0 -2-1012∗ dη 2 η z Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1): def η =ψ (µ) z A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by µ − the moment parameterization. © Eric Xing @ CMU, 2005-2009 9 MLE for Exponential Family z For iid data, the log-likelihood is T l (η; D) = log∏ h(xn )exp{η T (xn ) − A(η)} n ⎛ T ⎞ = ∑∑log h(xn ) + ⎜η T (xn )⎟ − NA(η) nn⎝ ⎠ z Take derivatives and set to zero: ∂l ∂A(η) = ∑T (xn ) − N = 0 ∂η n ∂η ∂A(η) 1 = T (x ) ∂η N ∑ n ⇒ n ) 1 µMLE = ∑T (xn ) N n z This amounts to moment matching. ) ) z We can infer the canonical parameters using ηMLE =ψ (µMLE ) © Eric Xing @ CMU, 2005-2009 10 5 Sufficiency z For p(x|θ), T(x) is sufficient for θ if there is no information in X regarding θ beyond that in T(x). z We can throw away X for the purpose of inference w.r.t. θ . z Bayesian view X T(x) θ p(θ |T (x), x) = p(θ |T (x)) z Frequentist view X T(x) θ p(x |T (x),θ ) = p(x |T (x)) z The Neyman factorization theorem X T(x) θ z T(x) is sufficient for θ if p(x,T (x),θ ) =ψ 1 (T (x),θ )ψ 2 (x,T (x)) ⇒ p(x |θ ) = g(T (x),θ )h(x,T (x)) © Eric Xing @ CMU, 2005-2009 11 Examples z Gaussian: −1 1 −1 η = [Σ µ;− 2 vec(Σ )] T (x) = []x; vec()xxT 1 1 ⇒ µMLE = ∑T1 (xn ) = ∑ xn 1 T −1 1 N N A(η) = 2 µ Σ µ + 2 log Σ n n h(x) = ()2π −k /2 z Multinomial: ⎡ ⎛π k ⎞ ⎤ η = ⎢ln⎜ ⎟;0⎥ ⎣ ⎝ π K ⎠ ⎦ T (x) = []x 1 ⇒ µMLE = ∑ xn K −1 K N n ⎛ ⎞ ⎛ ηk ⎞ A(η) = −ln⎜1 − ∑π k ⎟ = ln⎜∑ e ⎟ ⎝ k =1 ⎠ ⎝ k =1 ⎠ h(x) = 1 z Poisson: η = log λ T (x) = x 1 A(η) = λ = eη ⇒ µMLE = ∑ xn N n 1 h(x) = x! © Eric Xing @ CMU, 2005-2009 12 6 Generalized Linear Models (GLIMs) z The graphical model z Linear regression Xn z Discriminative linear classification z Commonality: Y T n model Ep(Y)=µ=f(θ X) N z What is p()? the cond. dist. of Y. z What is f()? the response function. z GLIM z The observed input x is assumed to enter into the model via a linear combination of its elements ξ = θT x z The conditional mean µ is represented as a function f(ξ) of ξ, where f is known as the response function z The observed output y is assumed to be characterized by an exponential family distribution with conditional mean µ. © Eric Xing @ CMU, 2005-2009 13 GLIM, cont. θ f ψ GLIM ξ µ η y x p(y |η) = h(y)exp{η T (x)y − A(η)} 1 T ⇒ p(y |η) = h(y,φ)exp{φ (η (x)y − A(η))} z The choice of exp family is constrained by the nature of the data Y z Example: y is a continuous vector Æ multivariate Gaussian y is a class label Æ Bernoulli or multinomial z The choice of the response function z Following some mild constrains, e.g., [0,1]. Positivity … −1 z Canonical response function: f = ψ ( ⋅ ) z In this case θTx directly corresponds to canonical parameter η. © Eric Xing @ CMU, 2005-2009 14 7 MLE for GLIMs with natural response z Log-likelihood T l = ∑log h(yn ) + ∑(θ xn yn − A(ηn )) nn z Derivative of Log-likelihood dl ⎛ dA(η ) dη ⎞ ⎜ n n ⎟ = ∑⎜ xn yn − ⎟ dθ n ⎝ dηn dθ ⎠ = ∑()yn − µn xn n This is a fixed point function T = X (y − µ) because µ is a function of θ z Online learning for canonical GLIMs z Stochastic gradient ascent = least mean squares (LMS) algorithm: t+1 t t θ = θ + ρ(yn − µn )xn t t T where µn = (θ ) xn and ρ is a step size © Eric Xing @ CMU, 2005-2009 15 Batch learning for canonical GLIMs ⎡− − x − −⎤ z The Hessian matrix 1 ⎢− − x − −⎥ X = ⎢ 2 ⎥ 2 ⎢ M M M ⎥ d l d dµ ⎢ ⎥ H = = y − µ x = x n − − x − − T T ∑()n n n ∑ n T ⎣ n ⎦ dθdθ dθ n n dθ ⎡ y1 ⎤ ⎢ ⎥ dµn dηn y = − x v 2 ∑ n T y = ⎢ ⎥ n dηn dθ ⎢ M ⎥ ⎢ ⎥ y dµn T T ⎣ n ⎦ = −∑ xn xn since η n= θ xn n dηn = −X TWX T where X = [ x n ] is the design matrix and ⎛ dµ1 dµN ⎞ W = diag⎜ ,K, ⎟ ⎝ dη1 dηN ⎠ nd which can be computed by calculating the 2 derivative of A(ηn) © Eric Xing @ CMU, 2005-2009 16 8 Recall LMS ⎡− − x − −⎤ z Cost function in matrix form: 1 ⎢− − x − −⎥ ⎢ 2 ⎥ n X = 1 T 2 ⎢ M M M ⎥ J(θ ) = ∑(xi θ − yi ) ⎢ ⎥ 2 i=1 ⎣− − xn − −⎦ 1 T ⎡ y ⎤ = ()()Xθ − yv Xθ − yv 1 ⎢y ⎥ 2 yv = ⎢ 2 ⎥ ⎢ M ⎥ ⎢ ⎥ ⎣yn ⎦ z To minimize J(θ), take derivative and set to zero: 1 T T T T v vT vT v ∇θ J = ∇θ tr()θ X Xθ −θ X y − y Xθ + y y 2 ⇒ X T Xθ = X T yv 1 = ()∇ trθ T X T Xθ −2∇ tryvT Xθ + ∇ tryvT yv 2 θ θ θ The normal equations 1 = ()X T Xθ + X T Xθ −2X T yv 2 ⇓ −1 = X T Xθ − X T yv = 0 θ * = (X T X ) X T yv © Eric Xing @ CMU, 2005-2009 17 Iteratively Reweighted Least Squares (IRLS) z Recall Newton-Raphson methods with cost function J t+1 t −1 θ = θ − H ∇θ J z We now have T ∇θ J = X (y − µ) −1 H = −X TWX θ * = (X T X ) X T yv z Now: t+1 t −1 θ = θ + H ∇θ l − = ()X TW t X 1 []X TW t Xθ t + X T (y − µ t ) − z = ()X TW t X 1 X TW t zt − where the adjusted response is zt = Xθ t + (W t ) 1 (y − µ t ) z This can be understood as solving the following " Iteratively reweighted least squares " problem θ t+1 = arg min(z − Xθ )T W (z − Xθ ) θ © Eric Xing @ CMU, 2005-2009 18 9 Example 1: logistic regression (sigmoid classifier) z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = 1 + e−η (x) z p(y|x) is an exponential family function, with 1 z mean: E[]y | x = µ = 1 + e−η ( x) T z and canonical response function η = ξ = θ x dµ = µ(1 − µ) z IRLS dη ⎛ µ (1 − µ ) ⎞ ⎜ 1 1 ⎟ W = ⎜ O ⎟ ⎜ ⎟ µN (1 − µN ) ⎝ © Eric Xing @ CMU, 2005-2009 ⎠ 19 Logistic regression: practical issues z It is very common to use regularized maximum likelihood.
Recommended publications
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • 6: the Exponential Family and Generalized Linear Models
    10-708: Probabilistic Graphical Models 10-708, Spring 2014 6: The Exponential Family and Generalized Linear Models Lecturer: Eric P. Xing Scribes: Alnur Ali (lecture slides 1-23), Yipei Wang (slides 24-37) 1 The exponential family A distribution over a random variable X is in the exponential family if you can write it as P (X = x; η) = h(x) exp ηT T(x) − A(η) : Here, η is the vector of natural parameters, T is the vector of sufficient statistics, and A is the log partition function1 1.1 Examples Here are some examples of distributions that are in the exponential family. 1.1.1 Multivariate Gaussian Let X be 2 Rp. Then we have: 1 1 P (x; µ; Σ) = exp − (x − µ)T Σ−1(x − µ) (2π)p=2jΣj1=2 2 1 1 = exp − (tr xT Σ−1x + µT Σ−1µ − 2µT Σ−1x + ln jΣj) (2π)p=2 2 0 1 1 B 1 −1 T T −1 1 T −1 1 C = exp B− tr Σ xx +µ Σ x − µ Σ µ − ln jΣj)C ; (2π)p=2 @ 2 | {z } 2 2 A | {z } vec(Σ−1)T vec(xxT ) | {z } h(x) A(η) where vec(·) is the vectorization operator. 1 R T It's called this, since in order for P to normalize, we need exp(A(η)) to equal x h(x) exp(η T(x)) ) A(η) = R T ln x h(x) exp(η T(x)) , which is the log of the usual normalizer, which is the partition function.
    [Show full text]
  • Lecture 2 — September 24 2.1 Recap 2.2 Exponential Families
    STATS 300A: Theory of Statistics Fall 2015 Lecture 2 | September 24 Lecturer: Lester Mackey Scribe: Stephen Bates and Andy Tsao 2.1 Recap Last time, we set out on a quest to develop optimal inference procedures and, along the way, encountered an important pair of assertions: not all data is relevant, and irrelevant data can only increase risk and hence impair performance. This led us to introduce a notion of lossless data compression (sufficiency): T is sufficient for P with X ∼ Pθ 2 P if X j T (X) is independent of θ. How far can we take this idea? At what point does compression impair performance? These are questions of optimal data reduction. While we will develop general answers to these questions in this lecture and the next, we can often say much more in the context of specific modeling choices. With this in mind, let's consider an especially important class of models known as the exponential family models. 2.2 Exponential Families Definition 1. The model fPθ : θ 2 Ωg forms an s-dimensional exponential family if each Pθ has density of the form: s ! X p(x; θ) = exp ηi(θ)Ti(x) − B(θ) h(x) i=1 • ηi(θ) 2 R are called the natural parameters. • Ti(x) 2 R are its sufficient statistics, which follows from NFFC. • B(θ) is the log-partition function because it is the logarithm of a normalization factor: s ! ! Z X B(θ) = log exp ηi(θ)Ti(x) h(x)dµ(x) 2 R i=1 • h(x) 2 R: base measure.
    [Show full text]
  • Exponential Families and Theoretical Inference
    EXPONENTIAL FAMILIES AND THEORETICAL INFERENCE Bent Jørgensen Rodrigo Labouriau August, 2012 ii Contents Preface vii Preface to the Portuguese edition ix 1 Exponential families 1 1.1 Definitions . 1 1.2 Analytical properties of the Laplace transform . 11 1.3 Estimation in regular exponential families . 14 1.4 Marginal and conditional distributions . 17 1.5 Parametrizations . 20 1.6 The multivariate normal distribution . 22 1.7 Asymptotic theory . 23 1.7.1 Estimation . 25 1.7.2 Hypothesis testing . 30 1.8 Problems . 36 2 Sufficiency and ancillarity 47 2.1 Sufficiency . 47 2.1.1 Three lemmas . 48 2.1.2 Definitions . 49 2.1.3 The case of equivalent measures . 50 2.1.4 The general case . 53 2.1.5 Completeness . 56 2.1.6 A result on separable σ-algebras . 59 2.1.7 Sufficiency of the likelihood function . 60 2.1.8 Sufficiency and exponential families . 62 2.2 Ancillarity . 63 2.2.1 Definitions . 63 2.2.2 Basu's Theorem . 65 2.3 First-order ancillarity . 67 2.3.1 Examples . 67 2.3.2 Main results . 69 iii iv CONTENTS 2.4 Problems . 71 3 Inferential separation 77 3.1 Introduction . 77 3.1.1 S-ancillarity . 81 3.1.2 The nonformation principle . 83 3.1.3 Discussion . 86 3.2 S-nonformation . 91 3.2.1 Definition . 91 3.2.2 S-nonformation in exponential families . 96 3.3 G-nonformation . 99 3.3.1 Transformation models . 99 3.3.2 Definition of G-nonformation . 103 3.3.3 Cox's proportional risks model .
    [Show full text]
  • 3.4 Exponential Families
    3.4 Exponential Families A family of pdfs or pmfs is called an exponential family if it can be expressed as ¡ Xk ¢ f(x|θ) = h(x)c(θ) exp wi(θ)ti(x) . (1) i=1 Here h(x) ≥ 0 and t1(x), . , tk(x) are real-valued functions of the observation x (they cannot depend on θ), and c(θ) ≥ 0 and w1(θ), . , wk(θ) are real-valued functions of the possibly vector-valued parameter θ (they cannot depend on x). Many common families introduced in the previous section are exponential families. They include the continuous families—normal, gamma, and beta, and the discrete families—binomial, Poisson, and negative binomial. Example 3.4.1 (Binomial exponential family) Let n be a positive integer and consider the binomial(n, p) family with 0 < p < 1. Then the pmf for this family, for x = 0, 1, . , n and 0 < p < 1, is µ ¶ n f(x|p) = px(1 − p)n−x x µ ¶ n ¡ p ¢ = (1 − p)n x x 1 − p µ ¶ n ¡ ¡ p ¢ ¢ = (1 − p)n exp log x . x 1 − p Define 8 >¡ ¢ < n x = 0, 1, . , n h(x) = x > :0 otherwise, p c(p) = (1 − p)n, 0 < p < 1, w (p) = log( ), 0 < p < 1, 1 1 − p and t1(x) = x. Then we have f(x|p) = h(x)c(p) exp{w1(p)t1(x)}. 1 Example 3.4.4 (Normal exponential family) Let f(x|µ, σ2) be the N(µ, σ2) family of pdfs, where θ = (µ, σ2), −∞ < µ < ∞, σ > 0.
    [Show full text]
  • On Curved Exponential Families
    U.U.D.M. Project Report 2019:10 On Curved Exponential Families Emma Angetun Examensarbete i matematik, 15 hp Handledare: Silvelyn Zwanzig Examinator: Örjan Stenflo Mars 2019 Department of Mathematics Uppsala University On Curved Exponential Families Emma Angetun March 4, 2019 Abstract This paper covers theory of inference statistical models that belongs to curved exponential families. Some of these models are the normal distribution, binomial distribution, bivariate normal distribution and the SSR model. The purpose was to evaluate the belonging properties such as sufficiency, completeness and strictly k-parametric. Where it was shown that sufficiency holds and therefore the Rao Blackwell The- orem but completeness does not so Lehmann Scheffé Theorem cannot be applied. 1 Contents 1 exponential families ...................... 3 1.1 natural parameter space ................... 6 2 curved exponential families ................. 8 2.1 Natural parameter space of curved families . 15 3 Sufficiency ............................ 15 4 Completeness .......................... 18 5 the rao-blackwell and lehmann-scheffé theorems .. 20 2 1 exponential families Statistical inference is concerned with looking for a way to use the informa- tion in observations x from the sample space X to get information about the partly unknown distribution of the random variable X. Essentially one want to find a function called statistic that describe the data without loss of im- portant information. The exponential family is in probability and statistics a class of probability measures which can be written in a certain form. The exponential family is a useful tool because if a conclusion can be made that the given statistical model or sample distribution belongs to the class then one can thereon apply the general framework that the class gives.
    [Show full text]
  • A Primer on the Exponential Family of Distributions
    A Primer on the Exponential Family of Distributions David R. Clark, FCAS, MAAA, and Charles A. Thayer 117 A PRIMER ON THE EXPONENTIAL FAMILY OF DISTRIBUTIONS David R. Clark and Charles A. Thayer 2004 Call Paper Program on Generalized Linear Models Abstract Generahzed Linear Model (GLM) theory represents a significant advance beyond linear regression theor,], specifically in expanding the choice of probability distributions from the Normal to the Natural Exponential Famdy. This Primer is intended for GLM users seeking a hand)' reference on the model's d]smbutional assumptions. The Exponential Faintly of D,smbutions is introduced, with an emphasis on variance structures that may be suitable for aggregate loss models m property casualty insurance. 118 A PRIMER ON THE EXPONENTIAL FAMILY OF DISTRIBUTIONS INTRODUCTION Generalized Linear Model (GLM) theory is a signtficant advance beyond linear regression theory. A major part of this advance comes from allowmg a broader famdy of distributions to be used for the error term, rather than just the Normal (Gausstan) distributton as required m hnear regression. More specifically, GLM allows the user to select a distribution from the Exponentzal Family, which gives much greater flexibility in specifying the vanance structure of the variable being forecast (the "'response variable"). For insurance apphcations, this is a big step towards more realistic modeling of loss distributions, while preserving the advantages of regresston theory such as the ability to calculate standard errors for estimated parameters. The Exponentml family also includes several d~screte distributions that are attractive candtdates for modehng clatm counts and other events, but such models will not be considered here The purpose of this Primer is to give the practicmg actuary a basic introduction to the Exponential Family of distributions, so that GLM models can be designed to best approximate the behavior of the insurance phenomenon.
    [Show full text]
  • Exponential Families
    Construction Estimation Additional topics Exponential families Patrick Breheny September 9 Patrick Breheny University of Iowa Likelihood Theory (BIOS 7110) 1 / 29 Construction Single parameter Estimation Moments Additional topics Multiparameter Introduction • In the middle part of this course, we will prove the theoretical properties of likelihood in great generality, trying to make as few assumptions as possible about the probability model we are using • However, the theoretical properties of likelihood turn out to be particularly simple and straightforward if the probability model falls into a class models known as the exponential family • Today we will cover the idea behind exponential families, see why they are particularly convenient for likelihood, and discuss some extensions of the family Patrick Breheny University of Iowa Likelihood Theory (BIOS 7110) 2 / 29 Construction Single parameter Estimation Moments Additional topics Multiparameter History • First, a bit of history • In the 19th and early 20th centuries, statistical theory and practice was almost exclusively focused on classical parametric models (normal, binomial, Poisson, etc.) • Starting in the 1930s (but taking a long time to be fully appreciated), it became apparent that all of these parametric models have a common construction (the exponential family) and unified theorems can be obtained that apply to all of them • In fact, as we will see today, this is not an accident – only exponential families enjoy certain properties of mathematical and computational simplicity
    [Show full text]
  • A Matrix Variate Generalization of the Power Exponential Family of Distributions
    A MATRIX VARIATE GENERALIZATION OF THE POWER EXPONENTIAL FAMILY OF DISTRIBUTIONS Key Words: vector distribution; elliptically contoured distribution; stochastic representation. ABSTRACT This paper proposes a matrix variate generalization of the power exponential distribution family, which can be useful in generalizing statistical procedures in multivariate analysis and in designing robust alternatives to them. An example is added to show an application of the generalization. 1. INTRODUCTION In this paper, we make a matrix variate generalization of the power exponential distri- bution and study its properties. The one-dimensional power exponential distribution was established in [1] and has been used in many studies about robust inference (see [2], [3]). A multivariate generalization was proposed in [4]. The power exponential distribution has proved useful to model random phenomena whose distributions have tails that are thicker or thinner than those of the normal distribution, and so to supply robust alternatives to many statistical procedures. The location parameter of these distributions is the mean, so linear and nonlinear models can be easily constructed. The covariance matrix permits a structure that embodies the uniform and serial dependence (see [5]). The power exponential multivariate distribution has been applied in several fields. An application to repeated measurements can be seen in [5]. It has also been applied to obtain robust models for nonlinear repeated measurements, in order to model dependencies among responses, as an alternative to models where the multivariate t distribution is used (see [6]). In Bayesian network applications, these distributions have been used as an alternative to the mixture of normal distributions; some references in the field of speech recognition and image processing are [7], [8], [9], [10] and [11].
    [Show full text]
  • 2. the Exponential Family
    CHAPTER 2. THE EXPONENTIAL FAMILY The development of the theory of the generalized linear model is based upon the exponential family of distributions. This formalization recharacterizes familiar functions into a formula that is more useful theoretically and demonstrates similarity between seemingly disparate mathematical forms. The name refers to the manner in which all of the terms in the expression for these PDFs and PMFs are moved into the exponent to provide common notation. This does not imply some restrictive relationship with the well-known exponential probability density function. Quasi-likelihood models (McCullagh, 1983; Wedderburn, 1974), which we describe later, replace this process and only require the stipu- lation of the first two moments. This allows the separation of the mean and variance functions, and the estimation is accomplished by employ- ing a quasi-likelihood function. This approach has the advantage of accommodating situations in which the data are found or assumed not to be independent and identically distributed (henceforth iid). Justification Fisher (1934) developed the idea that many commonly applied proba- bility mass functions and probability density functions are really just special cases of a more general classification he called the exponential family. The basic idea is to identify a general mathematical structure to the function in which uniformly labeled subfunctions characterize individual differences. The label “exponential family” comes from the convention that subfunctions are contained within the exponent com- ponent of the natural exponential function (i.e., the irrational number e = 2.718281 ... raised to some specified power). This is not a rigid restriction as any subfunction that is not in the exponent can be placed there by substituting its natural logarithm.
    [Show full text]
  • Exponential Families
    Exponential Families Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA Surprisingly many of the distributions we use in statistics for random vari- n ables X taking value in some space X (often R or N0 but sometimes R , Z, or some other space), indexed by a parameter θ from some parameter set Θ, can be written in exponential family form, with pdf or pmf f(x θ) = exp [η(θ)t(x) B(θ)] h(x) | − for some statistic t : X R, natural parameter η :Θ R, and functions → → B : Θ R and h : X R . The likelihood function for a random sample → → + of size n from the exponential family is n fn(x θ) = exp η(θ) t(xj) nB(θ) h(xi), | − Xj=1 Y which is actually of the same form with the same natural parameter η( ), · but now with statistic Tn(x) = t(xj) and functions Bn(θ) = nB(θ) and hn(x) = Πh(xj). P Examples For example, the pmf for the binomial distribution Bi(m,p) can be written as m x m x p m p (1 p) − = exp log x m log(1 p) x − 1 p − − x − p of Exponential Family form with η(p) = log 1 p and natural sufficient statis- tic t(x)= x, and the Poisson − x θ θ 1 e− = exp [(log θ)x θ] x! − x! 1 with η = log θ and again t(x) = x. The Beta distribution Be(α, β) with either one of its two parmeters unknown can be written in EF form too: β Γ(α + β) α 1 β 1 Γ(α) (1 x) x − (1 x) − = exp α log x log − Γ(α)Γ(β) − − Γ(α + β) x(1 x)Γ(β) − Γ(β) xα = exp β log(1 x) log − − Γ(α + β) x(1 x)Γ(α) − with t(x) = log x or log(1 x) when η = α or η = β is unknown, respectively.
    [Show full text]
  • Lecture 11: Exponential Family and Generalized Linear Models
    LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS HANI GOODARZI AND SINA JAFARPOUR 1. EXPONENTIAL FAMILY. Exponential family comprises a set of flexible distribution ranging both continuous and discrete random variables. The members of this family have many important properties which merits discussing them in some general format. Many of the probability distributions that we have studied so far are specific members of this family: • Gaussian: Rp • Multinomial: categorical • Bernoulli: binary f0; 1g • Binomial: counts of success/failure • Von mises: sphere • Gamma: R+ • Poisson: N+ • Laplace: R+ • Exponential: R+ • Beta: (0; 1) • Dirichlet: ∆ (Simplex) • Weibull: R+ • Weishart: symmetric positive-definite matrices All these distributions follow the general format: (1) p(xjη) = h(x) exp η> t(x) − a(η) ; where, η is called “natural parameter”, t(x) is “sufficient statistic” (a statistic is a function of data), h(x) is the “underlying measure” and a(η) is called “log normalizer”, which ensures that the distribution integrates to one. Hence, Z a(η) = log h(x) exp η>t(x) dx: We start by showcasing a number of known distributions and illustrate that they are indeed members of the exponential family. 1 2 HANI GOODARZI AND SINA JAFARPOUR 1.1. Bernoulli. Bernoulli distribution is defined on a binary (0 or 1) ran- dom variable using parameter π where π = Pr(x = 1). The Bernoulli distribution can be written as: (2) p(xjπ) = πx(1 − π)1−x: In order to convert Equation (2) to the general exponential format (Equa- tion (1)), we rewrite it as, (3) p(xjπ) = expflog πx(1 − π)1−xg = expfx log π + (1 − x) log(1 − π)g π = exp x log + log(1 − π) 1 − π In Equation (3), π • η = log 1−π , • t(x) = x, • a(η) = − log(1 − π), • and h(x) = 1.
    [Show full text]