Chapter 9. Exponential Family of Distributions

Machine Learning for Engineers: Chapter 9. Exponential Family of Distributions Osvaldo Simeone King's College London January 26, 2021 Osvaldo Simeone ML4Engineers 1 / 97 This Chapter In the previous chapters, we have adopted a limited range of probabilistic models, namely Bernoulli and categorical for discrete rvs and Gaussian for continuous rvs. While these are the most common modelling choices, they clearly do not represent many important situations. Examples: I Discrete data may a priori take arbitrarily large values, making Bernoulli and categorical models not suitable F ex.: waiting times for next arrival in a queue; I Continuous data may be non-negative, making Gaussian models not suitable F ex.: measurements of weights or heights. Osvaldo Simeone ML4Engineers 2 / 97 This Chapter Furthermore, we have seen that Bernoulli, categorical, and Gaussian distributions share several common features: I The gradient of the log-loss with respect to the model parameters can be expressed in terms of a mean error that measures the difference between mean under the model and observation (see Chapters 4 and 6); I ML learning can be solved in closed form by evaluating empirical averages (see Chapters 3, 4, and 6); I Information-theoretic quantities such as (differential) entropy and KL divergence can be computed in closed form (see Chapter 3). Osvaldo Simeone ML4Engineers 3 / 97 This Chapter In this chapter, we will introduce a general family of distributions that includes Bernoulli, categorical, and Gaussian as special cases: the exponential family of distributions. The family is much larger, and it also encompasses distributions such as I Poisson and geometric distributions, whose support is discrete and includes all integers; I exponential and gamma distributions, whose support is continuous and includes only non-negative values. All distributions that depend on a finite parameter vector and are supported on a set that does not depend on model parameters are in the exponential family. Osvaldo Simeone ML4Engineers 4 / 97 This Chapter A rare example of a distribution that does not belong to this class is given by a uniform distribution in an interval dependent on model parameters. We will discover that all distributions in the exponential family share the useful properties mentioned above in terms of log-loss, ML learning, and information-theoretic measures. As a result, the methods studied in the previous chapters can be extended to a much larger class of problems by replacing Bernoulli, categorical, and Gaussian distributions with another model in the exponential family. Osvaldo Simeone ML4Engineers 5 / 97 Overview Exponential family: definitions and examples Gradient of the log-loss, or score vector ML learning Information-theoretic metrics Fisher information matrix Generalized linear models Osvaldo Simeone ML4Engineers 6 / 97 Exponential Family: Definitions and Examples Osvaldo Simeone ML4Engineers 7 / 97 Exponential Family Bernoulli, categorical, and Gaussian distributions have the useful property that the log-loss − log p(xjη) is a convex function of the model parameters η: Note that in this chapter we will introduce two different definitions for model parameters, which will be denoted as η (natural parameters) and µ (mean parameters). Accordingly, we will not adopt the notation θ used thus far when we need to identify which type of model parameter is being considered. Ex.: For a Gaussian rv x ∼ N (η; 1), the log-loss is given as 1 − log p(xjη) = (x − η)2 + const. indep. of η; 2 which is quadratic, and hence convex in the parameter η. All distributions in the exponential family share this key property, which simplifies optimization (see Chapter 5), and they are defined as follows. Osvaldo Simeone ML4Engineers 8 / 97 Exponential Family A probabilistic model p(xjη) in the exponential family is described by I a K × 1 vector of sufficient statistics 2 3 s1(x) 6 . 7 s(x) = 4 . 5 ; sK (x) I as well as by a \log-base measure" function M(x). This is in the sense that the distribution p(xjη) depends on x only through vector s(x) and function M(x). One can hence think of s(x) and M(x) as defining a vector of features that determines the distribution of x. Osvaldo Simeone ML4Engineers 9 / 97 Exponential Family Specifically, the exponential family contains discrete and continuous distributions whose log-loss can be written as − log p(xjη) = − ηT s(x) − M(x) + A(η) ; | {z } | {z } |{z} PK log-base measure log-partition function = k=1 ηk sk (x) where we have defined T I the K × 1 natural parameters vector η = [η1; :::; ηK ] , I and the log-partition function A(η). Osvaldo Simeone ML4Engineers 10 / 97 Log-Partition Function The log-partition function A(η) is fixed once functions s(·) and M(·) are specified, and is convex in η. To see this, we can write the distributions in the exponential family as p(xjη) = exp ηT s(x) − A(η) + M(x) 1 = exp ηT s(x) + M(x) : exp(A(η)) In order to guarantee the normalization R p(xjη)dx = 1 for continuous P variables and x p(xjη) = 1 for discrete variables, we need to set Z A(η) = log exp ηT s(x) + M(x) dx for continuous rvs and X A(η) = log exp ηT s(x) + M(x) x for discrete rvs. Osvaldo Simeone ML4Engineers 11 / 97 Log-Partition Function The log-partition function has a \log-sum-exp" form and is hence convex in η: Note that the function exp(A(η)) is known as partition function { whence the name log-partition function for A(η): The log-loss is the sum of a linear function in η, namely −ηT s(x), and of a convex function in η, namely A(η). This implies that the log-loss is convex in η: The vector of natural parameters η can take any value that ensures that the distribution can be normalized, i.e., A(η) < 1: This feasible set is convex (i.e., it contains all segments between any two points in the set) and is taken to be open, that is, not to include its boundary. (Technically, this defines the class of distributions in the regular exponential family, which we will focus on). Osvaldo Simeone ML4Engineers 12 / 97 Exponential Family To summarize, the exponential family contains discrete and continuous distributions that are specified by sufficient statistics s(x) and log-base measure function M(x) as 0 1 B T C p(xjη) _ exp @ η s(x) + M(x) A ; | {z } | {z } linear function of η general function of x where the \proportional to" sign _ makes the normalizing constant 1 exp(A(η)) implicit. When no confusion can arise, we will write a distribution in the exponential family as p(xjη) = ExpFam(xjη); where the notation hides the dependence on s(x) and M(x). Osvaldo Simeone ML4Engineers 13 / 97 Example 1: Gaussian Distribution with Fixed Variance For the Gaussian distribution N (ν; β−1) with a fixed precision β; the log-loss can be written as 0 1 −1 B β 2 1 −1 C − log N (xjν; β ) = − βν x − B− x − log(2πβ )C |{z} |{z} @ 2 2 A η s(x) | {z } M(x) 0 1 2 Bβν C + B C : @ 2 A |{z} A(η) Osvaldo Simeone ML4Engineers 14 / 97 Example 1: Gaussian Distribution with Fixed Variance The log-partition function can be expressed in terms of the natural parameter η = βν as βν2 1 η2 A(η) = = : 2 2 β I As expected, this is a (strictly) convex function of η for all η 2 R. In the previous chapters, we would have parametrized the distribution N (ν; β−1) through the mean parameter ν = Ex∼N (ν,β−1)[s(x)] = Ex∼N (ν,β−1)[x]. I We now have an alternative parametrization in terms of the natural parameter η. I The natural parameter η and the mean parameter µ are in a one-to-one relationship as one can be recovered from the other through the equality ν = η/β (recall that β is fixed and hence it should be considered as a numerical value). Osvaldo Simeone ML4Engineers 15 / 97 Example 2: Bernoulli Distribution The Bernoulli distribution can be written as Bern(xjµ) = µx (1 − µ)1−x ; where x 2 f0; 1g and we have the mean parameter µ = Ex∼Bern(xjµ)[x] = Pr[x = 1]. Therefore, the log-loss is µ − log Bern(xjµ) = −log x + (−log (1 − µ)); 1 − µ |{z} | {z } s(x) A(η) | {zη } where M(x) = 0. Osvaldo Simeone ML4Engineers 16 / 97 Example 2: Bernoulli Distribution The Bernoulli distribution can be written as Bern(xjµ) = µx (1 − µ)1−x ; where x 2 f0; 1g and we have the mean parameter µ = Ex∼Bern(xjµ)[x] = Pr[x = 1]. Therefore, the log-loss is µ − log Bern(xjµ) = −log x + (−log (1 − µ)); 1 − µ |{z} | {z } s(x) A(η) | {zη } where M(x) = 0. Osvaldo Simeone ML4Engineers 16 / 97 Example 2: Bernoulli Distribution It follows that the natural parameter is the logit or log-odds (see Chapter 6) Bern(1jµ) µ η = log = log : Bern(0jµ) 1 − µ The mean parameter µ is in a one-to-one relationship with the natural parameter η: inverting the equality above, we have 1 µ = σ(η) = : 1 + e−η Therefore, the log-partition function can be expressed in terms of the natural parameter η as A(η) = log(1 + eη); which is a (strictly) convex function of η 2 R. Osvaldo Simeone ML4Engineers 17 / 97 Example 3: General Gaussian Distribution For a Gaussian distribution N (ν; β−1) with parameters (ν; β), we have the log-loss 0 0 1 1 −1 B B β C 2 C − log N (xjν; β ) = − B βν x + B− C x C @|{z} |{z} @ 2 A |{z}A η1 s1(x) |{z} s2(x) η2 0 1 2 Bν β 1 −1 C + B + log(2πβ )C ; @ 2 2 A | {z } A(η) where M(x) = 0.

Chapter 9. Exponential Family of Distributions

Package 'Distributional'

Fisher Information Matrix for Gaussian and Categorical Distributions

Categorical Distributions in Natural Language Processing Version 0.1

Package 'Extradistr'

Discrete Categorical Distribution

The Continuous Categorical: a Novel Simplex-Valued Exponential Family

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

A Compound Dirichlet-Multinomial Model for Provincial Level Covid-19 Predictions in South Africa

An R Package for Multivariate Categorical Data Analysis by Juhyun Kim, Yiwen Zhang, Joshua Day, Hua Zhou

Probability Distributions: Multinomial and Poisson

CHAPTER 2 Estimating Probabilities

Distributions.Jl Documentation Release 0.6.3