Exponential Family Distributions

CS242: Probabilistic Graphical Models Lecture 3A: Exponential Families & Graphical Models Professor Erik Sudderth Brown University Computer Science September 27, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/ CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 10 Exponential Families of Distributions 1 p(x ✓)= ⌫(x)exp ✓T φ(x) Z(✓)= ⌫(x)exp ✓T φ(x) dx | Z(✓) { } { } ZX = ⌫(x)exp ✓T φ(x) Φ(✓) Φ(✓) = log Z(✓) { − } d fixed vector of sufficient statistics (features), φ(x) R 2 specifying the family of distributions d unknown vector of natural parameters, ✓ ⇥ R 2 ✓ determine particular distribution in this family normalization constant or partition function Z(✓) > 0 (for discrete variables, integral becomes sum) reference measure independent of parameters ⌫(x) > 0 (for many models, we simply have ⌫ ( x )=1 ) d ⇥= ✓ R Z(✓) < ensures we construct a valid distribution { 2 | 1} Preview: Why the Exponential Family? p(x ✓)=⌫(x)exp ✓T φ(x) Φ(✓) Φ(✓) = log ⌫(x)exp ✓T φ(x) dx | { − } { } d ZX ⇥= ✓ R Φ(✓) < { 2 | 1} d φ(x) R vector of sufficient statistics (features) defining the family 2 d ✓ R vector of natural parameters indexing particular distributions • Allows simultaneous✓ study of many popular probability distributions: Bernoulli (binary), Categorical, Poisson (counts), Exponential (positive), Gaussian (real), … • Maximum likelihood (ML) learning is simple: moment matching of sufficient statistics • Bayesian learning is simple: conjugate priors are available Beta, Dirichlet, Gamma, Gaussian, Wishart, … • The maximum entropy interpretation: Among all distributions with certain moments of interest, the exponential family is the most random (makes fewest assumptions) • Parametric and predictive sufficiency: For arbitrarily large datasets, optimal learning is possible from a finite-dimensional set of statistics (streaming, big data) Natural Logarithms and Exponentials + ln(x) = log(x):R R ! exp(log(x)) = x log(xyz) = log(x) + log(y) + log(z) log(xa)=a log(x) Reminder: Logistic Functions simple derivatives convexity 1 @(z) = σ(z)σ( z) log σ(z) = log(1 + e z) σ(z)= z − 1+e− @z − − • Symmetry: • Log-odds ratio: σ(z) σ( z)=1 σ(z) log = z − − 1 σ(z) − Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin x 1 x Ber(x µ)=µ (1 µ) − x 0, 1 | − 2{ } E[x µ]=P[x = 1] = µ 0 µ 1 | Exponential Family Form: Derivation on board Ber(x ✓)=exp ✓x Φ(✓) ✓ ⇥=R 1 10 | { − } 2 0.9 9 Logistic Function: 0.8 8 Log-Normalizer 1 0.7 7 is Convex µ =(1+exp( ✓))− 0.6 6 − 0.5 5 ✓ = log(µ) log(1 µ) 0.4 4 − − 0.3 3 Log Normalization: 0.2 2 0.1 1 ✓ 0 0 Φ(✓) = log(1 + e ) −10 −5 0 5 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Y XK K=3 µ1 0 µ 1 µk =1 k kX=1 µk = E[xk] µ2 µ3 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Exponential FamilyY Form: XK K µk =1 Cat(x ✓)=exp k=1 ✓kxk Φ(✓) | { − } k=1 K X Φ(✓) = log P `=1 exp(✓`) 0 µ 1 k Ø A linear subspace of ⇣exponentialP family parameters⌘ gives the same probabilities, because the features are linearly dependent: k xk =1 Cat(x ✓) = Cat(x ✓ + c) for any scalar constantP c | | Multinomial Logistic (softmax) Function “uniform” probabilities “max” µk(✓/T ), ✓ =(3, 0, 1) exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Softmax versus Logistic Function 1 1 Assume K=2 categories: 0.9 σ(z)= z 0.8 z e 1 1 1+e− µ1(z)= = = σ(z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e − − 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=σ(z2 z1) ez1 + ez2 − − 0.3 0.2 Binary logistic function is a special case. 0.1 0 −10 −5 0 5 10 exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Minimal Exponential Families 1 1 Assume K=2 categories: 0.9 σ(z)= z 0.8 z e 1 1 1+e− µ1(z)= = = σ(z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e − − 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=σ(z2 z1) ez1 + ez2 − − 0.3 0.2 0.1 0 −10 −5 0 5 10 Ø In a minimal exponential family representation, the features must be linearly independent. Example: Ber(x ✓)=exp ✓x Φ(✓) Ø In a non-minimal or redundant exponential| family representation,{ − features} linearly dependent and multiple parameters give same distribution. Example: Ber(x ✓)=exp ✓ x + ✓ (1 x) Φ(✓ , ✓ ) | { 1 2 − − 1 2 } Multivariate Gaussian Distribution 1 1 1 T 1 (x µ, ⌃) = exp (x µ) ⌃− (x µ) N | (2⇡)N/2 ⌃ 1/2 −2 − − | | ⇢ Moment Parameterization: Mean & Covariance N 1 µ = E[x] R ⇥ 2 T T T N N ⌃=E[(x µ)(x µ) ]=E[xx ] µµ R ⇥ − − − 2 For simplicity, assume covariance positive definite & invertible. 1 ⇤=⌃− 1 Information Parameterization: Canonical Parameters # =⌃− µ 1 T 1 T 1 1 T 1 (x µ, ⌃) exp x ⌃− x + µ ⌃− x µ ⌃− µ N | / −2 − 2 ⇢ N 1 1 T T 1 2 − (x #, ⇤) exp x ⇤x + # x exp #sxs ⇤ssxs ⇤stxsxt / 8 − 2 − 2 39 N | / −2 "s=1 # s=t ⇢ < X X6 = 4 5 Recall general exponential family form: p(x: ✓)=exp ✓T φ(x) Φ(✓); | { − } CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families 400 300 200 100 0 3 2 3 1 2 1 0 0 −1 −1 Terminology: Inference versus Learning • Inference: Given a model with known parameters, estimate the “hidden” variables for some data instance, or find their marginal distribution • Learning: Given multiple data instances, estimate parameters for a graphical model of their structure: Maximum Likelihood, Maximum a Posteriori, … Ø Training instances may be completely or partially observed Example: Expert systems for medical diagnosis • Inference: Given observed symptoms for a particular patient, infer probabilities that they have contracted various diseases • Learning: Given a database of many patient diagnoses, learn the relationships between diseases and symptoms Example: Markov random fields for semantic image segmentation • Inference: What object category is depicted at each pixel in some image? • Learning: Across a database of images, in which some example objects have been labeled, how do those labels relate to low-level image features? Frequentist Parameter Estimation (1) (2) (L) Ø Training data has L observations: x = x ,x ,...,x ✓ Ø Assume independent, identically distributed{ (i.i.d.) samples } from an unknown member of some family of distributions: d p(x ✓), ✓ x(`) R L Example: Data is Gaussian,| ✓ is unknown2 mean Ø Think of parameter vector ✓ as fixed, non-random constant Ø An estimator guesses (learns) parameters from data ˆ (1) (L) ˆ d ✓L(x)=f(x ,...,x ), ✓L(x) R 2 Example: ✓ is unknown mean, and ˆ 1 L (`) ✓L(x)= L `=1 x Ø GOAL: Estimators with good properties, uniformly across ✓ Ø Analyze estimators via expected performance on P hypothetical data (many trials of drawing L i.i.d. samples) Maximum Likelihood (ML) Estimation Ø Suppose I have L independent observations sampled from some unknown probability distribution: x = x(1),...,x(L) { } Ø Suppose I have two candidate parameter estimates where: p(x ✓1) >p(x ✓2) Given no other information,| choose the| higher likelihood model! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1 p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Consistency of ML Estimators L L (`) (`) ✓ˆL = arg max p(x ✓) = arg max log p(x ✓) ✓ | ✓ | `Y=1 X`=1 Ø An estimator is consistent if the sequence of estimates ✓ˆL converges to the true parameter in probability, for any true ✓ lim P ( ✓ˆL ✓ ✏)=0 for any ✏ > 0 L || − || ≥ !1 Ø Under “mild conditions” that are true for most distributions, including a broad range of exponential family distributions, the ML parameter estimates are always consistent Ø The ML estimator also satisfies a central limit theorem, and for large L has provably small variance (“efficiency”) Finding ML Estimates Ø For many practical models (including exponential families), log-likelihood is a smooth & continuous function of parameters: L(✓)= L log p(x(`) ✓) `=1 | Maximum will occur atP a point where derivative equals zero! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1 p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Example: Bernoulli Distribution Ø A Bernoulli or indicator random variable X has one parameter: x 1 x p(x ✓)=✓ (1 ✓) − x 0, 1 | − 2 { } Ø The maximum likelihood (ML) estimate maximizes: L L(✓)= x(`) log(✓)+(1 x(`)) log(1 ✓) − − X`=1 L 1 (`) ✓ˆ = x ML: Empirical fraction of successes! L X`=1 Example: Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 XK K=3 ✓1 0 ✓k 1 ✓ =1 k kX=1 ✓k = E[xk] ✓2 probability ✓3 simplex A Constrained Optimization Lemma Objective: K ✓ˆ = arg max ak log ✓k ak 0 ✓ ≥ kX=1 K subject to ✓k =1 0 ✓k 1 K=3 ✓1 kX=1 Solution: K ak ✓ˆk = a0 = ak a0 k=1 ✓2 X Ø Proof for K=2: Change of variables to unconstrained problem ✓3 Ø Proof for general K: Lagrange multipliers Learning Categorical Distributions Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 Ø If we have Nk observations

Load more