Exponential Family Distributions

CS242: Probabilistic Graphical Models Lecture 3A: Exponential Families & Graphical Models Professor Erik Sudderth Brown University Computer Science September 27, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/ CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 10 Exponential Families of Distributions 1 p(x ✓)= ⌫(x)exp ✓T φ(x) Z(✓)= ⌫(x)exp ✓T φ(x) dx | Z(✓) { } { } ZX = ⌫(x)exp ✓T φ(x) Φ(✓) Φ(✓) = log Z(✓) { − } d fixed vector of sufficient statistics (features), φ(x) R 2 specifying the family of distributions d unknown vector of natural parameters, ✓ ⇥ R 2 ✓ determine particular distribution in this family normalization constant or partition function Z(✓) > 0 (for discrete variables, integral becomes sum) reference measure independent of parameters ⌫(x) > 0 (for many models, we simply have ⌫ ( x )=1 ) d ⇥= ✓ R Z(✓) < ensures we construct a valid distribution { 2 | 1} Preview: Why the Exponential Family? p(x ✓)=⌫(x)exp ✓T φ(x) Φ(✓) Φ(✓) = log ⌫(x)exp ✓T φ(x) dx | { − } { } d ZX ⇥= ✓ R Φ(✓) < { 2 | 1} d φ(x) R vector of sufficient statistics (features) defining the family 2 d ✓ R vector of natural parameters indexing particular distributions • Allows simultaneous✓ study of many popular probability distributions: Bernoulli (binary), Categorical, Poisson (counts), Exponential (positive), Gaussian (real), … • Maximum likelihood (ML) learning is simple: moment matching of sufficient statistics • Bayesian learning is simple: conjugate priors are available Beta, Dirichlet, Gamma, Gaussian, Wishart, … • The maximum entropy interpretation: Among all distributions with certain moments of interest, the exponential family is the most random (makes fewest assumptions) • Parametric and predictive sufficiency: For arbitrarily large datasets, optimal learning is possible from a finite-dimensional set of statistics (streaming, big data) Natural Logarithms and Exponentials + ln(x) = log(x):R R ! exp(log(x)) = x log(xyz) = log(x) + log(y) + log(z) log(xa)=a log(x) Reminder: Logistic Functions simple derivatives convexity 1 @(z) = σ(z)σ( z) log σ(z) = log(1 + e z) σ(z)= z − 1+e− @z − − • Symmetry: • Log-odds ratio: σ(z) σ( z)=1 σ(z) log = z − − 1 σ(z) − Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin x 1 x Ber(x µ)=µ (1 µ) − x 0, 1 | − 2{ } E[x µ]=P[x = 1] = µ 0 µ 1 | Exponential Family Form: Derivation on board Ber(x ✓)=exp ✓x Φ(✓) ✓ ⇥=R 1 10 | { − } 2 0.9 9 Logistic Function: 0.8 8 Log-Normalizer 1 0.7 7 is Convex µ =(1+exp( ✓))− 0.6 6 − 0.5 5 ✓ = log(µ) log(1 µ) 0.4 4 − − 0.3 3 Log Normalization: 0.2 2 0.1 1 ✓ 0 0 Φ(✓) = log(1 + e ) −10 −5 0 5 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Y XK K=3 µ1 0 µ 1 µk =1 k kX=1 µk = E[xk] µ2 µ3 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Exponential FamilyY Form: XK K µk =1 Cat(x ✓)=exp k=1 ✓kxk Φ(✓) | { − } k=1 K X Φ(✓) = log P `=1 exp(✓`) 0 µ 1 k Ø A linear subspace of ⇣exponentialP family parameters⌘ gives the same probabilities, because the features are linearly dependent: k xk =1 Cat(x ✓) = Cat(x ✓ + c) for any scalar constantP c | | Multinomial Logistic (softmax) Function “uniform” probabilities “max” µk(✓/T ), ✓ =(3, 0, 1) exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Softmax versus Logistic Function 1 1 Assume K=2 categories: 0.9 σ(z)= z 0.8 z e 1 1 1+e− µ1(z)= = = σ(z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e − − 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=σ(z2 z1) ez1 + ez2 − − 0.3 0.2 Binary logistic function is a special case. 0.1 0 −10 −5 0 5 10 exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Minimal Exponential Families 1 1 Assume K=2 categories: 0.9 σ(z)= z 0.8 z e 1 1 1+e− µ1(z)= = = σ(z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e − − 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=σ(z2 z1) ez1 + ez2 − − 0.3 0.2 0.1 0 −10 −5 0 5 10 Ø In a minimal exponential family representation, the features must be linearly independent. Example: Ber(x ✓)=exp ✓x Φ(✓) Ø In a non-minimal or redundant exponential| family representation,{ − features} linearly dependent and multiple parameters give same distribution. Example: Ber(x ✓)=exp ✓ x + ✓ (1 x) Φ(✓ , ✓ ) | { 1 2 − − 1 2 } Multivariate Gaussian Distribution 1 1 1 T 1 (x µ, ⌃) = exp (x µ) ⌃− (x µ) N | (2⇡)N/2 ⌃ 1/2 −2 − − | | ⇢ Moment Parameterization: Mean & Covariance N 1 µ = E[x] R ⇥ 2 T T T N N ⌃=E[(x µ)(x µ) ]=E[xx ] µµ R ⇥ − − − 2 For simplicity, assume covariance positive definite & invertible. 1 ⇤=⌃− 1 Information Parameterization: Canonical Parameters # =⌃− µ 1 T 1 T 1 1 T 1 (x µ, ⌃) exp x ⌃− x + µ ⌃− x µ ⌃− µ N | / −2 − 2 ⇢ N 1 1 T T 1 2 − (x #, ⇤) exp x ⇤x + # x exp #sxs ⇤ssxs ⇤stxsxt / 8 − 2 − 2 39 N | / −2 "s=1 # s=t ⇢ < X X6 = 4 5 Recall general exponential family form: p(x: ✓)=exp ✓T φ(x) Φ(✓); | { − } CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families 400 300 200 100 0 3 2 3 1 2 1 0 0 −1 −1 Terminology: Inference versus Learning • Inference: Given a model with known parameters, estimate the “hidden” variables for some data instance, or find their marginal distribution • Learning: Given multiple data instances, estimate parameters for a graphical model of their structure: Maximum Likelihood, Maximum a Posteriori, … Ø Training instances may be completely or partially observed Example: Expert systems for medical diagnosis • Inference: Given observed symptoms for a particular patient, infer probabilities that they have contracted various diseases • Learning: Given a database of many patient diagnoses, learn the relationships between diseases and symptoms Example: Markov random fields for semantic image segmentation • Inference: What object category is depicted at each pixel in some image? • Learning: Across a database of images, in which some example objects have been labeled, how do those labels relate to low-level image features? Frequentist Parameter Estimation (1) (2) (L) Ø Training data has L observations: x = x ,x ,...,x ✓ Ø Assume independent, identically distributed{ (i.i.d.) samples } from an unknown member of some family of distributions: d p(x ✓), ✓ x(`) R L Example: Data is Gaussian,| ✓ is unknown2 mean Ø Think of parameter vector ✓ as fixed, non-random constant Ø An estimator guesses (learns) parameters from data ˆ (1) (L) ˆ d ✓L(x)=f(x ,...,x ), ✓L(x) R 2 Example: ✓ is unknown mean, and ˆ 1 L (`) ✓L(x)= L `=1 x Ø GOAL: Estimators with good properties, uniformly across ✓ Ø Analyze estimators via expected performance on P hypothetical data (many trials of drawing L i.i.d. samples) Maximum Likelihood (ML) Estimation Ø Suppose I have L independent observations sampled from some unknown probability distribution: x = x(1),...,x(L) { } Ø Suppose I have two candidate parameter estimates where: p(x ✓1) >p(x ✓2) Given no other information,| choose the| higher likelihood model! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1 p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Consistency of ML Estimators L L (`) (`) ✓ˆL = arg max p(x ✓) = arg max log p(x ✓) ✓ | ✓ | `Y=1 X`=1 Ø An estimator is consistent if the sequence of estimates ✓ˆL converges to the true parameter in probability, for any true ✓ lim P ( ✓ˆL ✓ ✏)=0 for any ✏ > 0 L || − || ≥ !1 Ø Under “mild conditions” that are true for most distributions, including a broad range of exponential family distributions, the ML parameter estimates are always consistent Ø The ML estimator also satisfies a central limit theorem, and for large L has provably small variance (“efficiency”) Finding ML Estimates Ø For many practical models (including exponential families), log-likelihood is a smooth & continuous function of parameters: L(✓)= L log p(x(`) ✓) `=1 | Maximum will occur atP a point where derivative equals zero! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1 p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Example: Bernoulli Distribution Ø A Bernoulli or indicator random variable X has one parameter: x 1 x p(x ✓)=✓ (1 ✓) − x 0, 1 | − 2 { } Ø The maximum likelihood (ML) estimate maximizes: L L(✓)= x(`) log(✓)+(1 x(`)) log(1 ✓) − − X`=1 L 1 (`) ✓ˆ = x ML: Empirical fraction of successes! L X`=1 Example: Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 XK K=3 ✓1 0 ✓k 1 ✓ =1 k kX=1 ✓k = E[xk] ✓2 probability ✓3 simplex A Constrained Optimization Lemma Objective: K ✓ˆ = arg max ak log ✓k ak 0 ✓ ≥ kX=1 K subject to ✓k =1 0 ✓k 1 K=3 ✓1 kX=1 Solution: K ak ✓ˆk = a0 = ak a0 k=1 ✓2 X Ø Proof for K=2: Change of variables to unconstrained problem ✓3 Ø Proof for general K: Lagrange multipliers Learning Categorical Distributions Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 Ø If we have Nk observations

Exponential Family Distributions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support