CS242: Probabilistic Graphical Models Lecture 3A: Exponential Families & Graphical Models

Professor Erik Sudderth Brown University Computer Science September 27, 2016

Some figures and materials courtesy David Barber, Bayesian Reasoning and http://www.cs.ucl.ac.uk/staff/d.barber/brml/ CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 −10 −5 0 5 10 Exponential Families of Distributions 1 p(x ✓)= ⌫(x)exp ✓T (x) Z(✓)= ⌫(x)exp ✓T (x) dx | Z(✓) { } { } ZX = ⌫(x)exp ✓T (x) (✓) (✓) = log Z(✓) { } d fixed vector of sufficient (features), (x) R 2 specifying the family of distributions d unknown vector of natural parameters, ✓ ⇥ R 2 ✓ determine particular distribution in this family normalization constant or partition function Z(✓) > 0 (for discrete variables, integral becomes sum) reference measure independent of parameters ⌫(x) > 0 (for many models, we simply have ⌫ ( x )=1 ) d ⇥= ✓ R Z(✓) < ensures we construct a valid distribution { 2 | 1} Preview: Why the ? p(x ✓)=⌫(x)exp ✓T (x) (✓) (✓) = log ⌫(x)exp ✓T (x) dx | { } { } d ZX ⇥= ✓ R (✓) < { 2 | 1} d (x) R vector of sufficient statistics (features) defining the family 2 d ✓ R vector of natural parameters indexing particular distributions • Allows simultaneous✓ study of many popular probability distributions: Bernoulli (binary), Categorical, Poisson (counts), Exponential (positive), Gaussian (real), … • Maximum likelihood (ML) learning is simple: moment matching of sufficient statistics • Bayesian learning is simple: conjugate priors are available Beta, Dirichlet, Gamma, Gaussian, Wishart, … • The maximum entropy interpretation: Among all distributions with certain moments of interest, the exponential family is the most random (makes fewest assumptions) • Parametric and predictive sufficiency: For arbitrarily large datasets, optimal learning is possible from a finite-dimensional set of statistics (streaming, big data) Natural Logarithms and Exponentials + ln(x) = log(x):R R !

exp(log(x)) = x log(xyz) = log(x) + log(y) + log(z) log(xa)=a log(x) Reminder: Logistic Functions

simple derivatives convexity

1 @(z) = (z)( z) log (z) = log(1 + e z) (z)= z 1+e @z • Symmetry: • Log-odds ratio: (z) ( z)=1 (z) log = z 1 (z) Bernoulli Distribution: Single toss of a (possibly biased) coin x 1 x Ber(x µ)=µ (1 µ) x 0, 1 | 2{ } E[x µ]=P[x = 1] = µ 0 µ 1 |   Exponential Family Form: Derivation on board Ber(x ✓)=exp ✓x (✓) ✓ ⇥=R 1 10 | { } 2 0.9 9 Logistic Function: 0.8 8 Log-Normalizer 1 0.7 7 is Convex µ =(1+exp( ✓)) 0.6 6 0.5 5 ✓ = log(µ) log(1 µ) 0.4 4 0.3 3 Log Normalization: 0.2 2 0.1 1 ✓ 0 0 (✓) = log(1 + e ) −10 −5 0 5 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Y XK K=3 µ1 0 µ 1 µk =1  k  kX=1 µk = E[xk] µ2

µ3 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Exponential FamilyY Form: XK K µk =1 Cat(x ✓)=exp k=1 ✓kxk (✓) | { } k=1 K X (✓) = log P `=1 exp(✓`) 0 µ 1  k  Ø A linear subspace of ⇣exponentialP family parameters⌘ gives the same probabilities, because the features are linearly dependent: k xk =1 Cat(x ✓) = Cat(x ✓ + c) for any scalar constantP c | | Multinomial Logistic (softmax) Function “uniform” probabilities “max”

µk(✓/T ), ✓ =(3, 0, 1)

exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Softmax versus Logistic Function 1 1 Assume K=2 categories: 0.9 (z)= z 0.8 z e 1 1 1+e µ1(z)= = = (z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=(z2 z1) ez1 + ez2 0.3 0.2 Binary logistic function is a special case. 0.1 0 −10 −5 0 5 10

exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Minimal Exponential Families 1 1 Assume K=2 categories: 0.9 (z)= z 0.8 z e 1 1 1+e µ1(z)= = = (z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=(z2 z1) ez1 + ez2 0.3 0.2

0.1

0 −10 −5 0 5 10 Ø In a minimal exponential family representation, the features must be linearly independent. Example: Ber(x ✓)=exp ✓x (✓) Ø In a non-minimal or redundant exponential| family representation,{ features} linearly dependent and multiple parameters give same distribution. Example: Ber(x ✓)=exp ✓ x + ✓ (1 x) (✓ , ✓ ) | { 1 2 1 2 } Multivariate Gaussian Distribution

1 1 1 T 1 (x µ, ⌃) = exp (x µ) ⌃ (x µ) N | (2⇡)N/2 ⌃ 1/2 2 | | ⇢ Moment Parameterization: Mean & Covariance N 1 µ = E[x] R ⇥ 2 T T T N N ⌃=E[(x µ)(x µ) ]=E[xx ] µµ R ⇥ 2 For simplicity, assume covariance positive definite & invertible. 1 ⇤=⌃ 1 Information Parameterization: Canonical Parameters # =⌃ µ 1 T 1 T 1 1 T 1 (x µ, ⌃) exp x ⌃ x + µ ⌃ x µ ⌃ µ N | / 2 2 ⇢ N 1 1 T T 1 2 (x #, ⇤) exp x ⇤x + # x exp #sxs ⇤ssxs ⇤stxsxt / 8 2 2 39 N | / 2 "s=1 # s=t ⇢ < X X6 = 4 5 Recall general exponential family form: p(x: ✓)=exp ✓T (x) (✓); | { } CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families

400

300

200

100

0 3 2 3 1 2 1 0 0 −1 −1 Terminology: Inference versus Learning • Inference: Given a model with known parameters, estimate the “hidden” variables for some data instance, or find their • Learning: Given multiple data instances, estimate parameters for a graphical model of their structure: Maximum Likelihood, Maximum a Posteriori, … Ø Training instances may be completely or partially observed Example: Expert systems for medical diagnosis • Inference: Given observed symptoms for a particular patient, infer probabilities that they have contracted various diseases • Learning: Given a database of many patient diagnoses, learn the relationships between diseases and symptoms Example: Markov random fields for semantic image segmentation • Inference: What object category is depicted at each pixel in some image? • Learning: Across a database of images, in which some example objects have been labeled, how do those labels relate to low-level image features? Frequentist Parameter Estimation (1) (2) (L) Ø Training data has L observations: x = x ,x ,...,x ✓ Ø Assume independent, identically distributed{ (i.i.d.) samples } from an unknown member of some family of distributions: d p(x ✓), ✓ x(`) R L Example: Data is Gaussian,| ✓ is unknown2 mean Ø Think of parameter vector ✓ as fixed, non-random constant Ø An estimator guesses (learns) parameters from data ˆ (1) (L) ˆ d ✓L(x)=f(x ,...,x ), ✓L(x) R 2 Example: ✓ is unknown mean, and ˆ 1 L (`) ✓L(x)= L `=1 x Ø GOAL: Estimators with good properties, uniformly across ✓ Ø Analyze estimators via expected performance on P hypothetical data (many trials of drawing L i.i.d. samples) Maximum Likelihood (ML) Estimation Ø Suppose I have L independent observations sampled from some unknown : x = x(1),...,x(L) { } Ø Suppose I have two candidate parameter estimates where: p(x ✓1) >p(x ✓2) Given no other information,| choose the| higher likelihood model! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1

p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Consistency of ML Estimators L L (`) (`) ✓ˆL = arg max p(x ✓) = arg max log p(x ✓) ✓ | ✓ | `Y=1 X`=1 Ø An estimator is consistent if the sequence of estimates ✓ˆL converges to the true parameter in probability, for any true ✓ lim P ( ✓ˆL ✓ ✏)=0 for any ✏ > 0 L || || !1 Ø Under “mild conditions” that are true for most distributions, including a broad range of exponential family distributions, the ML parameter estimates are always consistent Ø The ML estimator also satisfies a central limit theorem, and for large L has provably small (“efficiency”) Finding ML Estimates Ø For many practical models (including exponential families), log-likelihood is a smooth & continuous function of parameters: L(✓)= L log p(x(`) ✓) `=1 | Maximum will occur atP a point where derivative equals zero! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1

p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Example: Bernoulli Distribution Ø A Bernoulli or indicator X has one parameter: x 1 x p(x ✓)=✓ (1 ✓) x 0, 1 | 2 { } Ø The maximum likelihood (ML) estimate maximizes: L L(✓)= x(`) log(✓)+(1 x(`)) log(1 ✓) X`=1 L 1 (`) ✓ˆ = x ML: Empirical fraction of successes! L X`=1 Example: Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 XK K=3 ✓1 0 ✓k 1 ✓ =1   k kX=1 ✓k = E[xk] ✓2 probability ✓3 simplex A Constrained Optimization Lemma Objective: K

✓ˆ = arg max ak log ✓k ak 0 ✓ kX=1 K subject to ✓k =1 0 ✓k 1 K=3 ✓1   kX=1 Solution: K ak ✓ˆk = a0 = ak a0 k=1 ✓2 X Ø Proof for K=2: Change of variables to unconstrained problem ✓3 Ø Proof for general K: Lagrange multipliers Learning Categorical Distributions Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 Ø If we have Nk observations of outcome k in L trials: X p(x(1),...,x(L) ✓)= K ✓Nk N = L x(`) | k=1 k k `=1 k K log p(x(1),...,x(L) ✓)=Q N log ✓ | k=1 k k P Ø From the previous optimization lemma: ˆ P Nk ✓ = arg max log p(x ✓) ✓ˆk = ✓ | L Ø Will this produce sensible predictions when K is large or N is small? Global and Local Optima

max =8.041 at x=−0.064,y=1.531, min =−6.532 at x=0.192,y=−1.653

global maximum 400 10 local maximum 5 300 0

200 −5

−10 100 4

2 0 3 local minimum 0 global minimum 2 3 3 −2 2 2 1 1 0 1 y −1 0 −4 −2 0 −3 x −1 −1 Ø A globally convergent algorithm is guaranteed to converge to a local optimum from any initialization Ø Convergence to a global optimum is another issue… Reminder: Gradients & Hessians d f : R R ! Gradient Vector: d d f : R R r ! @f(✓) ( f(✓))k = r @✓k Hessian Matrix: 2 d d d Ø Hessian is always symmetric f : R R ⇥ Ø For linear functions, Hessian is zero r ! @2f(✓) Ø For quadratic functions, Hessian is 2 ( f(✓))kk = 2 constant over the input space r @✓k 2 Ø Function f is convex iff Hessian is 2 @ f(✓) positive semidefinite everywhere ( f(✓))k` = r @✓k@✓` Gradient (Steepest) Descent 3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 stepsize ⌘k =0.1 stepsize ⌘k =0.6 ✓ = ✓ ⌘ f(✓ ) k+1 k kr✓ k Descent via Line Search

exact line searching 1 3

2.5

2

1.5

1

0.5

0

−0.5 0 0.5 1 1.5 2 With appropriate methods for step size selection, guaranteed to find a local optimum of objective Gradient-Based Optimization Software ML Estimate: ✓ˆ = arg min L(✓) ✓ You provide a function that, for any parameters, returns: Ø Objective function value: Scalar, smaller is better. Ø Gradient vector: Same dimension as parameter vector. Ø Sometimes also the Hessian (matrix of second derivatives) Optimization package does the following: Ø Step sizes: Guarantees descent, tries to move quickly. Ø Descent direction: Quasi-Newton (conjugate gradient, L-BFGS) methods use sequence of gradients to approximate curvature & converge faster Ø Convergence: Stop when changes fall below some tolerance. Ø Robustness: Care to avoid numerical underflow problems. Convexity Convex Sets

Convex Functions non-convex

Ø Equivalently, the set of point above the curve, or epigraph, is a convex set Ø The negative of a convex function is said to be concave Convex Functions

400

300

200

100

0 3 2 3 1 2 1 ✓ ✓0 0 0 Properties of Convex Objective Functions: −1 −1 Ø From any initialization, gradient descent finds global optimum • Ignoring possible numerical precision issues • Can be very slow (take huge number of iterations) when initialization is poor and/or gradient method is naïve Ø A sum of convex functions remains convex. Linear functions are convex. Ø Coming soon: Exponential families have convex negative log-posteriors! CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families Sec. 2.1. Exponential FamiliesLog Partition Function (Normalizer) 31 p(x ✓)=⌫(x)exp ✓T (x) (✓) (✓) = log ⌫(x)exp ✓T (x) dx Proposition| 2.1.1. The log{ partition function} Φ(θ) of eq. (2.2) is convex{ (strictly} so d ZX for minimal representations)⇥= ✓ R and(✓) continuously< differentiable over its domain Θ.Its derivatives are the cumulants{ 2 | of the su1ffi}cient statistics φ a ,sothat • Derivatives of the log partition function have an intuitive{ a | ∈form:A} ∂Φ(θ) = E [φ (x)] ! φ (x) p(x θ) dx (2.4) ∂θ θ a a | a !X ∂2Φ(θ) = Eθ[φa(x) φb(x)] Eθ[φa(x)] Eθ[φb(x)] (2.5) ∂θa∂θb − Proof. For• aProof: detailed Chain proof rule of of this derivatives classic and result, algorithmic see [15, 36manipulation, 311]. The (board) cumulant gener- ating properties follow from the chain rule and algebraic manipulation. From eq. (2.5), 2Φ(θ) is a positive semi–definite covariance matrix, implying convexity of Φ(θ). For ∇ minimal families, 2Φ(θ) must be positive definite, guaranteeing strict convexity. ∇ Due to this result, the log partition function is also known asthecumulant generating function of the exponential family. The convexity of Φ(θ) has important implications for the geometry of exponential families [6, 15, 36, 74].

Entropy, Information, and Divergence Concepts from information theory play a central role in the study of learning and inference in exponential families. Given a probability distribution p(x) defined on a discrete space , Shannon’s measure of entropy (in natural units, or nats) equals X H(p)= p(x) log p(x) (2.6) − x"∈X In such diverse fields as communications, signal processing, and statistical physics, entropy arises as a natural measure of the inherent uncertainty in a random variable [49]. The differential entropy extends this definition to continuous spaces:

H(p)= p(x) log p(x) dx (2.7) − !X In both discrete and continuous domains, the (differential) entropy H(p) is concave, continuous, and maximal for uniform densities. However, while the discrete entropy is guaranteed to be non-negative, differential entropy is sometimes less than zero. For problems of model selection and approximation, we need a measure of the distance between probability distributions. The relative entropy or Kullback-Leibler (KL) divergence between two probability distributions p(x) and q(x) equals p(x) D(p q)= p(x) log dx (2.8) || q(x) !X Important properties of the KL divergence follow from Jensen’s inequality [49], which bounds the expectation of convex functions: E[f(x)] f(E[x]) for any convex f : R (2.9) ≥ X → Log Partition Function (Normalizer) p(x ✓)=⌫(x)exp ✓T (x) (✓) (✓) = log ⌫(x)exp ✓T (x) dx | { } { } d ZX ⇥= ✓ R (✓) < { 2 | 1} • Derivatives of the log partition function have an intuitive form: ✓(✓)=E✓[(x)] r2 T T (✓)=Cov✓[(x)] = E✓[(x)(x) ] E✓[(x)]E✓[(x)] r✓ • Important consequences for learning with exponential families: Ø Finding gradients is equivalent to finding expected sufficient statistics, or moments, of some current model. This is an inference problem! Ø The Hessian is positive definite so the log partition function ( ✓ ) is convex Ø This in turn implies that the parameter space ⇥ is convex Ø Learning is a convex problem: No local optima! At least when we have complete observations… ML Estimation for Exponential Families log p(x(`) ✓) = log ⌫(x(`))+✓T (x(`)) (✓) | • Given L observations, the log- equals: L(✓)=C + L ✓T (x(`)) L(✓) `=1 h L (`) i C = P`=1 log ⌫(x ) • Note that the negative log-likelihood function is convex! • Gradients of the log-likelihoodP function have a simple form: L(✓)= L (x(`)) LE [(x)] r `=1 ✓ • At unique global optimum,hP gradient isi 0: 1 L (`) E✓[(x)] = L `=1 (x ) P Example: Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin x 1 x Ber(x µ)=µ (1 µ) x 0, 1 | 2{ } E[x µ]=P[x = 1] = µ 0 µ 1 |   Exponential Family Form: 1 µ =(1+exp( ✓)) Ber(x ✓)=exp ✓x (✓) ✓ = log(µ) log(1 µ) | { } 1 Maximum Likelihood from L data: 0.9 0.8 1 L (`) 0.7 µˆ = x 0.6 L `=1 0.5

0.4 µˆ 0.3 ✓ˆ = logP 0.2 0.1 1 µˆ 0 ✓ ◆ −10 −5 0 5 10