Exponential Family
Total Page:16
File Type:pdf, Size:1020Kb
School of Computer Science Probabilistic Graphical Models Learning generalized linear models and tabular CPT of fully observed BN Eric Xing Lecture 7, September 30, 2009 X1 X1 X1 X1 X2 X3 X2 X3 X2 X3 Reading: X4 X4 © Eric Xing @ CMU, 2005-2009 1 Exponential family z For a numeric random variable X T p(x |η) = h(x)exp{η T (x) − A(η)} Xn 1 N = h(x)exp{}η TT (x) Z(η) is an exponential family distribution with natural (canonical) parameter η z Function T(x) is a sufficient statistic. z Function A(η) = log Z(η) is the log normalizer. z Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,... © Eric Xing @ CMU, 2005-2009 2 1 Recall Linear Regression z Let us assume that the target variable and the inputs are related by the equation: T yi = θ xi + ε i where ε is an error term of unmodeled effects or random noise z Now assume that ε follows a Gaussian N(0,σ), then we have: 1 ⎛ (y −θ T x )2 ⎞ p(y | x ;θ) = exp⎜− i i ⎟ i i ⎜ 2 ⎟ 2πσ ⎝ 2σ ⎠ © Eric Xing @ CMU, 2005-2009 3 Recall: Logistic Regression (sigmoid classifier) z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x z We can used the brute-force gradient method as in LR z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model! © Eric Xing @ CMU, 2005-2009 4 2 Example: Multivariate Gaussian Distribution z For a continuous vector random variable X∈Rk: 1 ⎧ 1 T −1 ⎫ p(x µ,Σ) = 1/2 exp⎨− (x − µ) Σ (x − µ)⎬ ()2π k /2 Σ ⎩ 2 ⎭ Moment parameter 1 = exp{}− 1 tr()Σ−1 xxT + µ T Σ−1 x − 1 µ T Σ−1µ − log Σ ()2π k /2 2 2 z Exponential family representation Natural parameter −1 1 −1 −1 1 −1 η = [Σ µ;− 2 vec(Σ )]= [η1 , vec(η2 )], η1 = Σ µ and η2 = − 2 Σ T (x) = []x; vec()xxT 1 T −1 1 T 1 A(η) = 2 µ Σ µ + log Σ = − 2 tr(η2η1η1 ) − 2 log()−2η2 h(x) = ()2π −k /2 z Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)- element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom) © Eric Xing @ CMU, 2005-2009 5 Example: Multinomial distribution z For a binary vector random variable x ~ multi(x |π ), x1 x1 xK ⎧ ⎫ p(xπ ) = π1 π 2 Lπ K = exp⎨∑ xk lnπ k ⎬ ⎩ k ⎭ ⎧K −1 ⎛ K −1 ⎞ ⎛ K −1 ⎞⎫ = exp⎨∑ xk lnπ k + ⎜1− ∑ xK ⎟ln⎜1− ∑π k ⎟⎬ ⎩ k =1 ⎝ k =1 ⎠ ⎝ k =1 ⎠⎭ ⎧K −1 ⎛ π ⎞ ⎛ K −1 ⎞⎫ = exp x ln⎜ k ⎟ + ln⎜1− π ⎟ ⎨∑ k ⎜ K −1 ⎟ ∑ k ⎬ ⎩ k =1 ⎝1− ∑k =1 π k ⎠ ⎝ k =1 ⎠⎭ z Exponential family representation ⎡ ⎛π k ⎞ ⎤ η = ⎢ln⎜ ⎟;0⎥ ⎣ ⎝ π K ⎠ ⎦ T (x) = []x K −1 K ⎛ ⎞ ⎛ ηk ⎞ A(η) = −ln⎜1 − ∑π k ⎟ = ln⎜∑e ⎟ ⎝ k =1 ⎠ ⎝ k =1 ⎠ h(x) = 1 © Eric Xing @ CMU, 2005-2009 6 3 Why exponential family? z Moment generating property dA d 1 d = log Z(η) = Z(η) dη dη Z(η) dη 1 d = h(x)exp{}η TT (x) dx Z(η) dη ∫ h(x)exp η TT (x) = T (x) {}dx ∫ Z(η) = E[]T(x) d 2 A h(x)exp η TT (x) h(x)exp η TT (x) 1 d = T 2 (x) { }dx − T (x) { }dx Z(η) dη 2 ∫ Z(η) ∫ Z(η) Z(η) dη = E[]T 2 (x) − E 2 []T (x) = Var[]T(x) © Eric Xing @ CMU, 2005-2009 7 Moment estimation z We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(η). z The qth derivative gives the qth centered moment. dA(η) = mean dη d 2 A(η) = variance dη 2 L z When the sufficient statistic is a stacked vector, partial derivatives need to be considered. © Eric Xing @ CMU, 2005-2009 8 4 Moment vs canonical parameters z The moment parameter µ can be derived from the natural (canonical) parameter dA(η) def 8 = E[]T (x) = µ A dη 4 z A(η) is convex since 2 d A(η) η = Var[]T (x) > 0 -2-1012∗ dη 2 η z Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1): def η =ψ (µ) z A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by µ − the moment parameterization. © Eric Xing @ CMU, 2005-2009 9 MLE for Exponential Family z For iid data, the log-likelihood is T l (η; D) = log∏ h(xn )exp{η T (xn ) − A(η)} n ⎛ T ⎞ = ∑∑log h(xn ) + ⎜η T (xn )⎟ − NA(η) nn⎝ ⎠ z Take derivatives and set to zero: ∂l ∂A(η) = ∑T (xn ) − N = 0 ∂η n ∂η ∂A(η) 1 = T (x ) ∂η N ∑ n ⇒ n ) 1 µMLE = ∑T (xn ) N n z This amounts to moment matching. ) ) z We can infer the canonical parameters using ηMLE =ψ (µMLE ) © Eric Xing @ CMU, 2005-2009 10 5 Sufficiency z For p(x|θ), T(x) is sufficient for θ if there is no information in X regarding θ beyond that in T(x). z We can throw away X for the purpose of inference w.r.t. θ . z Bayesian view X T(x) θ p(θ |T (x), x) = p(θ |T (x)) z Frequentist view X T(x) θ p(x |T (x),θ ) = p(x |T (x)) z The Neyman factorization theorem X T(x) θ z T(x) is sufficient for θ if p(x,T (x),θ ) =ψ 1 (T (x),θ )ψ 2 (x,T (x)) ⇒ p(x |θ ) = g(T (x),θ )h(x,T (x)) © Eric Xing @ CMU, 2005-2009 11 Examples z Gaussian: −1 1 −1 η = [Σ µ;− 2 vec(Σ )] T (x) = []x; vec()xxT 1 1 ⇒ µMLE = ∑T1 (xn ) = ∑ xn 1 T −1 1 N N A(η) = 2 µ Σ µ + 2 log Σ n n h(x) = ()2π −k /2 z Multinomial: ⎡ ⎛π k ⎞ ⎤ η = ⎢ln⎜ ⎟;0⎥ ⎣ ⎝ π K ⎠ ⎦ T (x) = []x 1 ⇒ µMLE = ∑ xn K −1 K N n ⎛ ⎞ ⎛ ηk ⎞ A(η) = −ln⎜1 − ∑π k ⎟ = ln⎜∑ e ⎟ ⎝ k =1 ⎠ ⎝ k =1 ⎠ h(x) = 1 z Poisson: η = log λ T (x) = x 1 A(η) = λ = eη ⇒ µMLE = ∑ xn N n 1 h(x) = x! © Eric Xing @ CMU, 2005-2009 12 6 Generalized Linear Models (GLIMs) z The graphical model z Linear regression Xn z Discriminative linear classification z Commonality: Y T n model Ep(Y)=µ=f(θ X) N z What is p()? the cond. dist. of Y. z What is f()? the response function. z GLIM z The observed input x is assumed to enter into the model via a linear combination of its elements ξ = θT x z The conditional mean µ is represented as a function f(ξ) of ξ, where f is known as the response function z The observed output y is assumed to be characterized by an exponential family distribution with conditional mean µ. © Eric Xing @ CMU, 2005-2009 13 GLIM, cont. θ f ψ GLIM ξ µ η y x p(y |η) = h(y)exp{η T (x)y − A(η)} 1 T ⇒ p(y |η) = h(y,φ)exp{φ (η (x)y − A(η))} z The choice of exp family is constrained by the nature of the data Y z Example: y is a continuous vector Æ multivariate Gaussian y is a class label Æ Bernoulli or multinomial z The choice of the response function z Following some mild constrains, e.g., [0,1]. Positivity … −1 z Canonical response function: f = ψ ( ⋅ ) z In this case θTx directly corresponds to canonical parameter η. © Eric Xing @ CMU, 2005-2009 14 7 MLE for GLIMs with natural response z Log-likelihood T l = ∑log h(yn ) + ∑(θ xn yn − A(ηn )) nn z Derivative of Log-likelihood dl ⎛ dA(η ) dη ⎞ ⎜ n n ⎟ = ∑⎜ xn yn − ⎟ dθ n ⎝ dηn dθ ⎠ = ∑()yn − µn xn n This is a fixed point function T = X (y − µ) because µ is a function of θ z Online learning for canonical GLIMs z Stochastic gradient ascent = least mean squares (LMS) algorithm: t+1 t t θ = θ + ρ(yn − µn )xn t t T where µn = (θ ) xn and ρ is a step size © Eric Xing @ CMU, 2005-2009 15 Batch learning for canonical GLIMs ⎡− − x − −⎤ z The Hessian matrix 1 ⎢− − x − −⎥ X = ⎢ 2 ⎥ 2 ⎢ M M M ⎥ d l d dµ ⎢ ⎥ H = = y − µ x = x n − − x − − T T ∑()n n n ∑ n T ⎣ n ⎦ dθdθ dθ n n dθ ⎡ y1 ⎤ ⎢ ⎥ dµn dηn y = − x v 2 ∑ n T y = ⎢ ⎥ n dηn dθ ⎢ M ⎥ ⎢ ⎥ y dµn T T ⎣ n ⎦ = −∑ xn xn since η n= θ xn n dηn = −X TWX T where X = [ x n ] is the design matrix and ⎛ dµ1 dµN ⎞ W = diag⎜ ,K, ⎟ ⎝ dη1 dηN ⎠ nd which can be computed by calculating the 2 derivative of A(ηn) © Eric Xing @ CMU, 2005-2009 16 8 Recall LMS ⎡− − x − −⎤ z Cost function in matrix form: 1 ⎢− − x − −⎥ ⎢ 2 ⎥ n X = 1 T 2 ⎢ M M M ⎥ J(θ ) = ∑(xi θ − yi ) ⎢ ⎥ 2 i=1 ⎣− − xn − −⎦ 1 T ⎡ y ⎤ = ()()Xθ − yv Xθ − yv 1 ⎢y ⎥ 2 yv = ⎢ 2 ⎥ ⎢ M ⎥ ⎢ ⎥ ⎣yn ⎦ z To minimize J(θ), take derivative and set to zero: 1 T T T T v vT vT v ∇θ J = ∇θ tr()θ X Xθ −θ X y − y Xθ + y y 2 ⇒ X T Xθ = X T yv 1 = ()∇ trθ T X T Xθ −2∇ tryvT Xθ + ∇ tryvT yv 2 θ θ θ The normal equations 1 = ()X T Xθ + X T Xθ −2X T yv 2 ⇓ −1 = X T Xθ − X T yv = 0 θ * = (X T X ) X T yv © Eric Xing @ CMU, 2005-2009 17 Iteratively Reweighted Least Squares (IRLS) z Recall Newton-Raphson methods with cost function J t+1 t −1 θ = θ − H ∇θ J z We now have T ∇θ J = X (y − µ) −1 H = −X TWX θ * = (X T X ) X T yv z Now: t+1 t −1 θ = θ + H ∇θ l − = ()X TW t X 1 []X TW t Xθ t + X T (y − µ t ) − z = ()X TW t X 1 X TW t zt − where the adjusted response is zt = Xθ t + (W t ) 1 (y − µ t ) z This can be understood as solving the following " Iteratively reweighted least squares " problem θ t+1 = arg min(z − Xθ )T W (z − Xθ ) θ © Eric Xing @ CMU, 2005-2009 18 9 Example 1: logistic regression (sigmoid classifier) z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = 1 + e−η (x) z p(y|x) is an exponential family function, with 1 z mean: E[]y | x = µ = 1 + e−η ( x) T z and canonical response function η = ξ = θ x dµ = µ(1 − µ) z IRLS dη ⎛ µ (1 − µ ) ⎞ ⎜ 1 1 ⎟ W = ⎜ O ⎟ ⎜ ⎟ µN (1 − µN ) ⎝ © Eric Xing @ CMU, 2005-2009 ⎠ 19 Logistic regression: practical issues z It is very common to use regularized maximum likelihood.