MLE and MAP Examples
Total Page:16
File Type:pdf, Size:1020Kb
MLE and MAP Examples 1 Multinomial Distribution Given some integer k > 1, let Θ be the set of vectors θ = (θ1; :::; θk) satisfying θi ≥ 0 and Pk i=1 θi = 1. For any θ 2 Θ we define the probability mass function ( Qk θI(x=i) for x 2 f1; 2; :::; kg p (x) = i=1 i θ 0 o.w. Given n observations X1; :::; Xn 2 f1; :::; kg, we would like to derive the maximum likelihood estimate for the parameter θ under this model. First, let's write down the likelihood of the data for some θ 2 Θ (recall that we have assumed X1; :::; Xn 2 f1; :::; kg, since otherwise there is no meaningful solution): n Y L(θ; X1; :::; Xn) = pθ(Xj) j=1 n k Y Y I(Xj =i) = θi j=1 i=1 k n Y Y I(Xj =i) = θi i=1 j=1 k Y Si = θi i=1 Pn where, for brevity, we have defined Si = j=1 I(Xj = i). Our goal is to maximize L with respect to θ, subject to the constraint that θ 2 Θ. Equivalently, we can maximize the log likelihood: k X log L(θ; X1; :::; Xn) = Si log θi: i=1 Pk Introducing a Lagrange multiplier for the constraint i=1 θi = 1, we have k k ! X X Λ(θ; λ) = Si log θi + λ θi − 1 : i=1 i=1 1 (we need not worry about the positivity constraints, since the solution will satisfy those regardless). Differentiating with respect to θi and λ, for each i = 1; :::; k we have @Λ S = i + λ @θi θi @Λ Pk and @λ = i=1 θi − 1. Setting the latter partial derivative to 0 gives back the original constraint Pk i=1 θi = 1, as expected. Also for i = 1; :::; k, @Λ S = 0 ) i = −λ @θi θbi Si ) θbi = : (1) −λ By definition of Si we have k Pk X Si Si = i=1 −λ −λ i=1 n = −λ n which implies that in order for the summation constraint on θi to be satisfied, we require −λ = 1, i.e. λ = −n. Plugging this value for λ into (1), Si θbi = n n 1 X = X n j j=1 i.e. the maximum likelihood estimates for the elements of θ are simply the intuitively obvious estimators { the empirical means. 2 Multivariate Normal (unknown mean and variance) The PDF of the multivariate normal in d dimensions is 1 1 p(x; µ, Σ) = exp − (x − µ)T Σ−1(x − µ) (2π)d=2jΣj1=2 2 d d×d where the parameters are the mean vector µ 2 R , and the covariance matrix Σ 2 R , which must be symmetric positive definite. 2.1 MLE d The likelihood function given X1; :::; Xn 2 R is n Y 1 L(µ, Σ; X ; :::; X ) = c jΣj−n=2 exp − (X − µ)T Σ−1(X − µ) 1 n 1 2 i i i=1 ( n ) 1 X = c jΣj−n=2 exp − (X − µ)T Σ−1(X − µ) 1 2 i i i=1 2 −nd=2 where c1 = (2π) is a constant independent of the data and parameters and can be ignored. The log likelihood is n ! n 1 X log L(µ, Σ; X ; :::; X ) = − log jΣj − (X − µ)T Σ−1(X − µ) + c 1 n 2 2 i i 2 i=1 where c2 is another inconsequential constant. It is easiest to first maximize this with respect to µ. The corresponding partial derivative is n @ X log L = − Σ−1(X − µ) @µ i i=1 n −1 X = −Σ (Xi − µ) i=1 n ! 1 X = nΣ−1 µ − X n i i=1 which implies n 1 X µ = X : b n i i=1 The partial derivative of the log likelihood with respect to Σ is n ! @ n 1 1 X log L = − jΣjΣ−1 − −Σ−1(X − µ)(X − µ)T Σ−1 @Σ 2 jΣj 2 i i i=1 n ! n 1 X = − Σ−1 + Σ−1 (X − µ)(X − µ)T Σ−1: 2 2 i i i=1 Equating this to 0 and plugging in the estimate µb, we see that the estimator Σb must solve the equation n ! −1 −1 X T −1 nΣb = Σb (Xi − µb)(Xi − µb) Σb : i=1 It is easy to see that a solution for Σb is n 1 X T Σb = (Xi − µ)(Xi − µ) ; n b b i=1 which is known as the sample covariance matrix. 2.2 MAP under the conjugate prior The conjugate prior for the mean and covariance of a multivariate normal is sometimes called the Normal-inverse-Wishart distribution and has the density f(µ, Σjµ0; β; Ψ; ν) = p(µjµ0; βΣ)w(ΣjΨ; ν) 3 where p is the density of the multivariate normal distribution, and w is the density of the inverse- Wishart distribution given by jΨjν=2 1 w(ΣjΨ; ν) = jΣj−(ν+d+1)=2 exp − Tr(ΨΣ−1) νd=2 ν 2 Γd 2 2 where Γd is the multivariate Gamma function, and Tr(·) denotes the trace of a matrix. The d + d×d parameters of the Normal-inverse-Wishart are µ0 2 R , β 2 R ,Ψ 2 R positive definite, and ν 2 R with ν > d − 1. We need to 1. calculate the posterior distribution of µ and Σ assuming this prior and n observations X1; :::; Xn 2 d R ; 2. convince ourselves that the posterior is indeed a Normal-inverse-Wishart distribution and find its parameters; 3. and finally find the values for µ and Σ that maximize the posterior distribution. 2.2.1 Calculating the posterior distribution We have n ! Y f(µ, Σjµ0; β; Ψ; ν; X1; :::; Xn) / p(Xijµ, Σ) f(µ, Σjµ0; β; Ψ; ν) i=1 where we omit the term in the denominator which is a finite, non-zero constant that doesn't depend on µ or Σ, since any such term does not affect the shape of the posterior distribution and only factors in the normalizing constant that ensures the posterior integrates to 1. Continuing from above, n ! Y 1 1 f(µ, Σjµ ; β; Ψ; ν; X ; :::; X ) / exp − (X − µ)T Σ−1(X − µ) × (2) 0 1 n (2π)d=2jΣj1=2 2 i i i=1 1 1 exp − (µ − µ )T (βΣ)−1(µ − µ ) × (2π)d=2jβΣj1=2 2 0 0 jΨjν=2 1 jΣj−(ν+d+1)=2 exp − Tr(ΨΣ−1) (3) νd=2 ν 2 Γd 2 2 ( n ) 1 X / exp − (X − µ)T Σ−1(X − µ) × 2 i i i=1 1 exp − (µ − µ )T (βΣ)−1(µ − µ ) × 2 0 0 1 jΣj−(ν+n+d+2)=2 exp − Tr(ΨΣ−1) : (4) 2 While messy, this fully defines the posterior distribution. 4 2.2.2 The posterior is a Normal-inverse-Wishart distribution 0 0 0 0 Our goal now is to find µ0; β ; Ψ ; ν such that (4) looks like a Normal-inverse-Wishart density with those parameters. First we write out explicitly the form of the Normal-inverse-Wishart density, up to constants, for these as of yet unknown parameters: 1 f(µ, Σjµ0 ; β0; Ψ0; ν0) / exp − (µ − µ0 )T (β0Σ)−1(µ − µ0 ) × 0 2 0 0 0 1 jΣj−(ν +d+2)=2 exp − Tr(Ψ0Σ−1) : (5) 2 0 It is immediately clear that the only value ν can take to satisfy f(µ, Σjµ0; β; Ψ; ν; X1; :::; Xn) = 0 0 0 0 f(µ, Σjµ0; β ; Ψ ; ν ) is ν0 = ν + d: Furthermore we see that for this value of ν0 the term involving jΣj in (4) is accounted for in (5). 0 0 0 So now we only need find µ0; β ; Ψ such that ( n ) 1 X 1 exp − (X − µ)T Σ−1(X − µ) exp − (µ − µ )T (βΣ)−1(µ − µ ) × 2 i i 2 0 0 i=1 1 × exp − Tr(ΨΣ−1) (6) 2 is equal to 1 1 exp − (µ − µ0 )T (β0Σ)−1(µ − µ0 ) exp − Tr(Ψ0Σ−1) (7) 2 0 0 2 up to a multiplicative constant independent of µ and Σ (note that the terms involving jΣj are already equalized and needn't be considered any more). By taking the log of each quantity and multiplying by −2, we see that the above problem is 0 0 0 equivalent to finding µ0; β ; Ψ such that n X T −1 T −1 −1 (Xi − µ) Σ (Xi − µ) + (µ − µ0) (βΣ) (µ − µ0) + Tr(ΨΣ ) (8) i=1 is equal to 0 T 0 −1 0 0 −1 (µ − µ0) (β Σ) (µ − µ0) + Tr(Ψ Σ ) (9) up to an additive constant independent of µ and Σ. Pn Defining S = i=1 Xi, we can rewrite (8) as n X T −1 T −1 −1 (Xi − µ) Σ (Xi − µ) + (µ − µ0) (βΣ) (µ − µ0) + Tr(ΨΣ ) i=1 1 1 1 = µT Σ−1µ − 2 µT Σ−1µ + µT Σ−1µ + β β 0 β 0 0 n X T −1 T −1 T −1 −1 + Xi Σ Xi − 2µ Σ S + nµ Σ µ + Tr(ΨΣ ) i=1 1 1 = + n µT Σ−1µ − 2µT Σ−1 µ + S + (10) β β 0 n 1 X + µT Σ−1µ + XT Σ−1X + Tr(ΨΣ−1): (11) β 0 0 i i i=1 5 Notice each appearance of µ is now in the two terms on line (10), which look like the first two terms of the expansion of a quadratic form similar to the first term in (9).