MLE and MAP Examples

MLE and MAP Examples 1 Multinomial Distribution Given some integer k > 1, let Θ be the set of vectors θ = (θ1; :::; θk) satisfying θi ≥ 0 and Pk i=1 θi = 1. For any θ 2 Θ we define the probability mass function ( Qk θI(x=i) for x 2 f1; 2; :::; kg p (x) = i=1 i θ 0 o.w. Given n observations X1; :::; Xn 2 f1; :::; kg, we would like to derive the maximum likelihood estimate for the parameter θ under this model. First, let's write down the likelihood of the data for some θ 2 Θ (recall that we have assumed X1; :::; Xn 2 f1; :::; kg, since otherwise there is no meaningful solution): n Y L(θ; X1; :::; Xn) = pθ(Xj) j=1 n k Y Y I(Xj =i) = θi j=1 i=1 k n Y Y I(Xj =i) = θi i=1 j=1 k Y Si = θi i=1 Pn where, for brevity, we have defined Si = j=1 I(Xj = i). Our goal is to maximize L with respect to θ, subject to the constraint that θ 2 Θ. Equivalently, we can maximize the log likelihood: k X log L(θ; X1; :::; Xn) = Si log θi: i=1 Pk Introducing a Lagrange multiplier for the constraint i=1 θi = 1, we have k k ! X X Λ(θ; λ) = Si log θi + λ θi − 1 : i=1 i=1 1 (we need not worry about the positivity constraints, since the solution will satisfy those regardless). Differentiating with respect to θi and λ, for each i = 1; :::; k we have @Λ S = i + λ @θi θi @Λ Pk and @λ = i=1 θi − 1. Setting the latter partial derivative to 0 gives back the original constraint Pk i=1 θi = 1, as expected. Also for i = 1; :::; k, @Λ S = 0 ) i = −λ @θi θbi Si ) θbi = : (1) −λ By definition of Si we have k Pk X Si Si = i=1 −λ −λ i=1 n = −λ n which implies that in order for the summation constraint on θi to be satisfied, we require −λ = 1, i.e. λ = −n. Plugging this value for λ into (1), Si θbi = n n 1 X = X n j j=1 i.e. the maximum likelihood estimates for the elements of θ are simply the intuitively obvious estimators { the empirical means. 2 Multivariate Normal (unknown mean and variance) The PDF of the multivariate normal in d dimensions is 1 1 p(x; µ, Σ) = exp − (x − µ)T Σ−1(x − µ) (2π)d=2jΣj1=2 2 d d×d where the parameters are the mean vector µ 2 R , and the covariance matrix Σ 2 R , which must be symmetric positive definite. 2.1 MLE d The likelihood function given X1; :::; Xn 2 R is n Y 1 L(µ, Σ; X ; :::; X ) = c jΣj−n=2 exp − (X − µ)T Σ−1(X − µ) 1 n 1 2 i i i=1 ( n ) 1 X = c jΣj−n=2 exp − (X − µ)T Σ−1(X − µ) 1 2 i i i=1 2 −nd=2 where c1 = (2π) is a constant independent of the data and parameters and can be ignored. The log likelihood is n ! n 1 X log L(µ, Σ; X ; :::; X ) = − log jΣj − (X − µ)T Σ−1(X − µ) + c 1 n 2 2 i i 2 i=1 where c2 is another inconsequential constant. It is easiest to first maximize this with respect to µ. The corresponding partial derivative is n @ X log L = − Σ−1(X − µ) @µ i i=1 n −1 X = −Σ (Xi − µ) i=1 n ! 1 X = nΣ−1 µ − X n i i=1 which implies n 1 X µ = X : b n i i=1 The partial derivative of the log likelihood with respect to Σ is n ! @ n 1 1 X log L = − jΣjΣ−1 − −Σ−1(X − µ)(X − µ)T Σ−1 @Σ 2 jΣj 2 i i i=1 n ! n 1 X = − Σ−1 + Σ−1 (X − µ)(X − µ)T Σ−1: 2 2 i i i=1 Equating this to 0 and plugging in the estimate µb, we see that the estimator Σb must solve the equation n ! −1 −1 X T −1 nΣb = Σb (Xi − µb)(Xi − µb) Σb : i=1 It is easy to see that a solution for Σb is n 1 X T Σb = (Xi − µ)(Xi − µ) ; n b b i=1 which is known as the sample covariance matrix. 2.2 MAP under the conjugate prior The conjugate prior for the mean and covariance of a multivariate normal is sometimes called the Normal-inverse-Wishart distribution and has the density f(µ, Σjµ0; β; Ψ; ν) = p(µjµ0; βΣ)w(ΣjΨ; ν) 3 where p is the density of the multivariate normal distribution, and w is the density of the inverse- Wishart distribution given by jΨjν=2 1 w(ΣjΨ; ν) = jΣj−(ν+d+1)=2 exp − Tr(ΨΣ−1) νd=2 ν 2 Γd 2 2 where Γd is the multivariate Gamma function, and Tr(·) denotes the trace of a matrix. The d + d×d parameters of the Normal-inverse-Wishart are µ0 2 R , β 2 R ,Ψ 2 R positive definite, and ν 2 R with ν > d − 1. We need to 1. calculate the posterior distribution of µ and Σ assuming this prior and n observations X1; :::; Xn 2 d R ; 2. convince ourselves that the posterior is indeed a Normal-inverse-Wishart distribution and find its parameters; 3. and finally find the values for µ and Σ that maximize the posterior distribution. 2.2.1 Calculating the posterior distribution We have n ! Y f(µ, Σjµ0; β; Ψ; ν; X1; :::; Xn) / p(Xijµ, Σ) f(µ, Σjµ0; β; Ψ; ν) i=1 where we omit the term in the denominator which is a finite, non-zero constant that doesn't depend on µ or Σ, since any such term does not affect the shape of the posterior distribution and only factors in the normalizing constant that ensures the posterior integrates to 1. Continuing from above, n ! Y 1 1 f(µ, Σjµ ; β; Ψ; ν; X ; :::; X ) / exp − (X − µ)T Σ−1(X − µ) × (2) 0 1 n (2π)d=2jΣj1=2 2 i i i=1 1 1 exp − (µ − µ )T (βΣ)−1(µ − µ ) × (2π)d=2jβΣj1=2 2 0 0 jΨjν=2 1 jΣj−(ν+d+1)=2 exp − Tr(ΨΣ−1) (3) νd=2 ν 2 Γd 2 2 ( n ) 1 X / exp − (X − µ)T Σ−1(X − µ) × 2 i i i=1 1 exp − (µ − µ )T (βΣ)−1(µ − µ ) × 2 0 0 1 jΣj−(ν+n+d+2)=2 exp − Tr(ΨΣ−1) : (4) 2 While messy, this fully defines the posterior distribution. 4 2.2.2 The posterior is a Normal-inverse-Wishart distribution 0 0 0 0 Our goal now is to find µ0; β ; Ψ ; ν such that (4) looks like a Normal-inverse-Wishart density with those parameters. First we write out explicitly the form of the Normal-inverse-Wishart density, up to constants, for these as of yet unknown parameters: 1 f(µ, Σjµ0 ; β0; Ψ0; ν0) / exp − (µ − µ0 )T (β0Σ)−1(µ − µ0 ) × 0 2 0 0 0 1 jΣj−(ν +d+2)=2 exp − Tr(Ψ0Σ−1) : (5) 2 0 It is immediately clear that the only value ν can take to satisfy f(µ, Σjµ0; β; Ψ; ν; X1; :::; Xn) = 0 0 0 0 f(µ, Σjµ0; β ; Ψ ; ν ) is ν0 = ν + d: Furthermore we see that for this value of ν0 the term involving jΣj in (4) is accounted for in (5). 0 0 0 So now we only need find µ0; β ; Ψ such that ( n ) 1 X 1 exp − (X − µ)T Σ−1(X − µ) exp − (µ − µ )T (βΣ)−1(µ − µ ) × 2 i i 2 0 0 i=1 1 × exp − Tr(ΨΣ−1) (6) 2 is equal to 1 1 exp − (µ − µ0 )T (β0Σ)−1(µ − µ0 ) exp − Tr(Ψ0Σ−1) (7) 2 0 0 2 up to a multiplicative constant independent of µ and Σ (note that the terms involving jΣj are already equalized and needn't be considered any more). By taking the log of each quantity and multiplying by −2, we see that the above problem is 0 0 0 equivalent to finding µ0; β ; Ψ such that n X T −1 T −1 −1 (Xi − µ) Σ (Xi − µ) + (µ − µ0) (βΣ) (µ − µ0) + Tr(ΨΣ ) (8) i=1 is equal to 0 T 0 −1 0 0 −1 (µ − µ0) (β Σ) (µ − µ0) + Tr(Ψ Σ ) (9) up to an additive constant independent of µ and Σ. Pn Defining S = i=1 Xi, we can rewrite (8) as n X T −1 T −1 −1 (Xi − µ) Σ (Xi − µ) + (µ − µ0) (βΣ) (µ − µ0) + Tr(ΨΣ ) i=1 1 1 1 = µT Σ−1µ − 2 µT Σ−1µ + µT Σ−1µ + β β 0 β 0 0 n X T −1 T −1 T −1 −1 + Xi Σ Xi − 2µ Σ S + nµ Σ µ + Tr(ΨΣ ) i=1 1 1 = + n µT Σ−1µ − 2µT Σ−1 µ + S + (10) β β 0 n 1 X + µT Σ−1µ + XT Σ−1X + Tr(ΨΣ−1): (11) β 0 0 i i i=1 5 Notice each appearance of µ is now in the two terms on line (10), which look like the first two terms of the expansion of a quadratic form similar to the first term in (9).

MLE and MAP Examples

STAT 802: Multivariate Analysis Course Outline

Mx (T) = 2$/2) T'w# (&&J Terp

Bayesian Inference Via Approximation of Log-Likelihood for Priors in Exponential Family

Interpolation of the Wishart and Non Central Wishart Distributions

Singular Inverse Wishart Distribution with Application to Portfolio Theory

An Introduction to Wishart Matrix Moments Adrian N

Package 'Matrixsampling'

Arxiv:1402.4306V2 [Stat.ML] 19 Feb 2014 Advantages Come at No Additional Computa- Archambeau and Bach, 2010]

Wishart Distributions for Covariance Graph Models

2. Wishart Distributions and Inverse-Wishart Sampling

Bayesian Inference for the Multivariate Normal

Exact Largest Eigenvalue Distribution for Doubly Singular Beta Ensemble