Useful Properties of the Multivariate Normal*
Total Page:16
File Type:pdf, Size:1020Kb
USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 3.1. Conditionals and marginals For Bayesian analysis it is very useful to understand how to write joint, marginal, and conditional distributions for the multivariate normal. Given a vector x IR p the multivariate normal density is ∈ 1 1 − f(x) = exp (x µ)T Σ 1(x µ) . (2 π)p/ 2 Σ 1/2 −2 − − | | Now split the vector into two parts x µ q 1 x = 1 , µ = 1 , of size , x µ (p ×q) 1 2 2 − × and Σ Σ q q q (p q) Σ = 11 12 , of size . Σ Σ (p ×q) q (p ×q) −(p q) 21 22 − × − × − We now state the joint and marginal distributions x N( µ , Σ ), x N( µ , Σ ), x N( µ, Σ) , 1 ∼ 1 11 2 ∼ 2 22 ∼ and the conditional density − − x x N µ + Σ Σ 1(x µ ), Σ Σ Σ 1Σ . 1 | 2 ∼ 1 12 22 2 − 2 11 − 12 22 21 The same idea holds for other sizes of partitions. 3.2. Conjugate priors 3.2.1. Univariate normals 3.2.1.1. Fixed variance, random mean. We consider the parameter σ2 fixed so we are interested in the conjugate prior for µ: 1 1 π(µ µ , σ 2) exp (µ µ )2 , | 0 ∝ σ −2σ2 − 0 0 0 2 where µ0 and σ are hyper-parameters for the prior distribution (when we don’t 2 have informative prior knowledge we typically consider µ0 = 0 and σ large). 13 14 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING The posterior distribution for x1, ..., x n with a univariate normal likelihood and the above prior will be 2 2 −1 σ0 σ 1 n Post( µ x , ..., x ) N 2 x¯ + 2 µ , + . 1 n σ 2 σ 2 0 2 2 | ∼ ! + σ0 + σ0 σ0 σ " n n 3.2.1.2. Fixed mean, random variance. We will formulate this setting with two parameterizations of the scale parameter: (1) the variance σ2, (2) the precision 1 τ = σ2 . The two conjugate distributions are the Gamma and the inverse Gamma (really they are the same distribution, just reparameterized) α α β − − − β − IG( α, β ) : f(σ2) = (σ2) α 1 exp( β(σ2) 1), Ga( α, β ) : f(τ) = τ α 1 exp( βτ ). Γ( α) − Γ( α) − The posterior distribution of σ2 is n 1 σ2 x , ..., x IG α + , β + (x µ)2 . | 1 n ∼ 2 2 i − The posterior distribution of τ is not surprisinglyX n 1 τ x , ..., x Ga α + , β + (x µ)2 . | 1 n ∼ 2 2 i − 3.2.1.3. Random mean, random variance. We nowX put the previous priors together in what is called a Bayesian hierarchical model: iid −1 xi µ, τ N( µ, (τ) ) | ∼ − µ τ N( µ , (κ τ) 1) | ∼ 0 0 τ Ga( α, β ). ∼ For the above likelihood and priors the posterior distribution for the mean and precision is µ0κ0 + nx¯ −1 µ τ, x 1, ..., x n N , (τ(n + κ0)) | ∼ n + κ0 n 1 n (¯x x )2 τ x , ..., x Ga α + , β + (x x¯)2 + − i . | 1 n ∼ 2 2 i − n + 1 2 X 3.2.2. Multivariate normal Given a vector x IR p the multivariate normal density is ∈ 1 1 − f(x) = exp (x µ)T Σ 1(x µ) . (2 π)p/ 2 Σ 1/2 −2 − − | | We will work with the precision matrix instead of the covariance and we will consider the following Bayesian hierarchical model: iid −1 xi µ, Λ N( µ, (Λ) ) | ∼ − µ Λ N( µ , (κ Λ) 1) | ∼ 0 0 Λ Wi(Λ , n ), ∼ 0 0 the precision matrix is modeled using the Wishart distribution Λ (n−d−1) /2 exp( .5tr(Λ V −1)) f(Λ; V, n ) = | | − . 2nd/ 2 V n/ 2Γ (n/ 2) | | d USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 15 For the above likelihood and priors the posterior distribution for the mean and precision is µ0κ0 + nx¯ −1 µ Λ, x 1, ..., x n N , (Λ( n + κ0)) | ∼ n + κ0 n 1 κ Λ x , ..., x Wi n + , Λ + Σ¯ + 0 (¯x µ )(¯x µ )T . | 1 n ∼ 0 2 0 2 κ + n − 0 − 0 0 LECTURE 4 A Bayesian approach to linear regression The main motivations behind a Bayesian formalism for inference are a coherent approach to modeling uncertainty as well as an axiomatic framework for inference. We will reformulate multivariate linear regression from a Bayesian formulation in this section. Bayesian inference involves thinking in terms of probability distributions and conditional distributions. One important idea is that of a conjugate prior. Another tool we will use extensively in this class is the multivariate normal distribution and its properties. 4.1. Conjugate priors Given a likelihood function p(x θ) and a prior π(θ) on can write the posterior as | p(x θ)π(θ) p(x, θ ) p(θ x) = | ′ ′ ′ = , | ′ p(x θ )π(θ ) d θ p(x) θ | R where p(x) is the marginal density for the data, p(x, θ ) is the joint density of the data and the parameter θ. The idea of a prior and likelihood being conjugate is that the prior and the posterior densities belong to the same family. We now state some examples to illustrate this idea. Beta, Binomial: Consider the Binomial likelihood with n (the number of trials) fixed n − f(x p, n ) = px(1 p)n x, | x − the parameter of interest (the probability of a success) is p [0 , 1]. A natural prior distribution for p is the Beta distribution which has density∈ Γ( α + β) − − π(p; α, β ) = pα 1(1 p)β 1, p (0 , 1) and α, β > 0, Γ( α)Γ( β) − ∈ 17 18 S. MUKHERJEE, PROBABILISTIC MACHINE LEARNING Γ( α+β) where Γ( α)Γ( β) is a normalization constant. Given the prior and the likelihood densities the posterior density modulo normalizing constants will take the form n Γ( α + β) − − − f(p x) px(1 p)n x pα 1(1 p)β 1, | ∝ x × Γ( α)Γ( β) − × − − − − px+α 1(1 p)n x+β 1, ∝ − which means that the posterior distribution of p is also a Beta with p x Beta( α + x, β + n x). | ∼ − Normal, Normal: Given a normal distribution with unknown mean the den- sity for the likelihood is 1 f(x θ, σ 2) exp (x θ)2 , | ∝ −2σ2 − and one can specify a normal prior 1 π(θ; θ , τ 2) exp (θ θ )2 , 0 0 ∝ −2τ 2 − 0 0 with hyper-parameters θ0 and τ0. The resulting posterior distribution will have the following density function 1 1 f(θ x) exp (x θ)2 exp (θ θ )2 , | ∝ −2σ2 − × −2τ 2 − 0 0 which after completing squares and reordering can be written as θ0 x 2 + 2 2 τ0 σ 2 1 θ x N( θ1, τ 1 ), θ 1 = 1 1 , τ 1 = 1 1 . | ∼ 2 + 2 2 + 2 τ0 σ τ0 σ 4.2. Bayesian linear regression We start with the likelihood as n 1 y βT x 2 f(Y X, β, σ 2) = exp k i − ik . | √ − 2σ2 i=1 σ 2π Y and the prior as 1 π(β) exp βT β . ∝ −2τ 2 0 The density of the posterior is n 1 y βT x 2 1 1 Post( β D) exp k i − ik exp βT β . | ∝ √ − 2σ2 × (2 π)p/ 2γ1/2 −2τ 2 #i=1 σ 2π $ 0 Y With a good bit of manipulation the above can be rewritten as a multivariate normal distribution β Y, X, σ 2 N (µ , Σ ) | ∼ p 1 1 with −2 −2 T −1 −2 T Σ1 = ( τ0 Ip + σ X X) , µ 1 = σ Σ1 X Y. Note the similarities of the above distribution to the MAP estimator. Relate the mean of the above estimator to the MAP estimator. LECTURE 4. A BAYESIAN APPROACH TO LINEAR REGRESSION 19 Predictive distribution: Given data D = (x , y ) n and and a new value x∗ { i i }i=1 one would like to estimate y∗. This can be done using the posterior and is called the posterior predictive distribution 2 2 2 2 2 f(y∗ D, x ∗, σ , τ 0 ) = f(y∗ x∗, β, σ )f(β Y, X, σ , τ 0 ) d β, | p | | ZIR where with some manipulation 2 2 2 y∗ D, x ∗, σ , τ N( µ∗, σ ∗), | 0 ∼ where 1 T 2 2 T µ∗ = Σ X Y x ∗, σ ∗ = σ + x∗ Σ x∗. σ2 1 1 .