Bayesian Statistics for Genetics Lecture 4: Linear Regression

Bayesian Statistics for Genetics Lecture 4: Linear Regression Ken Rice Swiss Institute in Statistical Genetics September, 2017 Regression models: overview How does an outcome Y vary as a function of x = fx1; : : : ; xpg? • What are the effect sizes? • What is the effect of x1, in observations that have the same x2; x3; :::xp (a.k.a. \keeping these covariates constant")? • Can we predict Y as a function of x? These questions can be assessed via a regression model p(yjx). 4.1 Regression models: overview Parameters in a regression model can be estimated from data: 0 1 y1 x1;1 ··· x1;p B . C @ . A yn xn;1 ··· xn;p The data are often expressed in matrix/vector form: 0 1 0 1 0 1 y1 x1 x1;1 ··· x1;p B . C B . C B . C y = @ . A X = @ . A = @ . A yn xn xn;1 ··· xn;p 4.2 FTO example: design FTO gene is hypothesized to be involved in growth and obesity. Experimental design: • 10 fto + =− mice • 10 fto − =− mice • Mice are sacrificed at the end of 1-5 weeks of age. • Two mice in each group are sacrificed at each age. 4.3 FTO example: data ● ● 25 ● fto−/− ● ● fto+/− 20 ● ● ● 15 ● ● ● ● weight (g) weight ● ● ● ● 10 ● ● ● ● 5 ● 1 2 3 4 5 age (weeks) 4.4 FTO example: analysis • y = weight • xg = indicator of fto heterozygote 2 f0; 1g = number of \+" alleles • xa = age in weeks 2 f1; 2; 3; 4; 5g How can we estimate p(yjxg; xa)? Cell means model: genotype age −=− θ0;1 θ0;2 θ0;3 θ0;4 θ0;5 +=− θ1;1 θ1;2 θ1;3 θ1;4 θ1;5 Problem: 10 parameters { only two observations per cell 4.5 Linear regression Solution: Assume smoothness as a function of age. For each group, y = α0 + α1xa + . This is a linear regression model. Linearity means \linear in the parameters", i.e. several covariates, each multiplied by corresponding α, and added. A more complex model might assume e.g. 2 3 y = α0 + α1xa + α2xa + α3xa + , { but this is still a linear regression model, even with age2, age3 terms. 4.6 Multiple linear regression With enough variables, we can describe the regressions for both groups simultaneously: Yi = β1xi;1 + β2xi;2 + β3xi;3 + β4xi;4 + i ; where xi;1 = 1 for each subject i xi;2 = 0 if subject i is homozygous, 1 if heterozygous xi;3 = age of subject i xi;4 = xi;2 × xi;3 Note that under this model, E[ Y jx; ] = β1 + β3 × age if x2 = 0, and E[ Y jx ] = (β1 + β2) + (β3 + β4) × age if x2 = 1. 4.7 Multiple linear regression For graphical thinkers... β3 = 0 β4 = 0 β2 = 0 β4 = 0 ● ● ● ● 25 ● ● 20 ● ● ● ● ● ● 15 ● ● ● ● weight ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● 5 ● ● β4 = 0 ● ● ● ● 25 ● ● 20 ● ● ● ● ● ● 15 ● ● ● ● weight ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● 5 ● ● 1 2 3 4 5 1 2 3 4 5 age age 4.8 Normal linear regression How does each Yi vary around its mean E[ Yijββ; xi ]; ? T Yi = β xi + i 2 1; : : : ; n ∼ i.i.d. normal(0; σ ): This assumption of Normal errors specifies the likelihood: n 2 Y 2 p(y1; : : : ; ynjx1;::: xn; ββ; σ ) = p(yijxi; ββ; σ ) i=1 1 n = (2 2)−n=2exp X ( T )2 πσ {− 2 yi − β xi g: 2σ i=1 Note: in large(r) sample sizes, analysis is \robust" to the Normality assumption|but here we rely on the mean being linear in the x's, and on the i's variance being constant with respect to x. 4.9 Matrix form T • Let y be the n-dimensional column vector (y1; : : : ; yn) ; • Let X be the n × p matrix whose ith row is xi Then the normal regression model is that yjX; ββ; σ2 ∼ multivariate normal (Xββ; σ2I); where I is the p × p identity matrix and 0 x ! 1 1 0 β 1 0 β x + ··· + β x 1 0 Y jββ; x 1 B C 1 1 1;1 p 1;p E 1 1 B x2 ! C B . C B . C B . C Xβ = B . C @ . A = @ . A = @ . A : @ . A βp β1xn;1 + ··· + βpxn;p EYnjββ; xn xn ! 4.10 Ordinary least squares estimation What values of β are consistent with our data y;X? Recall 1 n ( 2) = (2 2)−n=2exp X ( T )2 p y1; : : : ; ynjx1;::: xn; ββ; σ πσ {− 2 yi−β xi g: 2σ i=1 P T 2 This is big when SSR(β) = (yi − β xi) is small { where we define n X T 2 T SSR(β) = (yi − β xi) = (y − Xβ) (y − Xβ) i=1 = yT y − 2βTXT y + βTXTXβ : What value of β makes this the smallest, i.e. gives fitted values closest to the outcomes y1; y2; :::yn? 4.11 OLS: with calculus `Recall' that... 1. a minimum of a function g(z) occurs at a value z such that d dzg(z) = 0; 2. the derivative of g(z) = az is a and the derivative of g(z) = bz2 is 2bz. Doing this for SSR... d d SSR(β) = yT y − 2βTXT y + βTXTXβ dββ dββ = −2XT y + 2XTXβ ; d and so SSR(β) = 0 , −2XT y + 2XTXβ = 0 dββ , XTXβ = XT y , β = (XTX)−1XT y : ^ T −1 T βols = (X X) X y is the Ordinary Least Squares (OLS) estimator of β. 4.12 OLS: without calculus The calculus-free, algebra-heavy version { which relies on knowing the answer in advance! T −1 T ^ Writing Π = X(X X) X , and noting that X = ΠX and Xβols = Πy; (y − Xβ)T (y − Xβ) = (y − Πy + Πy − Xβ)T (y − Πy + Πy − Xβ) ^ T ^ = ((I − Π)y + Π(βols − β)) ((I − Π)y + Π(βols − β)) T ^ T ^ = y (I − Π)y + (βols − β) Π(βols − β); because all the `cross terms' with Π and I − Π are zero. Hence the value of β that minimizes the SSR { for a given set ^ of data { is βols. 4.13 OLS: using R ### OLS estimate > beta.ols <- solve( t(X)%*%X )%*%t(X)%*%y > beta.ols [,1] (Intercept) -0.06821632 xg 2.94485495 xa 2.84420729 xg:xa 1.72947648 ### or less painfully, using lm() > fit.ols<-lm( y~ xg*xa ) > coef( summary(fit.ols) ) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.06821632 1.4222970 -0.04796208 9.623401e-01 xg 2.94485495 2.0114316 1.46405917 1.625482e-01 xa 2.84420729 0.4288387 6.63234803 5.760923e-06 xg:xa 1.72947648 0.6064695 2.85171239 1.154001e-02 4.14 OLS: using R ● ● ● ● ● ● ● ● weight ● ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 age > coef( summary(fit.ols) ) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.06821632 1.4222970 -0.04796208 9.623401e-01 xg 2.94485495 2.0114316 1.46405917 1.625482e-01 xa 2.84420729 0.4288387 6.63234803 5.760923e-06 xg:xa 1.72947648 0.6064695 2.85171239 1.154001e-02 4.15 Bayesian inference for regression models yi = β1xi;1 + ··· + βpxi;p + i Motivation: • Incorporating prior information • Posterior probability statements: P[ βj > 0jy;X ] • OLS tends to overfit when p is large, Bayes' use of prior tends to make it more conservative { as we'll see • Model selection and averaging (more later) 4.16 Prior and posterior distribution prior β ∼ mvn(β0; Σ0) sampling model y ∼ mvn(Xββ; σ2I) posterior βjy;X ∼ mvn(βn; Σn) where 2 −1 T 2 −1 Σn = Varβjy;X; σ = (Σ0 + X X/σ ) 2 −1 T 2 −1 −1 T 2 βn = Eβjy;X; σ = (Σ0 + X X/σ ) (Σ0 β0 + X y/σ ): Notice: −1 T 2 ^ • If Σ0 X X/σ , then βn ≈ βols −1 T 2 • If Σ0 X X/σ , then βn ≈ β0 4.17 The g-prior How to pick β0; Σ0? A classical suggestion (Zellner, 1986) uses the g-prior: β ∼ mvn(0; gσ2(XTX)−1) ^ Idea: The variance of the OLS estimate βols is σ2 Varβ^ = σ2(XTX)−1 = (XTX=n)−1 ols n This is (roughly) the uncertainty in β from n observations. σ2 Varβ = gσ2(XTX)−1 = (XTX=n)−1 gprior n=g The g-prior can roughly be viewed as the uncertainty from n=g observations. For example, g = n means the prior has the same amount of info as 1 observation { so (roughly!) not much, in a large study. 4.18 Posterior distributions under the g-prior After some algebra, it turns out that 2 fβjy;X; σ g ∼ mvn(βn; Σn); 2 g 2 T −1 where Σn = Var[βjy;X; σ ] = σ (X X) g + 1 2 g T −1 T βn = E[βjy;X; σ ] = (X X) X y g + 1 Notes: g ^ • The posterior mean estimate βn is simply g+1βols. g ^ • The posterior variance of β is simply g+1Varβols. • g shrinks the coefficients towards 0 and can prevent overfit- ting to the data • If g = n, then as n increases, inference approximates that ^ using βols. 4.19 Monte Carlo simulation What about the error variance σ2? It's rarely known exactly, so more realistic to allow for uncertainty with a prior... 2 2 prior1 /σ ∼ gamma(ν0=2; ν0σ0=2) sampling model y ∼ mvn(Xββ; σ2I) 2 2 posterior1 /σ jy;X ∼ gamma([ν0 + n]=2; [ν0σ0 + SSRg]=2) where SSRg is somewhat complicated. Simulating the joint posterior distribution: joint distribution p(σ2; βjy;X) = p(σ2jy;X) × p(βjy;X; σ2) simulation fσ2; βg ∼ p(σ2; βjy;X) , σ2 ∼ p(σ2jy;X); β ∼ p(βjy;X; σ2) To simulate fσ2; βg ∼ p(σ2; βjy;X), • First simulate σ2 from p(σ2jy;X) • Use this σ2 to simulate β from p(βjy;X; σ2) Repeat 1000's of times to obtain MC samples: fσ2; βg(1);:::; fσ2; βg(S).

Bayesian Statistics for Genetics Lecture 4: Linear Regression

DSA 5403 Bayesian Statistics: an Advancing Introduction 16 Units – Each Unit a Week’S Work

Ordinary Least Squares 1 Ordinary Least Squares

Ordinary Least Squares: the Univariate Case

Chapter 2: Ordinary Least Squares Regression

The Bayesian Approach to Statistics

Posterior Propriety and Admissibility of Hyperpriors in Normal

The Deviance Information Criterion: 12 Years On

Time-Series Regression and Generalized Least Squares in R*

A Widely Applicable Bayesian Information Criterion

Chapter 2 Simple Linear Regression Analysis the Simple

Bayes Theorem and Its Recent Applications

Regression Analysis