<<

Chapter 4: Regression and Hierarchical Models

Conchi Aus´ınand Mike Wiper Department of Universidad Carlos III de Madrid

Master in Business Administration and Quantitative Methods Master in Mathematical Engineering

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective

AFM Smith Dennis Lindley

We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of .

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents

1 Normal linear models 1.1. ANOVA model 1.2. Simple model

2 Generalized linear models

3 Hierarchical models

4 Dynamic models

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models

A normal is of the following form:

y = Xθ + ,

0 where y = (y1,..., yn) is the observed , X is a known n × k matrix, called 0 the design matrix, θ = (θ1, . . . , θk ) is the parameter set and  follows a multivariate . Usually, it is assumed that:

 1   ∼ N 0 , I . k φ k

A simple example of normal linear model is the model T  1 1 ... 1  where X = and θ = (α, β)T . x1 x2 ... xn

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models

Consider a normal linear model, y = Xθ + . A distribution is a normal-:

 1  θ | φ ∼ N m, V φ a b  φ ∼ G , . 2 2

Then, the posterior distribution given y is also a normal-gamma distribution with:

−1 m∗ = XT X + V−1 XT y + V−1m −1 V∗ = XT X + V−1 a∗ = a + n b∗ = b + yT y + mT V−1m − m∗T V∗−1m∗

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 5 / 35 Normal linear models

The posterior is given by:

−1 E [θ | y] = XT X + V−1 XT y + V−1m −1  −1  = XT X + V−1 XT X XT X XT y + V−1m

−1   = XT X + V−1 XT Xθˆ + V−1m

−1 where θˆ = XT X XT y is the maximum likelihood estimator. Thus, this expression may be interpreted as a weighted average of the prior estimator, m, and the MLE, θˆ, with weights proportional to precisions since, 1 conditional on φ, the prior is φ V and that the distribution of the MLE   ˆ 1 T −1 from the classical viewpoint is θ | φ ∼ N θ, φ (X X)

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 6 / 35 Normal linear models

Consider a normal linear model, y = Xθ + , and assume the limiting prior distribution, 1 p(θ, φ) ∝ . φ Then, we have that,

 1 −1 θ | y, φ ∼ N θˆ, XT X , φ  T  n − k yT y − θˆ XT X θˆ φ | y ∼ G , .  2 2 

T ˆT T ˆ 2 y y−θ (X X)θ 2 1 Note thatσ ˆ = n−k is the usual classical estimator of σ = φ . In this case, Bayesian credible intervals, estimators etc. will coincide with their classical counterparts.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 7 / 35 ANOVA model The ANOVA model is an example of normal lineal model where:

yij = θi + ij ,

1 where ij ∼ N (0, φ ), for i = 1,..., k, and j = 1,..., ni .

Thus, the parameters are θ = (θ1, . . . , θk ), the observed data are T y = (y11,..., y1n1 , y21,..., y2n2 ,..., yk1,..., yknk ) , the design matrix is:

 1 0 ··· 0   . .   . .     1n1 0 ··· 0     0 1 0  X =    . . .   . . .     0 1n 0   2   . . .   . . .  0 0 1

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 8 / 35 ANOVA model

 1  Assume conditionally independent normal priors, θi ∼ N mi , , for αi φ a b i = 1,..., k, and a gamma prior φ ∼ G( 2 , 2 ). This corresponds to a normal-gamma prior distribution for (θ, φ) where  1 1  m = (m1,..., mk ) and V = diag ,..., . α1 αk Then, it is obtained that,

 n1y¯1·+α1m1   1  n +α α +n 1 1 1 1 1 θ | y, φ ∼ N  .  ,  ..   .  φ  .  n1y¯1·+α1m1 1 n1+α1 αk +nk and

Pk Pni 2 Pk ni 2 ! a + n b + (yij − y¯i·) + (¯yi· − mi ) φ | y ∼ G , i=1 j=1 i=1 ni +αi 2 2

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 9 / 35 ANOVA model

1 If we assume alternatively the reference prior, p(θ, φ) ∝ φ , we have:

   1  y¯1· n 1 1 θ | y, φ ∼ N  .  ,  ..  ,  .  φ  .  1 y¯k· nk n − k (n − k)σ ˆ2  φ ∼ G , , 2 2

2 1 Pk 2 whereσ ˆ = n−k i=1 (yij − y¯i·) is the classical variance estimate for this problem.

A 95% posterior interval for θ1 − θ2 is given by:

r 1 1 y¯1· − y¯2· ± σˆ + tn−k (0.975) , n1 n2 which is equal to the usual, classical interval.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 10 / 35 Example: ANOVA model

Suppose that an ecologist is interested in analysing how the masses of starlings (a type of birds) vary between four locations.

A sample data of the weights of 10 starlings from each of the four locations can be downloaded from: http://arcue.botany.unimelb.edu.au/bayescode.html.

Assume a Bayesian one-way ANOVA model for these data where a different mean is considered for each location and the variation in mass between different birds is described by a normal distribution with a common variance.

Compare the results with those obtained with classical methods.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 11 / 35 Simple linear regression model

Another example of normal linear model is the simple regression model:

yi = α + βxi + i ,

 1  for i = 1,..., n, where i ∼ N 0, φ .

Suppose that we use the limiting prior: 1 p(α, β, φ) ∝ . φ

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 12 / 35 Simple linear regression model

Then, we have that:

  n  α  αˆ  1 Px 2 −nx¯ y, φ ∼ N , i β  βˆ  i=1  φnsx −nx¯ n ! n − 2 s 1 − r 2 φ | y ∼ G , y 2 2

where: s αˆ =y ¯ − βˆx¯, βˆ = xy , sx Pn 2 Pn 2 sx = i=1 (xi − x¯) , sy = i=1 (yi − y¯) , 2 Pn sxy 2 sy 1 − r sxy = i=1 (xi − x¯)(yi − y¯) , r = √ , σˆ = . sx sy n − 2

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 13 / 35 Simple linear regression model

Thus, the marginal distributions of α and β are Student-t distributions: α − αˆ | y ∼ tn−2 r n 2 P 2 σˆ i=1xi n sx β − βˆ | y ∼ tn−2 q σˆ2 sx

Therefore, for example, a 95% for β is given by: σˆ βˆ ± √ tn−2(0.975) sx

equal to the usual classical interval.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 14 / 35 Simple linear regression model

Suppose now that we wish to predict a future observation:

ynew = α + βxnew + new .

Note that,

E [ynew | φ, y] =α ˆ + βˆxnew Pn 2 2  1 i=1xi + nxnew − 2nxx¯ new V [ynew | φ, y] = + 1 φ nsx 1 s + nx¯2 + nx 2 − 2nxx¯  = x new new + 1 φ nsx Therefore,

2 !! 1 (¯x − xnew ) 1 ynew | φ, y ∼ N αˆ + βˆxnew , + + 1 φ sx n

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 15 / 35 Simple linear regression model

And then, ynew − αˆ + βˆxnew r | y ∼ tn−2  2  σˆ (¯x−xnew ) + 1 + 1 sx n

leading to the following 95% credible interval for ynew : v u 2 ! u (¯x − xnew ) 1 αˆ + βˆxnew ± σˆt + + 1 tn−2 (0.975) , sx n

which coincides with the usual, classical interval.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 16 / 35 Example: Simple linear regression model

Consider the data file prostate.data that can be downloaded from: http://statweb.stanford.edu/~tibs/ElemStatLearn/.

This includes, among other clinical measures, the level of prostate specific antigen in logs (lpsa) and the log cancer volume (lcavol) in 97 men who were about to receive a radical prostatectomy.

Use a Bayesian linear regression model to predict the lpsa in terms of the lcavol.

Compare the results with a classical linear regression fit.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 17 / 35 Generalized linear models

The generalized linear model generalizes the normal linear model by allowing the possibility of non-normal error distributions and by allowing for a non-linear relationship between y and x. A generalized linear model is specified by two functions:

1 A conditional, density function of y given x, parameterized by a mean parameter, µ = µ(x) = E[Y | x] and (possibly) a dispersion parameter, φ > 0, that is independent of x.

2 A (one-to-one) link function, g(·), which relates the mean, µ = µ(x) to the covariate vector, x, as g(µ) = xθ.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 18 / 35 Generalized linear models

The following are generalized linear models with the canonical link function which is the natural parameterization to leave the exponential family distribution in canonical form.

A is often used for predicting the occurrence of an event given covariates:

Yi | pi ∼ Bin(ni , pi ) pi log = xi θ 1 − pi A is used for predicting the number of events in a time period given covariates:

Yi | pi ∼ P(λi )

log λi = xi θ

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 19 / 35 Generalized linear models

The Bayesian specification of a GLM is completed by defining (typically normal or normal gamma) prior distributions p(θ, φ) over the unknown model parameters.

As with standard linear models, when improper priors are used, it is then important to check that these lead to valid posterior distributions.

Clearly, these models will not have conjugate posterior distributions, but, usually, they are easily handled by Gibbs .

In particular, the posterior distributions from these models are usually log concave and are thus easily sampled via adaptive rejection sampling.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 20 / 35 Example: A logistic regression model

The O-Ring data consist of 23 observations on Pre-Challenger Space Shuttle Launches On each launch, it is observed whether there is at least one O-ring failure, and the temperature at launch The goal is to model the probability of at least one O-ring failure as a function of temperature. Temperatures were 53, 57, 58, 63, 66, 67, 67, 67, 68, 69,70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81 Failures occurred at 53, 57, 58, 63, 70, 70, 75

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 21 / 35 Example: A logistic regression model

The table shows the relationship, for 64 infants, between gestational age of the infant (in weeks) at the time of birth (x) and whether the infant was breast feeding at the time of release from hospital (y).

x 28 29 30 31 32 33 # {y = 0} 4 3 2 2 4 1 # {y = 1} 2 2 7 7 16 14

Let xi represent the gestational age and ni the number of infants with this age. Then we can model the probability that yi infants were breast feeding at time of release from hospital via a standard model.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 22 / 35 Hierarchical models

Suppose we have data, x, and a f (x | θ) where the parameter values θ = (θ1, . . . , θk ) are judged to be exchangeable, that is, any permutation of them has the same distribution.

In this situation, it makes sense to consider a multilevel modeling assuming a prior distribution, f (θ | φ), which depends upon a further, unknown , φ, and use a hyperprior distribution, f (φ).

In theory, this process could continue further, using hyperhyperprior distributions to estimate the hyperprior distributions. This is a method to elicit the optimal prior distributions.

One alternative is to estimate the hyperparameter using classical methods, which is known as empirical Bayes. A point estimate φˆ is then obtained to approximate the posterior distribution. However, the uncertainty in φ is ignored.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 23 / 35 Hierarchical models

In most hierarchical models, the joint posterior distributions will not be analytically tractable as it will be,

f (θ, φ | x) ∝ f (x | θ)f (θ | φ)f (φ)

However, often a approach can be implemented by sampling from the conditional posterior distributions:

f (θ | x, φ) ∝ f (x | θ)f (θ | φ) f (φ | x, θ) ∝ f (θ | φ)f (φ)

It is important to check the propriety of the posterior distribution when improper hyperprior distributions are used. An alternative (as in for example Winbugs) is to use proper but high variance hyperprior distributions.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 24 / 35 Hierarchical models

For example, a hierachical normal linear model is given by:

 1  x | θ , φ ∼ N θ , , i = 1,..., n, j = 1,..., m. ij i i φ

Assuming that the , θi , are exchangeable, we may consider the following prior distribution:  1  θ | µ, ψ ∼ N µ, , i ψ where the are µ y ψ.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 25 / 35 Example: A hierarchical one-way ANOVA

Suppose that 5 individuals take 3 different IQ test developed by 3 different psychologists obtaining the following results:

1 2 3 4 5 Test 1 106 121 159 95 78 Test 2 108 113 158 91 80 Test 3 98 115 169 93 77

Then, we can assume that:  1  X | θ , φ ∼ N θ , , ij i i φ  1  θ | µ, ψ ∼ N µ, , i ψ

for i = 1,..., 5, and j = 1, 2, 3, where θi represents the true IQ of subject i and µ the mean true IQ in the population.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 26 / 35 Example: A hierarchical Poisson model

The number of failures, Xi at a power plant i is assumed to follow a : Xi | λi ∼ P(λi ti ), para i = 1,..., 10,

where λi is the for pump i and ti is the length of operation time of the pump (in 1000s of hours). It seems natural to assume that the failure rates are exchangeable and thus we might assume:

λi | γ ∼ E(γ),

where γ is the prior hyperparameter. The observed data are:

Pump 1 2 3 4 5 6 7 8 9 10 ti 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5 xi 5 1 5 14 3 19 1 1 4 22

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 27 / 35 Dynamic models

The univariate normal dynamic linear model (DLM) is:

yt = Ft θt + νt , νt ∼ N (0, Vt )

θt = Gt θt−1 + ωt , ωt ∼ N (0, Wt ).

These models are linear state space models, where xt = Ft θt represents the signal, θt is the state vector, Ft is a regression vector and Gt is a state matrix.

The usual features of a time series such as trend and seasonality can be modeled within this format.

If the matrices Ft , Gt , Vt and Wt are constants, the model is said to be time invariant.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 28 / 35 Dynamic models

One of the simplest DLMs is the random walk plus noise model, also called first order polynomial model. It is used to model univariate observations and the state vector is unidimensional:

yt = θt + νt , νt ∼ N (0, Vt )

θt = θt−1 + ωt , ωt ∼ N (0, Wt ).

This is a slowly varying level model where the observations fluctuate around a mean which varies according to a random walk.

Assuming known , Vt and Wt , a straightforward Bayesian analysis can be carried out as follows.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 29 / 35 Dynamic models

t−1 Suppose that the information at time t − 1 is y = {y1, y2, ..., yt−1} and assume that: t−1 θt−1 | y ∼ N (mt−1, Ct−1). Then, we have that:

The prior distribution for θt is:

t−1 θt | y ∼ N (mt−1, Rt )

where Rt = Ct−1 + Wt

The one step ahead predictive distribution for yt is:

t−1 yt | y ∼ N (mt−1, Qt )

where Qt = Rt + Vt .

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 30 / 35 Dynamic models

The joint distribution of θt and yt is:

 θ   m  R R  t | yt−1 ∼ N t−1 , t t yt mt−1 Rt Qt

t  t−1 The posterior distribution for θt given y = y , yt is:

t θt | y ∼ N(mt , Ct ), where

mt = mt−1 + At et ,

At = Rt /Qt ,

et = yt − mt−1, 2 Ct = Rt − At Qt .

Note that et is simply a error term. The posterior mean formula could also be written as:

mt = (1 − At ) mt−1 + At yt .

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 31 / 35 Example: First order polynomial DLM

Assume a slowly varying level model for the water level in Lake Huron with known variances: Vt = 1 and Wt = 1.

1 Estimate the filtered values of the state vector based on the observations up t to time t from f (θt | y ).

2 Estimate the predicted values of the state vector based on the observations t−1 up to time t − 1 from f (θt | y ).

3 Estimate the predicted values of the signal based on the observations up to t−1 time t − 1 from f (yt | y ).

4 Compare the results using e.g:

I Vt = 10 and Wt = 1. I Vt = 1 and Wt = 10.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 32 / 35 Dynamic models

When the variances are not known, the Bayesian inference for the system is more complex. One possibility is the use of MCMC algorithms which are usually based on the so-called forward filtering backward sampling algorithm.

1 The forward filtering step is the standard normal linear analysis to give t f (θt |y ) at each t, for t = 1,..., T .

2 ∗ The backward sampling step uses the Markov property and samples θT from T t ∗ f (θT |y ) and then, for t = T − 1,..., 1, samples from f (θt | y , θt+1) Thus, a sample from the posterior parameter structure is generated.

However, MCMC may be computationally very expensive for on-line estimation. One possible alternative is the use of particle filters.

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 33 / 35 Dynamic models

Other examples of DLM are the following: A dynamic linear regression model is given by:

yt = Ft θt + νt , νt ∼ N (0, Vt )

θt = θt−1 + ωt , ωt ∼ N (0, Wt ).

The AR(p) model with time-varying coefficients takes the form:

yt = θ0t + θ1t yt−1 + ... + θpt yt−p + νt , θit = θi,t−1 + ωit ,

This model can be expressed in state space form by setting θ = (θ0t , . . . , θpt ) and F = (1, yt−1,..., yt − p).

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 34 / 35 Dynamic models The additive structure of the DLMs makes it easy to think of observed series as originating form the sum of different components,e.g.,

yt = y1t + ..., yh,t

where y1t might represent a trend component, y2t a seasonal component, and so on. Then, each component, yit , might be described by a different DLM:

yt = Fit θit + νit , νit ∼ N (0, Vit )

θit = Git θt−1 + ωit , ωit ∼ N (0, Wit ).

By the assumption of independence of the components, yt is also a DLM described by: Ft = (F1t |...| Fht ) , Vt = V1t + ... + Vht , and     G1t W1t  ..   ..  Gt =  .  , Wt =  .  . Ght Wht

Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 35 / 35