<<

An Intro duction to Bayesian

Multilevel Hierarchical Mo delling

using MLwiN

by

William Browne

and

Harvey Goldstein

Centre for Multilevel Mo delling

Institute of Education London

May

Summary

What Is Multilevel Mo delling

Variance comp onents and Random slop es re

gression mo dels

Comparisons between frequentist and Bayesian

approaches

Mo del comparison in multilevel mo delling

Binary resp onses multilevel logistic regres

sion mo dels

Comparisons between various MCMC meth

ods

Areas of current research

Junior Scho ol Project JSP

Dataset

The JSP dataset Mortimore et al was

a longitudinal study of around primary

scho ol children from scho ols in the Inner Lon

don Education Authority

Sample of interest here consists of pupils

from scho ols

Main outcome of intake is a score out of

in year in mathematics MATH

Main predictor is a score in year again out

of on a mathematics test MATH

Other predictors are child gender and fathers

o ccupation status manualnon manual

Why Multilevel mo delling

Could t a simple of MATH

and MATH on the full dataset

Could t separate regressions of MATH and

MATH for each scho ol indep endently

The rst approach ignores scho ol eects dif

ferences whilst the second ignores similari

ties between the scho ols

Solution A multilevel or hierarchical or ran

dom eects mo del treats both the students

and scho ols as random variables and can be

seen as a compromise between the two single

level approaches

The predicted regression lines for each scho ol

pro duced by a multilevel mo del are a weighted

average of the individual scho ol lines and the

full dataset regression line

A mo del with both random scho ol intercepts

and slop es is called a random slop es regression mo del Random Slopes Regression (RSR) Model

MLwiN (Rasbash et al. 2000) allows the user to specify the model via an Equations window:

Here we see an average line MATH5 = 30.348+0.612*MATH3 Note: MATH3 has here been centred and has range –21 to 11.

The variance of the 48 school intercepts is 4.874 The variance of the 48 school slopes is 0.034 The co-variance between the intercepts and slopes is –0.374 The unexplained residual variance is 26.964

This model has been fitted using a maximum likelihood (ML) method known as IGLS (iterative generalized ).

IGLS is an iterative procedure based on estimating the random and fixed parts of the multilevel model alternately assuming the estimates for the other part are correct. This involves iterating between two GLS model fitting steps until the estimates converge to ML point estimates. Predicted school lines from the random slopes regression model

Here we highlight a poor performing school in light blue and a good performing school in red. Note that the school lines in this dataset have a negative covariance between the intercepts and slopes. Hence we see all the lines appear to converge at the top end of the intake scale and this is a direct consequence of the ceiling effect caused by the MATH5 test having a maximum score of 40.

There are simpler multilevel models than a RSR that simply partition the residual variance into the possible sources of variation (in our example pupils and schools). Such models are called variance components models. Variance Components Model

Here we have fitted a simple mean value for MATH5, which is estimated as 30.605 and we have partitioned the variance into variance between schools and residual variance between pupils.

Variance between schools = 5.157 Residual variance = 39.281

The Intra-school correlation coefficient (ICC) for the schools in this dataset measures the percentage of variation explained by the schools and equals 5.157/(5.157+39.281)=0.116, so schools explain 11.6% of the raw variation in this dataset.

Applying MCMC Metho ds to

Multilevel Mo dels

To make inferences ab out an unknown parameter

in a multilevel mo del in a Bayesian framework we

rst need to nd the joint p osterior distribution

of all the unknown parameters and then integrate

over all the other unknowns In the case of a

level variance comp onents mo del the joint p oste

rior is

   

p u j y py j u pu j

u e e u

 

p p p

u e

In all but the simplest examples this problem is

virtually imp ossible to solve analytically An al

ternative approach is necessary and this is where

MCMC metho ds t in

Although the joint p osterior distribution is rather

nasty the conditional p osteriors for the various

unknowns given all the other unknowns can be

shown to have forms that can be simulated from

easily

Gibbs Sampling

To implement the Gibbs sampling approach we

now sub divide our unknowns into four subsets

 

u and and calculate their conditional

u e

p osteriors The Gibbs sampler in this case works

as follows rstly cho ose starting values for each

group of unknown parameters in the mo del u

 

 

Then sample from the following and

e u

distributions rstly

 

p j y u



u e

to get then



 

pu j y



 e

to get u then



 

p j y u

u

 

e



to get then nally

u

 

p j y u

e

 

u



We have now up dated all of the to get

e

unknowns in the mo del This pro cess is simply

rep eated many times each time using the previ

ously generated parameter values to generate the

next set

MCMC Metho ds

Burnin

It is general practice to throw away the rst n

values generated to allow the Markov chain to

approach its equilibrium distribution namely the

joint p osterior distribution of interest These n

values are known as a burnin and in the ex

ample given we will use a burnin of iter

ates

Finding Estimates

We continue generating values after the burn

in for another m iterates These m values are

then averaged to give estimates of the parame

ter of interest in this case Posterior standard

deviations like frequentist standard errors for

the estimates can be obtained by calculating the

standard deviations of the m values MCMC Estimation of the VC model in MLwiN

MCMC Diagnostics for the level 2 variance parameter

Method Comparisons

In reality Bayesian statisticians know no more than frequentists!! It is just the case that if they have some prior information/knowledge about the problem then they have an elegant way of combining this information with the data to form their statistical models.

In the case when we have no prior knowledge then we would hope to come to similar conclusions and get similar estimates whether we use a frequentist or Bayesian approach.

Consequently in my PhD. thesis (Browne 1998) I performed simulation studies for both the VC and RSR models, which compared MCMC methods with two ‘default’ prior choices along with the IGLS and RIGLS (A restricted Maximum likelihood extension to IGLS) methods. The RSR simulation results can be read in Browne and Draper (2000), here we will concentrate on the VC simulations.

These simulation were based on the JSP dataset and I considered firstly 8 study designs in which I varied the number of schools (6,12,24,48) and whether the study was balanced (18 pupils per school) or unbalanced (resembling the JSP dataset in degree of unbalance.

We also changed the simulation values of the level 1 and 2 variances to give the school effects more or less importance (see Browne and Draper 2002). We varied the value of the level 1 variance between 10, 40 and 80, and the value of the level 2 variance between 1,10 and 40.

For each study design we simulated 1000 datasets.

We looked at two properties: point estimate bias and interval estimate coverage probability.

Bias comparison

For each of the 1,000 simulation studies we can calculate an estimate of the level 2 variance parameter and we know the correct estimate because this is fixed for the study.

We find that the bias in all methods is generally reduced as the size of the study increases. Balanced studies generally have less bias than unbalanced although the differences are small.

The studies when the true level 2 variance is much smaller than the level 1 variance induce greater bias. These studies call into question whether fitting a multilevel model is sensible in such situations and maybe a single level model would be preferable.

IGLS is negatively biased with biases as high as 23% induced in the smallest studies reducing to 2% in the JSP 48 schools design. RIGLS corrects for this bias in all cases with very little bias in all designs.

The MCMC method with -1(,) priors for the variances shows little bias when the MEDIAN is used as a point estimate (-6% in the smallest study) although in the 1:80 variance ratio study there is large bias.

The MCMC method with U(0,1/) priors for the variances shows little bias when the MODE is used as a point estimate except in very small studies (6 schools). Here 100% positive biases occur although in studies with at least 12 schools the largest bias is 1%. Again the 1:80 variance ratio study shows large bias.

Interval Estimates Coverage properties

For the MCMC methods we can work out accurate 95% ‘credible’ intervals from the quantiles of the chains produced. For the IGLS/RIGLS methods we can either assume Gaussian intervals i.e. point estimate M 2 standard errors or perform clever approximations e.g. a Gamma, Lognormal or other variance stabilizing transformations. We can then calculate what percentage of the 95% intervals contain the simulated estimate.

Results

Assuming Gaussian intervals for both IGLS/RIGLS gives intervals that cover the correct answer < 95% of the time. This varies from 72% for the smallest study up to 91% for the largest.

The various transforms improve things but still get only between 93% and 94% in the largest studies.

The MCMC -1(,) priors give coverage intervals that cover the correct answer 89% for the smallest study up to 94% for the largest.

The MCMC U(0,1/) priors give coverage intervals that cover the correct answer 91% for the smallest study up to 94% for the largest, although this method yields the widest intervals.

Summary

Both frequentist and Bayesian methods can yield approximately unbiased point estimates for multilevel models.

Both approaches experience difficulty in attaining nominal coverage of interval estimates when (i) the number of level 2 units is small and (ii) the variance ratio (Level 2 variance/Level 1 variance) is small. Model Comparison (Maximum Likelihood Estimation)

Using the Maximum likelihood IGLS method, model comparison is straightforward for Normal responses assuming the two models are nested.

For each model we have a deviance and in reality IGLS fits a huge multivariate normal response model with a structured variance matrix, Y~MVN(Xβ,V) that is equivalent to the required multilevel model.

Therefore assuming we use the multivariate Normal likelihood then each random variance and covariance is a parameter in the structured variance matrix and so uses up a degree of freedom.

So for example in moving from the variance components model to the random slopes regression model we add 3 new parameters: the fixed effect for MATH3, the between slopes variance and the covariance between slopes and intercept.

Therefore we can compute the change in deviance and this has a large sample 2 distribution with 3 degrees of freedom and so we can perform a likelihood ratio test.

Here the change of deviance = 5829.23 – 5514.16 = 315.07 which is highly significant so unsurprisingly the random slopes regression model is significantly better than variance components model.

Model Comparison (MCMC)

Spiegelhalter, Best et al. (2002) introduced the Deviance Information Criterion (DIC) diagnostic in a paper to the RSS this year, which got a mixed reaction from the Bayesian community present.

The DIC is an extension of the AIC diagnostic that can be calculated directly from the chains produced by an MCMC run and is a diagnostic that combines model fit with complexity.

The diagnostic DIC = Deviance + 2*pD where pD is the ‘effective number of parameters’ which can be calculated from the chain as the difference between the mean deviance (Dbar) in the chain and the deviance at the mean values for the parameters (D( bar)).

In random effect models the ‘effective number of parameters’ is less than the nominal number that an equivalent fixed effect model would have due to the additional distributional assumption for the random effects.

Model Nominal Effective Deviance DIC parameters params pD D( bar) Null 2 2.00 5881.6 5885.6 Variance 50 34.09 5740.8 5809.0 Components Simple 3 3.03 5592.2 5598.2 Regression Random 99 44.07 5397.9 5486.1 Slopes Regression

So here the models get ‘better’ as we read down the table and the DIC decreases.

Random Eects Logistic

Regression Mo del

When we have Binomial data and several levels

to our dataset we can extend

mo dels to include random eects in a similar way

to the Normal case

Consider the following mo del

y B er noul l ip where

ij ij



log itp u u N

ij j j

u

Then assuming uniform priors the conditional

p osterior distribution for is

Y

  u y u y 

j ij j ij

p j y u e e

u

ij

This distribution do es not simplify to a simple

distribution so the diculty is how to sample from

this distribution Two approaches will now be discussed

Metrop olis Hastings Sampling

Metrop olisHastings sampling works in a similar

way to Gibbs sampling in that each group of pa

rameters is up dated in turn and then the pro ce

dure is rep eated

Prop osal Distribution

The up dating pro cedure is dierent For each

parameter at each timestep a new value is gen

erated from a prop osal distribution Then this

new value is compared with the old value The

new value is accepted with a probability so that

the draws are actually simulating from the pos

terior distribution If a value is rejected then the

parameter retains its old value

The advantage of this metho d is that we do not

have to calculate the whole conditional p osterior

distribution but simply have to evaluate the con

ditional p osterior distribution at two p oints per

iteration The disadvantage is that we have to

construct go o d prop osal distributions for the pa

rameters of interest

Metrop olis Hastings Sampling

in MLwiN

Actually Hybrid GibbsMH Sampling

Gibbs steps for Variance parameters

MH steps for xed eects and residuals

Univariate Normal prop osal distributions

Use an adapting metho d to calculate prop osal

distributions

Aim for prop osal distributions so that we ac

cept of prop osals Adaptive Rejection (AR) Sampling

WinBUGS (Spiegelhalter et al. 2000) uses Adaptive Rejection (AR) Sampling (Gilks and Wild 1992) in place of the univariate MH sampling used in MLwiN.

S Density f(x) which is difficult to sample from.

S Envelope g(x) constructed from tangents to f(x) at n points.

S f(x) must be log concave.

S When a point is chosen from g(x), its tangent is evaluated and a new envelope function is created.

Accept sampled point y with probability min(1, f(y)/g(y)).

Bangladeshi Fertility Example

Dataset is part of the 1988 Bangladesh Fertility Survey and consists of 1934 women grouped within 60 districts. The response of interest is whether women were using contraceptives at the time of the survey.

Here we see that the probability of using contraceptives decreases as women age but increase as they have more children (base category is no kids). Also women in urban areas are much more likely to use contraceptives. There is significant variation between the districts.

Comparison between MCMC methods

The above model was fitted using the MH-Gibbs hybrid method in MCMC and the AR method in WinBUGS. For each method we ran the model for a total of 50,000 iterations following a burn-in of 500 iterations.

MH took 9 minutes 38 seconds in MLwiN. AR took 43 minutes 3 seconds in WinBUGS a 4.47 fold increase.

Effective sample size (ESS) is based on the autocorrelations in the chain: A ESS 50000/(1 2  (k)) k1

Parameter MH/Gibbs MH/Gibbs AR AR Mult. Estimate ESS Estimate ESS Factor β1 -1.697 758 -1.693 3689 4.86 β2 -0.027 2198 -0.026 9061 4.12 β3 1.120 2022 1.119 8500 4.20 β4 1.370 1621 1.369 6653 4.10 β5 1.347 991 1.345 4302 4.34 β6 0.735 4943 0.734 18964 3.84 2 σ u 0.237 3836 0.233 9185 2.39

Here there is no real winner and this illustrates the balance between the speed of the MH method versus the lower autocorrelations induced by the AR method.

Current Research Interests

Crossclassied and multiplememb ership mo d

els

Multivariate resp onse mo dels with missing data

Measurement error mo delling

Multilevel factor analysis mo delling

Mo delling complex variance functions and het

eroscedasticity

Useful Web Pages

httpmultilevelio eacukteambillhtml My

publications page

httpmultilevelio eacuk Centre for mul

tilevel mo delling homepage

httpmultilevelio eacukdevindexhtml New

development version of MLwiN homepage