An Introduction to Bayesian Multilevel (Hierarchical) Modelling Using
Total Page:16
File Type:pdf, Size:1020Kb
An Intro duction to Bayesian Multilevel Hierarchical Mo delling using MLwiN by William Browne and Harvey Goldstein Centre for Multilevel Mo delling Institute of Education London May Summary What Is Multilevel Mo delling Variance comp onents and Random slop es re gression mo dels Comparisons between frequentist and Bayesian approaches Mo del comparison in multilevel mo delling Binary resp onses multilevel logistic regres sion mo dels Comparisons between various MCMC meth ods Areas of current research Junior Scho ol Project JSP Dataset The JSP dataset Mortimore et al was a longitudinal study of around primary scho ol children from scho ols in the Inner Lon don Education Authority Sample of interest here consists of pupils from scho ols Main outcome of intake is a score out of in year in mathematics MATH Main predictor is a score in year again out of on a mathematics test MATH Other predictors are child gender and fathers o ccupation status manualnon manual Why Multilevel mo delling Could t a simple linear regression of MATH and MATH on the full dataset Could t separate regressions of MATH and MATH for each scho ol indep endently The rst approach ignores scho ol eects dif ferences whilst the second ignores similari ties between the scho ols Solution A multilevel or hierarchical or ran dom eects mo del treats both the students and scho ols as random variables and can be seen as a compromise between the two single level approaches The predicted regression lines for each scho ol pro duced by a multilevel mo del are a weighted average of the individual scho ol lines and the full dataset regression line A mo del with both random scho ol intercepts and slop es is called a random slop es regression mo del Random Slopes Regression (RSR) Model MLwiN (Rasbash et al. 2000) allows the user to specify the model via an Equations window: Here we see an average line MATH5 = 30.348+0.612*MATH3 Note: MATH3 has here been centred and has range –21 to 11. The variance of the 48 school intercepts is 4.874 The variance of the 48 school slopes is 0.034 The co-variance between the intercepts and slopes is –0.374 The unexplained residual variance is 26.964 This model has been fitted using a maximum likelihood (ML) method known as IGLS (iterative generalized least squares). IGLS is an iterative procedure based on estimating the random and fixed parts of the multilevel model alternately assuming the estimates for the other part are correct. This involves iterating between two GLS model fitting steps until the estimates converge to ML point estimates. Predicted school lines from the random slopes regression model Here we highlight a poor performing school in light blue and a good performing school in red. Note that the school lines in this dataset have a negative covariance between the intercepts and slopes. Hence we see all the lines appear to converge at the top end of the intake scale and this is a direct consequence of the ceiling effect caused by the MATH5 test having a maximum score of 40. There are simpler multilevel models than a RSR that simply partition the residual variance into the possible sources of variation (in our example pupils and schools). Such models are called variance components models. Variance Components Model Here we have fitted a simple mean value for MATH5, which is estimated as 30.605 and we have partitioned the variance into variance between schools and residual variance between pupils. Variance between schools = 5.157 Residual variance = 39.281 The Intra-school correlation coefficient (ICC) for the schools in this dataset measures the percentage of variation explained by the schools and equals 5.157/(5.157+39.281)=0.116, so schools explain 11.6% of the raw variation in this dataset. Applying MCMC Metho ds to Multilevel Mo dels To make inferences ab out an unknown parameter in a multilevel mo del in a Bayesian framework we rst need to nd the joint p osterior distribution of all the unknown parameters and then integrate over all the other unknowns In the case of a level variance comp onents mo del the joint p oste rior is p u j y py j u pu j u e e u p p p u e In all but the simplest examples this problem is virtually imp ossible to solve analytically An al ternative approach is necessary and this is where MCMC metho ds t in Although the joint p osterior distribution is rather nasty the conditional p osteriors for the various unknowns given all the other unknowns can be shown to have forms that can be simulated from easily Gibbs Sampling To implement the Gibbs sampling approach we now sub divide our unknowns into four subsets u and and calculate their conditional u e p osteriors The Gibbs sampler in this case works as follows rstly cho ose starting values for each group of unknown parameters in the mo del u Then sample from the following and e u distributions rstly p j y u u e to get then pu j y e to get u then p j y u u e to get then nally u p j y u e u We have now up dated all of the to get e unknowns in the mo del This pro cess is simply rep eated many times each time using the previ ously generated parameter values to generate the next set MCMC Metho ds Burnin It is general practice to throw away the rst n values generated to allow the Markov chain to approach its equilibrium distribution namely the joint p osterior distribution of interest These n values are known as a burnin and in the ex ample given we will use a burnin of iter ates Finding Estimates We continue generating values after the burn in for another m iterates These m values are then averaged to give estimates of the parame ter of interest in this case Posterior standard deviations like frequentist standard errors for the estimates can be obtained by calculating the standard deviations of the m values MCMC Estimation of the VC model in MLwiN MCMC Diagnostics for the level 2 variance parameter Method Comparisons In reality Bayesian statisticians know no more than frequentists!! It is just the case that if they have some prior information/knowledge about the problem then they have an elegant way of combining this information with the data to form their statistical models. In the case when we have no prior knowledge then we would hope to come to similar conclusions and get similar estimates whether we use a frequentist or Bayesian approach. Consequently in my PhD. thesis (Browne 1998) I performed simulation studies for both the VC and RSR models, which compared MCMC methods with two ‘default’ prior choices along with the IGLS and RIGLS (A restricted Maximum likelihood extension to IGLS) methods. The RSR simulation results can be read in Browne and Draper (2000), here we will concentrate on the VC simulations. These simulation were based on the JSP dataset and I considered firstly 8 study designs in which I varied the number of schools (6,12,24,48) and whether the study was balanced (18 pupils per school) or unbalanced (resembling the JSP dataset in degree of unbalance. We also changed the simulation values of the level 1 and 2 variances to give the school effects more or less importance (see Browne and Draper 2002). We varied the value of the level 1 variance between 10, 40 and 80, and the value of the level 2 variance between 1,10 and 40. For each study design we simulated 1000 datasets. We looked at two properties: point estimate bias and interval estimate coverage probability. Bias comparison For each of the 1,000 simulation studies we can calculate an estimate of the level 2 variance parameter and we know the correct estimate because this is fixed for the study. We find that the bias in all methods is generally reduced as the size of the study increases. Balanced studies generally have less bias than unbalanced although the differences are small. The studies when the true level 2 variance is much smaller than the level 1 variance induce greater bias. These studies call into question whether fitting a multilevel model is sensible in such situations and maybe a single level model would be preferable. IGLS is negatively biased with biases as high as 23% induced in the smallest studies reducing to 2% in the JSP 48 schools design. RIGLS corrects for this bias in all cases with very little bias in all designs. The MCMC method with -1(,) priors for the variances shows little bias when the MEDIAN is used as a point estimate (-6% in the smallest study) although in the 1:80 variance ratio study there is large bias. The MCMC method with U(0,1/) priors for the variances shows little bias when the MODE is used as a point estimate except in very small studies (6 schools). Here 100% positive biases occur although in studies with at least 12 schools the largest bias is 1%. Again the 1:80 variance ratio study shows large bias. Interval Estimates Coverage properties For the MCMC methods we can work out accurate 95% ‘credible’ intervals from the quantiles of the chains produced. For the IGLS/RIGLS methods we can either assume Gaussian intervals i.e. point estimate M 2 standard errors or perform clever approximations e.g. a Gamma, Lognormal or other variance stabilizing transformations.