<<

The Bayesian Perspective in the Context of Large Scale Assessments

David Kaplan Department of Educational Psychology

17 February 2012 2 / 55 ent itute of Grant The research reported in this paper was supported byR305D110001 the to Inst The University ofopinions Wisconsin expressed - are Madison. those The ofviews the of authors the and Institute do or not the repres U.S. Department of Education. Education , U.S. Department of Education, through ●

Introduction Indiana University - WIM Symposium Introduction ❖ 1. Introduction to 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction does not exist. 1. Introduction to Bayesian Inference

2. An Example: - Multilevel SEM applied to PISA

3. Discussion

Indiana University - WIM Symposium 3 / 55 ❖ Introduction ● 1. Introduction to Bruno de Finetti was an Italian probabilist and one of the most Bayesian Inference important contributors to the subjectivist Bayesian movement. 2. An Example: Multilevel SEM applied to PISA ● 3. Discussion His statement can be considered the foundational view of the subjectivist branch of Bayesian .

● My goal is to argue for a subjectivist Bayesian approach to conducting research using from large scale assessments (LSAs).

Indiana University - WIM Symposium 4 / 55 ❖ Introduction OUTLINE 1. Introduction to Bayesian Inference

2. An Example: Multilevel SEM applied 1. Introduction to Bayesian inference to PISA

3. Discussion 2. An Example: Multilevel SEM applied to PISA

3. Discussion

Indiana University - WIM Symposium 5 / 55 6 / 55

1. Introduction to Bayesian Inference Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion 7 / 55 bable and ments. earson to the idea ics we most esented by . long run the model of For frequentists, the basic idea is that probability is repr underlies the Fisherschools and of Neyman-P statistics – the conventional methods of statist The frequentist formulation rests onstochastically the independent idea events of equally pro The physical representation is theof coin a toss, very which large relates (actually infinite) number of repeated experi often use. ● ● ● ●

Frequentist v. Bayesian Probability Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to The entire structure of Neyman - Pearson hypothesis testing is Bayesian Inference based on frequentist probability. ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ❖ The Prior Distribution ● Our conclusions regarding the null and alternative hypotheses ❖ Exchangeability ❖ Where do priors come presuppose the idea that we could conduct the same an from? infinite number of times. ❖ The Likelihood ❖ The Posterior Distribution ❖ Bayesian Model ● Evaluation and Testing Our interpretation of confidence intervals also assumes a fixed ❖ Implications for ILSAs parameter and CIs that vary over an infinitely large number of 2. An Example: identical . Multilevel SEM applied to PISA

3. Discussion

Indiana University - WIM Symposium 8 / 55 ❖ Introduction ● 1. Introduction to But there is another view of probability as subjective . Bayesian Inference ❖ Frequentist v. Bayesian Probability ● The physical model in this case is that of the “bet”. ❖ Bayes’ Theorem ❖ The Prior Distribution ❖ Exchangeability ● ❖ Where do priors come Consider the situation of betting on who will win the World Series. from? ❖ The Likelihood ❖ The Posterior ● Distribution Here, probability is not based on some notion of an infinite number ❖ Bayesian Model of repeatable and stochastically independent events, but rather Evaluation and Testing ❖ Implications for ILSAs some notion of how much you have and how much you

2. An Example: are willing to bet. Multilevel SEM applied to PISA 3. Discussion ● Thus, subjective probability allows one to address questions such as “what is the probability that the Cubs will win the World Series?” Relative frequency supplies information, but it is not the same as probability and can be quite different.

● This notion of subjective probability underlies .

Indiana University - WIM Symposium 9 / 55 10 / 55 (1) (2) , for example X and Y ) ) Y X ( ( p p ) ) Y X | | X Y ( ( p p ) = ) = Y,X X, Y ( ( p p observing smoking and lung cancer jointly. Consider the joint probability of two events, The joint probability can be written as Similarly ● ● ●

Bayes' Theorem Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to Because these are symmetric, we can set them equal to each other Bayesian Inference to obtain the following ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem p(Y |X)p(X) = p(X|Y )p(Y ) (3) ❖ The Prior Distribution ❖ Exchangeability Therefore ❖ Where do priors come from? p(X|Y )p(Y ) ❖ The Likelihood p(Y |X) = (4) ❖ The Posterior p(X) Distribution ❖ Bayesian Model Evaluation and Testing and the theorem (Bayes’ theorem) states ❖ Implications for ILSAs

2. An Example: p(Y |X)p(X) Multilevel SEM applied p(X|Y ) = (5) to PISA p(Y )

3. Discussion

● Why do we care? Because this is how you could go from the probability of having cancer given that the patient smokes, to the probability that the patient smokes given that he/she has cancer.

● We simply need the marginal probability of smoking and the marginal probability of cancer (what we will call prior ).

Indiana University - WIM Symposium 11 / 55 ❖ Introduction ● 1. Introduction to A fundamental difference (among many) between Bayesians and Bayesian Inference frequentists, concerns the nature of parameters. ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● ❖ The Prior Distribution Frequentists view parameters as unknown fixed characteristics of a ❖ Exchangeability population, estimated from sample data. ❖ Where do priors come from? ❖ The Likelihood ❖ The Posterior ● Bayesians view parameters as unknown but random quantities that Distribution ❖ Bayesian Model can be characterized by a distribution. Evaluation and Testing ❖ Implications for ILSAs ● 2. An Example: In fact, Bayesians view all unknowns as possessing probability Multilevel SEM applied to PISA distributions.

3. Discussion ● This is a fundamental difference! It has implications for estimation.

Indiana University - WIM Symposium 12 / 55 ❖ Introduction ● 1. Introduction to Because Bayesians view parameters probabilistically, we need to Bayesian Inference assign them probability distributions. ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● Consider again Bayes’ theorem in light of data and parameters. ❖ The Prior Distribution Since everything is random possessing a , we ❖ Exchangeability ❖ Where do priors come can write Bayes’s theorem as from? ❖ The Likelihood p(Y |θ)p(θ) ❖ The Posterior p(θ|Y ) = (6) Distribution p(Y ) ❖ Bayesian Model Evaluation and Testing ❖ Implications for ILSAs where p(θ|Y ) is the posterior distribution of θ given the data Y , p(θ)

2. An Example: is the prior distribution of θ, and p(Y ) is the marginal distribution of Multilevel SEM applied the data. to PISA 3. Discussion ● Note that the marginal distribution p(Y ) does not involve model parameters. It is there to normalize the probability so that it integrates to 1.0. As such, we can ignore it and write Bayes’ theorem as

p(θ|Y ) ∝ p(Y |θ)p(θ) (7)

Indiana University - WIM Symposium 13 / 55 ❖ Introduction ● 1. Introduction to Let’s look at Bayes’ theorem again. Bayesian Inference ❖ Frequentist v. p(θ|Y ) ∝ p(Y |θ)p(θ) (8) Bayesian Probability ❖ Bayes’ Theorem ❖ The Prior Distribution ❖ Exchangeability ❖ Where do priors come ● This simple expression reflects a view about the evolutionary from? ❖ The Likelihood development of knowledge. It says ❖ The Posterior Distribution ❖ Bayesian Model Current knowledge ∝ Current evidence/data × Prior knowledge Evaluation and Testing ❖ Implications for ILSAs 2. An Example: ● The problem lies in what we consider to be prior knowledge. Multilevel SEM applied to PISA ● Frequency is not enough. We use frequency as a piece of 3. Discussion information to inform a subjective of uncertainty which we quantify as probability.

● Probability does not exist as a feature of the external world. It is a subjective quantification of uncertainty. This is the essence of de Finetti’s quote.

Indiana University - WIM Symposium 14 / 55 15 / 55 (9) d by stribution of ior distribution. In practice, we model can be written as ) ω describing the distribution of the ( p ) ω | θ ( is the p ) ) hierarchical θ ω | ( p Y ( p and ∝ θ ) Y | are θ ( ω p rarely go this deep. parameters where A critically important part of Bayesian statistics is the pr The prior distribution incorporates our belief about the di Thus, a full Bayesian But note that this does characterize multilevel modeling. distribution. the parameters, where the distributionhyperparameters can be characterize ● ● ● ●

The Prior Distribution Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to Prior distributions can be conjugate or non-conjugate. Bayesian Inference ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● Conjugate priors are those that are of the same family as the ❖ The Prior Distribution likelihood yielding a posterior distribution of the same family. This ❖ Exchangeability ❖ Where do priors come makes computation easy. from? ❖ The Likelihood ❖ The Posterior Distribution ● However, with new estimation methods (MCMC), we can now use ❖ Bayesian Model Evaluation and Testing non-conjugate priors. ❖ Implications for ILSAs

2. An Example: Multilevel SEM applied to PISA

3. Discussion

Indiana University - WIM Symposium 16 / 55 17 / 55 ortant ol? - on? city coded ence of a . , , 1] 0] 0] , , , would make to the question 1 1 0 i , , , 0 0 0 , , , 0 1 0 , , , 1 0 1 possible patterns , , , 0 1 1 , , , 1 1 1 , , , 1 1 1 , , , = 1024 1 0 1 , , , 10 2 = [0 = [1 = [1 Pattern 3 Pattern 2 Pattern 1 appearing in PISA 2009 “Howeach much of do the you following agree statements or about disagree teachers with at your scho Most teachers are interested in my well-being”, (for simpli 1,0) What warrants the existence of a prior andDe a Finetti’s prior Representation distributi Theorem isresult perhaps in the Bayesian most statistics imp becauseprior. it warrants the exist Consider the response that student Imagine that 6 students agree, and 4 studentsConsider disagree. 3 out of ● ● ● ● ●

Exchangeability Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to Assigning probabilities to all 1024 patterns is difficult. Bayesian Inference ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● However, if we assume that the proportion of 1’s matters and not ❖ The Prior Distribution their location, then that reduces the problem considerably. ❖ Exchangeability ❖ Where do priors come from? ❖ The Likelihood ● The fact that we can, in this case, “permute” the location of the 1’s ❖ The Posterior Distribution and 0’s but not the proportion of 1’s and 0’s is the assumption of ❖ Bayesian Model Evaluation and Testing exchangeability. ❖ Implications for ILSAs

2. An Example: Multilevel SEM applied ● to PISA That proportion is a parameter (in fact, a of the ) that is assumed to generate the binomial data and 3. Discussion “exists” prior to the data.

Indiana University - WIM Symposium 18 / 55 ❖ Introduction ● 1. Introduction to The Representation Theorem goes on to prove that a distribution Bayesian Inference around that parameter is also a result of the theorem. ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ❖ The Prior Distribution ● Jackman (2009) noted that the prior and its distribution is implied by ❖ Exchangeability ❖ Where do priors come the Representation Theorem and not just a “Bayesian add-on” to an from? otherwise classical analysis. ❖ The Likelihood ❖ The Posterior Distribution ❖ Bayesian Model ● Evaluation and Testing In other words, the prior and its distribution necessarily arises from ❖ Implications for ILSAs an acceptance of exchangeability (Jackman, 2009). 2. An Example: Multilevel SEM applied to PISA ● This then places subjective probability on a firm mathematical 3. Discussion footing.

● “If there are ... readers who find p(θ) distasteful, remember it is only as distasteful as exchangeability; and is that unreasonable?” (Lindley and Phillips, 1976, pg. 115).

Indiana University - WIM Symposium 19 / 55 ❖ Introduction ● 1. Introduction to When might we not accept exchangeability? Bayesian Inference ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● What if the first 5 students actually came from the same school, and ❖ The Prior Distribution the last 5 students came from the same school? ❖ Exchangeability ❖ Where do priors come from? ❖ The Likelihood ● Then, as a nesting problem familiar in education and in ILSA ❖ The Posterior Distribution designs, exchangeability of students cannot hold. ❖ Bayesian Model Evaluation and Testing ❖ Implications for ILSAs ● 2. An Example: However, we can invoke conditional exchangeability. That is, given Multilevel SEM applied to PISA knowledge of the school, we could justify conditional exchangeability.

3. Discussion

● This can be extended to hierarchical Bayesian models (essentially, multilevel models) in which we can invoke parameter exchangeability.

Indiana University - WIM Symposium 20 / 55 21 / 55 e later. SS, ers for is of m ot require resent ainity. . expert probability assignments, and thus characterize our uncert that parameters “exist” in theconvenient real mathematical world. constructions Rather, that we allow find us the to rep opinion, meta-analytic studies, informed hypotheses). LSAs, priors can come from previous cycles of the LSA. It is important to note that the subjectivist approach does n Prior distributions can be elicited in a variety of ways (e.g Of critical importance for a Bayesian approach to the analys For example, we can useancillary previous information cycles from of comparable PISA, LSAs along (e.g. with NAEP, TIM Disagreements about the values of the hyperparameters can b etc.) to inform the degreeanalyses of of precision data around from hyperparamet the current cycle. directly assessed via model comparison measures discussed ● ● ● ● ●

Where do priors come from? Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion 22 / 55 eight our he prior – f the data – find me . ) ) θ | Y | Y θ ( ( p L that maximize the likelihood of the data. θ parameters A central feature of Bayesian“moderated” statistics or is weighted that by the the prior data is in hand. The data in hand is written in our model as Notice that this asks thein question, hand, what given is a the set probability of o model parameters? This can also be written as the “likelihood” So, Bayesian statistics does not ignore data, but uses it to w prior information. Large amounts ofas data it will should. always trump t ● ● ● ● ●

The Likelihood Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion 23 / 55 o is of R or lating this ary of the ode, and . ) Y | θ ( p quantiles of the posterior distribution.distribution This of is the a parameters full of summ the model. (MCMC) sampling. programs. The prior and the likelihooddistribution are combined to form the posteri What made Bayesian statistical practicedistribution difficult and was finding calcu a way to summarize it. Advances in computation now makedistribution it in possible its to entirety, summarize providing th the , , m The approach uses a procedure called Markov chain Monte Carl Software programs include WinBugs, OpenBugs, and a variety ● ● ● ● ●

The Posterior Distribution Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to One important result of considering parameters probabilistically, is Bayesian Inference the interpretation of “confidence intervals” ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● ❖ The Prior Distribution A frequentist interpretation of a CI requires that one imagine a fixed ❖ Exchangeability parameter, say µ ❖ Where do priors come from? ❖ The Likelihood ❖ The Posterior ● Then, we imagine a very large number of repeated samples from the Distribution ❖ Bayesian Model population characterized by µ Evaluation and Testing ❖ Implications for ILSAs ● 2. An Example: For any given sample, we obtain x¯ and form a 95% CI. Multilevel SEM applied to PISA 3. Discussion ● The correct (albeit convoluted) interpretation is that the 95% of the CIs formed this way, capture the true parameter µ under the null hypothesis.

● Notice that from this perspective, the probability that the parameter is in the interval is either zero or one.

Indiana University - WIM Symposium 24 / 55 ❖ Introduction ● 1. Introduction to The Bayesian perspective forms a “ interval” Bayesian Inference (aka a credibile interval) . ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● Again, because we assume that a parameter has a probability ❖ The Prior Distribution distribution, when we sample from the posterior distribution, we can ❖ Exchangeability ❖ Where do priors come obtain the quantiles of the parameter distribution. from? ❖ The Likelihood ● From the quantiles, we can directly obtain the probability that the ❖ The Posterior Distribution parameter lies within a particular interval. ❖ Bayesian Model Evaluation and Testing ● ❖ Implications for ILSAs So, here, we would say that a 95% posterior probability interval

2. An Example: that the probability that the parameter lies in the interval is Multilevel SEM applied to PISA .95. 3. Discussion ● Notice that this is entirely different from the frequentist interpretation.

● Interestingly, this is typically how students “mis-interpret” the frequentist confidence interval.

Indiana University - WIM Symposium 25 / 55 26 / 55 (10) favor hat ion to the ratio ll . ) ) 1 2 M M ( ( p p × ) ) 1 2 M M | | y y ( ( p p = ) ) y y | | 1 2 M M ( ( p p hypothesis testing. has the best predictive accuracy,the and distribution to of pay parameters. much more There attent is no real notion of nu For Bayesians, the goal of model testing is to choose a model t One form of model testing is the use of the BayesThe Factors. provides aone way hypothesis to over quantify another. the odds that data Bayes Factor can be written as Notice that when prior oddsof are the neutral, likelihoods, Bayes again factor as is it just should be. ● ● ● ● ●

Bayesian Model Evaluation and Testing Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to A popular measure for used in both frequentist and Bayesian Inference Bayesian applications is based on an approximation of the Bayes ❖ Frequentist v. Bayesian Probability factor and is referred to as the Bayesian information criterion (BIC), ❖ Bayes’ Theorem also referred to as the Schwarz criterion. ❖ The Prior Distribution ❖ Exchangeability ● ❖ Where do priors come The BIC can be written as from? ❖ The Likelihood BIC = −2 log(θˆ|y) + q log(n) (11) ❖ The Posterior Distribution ❖ Bayesian Model where −2log θˆ|y describes model fit while p log(n) is a penalty for Evaluation and Testing ❖ Implications for ILSAs model complexity, where q represents the number of variables in the

2. An Example: model and n is the sample size. Multilevel SEM applied to PISA ● As with Bayes factors, the BIC is often used for model comparisons. 3. Discussion Specifically, the difference between two BIC measures comparing, say M1 to M2 can be written as

∆(BIC12) = BIC(M1) − BIC(M2) (12) 1 = log(θˆ |y) − log(θˆ |y) − (q − q )log(n) 1 2 2 1 2

Indiana University - WIM Symposium 27 / 55 ❖ Introduction ● 1. Introduction to Rules of thumb have been developed to assess the quality of the Bayesian Inference evidence favoring one hypothesis over another using Bayes factors ❖ Frequentist v. Bayesian Probability and the comparison of BIC values from two competing models. ❖ Bayes’ Theorem Following Kass and Raftery (1995) and using M1 as the reference ❖ The Prior Distribution ❖ Exchangeability model, ❖ Where do priors come from? ❖ The Likelihood BIC Difference Bayes Factor Evidence against M2 ❖ The Posterior Distribution ❖ Bayesian Model 0to2 1to3 Weak Evaluation and Testing ❖ Implications for ILSAs 2 to 6 3 to 20 Positive

2. An Example: 6 to 10 20 to 150 Strong Multilevel SEM applied > 10 > 150 Very strong to PISA

3. Discussion ● Notice the interpretation! Bayes factors focus on evidence against one hypothesis (model) over another.

Indiana University - WIM Symposium 28 / 55 ❖ Introduction ● 1. Introduction to Although the BIC is derived from a fundamentally Bayesian Bayesian Inference perspective, it is often productively used for model comparison in the ❖ Frequentist v. Bayesian Probability frequentist domain. ❖ Bayes’ Theorem ❖ The Prior Distribution ❖ Exchangeability ● ❖ Where do priors come Recently, however, an explicitly Bayesian approach to model from? comparison was developed by Spiegelhalter, et al (2002). based on ❖ The Likelihood ❖ The Posterior the notion of Bayesian . Distribution ❖ Bayesian Model Evaluation and Testing ❖ Implications for ILSAs ● The DIC can be written as 2. An Example: Multilevel SEM applied to PISA DIC = Eθ{−2log[p(y|θ)|y]+2log[h(y)}. (13)

3. Discussion

● Similar to the BIC, the model with the smallest DIC among a set of competing models is preferred.

Indiana University - WIM Symposium 29 / 55 ❖ Introduction ● 1. Introduction to Another approach to model evaluation involves posterior predictive Bayesian Inference checking. ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ❖ The Prior Distribution ● The general idea behind posterior predictive checking is that there ❖ Exchangeability ❖ Where do priors come should be little, if any, discrepancy between data generated by the from? model, and the actual data itself. ❖ The Likelihood ❖ The Posterior Distribution ❖ Bayesian Model ● Evaluation and Testing In essence, posterior predictive checking is a method for assessing ❖ Implications for ILSAs the specification quality of the model from the viewpoint of predictive 2. An Example: accuracy. Multilevel SEM applied to PISA

3. Discussion ● Any deviation between the model generated data and the actual data suggests possible model misspecification.

● Posterior predictive checking utilizes the posterior predictive distribution of replicated data.

Indiana University - WIM Symposium 30 / 55 ❖ Introduction ● rep 1. Introduction to Following Gelman, et al. (2004), let y be data replicated from our Bayesian Inference current model. That is, ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem p(yrep|y) = p(yrep|θ)p(θ|y)dθ (14) ❖ The Prior Distribution ❖ Exchangeability ❖ Where do priors come = p(yrep|θ)p(y|θ)p(θ)dθ. from? ❖ The Likelihood ❖ The Posterior Distribution ❖ Bayesian Model Evaluation and Testing ● Notice that the second term, p(θ|y), on the right-hand-side of ❖ Implications for ILSAs equation (14) is simply the posterior distribution of the model 2. An Example: Multilevel SEM applied parameters. to PISA

3. Discussion ● In words, equation (14) states that the distribution of future observations given the present data, p(yrep|y), is equal to the probability distribution of the future observations given the parameters, p(yrep|θ), weighted by the posterior distribution of the model parameters.

● Thus, posterior predictive checking accounts for both the uncertainty in the model parameters and the uncertainty in the data.

Indiana University - WIM Symposium 31 / 55 ❖ Introduction ● 1. Introduction to As a means of assessing the fit of the model, posterior predictive Bayesian Inference checking implies that the replicated data should match the observed ❖ Frequentist v. Bayesian Probability data quite closely if we are to conclude that the model fits the data. ❖ Bayes’ Theorem ❖ The Prior Distribution ● One approach to quantifying model fit in the context of posterior ❖ Exchangeability ❖ Where do priors come predictive checking incorporates the notion of Bayesian p-values. from? ❖ The Likelihood ● Denote by T (y) a model test based on the data, and let ❖ The Posterior rep Distribution T (y ) be the same test statistic but defined for the replicated data. ❖ Bayesian Model Evaluation and Testing ● ❖ Implications for ILSAs Then, the Bayesian p-value is defined to be

2. An Example: rep Multilevel SEM applied p−value = pr(T (y ) ≥ T (y)|y). (15) to PISA

3. Discussion

● Equation (15) measures the proportion of observations in the replicated data that exceeds that of the actual data.

Indiana University - WIM Symposium 32 / 55 ❖ Introduction ● 1. Introduction to How are parameters estimated within a Bayesian framework? Bayesian Inference ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem ● In Bayesian statistics, we use a method referred to as Markov chain ❖ The Prior Distribution Monte Carlo sampling (MCMC). ❖ Exchangeability ❖ Where do priors come from? ❖ The Likelihood ● In contrast to maximum likelihood estimation, and other estimation ❖ The Posterior Distribution methods within the frequentist paradigm, Bayesian inference focuses ❖ Bayesian Model Evaluation and Testing on estimating features of the posterior distribution, such as point ❖ Implications for ILSAs estimates (e.g. the EAP or MAP) and posterior probability intervals. 2. An Example: Multilevel SEM applied to PISA ● The difficulty arises when attempting to summarize the posterior 3. Discussion distribution.

Indiana University - WIM Symposium 33 / 55 ❖ Introduction ● 1. Introduction to Summarizing the posterior distribution requires calculating Bayesian Inference expectations. We can write the expectation of the posterior ❖ Frequentist v. Bayesian Probability distribution as ❖ Bayes’ Theorem ❖ The Prior Distribution f(θ)p(θ)p(y|θ)dθ E[f(θ|y)] = (16) ❖ Exchangeability p(θ)p(y|θ)dθ ❖ Where do priors come from? ❖ The Likelihood and it is this expectation that is impossible to solve analytically – ❖ The Posterior Distribution particularly for complex high-dimensional problems. ❖ Bayesian Model Evaluation and Testing ❖ Implications for ILSAs

2. An Example: Multilevel SEM applied to PISA

3. Discussion

Indiana University - WIM Symposium 34 / 55 ❖ Introduction ● 1. Introduction to Rather than attempting the impossible task of analytically solving Bayesian Inference equation (16), we can instead draw samples from f(θ|y) and ❖ Frequentist v. Bayesian Probability summarize the distribution formed by those samples. ❖ Bayes’ Theorem ❖ The Prior Distribution ❖ Exchangeability ● ❖ Where do priors come This is referred to as Monte Carlo integration. from? ❖ The Likelihood ❖ The Posterior Distribution ● The approach is based on first drawing samples {θt, t =1,...,T } ❖ Bayesian Model Evaluation and Testing from the posterior distribution f(θ|y) and approximating the ❖ Implications for ILSAs expectation in equation (16) by 2. An Example: Multilevel SEM applied T to PISA 1 E[f(θ|y)] ≈ f(θt|y) (17) 3. Discussion T t=1

● MCMC algorithms include the Metropolis-Hastings algorithm and the Gibbs Sampler.

Indiana University - WIM Symposium 35 / 55 36 / 55 (18) del s. vague ) Θ ( p ) Θ | Y ( p ∝ ) Y | are model parameters following a structural equation Θ ( Θ p As long as we caninterest, put we prior can distributions run on virtually the any parameters model of asIt’s a a Bayesian difference model. of perspective. In the case of not having much information, we can specify Returning to Bayes’ theorem, whatparameters we derived have from now a is complex a model. vector of For a fully Bayesian model,parameters. we They would could place be priors vague on priors all or the very mo specific prior (also called non-informative or diffuse) priors. where model. ● ● ● ● ●

Implications for ILSAs Indiana University - WIM Symposium Introduction Frequentist v. Bayes’ Theorem The Prior Distribution Exchangeability Where do priors come The Likelihood The Posterior Bayesian Model Implications for ILSAs ❖ 1. Introduction to Bayesian Inference ❖ Bayesian Probability ❖ ❖ ❖ ❖ from? ❖ ❖ Distribution ❖ Evaluation and Testing ❖ 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to The types of models that have been applied to ILSAs but can be Bayesian Inference brought into the Bayesian framework include: ❖ Frequentist v. Bayesian Probability ❖ Bayes’ Theorem 1. IRT models for item analysis. ❖ The Prior Distribution ❖ Exchangeability ❖ Where do priors come 2. CFA models for attitude/opinion questions in the BQ. from? ❖ The Likelihood ❖ The Posterior Distribution 3. Multilevel (HLM) models. ❖ Bayesian Model Evaluation and Testing ❖ Implications for ILSAs 4. Structural equation models – including multilevel SEM. 2. An Example: Multilevel SEM applied to PISA 3. Discussion ● In each case, we can acknowledge our degree of uncertainty via specification of the prior distribution.

Indiana University - WIM Symposium 37 / 55 38 / 55

2. An Example: Multilevel SEM applied to PISA Indiana University - WIM Symposium Introduction Conjugate priors for Model Evaluation Assessing MCMC Results ❖ 1. Introduction to Bayesian Inference 2. An Example: Multilevel SEM applied to PISA ❖ SEM ❖ ❖ Convergence ❖ 3. Discussion ❖ Introduction ● 1. Introduction to It has been long recognized the SEM is a valuable tool to estimate Bayesian Inference putative relationships among inputs, processes, and outputs of the 2. An Example: educational system. Multilevel SEM applied to PISA ❖ Conjugate priors for SEM ❖ Model Evaluation ● The extension to multilevel SEM now allows us to account for the ❖ Assessing MCMC Convergence hierarchical structure of educational data and also allows structural ❖ Results relations to exist both within and between levels of an educational

3. Discussion system.

● The purpose of this example is to demonstrate Bayesian multilevel SEM in the context of PISA and to compare results to those obtained under conventional multilevel SEM using maximum likelihood estimation. A fuller description can be found in Kaplan & Depaoli (in press).

Indiana University - WIM Symposium 39 / 55 ❖ Introduction ● 1. Introduction to The model that we will consider allows for varying intercepts and Bayesian Inference varying structural regression coefficients. 2. An Example: Multilevel SEM applied to PISA ❖ Conjugate priors for ● The within-school (level–1) full structural equation model as SEM ❖ Model Evaluation ❖ Assessing MCMC Convergence ❖ Results yig = αg + Bgyig + Γgxig + rig, g =1, 2,..., G. (19)

3. Discussion

Indiana University - WIM Symposium 40 / 55 ❖ Introduction ● 1. Introduction to We can model the structural intercepts and slopes as a function of Bayesian Inference between school endogenous variables zg and between school 2. An Example: exogenous variables wg. Multilevel SEM applied to PISA ❖ Conjugate priors for ● Specifically, we write the level–2 model as SEM ❖ Model Evaluation ❖ Assessing MCMC αg = α00 + α01zg + α02wg + ǫg, (20) Convergence ❖ Results Bg = B00 + B01zg + B02wg + ζg, (21) 3. Discussion Γg = Γ00 + Γ01zg + Γ02wg + θg. (22)

Equations 20–22 allow for randomly varying intercepts and two types of randomly varying slopes.

● These randomly varying structural coefficients are modeled as functions of a set of between school predictors zg and wg.

● These between school predictors appear in Equations 20–22 but their respective regression coefficients are parameterized to reflect a priori structural relationships.

Indiana University - WIM Symposium 41 / 55 ❖ Introduction ● 1. Introduction to The full multilevel path model allows for a set of structural Bayesian Inference relationships among school endogenous and exogenous variables, 2. An Example: which we can write as (see Kaplan & Kreisman, 2000) Multilevel SEM applied to PISA ❖ Conjugate priors for zg = τ + ∆zg + Ωwg + δg, (23) SEM ❖ Model Evaluation ❖ Assessing MCMC where τ, ∆, and Ω are the fixed structural effects. Convergence ❖ Results ● Finally, ǫ, ζ, θ, and δ are disturbance terms that are assumed to be 3. Discussion normally distributed with mean zero and diagonal matrix T with elements

2 σǫ 2  0 σζ  T = (24) 0 0 σ2  θ   0 0 0 σ2  δ 

Indiana University - WIM Symposium 42 / 55 ❖ Introduction 1. Introduction to Within Bayesian Inference

2. An Example: MOMEDUC Multilevel SEM applied ENJOY to PISA ❖ Conjugate priors for SEM DADEDUC MATHSCOR ❖ Model Evaluation ❖ Assessing MCMC IMPORTNT Convergence PERTEACH ❖ Results

3. Discussion Between ENJOY

IMPORTNT MATHSCOR

NEWMETHO

CNSENSUS ENTHUSIA ENCOURAG

CNDITION

RANDOM SLOPE

Figure 2. Multilevel path model of mathematics achi evement with

Indiana University - WIM Symposium 43 / 55 44 / 55 (26) (25) is the will yield , and let q δ lly, we shart and , we assume R , where Ψ 1 − 1 block, the prior distribution is and Ξ δ>q . ) be the vector of free model a, b ( } IG K , ∼ Γ , , , ) IW ) B θ , Ω ,δ , Λ , R ( ( ν are the mean and hyperparameters, , N be the vector of free model parameters that are α IW } Ω ∼ { Ψ ∼ = , is a positive definite matrix, and and Ξ IW norm { R θ θ norm = θ IW different degrees of “informativeness” for the inverse-Wi distribution. where number of observed variables. Different choices for that the prior distribution is inverse-Wishart, i.e. respectively, of the normal prior. assumed to follow the inverse-Wishartwrite distribution. Forma Let For blocks of and in where parameters that are assumed toθ follow a normal distribution ● ● Note that in the case where the there is only one element in the 1

Conjugate priors for SEM assumed to be inverse-gamma, i.e. Indiana University - WIM Symposium Introduction Conjugate priors for Model Evaluation Assessing MCMC Results ❖ 1. Introduction to Bayesian Inference 2. An Example: Multilevel SEM applied to PISA ❖ SEM ❖ ❖ Convergence ❖ 3. Discussion 45 / 55 on. o make ussed on erior at there y working d by the chi-square statistic and the myriad LR of fit indices does not enter into Bayesian SEM model evaluati The issue of model fit via the Instead, model evaluation is basedaccurate on predictions. the Competing ability models of arecriteria a also model of evaluated t predictive accuracy. One approach is based onearlier. posterior predictive checking disc The general idea behind posterior predictive checking is th Unfortunately, there presently does notpredictive exist checking a for form multilevel of SEM, post and we are currentl should be little, if any,model, discrepancy and between the data actual generate data itself. on this problem in the context of our IES grant. ● ● ● ● ●

Model Evaluation Indiana University - WIM Symposium Introduction Conjugate priors for Model Evaluation Assessing MCMC Results ❖ 1. Introduction to Bayesian Inference 2. An Example: Multilevel SEM applied to PISA ❖ SEM ❖ ❖ Convergence ❖ 3. Discussion 46 / 55 part (R). Portion of MCMC convergence plots. Between part (L). Within

Assessing MCMC Convergence Indiana University - WIM Symposium Introduction Conjugate priors for Model Evaluation Assessing MCMC Results ❖ 1. Introduction to Bayesian Inference 2. An Example: Multilevel SEM applied to PISA ❖ SEM ❖ ❖ Convergence ❖ 3. Discussion 47 / 55 . 10 lysis 10 dy evement to s possible. to non-informative priors based also on normal } τ , Ω , ∆ , Γ , B , α distribution, but where the variance hyperparameter is set using data from 4498 Southsurvey. Korean students from the PISA 2003 This example is based ondescribed a in reanalysis Kaplan, of Kim, a and multilevel Kim path (2009). ana In that paper, a multilevelwithin- path and analysis between-school was predictors employed of to mathematics stu achi For this study, we compare{ normal conjugate priors for We do not vary thedistributions shape for or the scale disturbance of variances, the though inverse-gamma that i ● ● ● ●

Results Indiana University - WIM Symposium Introduction Conjugate priors for Model Evaluation Assessing MCMC Results ❖ 1. Introduction to Bayesian Inference 2. An Example: Multilevel SEM applied to PISA ❖ SEM ❖ ❖ Convergence ❖ 3. Discussion ❖ Introduction ● 1. Introduction to All analyses utilized Mplus (Muthen´ and Muthen,´ 2010). Bayesian Inference

2. An Example: Multilevel SEM applied ● to PISA Parameter estimates derived from maximum likelihood estimation via ❖ Conjugate priors for the EM algorithm are compared to expected a posteriori (EAP) SEM ❖ Model Evaluation estimates from the Gibbs Sampler. ❖ Assessing MCMC Convergence ❖ Results ● 3. Discussion The Gibbs sampler utilized two chains with 5,000 burn-in iterations and 5,000 post burn-in iterations.

● The Brooks & Gelman (1998) convergence diagnostic indicated that all parameters properly converged for this model. This model took approximately 1 minute to run.

Indiana University - WIM Symposium 48 / 55 Table 1: Comparison of ML/EM v. Bayes with Non-informative Priors: Se- lected Estimates

❖ Introduction Parameter Estimate Conf. Int. EAP Cred. Int. 1. Introduction to Bayesian Inference Within Level

2. An Example: MATHSCOR ON MOMEDUC 4.01 1.96, 6.05 5.42 2.15, 5.79 Multilevel SEM applied to PISA MATHSCOR ON DADEDUC 4.81 2.99, 6.63 4.43 2.91, 6.68 ❖ Conjugate priors for MATHSCOR ON PERTEACH 6.27 0.85, 11.69 2.87 1.64, 10.72 SEM MATHSCOR ON IMPORTNT 15.87 11.29, 20.44 13.05 11.84, 19.72 ❖ Model Evaluation ❖ Assessing MCMC Between Level Convergence ❖ Results SLOPE ON NEWMETHO -4.61 -9.81, 0.58 -6.19 -9.45, 1.02 SLOPE ON ENTHUSIA 10.10 2.57, 17.61 9.12 -0.76, 18.23 3. Discussion SLOPE ON CNSENSUS -3.64 -9.95, 2.67 -3.83 -10.65, 4.29 SLOPE ON CNDITION -8.20 -13.16, -3.24 -9.58 -13.53, -3.09 SLOPE ON ENCOURAG -1.68 -7.28, 3.93 -1.89 -7.59, 3.58

Indiana University - WIM Symposium 49 / 55 Table 2: Comparison of ML/EM v. Bayes with Informative Priors: Within Estimates

❖ Introduction Parameter Estimate Conf. Int. EAP Cred. Int. 1. Introduction to Bayesian Inference Within Level

2. An Example: MATHSCOR ON MOMEDUC 4.01 1.96, 6.05 3.47 2.65, 5.20 Multilevel SEM applied to PISA MATHSCOR ON DADEDUC 4.81 2.99, 6.63 4.23 3.61, 5.99 ❖ Conjugate priors for MATHSCOR ON PERTEACH 6.27 0.85, 11.69 8.11 2.61, 9.89 SEM MATHSCOR ON IMPORTNT 15.87 11.29, 20.44 15.01 12.90, 19.42 ❖ Model Evaluation ❖ Assessing MCMC Between Level Convergence ❖ Results SLOPE ON NEWMETHO -4.61 -9.81, 0.58 -4.05 -8.31, -1.64 SLOPE ON ENTHUSIA 10.10 2.57, 17.61 8.59 4.75, 14.91 3. Discussion SLOPE ON CNSENSUS -3.64 -9.95, 2.67 -1.97 -7.22, 1.31 SLOPE ON CNDITION -8.20 -13.16, -3.24 -8.12 -11.28, -4.78 SLOPE ON ENCOURAG -1.68 -7.28, 3.93 -6.21 -7.06, 1.58

Indiana University - WIM Symposium 50 / 55 51 / 55

3. Discussion Indiana University - WIM Symposium Introduction ❖ 1. Introduction to Bayesian Inference 2. An Example: Multilevel SEM applied to PISA 3. Discussion ❖ Introduction ● 1. Introduction to Why one would choose to use this method – particularly when it can Bayesian Inference often provide results that are very close to that of frequentist 2. An Example: approaches such as maximum likelihood? Multilevel SEM applied to PISA ● 3. Discussion The answer lies in the major distinction between the Bayesian approach and the frequentist approach; that is, in the elicitation, specification and incorporation of prior distributions on the model parameters.

● As pointed out by Skrondal and Rabe-Hesketh (2004, pg. 206), there are four reasons why one would adopt the use of prior distributions – one of which they indicate is “truly” Bayesian, while the others represent a more “pragmatic” approach to Bayesian inference.

● The truly Bayesian approach would specify prior distributions that reflect elicited prior knowledge.

● Pragmatic approaches might specify prior distributions for the purposes of achieving model identification, handling inadmissible values, or because the application of MCMC can sometimes make difficult frequentist problems tractable.

Indiana University - WIM Symposium 52 / 55 ❖ Introduction ● 1. Introduction to Although we concur with the general point that Skrondal and Bayesian Inference Rabe-Hesketh (2004) are making, we do not believe that the 2. An Example: distinction between “true” Bayesians versus “pragmatic” Bayesians is Multilevel SEM applied to PISA necessarily the correct distinction to be made. 3. Discussion

● If there is a distinction to be made, we argue that it is between Bayesians and pseudo-Bayesians, where the latter implement MCMC as “just another estimator”.

● Rather, we adopt the pragmatic perspective that the usefulness of a model lies in whether it provides good predictions.

● The specification of priors based on subjective knowledge can be subjected to quite pragmatic procedures in order to sort out the best predictive model, such as the use of posterior predictive checking.

Indiana University - WIM Symposium 53 / 55 ❖ Introduction ● 1. Introduction to What Bayesian theory forces us to recognize is that it is possible to Bayesian Inference bring in prior information on the distribution of model parameters, but 2. An Example: that this requires a deeper understanding of the elicitation problem Multilevel SEM applied to PISA (Abbas, et al., 2008; Abbas, et al., 2010; O’Hagan, et al., 2006) 3. Discussion ● Through a careful review of prior research on a problem, and/or the careful elicitation of prior knowledge from experts and/or key stakeholders, relatively precise values for hyperparameters can be obtained and incorporated into a Bayesian specification.

● In the context of ILSAs, prior knowledge can be gleaned from both previous waves of the same ILSA as well as developing elicitation protocols from expert groups.

● Alternative elicitations can be directly compared via Bayesian model selection measures such as use of the deviance information criterion or Bayes factors.

Indiana University - WIM Symposium 54 / 55 ❖ Introduction It is through 1. Introduction to Bayesian Inference ● the careful and rigorous elicitation of prior knowledge, 2. An Example: Multilevel SEM applied to PISA ● the incorporation of that knowledge into our statistical models, and

3. Discussion ● a rigorous approach to the selection among competing models, that a pragmatic and evolutionary development of knowledge (via surveys such as PISA) about the inputs, processes, and outputs of schooling can be realized.

● This is precisely the advantage that Bayesian statistics, and Bayesian SEM in particular, has over its frequentist counterparts.

● Now that the theoretical and computational foundations have been established, the benefits of Bayesian SEM in general, and to international educational research in particular, will be realized in terms of how it provides insights into important substantive problems.

Indiana University - WIM Symposium 55 / 55