Goodness-Of-Fit Testing

Author's personal copy Goodness-of-Fit Testing A Maydeu-Olivares and C Garcı´a-Forero, University of Barcelona, Barcelona, Spain ã 2010 Elsevier Ltd. All rights reserved. Glossary useful dichotomy to classify GOF assessment is based on the nature of the observed data, discrete versus continu- Absolute goodness of fit – The discrepancy ous. Historically, GOF assessment for multivariate discrete between a statistical model and the data at hand. data and that for multivariate continuous data have been Goodness-of-fit index – A numerical summary of presented as being completely different. However, new the discrepancy between the observed values and developments in limited information GOF assessment for the values expected under a statistical model. discrete data reveal that there are strong similarities Goodness-of-fit statistic – A goodness-of-fit index between the two, and here, we shall highlight the simila- with known sampling distribution that may be used in rities in GOF assessment for discrete and continuous data. statistical-hypothesis testing. Relative goodness of fit – The discrepancy between two statistical models. GOF testing with Discrete Observed Data Consider modeling N observations on n discrete random variables, each with K categories, such as the responses to n test items. The observed responses can then be gathered Introduction n in an n-dimensional contingency table with C ¼ K cells. The goodness of Fit (GOF) of a statistical model describes Within this setting, assessing the GOF of a model involves assessing the discrepancy between the observed propor- how well it fits into a set of observations. GOF indices sum- marize the discrepancy between the observed values and tions and the probabilities expected under the model c ¼ ... C the values expected under a statistical model. GOF statis- across all cells 1, , of the contingency table. p tics are GOF indices with known sampling distributions, More formally, let c be the probability of one such cell p pðÞu usually obtained using asymptotic methods, that are used and let c be the observed proportion. Let be the C in statistical hypothesis testing. As large sample approx- -dimensional vector of model probabilities expressed as a function of, say, q model parameters to be estimated imations may behave poorly in small samples, a great deal of research using simulation studies has been devoted to from the data. Then, the null hypothesis to be tested is H : p ¼ pðÞu investigate under which conditions the asymptotic p-values 0 , that is, the model holds, against H1 : p 6¼ pðÞu . of GOF statistics are accurate (i.e., how large the sample size must be for models of different sizes). Assessing absolute model fit (i.e., the discrepancy between a model and the data) is critical in applications, GOF Statistics for Assessing Overall Fit as inferences drawn on poorly fitting models may be badly The two standard GOF statistics for discrete data are misleading. Applied researchers must examine not only Pearson’s statistic the overall fit of their models, but they should also per- XC form a piecewise assessment. It may well be that a model 2 2 X ¼ N ðÞpc À ^c =^c; ½1 fits well overall but that it fits poorly some parts of the c¼1 data, suggesting the use of an alternative model. The piecewise GOF assessment may also reveal the source of and the likelihood ratio statistic C misfit in poorly fitting models. X 2 When more than one substantive model is under con- G ¼ 2N pc lnðÞpc =^c : ½2 c¼ sideration, researchers are also interested in a relative 1 ^ ¼ p ðu^Þ c model fit (i.e., the discrepancy between two models; see where c c denotes the probability of cell under Yuan and Bentler, 2004; Maydeu-Olivares and Cai, 2006). the model. Thus, we can classify GOF assessment using two useful Asymptotic p-values for both statistics can be obtained C q dichotomies: GOF indices versus GOF statistics, and using a chi-square distribution with – – 1 degrees of absolute fit versus relative fit. In turn, GOF indices and freedom when maximum likelihood estimation is used. statistics can be classified as overall or piecewise. A third However, these asymptotic p-values are only correct when 190 International Encyclopedia of Education (2010), vol. 7, pp. 190-196 Author's personal copy Goodness-of-Fit Testing 191 all expected frequencies N^c are large (>5 is the usual GOF Statistics for Piecewise Assessment of Fit rule of thumb). A practical way to evaluate whether the In closing this section, the standard method for assessing asymptotic p-values for X2 and G2 are valid is to compare the source of misfit is the use of z-scores for cell residuals them. If the p-values are similar, then both are likely to be p À ^ correct. If they are very different, it is most likely that both c c ; ½ 4 p SEðÞpc À ^c -values are incorrect. Unfortunately, as the number of cells in the table where SE denotes standard error. In large samples, their increases, the expected frequencies become small (as the distribution can be approximated using a standard normal sum of all C probabilities must be equal to 1). As a result, distribution. Unfortunately, the use of these residuals even in multivariate discrete data analysis, most often, the in moderately large contingency tables, is challenging. It is p -values for these statistics cannot be trusted. In fact, difficult to find trends in inspecting these residuals, and the k> when the number of categories is large (say 4), the number of residuals to be inspected is easily too large. Most p asymptotic -values almost invariably become inaccurate importantly, for large C, because the cell frequencies are n > as soon as 5. To overcome the problem of the inac- integers and the expected frequencies must be very small, curacy of the asymptotic p-values for these statistics, the resulting residuals will be either very small or very large. two general methods have been proposed: resampling methods (e.g., bootstrap), and pooling cells. Unfortunately, existing evidence suggest that resampling methods do not yield accurate p-values for the X2 and G2 statistics (Tollenaar New Developments in GOF with Discrete and Mooijart, 2003). Pooling cells may be a viable alter- Observed Data: Limited Information native to obtain accurate p-values in some instances. For Methods instance, rating items with five categories can be pooled into three categories to reduce sparseness. However, if In standard GOF methods for discrete data, contingency the number of variables is large, the resulting table may tables are characterized using cell probabilities. However, still yield some small expected frequencies. Moreover, they can be equivalently characterized using marginal Â pooling may distort the purpose of the analysis. Finally, probabilities. To see this, consider the following 2 3 pooling must be performed before the analysis is made to contingency table: obtain a statistic with the appropriate asymptotic reference distribution. Due to the difficulties posed by small expected prob- X ¼ 0X¼ 1X¼ 2 abilities on obtaining accurate p-values for GOF statistics 2 2 2 X ¼ p p p assessing absolute models, some researchers have resorted 1 0 00 01 02 to examining only the relative fit of the models under X1 ¼ 1 p11 p11 p12 consideration, without assessing the absolute model fit. Other researchers simply use GOF indices. This table can be characterized using the cell prob- 0 abilities p ¼ð00; ÁÁÁ;12Þ . Alternatively, it can be ð1Þ ð1Þ ð2Þ GOF Indices characterized using the univariate p_ 1 ¼ðp ; p ; p Þ ð Þð Þ ð Þð Þ 1 2 2 and bivariate p_ ¼ð 1 1 ; 1 2 Þ probabilities, where With L denoting the loglikelihood, two popular GOF indices 2 12 12 ðkÞ ðkÞðlÞ ¼ ðXi ¼ kÞ; ¼ ðXi ¼ k; Xj ¼ lÞ are Akaike’s information criterion (AIC), AIC ¼2L þ 2q i Pr ij Pr , and and Schwarz Bayesian information criterion (BIC), ¼ ¼ ¼ BIC ¼2L þ q ln(N ), X2 0X2 1X2 2 X1 ¼ 0 AIC ¼2L þ 2q; BIC ¼2L þ q lnðÞN ½3 X ¼ pðÞ1 ðÞ1 pðÞ1 ðÞ2 pðÞ1 1 1 12 12 1 pðÞ1 pðÞ2 The AIC and BIC are not used to test the model in the ðÞ2 ðÞ2 sense of hypothesis testing, but for model selection. Given a data set, a researcher chooses either the AIC or BIC, and computes it for all models under consideration. Then, the Both characterizations are equivalent, and the equiva- model with the lowest index is selected. Notice that both lence extends to contingency tables of any dimension. the AIC and BIC combine absolute fit with model parsi- Limited-information GOF methods disregard infor- mony. That is, they penalize by adding parameters to the mation contained in the higher-order marginals of the model, but they do so differently. Of the two, the BIC table. Thus, quadratic forms in, say, univariate and bivari- penalizes by adding parameters to the model more strongly ate residuals are used instead of using all marginal resi- than the AIC. duals up to order n. International Encyclopedia of Education (2010), vol. 7, pp. 190-196 Author's personal copy 192 Statistics GOF Statistics for Assessing Overall Fit df ¼ number of residuals used–number of estimated parameters involved. However, when used for piecewise Maydeu-Olivares and Joe (2005, 2006) proposed a family assessment of fit, X2 may yield an undue impression of of GOF statistics, Mr , that provides a unified framework poor fit (Maydeu-Olivares and Joe, 2006). The use of the for limited information and full information GOF statis- 2 Mr statistics in place of X for pairs (or triplets) of vari- tics. This family can be written as ables solves this problem.

Goodness-Of-Fit Testing

Goodness of Fit: What Do We Really Want to Know? I

Linear Regression: Goodness of Fit and Model Selection

Chapter 2 Simple Linear Regression Analysis the Simple

A Study of Some Issues of Goodness-Of-Fit Tests for Logistic Regression Wei Ma

Testing Goodness Of

Measures of Fit for Logistic Regression Paul D

Goodness of Fit of a Straight Line to Data

Using Minitab to Run a Goodness‐Of‐Fit Test 1. Enter the Values of A

Goodness-Of-Fit Tests for Parametric Regression Models

Measures of Fit for Logistic Regression Paul D

Lecture 10: F-Tests, R2, and Other Distractions

Goodness of Fit Tests for High-Dimensional Linear Models