Alternatives to Credibility Intervals in Validity Generalization Research Roger E
Total Page:16
File Type:pdf, Size:1020Kb
Tolerance Intervals: Alternatives to Credibility Intervals in Validity Generalization Research Roger E. Millsap Baruch College, City University of Now York In validity generalization research, the estimated interval,&dquo; borrowing the concept from Bayesian mean and variance of the true validity distribution are statistics (Novick & Jackson, 1974). often used to construct a credibility interval, an inter- Since this work Schmidt and Hunter, val of the true valid- early by containing a specified proportion calculation of intervals has ity distribution. The statistical interpretation of this in- the credibility become terval in the literature has varied between Bayesian a routine step in studies of validity generalization and classical (frequentist) viewpoints. Credibility in- (Calender 85 Osbum, 19~ 19 Callender, Osbum, tervals are here discussed from the frequentist perspec- Greener, & Ashworth, 1982; Linn, Hamisch, & tive. These are known as "tolerance intervals" in the Dunbar, 1981; Linn & Hastings, 1984; Pearlman, statistical literature. Two new methods for construct- Schmidt, & Hunter, 1980; Hunter, & ing a credibility interval are presented. Unlike the cur- Schmidt, rent method of constructing the credibility interval, Caplan, 1981; Schmidt, Hunter, Pearlman, & Shane, tolerance intervals have known performance character- 1979). In studies where the variance in true validi- istics across repeated applications, justifying confi- ties remains substantial after removing artifactual dence statements. The new methods may be useful in variation, the lower limits of the credibility interval validity generalization research involving a small or moderate number of validation studies. Index is used to justify conclusions about the generaliz- terms: Bayesian statistics, Credibility intervals, Meta- ability of the test’ validity across settings and or- analysis, Tolerance intervals, True validity distribu- ganizations (Osbum, Callender, Greener, & Ash- tion, Validity generalization. worth, 1983; Pearlman et al., 1980; Schmidt ~z Hunter, 1981; Schmidt et al., 1981; Schmidt, Hunter, A validity generalization study provides esti- & Pearlman, 1982; Schmidt et al., 1979; Schmidt, mates of the mean and variance of true test validi- Pearlman, Hunter, & Hirsh, 1985). ties using the results of many individual validation As described in Schmidt and Hunter (1977), the studies. Schmidt and Hunter (19~7) proposed that credibility interval is to have a Bayesian interpre- the estimated mean and variance of the true validity tation as an interval centered at the mean of the distribution be used to construct an cen- interval, posterior distribution of true validities. The pos- tered at the mean true validity, that contains a spec- terior distribution is the distribution of true valid- ified percentage of the distribution of true validi- ities that is obtained by considering both the in- ties. This interval was denoted a &dquo;credibility vestigator’s initial beliefs about the distribution of validities, and the empirical results of many separate validation studies as cumulated in the va- lidity generalization procedure. In Bayesian infer- ence, the initial beliefs of the investigator are for- 27 Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 28 mally in the ‘6pri~r distribution.&dquo; These validities. The prior distribution can be regarded beliefs may simply be hunches, or may reflect pre- as describing an actual ’ &dquo;population&dquo; of true valid- vious empirical work. The difference between the ties, with a single validation study being a sample prior and posterior distributions indicates how the of size one from this population. The population investigator’s initial beliefs have been modified by of true validities can be defined in various ways, empirical evidence. depending on the intent of the investigators. In the While describing credibility interval in terms employment testing context, the population might of the posterior distribution, Schmidt and Hunter be defined as &dquo;true validities resulting from all (1977) stated that validity generalization results possible validation studies within a particular job provide information about the distribution of true family using a given predictor and a given class of validities that can serve as a prior distribution in criterion measures.&dquo; More or less specificity can future validation research. Further empirical vali- be introduced in this definition. The point is simply dation is unnecessary if a sufficient portion of this that such a population is easy to conceptualize. prior distribution lies above an acceptable mini- The validity generalization procedure uses the mum validity: the lower bound of the credibility results from N validation studies to estimate the interval. Schmidt and Hunter (1977) used the es- mean true validity p and the true validity variance timated mean and variance of the true validity dis- cr 2. Let p and 6r2 denote these estimates respec- tribution to construct this interval, assuming a nor- tively. A credibility interval is constructed to give mal distribution for the prior. If the lower bound bounds that contain a specified proportion of the is acceptably large, Schmidt and Hunter (1977) true validity distribution. Viewed in frequentist suggested that further validation studies are un- terms, how should such an interval be constructed? likely, 6‘~her~ used in a Bayesian analysis, to alter Before constructing the interval, the sense in the conclusion that the test is valid&dquo; (p. 532). Sim- which the interval will &dquo;contain’’ the specified pro- ilar viewpoints are expressed in Schmidt et al. (1979) portion of the distribution must be clarified. In the and Pearlman et al. (1980). frequentist view, the true validity distribution is Recent discussions of the credibility interval have fixed, and the credibility intervals constructed from omitted explicit reference to Bayesian inference sample data will vary due to sampling error. Re- and instead view the interval in frequentist terms, peated validity generalization studies, applied to as a predictive indicator for the outcomes of future the same true validity population, will generate validation studies. Schmidt et al. (1981) stated that different credibility intervals. It is impossible to when the credibility interval shows that more than say with ce ty whether any specific interval does 90% of the estimated true validity distribution lies or does not actually contain the specified portion in the positive range, &dquo;positive validity can be of the true validity distribution. But it is possible expected in new settings and organizations more to give a statement concerning the percentage of than 9 times out of 10&dquo; (p. 268). Schmidt et al. these intervals which can be expected to cover the (1985) suggested that &dquo;the process of setting con- specified portion of the distribution. Alternatively, fidence bounds on the true validity is the same the intervals can be constructed so that they will process used in all of inferential statistics&dquo; (p. 723), cover the desired portion &dquo;on the average.&dquo; In and drew an analogy to the confidence interval set either case, the basis for confidence in the intervals by an investigator for the true validity in an indi- lies in a hypothetical set of repeated applications vidual validation study. This paper discusses the in different samples of validation studies. credibility interval from a frequentist perspective. In the statistical literature, an interval con- structed to contain a specified proportion of the distribution is known as a tolerance in- The Frequentist Viewpoint population terval (Kendall & Stuart, 1979; Mood, Graybill, In validity generalization research, there is an ~z Boes, 1974; Proschan, 1953; Wald ~ Wolfo- empirical basis for the prior distribution of true witz, 1946; ~lilks, 1941). A tolerance interval dif- Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 29 fers from a confidence interval in that the former where P is set equal to 2R - 1 for R > .50. Al- encloses a proportion of the entire population dis- though it can be expected that, on the average, a tribution, while the latter is constructed to contain proportion of the distribution will lie above ~°2, the value of a population parameter. In both cases, there is no explicit control over how often this will the sense in which the intervals &dquo;contain&dquo; their be true across repeated applications. Thus the target values is interpreted with reference to re- 14 confidence level,&dquo; expressing the degree of ex- applications in many samples. pectation that a pr~p®~i~ar~ ~ of the distribution will Returning to the validity generalization appli- lie above C~, cannot be altered. For this bound, cation, assume that the true validity distribution is the confidence level is fixed at 50% (Proschan, r~® ~i. If p and a2 are i~r~®~~9 ~ tolerance interval 1953). can be constructed to contain a pr~p®~i®~ ~ of the The second type of tolerance interval allows true validity distribution