Tolerance Intervals: Alternatives to Credibility Intervals in Validity Generalization Research Roger E. Millsap Baruch College, City University of Now York

In validity generalization research, the estimated interval,&dquo; borrowing the concept from Bayesian and of the true validity distribution are (Novick & Jackson, 1974). often used to construct a credibility interval, an inter- Since this work Schmidt and Hunter, val of the true valid- early by containing a specified proportion calculation of intervals has ity distribution. The statistical interpretation of this in- the credibility become terval in the literature has varied between Bayesian a routine step in studies of validity generalization and classical (frequentist) viewpoints. Credibility in- (Calender 85 Osbum, 19~ 19 Callender, Osbum, tervals are here discussed from the frequentist perspec- Greener, & Ashworth, 1982; Linn, Hamisch, & tive. These are known as "tolerance intervals" in the Dunbar, 1981; Linn & Hastings, 1984; Pearlman, statistical literature. Two new methods for construct- Schmidt, & Hunter, 1980; Hunter, & ing a credibility interval are presented. Unlike the cur- Schmidt, rent method of constructing the credibility interval, Caplan, 1981; Schmidt, Hunter, Pearlman, & Shane, tolerance intervals have known performance character- 1979). In studies where the variance in true validi- istics across repeated applications, justifying confi- ties remains substantial after removing artifactual dence statements. The new methods may be useful in variation, the lower limits of the credibility interval validity generalization research involving a small or moderate number of validation studies. Index is used to justify conclusions about the generaliz- terms: Bayesian statistics, Credibility intervals, Meta- ability of the test’ validity across settings and or- analysis, Tolerance intervals, True validity distribu- ganizations (Osbum, Callender, Greener, & Ash- tion, Validity generalization. worth, 1983; Pearlman et al., 1980; Schmidt ~z Hunter, 1981; Schmidt et al., 1981; Schmidt, Hunter, A validity generalization study provides esti- & Pearlman, 1982; Schmidt et al., 1979; Schmidt, mates of the mean and variance of true test validi- Pearlman, Hunter, & Hirsh, 1985). ties using the results of many individual validation As described in Schmidt and Hunter (1977), the studies. Schmidt and Hunter (19~7) proposed that credibility interval is to have a Bayesian interpre- the estimated mean and variance of the true validity tation as an interval centered at the mean of the distribution be used to construct an cen- interval, posterior distribution of true validities. The pos- tered at the mean true validity, that contains a spec- terior distribution is the distribution of true valid- ified percentage of the distribution of true validi- ities that is obtained by considering both the in- ties. This interval was denoted a &dquo;credibility vestigator’s initial beliefs about the distribution of validities, and the empirical results of many separate validation studies as cumulated in the va- lidity generalization procedure. In Bayesian infer- ence, the initial beliefs of the investigator are for- 27

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 28

mally in the ‘6pri~r distribution.&dquo; These validities. The prior distribution can be regarded beliefs may simply be hunches, or may reflect pre- as describing an actual ’ &dquo;population&dquo; of true valid- vious empirical work. The difference between the ties, with a single validation study being a sample prior and posterior distributions indicates how the of size one from this population. The population investigator’s initial beliefs have been modified by of true validities can be defined in various ways, empirical evidence. depending on the intent of the investigators. In the While describing credibility interval in terms employment testing context, the population might of the posterior distribution, Schmidt and Hunter be defined as &dquo;true validities resulting from all (1977) stated that validity generalization results possible validation studies within a particular job provide information about the distribution of true family using a given predictor and a given class of validities that can serve as a prior distribution in criterion measures.&dquo; More or less specificity can future validation research. Further empirical vali- be introduced in this definition. The point is simply dation is unnecessary if a sufficient portion of this that such a population is easy to conceptualize. prior distribution lies above an acceptable mini- The validity generalization procedure uses the mum validity: the lower bound of the credibility results from N validation studies to estimate the interval. Schmidt and Hunter (1977) used the es- mean true validity p and the true validity variance timated mean and variance of the true validity dis- cr 2. Let p and 6r2 denote these estimates respec- tribution to construct this interval, assuming a nor- tively. A credibility interval is constructed to give mal distribution for the prior. If the lower bound bounds that contain a specified proportion of the is acceptably large, Schmidt and Hunter (1977) true validity distribution. Viewed in frequentist suggested that further validation studies are un- terms, how should such an interval be constructed? likely, 6‘~her~ used in a Bayesian analysis, to alter Before constructing the interval, the sense in the conclusion that the test is valid&dquo; (p. 532). Sim- which the interval will &dquo;contain’’ the specified pro- ilar viewpoints are expressed in Schmidt et al. (1979) portion of the distribution must be clarified. In the and Pearlman et al. (1980). frequentist view, the true validity distribution is Recent discussions of the credibility interval have fixed, and the credibility intervals constructed from omitted explicit reference to sample will vary due to error. Re- and instead view the interval in frequentist terms, peated validity generalization studies, applied to as a predictive indicator for the outcomes of future the same true validity population, will generate validation studies. Schmidt et al. (1981) stated that different credibility intervals. It is impossible to when the credibility interval shows that more than say with ce ty whether any specific interval does 90% of the estimated true validity distribution lies or does not actually contain the specified portion in the positive , &dquo;positive validity can be of the true validity distribution. But it is possible expected in new settings and organizations more to give a statement concerning the percentage of than 9 times out of 10&dquo; (p. 268). Schmidt et al. these intervals which can be expected to cover the (1985) suggested that &dquo;the process of setting con- specified portion of the distribution. Alternatively, fidence bounds on the true validity is the same the intervals can be constructed so that they will process used in all of inferential statistics&dquo; (p. 723), cover the desired portion &dquo;on the average.&dquo; In and drew an analogy to the set either case, the basis for confidence in the intervals by an investigator for the true validity in an indi- lies in a hypothetical set of repeated applications vidual validation study. This paper discusses the in different samples of validation studies. credibility interval from a frequentist perspective. In the statistical literature, an interval con- structed to contain a specified proportion of the distribution is known as a tolerance in- The Frequentist Viewpoint population terval (Kendall & Stuart, 1979; Mood, Graybill, In validity generalization research, there is an ~z Boes, 1974; Proschan, 1953; Wald ~ Wolfo- empirical basis for the prior distribution of true witz, 1946; ~lilks, 1941). A tolerance interval dif-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 29

fers from a confidence interval in that the former where P is set equal to 2R - 1 for R > .50. Al- encloses a proportion of the entire population dis- though it can be expected that, on the average, a tribution, while the latter is constructed to contain proportion of the distribution will lie above ~°2, the value of a population parameter. In both cases, there is no explicit control over how often this will the sense in which the intervals &dquo;contain&dquo; their be true across repeated applications. Thus the target values is interpreted with reference to re- 14 confidence level,&dquo; expressing the degree of ex- applications in many samples. pectation that a pr~p®~i~ar~ ~ of the distribution will Returning to the validity generalization appli- lie above C~, cannot be altered. For this bound, cation, assume that the true validity distribution is the confidence level is fixed at 50% (Proschan, r~® ~i. If p and a2 are i~r~®~~9 ~ tolerance interval 1953). can be constructed to contain a pr~p®~i®~ ~ of the The second type of tolerance interval allows true validity distribution as specification of a confidence level y to be associ- ated with the interval. This interval is expected to cover at least a of the true where is the standard normal deviate such that proportion P validity Zp distribution in y% of the applications. Wald and Pf(Z>Zp) = ( 1 - P)/2. In constructing credibility Wolfowitz (1946) derived the method of construc- intervals, only the lower bound is of interest. If a tion for these tolerance intervals. Tables giving the proportion R of the true validity distribution is to value of ~3 in lie above this bound, the appropriate bound is F * K3+ (5) for specified values of P, y, and N can be found where ~~°~~ > ~~.) = ~. The interval in Expression in Bowker ( fi 94~~, Dixon and Massey (1969), Bur- 1 and the bound ~°~ in Equation 2 are exact in the ington and May (1970), or Beyer ( i 9~~~. The lower sense that p and a2 are known, and the bounds are tolerance bound above which a proportion of the unaffected by . It is expected that a true validity distribution will lie with confidence pr®p®r~i®~ ~ of the future validation studies will -,I% is found as have true validities which exceed Ci, assuming that = the studies are selected from the population leading C3 p-~o- (6) to C’g. In the literature, the lower bound of the by entering the tables with P = 2R - 1 for R > interval is calculated as C,. .50. The bound ~3 is slightly inaccurate, but ad- In practice, p and a2 are unknown, and the tol- equate for practical use. A more precise bound can erance interval is calculated using the sample es- be found by consulting tables of one-sided toler- timates p and 6-2 based upon N validation studies. ance intervals in Burington and May (1970) or Owen The ~~~s~r~~ted interval will now be affected by (1958). sampling error. Two types of tolerance intervals Table 1 gives the values of K2 and ~3 at selected exist in the literature for this case, and both are values of N for P . 80 ~~.d ~ - .90, giving values discussed by Proschan (1953). The first type is of R = .90 ~nd ~ _ .95 respectively. Thus for R constructed so that on the ~~,rer~~~9 across repeated = .95, 95% of the distribution will lie above the applications, the proportion of the true validity dis- tolerance bounds C2 and C,, in the sense discussed tribution covered will be P. This interval was = pre- earlier. Two different confidence levels (y 90%,9 sented by Wilks (1941), and is given as y = 95%) are given for €3. For comparison pur- poses, the appropriate value of 7-, in Equation 2 is i .~~ at R = .95 and i .~~ at R = for all values where is the value in the Student distribution t, of ~. From the table, it is clear that both K2 and such that Pr(t > t,) = (1 - P)/2. The lower toler- ~3 are greater than Zr for ail 7V, with the discrep- ance bound above which a ~~°~~®~i®~ ~ of the true ancy being larger at smaller values of 1~1, as would validity distribution will lie is be expected. The implication is that C2 and €3 will always be less than or equal to C,. The bound €3

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 30

Table 1 Values of K2 and K3 at R=.90 and R=.95 for Selected Sample Sizes .

will generally be the lowest of the three bounds, The tolerance bounds C2 and C3 are uniformly depending upon the level of confidence selected. lower than C1, although C~ is generally close to This is the 6 ‘pr~ce~’ that must be paid for being C1. The bound C°3 differs significantly from C1 in able to specify a confidence level for the interval several cases. The bounds for the general intelli- in Expression 5 and the bound in Equation 6. gence (~~) and arithmetic reasoning (AR) tests are To illustrate the performance of the three bounds negative, and the mechanical comprehension (MC) in actual data, Table 2 gives Cl, C2, and C3 for a test retains a clearly positive bound only for the validity generalization study reported by Schmidt, operator job group. Schmidt et al. (1981) judged et al. (1981). This study examined the generaliz- test validity to be generalizable if the bound C, is ability of the validities of four cognitive ability tests positive. If this rule is applied to C39 only the chem- for performance criteria in two petroleum industry ical comprehension (cc) test demonstrates gener- job groups. The two job groups were classified as alizable validity for both jobs. This example illus- &dquo;operator&dquo; and &dquo;maintenance,&dquo; with job pro6- trates that the tolerance bounds C2 and C3 can ciency criteria measured in both cases. The four significantly alter the conclusion of test general- cognitive tests included mechanical comprehen- izability when the number of validation studies is

sion, chemical comprehension, general intelli- low.a gence, and arithmetic reasoning measures. The true Two final points deserve mention. The N to be validity mean and variance estimates in Table 2 used in constructing C2 and C3 is the number of are taken from Table 6 (p. 267) in Schmidt et al. validity coefficients cumulated in estimating p and (1981). All bounds in Table 2 are calculated for cr 2, and is unrelated to the sample size within each ~ _ .90. The bounds CB are identical to the cred- validation study. Even if the true validity in each ibility interval values given in Schmidt et al. (1981). study were known, the sample size in the tolerance

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 31

Table 2 Estimates of the Mean (p) and (a) of the True Validity, Credibility Bound Ci, and Tolerance Bounds C2 and C3 for MC, CC, GI, and AR Test Data From Schmidt, Hunter, and Caplan (1981)

_ _ _

interval procedure would be the number of coef- timates are used to construct the bound, the bound ficients. However, the researcher is interested in can be expected to vary across validity generali- the population of true validities, from which the zation studies. &dquo;Confidence&dquo; in any particular bound studies at hand are only a sample. is grounded in the procedure used to construct the Secondly, the bounds C, and C3 assume nor- bound, and the expected performance of that pro- mality of the true validity distribution. There is no cedure across many repeated applications. known empirical evidence to directly support this From a frequentist perspective, no confidence assumption. All three bounds C,, C2, and C3 may statements can be attached to Cj when calculated be inaccurate in skewed distributions. Distribution- from sample data. On the average, a proportion R free tolerance intervals, which make no assump- of the true validity distribution will lie above €2. tions about the form of the population distribution, In other words, for a randomly selected study, it can be constructed from sample order statistics can be stated with 5®°7® confidence that a proportion (Kendall & Stuart, 1979). These order statistics are ~ of the distribution will lie above C’2 (Proschan, typically unavailable in validity generalization re- 1953). The bound C3 allows more stringent con- search, hence this approach will not be pursued fidence statements. Both C~ and €3 will generally here. give lower bounds than Ci, but C~ and C3 have known performance characteristics across repeated As N the three bounds will Conclusions applications. increases, converge to the same value. For small I’~9 the bounds When validity generalization research is ap- may differ considerably. Validity generalization proached from a frequentist perspective, the true analyses using fewer than 50 studies are not un- validity distribution is viewed as a potentially re- common (Pearlman et al., 1980; Sackett, Harris, alizable &dquo;population&dquo; of true validities. The val- & Off, 1986). In these cases, investigators may idation studies cumulated in the validity generali- wish to use Cl Or C3 in preference to Ci. zation procedure are viewed as a sample of size N from this population. The estimates of the mean References and variance of the true validity distribution are construct a lower tolerance above used to bound, Beyer, W. H. (1968). Handbook of tables for probability which a specified proportion of the true validity and statistics (2nd ed.). Cleveland OH: The Chemical distribution is expected to lie. Because sample es- Rubber Company.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/ 32

Bowker, A. H. (1947). Tolerance limits for normal dis- for the . Journal of the American tributions. In Columbia University, Statistical Re- Statistical Association, 48, 550-564. search Group, Techniques of statistical analysis. New Sackett, P. R., Harris, M. M., & Orr, J. M. (1986). On York: McGraw-Hill. seeking moderator variables in the meta-analysis of Burington, R. S., & May, D. C. (1970). Handbook of correlational data: A monte carlo investigation of sta- probability and statistics with tables (2nd ed.). New tistical power and resistance to type I error. Journal York: McGraw-Hill. of Applied Psychology, 71, 302-310. Callender, J. C., & Osbum, H. G. (1981). Testing the Schmidt, F. L., & Hunter, J. E. (1977). Development constancy of validity with computer-generated sam- of a general solution to the problem of validity gen- pling distributions of the multiplicative model variance eralization. Journal of Applied Psychology, 62, 529- estimate: Results for petroleum industry validation re- 540. search. Journal of Applied Psychology, 66, 274-281. Schmidt, F. L., & Hunter, J. E. (1981). Employment Callender, J. C., Osbum, H. G., Greener, J. M., & testing: Old theories and new research findings. Amer- Ashworth, S. (1982). Multiplicative validity gener- ican Psychologist, 36, 1128-1137. alization model: Accuracy of estimates as a function Schmidt, F. L., Hunter, J. E., & Caplan, J. R. (1981). of sample size and mean, variance and shape of the Validity generalization results for two job groups in distribution of true validities. Journal of Applied Psy- the petroleum industry. Journal of Applied Psychol- chology, 67, 859-867. ogy, 66, 261-273. Dixon, W. J., & Massey, F. J., Jr. (1969). Introduction Schmidt, F. L., Hunter, J. E., & Pearlman, K. (1982). to statistical analysis. New York: McGraw-Hill. Progress in validity generalization: Comments on Cal- Kendall, M., & Stuart, A. (1979). The advanced theory lender and Osbum and further developments. Journal of statistics, Vol. II. New York: Macmillan. of Applied Psychology, 67, 835-845. Linn, R. L., Hamisch, D. L., & Dunbar, S. B. (1981). Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, Validity generalization and situational specificity: An G. S. (1979). Further tests of the Schmidt-Hunter analysis of the prediction of first-year grades in law validity generalization procedure. Personnel Psy- school. Applied Psychological Measurement, 5, 281- chology, 32, 257-281. 289. Schmidt, F. L., Pearlman, K., Hunter, J. E., & Hirsh, Linn, R. L., & Hastings, C. N. (1984). A meta-analysis H. R. (1985). Forty questions about validity gener- of the validity of predictors of performance in law alization and meta-analysis. Personnel Psychology, school. Journal of Educational Measurement, 21, 245- 38, 697-798. 249. Wald, A., & Wolfowitz, J. (1946). Tolerance limits for Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). a normal distribution. Annals of Mathematical Statis- Introduction to the theory of statistics (3rd ed.). New tics, 17, 208-215. York: McGraw-Hill. Wilks, S. S. (1941). Determination of sample sizes for Novick, M. R., & Jackson, P. H. (1974). Statistical setting tolerance limits. Annals of Mathematical Sta- methods for educational and psychological research. tistics, 12, 91-96. New York: McGraw-Hill. Osbum, H. G., Callender, J. C., Greener, J. M., & Ashworth, S. (1983). Statistical power of tests of the Acknowledgments situational specificity hypothesis in validity generali- zation studies: A cautionary note. Journal of Applied Preparation of this article was supported in part by PSC- Psychology, 68, 115-122. CUNY grant #661191. The author thanks several anon- Owen, D. B. (1958). Tables of factors for one sided ymous reviewers for their helpful comments. tolerance limits for a normal distribution (Monograph SCR-13). Washington DC: Sandia Corporation. Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Author’s Address Validity generalization results for tests used to predict job proficiency and training success in clerical occu- Send requests for reprints or further information to Roger pations. Journal of Applied Psychology, 65, 373-406. E. Millsap, Department of Psychology, Baruch College, Proschan, F. (1953). Confidence and tolerance intervals 17 Lexington Avenue, New York NY 10010, U.S.A.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/