Copyright 1994 by the American Psychological Association, Inc. Journna; l of Applied Psychology 0021-9010/94/$3.00 1994..Vol. 79, No. 4, 599-616
The Validity of Employment Interviews: A Comprehensive Review and Meta-Analysis
Michael A. McDaniel, Deborah L. Whetzel, Frank L. Schmidt, and Steven D. Maurer
This meta-analytic review presents the findings of a project investigating the validity of the employ- ment interview. Analyses are based on 245 coefficients derived from 86,311 individuals. Results show that interview validity depends on the content of the interview (situational, job related, or psychological), how the interview is conducted (structured vs. unstructured; board vs. individual), and the nature of the criterion (job performance, training performance, and tenure; research or administrative ratings). Situational interviews had higher validity than did job-related interviews, which, in turn, had higher validity than did psychologically based interviews. Structured interviews were found to have higher validity than unstructured interviews. Interviews showed similar validity for job performance and training performance criteria, but validity for the tenure criteria was lower.
The interview is a selection procedure designed to predict 1982; Harris, 1989; Mayfield, 1964; Schmitt, 1976; Ulrich & future job performance on the basis of applicants' oral re- Trumbo, 1965; Wagner, 1949; O. R. Wright, 1969). Only 25 of sponses to oral inquiries. Interviews are one of the most fre- the 106 articles on the validity or the reliability of interviews quently used selection procedures, perhaps because of their in- located by Wagner reported any quantitative data. Wagner's tuitive appeal for hiring authorities. Ulrich and Trumbo (1965) major conclusions were that (a) quantitative research on the in- surveyed 852 organizations and found that 99% of them used terview is much needed; (b) the validity and reliability of the interviews as a selection tool. Managers and personnel officials, interview may be highly specific to both the situation and the especially those who are interviewers, tend to believe that the interviewer; (c) the interview should be confined to evaluating interview is valid for predicting future job performance. In this factors that cannot be measured accurately by other methods; article, we quantitatively cumulate and summarize research on (d) the interview is most accurate when a standardized ap- the criterion-related validity of the employment interview. proach is used; and (e) the interviewer must be skilled in elicit- Our purpose in this article is threefold. First, we summarize ing complete information from the applicant, observing sig- past narrative and quantitative reviews of criterion-related va- nificant behavior, and synthesizing developed information. lidity studies of the employment interview. Second, we report Mayfield (1964) made prescriptive statements that he consid- research that extends knowledge of the criterion-related validity ered justified by empirical findings. He concluded that typical of interviews through meta-analyses conducted on a more com- unstructured interviews with no prior data on the interviewee prehensive database than has been available to earlier investiga- are inconsistent in their coverage. He also found that interview tors. Third, we examine the criterion-related validity of differ- validities are low even in studies with moderate reliabilities, al- ent categories of interviews that vary in type and in structure. though structured interviews generally show higher interrater reliabilities than do unstructured interviews. Mayfield indi- Traditional Reviews of Interview Validity cated that individual interviewers, although consistent in their Seven major literature reviews of interview research have approach to interviewees, are inconsistent in their interpreta- been published during the past 35 years (Arvey & Campion, tion of data, perhaps because interviewers' attitudes bias their judgments and because there is a tendency for interviewers to make decisions early in the unstructured interview. Concerning Michael A. McDaniel, Department of Psychology, University of Ak- ron; Deborah L. Whetzel, American Institutes for Research, Washing- the assessment of cognitive ability, Mayfield concluded that in- ton, DC; Frank L. Schmidt, Department of Management and Organi- telligence is the human quality that may be best estimated from zations, University of Iowa; Steven D. Maurer, Department of Manage- an interview but that interviewer assessments offer little, if any, ment, Old Dominion University. incremental validity over test scores alone. When the test score John Hunter and James Russell made contributions to earlier ver- was known, the interview contributed nothing to the predictive sions of this article; their contributions are appreciated. We are also validity in a multiple-assessment procedure. Mayfield also con- indebted to many authors of primary validity studies for releasing their cluded that the interrater reliability of the interview was satis- results to us. We appreciate the substantial assistance of Allen Huffcutt factory. However, interrater reliability estimates were not based in obtaining many obscurely published interview studies and for in- sightful critiques of drafts of the article. We also appreciate the data on two independent interviews but, rather, were correlations be- provided to us by Willi Wiesner and Steve Cronshaw that permitted the tween raters in the same interview. We return to this point later. interstudy reliability analyses. On the other hand, Ulrich and Trumbo (1965) concluded Correspondence concerning this article should be addressed to Mi- that personal relations and motivation were two areas that con- chael A. McDaniel, Department of Psychology, University of Akron, tributed to most decisions made by interviewers and that these Akron, Ohio 44325-4301. two attributes were valid predictors. They concluded that the 599 600 McDANIEL, WHETZEL, SCHMIDT, AND MAURER
use of the interview should be limited to one or two content Their results, shown in Table 1, indicated that the interview is a areas as part of a sequential testing procedure. better predictor of supervisor ratings than it is of promotion, O. R. Wright (1969) summarized the research on the selec- training success, and tenure; but even for supervisory ratings, tion interview since 1964 and concluded that (a) interview de- the correlation was low (.14). Reilly and Chao obtained a higher cisions are made on the basis of behavioral and verbal cues; (b) validity (.19), but they had available only 12 coefficients. rapport between interviewer and interviewee is an important Dunnette et al. used 30 coefficients and obtained an average variable that influences the effectiveness of the interview; and validity of . 16, which is comparable to Reilly and Chao's re- (c) structured or patterned interviews are more reliable than sults. Table 1 shows the number of coefficients and average va- unstructured interviews. lidities found by each researcher, listed according to criterion. Schmitt (1976) reviewed the literature on interviews and P. M. Wright et al. (1989) conducted a meta-analysis of 13 found that nearly all of the recent studies on employment in- investigations of structured interviews. This study differed from terviews focused on factors that influenced interview decisions Hunter and Hunter's (1984) and Reilly and Chao's (1982) stud- rather than on the outcomes resulting from those decisions. Ex- ies in that more advanced quantitative cumulation methods amples of factors include research on photographs and the were used, which permitted estimation of the distribution of effects of appearance (Carlson, 1967); contrast effects (Carlson, population validity coefficients. Initially, P. M. Wright et al. 1970; Hakel, Ornesorge, & Dunnette, 1970; Wexley, Yukl, Ko- found that after correcting for criterion unreliability, the mean vacs, & Sanders, 1972); situational variables, including quota validity of the interview was .37. Placing the 95% credibility position (Carlson, 1967); and race of the interviewer (Ledvinka, interval around this mean suggested that the true validity was 1973). between —.07 and .77. Because this interval included zero, they Arvey and Campion (1982) reviewed the literature from 1975 looked for moderators and outliers. After they eliminated a to 1982 and found that there was an increase in research inves- study reporting a negative validity coefficient (—.22; Kennedy, tigating possible bias in the interview. Attention had been fo- 1986), the mean corrected validity was .39 with a 95% credibil- cused on the interaction of group membership variables and ity interval from .27 to .49. The authors noted that the interval interviewers' decision making. They also found that researchers did not include the mean validity given by Hunter and Hunter were investigating several variables that influenced the in- (.14; 1984) and stated that this indicated a real difference in the terview, such as nonverbal behavior, interviewees' perceptions predictive power of the structured interview over the tradi- of interviewers, and interviewer training. Arvey and Campion tional, unstructured interview. considered most of these studies to be microanalytic in nature, In the most comprehensive meta-analytic summary to date, studying only a narrow range of variables. They also concluded Wiesner and Cronshaw (1988) investigated interview validity as that researchers were becoming more sophisticated in their re- a function of interview format (individual vs. board) and degree search methods and strategies but had neglected the person- of structure (structured vs. unstructured). Results of this study perception literature, including attribution models and implicit showed that structured interviews yielded much higher mean personality theory. Finally, Arvey and Campion postulated rea- corrected validities than did unstructured interviews (.63 vs. sons for the continued use of the interview in spite of the previ- .20) and that structured board interviews using consensus rat- ous evidence of its low validity and reliability. ings had the highest corrected validity (.64). Most recently, Harris (1989) summarized the qualitative and quantitative reviews of interview validity and concluded that, Need for Additional Quantitative Reviews contrary to the popular belief that interviews lack validity, re- cent evidence suggested that the interview had at least moderate Wiesner and Cronshaw's (1988) meta-analysis of employ- validity. In addition, Harris addressed other issues related to the ment interview validities was a major integration of the litera- interview, such as decision making, applicant characteristics, ture. However, there are at least three reasons for conducting and interviewer training. additional analyses of employment interview validity. First, the employment interview is the second most frequently used ap- Previous Quantitative Reviews of Interview Validity plicant-screening device (Ash, 1981). (Reviews of resumes and employment application materials are the most common; see In the past 20 years, five quantitative reviews of interview va- McDaniel, Schmidt, & Hunter, 1988, for an analysis of that lit- lidity have been conducted (Dunnette, Arvey, & Arnold, 1971; erature.) Given the popularity of the employment interview, Hunter & Hunter, 1984; Reilly & Chao, 1982; Wiesner & Cron- more than one comprehensive review of the approach is war- shaw, 1988; P. M. Wright, Lichtenfels, & Pursell, 1989). The ranted, particularly when an expanded data set such as the one three earliest reviews may be termed quantitative because the used in this analysis is available. The two additional reasons for mean observed validity was calculated for each set of studies. further investigation rest on a comparison between the Wiesner However, they were not typical validity generalization studies and Cronshaw study and the present research. (Hunter & Schmidt, 1990), because estimates of the population Second, the present study goes beyond Wiesner and Cron- true validities were not calculated, and variance statistics shaw's (1988) contribution to summarize validity information needed for validity generalization inferences were not reported. by criterion type. Wiesner and Cronshaw did not differentiate Furthermore, these studies did not examine the validity of var- their criterion by job performance, training performance, and ious types of interviews separately. Hunter and Hunter's analy- tenure. We believe that criterion-type distinctions are impor- sis included 27 coefficients analyzed separately for four criteria: tant because validities usually vary by criterion type, and we supervisor ratings, promotion, training success, and tenure. believe that separate analyses are warranted because of the VALIDITY OF EMPLOYMENT INTERVIEWS 601
Table 1 Number of Validity Coefficients and Average Observed Validities Reported for the Interview by Type of Criterion
Researcher Criterion No. of rs Mean validity
Dunnette, Arvey, & Arnold Supervisor ratings 30 .16 (1971) Reilly&Chao(1982) Mixture of training and 12 .19 performance Hunter & Hunter (1984) Supervisor ratings 10 .14 Promotion 5 .08 Training success 9 .10 Tenure 3 .03 Wiesner & Cronshaw (1988) Performance (primarily 150 .26° supervisor ratings) Wright, Lichtenfels, & Supervisor ratings 13 .26" Pursell(1989)
' Disattenuated coefficient = .47. b Disattenuated coefficient = .39.
differences in mean reliability for the three types of criteria make a significant profit for the company (Latham, 1989). Job- (Hunter & Hunter, 1984; Lilienthal & Pearlman, 1983; Pearl- related interviews are those in which the interviewer is a person- man, 1979; Pearlman, Schmidt, & Hunter, 1980; Schmidt, nel officer or hiring authority and the questions attempt to assess Hunter, Pearlman, & Hirsh, 1985). We examined validity data past behaviors and job-related information, but most questions for training performance, tenure, and job performance criteria. are not considered situational. Psychological interviews are con- The validity results are reported separately, and appropriate cri- ducted by a psychologist, and the questions are intended to assess terion reliabilities are used. personal traits, such as dependability. Psychological interviews Third, this study examines the validity of three different types are typically included in individual assessments (Ryan & Sackett, of interview content: situational, job related, and psychological. 1989) of personnel that are conducted by consulting psycholo- Wiesner and Cronshaw (1988) did not differentiate among cat- gists or management consultants. egories of interview content. We believe that it is likely that va- Recently, some researchers have asserted the benefits of be- lidities vary as a function of the content of the information col- havioral interviews (Janz, 1982, 1989). There were too few va- lected in the interview. lidity coefficients for these studies to be analyzed separately. Be- In summary, although Wiesner and Cronshaw's (1988) meta- havioral interviews describe a situation and ask respondents analysis was a major contribution, the present study provides a how they have behaved in the past in such a situation. One significant incremental contribution to knowledge of the in- might argue that because the applicant is responding to a ques- terview in several areas. First, an extensive literature search was tion about behavior in a specific situation, the behavioral in- conducted that resulted in a larger number of validity coeffi- terviews should be classified as situational. Others might argue cients than had been examined by previous researchers. Sec- that because the interviews solicit information about past be- ond, the validity of the interview was quantitatively assessed as havior, the interviews should be classified as job related. In this a function of both interview content and structure. Third, the study, we assigned the studies to the job-related category. As validity of the interview was summarized for three categories of more validity data are accumulated, we recommend that the validity of the behavioral interview be evaluated as a distinct criteria: job performance, training performance, and tenure. category of interview content. We hypothesized that the validity of the employment in- These three categories were not completely disjointed. Situa- terview depends on three factors: (a) the content of information tional interviews are designed to be job related, but they repre- collected during the interview, (b) how interview information is sent a distinct subcategory worthy of separate analysis. Nonpsy- collected, and (c) criteria used to validate the interview. Each of chologist interviewers are interested in psychological traits such these factors is described below. as dependability, and psychologists are interested in an appli- cant's past work experience. Also, job-related interviews, which Content of Information Collected are not primarily situational interviews, often will include some The first factor likely to affect interview validity is the content situational questions. Still, some interviews are primarily situa- of the interview. In the present research, a distinction was drawn tional, others are primarily nonsituational and job related in between situational, job-related (but not primarily situational), content, and others are primarily psychological. We hypothe- and psychological interviews. Questions in situational interviews sized that situational interviews would prove to be more valid (Latham & Saari, 1984; Latham, Saari, Pursell, & Campion, than job-related interviews, which, in turn, we expected to be more valid than psychological interviews. 1980) focus on the individual's ability to project what his or her behavior would be in a given situation. For example, an inter- How Information Is Collected viewee may be evaluated on the choice between winning a cov- Interviews can be differentiated by the extent of their stan- eted award for lowering costs and helping a coworker who will dardization. For example, most oral board examinations con- 602 McDANIEL, WHETZEL, SCHMIDT, AND MAURER
ducted by civil service agencies in the public sector were consid- criteria as well as for tenure. No specific hypotheses were made ered to be structured because the questions and acceptable re- concerning the differences in validities between the criteria. sponses were specified in advance and the responses were rated When possible, we also drew distinctions between job perfor- for appropriateness of content. McMurry's (1947) "patterned mance criteria collected for research purposes and for admin- interview" is an early example of a structured interview. It is istrative purposes (i.e., performance ratings collected as part of described as a printed form containing specific items to be cov- an organizationwide performance appraisal process). This dis- ered and providing a uniform method of recording information tinction was based on the premise that non-performance-re- and rating the interviewees' qualifications. lated variables, such as company policies or economic factors, Unstructured interviews gather applicant information in a are more likely to influence criteria collected for administrative less systematic manner than do structured interviews. Although purposes than are those collected for research purposes. the questions may be specified in advance, they usually are not, Wherry and Bartlett (1982) hypothesized that ratings collected and there is seldom a formalized scoring guide. Also, all persons solely for research purposes would be more accurate than rat- being interviewed are not typically asked the same questions. ings collected for administrative purposes. Several studies have This difference in interview structure may affect the interview demonstrated that ratings collected for administrative purposes evaluations in two ways. First, the lack of standardization may are significantly more lenient and exhibit more halo than do cause the unstructured interviews to be less reliable than the ratings collected for research purposes (Sharon & Bartlett, structured interviews. Second, structured interviews may be 1969; Taylor & Wherry, 1951; Veres, Field, & Boyles, 1983; better than unstructured interviews for obtaining a wide range Warmke & Billings, 1979). Zedeck and Cascio (1982) also of applicant information. On the basis of these differences, we found that the purpose of the rating was an important correlate hypothesized higher validities for structured interviews. of ratings of performance. In the present study, all validity co- The distinction between structured and unstructured in- efficients for training and tenure criteria were judged to be ad- terviews is not independent from interview content. Although ministrative criteria. We hypothesized that validities would be some job-related interviews were structured and others un- higher for research criteria than for administrative criteria. structured, most of the psychological interviews were catego- rized as unstructured, and all of the situational interviews were Method classified as structured. Meta-Analysis as a Method of Determining Validity Another consideration for users of the employment interview is the use of board interviews (interviews with more than one We used Hunter and Schmidt's (1990, p. 185) psychometric meta- rater in the same interview) versus individual interviews (one analytic procedure to test the proposed hypotheses. This statistical tech- nique estimates how much of the observed variation in results across rater per interview). Because board interviews have higher ad- studies is due to statistical and methodological artifacts rather than to ministrative costs than individual interviews, they must be substantive differences in underlying population relationships. Some of more valid if they are to be cost-effective. Several researchers these artifacts also reduce the correlations below their true (e.g., popu- (Arvey & Campion, 1982; Pursell, Campion, & Gaylord, 1980) lation) values. Through this method, the variance attributable to sam- have suggested that board interviews may be more valid than pling error and to differences between studies in reliability and range individual interviews. Wiesner and Cronshaw (1988) have pro- restriction is determined, and that amount is subtracted from the total vided the most extensive analysis of the validity of these two amount of variation, yielding estimates of the true variation across stud- types of interviews. For unstructured interviews, board in- ies and of the true average correlation. We used artifact distribution terviews were more valid than individual interviews (mean ob- meta-analysis with the interactive method (Hunter & Schmidt, 1990, chap. 4). The mean observed correlation was used in the sampling error served coefficients = .21 vs. .11, respectively). However, for variance formula (Hunter & Schmidt, 1990, pp. 208-210; Law, structured interviews, board interviews and individual in- Schmidt, & Hunter, 1994; Schmidt etal., 1993). The computer program terviews were similar in validity (mean observed coefficients = that we used is described in McDaniel (1986). Additional detail on the .33 vs. .35, respectively). We made no hypotheses concerning program is presented in Appendix B of McDaniel et al. (1988). differences in validity between board and individual interviews. In our analyses, we corrected the mean observed validity for mean Another issue to consider is the interviewers' access to cogni- attenuation due to range restriction and criterion unreliability (Hunter tive ability test scores. If interviewers have access to test scores, & Schmidt, 1990, p. 165). Although employment interviews used in the final interview evaluations should have higher validities than selection have less than perfect reliability, the mean validity was not if the interviewer did not have this information. This would be corrected for predictor unreliability because the goal was to estimate expected because the interviewer would have access to more cri- the operational validity of interviews for selection purposes. However, the observed variance of validities was corrected for variation across terion-relevant information. studies in interview unreliabilities. In addition, the variance of the ob- served validities was corrected for variation across studies in criterion unreliabilities and range-restriction values. For comparison purposes, Criteria Used to Validate the Interview we also reported analyses in which no range-restriction corrections were made in either the mean or variance of the interview distributions. The In past validity generalization studies (Hirsh, Northrop, & rationale for conducting the analyses with and without range-restriction Schmidt, 1986; Hunter & Hunter, 1984;Lilienthal&PearIman, corrections is detailed later under Artifact Information. 1983; Pearlman, 1979; Pearlman et al., 1980), validity has been found to vary depending on whether job performance measures Literature Review or training performance measures served as the criteria. There- We conducted a thorough search for validity studies, starting with the fore, we examined the validity of the interview for both of these database of validity coefficients collected by the U.S. Office of Personnel VALIDITY OF EMPLOYMENT INTERVIEWS 603
Management (Dye, 1988). In addition, we examined references from data set was available in Johnson and McDaniel (1981). We did not the five literature reviews cited above to determine if the articles con- include data from Drucker (1957) because the same data appeared to tained validity coefficients. Although this literature search yielded more be reported in Parrish, Klieger, and Drucker (1955) and in Rundquist validity coefficients than had been assembled by other reviewers (1947). We did not include data from Reeb (1968) because it appeared (Dunnette et al., 1971; Hunter & Hunter, 1984; Reilly & Chao, 1982; that the same data were reported in Reeb (1969). Finally, we did not Wiesner & Cronshaw, 1988; P. M. Wright et al., 1989), it is likely that include data from Glaser, Schwartz, and Flanagan (1956) because these we did not obtain all existing validity coefficients. However, this search, data were duplicated in Glaser, Schwartz, and Flanagan (1958). extending over a period of 8 years, is perhaps the most complete search Consistent with the decision rules of past validity generalization stud- ever conducted for interview validities. The references for the studies ies (Rothstein & McDaniel, 1989), we did not include incomplete data are listed in the Appendix. from sparse matrices. For example, studies reporting only significant correlations were omitted because capitalization on chance would cause Decision Rules these data to bias the mean validities upward. If the interview was conducted as part of an assessment center or a We used certain decision rules for selecting relevant studies for this similar multiple assessment procedure (so that the ratings could have meta-analysis. First, only studies using measures of overall job perfor- been affected by performance in other exercises), the data were excluded mance, training performance, or tenure as criteria were included. For from our analyses. This eliminated data from several studies (e.g., example, Rimland's (1958) study was not included in this analysis be- Dunnette, 1970), including a set of studies based on data obtained from cause an attitude scale, the Career Intention Questionnaire, was the only the British civil service system (Anstey, 1966, 1977; Castle & Garforth, criterion. Another study excluded was Tupes's (1950), in which the cri- 1951; Gardner* Williams, 1973; Handyside& Duncan, 1954;Vernon, terion was the combined judgment of three clinicians who studied each 1950; Wilson, 1948). These civil service studies incorporated the in- subject for 7 days and made final ratings of psychological material gath- terview into assessment-center-like screening that sometimes lasted up ered during the assessment. Also excluded were studies in which the to a week (Gardner & Williams, 1973). Note that the data from interview attempted to predict intelligence test scores and not job per- Handyside and Duncan were excluded because only corrected coeffi- formance, training performance, or tenure. Data from Putney (1947) cients were reported and insufficient data were described to permit esti- were excluded because the criterion contrasted the training perfor- mation of the unconnected coefficients. mance of those selected by an interview with those who were not Kennedy (1986) reported several studies that contained two-part in- screened, and we did not consider this an interview validity coefficient. terviews. One of the parts was classified as structured by our decision When the criterion was on-the-job training, the distinction between rules and the other as situational. In two of the studies, validities were job performance and training performance was not clear. For purposes reported separately for each part. In these cases, we assigned one co- of this meta-analysis, on-the-job training—when the employee was per- efficient to the structured category and the other to the situational cate- forming the job but was considered to be in training—was coded as job gory. performance. On the other hand, results of classroom or other formal instruction were uniformly coded as training performance. We coded the retained studies by using guidelines established for a We also omitted studies in which there was no actual interview, as larger validity generalization project (McDaniel & Reck, 1985). the interview is traditionally understood. For example, the findings of Content of the interview was categorized into three groups: situational, Campbell, Otis, Liske, and Prien (1962) were excluded because they job related, and psychological. Information on how the interview was reported the validity of a summary report made by a person who had conducted was summarized in three sets of categories: (a) structured access to interview scores assigned by someone else. Similarly excluded and unstructured, (b) board and individual, and (c) test information were the studies by Grant and Bray (1969) and by Hilton, Bolin, Parker, available to the interviewer and no test information available to the in- Taylor, and Walker (195 5), in which the raters did not conduct the actual terviewer. Information on the nature of the criterion was summarized interviews but derived scores by reviewing narrative summaries of in- into two sets of categories: (a) criterion content (job performance, train- terview reports. We also excluded the findings of Trankell (1959), who ing performance, or tenure) and (b) purpose for which the criterion was used standardized tests to measure such traits as panic resistance and collected (administrative or research). Because these categories could sensitivity; those of Dicken and Black (1965), who reported the validity be correlated, we used a hierarchical moderator analysis to assess the of clinical interpretations of an objective test battery; and those of Den- potential confounding of analyses due to correlation and interaction. In ton (1964), who studied "interview-type data" that consisted of written a fully hierarchical moderator analysis, the data set of correlations is responses to questions. Data from Miles, Wilkins, Lester, and Hutchens broken down by one key moderator variable first, and then, within each (1946) were excluded because three individuals were screened per min- subgroup, subsequent moderator analyses are undertaken one by one in ute. Data based on enlisted military personnel presented in Bartlett a hierarchical manner (Hunter & Schmidt, 1990, p. 527). Although the (1950) were not included because most of the interviews lasted for only coding scheme permitted a variety of detailed analyses, there were far a few minutes. too few interview validity coefficients to conduct a full hierarchical anal- For our purposes, the individuals being interviewed were required to ysis. Partial hierarchical analyses were conducted when the number of be employees or applicants for a job. We excluded data from Anderson available coefficients permitted a meaningful analysis. For example, hi- (1954) because those interviewed were doctoral candidates. Data from erarchical analyses were most fully implemented in studies using job Dann and Abrahams (1970), Newmann and Abrahams (1979), Walsh performance criteria, much less so in studies using training perfor- (1975), and Zaccaria et al. (1956) were excluded because the interview- mance criteria, and not at all with validities using tenure criteria. ees were college students. Psychiatric interviews were also omitted from When more than one coefficient from a sample was available, we in- this analysis when the interviewee was not a job applicant or employee. voked decision rules to include only one of the coefficients. A common Examples of excluded studies include Matarazzo, Matarazzo, Saslow, reason for multiple coefficients from a sample was that more than one and Phillips (1958) and Hunt, Herrmann, and Noble (1957), which in- job performance criterion, each using a different measurement method volved psychiatric interviews conducted at mental hospitals. (e.g., supervisory ratings or production data), was available. In this cir- The same data were often reported in more than one study. For ex- cumstance, the coefficient using a supervisory rating was retained. In ample, we did not include data from Felix, Cameron, Bobbin, and New- some studies, coefficients using more than one training performance man (1945) because these data were duplicated in Bobbin and Newman criterion, each using a different measurement method (e.g., instructor (1944). We did not code data from Johnson (1979) because an expanded ratings or job knowledge test), were available. In this case, we retained 604 McDANIEL, WHETZEL, SCHMIDT, AND MAURER the coefficient based on the job knowledge test criterion. When the following: .50x + .66( 1 - x) = .60. The .60 is the mean reliability figure choice was between validities based on a single criterion measurement we used; x is the proportion of studies in which we have (in effect) as- or on a composite measurement (e.g., a composite criterion based on sumed there is only 1 rater (reliability = .50); and 1 —x is the proportion written test scores and an instructor rating), we retained the coefficient of studies in which we have (by assuming that mean reliability = .60) based on the composite criterion. When multiple coefficients were due assumed there were two raters. The .66 is the reliability of ratings based solely to differences in the amount of time between the collection of the on two raters, as determined by the Spearman-Brown formula. Solving predictor and criterion data, coefficients based on longer time periods for x finds x = .375, and 1 - x= .625. Thus, on the basis of Rothstein's were chosen over those based on shorter or unknown time periods. In- findings, our mean of .60 assumes that one rater was used in 37.5% of terviews in which the interviewer had access to test scores were pre- the studies and two raters were used in 62.5% of the studies. On the basis ferred. Interviews conducted by one interviewer were chosen over those of our experience and our reading of the literature, we believe only a conducted by multiple interviewers. The above decision rules, involving minority of studies use ratings by two raters. Thus, we have actually choices among criterion measurement methods, enhance the general- overestimated the reliability of job performance ratings. ization of the results to the most frequently occurring criterion mea- No estimates of the reliability of tenure were available, leading, by surement methods and are consistent with the distributions of criterion default, to operational estimates of 1.00; this assumption of perfect re- reliabilities used in this analysis. liability leads to conservative (lower bound) validity estimates for the criterion of tenure. We reviewed the literature for studies containing information on the Reliability of Coding reliability of interviews. Some reliability coefficients were obtained When coding studies for validity generalization analyses, the reliabil- from studies that reported validity coefficients. Most of the reliability coefficients, however, came from studies that did not report validity ity of interest is that between coders of studies used in the meta-analysis. data. The desired interview reliability coefficient is obtained when ap- Previous research (Whetzel & McDaniel, 1988) has indicated that such coding was very reliable and that independent investigators, coding the plicants are interviewed twice, each time by a different interviewer, and the two evaluations are correlated. Such a reliability coefficient captures same data, record the same values and obtain the same results. In the two types of measurement error: error due to the applicant giving present study, we coded data from 59 studies in common with Wiesner different answers in the two interviews (temporal instability) and error and Cronshaw's (1988) study. The correlation between these two data due to the interviewers not agreeing in their ratings of the applicant sets was .89 for the correlation coefficients and .97 for the sample sizes. (scorer or conspect unreliability). However, the literature (remarkably) A more compelling comparison of the coding reliabilities examines the contained only one coefficient of this type: a value of .64 found by Mo- meta-analytic results conducted separately on the coefficients coded towidlo et al. (1992), based on 37 interviewees. All other reliability esti- from these two sets of data. When we corrected only for sampling error, mates found represented interviewer reliability: The applicants partici- the results indicated that the two sets yielded very similar mean ob- pated in one interview with two or more interview raters present, and served coefficients (.29 vs. .26) and standard deviations (.142 vs. .164). these evaluations were correlated. Thus, these reliability coefficients do The lack of perfect agreement in the coding of correlation coefficients not reflect the measurement error represented by applicant changes in can be attributed to differences in coding decision rules. For example, behavior or answers in different interviews (temporal instability). when separate validity coefficients were available for each dimension of Therefore, these interviewer reliability coefficients overestimate the re- an interview (e.g., interpersonal skills or job knowledge), Wiesner and liability of the interview. Cronshaw coded the coefficient for the overall or summary dimension. This is not a problem in the present analysis because the corrections We coded the coefficient between a composite of the individual dimen- to the validity distributions depend on the variance of the predictor re- sions and the criterion. These results support the assertion that the data liability distribution, not on its mean. That is, validities were not cor- in the present study were accurately coded. rected for attenuation due to interview unreliability; the only correction was for variance in validities across studies due to variability in in- Artifact Information terview reliabilities, in accordance with standard practice in validity generalization studies. Enough data were available to assemble reliabil- The studies contained little information on the reliability of the job ity distributions for psychological, structured job-related, and unstruc- performance and job training criteria. Therefore, the criterion reliabil- tured job-related interviews. Descriptive statistics for these distribu- ity distributions used by Pearlman (1979) were used in this study (aver- tions are shown in Table 2. The mean reliability of the unstructured age criterion reliabilities of .60 and .80 for job performance and training criteria, respectively). A reviewer of this article asserted that our use of a mean criterion reliability of .60 overcorrected the observed coefficients Table 2 because for some studies the criterion consisted of a composite of rat- Descriptive Statistics for the Interviewer ings supplied by more than one supervisor. Whereas we suspected that Reliability Distributions the concern of the reviewer may be shared by others, we sought to fully address this issue. Reliability We cannot agree that the mean reliability of .60 for job performance No. of ratings is an underestimate. Rothstein (1990) found that across 9,975 Reliability distribution coefficients M Mdn SD employees and across all time periods of supervisory exposure to em- ployees, the mean interrater agreement (reliability for one rater) was Distributions from the literature .48. The mean of .60 applies to situations in which supervisors had Psychological 25 .73 .74 .11 about 20 years to observe employee performance. Even if we round the Job related unstructured 20 .68 .78 .25 .48 figure up to .50, we find that by using a mean reliability of .60, we Job related structured 167 .84 .89 .15 have allowed for the use of two raters in 62.5% of the studies. This is undoubtedly a much higher percentage of the studies than, in fact, had Pooled distributions two raters. That is, the .60 figure is indeed almost certainly an overesti- All job related 187 .82 .87 .17 mate, and therefore the reliability corrections were undercorrections. All reliabilities 212 .81 .87 .17 Below we present our analysis supporting the figure of 62.5%. Note the VALIDITY OF EMPLOYMENT INTERVIEWS 605 distribution was ,68, whereas the mean reliability of structured in- The administrative versus research purpose analyses were terviews was .84. The pooled distribution labeled all job related is the conducted solely for the job performance criteria because all combination of unstructured and structured job-related distributions. training and tenure criteria were classified as administrative The distribution labeled all reliabilities is the combination of the psy- 1 criteria. chological and the two job-related distributions. The first column of data in each table identifies the distribu- Most observed validity coefficients are attenuated because of range tion of validities analyzed. The next four columns present the restriction in the predictor. This restriction results from employees be- total sample size, the number of validity coefficients on which ing selected on the basis of the predictor or on the basis of another selec- tion method that is correlated with the predictor. Range restriction is each distribution was based, and the uncorrected mean and indexed as u = s/S, where i is the restricted standard deviation and Sis standard deviation of each distribution. The next three columns the unrestricted standard deviation. Although information on the level present the estimated population mean (p), the estimated popu- of range restriction in employment interviews was not provided in most lation standard deviation (af), and the 90% credibility value for studies, we obtained 15 estimates of range-restriction information. At the distribution of true validities. The population distribution the suggestion of a reviewer, we dropped an outlier in our range-restric- estimates are for distributions in which the mean true validities tion distribution, reducing our distribution of range-restriction statis- are corrected only for unreliability in the criterion, not predic- tics to 14 observations. These data are presented in Table 3. The mean tor unreliability. We corrected the variances of the true validity value of .68 and the standard deviation of .16 for these u values are distributions for sampling error and for differences among the very similar to those from validity studies of aptitude and ability tests (Alexander, Carson, Alliger, & Cronshaw, 1989; Schmidt et al., 1985). studies in predictor and criterion reliability. Data in the last They are also very similar to the interview range-restriction data re- three columns of Tables 4-8 are the results of range-restriction ported by Huffcutt (1992). Although the range-restriction data appear corrections being added to the corrections discussed above. reasonable, the distribution is based on only 14 observations and, thus, Thus, in these last three columns, the mean true validity is cor- might not be representative of the entire literature. Therefore, we con- rected for unreliability in the criterion and range restriction, ducted all analyses twice, once with range-restriction corrections and and the variance is corrected for sampling error and for differ- once without. The validity values obtained without correction for range ences among studies in predictor reliability, criterion reliability, restriction, however, must be regarded as lower bound values (i.e., down- and range restriction. wardly biased). The results from analyses that include range-restriction cor- rections were the most accurate estimates of the population va- Results lidity distributions. The results based on analyses not including The results are organized to address three categories of char- range-restriction corrections yielded mean validity estimates acteristics that might covary with interview validity. Each cate- that were lower bound (downwardly biased) estimates of valid- gory contains one or more subcategories of characteristics: (a) ity. Therefore, in the following text, we focus our discussion on content of the interview—situational versus job-related (non- the results that included range-restriction corrections. situational) versus psychological; (b) how the interview is con- Table 4 shows results detailing how interview validity for job ducted—structured versus unstructured, board versus individ- performance criteria covaries with interview content, interview ual, test information available to the interviewer versus no test structure, and purpose for which the criteria are collected. Ta- information available; (c) the nature of the criterion—job per- ble 5 shows the same analyses for training performance criteria. formance versus training performance versus tenure and ad- Table 6 shows individual and board interviews for job perfor- ministrative versus research purpose for collection of the mance criteria. Table 7 allows a comparison of the validity of criterion. interviews for job performance criteria where the interviewer
1 After this article was accepted, J. Conway (personal communica- Table 3 tion, March 15, 1994) alerted us to a group of reliability coefficients Range-Restriction Information for the Interview obtained under conditions such that each applicant was interviewed first by one interviewer and then, on another occasion, was interviewed s/S Frequency by a second interviewer. The mean of the 41 reliability coefficients was .52, a much lower value than the .81 average in Table 2. This mean .452 difference has no implications for the results of this meta-analysis be- .491 .494 cause, as noted before, we do not correct for mean predictor unreliabil- .505 ity. However, it is of interest from a more general viewpoint. This finding .616 means that, on average, 29% of the variance in interview scores—(.81 — .651 .52) X 100—is due to transient error, that is, to occasion-to-occasion .686 instability in applicant responses. Only 19% of the variance—(1.00 — .689 .81) X 100—is due to disagreement between interviewers observing the .696 same interview at a single time. That is, only 19% is due to conspect .828 unreliability. The standard deviation of Conway's reliability coefficients .829 (.23) was larger than that reported in our Table 2 (. 17). This means that .830 .833 our analysis can be expected to somewhat underestimate the variance .962 in validities that is due to variability in predictor reliabilities, resulting in a slight overestimate of ap values and a corresponding underestima- Note. Mean s/S = .68. s = restricted standard deviation; S = un- tion of the generalizability of interview validities. Thus, the effect is to restricted standard deviation. make our results slightly conservative. 606 McDANIEL, WHETZEL, SCHMIDT, AND MAURER
Table 4 Analysis Results for Employment Interviews in Which the Criterion Was Job Performance: Validities of Interviews by Interview Content, Structure, and Criterion Purpose
Without range- With range- restriction corrections restriction corrections
No. Mean Obs. 90% Interview distribution N rs r cv p a. CV
All interviews 25,244 160 .20 .15 .26 .17 .04 .37 .23 .08 Interview content Situational 946 16 .27 .14 .35 .08 .26 .50 .05 .43 Job related 20,957 127 .21 .16 .28 .18 .05 .39 .24 .08 Psychological 1,381 14 .15 .10 .20 .03 .16 .29 .00 .29 Interview structure Structured 12,847 106 .24 .18 .31 .20 .05 .44 .27 .09 Unstructured 9,330 39 .18 .11 .23 .10 .10 .33 .12 .17 Criterion purpose Administrative 15,376 90 .19 .15 .25 .16 .04 .36 .22 .08 Research 5,047 65 .25 .19 .33 .20 .08 .47 .26 .13 Job-Related Interviews X Structure Structured 11,801 89 .24 .18 .31 .21 .04 .44 .28 .07 Unstructured 8,985 34 .18 .11 .23 .10 .10 .33 .13 .17 Job-Related Interviews X Criterion Purpose Administrative 12,414 74 .21 .15 .27 .17 .06 .39 .23 .09 Research 3,771 49 .27 .20 .36 .21 .08 .50 .28 .14 Job-Related Structured Interviews X Criterion Purpose Administrative 8,155 50 .20 .16 .26 .18 .04 .37 .24 .06 Research 3,069 36 .28 .21 .37 .23 .07 .51 .3! .11 Job-Related Unstructured Interviews X Criterion Purpose Administrative 4,259 24 .22 .14 .29 .13 .12 .41 .17 .19 Research 531 9 .21 .13 .27 .00 .27 .38 .00 .38
Note. Obs. = observed; p = estimated population mean;