The Validity of Employment Interviews: a Comprehensive Review and Meta-Analysis

The Validity of Employment Interviews: A Comprehensive Review and Meta-Analysis

Michael A. McDaniel, Deborah L. Whetzel, Frank L. Schmidt, and Steven D. Maurer

This meta-analytic review presents the findings of a project investigating the validity of the employment interview. Analyses are based on 245 coefficients derived from 86,311 individuals. Results show that interview validity depends on the content of the interview (situational, job related, or psychological), how the interview is conducted (structured vs. unstructured; board vs. individual), and the nature of the criterion (job performance, training performance, and tenure; research or administrative ratings). Situational interviews had higher validity than did job-related interviews, which, in turn, had higher validity than did psychologically based interviews. Structured interviews were found to have higher validity than unstructured interviews. Interviews showed similar validity for job performance and training performance criteria, but validity for the tenure criteria was lower.

The interview is a selection procedure designed to predict 1982; Harris, 1989; Mayfield, 1964; Schmitt, 1976; Ulrich & future job performance on the basis of applicants' oral re- Trumbo, 1965; Wagner, 1949; O. R. Wright, 1969). Only 25 of sponses to oral inquiries. Interviews are one of the most fre- the 106 articles on the validity or the reliability of interviews quently used selection procedures, perhaps because of their in- located by Wagner reported any quantitative data. Wagner's tuitive appeal for hiring authorities. Ulrich and Trumbo (1965) major conclusions were that (a) quantitative research on the in- surveyed 852 organizations and found that 99% of them used terview is much needed; (b) the validity and reliability of the interviews as a selection tool. Managers and personnel officials, interview may be highly specific to both the situation and the especially those who are interviewers, tend to believe that the interviewer; (c) the interview should be confined to evaluating interview is valid for predicting future job performance. In this factors that cannot be measured accurately by other methods; article, we quantitatively cumulate and summarize research on (d) the interview is most accurate when a standardized ap- the criterion-related validity of the employment interview. proach is used; and (e) the interviewer must be skilled in elicit- Our purpose in this article is threefold. First, we summarize ing complete information from the applicant, observing sig- past narrative and quantitative reviews of criterion-related va- nificant behavior, and synthesizing developed information. lidity studies of the employment interview. Second, we report Mayfield (1964) made prescriptive statements that he consid- research that extends knowledge of the criterion-related validity ered justified by empirical findings. He concluded that typical of interviews through meta-analyses conducted on a more com- unstructured interviews with no prior data on the interviewee prehensive database than has been available to earlier investiga- are inconsistent in their coverage. He also found that interview tors. Third, we examine the criterion-related validity of differ- validities are low even in studies with moderate reliabilities, al- ent categories of interviews that vary in type and in structure. though structured interviews generally show higher interrater reliabilities than do unstructured interviews. Mayfield indi- Traditional Reviews of Interview Validity cated that individual interviewers, although consistent in their Seven major literature reviews of interview research have approach to interviewees, are inconsistent in their interpreta- been published during the past 35 years (Arvey & Campion, tion of data, perhaps because interviewers' attitudes bias their judgments and because there is a tendency for interviewers to make decisions early in the unstructured interview. Concerning Michael A. McDaniel, Department of Psychology, University of Ak- ron; Deborah L. Whetzel, American Institutes for Research, Washing- the assessment of cognitive ability, Mayfield concluded that in- ton, DC; Frank L. Schmidt, Department of Management and Organi- telligence is the human quality that may be best estimated from zations, University of Iowa; Steven D. Maurer, Department of Manage- an interview but that interviewer assessments offer little, if any, ment, Old Dominion University. incremental validity over test scores alone. When the test score John Hunter and James Russell made contributions to earlier ver- was known, the interview contributed nothing to the predictive sions of this article; their contributions are appreciated. We are also validity in a multiple-assessment procedure. Mayfield also con- indebted to many authors of primary validity studies for releasing their cluded that the interrater reliability of the interview was satis- results to us. We appreciate the substantial assistance of Allen Huffcutt factory. However, interrater reliability estimates were not based in obtaining many obscurely published interview studies and for in- sightful critiques of drafts of the article. We also appreciate the data on two independent interviews but, rather, were correlations be- provided to us by Willi Wiesner and Steve Cronshaw that permitted the tween raters in the same interview. We return to this point later. interstudy reliability analyses. On the other hand, Ulrich and Trumbo (1965) concluded Correspondence concerning this article should be addressed to Mi- that personal relations and motivation were two areas that con- chael A. McDaniel, Department of Psychology, University of Akron, tributed to most decisions made by interviewers and that these Akron, Ohio 44325-4301. two attributes were valid predictors. They concluded that the 599 600 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

use of the interview should be limited to one or two content Their results, shown in Table 1, indicated that the interview is a areas as part of a sequential testing procedure. better predictor of supervisor ratings than it is of promotion, O. R. Wright (1969) summarized the research on the selec- training success, and tenure; but even for supervisory ratings, tion interview since 1964 and concluded that (a) interview de- the correlation was low (.14). Reilly and Chao obtained a higher cisions are made on the basis of behavioral and verbal cues; (b) validity (.19), but they had available only 12 coefficients. rapport between interviewer and interviewee is an important Dunnette et al. used 30 coefficients and obtained an average variable that influences the effectiveness of the interview; and validity of . 16, which is comparable to Reilly and Chao's re- (c) structured or patterned interviews are more reliable than sults. Table 1 shows the number of coefficients and average va- unstructured interviews. lidities found by each researcher, listed according to criterion. Schmitt (1976) reviewed the literature on interviews and P. M. Wright et al. (1989) conducted a meta-analysis of 13 found that nearly all of the recent studies on employment in- investigations of structured interviews. This study differed from terviews focused on factors that influenced interview decisions Hunter and Hunter's (1984) and Reilly and Chao's (1982) stud- rather than on the outcomes resulting from those decisions. Ex- ies in that more advanced quantitative cumulation methods amples of factors include research on photographs and the were used, which permitted estimation of the distribution of effects of appearance (Carlson, 1967); contrast effects (Carlson, population validity coefficients. Initially, P. M. Wright et al. 1970; Hakel, Ornesorge, & Dunnette, 1970; Wexley, Yukl, Ko- found that after correcting for criterion unreliability, the mean vacs, & Sanders, 1972); situational variables, including quota validity of the interview was .37. Placing the 95% credibility position (Carlson, 1967); and race of the interviewer (Ledvinka, interval around this mean suggested that the true validity was 1973). between —.07 and .77. Because this interval included zero, they Arvey and Campion (1982) reviewed the literature from 1975 looked for moderators and outliers. After they eliminated a to 1982 and found that there was an increase in research inves- study reporting a negative validity coefficient (—.22; Kennedy, tigating possible bias in the interview. Attention had been fo- 1986), the mean corrected validity was .39 with a 95% credibil- cused on the interaction of group membership variables and ity interval from .27 to .49. The authors noted that the interval interviewers' decision making. They also found that researchers did not include the mean validity given by Hunter and Hunter were investigating several variables that influenced the in- (.14; 1984) and stated that this indicated a real difference in the terview, such as nonverbal behavior, interviewees' perceptions predictive power of the structured interview over the tradi- of interviewers, and interviewer training. Arvey and Campion tional, unstructured interview. considered most of these studies to be microanalytic in nature, In the most comprehensive meta-analytic summary to date, studying only a narrow range of variables. They also concluded Wiesner and Cronshaw (1988) investigated interview validity as that researchers were becoming more sophisticated in their re- a function of interview format (individual vs. board) and degree search methods and strategies but had neglected the person- of structure (structured vs. unstructured). Results of this study perception literature, including attribution models and implicit showed that structured interviews yielded much higher mean personality theory. Finally, Arvey and Campion postulated rea- corrected validities than did unstructured interviews (.63 vs. sons for the continued use of the interview in spite of the previ- .20) and that structured board interviews using consensus rat- ous evidence of its low validity and reliability. ings had the highest corrected validity (.64). Most recently, Harris (1989) summarized the qualitative and quantitative reviews of interview validity and concluded that, Need for Additional Quantitative Reviews contrary to the popular belief that interviews lack validity, recent evidence suggested that the interview had at least moderate Wiesner and Cronshaw's (1988) meta-analysis of employ- validity. In addition, Harris addressed other issues related to the ment interview validities was a major integration of the litera- interview, such as decision making, applicant characteristics, ture. However, there are at least three reasons for conducting and interviewer training. additional analyses of employment interview validity. First, the employment interview is the second most frequently used ap- Previous Quantitative Reviews of Interview Validity plicant-screening device (Ash, 1981). (Reviews of resumes and employment application materials are the most common; see In the past 20 years, five quantitative reviews of interview va- McDaniel, Schmidt, & Hunter, 1988, for an analysis of that lit- lidity have been conducted (Dunnette, Arvey, & Arnold, 1971; erature.) Given the popularity of the employment interview, Hunter & Hunter, 1984; Reilly & Chao, 1982; Wiesner & Cron- more than one comprehensive review of the approach is war- shaw, 1988; P. M. Wright, Lichtenfels, & Pursell, 1989). The ranted, particularly when an expanded data set such as the one three earliest reviews may be termed quantitative because the used in this analysis is available. The two additional reasons for mean observed validity was calculated for each set of studies. further investigation rest on a comparison between the Wiesner However, they were not typical validity generalization studies and Cronshaw study and the present research. (Hunter & Schmidt, 1990), because estimates of the population Second, the present study goes beyond Wiesner and Cron- true validities were not calculated, and variance statistics shaw's (1988) contribution to summarize validity information needed for validity generalization inferences were not reported. by criterion type. Wiesner and Cronshaw did not differentiate Furthermore, these studies did not examine the validity of var- their criterion by job performance, training performance, and ious types of interviews separately. Hunter and Hunter's analy- tenure. We believe that criterion-type distinctions are impor- sis included 27 coefficients analyzed separately for four criteria: tant because validities usually vary by criterion type, and we supervisor ratings, promotion, training success, and tenure. believe that separate analyses are warranted because of the VALIDITY OF EMPLOYMENT INTERVIEWS 601

Table 1 Number of Validity Coefficients and Average Observed Validities Reported for the Interview by Type of Criterion

Researcher Criterion No. of rs Mean validity

Dunnette, Arvey, & Arnold Supervisor ratings 30 .16 (1971) Reilly&Chao(1982) Mixture of training and 12 .19 performance Hunter & Hunter (1984) Supervisor ratings 10 .14 Promotion 5 .08 Training success 9 .10 Tenure 3 .03 Wiesner & Cronshaw (1988) Performance (primarily 150 .26° supervisor ratings) Wright, Lichtenfels, & Supervisor ratings 13 .26" Pursell(1989)

' Disattenuated coefficient = .47. b Disattenuated coefficient = .39.

differences in mean reliability for the three types of criteria make a significant profit for the company (Latham, 1989). Job- (Hunter & Hunter, 1984; Lilienthal & Pearlman, 1983; Pearl- related interviews are those in which the interviewer is a person- man, 1979; Pearlman, Schmidt, & Hunter, 1980; Schmidt, nel officer or hiring authority and the questions attempt to assess Hunter, Pearlman, & Hirsh, 1985). We examined validity data past behaviors and job-related information, but most questions for training performance, tenure, and job performance criteria. are not considered situational. Psychological interviews are con- The validity results are reported separately, and appropriate cri- ducted by a psychologist, and the questions are intended to assess terion reliabilities are used. personal traits, such as dependability. Psychological interviews Third, this study examines the validity of three different types are typically included in individual assessments (Ryan & Sackett, of interview content: situational, job related, and psychological. 1989) of personnel that are conducted by consulting psycholo- Wiesner and Cronshaw (1988) did not differentiate among cat- gists or management consultants. egories of interview content. We believe that it is likely that va- Recently, some researchers have asserted the benefits of be- lidities vary as a function of the content of the information col- havioral interviews (Janz, 1982, 1989). There were too few va- lected in the interview. lidity coefficients for these studies to be analyzed separately. Be- In summary, although Wiesner and Cronshaw's (1988) meta- havioral interviews describe a situation and ask respondents analysis was a major contribution, the present study provides a how they have behaved in the past in such a situation. One significant incremental contribution to knowledge of the in- might argue that because the applicant is responding to a ques- terview in several areas. First, an extensive literature search was tion about behavior in a specific situation, the behavioral in- conducted that resulted in a larger number of validity coeffi- terviews should be classified as situational. Others might argue cients than had been examined by previous researchers. Sec- that because the interviews solicit information about past be- ond, the validity of the interview was quantitatively assessed as havior, the interviews should be classified as job related. In this a function of both interview content and structure. Third, the study, we assigned the studies to the job-related category. As validity of the interview was summarized for three categories of more validity data are accumulated, we recommend that the validity of the behavioral interview be evaluated as a distinct criteria: job performance, training performance, and tenure. category of interview content. We hypothesized that the validity of the employment in- These three categories were not completely disjointed. Situa- terview depends on three factors: (a) the content of information tional interviews are designed to be job related, but they repre- collected during the interview, (b) how interview information is sent a distinct subcategory worthy of separate analysis. Nonpsy- collected, and (c) criteria used to validate the interview. Each of chologist interviewers are interested in psychological traits such these factors is described below. as dependability, and psychologists are interested in an applicant's past work experience. Also, job-related interviews, which Content of Information Collected are not primarily situational interviews, often will include some The first factor likely to affect interview validity is the content situational questions. Still, some interviews are primarily situa- of the interview. In the present research, a distinction was drawn tional, others are primarily nonsituational and job related in between situational, job-related (but not primarily situational), content, and others are primarily psychological. We hypothe- and psychological interviews. Questions in situational interviews sized that situational interviews would prove to be more valid (Latham & Saari, 1984; Latham, Saari, Pursell, & Campion, than job-related interviews, which, in turn, we expected to be more valid than psychological interviews. 1980) focus on the individual's ability to project what his or her behavior would be in a given situation. For example, an inter- How Information Is Collected viewee may be evaluated on the choice between winning a cov- Interviews can be differentiated by the extent of their stan- eted award for lowering costs and helping a coworker who will dardization. For example, most oral board examinations con- 602 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

ducted by civil service agencies in the public sector were consid- criteria as well as for tenure. No specific hypotheses were made ered to be structured because the questions and acceptable re- concerning the differences in validities between the criteria. sponses were specified in advance and the responses were rated When possible, we also drew distinctions between job perfor- for appropriateness of content. McMurry's (1947) "patterned mance criteria collected for research purposes and for admin- interview" is an early example of a structured interview. It is istrative purposes (i.e., performance ratings collected as part of described as a printed form containing specific items to be cov- an organizationwide performance appraisal process). This dis- ered and providing a uniform method of recording information tinction was based on the premise that non-performance-re- and rating the interviewees' qualifications. lated variables, such as company policies or economic factors, Unstructured interviews gather applicant information in a are more likely to influence criteria collected for administrative less systematic manner than do structured interviews. Although purposes than are those collected for research purposes. the questions may be specified in advance, they usually are not, Wherry and Bartlett (1982) hypothesized that ratings collected and there is seldom a formalized scoring guide. Also, all persons solely for research purposes would be more accurate than rat- being interviewed are not typically asked the same questions. ings collected for administrative purposes. Several studies have This difference in interview structure may affect the interview demonstrated that ratings collected for administrative purposes evaluations in two ways. First, the lack of standardization may are significantly more lenient and exhibit more halo than do cause the unstructured interviews to be less reliable than the ratings collected for research purposes (Sharon & Bartlett, structured interviews. Second, structured interviews may be 1969; Taylor & Wherry, 1951; Veres, Field, & Boyles, 1983; better than unstructured interviews for obtaining a wide range Warmke & Billings, 1979). Zedeck and Cascio (1982) also of applicant information. On the basis of these differences, we found that the purpose of the rating was an important correlate hypothesized higher validities for structured interviews. of ratings of performance. In the present study, all validity co- The distinction between structured and unstructured in- efficients for training and tenure criteria were judged to be ad- terviews is not independent from interview content. Although ministrative criteria. We hypothesized that validities would be some job-related interviews were structured and others un- higher for research criteria than for administrative criteria. structured, most of the psychological interviews were categorized as unstructured, and all of the situational interviews were Method classified as structured. Meta-Analysis as a Method of Determining Validity Another consideration for users of the employment interview is the use of board interviews (interviews with more than one We used Hunter and Schmidt's (1990, p. 185) psychometric meta- rater in the same interview) versus individual interviews (one analytic procedure to test the proposed hypotheses. This statistical technique estimates how much of the observed variation in results across rater per interview). Because board interviews have higher ad- studies is due to statistical and methodological artifacts rather than to ministrative costs than individual interviews, they must be substantive differences in underlying population relationships. Some of more valid if they are to be cost-effective. Several researchers these artifacts also reduce the correlations below their true (e.g., popu- (Arvey & Campion, 1982; Pursell, Campion, & Gaylord, 1980) lation) values. Through this method, the variance attributable to sam- have suggested that board interviews may be more valid than pling error and to differences between studies in reliability and range individual interviews. Wiesner and Cronshaw (1988) have pro- restriction is determined, and that amount is subtracted from the total vided the most extensive analysis of the validity of these two amount of variation, yielding estimates of the true variation across stud- types of interviews. For unstructured interviews, board in- ies and of the true average correlation. We used artifact distribution terviews were more valid than individual interviews (mean ob- meta-analysis with the interactive method (Hunter & Schmidt, 1990, chap. 4). The mean observed correlation was used in the sampling error served coefficients = .21 vs. .11, respectively). However, for variance formula (Hunter & Schmidt, 1990, pp. 208-210; Law, structured interviews, board interviews and individual in- Schmidt, & Hunter, 1994; Schmidt etal., 1993). The computer program terviews were similar in validity (mean observed coefficients = that we used is described in McDaniel (1986). Additional detail on the .33 vs. .35, respectively). We made no hypotheses concerning program is presented in Appendix B of McDaniel et al. (1988). differences in validity between board and individual interviews. In our analyses, we corrected the mean observed validity for mean Another issue to consider is the interviewers' access to cogni- attenuation due to range restriction and criterion unreliability (Hunter tive ability test scores. If interviewers have access to test scores, & Schmidt, 1990, p. 165). Although employment interviews used in the final interview evaluations should have higher validities than selection have less than perfect reliability, the mean validity was not if the interviewer did not have this information. This would be corrected for predictor unreliability because the goal was to estimate expected because the interviewer would have access to more cri- the operational validity of interviews for selection purposes. However, the observed variance of validities was corrected for variation across terion-relevant information. studies in interview unreliabilities. In addition, the variance of the observed validities was corrected for variation across studies in criterion unreliabilities and range-restriction values. For comparison purposes, Criteria Used to Validate the Interview we also reported analyses in which no range-restriction corrections were made in either the mean or variance of the interview distributions. The In past validity generalization studies (Hirsh, Northrop, & rationale for conducting the analyses with and without range-restriction Schmidt, 1986; Hunter & Hunter, 1984;Lilienthal&PearIman, corrections is detailed later under Artifact Information. 1983; Pearlman, 1979; Pearlman et al., 1980), validity has been found to vary depending on whether job performance measures Literature Review or training performance measures served as the criteria. There- We conducted a thorough search for validity studies, starting with the fore, we examined the validity of the interview for both of these database of validity coefficients collected by the U.S. Office of Personnel VALIDITY OF EMPLOYMENT INTERVIEWS 603

Management (Dye, 1988). In addition, we examined references from data set was available in Johnson and McDaniel (1981). We did not the five literature reviews cited above to determine if the articles con- include data from Drucker (1957) because the same data appeared to tained validity coefficients. Although this literature search yielded more be reported in Parrish, Klieger, and Drucker (1955) and in Rundquist validity coefficients than had been assembled by other reviewers (1947). We did not include data from Reeb (1968) because it appeared (Dunnette et al., 1971; Hunter & Hunter, 1984; Reilly & Chao, 1982; that the same data were reported in Reeb (1969). Finally, we did not Wiesner & Cronshaw, 1988; P. M. Wright et al., 1989), it is likely that include data from Glaser, Schwartz, and Flanagan (1956) because these we did not obtain all existing validity coefficients. However, this search, data were duplicated in Glaser, Schwartz, and Flanagan (1958). extending over a period of 8 years, is perhaps the most complete search Consistent with the decision rules of past validity generalization stud- ever conducted for interview validities. The references for the studies ies (Rothstein & McDaniel, 1989), we did not include incomplete data are listed in the Appendix. from sparse matrices. For example, studies reporting only significant correlations were omitted because capitalization on chance would cause Decision Rules these data to bias the mean validities upward. If the interview was conducted as part of an assessment center or a We used certain decision rules for selecting relevant studies for this similar multiple assessment procedure (so that the ratings could have meta-analysis. First, only studies using measures of overall job perfor- been affected by performance in other exercises), the data were excluded mance, training performance, or tenure as criteria were included. For from our analyses. This eliminated data from several studies (e.g., example, Rimland's (1958) study was not included in this analysis be- Dunnette, 1970), including a set of studies based on data obtained from cause an attitude scale, the Career Intention Questionnaire, was the only the British civil service system (Anstey, 1966, 1977; Castle & Garforth, criterion. Another study excluded was Tupes's (1950), in which the cri- 1951; Gardner* Williams, 1973; Handyside& Duncan, 1954;Vernon, terion was the combined judgment of three clinicians who studied each 1950; Wilson, 1948). These civil service studies incorporated the in- subject for 7 days and made final ratings of psychological material gath- terview into assessment-center-like screening that sometimes lasted up ered during the assessment. Also excluded were studies in which the to a week (Gardner & Williams, 1973). Note that the data from interview attempted to predict intelligence test scores and not job per- Handyside and Duncan were excluded because only corrected coeffi- formance, training performance, or tenure. Data from Putney (1947) cients were reported and insufficient data were described to permit esti- were excluded because the criterion contrasted the training perfor- mation of the unconnected coefficients. mance of those selected by an interview with those who were not Kennedy (1986) reported several studies that contained two-part in- screened, and we did not consider this an interview validity coefficient. terviews. One of the parts was classified as structured by our decision When the criterion was on-the-job training, the distinction between rules and the other as situational. In two of the studies, validities were job performance and training performance was not clear. For purposes reported separately for each part. In these cases, we assigned one co- of this meta-analysis, on-the-job training—when the employee was per- efficient to the structured category and the other to the situational cate- forming the job but was considered to be in training—was coded as job gory. performance. On the other hand, results of classroom or other formal instruction were uniformly coded as training performance. We coded the retained studies by using guidelines established for a We also omitted studies in which there was no actual interview, as larger validity generalization project (McDaniel & Reck, 1985). the interview is traditionally understood. For example, the findings of Content of the interview was categorized into three groups: situational, Campbell, Otis, Liske, and Prien (1962) were excluded because they job related, and psychological. Information on how the interview was reported the validity of a summary report made by a person who had conducted was summarized in three sets of categories: (a) structured access to interview scores assigned by someone else. Similarly excluded and unstructured, (b) board and individual, and (c) test information were the studies by Grant and Bray (1969) and by Hilton, Bolin, Parker, available to the interviewer and no test information available to the in- Taylor, and Walker (195 5), in which the raters did not conduct the actual terviewer. Information on the nature of the criterion was summarized interviews but derived scores by reviewing narrative summaries of in- into two sets of categories: (a) criterion content (job performance, train- terview reports. We also excluded the findings of Trankell (1959), who ing performance, or tenure) and (b) purpose for which the criterion was used standardized tests to measure such traits as panic resistance and collected (administrative or research). Because these categories could sensitivity; those of Dicken and Black (1965), who reported the validity be correlated, we used a hierarchical moderator analysis to assess the of clinical interpretations of an objective test battery; and those of Den- potential confounding of analyses due to correlation and interaction. In ton (1964), who studied "interview-type data" that consisted of written a fully hierarchical moderator analysis, the data set of correlations is responses to questions. Data from Miles, Wilkins, Lester, and Hutchens broken down by one key moderator variable first, and then, within each (1946) were excluded because three individuals were screened per min- subgroup, subsequent moderator analyses are undertaken one by one in ute. Data based on enlisted military personnel presented in Bartlett a hierarchical manner (Hunter & Schmidt, 1990, p. 527). Although the (1950) were not included because most of the interviews lasted for only coding scheme permitted a variety of detailed analyses, there were far a few minutes. too few interview validity coefficients to conduct a full hierarchical anal- For our purposes, the individuals being interviewed were required to ysis. Partial hierarchical analyses were conducted when the number of be employees or applicants for a job. We excluded data from Anderson available coefficients permitted a meaningful analysis. For example, hi- (1954) because those interviewed were doctoral candidates. Data from erarchical analyses were most fully implemented in studies using job Dann and Abrahams (1970), Newmann and Abrahams (1979), Walsh performance criteria, much less so in studies using training perfor- (1975), and Zaccaria et al. (1956) were excluded because the interview- mance criteria, and not at all with validities using tenure criteria. ees were college students. Psychiatric interviews were also omitted from When more than one coefficient from a sample was available, we in- this analysis when the interviewee was not a job applicant or employee. voked decision rules to include only one of the coefficients. A common Examples of excluded studies include Matarazzo, Matarazzo, Saslow, reason for multiple coefficients from a sample was that more than one and Phillips (1958) and Hunt, Herrmann, and Noble (1957), which in- job performance criterion, each using a different measurement method volved psychiatric interviews conducted at mental hospitals. (e.g., supervisory ratings or production data), was available. In this cir- The same data were often reported in more than one study. For ex- cumstance, the coefficient using a supervisory rating was retained. In ample, we did not include data from Felix, Cameron, Bobbin, and New- some studies, coefficients using more than one training performance man (1945) because these data were duplicated in Bobbin and Newman criterion, each using a different measurement method (e.g., instructor (1944). We did not code data from Johnson (1979) because an expanded ratings or job knowledge test), were available. In this case, we retained 604 McDANIEL, WHETZEL, SCHMIDT, AND MAURER the coefficient based on the job knowledge test criterion. When the following: .50x + .66( 1 - x) = .60. The .60 is the mean reliability figure choice was between validities based on a single criterion measurement we used; x is the proportion of studies in which we have (in effect) as- or on a composite measurement (e.g., a composite criterion based on sumed there is only 1 rater (reliability = .50); and 1 —x is the proportion written test scores and an instructor rating), we retained the coefficient of studies in which we have (by assuming that mean reliability = .60) based on the composite criterion. When multiple coefficients were due assumed there were two raters. The .66 is the reliability of ratings based solely to differences in the amount of time between the collection of the on two raters, as determined by the Spearman-Brown formula. Solving predictor and criterion data, coefficients based on longer time periods for x finds x = .375, and 1 - x= .625. Thus, on the basis of Rothstein's were chosen over those based on shorter or unknown time periods. In- findings, our mean of .60 assumes that one rater was used in 37.5% of terviews in which the interviewer had access to test scores were pre- the studies and two raters were used in 62.5% of the studies. On the basis ferred. Interviews conducted by one interviewer were chosen over those of our experience and our reading of the literature, we believe only a conducted by multiple interviewers. The above decision rules, involving minority of studies use ratings by two raters. Thus, we have actually choices among criterion measurement methods, enhance the general- overestimated the reliability of job performance ratings. ization of the results to the most frequently occurring criterion mea- No estimates of the reliability of tenure were available, leading, by surement methods and are consistent with the distributions of criterion default, to operational estimates of 1.00; this assumption of perfect re- reliabilities used in this analysis. liability leads to conservative (lower bound) validity estimates for the criterion of tenure. We reviewed the literature for studies containing information on the Reliability of Coding reliability of interviews. Some reliability coefficients were obtained When coding studies for validity generalization analyses, the reliabil- from studies that reported validity coefficients. Most of the reliability coefficients, however, came from studies that did not report validity ity of interest is that between coders of studies used in the meta-analysis. data. The desired interview reliability coefficient is obtained when ap- Previous research (Whetzel & McDaniel, 1988) has indicated that such coding was very reliable and that independent investigators, coding the plicants are interviewed twice, each time by a different interviewer, and the two evaluations are correlated. Such a reliability coefficient captures same data, record the same values and obtain the same results. In the two types of measurement error: error due to the applicant giving present study, we coded data from 59 studies in common with Wiesner different answers in the two interviews (temporal instability) and error and Cronshaw's (1988) study. The correlation between these two data due to the interviewers not agreeing in their ratings of the applicant sets was .89 for the correlation coefficients and .97 for the sample sizes. (scorer or conspect unreliability). However, the literature (remarkably) A more compelling comparison of the coding reliabilities examines the contained only one coefficient of this type: a value of .64 found by Mo- meta-analytic results conducted separately on the coefficients coded towidlo et al. (1992), based on 37 interviewees. All other reliability esti- from these two sets of data. When we corrected only for sampling error, mates found represented interviewer reliability: The applicants partici- the results indicated that the two sets yielded very similar mean ob- pated in one interview with two or more interview raters present, and served coefficients (.29 vs. .26) and standard deviations (.142 vs. .164). these evaluations were correlated. Thus, these reliability coefficients do The lack of perfect agreement in the coding of correlation coefficients not reflect the measurement error represented by applicant changes in can be attributed to differences in coding decision rules. For example, behavior or answers in different interviews (temporal instability). when separate validity coefficients were available for each dimension of Therefore, these interviewer reliability coefficients overestimate the re- an interview (e.g., interpersonal skills or job knowledge), Wiesner and liability of the interview. Cronshaw coded the coefficient for the overall or summary dimension. This is not a problem in the present analysis because the corrections We coded the coefficient between a composite of the individual dimen- to the validity distributions depend on the variance of the predictor re- sions and the criterion. These results support the assertion that the data liability distribution, not on its mean. That is, validities were not cor- in the present study were accurately coded. rected for attenuation due to interview unreliability; the only correction was for variance in validities across studies due to variability in in- Artifact Information terview reliabilities, in accordance with standard practice in validity generalization studies. Enough data were available to assemble reliabil- The studies contained little information on the reliability of the job ity distributions for psychological, structured job-related, and unstruc- performance and job training criteria. Therefore, the criterion reliabil- tured job-related interviews. Descriptive statistics for these distribu- ity distributions used by Pearlman (1979) were used in this study (aver- tions are shown in Table 2. The mean reliability of the unstructured age criterion reliabilities of .60 and .80 for job performance and training criteria, respectively). A reviewer of this article asserted that our use of a mean criterion reliability of .60 overcorrected the observed coefficients Table 2 because for some studies the criterion consisted of a composite of rat- Descriptive Statistics for the Interviewer ings supplied by more than one supervisor. Whereas we suspected that Reliability Distributions the concern of the reviewer may be shared by others, we sought to fully address this issue. Reliability We cannot agree that the mean reliability of .60 for job performance No. of ratings is an underestimate. Rothstein (1990) found that across 9,975 Reliability distribution coefficients M Mdn SD employees and across all time periods of supervisory exposure to employees, the mean interrater agreement (reliability for one rater) was Distributions from the literature .48. The mean of .60 applies to situations in which supervisors had Psychological 25 .73 .74 .11 about 20 years to observe employee performance. Even if we round the Job related unstructured 20 .68 .78 .25 .48 figure up to .50, we find that by using a mean reliability of .60, we Job related structured 167 .84 .89 .15 have allowed for the use of two raters in 62.5% of the studies. This is undoubtedly a much higher percentage of the studies than, in fact, had Pooled distributions two raters. That is, the .60 figure is indeed almost certainly an overesti- All job related 187 .82 .87 .17 mate, and therefore the reliability corrections were undercorrections. All reliabilities 212 .81 .87 .17 Below we present our analysis supporting the figure of 62.5%. Note the VALIDITY OF EMPLOYMENT INTERVIEWS 605 distribution was ,68, whereas the mean reliability of structured in- The administrative versus research purpose analyses were terviews was .84. The pooled distribution labeled all job related is the conducted solely for the job performance criteria because all combination of unstructured and structured job-related distributions. training and tenure criteria were classified as administrative The distribution labeled all reliabilities is the combination of the psy- 1 criteria. chological and the two job-related distributions. The first column of data in each table identifies the distribu- Most observed validity coefficients are attenuated because of range tion of validities analyzed. The next four columns present the restriction in the predictor. This restriction results from employees be- total sample size, the number of validity coefficients on which ing selected on the basis of the predictor or on the basis of another selection method that is correlated with the predictor. Range restriction is each distribution was based, and the uncorrected mean and indexed as u = s/S, where i is the restricted standard deviation and Sis standard deviation of each distribution. The next three columns the unrestricted standard deviation. Although information on the level present the estimated population mean (p), the estimated popu- of range restriction in employment interviews was not provided in most lation standard deviation (af), and the 90% credibility value for studies, we obtained 15 estimates of range-restriction information. At the distribution of true validities. The population distribution the suggestion of a reviewer, we dropped an outlier in our range-restric- estimates are for distributions in which the mean true validities tion distribution, reducing our distribution of range-restriction statis- are corrected only for unreliability in the criterion, not predic- tics to 14 observations. These data are presented in Table 3. The mean tor unreliability. We corrected the variances of the true validity value of .68 and the standard deviation of .16 for these u values are distributions for sampling error and for differences among the very similar to those from validity studies of aptitude and ability tests (Alexander, Carson, Alliger, & Cronshaw, 1989; Schmidt et al., 1985). studies in predictor and criterion reliability. Data in the last They are also very similar to the interview range-restriction data re- three columns of Tables 4-8 are the results of range-restriction ported by Huffcutt (1992). Although the range-restriction data appear corrections being added to the corrections discussed above. reasonable, the distribution is based on only 14 observations and, thus, Thus, in these last three columns, the mean true validity is cor- might not be representative of the entire literature. Therefore, we con- rected for unreliability in the criterion and range restriction, ducted all analyses twice, once with range-restriction corrections and and the variance is corrected for sampling error and for differ- once without. The validity values obtained without correction for range ences among studies in predictor reliability, criterion reliability, restriction, however, must be regarded as lower bound values (i.e., down- and range restriction. wardly biased). The results from analyses that include range-restriction corrections were the most accurate estimates of the population va- Results lidity distributions. The results based on analyses not including The results are organized to address three categories of char- range-restriction corrections yielded mean validity estimates acteristics that might covary with interview validity. Each cate- that were lower bound (downwardly biased) estimates of valid- gory contains one or more subcategories of characteristics: (a) ity. Therefore, in the following text, we focus our discussion on content of the interview—situational versus job-related (non- the results that included range-restriction corrections. situational) versus psychological; (b) how the interview is con- Table 4 shows results detailing how interview validity for job ducted—structured versus unstructured, board versus individ- performance criteria covaries with interview content, interview ual, test information available to the interviewer versus no test structure, and purpose for which the criteria are collected. Ta- information available; (c) the nature of the criterion—job per- ble 5 shows the same analyses for training performance criteria. formance versus training performance versus tenure and ad- Table 6 shows individual and board interviews for job perfor- ministrative versus research purpose for collection of the mance criteria. Table 7 allows a comparison of the validity of criterion. interviews for job performance criteria where the interviewer

1 After this article was accepted, J. Conway (personal communica- Table 3 tion, March 15, 1994) alerted us to a group of reliability coefficients Range-Restriction Information for the Interview obtained under conditions such that each applicant was interviewed first by one interviewer and then, on another occasion, was interviewed s/S Frequency by a second interviewer. The mean of the 41 reliability coefficients was .52, a much lower value than the .81 average in Table 2. This mean .452 difference has no implications for the results of this meta-analysis be- .491 .494 cause, as noted before, we do not correct for mean predictor unreliabil- .505 ity. However, it is of interest from a more general viewpoint. This finding .616 means that, on average, 29% of the variance in interview scores—(.81 — .651 .52) X 100—is due to transient error, that is, to occasion-to-occasion .686 instability in applicant responses. Only 19% of the variance—(1.00 — .689 .81) X 100—is due to disagreement between interviewers observing the .696 same interview at a single time. That is, only 19% is due to conspect .828 unreliability. The standard deviation of Conway's reliability coefficients .829 (.23) was larger than that reported in our Table 2 (. 17). This means that .830 .833 our analysis can be expected to somewhat underestimate the variance .962 in validities that is due to variability in predictor reliabilities, resulting in a slight overestimate of ap values and a corresponding underestima- Note. Mean s/S = .68. s = restricted standard deviation; S = un- tion of the generalizability of interview validities. Thus, the effect is to restricted standard deviation. make our results slightly conservative. 606 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

Table 4 Analysis Results for Employment Interviews in Which the Criterion Was Job Performance: Validities of Interviews by Interview Content, Structure, and Criterion Purpose

Without range- With range- restriction corrections restriction corrections

No. Mean Obs. 90% Interview distribution N rs r cv p a. CV

All interviews 25,244 160 .20 .15 .26 .17 .04 .37 .23 .08 Interview content Situational 946 16 .27 .14 .35 .08 .26 .50 .05 .43 Job related 20,957 127 .21 .16 .28 .18 .05 .39 .24 .08 Psychological 1,381 14 .15 .10 .20 .03 .16 .29 .00 .29 Interview structure Structured 12,847 106 .24 .18 .31 .20 .05 .44 .27 .09 Unstructured 9,330 39 .18 .11 .23 .10 .10 .33 .12 .17 Criterion purpose Administrative 15,376 90 .19 .15 .25 .16 .04 .36 .22 .08 Research 5,047 65 .25 .19 .33 .20 .08 .47 .26 .13 Job-Related Interviews X Structure Structured 11,801 89 .24 .18 .31 .21 .04 .44 .28 .07 Unstructured 8,985 34 .18 .11 .23 .10 .10 .33 .13 .17 Job-Related Interviews X Criterion Purpose Administrative 12,414 74 .21 .15 .27 .17 .06 .39 .23 .09 Research 3,771 49 .27 .20 .36 .21 .08 .50 .28 .14 Job-Related Structured Interviews X Criterion Purpose Administrative 8,155 50 .20 .16 .26 .18 .04 .37 .24 .06 Research 3,069 36 .28 .21 .37 .23 .07 .51 .3! .11 Job-Related Unstructured Interviews X Criterion Purpose Administrative 4,259 24 .22 .14 .29 .13 .12 .41 .17 .19 Research 531 9 .21 .13 .27 .00 .27 .38 .00 .38

Note. Obs. = observed; p = estimated population mean;

Table 5 Analysis Results for Employment Interviews in Which the Criterion Was Training Performance: Validities of Interviews by Interview Content and Structure Without range- With range-restriction restriction corrections corrections

No. Mean Obs. 90% 90% CV Interview distribution N rs r a p °, CV p «f

All interviews 59,844 75 .23 .09 .26 .09 .14 .36 .09 .24 Interview content Job related" 51,152 56 .22 .08 .25 .08 .14 .36 .09 .24 Psychological 8,376 15 .25 .11 .28 .11 .14 .40 .14 .22 Job-Related Interviews X Structure Structured 3,576 26 .21 .12 .24 .09 .12 .34 .11 .20 Unstructured8 47,576 30 .23 .08 .25 .06 .17 .36 .05 .29 Note. Obs. = observed; p = estimated population mean; a, = estimated standard deviation; 90% C V = 90% credibility value for the distribution of true validities. • One coefficient in this distribution, obtained from Bloom & Brundage (1947), had a sample size of 37,862 and an observed validity coefficient ot .22. VALIDITY OF EMPLOYMENT INTERVIEWS 607

Table 6 Analysis Results for Board and Individual Employment Interviews in Which the Criterion Was Job Performance

Without range- With range-restriction restriction corrections corrections

No. Mean Obs. 90% 90% Interview distribution N rs r a P <*, CV P aP CV

All interviews Individual interviewer 11,393 90 .24 .18 .31 .20 .05 .43 .26 .09 Board interview 11,915 54 .17 .12 .22 .13 .06 .32 .17 .11 Individual Interviews X Structure Structured 8,944 61 .25 .19 .33 .22 .05 .46 .29 .08 Unstructured 1,667 19 .18 .11 .24 .00 .24 .34 .00 .34 Board Interviews X Structure Structured 2,785 35 .20 .02 .26 .13 .09 .38 .17 .15 Unstructured 6,961 15 .18 .11 .23 .12 .08 .33 .16 .13

Note. Obs. = observed; p = estimated population mean; af = estimated standard deviation; 90% CV = 90% credibility value for the distribution of true validities.

interview validity: content of the interview, how the interview is terviews (.36) is somewhat lower than the mean validity of psy- conducted, and the nature of the criterion. chological interviews (.40).

Content of the Interview How the Interview Is Conducted The analyses focusing on interview content examined in- The analyses of whether validity varies with how the interview validity across three types of interview content: situa- terview is conducted addressed three questions: Are structured tional, job related, and psychological. For job performance cri- interviews more valid than unstructured interviews? Are board teria (Table 4), situational interviews yield a higher mean valid- interviews more valid than individual interviews? and Are in- ity (.50) than do job-related interviews (.39), which yield a terviews more valid when the interviewer has access to cognitive higher mean validity than do psychological interviews (.29). For test scores? Each of these questions is addressed in turn. training data (Table 5), the mean validity of job-related in- Are structured interviews more valid than unstructured in-

Table 7 Analysis Results for Employment Interviews in Which the Criterion Was Job Performance: Effects of Test Information on Validities

Without range- With range- restriction restriction corrections corrections No. Mean Obs. 90% 90% Test available? N rs r a CV P °f P ", CV Yes 2,196 19 .14 .11 .18 .06 .10 .26 .07 .17 No 6,843 47 .25 .19 .32 .21 .04 .45 .29 .08 Unknown 16,205 94 .19 .14 .25 .15 .05 .35 .20 .09

Structured interviews Yes 1,031 9 .09 .11 .11 .08 .01 .16 .11 .02 No 4,865 36 .22 .20 .29 .23 -.01 .40 .32 -.01 Unknown 6,951 61 .27 .16 .35 .17 .14 .50 .22 .22

Unstructured interviews Yes 433 5 .18 .06 .24 .00 .24 .34 .00 .34 No 1,854 9 .32 .12 .41 .07 .31 .57 .00 .57 Unknown 7,043 25 .14 .07 .18 .02 .16 .26 .08 .26

Note. Obs. = observed; p = estimated population mean; ap = estimated standard deviation; 90% CV = 90% credibility value for the distribution of true validities. 608 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

Table 8 Analysis Results for Employment Interviews: Comparison of Criterion Categories

Without range- With range- restriction restriction corrections corrections

Criteria No. Mean Obs. 90% 90% distribution rs r a CV N P ot CV P °f Job performance 25,244 160 .20 .15 .26 .17 .05 .37 .23 .08 Training performance 59,844 75 .23 .09 .26 .09 .14 .36 .09 .24 Tenure 1,223 10 .12 .17 .13 .17 -.08 .20 .24 -.11

Note. Obs. = observed; p = estimated population mean; af = estimated standard deviation; 90% CV = 90% credibility value for the distribution of the true validities. terviews? Structured interviews, regardless of content, are The Nature of the Criterion more valid (.44) than unstructured interviews (.33) for predict- ingjob performance criteria (Table 4). When the content of the Analyses concerning the influence of the nature of the crite- interview is job related, structured interviews are still more rion on the validity of the interview focused on two questions: valid (.44) than unstructured interviews (.33). However, when Does validity vary between job performance, training perfor- the criterion is training performance, the validity of unstruc- mance, and tenure criteria? and Does validity vary as a function tured and structured interviews is similar (.34 and .36, respec- of whether the criteria are collected for administrative or re- tively), as shown in Table 5. search purposes? Each of these questions is addressed in turn. It should be emphasized that to obtain a correlation between Does validity vary between job performance, training perfor- interviews and a criterion, interviewers must use a rating instru- mance, and tenure criteria? A comparison of Tables 4 and 5 ment, even for unstructured interviews. Therefore, it is likely indicates that interviews are similar in predictive accuracy for that the unstructured interviews included in this review were job performance (.37) and training performance (.36). This pat- more structured than those typically conducted in applied set- tern of validities is in contrast with cognitive ability research tings. This suggests that the validity of most unstructured in- showing that training performance is more highly predicted terviews used in practice may be lower than the validity found in than are supervisory ratings of job performance (e.g., Lilienthal this study. However, to the extent that unstructured interviews &Pearlman, 1983;Pearlman, 1979). However, when job perfor- resemble those studied here, their validity should approximate mance is measured with content-valid job sample tests, cogni- the present results. tive aptitude and ability tests reveal a pattern similar to that Are board interviews more valid than individual interviews? noted here for interviews, in that they have been found equally In board interviews, multiple interviewers provide ratings in valid for job performance and training performance criteria one setting (Warmke & Weston, 1992; Weston & Warmke, (Schmidt et al., 1985; Schmidt, Ones, & Hunter, 1992). Table 8 1988). Although board interviews typically are more costly shows the validity of the interview for job performance, training than individual interviews (because they are more labor-inten- performance, and tenure. This table indicates that tenure is less sive), board reviews are likely to be more reliable because more predictable by the interview (.20) than are job and training per- than one person provides ratings. Table 6 shows results of meta- formance. In the absence of a feasible method for assessing the analyses comparing the validity of individual and board in- reliability of tenure, it was assumed by default to be perfect, terviews for job performance criteria. When all interviews are leading to a lower bound validity estimate. As in other analyses considered together, individual interviews appear more valid involving few coefficients, the tenure results are best viewed as than board interviews (.43 vs. .32). When interviews are differ- tentative pending further study. entiated by structure, the results are similar. Individual in- Does validity vary as a function of whether the criteria were terviews are more valid both when they are structured (.46 vs. collected for administrative or research purposes? Job perfor- .38) and when they are unstructured (.34 vs. .33). mance criteria may be collected for administrative purposes or Are interviews more valid when the interviewer has access to for research purposes only. As shown in Table 4, the mean va- cognitive test scores? Table 7 shows results of analyses com- lidity with research criteria is .47, in comparison with .36 for paring the validity of interviews in which interviewers had ac- administrative criteria. This pattern of findings held for job- cess to cognitive ability test scores with the validity of interviews related interviews (.50 vs. .39) and job-related structured in- where no test data were available. Results showed that an interviews (.51 vs. .37), but not for job-related unstructured interviewer's access to test scores appears to decrease validity for terviews (.38 vs. .41). This contradiction to the general trend predicting job performance. Additional analyses showed that may be due to the relatively low number of coefficients in the this is true for both structured and unstructured interviews. distribution of job-related unstructured interviews with re- However, for most coefficients, it was unknown whether the in- search criteria. Validities obtained based on criterion measures terviewer had access to test scores. We suggest that conclusions collected only for research purposes are typically larger than are tentative and await future research. those based on criterion measures collected for administrative VALIDITY OF EMPLOYMENT INTERVIEWS 609 purposes. Because the latter are more likely to contain biases purposes generally yielded higher validities than those using ad- and contaminants, we conclude that the validity estimates ministrative criteria, indicating that validity estimates com- based on the research criteria are the more accurate and that puted with administrative criteria are underestimates. those based on administrative criteria are substantially down- Several findings are based on distributions containing few wardly biased. studies. Distributions with few coefficients have greater potential for second-order sampling error, which can distort true va- Summary, Conclusions, and Implications for Future lidity variance estimates and, to a lesser extent, can distort true Research mean validity estimates (Hunter & Schmidt, 1990, chap. 9; Schmidt et al., 1985, Question and Answer number 25). There- Validity may be concluded to be generalizable if the value at fore, these meta-analyses should be rerun in the future as more the lower 10th percentile of the distribution of estimated true studies become available. This caveat, however, should not validities is greater than zero (Callender & Osburn, 1981). This overly restrict conclusions drawn from our findings. Although definition of validity generalizability is directly analogous to sig- some distributions contain fewer coefficients and smaller total nificance testing: A correlation is statistically significant when sample sizes than desired, the impact of the interview and cri- the lower bound of its confidence interval is above zero. By this terion characteristics is largely consistent across the various criterion, almost all distributions (35 of 37) in Tables 4 through analyses. This consistency allows one to place more confidence 8 show validity generalization. in the findings. However, some cautionary statements appear to be appropri- This meta-analysis of employment interview validities is ate with respect to the unstructured interview. As noted earlier, different from past meta-analyses of ability constructs (e.g., ver- the data summarized in this study come from employment in- bal ability) in two major ways. The first concerns the differen- terviews that are likely to be different from those normally con- tiation of constructs. The employment interview is a measure- ducted. The typical interview in the private sector is likely to be ment method, as is a paper-and-pencil test. When one meta- substantially less structured than the unstructured interviews analytically summarizes the validity of paper-and-pencil tests, used in this study because private sector interviews typically one conducts the analyses separately for the different constructs do not use scoring guides or result in quantifiable evaluations. measured by the tests. Separate meta-analyses are performed Therefore, the validity estimates obtained in this study for un- because the construct distinctions are considered meaningful structured interviews might overestimate the operational valid- in their own right and because different constructs may have ity of many unstructured interviews used in business and in- different correlations with performance. Like paper-and-pencil dustry. Those wishing to support their present interview prac- tests, employment interviews may measure different constructs tices on the basis of this study should review the cited primary (e.g., cognitive ability, interpersonal skills, and manifest motiva- studies to determine the extent to which their interview is sim- tion). However, separate analyses for different constructs are not ilar to those analyzed in the present research. possible because of the dearth of employment interview validi- Dreher, Ash, and Hancock (1988) took the opposite position. ties reported for separate constructs. They concluded that most interview validity studies have un- The second difference between this study and other meta-an- derestimated the validity of interviews because researchers have alytic reviews is that there is more variability in how interview usually failed to consider that individual interviewers differ in data are collected than there is in how ability data are collected. their ability to provide accurate predictions of employee behav- Although paper-and-pencil measures of a given ability may vary ior and in their tendencies to make favorable or unfavorable slightly (e.g., they may use different item types), the measure- ratings. Dreher et al. argued that the procedure of collapsing ment process used for gathering data for a given ability is very data across multiple interviewers results in underestimates of similar. In contrast, employment interviews vary widely in data the validity of interviews. To the extent that this effect is operat- collection processes. Some interviewers follow procedures pre- ing, it represents a downward bias in all of our validity esti- scribed by their organization or by authors of how-to-interview mates, including those for the unstructured interview. Despite publications, whereas others have no predetermined agenda. this fact, it seems likely that the short, casual, conversational We have attempted to address these process distinctions by an- interviews one often encounters in business and industry will alyzing the data separately for structured and unstructured in- have mean validities lower than those reported in this study for terviews. Although this distinction is a meaningful one for many the unstructured interview. interview developers and researchers (Asher, 1971; Hoeschen, In this article, we have summarized the validity of various 1977; Mayfield, 1964; McMurry, 1947; Tupes, 1950; Wagner, types of interviews. For job performance criteria, situational in- 1949; O. R. Wright, 1969), the resulting interview categories terviews yield the highest mean validity, followed by job-related used in the present research are not perfectly homogeneous. For and psychological interviews. However, for training perfor- example, some structured interviews are more structured than mance, psychological interviews were similar in validity to others. (nonsituational) job-related interviews. The validity of the in- These differences between the present research and most past terview was found to vary by interview structure: Structured validity generalization studies suggest that the present research interviews yielded higher validities than did unstructured in- has less control over the constructs measured and the measure- terviews. Validity was similar for job performance criteria and ment process. These two sources of uncontrolled variance affect training performance criteria. Validity was lowest for tenure cri- meta-analytic findings in that they increase the apparent situa- teria, but inability to correct for criterion unreliability may ex- tional specificity and reduce validity generalizability. Hence, plain this finding. Studies using criteria collected for research conclusions based on such analyses tend to be conservative; they 610 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

overestimate situational specificity and underestimate validity it may be possible to develop more reliable measurement meth- generalizability. ods for the underlying constructs (Hunter & Hunter, 1984). An The data and results presented in this article contrast some- important question for future researchers is the extent to which what with some traditional beliefs about the validity of the in- the interview can contribute incrementally over measures of terview. In our experience, many personnel psychologists be- cognitive ability, or vice versa. lieve that the interview generally has low validity. This was our belief also, before we undertook this study. This belief was sup- References ported by the findings of Reilly and Chao (1982), Hunter and Alexander, R. A., Carson, K. P., Alliger, G. M., & Cronshaw, S. F. (1989). Hunter (1984), and Dunnette et al. (1971). Because our conclu- Empirical distributions of range restricted SD* in validity studies. sions are contrary to much professional opinion and many pre- Journal of Applied Psychology, 74, 253-258. vious quantitative reviews (except for Wiesner & Cronshaw, Anderson, R. C. (1954). The guided interview as an evaluative instru- 1988, and P. M. Wright et al., 1989), our results should be ex- ment. Journal of Educational Research, 48, 203-209. amined carefully. The disparities between most past reviews Anstey, E. (1966). The civil service administrative class and the diplo- and the present review can be explained, at least in part, by the matic service: A follow-up. Occupational Psychology, 40, 139-151. effects of interview type and structure characteristics and by the Anstey, E. (1977). A 30-year follow-up of the CSSB procedure, with lessons for the future. Journal of Occupational Psychology, 50, 149- effects of criterion content and purpose. 159. The meta-analysis findings for the validity of the interview Arvey, R. D., & Campion, J. E. (1982). The employment interview: A reported here contrast in an interesting way with those reported summary and review of recent research. Personnel Psychology, 35, for evaluations of education, training, and experience by Mc- 281-322. Daniel et al. (1988). In the case of both the interview and train- Ash, R. A. {1981). Comparison of four approaches to the evaluation of ing and experience evaluations, the conventional belief has tra- job applicant training and work experience. Dissertation Abstracts ditionally been that both had low (or zero) validity. Validity gen- International, 42,4606B. (University Microfilms No. DE082-07909) eralization analysis showed that this belief was essentially Asher, J. J. (1971). Reliability of a novel format for the selection in- correct for the most commonly used method of training and terview. Psychological Reports, 26, 451-456. Bartlett, N. R. (1950). Review of research and development in exami- experience evaluation: the point method, which had a mean nation for aptitude for submarine training, 1942-1945. Medical Re- true validity of only .13. However, in the case of the interview, search Laboratory Report, 153(9), 11-53. the conclusion was the opposite: Even the unstructured in- Bobbin, J. M., & Newman, S. H. (1944). Psychological activities at the terview was found to have a respectable level of validity. United States Coast Guard Academy. Psychological Bulletin, 41, Our results confirm conventional wisdom regarding the su- 568-579. periority of criterion measures collected for research purposes Callender, J. C., & Osburn, H. G. (1981). Testing the constancy of valid- (Wherry & Bartlett, 1982). The finding that the administrative- ity with computer-generated sampling distributions of the multipli- research criterion dichotomy explains differences among stud- cative model variance estimate: Results for petroleum industry vali- ies in validity has implications for the interpretation of past va- dation research. Journal of Applied Psychology, 66, 274-281. lidity generalization results and the conduct of future validity Campbell, J. T, Otis, J. L., Liske, R. E., & Prien, E. P. (1962). Assess- ments of higher-level personnel: 2. Validity of the overall assessment generalization research. This study has shown that validity co- process. Personnel Psychology, 15, 63-74. efficients based on research criteria are generally larger than Carlson, R. E. (1967). Selection interview decisions: The effect of in- those based on administrative criteria. Because this distinction terviewer experience, relative quota situation, and applicant sample has not been explicitly addressed in past validity generalization on interviewer decisions. Personnel Psychology, 20, 259-280. studies of cognitive abilities, the results of those studies are Carlson, R. E. (1970). Effect of applicant sample on ratings of valid probably more conservative than was previously thought. The information in an employment setting. Journal of Applied Psychol- variance caused by this criterion characteristic can be expected ogy, 54,217-222. to have caused an overestimate of the true standard deviations Castle, P. F. C., & Garforth, F. I., de la P. (1951). Selection, training and and an underestimate of the true operational validity of cogni- status of supervisors: 1. Selection. Occupational Psychology, 25, 109- 123. tive ability tests (Schmidt et al., 1993). Dann, J. E., & Abrahams, N. M. (1970). Use of biographical and in- Although these analyses significantly advance knowledge of terview information in predicting Naval Academy disenrollmenl. (Re- the validity of the interview, there is still a need for additional search Rep. SRR 71-7). San Diego, CA: Naval Personnel and Train- research. First, future validity studies should include a detailed ing Research Laboratory. description of the interview to permit a taxonomy of interview Denton, J. C. (1964). The validation of interview-type data. Personnel content and structure. Correlations between the interview Psychology, 77,281-287. scores and measures of constructs (e.g., cognitive ability) also Dicken, C. F., & Black, J. D. (1965). Predictive validity of psychometric should be included. Furthermore, future validity studies should evaluations of super visors. Journal of Applied Psychology, 49, 34-47. report range-restriction and reliability information; at present Dreher, G. F., Ash, R. A., & Hancock, P. (1988). The role of the tradi- this is done only sporadically. Because some types of interviews tional research design in underestimating the validity of the interview. Personnel Psychology, 41, 315-328. have been shown to yield validities as high as those for cognitive Drucker, A. J. (1957). Predicting leadership ratings in the United States ability tests, we join Harris (1989) in calling for further study of Army. Educational and Psychological Measurement, 17, 240-263. the construct validity of the interview: To what extent does it Dunnette, M. D. (1970). Multiple assessment procedures in identifying measure motivation, social skills, and communication skills? At and developing managerial talent. Minneapolis: University of Minne- present, we know only that it measures multiple factors that sota, Center for the Study of Organizational Performance and Hu- predict job and training success. If these factors can be isolated, man Effectiveness. VALIDITY OF EMPLOYMENT INTERVIEWS 611

Dunnette, M. D., Arvey, R. D., & Arnold, J. A. (1971). Validity study Johnson, E. M., & McDaniel, M. A. (1981, July). Training correlates of results for jobs relevant to the petroleum refining industry. Minneap- a police selection system. Paper presented at the International Per- olis, MN: Personnel Decisions. sonnel Management Association Assessment Council, Denver, CO. Dye, D. A. (1988). What's in OPM's validity generalization data base? Kennedy, R. (1986). An investigation of criterion-related validity for the (OPRD-88-7). Washington, DC: U.S. Office of Personnel Manage- structured interview. Unpublished master's thesis, East Carolina Uni- ment, Office of Personnel Research and Development. versity, Greenville, NC. Felix, R. H., Cameron, D. C, Bobbin, J. M., & Newman, S. H. (1945). Latham, G. P. (1989). The reliability, validity, and practicality of the An integrated medico-psychological program at the United States situational interview. In R. W. Eder & G. R. Ferris (Eds.), The em- Coast Guard Academy. American Journal of Psychiatry, 101, 635- ployment interview: Theory, research, and practice. Newbury Park, 642. CA: Sage. Gardner, K. E., & Williams, A. P. O. (1973). A twenty-five year follow- Latham, G. P., & Saari, L. M. (1984). Do people do what they say? up of an extended interview selection program in the Royal Navy. Further studies on the situational interview. Journal of Applied Psy- Occupational Psychology, 47, 1-13. chology, 69, 569-573. Glaser, R., Schwartz, P. A., & Flanagan, J. C. (1956). Development of Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). interview and performance tests for the selection of wage board super- The situational interview. Journal of Applied Psychology, 65, 422- visors (Personnel Research Board Technical Research Note 53). Pitts- 427. burgh, PA: American Institutes for Research for Personnel Research Law, K. S., Schmidt, F. L., & Hunter, J. E. (1994). Non-linearity of Branch of the Adjutant General's Office. range correlations in meta-analysis: A test of an improved procedure. Glaser, R., Schwartz, P. A., & Flanagan, J. C. (1958). The contribution Journal of Applied Psychology, 79, 425-438. of interview and situational performance procedures to the selection Ledvinka, J. (1973). Race of employment interviewer and reasons given of supervisory personnel. Journal of Applied Psychology, 42, 69-73. by job seekers leaving their jobs. Journal of Applied Psychology, 58, Grant, D. L., & Bray, D. W. (1969). Contributions of the interview to 362-364. assessment of management potential. Journal of Applied Psychology, Lilienthal, R. A., & Pearlman, K. (1983). The validity of federal selec- 53. 24-34. tion tests for aid/technicians in the health, science, and engineering Hakel, M. D., Ornesorge, J. P., & Dunnette, M. D. (1970). Interviewer fields. Washington, DC: U.S. Office of Personnel Management, Office evaluations of job applicants' resumes as a function of the qualifica- of Personnel Research and Development. tion of the immediately preceding applicants. Journal of Applied Psy- Matarazzo, R. G., Matarazzo, J. D., Saslow, G., & Phillips, J. S. (1958). chology, 54, 27-30. Psychological test and organismic correlates of interview interaction Handyside, J. D., & Duncan, D. C. (1954). Four years later: A follow- patterns. Journal of Abnormal and Social Psychology, 56, 329-338. up of an experiment in selecting supervisors. Occupational Psychol- Mayfield, E. C. (1964). The selection interview: A reevaluation of pub- ogy, 28, 9-23. lished research. Personnel Psychology, 17, 239-260. Harris, M. M. (1989). Reconsidering the employment interview: A re- McDaniel, M. A. (1986). Computer programs for calculating meta- view of recent literature and suggestions for future research. Person- analysis statistics. Educational and Psychological Measurement, 46, net Psychology, 42, 691 -726. 175-177. Hilton, A. C., Bolin, S. F., Parker, J. W, Jr., Taylor, E. K., & Walker, McDaniel, M. A., & Reck, M. (1985). Validity generalization coding W. B. (1955). The validity of personnel assessments by professional instructions (Draft Report). Washington, DC: U.S. Office of Person- psychologists. Journal of Applied Psychology, 39, 287-293. nel Management, Office of Staffing Policy. Hirsh, H. R., Northrop, L. C., & Schmidt, F. L. (1986). Validity gener- McDaniel, M. A., Schmidt, F. L., & Hunter, J. E. (1988). A meta-anal- alization results for law enforcement occupations. Personnel Psychol- ogy, 39, 399-420. ysis of methods for rating training and experience in personnel selection. Personnel Psychology, 41, 283-314. Hoeschen, P. L. (1977). Using the Q by Q interview to predict success as a police sergeant. Unpublished master's thesis, San Jose State Univer- McMurry, R. N. (1947). Validating the patterned interview. Personnel, sity. 23, 263-272. Huffcutt, A. I. (1992). An empirical investigation of the relationship be- Miles, D. W, Wilkins, W. L., Lester, D. W, & Hutchens, W. H. (1946). tween multidimensional degree of structure and the validity of the em- The efficiency of a high speed screening procedure in detecting the ployment interview. Unpublished doctoral dissertation, Texas A&M neuropsychiatrically unfit at a U.S. Marine Corps recruit training University, College Station. depot. Journal of Psychology, 21, 243-268. Hunt, W. A., Herrmann, R. S., & Noble, H. (1957). The specificity of Motowidlo, S. J., Carter, G. W., Dunnette, M. D., Tippins, N., Werner, the psychiatric interview. Journal of Clinical Psychology, 13, 49-53. S., Burnett, J. R., & Vaughn, M. J. (1992). Studies of the structured Hunter, J. E., & Hunter, R. F. (1984). The validity and utility of alterna- behavioral interview. Journal of Applied Psychology, 77, 571-587. tive predictors of job performance. Psychological Bulletin, 96, 72-98. Newmann, I., & Abrahams, N. M. (1979). Validation of NROTC selec- Hunter, J. E., & Schmidt, F. L. (1990). Methods ofmeta-analysis: Cor- tion procedures (NPRDC Special Rep. 79-12). San Diego, CA:Naval recting error and bias in research findings. Newbury Park, CA: Sage. Personnel Research and Development Center. Janz, T. (1982). Initial comparisons of patterned behavior description Parrish, J. A., KJieger, W. A., & Drucker, A. J. (1955). Assessment rat- interviews versus unstructured interviews. Journal of Applied Psy- ings and experimental tests as predictors of performance in officer chology, 67, 577-580, candidate school (Personnel Research Branch Technical Research Janz, T. (1989). The patterned behavior description interview: The best Note 49). Washington, DC: Department of the Navy, Adjutant Gen- prophet of the future is the past. In R. W. Eder & G. R. Ferris (Eds.), eral's Office. The employment interview: Theory, research, and practice (pp. 158- Pearlman, K. (1979). The validity of tests used to select clerical person- 168). Newbury Park, CA: Sage. nel: A comprehensive summary and evaluation (TS-79-1). Washing- Johnson, E. M. (1979). The relationship between selection variables and ton, DC: U.S. Office of Personnel Management, Personnel Research performance variables in academy training for police officers. Unpub- and Development Center. (NTIS No. PB 80.102650) lished manuscript, City of Milwaukee, Fire and Police Commission, Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity general- Wisconsin. ization results for tests used to predict training success and job pro- 612 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

ficiency in clerical occupations. Journal of Applied Psychology, 65, Tupes, E. C. (1950). An evaluation of personality-trait ratings obtained 373-406. by unstructured assessment interviews. Psychological Monographs, Pursell, E. D., Campion, M. A., & Gaylord, R. S. (1980). Structured 64( 11, Whole No. 287). interviewing: Avoiding selection problems. Personnel Journal, 59, Ulrich, L., & Trumbo, D. (1965). The selection interview since 1949. 907-912. Psychological Bulletin, 63, 100-116. Putney, R. W. (1947). Validity of the placement interview. Personnel Veres, J. G., Ill, Field, H. S., & Boyles, W. R. (1983). Administrative Journal, 23, 144-145. versus research performance ratings: An empirical test of rating data Reeb, M. (1968). Construction of a questionnaire to replace a valid quality. Public Personnel Management, 12, 290-298. structured interview in Israel Defence Forces. Megamot: Behavioral Vernon, P. E. (1950). The validation of civil service selection board pro- Science Quarterly, 26, 69-74. cedures. Occupational Psychology, 24, 75-95. Reeb, M. (1969). A structured interview for predicting military adjust- Wagner, R. (1949). The employment interview: A critical summary. ment. Occupational Psychology, 43, 193-199. Personnel Psychology, 2, 17-46. Reilly, R. A., & Chao, G. T. (1982). Validity and fairness of some al- Walsh, U. R. (1975). A test of the construct and predictive validity of a ternative employee selection procedures. Personnel Psychology, 35, structure interview. Unpublished doctoral dissertation, University of 1-62. Nebraska, Lincoln. Rimland, B. (1958). The use of the interview in selecting NROTC stu- Warmke, D. L., & Billings, R. S. (1979). Comparison of training meth- dents (Project SD 1104.9.1). San Diego, CA: U.S. Naval Personnel ods for improving the psychometric quality of experimental and ad- Research Field Activity. ministrative performance ratings. Journal of Applied Psychology, 64, Rothstein, H. R. (1990). Interrater reliability of job performance rat- 124-131. ings: Growth to asymptote level with increasing opportunity to ob- Warmke, D. L., & Weston, D. J. (1992). Success dispels myths about serve. Journal of Applied Psychology, 75, 322-327. panel interviewing. Personnel Journal, 71, 120-126. Rothstein, H. R., & McDaniel, M. A. (1989). Guidelines for conducting Weston, D. J., & Warmke, D. L. (1988). Dispelling the myths about and reporting meta-analyses. Psychological Reports, 65, 759-770. panel interviews. Personnel Administrator, 33, 109-111. Wexley, K. N., Yukl, G., Kovacs, S. Z., & Sanders, R. E. (1972). Impor- Rundquist, E. A. (1947). Development of an interview for selection tance of contrast effects in employment interviews. Journal of Ap- purposes. In G. A. Kelly (Ed.), New methods in applied psychology plied Psychology, 56, 45-48. (pp. 85-95). College Park: University of Maryland. Wherry, R. J., Sr., & Bartlett, C. J. (1982). The control of bias in ratings. Ryan, A. M., & Sackett, P. R. (1989). Exploratory study of individual Personnel Psychology, 35, 521-551. assessment practices: Interrater reliability and judgments of assessor Whetzel, D. L., & McDaniel, M. A. (1988). Reliability of validity gen- effectiveness. Journal of Applied Psychology, 74, 568-579. eralization databases. Psychological Reports, 63, 131 -134. Schmidt, F. L., Hunter, J. E., Pearlman, K.., & Hirsh, H. R. (1985). Wiesner, W. H., & Cronshaw, S. F. (1988). The moderating impact of Forty questions and answers about validity generalization and meta- interview format and degree of structure on the validity of the em- analysis. Personnel Psychology, 38, 697-798. ployment interview. Journal of Occupational Psychology, 61, 275- Schmidt, F. L., Law, K., Hunter, J. E., Rothstein, J. R., Pearlman, K., 290. & McDaniel, M. A. (1993). Refinements in validity generalization Wilson, N. A. B. (1948). The work of the civil service selection board. procedures: Implications for the situational specificity hypothesis. Occupational Psychology, 22, 204-212. Journal of Applied Psychology, 78, 3-13. Wright, O. R. (1969). Summary of research on the selection interview Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnel selection. since \964.PersonnelPsychology, 22, 391-413. Annual Review of Psychology, 43, 627-670. Wright, P. M., Lichtenfels, P. A., & Pursell, E. D. (1989). The structured Schmitt, N. (1976). Social and situational determinants of interview interview: Additional studies and a meta-analysis. Journal of Occupa- decisions: Implications for the employment interview. Personnel Psy- tional Psychology, 62, 191 -199. chology, 29.19-101. Zaccaria, M. A., Dailey, J. T, Tupes, E. C., Stafford, A. R., Lawrence, Sharon, A. T., & Bartlett, C. J. (1969). Effect of instructional conditions H. G., & Ailsworth, K. A. (1956). Development of an interview proce- in producing leniency on two types of rating scales. Personnel Psy- dure for USAF officer applicants (AFPTRC-TN-56-43, Project No. chology, 23,251-263. 7701). Lackland Air Force Base, TX: Air Research and Development Taylor, E. L., & Wherry, R. J. (1951). A study of leniency in two rating Command. systems. Personnel Psychology, 4, 39-47. Zedeck, S., & Cascio, W. F. (1982). Performance appraisal decisions as a Trankell, A. (1959). The psychologist as an instrument of prediction. function of rater train ing and purpose of appraisal. Journal of Applied Journal of Applied Psychology, 43, 170-175. Psychology, 67, 752-758. VALIDITY OF EMPLOYMENT INTERVIEWS 613

Appendix

Sources of Validity Data

Albrecht, P. A., Glaser, G. M., & Marks, J. (1964). Validation of a cess as army aviators (FOA Rep. C 53020-H2). Linkoping, Sweden: multiple-assessment procedure for managerial personnel. Journal of National Defense Research Institute. Applied Psychology, 48, 351-360. Cerf, A. Z. (1947). Conference for the interpretation of test scores and Aramburo, D. J. (1981). The efficacy of the structured interview in the occupational background CE707A. In J. P. Guilford & J. I. Lacey selection of special education teachers. Unpublished doctoral disser- (Eds.), Printed classification tests: Report No. 5 (pp. 652-656). Wash- tation, University of New Orleans, LA. ington, DC: Army Air Forces Aviation Psychology Program. Ard, R. F. (1985, January 30). Personal communication to M. A. Mc- Conrad, H. S., & Satter, G. A. (1946). The use of test scores and quality- Daniel. classification ratings in predicting success in electrician's mates Arvey, R. D., Miller, H. E., Gould, R., & Burch, P. (1987). Interview school (Office of Scientific Research and Development Rep. No. validity for selecting sales clerks. Personnel Psychology, 40, 1-12. 5667, 1945: Publications Board No. 13290). Washington, DC: U.S. Banta, G. W. (1967). A comparison of the leaderless group discussion Department of Commerce. and the individual interview techniques for the selection of student Cook, R. E. (1981). Chippewa Valley local validation study of the teacher orientation assistants. Dissertation Abstracts International, 28, 937. perceiver interview process. Unpublished doctoral dissertation, Barrett, G. V, Svetlik, B., & Prien, E. P. (1966). Validity of the job- Wayne State University, Detroit, MI. concept interview in an industrial setting. Journal of Applied Psychol- Darany, T. (1971). Summary of state police trooper 07 validity study. ogy, 51, 233-235. Ann Arbor: Michigan Department of Civil Service. Bartlett, N. R. (1950). Review of research and development in exami- Davey, B. (1984, May). Are all oral panels created equal?: A study of nation for aptitude for submarine training, 1942-1945. Medical Re- differential validity across oral panels. Paper presented at the Interna- search Laboratory Report 153(9), 11-53. tional Personnel Management Association Assessment Council An- Bender, W. R. G., & Loveless, H. E. (1958). Validation studies involving nual Conference, Seattle, WA. successive classes of trainee stenographers. Personnel Psychology, 11, Davis, R. (1986). Personal communication to M. A. McDaniel. 491-508. DeHart, W. A. (n.d.). Preliminary report on research findings of a dem- onstration project in recruitment and training of case workers spon- Benz, M. P. (1974). Validation of the examination for Staff Nurse II. sored by the Utah State Department of Public Welfare, 1966—1967. Urbana: University Civil Service Testing Program of Illinois, Testing Salt Lake City: Utah State Department of Public Welfare. Research Program. Delaney, E. C. (1954). Teacher selection and evaluation with special at- Berkley, S. (1984). VII Validation Report Corrections Officer Trainee. tention to the validity of the personal interview and the National Harrisburg: Commonwealth of Pennsylvania, State Civil Service Teacher Examinations as used in one selected community (Elizabeth, Commission. New Jersey). (Doctoral dissertation, Columbia University, 1954). Bertram, F. D. (1975). The prediction of police academy performance Dissertation Abstracts International, 14, 1334-1335. and on the job performance from police recruit screening measures. Delery, J. E., Wright, P. M., & Tolzman, K. (1992, April). Employment Unpublished doctoral dissertation, Marquette University, Milwau- tests and the situational interview: A test of incremental validity. Paper kee, WI. presented at the seventh annual meeting of the Society for Industrial Bloom, R. F, & Brundage, E. G. (1947). Prediction of success in ele- and Organizational Psychology, Montreal, Quebec, Canada. mentary schools for enlisted personnel. In D. B. Stuit (Ed.), Personnel Dipboye, R. L., Gaugler, B., & Hayes, T. (1990, April). Differences research and test development in the Bureau of Naval Personnel (pp. among interviewers in the incremental validity of their judgments. Pa- 251-254). Princeton, NJ: Princeton University Press. per presented at the fifth annual meeting of the Society for Industrial Bobbitt, J. M., & Newman, S. H. (1944). Psychological activities at the and Organizational Psychology, Miami, FL. United States Coast Guard Academy. Psychological Bulletin, 41, Dougherty, T. W, Ebert, R. J., & Callender, J. C. (1986). Policy captur- 568-579. ing in the employment interview. Journal of Applied Psychology, 71, Bolanovich, D. J. (1944). Selection of female engineering trainees. Jour- 9-15. nal of Educational Psychology, 35, 545-553. Dubois, P. H., & Watson, R. K. (1950). The selection of patrolman. Bonneau, L. R. (1957). An interview for selecting teachers (Doctoral Journal of Applied Psychology, 34, 90-95. dissertation, University of Nebraska, 1956). Dissertation Abstracts Dunlap, J. W, & Wantman, M. J. (1944). An investigation of the in- International, 17, 537-538. terview as a techniquefo r selecting aircraft pilots (Rep. No. 33). Wash- Borman, W. C. (1982). Validity of behavioral assessment for predicting ington, DC: Civil Aeronautics Administration, Airman Development military recruiter performance. Journal of Applied Psychology, 67, 3- Division. 9. English, J. J. (1983). The relationship between structured selection cri- Bosshardt, M. J. (1992). Situational interviews versus behavior descrip- teria and the assessment of proficient leaching. Unpublished doctoral tion interviews: A comparative validity study. Unpublished doctoral dissertation, University of Virginia, Richmond. dissertation, University of Minnesota, Minneapolis. Finesinger, J. E., Cobb, S., Chappie, E. D., & Brazier, M. A. B. (1948). Bosshardt, M. J. (1993, April 8). Personal communication to M. A. An investigation of prediction of success in naval flight training (Rep. McDaniel. No. 81). Washington, DC: Civil Aeronautics Administration, Divi- Brown, S. H. (1986). Personal communication to M. A. McDaniel. sion of Research. Campbell, J. T., Prien, E. P., & Brailey, L. G. (1960). Predicting perfor- Fisher, J., Epstein, L. J., & Harris, M. R. (1967). Validity of the psychi- mance evaluations. Personnel Psychology, 13, 435-440. atric interview: Predicting the effectiveness of the first Peace Corps Campion, M. A., Pursell, E. D., & Brown, B. K. (1988). Structured volunteers in Ghana. Archives of General Psychiatry, 17, 744-750. interviewing: Raising the psychometric properties of the employment Flynn, J. T., & Peterson, M. (1972). The use of regression analysis in interview. Personnel Psychology, 41, 25-42. police patrolman selection. The Journal of Criminal Law, Criminol- Carlstrom, A. (1984). Correlations between selection ratings and suc- ogy, and Police Science, 63, 564-569.

(Appendix continues on next page) 614 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

Fowlkes, R. D. (1984). The relationship between the teacher perceiver Husen, T. (1954). La validite des interviews par rapport a 1'age, au sexe interview and instructional behaviors of teachers of learning disabled et a la formation des interview. Le Travail Humain, 17, 60-67. students. Unpublished doctoral dissertation. University of Virginia, Inman, S. J. (1986). Personal communication to M. A. McDaniel re- Richmond. garding the study by the city of Milwaukee, WI: The selection and Freeman, G. L., Manson, G. E., Katzoff, E. T., & Pathman, J. H. (1942). evaluation of unskilled laborers. The stress interview. Journal of Abnormal and Social Psychology, 37, Janz, T. (1982). Initial comparisons of patterned behavior description 427-447. interviews versus unstructured interviews. Journal of Applied Psy- Friedland, D. (1973). Selection of police officers. Los Angeles: City of chology, 67, 577-580. Los Angeles, Personnel Department. Johnson, E. M., & McDaniel, M. A. (1981, June). Training correlates Friedland, D. (1980). A predictive study of selected tests for police officer of a police selection system. Paper presented at the Annual Confer- selection. Los Angeles: City of Los Angeles, Personnel Department. ence of the International Personnel Management Council, Denver, Friedland, D. (n.d.). Section II—City of Los Angeles Personnel Depart- CO. ment (Junior Administrative Assistant Validation Study). Los Ange- Johnson, G. (1990). The structured interview: Manipulating structuring les: City of Los Angeles, Personnel Department. criteria and the effects on validity, reliability, and practicality. Unpub- Ghiselli, E. E. (1966). The validity of a personnel interview. Personnel lished doctoral dissertation, Tulane University, New Orleans, LA. Psychology, 19, 389-394. Johnson, H. M., Boots, M. L., Wherry, R. J., Hotaling, O. C., Martin, Gillies, T. K. (1988). The relationship between selection variables and L. G., & Cassens, F. P., Jr. (1944). On the actual and potential value of subsequent performance ratings for teachers in an Oregon school dis- biographical information as a means of predicting success in aero- trict. Unpublished doctoral dissertation, University of Oregon, Eu- nautical training (Rep. No. 32). Washington, DC: Civil Aeronautics gene. Administration, Airman Development Division. Githens, W. H., & Rimland, B. (1964). The validity of NROTCselection Jones, D. E. (1978). Predicting teaching processes with the teacher per- interviews against career decisions and officer fitness reports: An eight ceiver interview. Unpublished doctoral dissertation, Virginia Poly- year followup (PRASD Rep. No. 234). San Diego, CA: U.S. Naval technic Institute and State University, Blacksburg. Personnel Research Activity. Kelley, E. L., & Ewart, E. (1942). A preliminary study of certain predic- Glaser, R., Schwartz, P. A., & Flanagan, J. C. (1958). The contribution tors of success in civilian pilot training (Rep. No. 7). Washington, DC: of interview and situational performance procedures to the selection Civil Aeronautics Administration, Division of Research. of supervisory personnel. Journal of Applied Psychology, 42, 69-73. Kelly, E. L., & Fiske, D. W. (1950). The prediction of success in the VA Green, P. C., & Alter, P. (1990). Validity of a multi-position, Behavioral training program in clinical psychology. American Psychologist, 5, Interviewing system when using an answer scoring strategy. Mem- 395-406. phis, TN: Behavioral Technology. Kennedy, R. (1986). An investigation of criterion-related validity for the Gregory, E. (1956). Evaluation of selection procedures for women naval structured interview. Unpublished master's thesis, East Carolina Uni- officers (USN Bureau of Naval Personnel Technical Bulletin, No. 56- versity, Greenville, NC. 11). [Cited in Sawyer, J. (1966). Measurement and prediction, clinical Kirkpatrick, J. J., Ewen, R. B., Barrett, R. S., & Katzell, R. A. (1968). and statistical. Psychological Bulletin, 55, 178-200.] Testing and fair employment: Fairness and validity of personnel tests Grove, D. A. (1981). A behavioral consistency approach to decision for different ethnic groups. New Vbrk: New \brk University Press. making in employment selection. Personnel Psychology, 34, 55-64. Knights, R. M. (1977). The relationship between the selection process Halliwell, S. T. (1984, November). An evaluation of the Canadian Forces and on-the-job performance of Albuquerque police officers (Doctoral combat arms officer selection board. Paper presented at the 26th an- dissertation, University of New Mexico, 1976). Dissertation Abstracts nual conference of the Military Testing Association, Munich, Ger- International, 38, 1229A-1230A. many (Available from Canadian Forces Personnel Applied Research Komives, E., Weiss, S. T, & Rosa, R. M. (1984). The applicant interview Unit, Willowdale, Ontario, Canada) as a predictor of resident performance. Journal of Medical Education, Harris, J. G. (1972). Prediction of success on a distant pacific island: 59, 425-426. Peace Corps style. Journal of Consulting and Clinical Psychology, 38, Landy, F. J. (1976). The validity of the interview in police officer selec- 181-190. tion. Journal of Applied Psychology, 61, 193-198. Hays, E. J. (1990). Relationship of a situational interview to the job performance of convenience store clerks. Unpublished master's thesis, Latham, G. P., & Saari, L. M. (1984). Do people do what they say? Further studies on the situational interview. Journal of Applied Psy- University of North Texas, Denton. chology, 69, 569-573. Hess, L. R. (1973). Police entry tests and their predictability of score in police academy and subsequent job performance (Doctoral disserta- Latham, G. P., Saari, L. M., Pursell, E. D., & Campion, M. A. (1980). tion, Marquette University, 1972). Dissertation Abstracts Interna- The situational interview. Journal of Applied Psychology, 65, 422- tional, 24, 679-686. 426. Hill, C. J., Jr., Russell, D. L., & Wendt.G. R. (1946). A partial analysis Lopez, F. M. (1966). Current problems in test performance of job ap- of a three-man interview technique for predicting airline pilot success plicants: I. Personnel Psychology, 19, 10-18. in the civilian pilot training program (Rep. No. 65). Washington, DC: Martin, M. A. (1972). Reliability and validity of Canadian Forces selec- Civil Aeronautics Administration, Division of Research. tion interview procedures (Rep. 72-4). Willowdale, Ontario: Cana- Holt, R. R. (1958). Clinical and statistical prediction: A reformulation dian Forces Personnel Applied Research Unit. and some new data. Journal of Abnormal and Social Psychology, 53, Maurer, S. (1986). Personal communication to M. A. McDaniel. 1-12. McKinney, T. S. (1975). The criterion-related validity of entry level po- Hovland, C. I., & Wonderlic, E. F. (1939). Prediction of industrial suc- lice officer selection procedures. Phoenix, AZ: City of Phoenix, Per- cess from a standardized interview. Journal of Applied Psychology, 23, sonnel Department, Employee Selection Research. 537-546. McKinney, T. S., & Meltzzer, D. M. (1979). The validity and utility of Huse, E. F. (1962). Assessments of higher-level personnel: IV. The valid- entry level police recruit selection procedures: The Phoenix Experi- ity of assessment techniques based on systematically varied informa- ence. Phoenix, AZ: City of Phoenix, Personnel Department, Em- tion. Personnel Psychology, 15. 195-205. ployee Selection Research. VALIDITY OF EMPLOYMENT INTERVIEWS 615

McMurry, R. N. (1947). Validating the patterned interview. Personnel, Ricchio, L. K. (1980). Evaluation of eligible list ranking as a predictor 23, 263-272. of teacher success. (Doctoral dissertation, Claremont Graduate Merenda, P. R, & Clarke, W. V. (1959). AVA validity for textile workers. School, 1980). Dissertation Abstracts International, 41, 50. Journal of Applied Psychology, 43, 162-165. Robertson, I. T., Gratton, L., & Rout, U. (1990). The validity of situa- Meyer, H. H. (1956). An evaluation of a supervisory selection program. tional interviews for administrative jobs. Journal of Organizational Personnel Psychology, 9, 499-513. Behavior, U, 69-76. Miner, J. B. (1970). Psychological evaluations as predictors of consult- Roth, P. L., Campion, J. E., & Francis, D. J. (1988, April). A LISREL ing success. Personnel Psychology, 23, 393-405. analysis of the predictive power ofpre-employmenl tests and the panel Mischel, W. (1965). Predicting the success of Peace Corps volunteers in interview. Paper presented at the third annual meeting of the Society Nigeria. Journal of Personality and Social Psychology, I, 510-517. for Industrial and Organizational Psychology, Dallas, Texas. (Validity Modderaar, J. (1966). Over de validiteit van de selectie voor bedrijven data not reported in the paper were obtained from P. L. Roth.) [About the validity of selection for industry]. Nederlands Tijdschrift Rundquist, E. A. (1947). Development of an interview for selection voor de Psychologie en haar Grensgebieden, 21, 573-589. purposes. In G. A. Kelly (Ed.), New methods in applied psychology Mosher, M. R. (1991, June). Development of a behaviorly consistent (pp. 85-95). College Park: University of Maryland. structured interview. Paper presented at the 27th International Ap- Shaw, J. (1952). The function of the interview for determining fitness plied Military Psychology Symposium, Stockholm, Sweden. for teacher-training. Journal of Educational Research, 45, 667-681. Motowidlo, S. J., Carter, G. W., Dunnette, M. D., Tippins, N., Werner, Simmons, J. E. (1976). A study to test teacher perceiver interview as an S., Burnett, J. R., & Vaughn, M. J. (1992). Studies of the structured instrument that would select vocational agriculture instructors that behavioral interview. Journal of Applied Psychology, 77, 571-587. develop positive rapport with their students. Unpublished doctoral dis- Orpen, C. (1985). Patterned behavior description interviews versus un- sertation, University of Nebraska, Lincoln. structured interviews: A comparative validity study. Journal of Ap- Smedberg, C. A. (1984). An examination of the validity of the empathy plied Psychology, 70, 774-776. interview as an instrument for teacher selection. Unpublished doc- Parrish, J. A., Klieger, W. A., & Drucker, A. J. (1955). Assessment rat- toral dissertation, University of South Florida, Tampa. ings and experimental tests as predictors of performance in officer Solomon, G. L. (1981). The Omaha teacher pre-employment interview candidate school (Personnel Research Branch Technical Research as a predictor of job performance of selected new teachers in the Note 49). Washington, DC: Department of the Army, Adjutant Gen- Jefferson County school system. Unpublished doctoral dissertation, eral's Office. University of Alabama, Tuscaloosa. Pasco, D. C. (1979). The comparison of two interviewing techniques, Sparks, C. P. (n.d.). [Title unknown]. Cited in Dunnette, M. D., Arvey, the leuderless group discussion and the traditional interview, as a R. D., & Arnold, J. A. (1971). Validity study results for jobs relevant method for teacher selection (Doctoral dissertation, University of to the petroleum industry. Minneapolis, MN: Personnel Decisions. Maryland, 1979). Dissertation Abstracts International, 41, 486. Stephen, M. J. (1980). The selection of computer salespersons: An eval- Pashalian, S., & Crissy, W. J. E. (1953). The interview IV: The reliability uation of a structured interview and its underlying theory. (Doctoral and validity of the assessment interview as a screening and selection dissertation, Northwestern University, 1980). Dissertation Abstracts technique in the submarine service (Medical Research Laboratory International, 41, 2372. Rep. No. 216, XII, No. 1). Washington, DC: Department of the Navy, Stohr-Gillmore, M. K., Stohr-Gillmore, M. W., & Kistler, N. (1990). Bureau of Medicine and Surgery. Improving selection outcomes with the use of situational interviews: Pearlman, K.. (1978). Personal communication to M. A. McDaniel. Empirical evidence from a study of correctional officers for new gen- Plag, J. A. (1961). Some considerations of the value of the psychiatric eration jails. Reviewof Public Personnel Administration, 10, 1-18. screening interview. Journal of Clinical Psychology, 17, 3-8. Tarico, V S., Altmaier, E. M., Smith, W. L., Franken, E. A., & Berbaum, Prien, E. P. (1962). Assessments of higher-level personnel: V. An analysis K. S. (1986). Development and validation of an accomplishment in- of interviewers' predictions of job performance. Personnel Psychol- terview for radiology residents. Journal of Medical Education, 61, ogy, 15, 319-334. 845-847. Pulos, L., Nichols, R. C., Lewinsohn, P. M., & Koldjeski, T. (1962). Trankell, A. (1959). The psychologist as an instrument of prediction. Selection of psychiatric aides and prediction of performance through Journal of Applied Psychology, 43, 170-175. psychological testing and interviews. Psychological Reports, 10, 519- Trites, D. K. (1960). Adaptability measures as predictors of perfor- 520. mance ratings. Journal of Applied Psychology, 44, 349-353. Rafferty, J. A., & Deemer, Jr., W. L. (1950). Factor analysis of psychiat- Tubiana, J. H., & Ben-Sitakhar, G. (1982). An objective group question- ric impressions. Journal of Educational Psychology, 41, 173-183. naire as a substitute for a personal interview in prediction of success Raines, G. N., & Rohrer, J. H. (1955). The operational matrix of psy- in military training in Israel. Personnel Psychology, 35, 349-357. chiatric practice: Consistency and variability in interview impres- Tziner, A., & Dolan, S. (1982a). Evaluation of a traditional selection sions of different psychiatrists. The American Journal of Psychiatry, system in predicting success of females in officer training. Journal of J, 721-733. Occupational Psychology, 55, 269-275. Reeb, M. (1969). A structured interview for predicting military adjust- Tziner, A., & Dolan, S. (1982b). Validity of an assessment center for ment. Occupational Psychology, 43, 193-199. identifying future female officers in the military. Journal of Applied Requate, F. (1971 -1972). Metropolitan Dade County firefighter valida- Psychology, 67, 728-736. tion study, Miami, Florida. Miami: Dade County Personnel Depart- U.S. Office of Personnel Management (1987). The structured interview. ment. Washington, DC: Office of Examination Development, Alternative Rhea, B. D. (1966). Validation of OCS selection instruments: The rela- Examining Procedures Division. tionship of OCS selection measures to OCS performance (Tech. Bul- Waldron, L. A. (1974). The validity of an employment interview inde- letin STB 66-18 ). San Diego, CA: U.S. Naval Personnel Research. pendent of psychometric variables. Australian Psychologist, 9, 68-77. Rhea, B. D., Rimland, B., & Githens, W. H. (1965). The development Walters, L. C., Miller, M., & Ree, M. J. (1993). Structured interviews for and evaluation of a forced-choice letter of reference form for selecting pilot selection: No incremental validity. Internal Journal of Aviation officer candidates (Tech. Bulletin STB 66-10). San Diego: U.S. Naval Psychology, 3, 25-38. Personnel Research Activity. Wayne County Civil Service Commission. (1973). Summary informa-

(Appendix continues on next page) 616 McDANIEL, WHETZEL, SCHMIDT, AND MAURER

tion regarding validity of the examination for patrolmen in the Wayne Wunder, S. (1977b). Selecting process operator and laboratory trainees. County Sheriff's Department. Wayne County Civil Service Commis- Unpublished manuscript. sion. Wunder, S. (1978a). Preemployment tests for pipefitter trainees. Unpub- Weekley, J. A., & Gier, J. A. (1987). Reliability and validity of the situa- lished manuscript. tional interview for a sales position. Journal of Applied Psychology, Wunder, S. (1978b). Selecting laboratory technicians. Unpublished 72, 484-487. manuscript. Willkins, W. L., Anderhalter, O. F, Rigby, M. K., & Stinson, P. (1955). Wunder, S. (1979). Preemployment tests for electrician trainees. Unpub- Statistical description of criterion measures for VSMC junior officers lished manuscript. (Office of Naval Research Contract N7ONR-40802; NR 151-092). Wunder, S. (1980). Validity ofpreemploymenl measures for instrument St. Louis, MO: St. Louis University, Department of Psychology. technician trainees. Unpublished manuscript. Woodworth, D. G., Barren, F, & MacKinnon, D. W. (1957). Analysis Wunder, S. (1981). Selection of manufacturing technicians. Unpub- oj life history interview's rating for 100 air force captains (Research lished manuscript. Rep. AFPTRC-TN, ASTIA Document AD 146-401). Lackland Air Yonge, K. A. (1956). The value of the interview: An orientation and a Force Base, TX: Air Research and Development Command. pilot study. Journal of Applied Psychology, 40, 25-31. Wright, P. M., Pursell, E. D., Lichtenfels, P. A., & Kennedy, R. (1986, Zaranek, R. J. (1983). ,4 correlational analysis between the teacher per- August). The structured interview: Additional studies and a meta- ceiver interview and teacher success in the Chippewa Valley school analysis. Paper presented at the Academy of Management Conven- system. Unpublished doctoral dissertation, Western Michigan Uni- tion, Chicago. versity, Kalamazoo. Wunder, S. (1973a). Validity of a selection procedure for process train- Zedeck, S., Tziner, A., & Middlestadt, S. (1983). Interviewer validity ees. Unpublished manuscript. and reliability: An individual analysis approach. Personnel Psychol- Wunder, S. (1973b). Validity of a selection program for refinery techni- ogy, 36, 355-370. cian trainees. Unpublished manuscript. Wunder, S. (1974). Validity of a selection program for commercial driv- ers. Unpublished manuscript. Received October 7, 1991 Wunder, S. (1977a). The refinery and chemical plant test battery: Valid- Revision received November 29, 1993 ity for laboratory selection. Unpublished manuscript. Accepted November 30, 1993 •