See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/318739522

Diversity Shrinkage: Cross-Validating Pareto-Optimal Weights to Enhance Diversity via Hiring Practices

Article in Journal of Applied · July 2017 DOI: 10.1037/apl0000240

CITATIONS READS 4 385

3 authors:

Q. Chelsea Song Serena Wee Purdue University University of Western Australia

10 PUBLICATIONS 17 CITATIONS 22 PUBLICATIONS 468 CITATIONS

SEE PROFILE SEE PROFILE

Daniel A. Newman University of Illinois, Urbana-Champaign

64 PUBLICATIONS 4,535 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Diversity Shrinkage: Cross-Validating Pareto-Optimal Weights to Enhance Diversity via Hiring Practices View project

All content following this page was uploaded by Q. Chelsea Song on 30 October 2018.

The user has requested enhancement of the downloaded file. Journal of © 2017 American Psychological Association 2017, Vol. 102, No. 12, 1636–1657 0021-9010/17/$12.00 http://dx.doi.org/10.1037/apl0000240

Diversity Shrinkage: Cross-Validating Pareto-Optimal Weights to Enhance Diversity via Hiring Practices

Q. Chelsea Song Serena Wee University of Illinois at Urbana-Champaign Singapore Management University

Daniel A. Newman University of Illinois at Urbana-Champaign

To reduce adverse impact potential and improve diversity outcomes from personnel selection, one promising technique is De Corte, Lievens, and Sackett’s (2007) Pareto-optimal weighting strategy. De Corte et al.’s strategy has been demonstrated on (a) a composite of cognitive and noncognitive (e.g., personality) tests (De Corte, Lievens, & Sackett, 2008) and (b) a composite of specific cognitive ability subtests (Wee, Newman, & Joseph, 2014). Both studies illustrated how Pareto-weighting (in contrast to unit weighting) could lead to substantial improvement in diversity outcomes (i.e., diversity improvement), sometimes more than doubling the number of job offers for minority applicants. The current work addresses a key limitation of the technique—the possibility of shrinkage, especially diversity shrinkage, in the Pareto-optimal solutions. Using Monte Carlo simulations, sample size and predictor combinations were varied and cross-validated Pareto-optimal solutions were obtained. Although diversity shrinkage was sizable for a composite of cognitive and noncognitive predictors when sample size was at or below 500, diversity shrinkage was typically negligible for a composite of specific cognitive subtest predictors when sample size was at least 100. Diversity shrinkage was larger when the Pareto-optimal solution suggested substantial diversity improvement. When sample size was at least 100, cross-validated Pareto-optimal weights typically outperformed unit weights—suggesting that diversity improvement is often possible, despite diversity shrinkage. Implications for Pareto-optimal weighting, adverse impact, sample size of validation studies, and optimizing the diversity- tradeoff are discussed.

Keywords: adverse impact, cognitive ability/intelligence, cross-validation, diversity, Pareto-optimal weighting

When implementing personnel selection procedures to hire new business case for enhancing organizational diversity, based on the employees, some managers and executives place value on the ideas that diversity initiatives improve access to talent, help the diversity of the new hires (Avery, McKay, Wilson, & Tonidandel, understand its diverse customers better, and improve 2007). Diversity can be valued for a variety of reasons, including team performance (see reviews by Jayne & Dipboye, 2004, moral and ethical imperatives related to fairness and equality of the Kochan et al., 2003, and Konrad, 2003, which suggest only mixed treatment, rewards, and opportunities afforded to different demo- and contingent evidence for diversity’s effects on organizational graphic subgroups (Bell, Connerley, & Cocchiara, 2009; Gilbert & performance; as well as more recent and nuanced reports on race Ivancevich, 2000; Gilliland, 1993; Newman, Hanges, & Outtz, and gender diversity’s performance benefits from Avery et al., 2007). Some scholars and professionals have also articulated a 2007; Joshi & Roh, 2009; Roberson & Park, 2007). In short, whereas the jury is still out on whether and when diversity en- hances group performance, diversity remains an objective in itself This document is copyrighted by the American Psychological Association or one of its allied publishers. that is valued by many . This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. This article was published Online First July 27, 2017. The attainment of organizational diversity is, in large part, a Q. Chelsea Song, Department of Psychology, University of Illinois at function of hiring practices. There are thus potentially far-reaching Urbana-Champaign; Serena Wee, School of Social Sciences, Singapore negative consequences for attaining diversity when an organiza- Management University; Daniel A. Newman, Department of Psychology tion uses a selection measure demonstrating adverse impact po- and School of Labor & Relations, University of Illinois at tential. To elaborate, “adverse impact has been defined as sub- Urbana-Champaign. group differences in selection rates (e.g., hiring, licensure and We thank Wilfried De Corte for helping us implement the Pareto- certification, college admissions) that disadvantage subgroups pro- optimal technique in TROFSS, and for his advice when we were designing tected under Title VII of the 1964 Civil Rights Act. Protected the ParetoR package. The I/O Journal Club at the University of Illinois gave comments on a draft. A previous version of this paper was presented subgroups are defined on the basis of a number of demographics, at the Society for I/O Psychology Conference in April 2016. including race, sex, age, religion, and national origin (Equal Em- Correspondence concerning this article should be addressed to Q. Chel- ployment Opportunity Commission, Civil Service Commission, sea Song, Department of Psychology, University of Illinois at Urbana- Department of Labor & Department of Justice, 1978)” (Outtz & Champaign, IL, United States 61820. E-mail: [email protected] Newman, 2010, p. 53). Adverse impact is evidenced by the ad-

1636 DIVERSITY SHRINKAGE 1637

verse impact ratio (AI ratio)—which is calculated as the selection Hough et al., 2001), (b) reducing irrelevant predictor variance ratio for one subgroup (e.g., number of selected minority appli- (e.g., Arthur, Edwards, & Barrett, 2002), and (c) cants divided by total number of minority applicants) divided by strategies (e.g., Newman & Lyon, 2009; for a review, see Ployhart the for the subgroup with the highest selection ratio & Holtz, 2008). Nonetheless, most of the research on adverse (e.g., number of selected majority applicants divided by total impact reduction has investigated different ways of combining number of majority applicants). As a rule of thumb, when the AI predictors (e.g., Bobko et al., 1999; De Corte, Lievens, & Sackett, ratio falls below 4/5ths (i.e., when the minority selection ratio is 2007; Pulakos & Schmitt, 1996; Sackett & Ellingson, 1997; 80% or less of the majority selection ratio), this is generally Schmitt, Rogers, Chan, Sheppard, & Jennings, 1997; Wee, New- considered prima facie evidence of adverse impact discrimination man, & Joseph, 2014; see Ryan & Ployhart, 2014, for a review). (Equal Employment Opportunity Commission, Civil Service Com- Although these studies indicate that predictor combination is one mission, Department of Labor & Department of Justice, 1978). To useful way to deal with the performance-diversity dilemma, many be more specific, the “4/5ths rule” is not a legal definition of of these studies (Pulakos & Schmitt, 1996; Sackett & Ellingson, adverse impact, but rather provides a guideline for practitioners 1997; Schmitt et al., 1997) used approaches based on “trial-and- and for Federal enforcement agencies when deciding whether to error” (as characterized by De Corte et al., 2007, p. 1381) to further investigate discrepant selection rates. When the adverse examine the effect of weighting strategies in generating acceptable impact ratio is low, organizations can face reputational risk and trade-offs between job performance and diversity objectives. legal charges of discriminatory practice (Outtz, 2010; Sackett, More recently however, De Corte and colleagues (De Corte et al., Schmitt, Ellingson, & Kabin, 2001). 2007; De Corte, Sackett, & Lievens, 2011) proposed a combinatorial The issue of adverse impact often arises from using one partic- approach using Pareto-optimally derived predictor weights that al- ular selection device (i.e., the cognitive ability test) that exhibits lowed the examination of all possible trade-offs between job perfor- among the strongest criterion validities (Schmidt & Hunter, 1998) mance and diversity outcomes. Instead of optimizing on a single and also large mean differences between racial groups (Goldstein, criterion (as done when using regression weights), or ignoring opti- Scherbaum, & Yusko, 2010; Hough, Oswald, & Ployhart, 2001). mization (as done when using unit weights), Pareto-optimal weights The classic problem is that subgroup mean differences on valid optimize on multiple criteria simultaneously, providing an optimal tests result in differences in selection rates between the majority solution on one objective, at a given value of another objective. and the minority group, which often disadvantage the minority Specifically, a Pareto-optimal solution provides the best possible group in gaining access to jobs (Bobko, Roth, & Potosky, 1999; diversity outcome (AI ratio) at a given level of job performance Outtz & Newman, 2010). (criterion ); it also provides the best possible job performance For example, cognitive tests have a meta-analytic estimated outcome at a given level of diversity. Using Pareto-optimal weights, operational relationship with job performance of r ϭ .52 (medium- De Corte, Lievens, and Sackett (2008) showed how improvements in complexity jobs; corrected for range restriction and unreliability in both and the AI ratio could be achieved via a the criterion only; Roth, Switzer, Van Iddekinge, & Oh, 2011,p. Pareto-weighted predictor composite formed with a cognitive test and 905), but also an average Black–White mean difference of d ϭ .72 (within-jobs; for medium-complexity jobs; Roth, Bevier, Bobko, noncognitive predictors (e.g., structured interview), as compared with Switzer, & Tyler, 2001, p. 314; though we note the average a cognitive test alone. Also building on De Corte et al.’s (2007) work, Black–White difference on actual job performance is less than half Wee et al. (2014) further illustrated the utility of Pareto-optimal as large as the Black–White difference on cognitive tests [see weighting by showing how Pareto-optimal weighting of cognitive McKay & McDaniel, 2006]). Hence, organizations often face the ability subtests could result in substantial diversity improvement (i.e., seemingly incompatible goals of selecting either for high levels of adverse impact reduction resulting from using Pareto-optimal weights expected job performance (by maximizing criterion validity) or rather than unit weights, with no loss in expected job performance). selecting for greater racial diversity (by maximizing the AI ratio). Although the aforementioned studies provided evidence that That is, organizations often face a “performance versus diversity Pareto-optimal weighting could be an effective strategy to deal dilemma” (Sackett et al., 2001, p. 306; Ployhart & Holtz, 2008). with the trade-off between job performance and diversity, a key The trade-off required between fulfilling job performance and caveat remains: we do not know the extent to which the Pareto- diversity objectives has driven research efforts to develop selection optimally derived diversity improvement obtained in a particular This document is copyrighted by the American Psychological Association or one of its allied publishers. systems with substantial criterion validity but less adverse impact calibration sample will cross-validate. Before we can determine This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. potential. In this enterprise, one important constraint is that the the practical value of using Pareto-optimal weighting to enhance Civil Rights Act of 1991 prohibits treating predictor scores differ- diversity, we should determine whether diversity improvements ently for different racial subgroups (e.g., using different cut scores generalize beyond the sample on which the weights were obtained. for different racial groups, or including applicant minority status/ That is the purpose of the current article—to examine the cross- race in the predictor composite). As such, we highlight that con- validity of Pareto-optimal solutions.1 Specifically, by conducting temporary hiring strategies designed to enhance diversity—includ- Monte Carlo simulations, we examined the extent to which Pareto- ing the Pareto-weighting strategy described in the current optimal solutions shrink (i.e., the extent to which expected job paper—in no way include minority status or race of individual performance and expected diversity decrease in size when the applicants in the predictor composite. Under this legal constraint, several alternative strategies have been traditionally recommended 1 Both De Corte et al. (2007) and Wee et al. (2014) examined sensitivity to accomplish both job performance and diversity objectives si- (i.e., sample-to-sample variation) in the Pareto-optimal results that were multaneously: (a) using “low impact” predictors that exhibit presented, but neither investigated the cross-validation/shrinkage of Pareto- smaller subgroup differences than cognitive ability tests do (e.g., optimal solutions. 1638 SONG, WEE, AND NEWMAN

predictor weights obtained in the original sample are applied to Shrinkage and Cross-Validation Using Multiple new samples), under different predictor combinations and sample Regression Weights sizes. We attempt to make three contributions to the personnel selec- Multiple regression, where predictors are weighted to maximize 2 tion literature. First, we extend the concept of cross-validation the variance accounted for (R ) in a single criterion (e.g., job from a context where only a single objective is optimized (as in performance), is typically used because it provides the best linear multiple regression), to a context where multiple objectives are unbiased estimates of the predictor weights for predicting the optimized. We distinguish between the notions of validity shrink- criterion. However, we must consider shrinkage when these weights from one sample are used to predict the criterion in age (i.e., loss of predicted job performance) and diversity shrink- another sample, or to generalize to the population (Larson, 1931; age (i.e., loss of predicted diversity). This distinction is important Wherry, 1931). Because the weights provided by any model-fitting because we hypothesize that shrinkage covaries with optimization. procedure capitalize on the unique characteristics of the sample, That is, when two objectives are simultaneously considered, we these weights do not generate an optimal solution when applied to expect that the extent of shrinkage on one objective (e.g., job a different sample. That is, the predictive power of the model (i.e., performance) depends on the extent to which it is maximized in the sample’s squared multiple correlation coefficient, R2) is opti- relation to the second objective (e.g., diversity). Such phenomena mized in the sample from which the weights were derived; but the are central to understanding the generalizability and practical value model is overfitted (Yin & Fan, 2001) and shrinkage will likely of the Pareto-optimal weighting strategy. Second, we investigate occur when the weights are applied to a new sample. Shrinkage the generalizability of Pareto-optimal weights by examining the describes the extent to which the variance explained in the popu- extent to which different predictor combinations and sample sizes lation or in a new sample is smaller than the variance explained in impact validity shrinkage and diversity shrinkage. If the cross- the original sample. Shrinkage parameters such as the squared validity of Pareto-optimal weights shrinks drastically across vari- population multiple correlation coefficient (␳2) and the squared ous predictor combinations and sample sizes, the benefit of this ␳ 2 cross-validated multiple correlation coefficient ( c ) index the re- strategy for addressing diversity concerns is constrained; the lationship between the set of predictors and the criterion in the weights generated in a given sample would be less useful in population, and in a sample from the population, respectively (see making predictions in other samples, except in rare cases. In Appendix A for details). contrast, if Pareto-optimal weights demonstrate little shrinkage The procedure for assessing shrinkage is referred to as cross- under certain conditions, then the strategy would be promising validation. There are two approaches to estimate cross-validity: an under those conditions. Thus, the current article seeks to ascertain empirical approach and a statistical approach (Cascio & Aguinis, the boundary conditions for when Pareto-optimal weighting may 2005). The empirical approach consists of fitting a regression be used in organizational settings. Third, we compare the shrunken model to data from a calibration sample, and then applying the Pareto-optimal solutions with unit-weighted solutions. Unit- obtained predictor weights to an independent, validation sample. weighted solutions are often recommended (Bobko, Roth, & ␳ 2 The obtained result, c , does not suffer from overfitting and thus Buster, 2007; see Einhorn & Hogarth, 1975) and widely used in provides a generalizable estimate of the variance accounted for in selection practice, and thus serve as a natural point of comparison. the criterion by the predictors. The statistical approach uses for- In addition, because unit-weighted solutions are not based on any mulas to adjust the squared sample multiple correlation coefficient, kind of optimization technique, they have the desirable property R2, by the sample size (N) and the number of predictors (k) used that they do not demonstrate shrinkage when subject to cross- to obtain the sample estimate (R2). Larger adjustments are made validation (Bobko et al., 2007). when estimates are based on smaller samples, more predictors, or In what follows, we first describe the concept of cross-validation smaller coefficient values (Cattin, 1980). The statistical approach as a way of assessing the generalizability of a predictive model, is generally preferred over the empirical approach as it provides and very briefly summarize extant approaches to cross-validation results as accurate as the empirical approach, but more cost- in the personnel selection context (e.g., Cascio & Aguinis, 2005; effectively, given that a second sample is not required (Cascio & Cattin, 1980; Raju, Bilgic, Edwards, & Fleer, 1999). Because Aguinis, 2005). cross-validation research in personnel selection has thus far fo- This document is copyrighted by the American Psychological Association or one of its alliedcused publishers. only on single-objective optimization (e.g., maximizing This article is intended solely for the personal use ofexpected the individual user and is not to be disseminated broadly. job performance), we leave our discussion of shrinkage The Pareto-Optimal Weighting Strategy when multiple objectives are optimized until after we describe De Corte et al. (2007) proposed a Pareto-optimal weighting optimization and the Pareto technique. Then, we outline how we strategy as a way to systematically estimate (locally) optimal expect validity shrinkage and diversity shrinkage to occur, and solutions. Under Pareto-optimal weighting, predictors are change, across the entire Pareto-optimal surface. weighted to optimize two criteria: job performance and diversity. We conducted a simulation to examine the conditions under Instead of a single, globally optimal outcome (as with multiple which Pareto-optimal weights demonstrate substantial cross- regression), Pareto-optimal weights produce a frontier (e.g., see validity. To compare our Pareto shrinkage results against more Figure 1) of possible selection outcomes simultaneously involving widely known regression-based results, we first examined the two criteria. This is represented as a downward-sloping trade-off degree of validity shrinkage and diversity shrinkage when only one curve with job performance (e.g., criterion validity) plotted against objective was maximized. We then extended the results to multi- diversity (e.g., AI ratio). The negative slope of the Pareto curve in criterion optimization, and we also compared cross-validated Figure 1 indicates that additional diversity can only be obtained at Pareto-optimal solutions to unit-weighted solutions. some cost in terms of job performance (cf. De Corte, 1999). A DIVERSITY SHRINKAGE 1639

Figure 1. Left panel: Point 1 is the unit weight solution, Point 2 is the regression weight solution (maximizing validity/job performance), and Points 2 through 11 correspond to 10 Pareto weight solutions (i.e., the Pareto- optimal trade-off surface). The horizontal dashed line indicates the diversity improvement obtained by using Pareto-optimal weights rather than unit weights. The example scenario uses 3 predictors (A, B, and C) and an overall measure of job performance. d values are .9, .7, and .1, respectively; test validities are .6, .5, and .5, ϭ ϭ ϭ respectively, and intercorrelations are rAB .8, rBC .6, and rAC .4. Right panel: Predictor weighting scheme that corresponds to the Pareto-optimal trade-off surface. Lines marked ‘A—A,’ ‘B—B,’ and ‘C—C’ correspond to weights for Predictors A, B, and C, respectively. Reprinted from “More than g: Selection quality and adverse impact implications of considering second-stratum cognitive abilities,” by S. Wee et al., 2014, Journal of Applied Psychology, 99, p. 552. Copyright 2013 by American Psychological Association. Reprinted with permission.

steeper slope of the trade-off curve would indicate that the job Shrinkage and Cross-Validation Using performance cost is high for obtaining additional diversity, Pareto-Optimal Weights whereas a flatter slope would indicate that the job performance cost is low for obtaining additional diversity. As highlighted earlier, the predictor weights provided by any Figure 1 shows the Pareto-optimal trade-off curve (left panel) model-fitting procedure capitalize on the unique characteristics of and the corresponding weights for each of three predictors: A, B, the sample. Thus, as with multiple regression, variance accounted and C (right panel). As shown in Figure 1 (left panel), the trade-off for under Pareto-optimal weighting should also be adjusted for curve represents a range of possible selection outcomes, in terms shrinkage. Although a statistical approach would be preferred over of both job performance and diversity. Each point on the curve an empirical approach, the currently available shrinkage formulas represents a unique solution given by a set of Pareto-optimal are based on least-squares assumptions (Dorans & Drasgow, 1980) predictor weights. As an example, the set of weights corresponding that do not form the basis for Pareto-optimization. Therefore, in to Point 2 in Figure 1 (left panel) shows a Pareto-optimal solution the current study we examined the issue of shrinkage in Pareto where job performance is maximized (that is, where only job weights by using an empirical approach (i.e., Monte Carlo simu- performance was maximized, and diversity was not considered). lation). This document is copyrighted by the American Psychological Association or one of its allied publishers. Similarly, Point 11 in Figure 1 (left panel) shows a Pareto-optimal When using Pareto-optimal weighting to reduce adverse impact, This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. solution where diversity is maximized; job performance was not the weights simultaneously optimize both job performance and considered. In contrast, Point 6 in Figure 1 (left panel) shows a diversity. Because both objectives are optimized by the given set Pareto-optimal solution where job performance and diversity ob- of Pareto weights (although the two objectives are achieved to jectives were both considered—corresponding to a selection sce- differing extents at different points on the Pareto frontier), the nario where job performance and diversity are both explicitly issue of shrinkage is more complicated than in the single objective, valued. It should be noted that every point along the Pareto- multiple regression case. That is, optimizing two objectives at the optimal curve (i.e., Points 2 through 11) is locally optimal, same time complicates shrinkage. For each set of Pareto weights, whereas points below the curve (e.g., Point 1) are suboptimal. That we must consider the degree of shrinkage on both of the objectives is, for Point 1 (which corresponds to unit-weighting), there will be being optimized. Specifically, we refer to shrinkage on the job a Pareto-optimal solution with the same level of job performance, performance objective as validity shrinkage, and to shrinkage on but with higher levels of diversity. Thus, for Point 1, diversity the diversity objective as diversity shrinkage; and we assess each improvements could be made by using Pareto-optimal rather than in turn. As illustrated in Figure 1, Points 2 and 11 indicate unit weights. Pareto-optimal solutions where only one objective is maximized 1640 SONG, WEE, AND NEWMAN

(i.e., job performance is maximized at Point 2, and diversity is Table 1 maximized at Point 11). Shrinkage in these situations should be Bobko-Roth Predictor Intercorrelations, Black–White Subgroup akin to shrinkage in the multiple regression case. When maximiz- ds, and Criterion Validities ing job performance (e.g., Point 2 in Figure 1), we expect consid- erable validity shrinkage and negligible diversity shrinkage. Like- Variable 1 2 3 4 5 wise, when maximizing diversity (Point 11 in Figure 1), we expect Roth et al., 2011 predictors considerable diversity shrinkage and negligible validity shrinkage. 1. Biodata 1.00 This is because—at the endpoints of the Pareto curve—the pre- 2. Cognitive ability .37 1.00 3. Conscientiousness .51 .03 1.00 dictor weights should be overfitted for the maximized objective, 4. Structured interview .16 .31 .13 1.00 but not directly overfitted for the ignored objective. 5. Integrity test .25 .02 .34 Ϫ.02 1.00 By contrast, Points 3 through 10 in Figure 1 indicate solutions Criterion validities .32 .52 .22 .48 .20 Black–White subgroup d .39 .72 Ϫ.09 .39 .04 where both objectives are being valued. Whereas Point 3 indicates a solution where job performance is maximized to a greater extent Note. Meta-analytic predictor inter-correlation matrix for all five predic- than diversity is, and Point 10 indicates a solution where diversity tors was obtained from Roth, Switzer, Van Iddekinge, and Oh (2011; pp. 918–919, Table 6). Criterion validities for biodata, cognitive ability, Con- is maximized to a greater extent than job performance is, Points 6 scientiousness, and structured interview were obtained from Roth et al. and 7 indicate solutions where job performance and diversity are (2011; p. 915, Table 5). We used an updated estimate of integrity test considered similarly important and maximized to a similar extent. validity provided by Van Iddekinge, Roth, Raymark, and Odle-Dusseau ␳ϭ Extending the concept of shrinkage to these situations, we expect (2012; p. 511, Table 2; .18), which we then corrected for direct range restriction using u ϭ .90 (Van Iddekinge et al., 2012, p. 516). In all, the greater validity shrinkage at Point 3 than at Point 10, and greater validity and predictor intercorrelation input values were based on job diversity shrinkage at Point 10 than at Point 3. More generally, we applicant information and were corrected for range restriction and criterion expect shrinkage to be proportionally larger on the more valued unreliability only. Subgroup d values for predictors were obtained from Bobko and Roth (2013; pp. 94–96, Tables 1 and 2). Job performance objective (i.e., the objective to which more importance is attached Black–White subgroup d was .38 (McKay & McDaniel, 2006; p. 544, when selecting a point on the Pareto curve). As such, to the extent Table 2). diversity is being maximized, we expect greater diversity shrink- age. As reviewed earlier, cross-validation research using multiple Research Question 1: What is the extent of validity shrinkage regression indicates greater shrinkage should be expected when and diversity shrinkage along the entire frontier of the Pareto- estimates are based on small samples (N), a large number of optimal trade-off curve, as a function of (a) sample size (i.e., predictors (k), and predictor composites with low predictive power number of applicants in the calibration sample) and (b) type of (low R2; Cattin, 1980). We expect these same factors would also predictor combination (i.e., cognitive-noncognitive combina- influence the degree of shrinkage in Pareto-optimal solutions. tion vs. a combination of cognitive subtests)? What remains unknown, however, is the extent to which each of Research Question 2: Is validity shrinkage larger when job these factors independently influences validity shrinkage and di- performance is maximized, and is diversity shrinkage larger versity shrinkage, along the entire frontier of Pareto-optimal se- when diversity is maximized? lection solutions. To ensure the simulation conditions we considered were appli- Research Question 3: When are unit weights outperformed by cable to actual selection contexts, we examined Pareto-optimal cross-validated Pareto-optimal weights? That is, under what shrinkage using two distinct sets of predictor combinations: (a) a conditions (sample size and predictor combination) would set of cognitive and noncognitive predictors comprising a cogni- cross-validated Pareto-optimal weights result in greater diver- tive ability test, a biodata form, a conscientiousness measure, a sity improvement than unit weights? structured interview, and an integrity test (see Table 1; similar to that used by De Corte et al., 2008); and (b) a set of cognitive ability Method subtest predictors comprising tests of verbal ability, mathematical In the current study, validity shrinkage and diversity shrinkage This document is copyrighted by the American Psychological Association or one of its alliedability, publishers. technical knowledge, and clerical speed (see Table 2;as of Pareto-optimal weights were investigated via a Monte Carlo This article is intended solely for the personal use ofused the individual user and is not to beby disseminated broadly. Wee et al., 2014). We chose these particular predictor simulation with two factors: calibration sample size and predictor combinations because, to our knowledge, they are the primary sets combination. The design incorporates five calibration sample sizes of predictor combinations for which Pareto-optimization has pre- (N ϭ 40, 100, 200, 500, and 1,000), and two predictor combina- viously been demonstrated. Our study thus provides an important tions: (a) a set of cognitive and noncognitive predictors (we refer test of the extent to which these published results would generalize to these as “Bobko-Roth predictors”; see Table 1 for details), and under cross-validation. (b) a set of cognitive subtest predictors (from Wee et al., 2014; see Given that shrinkage is affected by sample size, and that sample Table 2 for details). sizes vary widely across selection contexts, we also examined the When demonstrating diversity shrinkage on the cognitive sub- extent of validity shrinkage and diversity shrinkage across a range test predictors, a helpful reviewer encouraged us to limit our of sample sizes. Finally, we investigated the conditions under illustration to only two prototypical job families from the original which cross-validated Pareto-optimal solutions reliably outper- seven job families investigated by Wee et al. (2014). We chose two formed unit-weighted solutions. In summary, we addressed the job families that theoretically should differ in how much diversity following research questions: improvement is possible. First, we chose one job family (Machin- DIVERSITY SHRINKAGE 1641

Table 2 programming language (R version 3.2.0; R Core Team, 2015). Cognitive Subtests Predictor Intercorrelations, Black–White [For future users, we also developed an easy-to-use R package Subgroup ds, and Criterion Validities (https://github.com/Diversity-ParetoOptimal/ParetoR), as well as a simple web application (i.e., a ShinyApp), that both implement the Variable 1 2 3 4 Pareto-optimal weighting technique described in the current paper ASVAB cognitive subtests (see Appendix C).] 1. Verbal ability 1.00 1. Generate calibration samples. One thousand calibration 2. Mathematical ability .79 1.00 samples were generated for each condition, based on each corre- 3. Technical knowledge .78 .68 1.00 sponding population correlation matrix (Tables 1 and 2). For 4. Clerical speed .46 .64 .23 1.00 ϭ Criterion validities (by job families) example, samples of a given size (e.g., N 40) were drawn from Machinery repair .52 .58 .54 .36 a multivariate normal population distribution, and this process was Control & communication .40 .55 .37 .43 repeated 1,000 times to form 1,000 calibration sample data sets. Black–White subgroup d .91 .69 1.26 .06 Each calibration sample data set consisted of predictor scores, job Note. Values obtained from Wee et al. (2014). Performance Black-White performance scores, and a binary variable indicating race. The subgroup d was .30. The predictor intercorrelation matrix, criterion valid- sample correlation matrix and subgroup ds were then calculated ities, and subgroup ds for the cognitive subtests condition (i.e., verbal for each calibration sample. ability, mathematical ability, technical knowledge, clerical speed) comes We note that race is represented as a binary variable in the from the ASVAB matrix in Wee et al. (2014). current simulation. Thus, to maintain multivariate normality dur- ery Repair) as a typical scenario in which the predictor validities ing the simulation step where we generate samples from the and predictor subgroup differences (d) were positively related, population correlation matrix, we needed to first convert the pop- across predictors (i.e., positive validity-d correspondence; for ex- ulation subgroup ds into population correlations that included race. ample, for machinery repair jobs, clerical speed tests show the Then we generated the sample data based on the population lowest criterion validity and the lowest subgroup d). For the sake correlation matrix, and lastly converted the sample race correla- of comparison, we also chose a second job family (Control & tions back into sample subgroup ds. Specifically, the following Communication) to represent the scenario where predictor validi- steps were used to generate the samples. ties and predictor subgroup differences (d) were negatively related, First, subgroup ds (i.e., for race) were converted into point- across predictors (i.e., negative validity-d correspondence; for biserial correlations using the following formula (see Lipsey & example, for control & communication jobs, technical knowledge Wilson, 2001): tests show the lowest criterion validity and the highest subgroup d;. ϭ ͙ 2 ϩ Ϫ (1) To elaborate, positive validity-d correspondence scenarios should rpb d ⁄ d 1⁄(p(1 p)) logically result in smaller diversity improvements (because high- where p is the proportion of minority applicants (e.g., Black impact predictors receive more weight as performance is maxi- applicants), (1 – p) is the proportion of majority applicants (e.g., mized), whereas negative validity-d correspondence scenarios White applicants), and d is the standardized mean difference should result in larger diversity improvements (because high- between the majority and minority applicant groups. impact predictors tend not to receive more weight as performance Second, point-biserial correlations were converted into biserial is maximized). Note that even under positive validity-d correspon- correlations using the following formula (see Bobko, 2001, p. 38): dence, the shrunken diversity improvement due to Pareto weight- 2 ϭ ͙͑ Ϫ ␭͒ ing is still expected to be greater than zero. rbis rpb p(1 p)⁄ (2) ϫ In total, we examined 15 simulation conditions [5 sample size again where p is the proportion of minority applicants, (1 – p)is 3 predictor conditions (i.e., Bobko-Roth predictors, plus cognitive the proportion of majority applicants, and ␭ is the height (i.e., subtest predictors for two job families)]. For each condition, we density) of the standard normal distribution at the point where the N ϭ conducted 1,000 replications. For example, for the 40 con- sample is divided into majority and minority applicant groups. dition using the Bobko-Roth predictors, we generated 1,000 cali- Third, the race biserial correlations were included in the popu- bration samples of size 40. In each replication, predictor weights lation correlation matrix and random samples were drawn from a This document is copyrighted by the American Psychological Association or one of its alliedwere publishers. generated in a calibration sample, and then applied to an multivariate normal distribution. Each sample consists of predictor This article is intended solely for the personal use ofindependent the individual user and is not to be disseminated broadly. validation sample. The resulting cross-validated cri- scores, job performance scores, and a continuous variable for race. terion validity and diversity were obtained. To estimate validity Fourth, the continuous race variable was dichotomized. To shrinkage, we subtracted the cross-validated criterion validity from achieve this, the sample was first sorted based on values of the the calibration sample criterion validity. Similarly, to estimate continuous race variable. Then, the first (1 Ϫ p) ϫ N simulated diversity shrinkage, we subtracted the cross-validated diversity applicants were labeled “2” (indicating White applicant), with the value from the calibration sample diversity value. Finally, to estimate the diversity improvement possible when using Pareto- optimal weights, we compared the cross-validated diversity values 2 We note that one of the job families that was removed from the with diversity values obtained when using unit-weights. manuscript, the Weapons Operation job family, exhibited a set of shrunken Pareto weights that were similar to—but no better than—unit weights. In other words, for one of the seven ASVAB job families from Wee et al. Procedure (2014), unit weights were not suboptimal. This likely occurred because the bivariate criterion validities (as well as the optimal regression weights) The procedure includes eight steps. The simulation was con- were very similar across predictors (i.e., approximately equally weighted) ducted using the TROFSS program (De Corte, 2006) and the R for the Weapons Operation job family. 1642 SONG, WEE, AND NEWMAN

rest of the simulated applicants labeled “1” (indicating Black SR AI ratio ϭ Black applicant). This produced a binary variable representing race. SRWhite Finally, we calculated the sample subgroup d for each predictor x ϩ ͙ x2 ϩ from the raw data in each sample. As a check on the accuracy of Ϸ (1.64 cut .76 cut 4) ͑ 2ϩ ͒ the simulation procedures, the current simulation produced sample ͙͑ d 2xcutd ͒ ϩ ϩ ϩ 2 ϩ e [1.64(xcut d) ͙.76(xcut d) 4] correlation matrices and sample subgroup d values whose average was similar—within .01 on average—to the population correlation That is, for each calibration sample, the inputs to the simulation matrices and population subgroup ds shown in Tables 1 and 2. The include: (a) the correlation matrix among predictors and job per- current simulation procedure contrasts with previous procedures formance, (b) the standardized Black–White subgroup difference that have been used to generate separate multivariate normal on each of the predictors, (c) the overall selection ratio (SR), and distributions for each race (e.g., De Corte et al., 2007; Newman & (d) the proportion of minority applicants (p). The weighted pre- Lyon, 2009). Generating separate distributions for each race has dictor composite score is obtained by using the Pareto-optimal the drawback that a 2-population mixture distribution is not nec- weights on the individual predictors (e.g., biodata, cognitive abil- essarily normally distributed, and further might produce overall ity, etc.). (We should be clear that the predictor composite only sample correlations across the two groups that are systematically includes preemployment tests; it does not include minority status different from the population correlations in Tables 1 and 2 (see of the candidates.) The Pareto optimal predictor composites were Ostroff, 1993).3 then correlated with job performance to obtain the calibration R 2. Generate validation samples. Validation samples were sample criterion validities (i.e., perf.cal), and correlated with ap- generated in the same manner as calibration samples; 1,000 vali- plicant minority status (race) to obtain the calibration sample point-biserial multiple correlations (i.e., R ), which were dation samples were generated. However, validation sample size race.cal subsequently converted to AI Ratio using the formula above. was always set at N ϭ 10,000. We chose an extremely large cal 4. For each replication, apply calibration sample Pareto- sample size to minimize the effect of random sample-to-sample optimal predictor weights to the validation sample, to obtain variation due to validation sample size. A helpful reviewer also cross-validated (shrunken) criterion validity and diversity pointed out that our validation sample size of 10,000 is sufficient estimates. The calibration sample Pareto-optimal predictor to make results reported to the second decimal place meaningful weights obtained in Step 3 were applied to the validation sample (Bedeian, Sturman, & Streiner, 2009). Thus, in our simulation, we raw data obtained in Step 2. For each of the 21 sets of predictor were able to closely estimate the shrinkage from a calibration weights for each sample, the cross-validated criterion validity sample to the true local population. (R ) is estimated as the correlation, in the validation sample, 3. For each replication, calculate calibration sample Pareto- perf.val between the weighted predictor composite scores and the job optimal predictor weights, criterion validity, and diversity. performance scores. Similarly, the cross-validated diversity The calibration sample correlation matrices and subgroup ds ob- (AI Ratio ; which is monotonically related to the cross-validated tained in Step 1 were used as input for the TROFSS program (De val Corte, 2006). The TROFSS program also requires as input the overall selection ratio (SR), the proportion of minority applicants 3 At a reviewer’s request, we also ran the simulations by generating two (p), and the number of points along the Pareto-optimal curve to be separate normal distributions (one for each race) that differed only in their generated (these points are spread evenly along the curve). To be means (i.e., same as De Corte et al., 2007). The results of the current study were nearly identical under the two procedures for simulating race: (a) the as consistent as possible with previous studies (De Corte et al., biserial method—in which data are simulated from a single continuous 2007; Wee et al., 2014), we set SR at .15 and the number of Pareto distribution that includes race as a variable versus (b) the two-distribution points at 21. We used p ϭ .167 for the current simulations (i.e., 1/6 method—in which data are simulated from two separate distributions. of applicants are Black), which is between the p ϭ 1/4 example Results of this comparison are available from the first author. Our reviewer Ϸ further noted that there is a conceptual point to be made here about one’s from De Corte et al. (2007, p. 1384) and the p 1/8 used by De theoretical definition of race. Under the two-distribution method, race is Corte et al. (2008, p. 187).4 There were 21 sets of Pareto-optimal handled as a natural dichotomy; whereas under our current biserial method, weights estimated for each calibration sample. For a given set of race is treated as a reflection of a continuous underlying distribution of weights, the calibration sample criterion validity (denoted R ) racial identification, racialized experience, and racial image (which gets perf.cal dichotomized when race is measured, and when one considers adverse This document is copyrighted by the American Psychological Association or one of its alliedis publishers. given by the correlation between the weighted predictor com- impact—consistent with our simulation). This article is intended solely for the personal use ofposite the individual user and is not to be disseminated broadly. scores and the job performance scores.5 Similarly, the 4 In response to a reviewer comment, we conducted supplementary

calibration sample diversity (denoted AI Ratiocal; which is mono- simulations to evaluate whether the Pareto curves and shrunken Pareto tonically related to the calibration sample race point-biserial mul- curves change when there are changes in (a) the selection ratio, and (b) the proportion of applicants from the minority group. These supplementary tiple correlation, denoted Rrace.cal) is given by the correlation simulation results are presented in Appendix B. Results suggest that: (a) as between the weighted predictor composite scores and the binary the selection ratio decreases, available solutions will provide lower AI race variable. Specifically, the AI ratio can be directly obtained ratios, and diversity shrinkage will increase, and (b) as the proportion of from R using the following formulas from Newman, Jacobs, applicants from the minority group decreases, available solutions will race.cal provide slightly higher AI ratios, and diversity shrinkage will also increase. and Bartram (2007, p. 1404), given that SR and p are known for As such, results of the current simulation might overstate the magnitude of each condition. diversity shrinkage compared to conditions with higher selection ratios and higher proportions of minority applicants, and might understate diversity d ϭ r ⁄ ͙(1 Ϫ r2)p(1 Ϫ p) shrinkage compared to conditions with lower selection ratios and lower proportions of minority applicants. 5 Similar to Ree, Earles, and Teachout (1994), we provide estimates in ϭ ͙ ϩ 2 Ϫ Ϫ 2 xcut zcut 1 d [p(1 p)] dp the R rather than the R metric for criterion-related validity. DIVERSITY SHRINKAGE 1643

race point-biserial multiple correlation, Rrace.val) is estimated as Pareto curve. That is, each shrunken Pareto curve is the set of the correlation, in the validation sample, between the weighted cross-validated Pareto-optimal solutions (i.e., the criterion valid- predictor composite scores and the binary race variable, at the ities and diversity/AI ratio values obtained by applying the cali- given SR and p (as described in Step 3 above). bration sample Pareto-optimal predictor weights to the validation 5. For each replication, calculate unit-weighted criterion sample). validity and diversity in the validation sample. Step 4 was In Figure 5, the fact that the validation Pareto curve consistently repeated, using unit weights for the predictors instead of using the lies below and to the left of the calibration Pareto curve means that calibration sample Pareto-optimal weights. By giving each predic- there is shrinkage in both criterion validity (expected job perfor-

tor a weight of 1.0, the unit-weighted criterion validity (Runit) and mance) and diversity (expected AI ratio) across the entire range of diversity (AI Ratiounit) were obtained. the Pareto curve. We use the term Pareto shrinkage to describe this 6. Aggregate data across replications. We aggregated the shift of the Pareto curve (i.e., the shift of each Pareto solution, or data to obtain the calibration and validation samples’ mean Pareto- each Pareto point) to a lower expected value of both criterion optimal curves. To do so, we averaged across the 1,000 replica- validity and diversity. In Figure 5, the diagonal line from calibra- tions for each of the 21 Pareto points. For example, the mean tion solution Point 4 to validation solution Point 4 is the Pareto ៮ calibration sample criterion validity (Rperf.cal) at the Pareto point shrinkage for solution Point 4. It is important to realize that Pareto where criterion validity is maximized—that is, the first or leftmost shrinkage can be decomposed into two component vectors: (a) point on the Pareto-optimal curve—is calculated by averaging the validity shrinkage—which is the cross-validation shrinkage along criterion validities obtained when criterion validity is maximized, the vertical (job performance or criterion validity) axis, and (b) across the 1,000 calibration samples. The mean calibration sample diversity shrinkage—which is the cross-validation shrinkage along ៮ diversity (AI Ratiocal) at the Pareto point where criterion validity is the horizontal (diversity or AI ratio) axis. To restate, for a given maximized—also the first point on the Pareto curve—is calculated sample size (e.g., N ϭ 40), validity shrinkage is indicated by the by first averaging the race point-biserial multiple correlations vertical distance (e.g., dashed vertical line in Figure 5) between the

(Rrace.cal) obtained when criterion validity is maximized, across criterion validities of the calibration (e.g., solid curved line) and ៮ the 1,000 calibration samples, then converting Rrace.cal to validation (e.g., dotted curved line) samples for the same Pareto- ៮ 6 point (e.g., solution Point 4). Similarly, diversity shrinkage is AI Ratiocal. This procedure was repeated for each of the 21 Pareto points in the calibration sample, and then again for each of the 21 indicated by the horizontal distance (e.g., dashed horizontal line in Pareto points in the validation sample. We also obtained the mean Figure 5) between the diversity values (i.e., AI ratios) of the ៮ ៮ calibration and validation samples (e.g., horizontal distance be- unit-weighted criterion validity (Rperf.unit) and diversity (AI Ratiounit), across replications. tween solid and dotted curved lines) for the same Pareto-point 7. Plot the Pareto-optimal trade-off curves and unit- (e.g., solution Point 4). In Figure 5’s illustration, validity shrinkage weighted solutions. Each Pareto-optimal curve (defined by 21 and diversity shrinkage for Point 4 are two component vectors of Pareto solutions) was graphed, by plotting the mean criterion Pareto shrinkage. In Figures 2 through 4, in addition to plotting the validities (i.e., the job performance objective) against the mean AI calibration and validation Pareto curves, we also plotted the unit- ratios (i.e., the diversity objective). For each condition, separate weighted solution (i.e., a black square). curves were plotted for the calibration and validation samples. The unit-weighted solution was obtained by plotting the mean unit- Sample Size Effects on Validity Shrinkage and weighted criterion validity (R៮ ) against the mean unit- ៮ perf.unit Diversity Shrinkage weighted diversity (AI Ratiounit). 8. Calculate validity shrinkage and diversity shrinkage. Research Question 1 involves sample size effects on diversity For each point along the Pareto-optimal curve, the amount of shrinkage. Figures 2 through 4 showed consistent and intuitive validity shrinkage (⌬R) was calculated by subtracting the mean findings regarding sample size: validity shrinkage and diversity criterion validity of the validation sample from the mean criterion shrinkage both decreased as sample size increased. For the Bobko- validity of the calibration sample. Likewise, the amount of diver- Roth (see Figure 2) and cognitive subtest (Figures 3 and 4) sity shrinkage (⌬AI ratio) was calculated by subtracting the mean predictors, validity shrinkage when N ϭ 40 (i.e., vertical distances

This document is copyrighted by the American Psychological Association or one of its alliedAI publishers. ratio of the validation sample from the mean AI ratio of the between the solid and dotted lines) was substantially larger than ϭ This article is intended solely for the personal use ofcalibration the individual user and is not to be disseminated broadly. sample. validity shrinkage when N 1,000 (i.e., vertical distances between the solid and dotted lines), at all points along the Pareto-optimal Results curve. The same pattern of results held for diversity shrinkage. For The Pareto-optimal trade-off curves for different sample size conditions are presented in the “rainbow graphs” in Figures 2, 3 6 We note that our procedure of averaging the multiple correlation (R ) across replications, rather than averaging the AI ratio across repli- and 4. The Pareto curves for the Bobko-Roth predictors are pre- race cations, has a distinct advantage. Our procedure minimizes the influence of sented in Figure 2, and the curves for the cognitive subtest pre- extreme AI ratios on our results, which would have been obtained for dictors, for each of the two job families, are presented in Figures individual replications where the denominator of the AI ratio was close to 7 3 and 4. To understand how to interpret these figures, look at zero. The race multiple correlation (Rrace) is bounded on the interval the example graph in Figure 5.InFigure 5, the Pareto curve for the [Ϫ1, ϩ1], whereas the AI ratio has no upper bound. Extremely large and implausible AI ratios sometimes occur, especially when simulations are calibration sample (i.e., solid line) and the Pareto curve for the based on small sample sizes. validation sample (i.e., dotted line) are shown. The Pareto curve 7 Results for the other five ASVAB job families from Wee et al. (2014) for the validation sample (dotted line in Figure 5) is the shrunken may be obtained from the first author. 1644 SONG, WEE, AND NEWMAN

the Bobko-Roth (see Figure 2) and cognitive subtest (Figures 3 and 4) predictors, diversity shrinkage when N ϭ 40 (i.e., horizontal distances between the solid and dotted lines) was larger than when N ϭ 1,000, for all points along the Pareto-optimal curve. In general, the incremental benefit of sample size was also greater when sample sizes were small (i.e., when attempting to control shrinkage, there are diminishing returns to sample size). That is, increasing sample size from 40 to 100 ameliorated shrink- age more than increasing sample size from 100 to 200, because of a nonlinear relationship between shrinkage and sample size.8 One part of the sample size results that might not be entirely intuitive is the thresholds for calibration sample size that render shrinkage negligible, which we discuss in the sections below.

Validity Shrinkage and Diversity Shrinkage When One Objective Is Maximized As shown across both the Bobko-Roth (see Figure 2) and the cognitive subtest (Figures 3 and 4) predictors, the degree of va- lidity shrinkage and diversity shrinkage changes from one end of the Pareto curve to the other. For Research Question 2, we are interested in whether validity shrinkage is larger when job perfor- mance is being maximized, and whether diversity shrinkage is larger when diversity is being maximized. To answer Research Figure 3. Validity and diversity shrinkage (original/calibration and shrunken Pareto curves) for Machinery Repair job family using cognitive Question 2, we refer to Tables 3 and 4. To highlight the differences subtest predictors (Wee et al., 2014). See the online article for the color in validity shrinkage and diversity shrinkage when each objective version of this figure. was maximized, we compared the criterion validities and diversity values obtained at the endpoints of the Pareto-optimal curve. These results are presented for the Bobko-Roth predictors in Table 3, and show the criterion validity (R), validity shrinkage (⌬R), diversity for the cognitive subtest predictors (Machinery Repair and Control value (i.e., AI ratio), and diversity shrinkage (⌬AI ratio) obtained & Communication) in Table 4. The top halves of Tables 3 and 4 when job performance (i.e., criterion validity) was maximized. The bottom halves of Tables 3 and 4 show the same parameters, but obtained under conditions when diversity was maximized. In each section (top half and bottom half) of Tables 3 and 4, the criterion validities and diversity values obtained in the calibration and validation samples are provided, for each calibration sample size (i.e., from N ϭ 40 to 1,000). The amount of validity shrinkage (⌬R) was calculated by subtracting the mean criterion validity obtained in the validation sample (i.e., cross-validated validity) from the mean criterion validity obtained in the calibration sample. The amount of diversity shrinkage (⌬AI ratio) was calculated by subtracting the mean AI ratio obtained in the validation sample (i.e., cross-validated AI ratio) from the mean AI ratio obtained in the calibration sample.

This document is copyrighted by the American Psychological Association or one of its allied publishers. For the Bobko-Roth (see Table 3) and cognitive subtest (see

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Table 4) predictors, the results indicated greater shrinkage on the maximized objective. Specifically, diversity shrinkage was larger when diversity was maximized (i.e., ⌬AI ratio values in the bottom halves of Tables 3 and 4), as compared with when criterion validity was maximized (i.e., ⌬AI ratio values in the top halves of Tables 3 and 4). For example, for the Bobko-Roth predictors (Table 3, when N ϭ 40), diversity shrinkage was ⌬AI ratio ϭ 1.24 when

8 As noted by Bobko (2001, p. 244), in Wherry’s (1931) classic shrink- age formula, the only factor by which validity (R2) is reduced during Figure 2. Validity and diversity shrinkage (original/calibration and shrinkage is (n – 1)/(n – k – 1)—that is, the ratio of df total to df error, shrunken Pareto curves) for Bobko-Roth predictors using meta-analytically which is roughly equivalent to 1/(1 – (k/n)). In other words, increasing n derived values presented in Table 1. See the online article for the color has a much bigger effect on shrinkage (i.e., a big effect on 1/(1 – (k/n)) version of this figure. when n is small. DIVERSITY SHRINKAGE 1645

Validity Shrinkage and Diversity Shrinkage Along the Entire Pareto-Optimal Curve Aside from the endpoints of the Pareto curve, we note that results in the middle of the curve can also be easily interpreted. That is, diversity shrinkage was smaller in the middle of the curve than it was at the endpoint where diversity was maximized, but diversity shrinkage was also larger in the middle of the curve than at the endpoint where criterion validity was maximized. Similarly, validity shrinkage in the middle of the curve was smaller than at the endpoint where validity was maximized, but validity shrinkage in the middle of the curve was larger than at the endpoint where diversity was maximized. This pattern of results generalized across the different predictor combinations, as shown in Figures 2 through 4. To restate, within a given selection scenario, diversity shrinkage increases to the extent diversity is being maximized. Also, validity shrinkage increases to the extent validity is being maximized.

Diversity Shrinkage When Using Low-Impact Predictors Research Question 1b dealt with the effects of different types of Figure 4. Validity and diversity shrinkage (original/calibration and predictor combinations on diversity shrinkage. Two major differences shrunken Pareto curves) for Control and Communication job family using were observed between the pattern of results obtained for the Bobko- cognitive subtest predictors (Wee et al., 2014). See the online article for the Roth (cognitive and noncognitive) predictors versus the cognitive color version of this figure. subtest predictors. First, the Bobko-Roth predictors were more strongly affected by diversity shrinkage than were the cognitive

diversity was maximized, but only ⌬AI ratio ϭ 0.00 when crite- rion validity was maximized. For the cognitive subtest predictors (Table 4, when N ϭ 40), diversity shrinkage was ⌬AI ratio ϭ 0.05 (Machinery Repair) and ϭ 0.03 (Control & Communication) when diversity was maximized, but was smaller at ⌬AI ratio ϭ 0.01 (Machinery Repair) and ϭ 0.00 (Control & Communication) when criterion validity was maximized. Likewise, validity shrinkage was larger when criterion validity was maximized (i.e., ⌬R values in the top halves of Tables 3 and 4), as compared with when diversity was maximized (i.e., ⌬R values in the bottom halves of Tables 3 and 4). For example, for the Bobko-Roth predictors (Table 3, when N ϭ 40), validity shrinkage was ⌬R ϭ .06 when criterion validity was maximized, and only ⌬R ϭϪ.02 when diversity was maximized. Similarly, for the cognitive subtest predictors (Table 4, when N ϭ 40), validity shrinkage was ⌬R ϭ .04 (for both Machinery Repair and Control This document is copyrighted by the American Psychological Association or one of its allied& publishers. Communication job families) when criterion validity was max- This article is intended solely for the personal use ofimized, the individual user and is not to be disseminated broadly. but was smaller at ⌬R ϭϪ.01 (for both Machinery Repair and Control & Communication job families) when diversity was maximized. To summarize our answer to Research Question 2, when examining Table 3 (Bobko-Roth predictors) and Table 4 (cog- nitive subtest predictors), we see negligible amounts of shrink- Figure 5. Example scenario demonstrating decomposition of Pareto age on the objective that was not maximized. That is, for both shrinkage into diversity shrinkage and validity shrinkage. Pareto shrinkage sets of predictors, there was little to no validity shrinkage when describes the shift of the calibration sample Pareto curve (i.e., each of the 21 Pareto solutions on the solid line) to a lower expected value for both diversity was maximized, and little to no diversity shrinkage criterion validity and diversity (i.e., each of the 21 Pareto solutions on the when criterion validity/job performance was maximized. To dotted line; obtained in the validation sample). Pareto shrinkage can be restate, diversity shrinkage is greatest when diversity is maxi- further decomposed into independent components of validity shrinkage mized, and validity shrinkage is greatest when validity is max- (along the vertical axis) and diversity shrinkage (along the horizontal axis). imized. See the online article for the color version of this figure. 1646 SONG, WEE, AND NEWMAN

Table 3 Criterion Validity and Diversity for Pareto Curve Endpoints Using Bobko-Roth Predictors

Criterion validity (R) Diversity (AI ratio) Validity Calibration Diversity Calibration Unit-weighted Calibration Cross-validated shrinkage Unit-weighted sample AI Cross-validated shrinkage sample size validity sample validity validity (⌬R) AI ratio ratio AI ratio (⌬AI ratio)

Endpoint where criterion validity is maximized N ϭ 40 .58 .69 .63 .06 .63 .48 .48 .00 N ϭ 100 .58 .67 .65 .03 .63 .47 .47 .00 N ϭ 200 .58 .67 .65 .02 .63 .47 .47 .00 N ϭ 500 .58 .66 .66 .01 .63 .47 .46 .00 N ϭ 1,000 .58 .66 .66 .00 .63 .46 .46 .00

Endpoint where diversity is maximized N ϭ 40 .58 .25 .27 Ϫ.02 .63 2.15 .91 1.24 N ϭ 100 .58 .22 .23 Ϫ.01 .63 1.46 1.03 .43 N ϭ 200 .58 .22 .21 .00 .63 1.25 1.07 .18 N ϭ 500 .58 .21 .21 .00 .63 1.20 1.12 .08 N ϭ 1,000 .58 .21 .21 .00 .63 1.17 1.14 .03 Note. Validity shrinkage (⌬R) refers to the mean criterion validity difference between calibration sample validity and validation sample (i.e., cross-validated) validity, and indicates job performance shrinkage. Diversity shrinkage (⌬AI ratio) refers to the mean diversity/AI ratio difference between calibration sample and validation sample values. AI ϭ adverse impact.

subtest predictors. This can be seen by observing the larger horizontal sample sizes of N ϭ 500 or fewer all demonstrated substantial distances between the solid and dotted lines for the Bobko-Roth (non-negligible) diversity shrinkage. In contrast, in Figure 3 and predictors (Figure 2 and Table 3) than for the cognitive subtest Figure 4 (i.e., cognitive subtest predictors in the Machinery Repair predictors (Figures 3 and 4, and Table 4). Second, as compared with and Control & Communication job families, respectively), there the cognitive subtest predictors, the Bobko-Roth predictors required was negligible diversity shrinkage whenever the calibration sam- larger sample sizes before diversity shrinkage was negligible. ple size was at N ϭ 100 or more (i.e., the horizontal distance Let us arbitrarily define “negligible” diversity shrinkage as ⌬AI between the solid and dotted lines is small). ratio Ͻ.05 (i.e., where diversity shrinkage rounds to 0.0, rather In general, these results suggest that including low impact pre- than 0.1), for the sake of interpretation. In Figure 2 (i.e., Bobko- dictors tended to increase diversity shrinkage. As can be seen in Roth predictors), the Pareto-optimal solutions based on calibration Tables 1 and 2, the average subgroup d value of the Bobko-Roth

Table 4 Criterion Validity and Diversity for Pareto Curve Endpoints Using Cognitive Subtest Predictors (Machinery Repair Job Family/Control and Communication Job Family)

Criterion validity (R) Diversity (AI ratio) Validity Calibration Diversity Calibration Unit-weighted Calibration Cross-validated shrinkage Unit-weighted sample AI Cross-validated shrinkage sample size validity sample validity validity (⌬R) AI ratio ratio AI ratio (⌬AI ratio)

Endpoint where criterion validity is maximized N ϭ 40 .60/.52 .64/.57 .59/.53 .04/.04 .35/.35 .32/.46 .32/.46 .01/.00 This document is copyrighted by the American Psychological Association or one of its allied publishers. N ϭ 100 .60/.52 .62/.57 .61/.55 .02/.02 .35/.35 .31/.48 .31/.48 .00/.00 This article is intended solely for the personal use of theN individual user andϭ is not to200 be disseminated broadly. .60/.52 .62/.56 .61/.55 .00/.01 .35/.35 .30/.48 .30/.48 .00/.00 N ϭ 500 .60/.52 .62/.56 .61/.55 .00/.00 .35/.35 .30/.48 .30/.48 .00/.00 N ϭ 1,000 .60/.52 .62/.56 .62/.55 .00/.00 .35/.35 .30/.48 .30/.48 .00/.00

Endpoint where diversity is maximized N ϭ 40 .60/.52 .37/.42 .38/.43 Ϫ.01/ .35/.35 .89/.89 .83/.86 .05/.03 Ϫ.01 N ϭ 100 .60/.52 .36/.43 .36/.43 .00/.00 .35/.35 .92/.92 .91/.92 .01/.00 N ϭ 200 .60/.52 .35/.43 .36/.43 .00/.00 .35/.35 .91/.92 .92/.92 .00/.00 N ϭ 500 .60/.52 .36/.43 .36/.43 .00/.00 .35/.35 .92/.92 .92/.92 .01/.00 N ϭ 1,000 .60/.52 .36/.43 .36/.43 .00/.00 .35/.35 .91/.92 .92/.92 .00/.00 Note. Validity shrinkage (⌬R) refers to the mean criterion validity difference between calibration sample and validation sample values, and indicates job performance shrinkage. Diversity shrinkage (⌬AI ratio) refers to the mean diversity (i.e., AI ratio) difference between calibration sample and validation sample values. In each cell, results for the Machinery Repair job family are shown on the left side of “/” whereas results for the Control & Communication job family are shown on the right side of “/”. AI ϭ adverse impact. DIVERSITY SHRINKAGE 1647

predictors is lower than the average subgroup d value of the the black square in each figure). In terms of the cross-validated cognitive subtest predictors. Although the low-impact predictors diversity improvement (i.e., the horizontal distance between the (e.g., personality, structured interviews) might suggest that large cross-validated diversity value and the unit-weighted diversity diversity improvements are possible in the original calibration value; see Figure 1, left panel), the simulation results (Figures 2 sample from which the Pareto-optimal weights were obtained, our through 4) showed that the cross-validated Pareto-optimal solu- current results suggest that caution is warranted in applying these tions outperformed the unit-weighted solution in all cases simu- weights to new samples (because of diversity shrinkage), espe- lated, whenever the calibration sample size was at least 100. Thus, cially if the Pareto-optimal weighting was conducted using a unit-weighting was suboptimal. That is, even after accounting for calibration sample size of 500 or smaller. In specific, when apply- diversity shrinkage, Pareto weights yielded greater diversity out- ing Pareto weights to Bobko-Roth predictors, using calibration comes (as compared with unit weights), while continuing to pro- samples of N ϭ 500 or smaller, there is risk of non-negligible duce shrunken criterion validities/job performance that were as diversity shrinkage; the diversity improvement estimated using good as (and sometimes better than) unit weights. In other words, Pareto weighting in the calibration sample will be an overestimate. reliable diversity improvement was typically possible via Pareto Nonetheless, even when diversity shrinkage is large and thus the weighting. expected diversity benefits are overstated by the calibration sample To explain this in a different way, we note the sample size (when calibration N Յ 500), it is still plausible that Pareto weights required before cross-validated Pareto weights reliably outper- will yield better selection diversity outcomes in contrast to unit formed unit weights. This critical sample size can be contrasted weighting. So, even when diversity improvement (i.e., Pareto- depending on the types of predictors (e.g., Figure 2 vs. Figure 3) weighted diversity minus unit-weighted diversity; Figure 1)is and the particular job family (e.g., Figure 3 vs. Figure 4) consid- overestimated, it can still be positive. That is, Pareto weighting can ered. Based on our simulation results, the Bobko-Roth predictors still potentially outperform unit weighting, despite diversity (see Figure 2) required sample sizes of at least 100 before Pareto- shrinkage. optimal weights reliably outperformed unit weights in providing diversity improvement. This was similar to the sample size re- Diversity Shrinkage and the 4/5ths Threshold quired for using cognitive subtests in the Machinery Repair job family (see Figure 3). When using the cognitive subtests in the We should note that, when shrunken diversity improvement is Control and Communication job family, even the minimum sample positive (which it typically is), this means that some diversity size of N ϭ 40 produced cross-validated Pareto weights that enhancement is possible—not that adverse impact (in terms of the reliably outperformed unit weights. 4/5ths rule) can be completely removed. When using the current Across all conditions, the unit-weighted solutions had fairly set of predictors (i.e., a set of predictors that includes cognitive high criterion validities (relatively good job performance out- tests), AI ratios above .8 (meeting the 4/5ths rule) could not be comes) but fairly low AI ratio values (relatively poor diversity achieved without considerable loss of some criterion validity and outcomes). Importantly, compared to the Pareto-optimal solutions, expected job performance. Further, because of diversity shrinkage, the unit-weighted solutions tended to be suboptimal, suggesting the expected number of violations of the 4/5ths rule sometimes diversity improvements over the unit-weighted solution would increased (i.e., the AI ratio shifted from “greater than .8” to often be possible in many of these conditions. “less than .8”) when calibration sample sizes were small. For Sample size and diversity improvement. To highlight the example (as seen in Figure 3), when N ϭ 40, a full 10 of the 21 effect of sample size on diversity improvement, we plotted the Pareto solution points exhibit AI ratios safely above .8 in the diversity improvement (i.e., ⌬AI ratio from unit weights to Pareto calibration sample, but with corresponding shrunken AI ratios weights, holding job performance constant) that could be obtained that fall below .8. So almost half of the Pareto solution points when using Pareto-optimal weights, at each distinct sample size are too optimistic in regard to the 4/5ths rule, when N is very condition. We plotted both shrunken and unshrunken diversity small (N ϭ 40). In contrast, when N ϭ 100, only 5 of the 21 improvement for the Bobko-Roth predictors (Figure 6, left panel) Pareto solutions show a change in 4/5ths rule violation due to and the cognitive subtest predictors (i.e., Figure 6, right panel).9 diversity shrinkage. When N ϭ 200, the number of changes in For the Bobko-Roth predictors (Figure 6, left panel), the solid line 4/5ths rule violations drops to only 3 out of 21; and when N Ն This document is copyrighted by the American Psychological Association or one of its allied publishers. represents the unshrunken diversity improvement (i.e., diversity 500, there is only 1 of 21 Pareto solutions that exhibits changes This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. improvement obtained in the calibration sample, by using Pareto in outcomes of the 4/5ths rule due to diversity shrinkage. Our weights rather than unit weights). The dotted line represents the interpretation is that the use of Pareto weighting to avoid cross-validated (shrunken) diversity improvement (i.e., diversity violations of the 4/5ths rule can sometimes produce inflated improvement obtained in the validation sample, by using cross- expectations (under small sample size conditions), unless di- validated Pareto weights rather than unit weights). For example, at versity shrinkage is taken into account. small sample sizes, diversity improvement is not only negligible, but it can also be greatly overstated by the unshrunken estimates. Diversity Improvements Using Cross-Validated Pareto As can be seen from the bottom two lines in Figure 6 (left panel), Weights Instead of Unit Weights at smaller sample sizes (e.g., N ϭ 40), the unshrunken diversity ⌬ ϭ Can we reliably get diversity improvement, after shrinkage? improvement value ( AI ratio 0.28) overestimates the actually To examine the practical utility of Pareto-optimal weighting, we compared the cross-validated Pareto-optimal solutions (i.e., dotted 9 Results in Figure 6 right panel pertain to the Machinery Repair job lines in Figures 2 through 4) with the unit-weighted solution (i.e., family; results for other job families may be obtained from the first author. 1648 SONG, WEE, AND NEWMAN

Figure 6. Diversity improvement (⌬AI ratio) by calibration sample size, for the Bobko-Roth predictors (left panel) and the cognitive subtest predictors in the Machinery Repair job family (right panel). Red lines indicate diversity improvement calculated as the difference between AI ratio of Pareto Weights at Unit-Weighted R versus AI ratio of Unit Weights [in both the calibration sample (solid red lines) and in the validation sample (dotted red lines)]. Blue lines indicate the diversity improvement calculated as the difference between AI ratio of Pareto Weights at R ϭ .50 versus AI ratio of Unit Weights [in the calibration sample (solid blue lines) and in the validation sample (dotted blue lines)]. As such, red lines indicate diversity improvement at no cost in terms of job performance, and blue lines indicate diversity improvement at a small cost in terms of job performance (i.e., dropping from unit-weighted R to R ϭ .50). See the online article for the color version of this figure.

attainable shrunken/cross-validated diversity improvement value Pareto weighting are fairly robust to diversity shrinkage (i.e., (⌬AI ratio ϭϪ0.01). In contrast, at larger sample sizes (e.g., N ϭ cross-validation of the Pareto curve), under the crucial boundary 500), the unshrunken diversity improvement (⌬AI ratio ϭ 0.08) condition that the calibration sample size is adequate, given the and the cross-validated diversity improvement (⌬AI ratio ϭ 0.06) mix of high-impact and low-impact predictors at hand. And, even do not differ substantially. A similar pattern of results is observed greater diversity improvements are possible if modest decrements for the cognitive subtest predictors (bottom two lines in Figure 6, in criterion validity/job performance are accepted. right panel). In Figure 6, we also plotted the diversity improvement that Discussion could be obtained if an organization were willing to accept a small decrement in performance (i.e., criterion validity) in order to The current study examining the cross-validity of Pareto- obtain diversity improvement. The top two lines in Figure 6 (both optimal weights supports the following conclusions. First, along left and right panels) indicate diversity improvement calculated as the Pareto-optimal curve, diversity shrinkage increases to the ex- the difference between the AI ratio of Pareto weights at R ϭ .50 tent that diversity is being maximized. That is, validity shrinkage This document is copyrighted by the American Psychological Association or one of its allied publishers. versus the AI ratio of unit weights [in the calibration sample (solid was greater for Pareto solutions where maximizing job perfor- This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. lines) and in the validation sample (dotted lines)]. As such, the mance was more important than maximizing diversity (i.e., left bottom two lines in each panel indicate diversity improvement at side of the Pareto-optimal curve); whereas diversity shrinkage was no cost in terms of job performance (compared with unit weights), greater at points where maximizing diversity was more important and the top two lines in each panel indicate diversity improvement than maximizing job performance (i.e., right side of the Pareto- at a modest cost in terms of job performance (i.e., dropping from optimal curve). Both validity and diversity shrinkage decreased unit-weighted R to R ϭ .50). As expected, larger diversity im- with increasing sample size, although larger diversity shrinkage provements can be obtained if small decrements in job perfor- was seen for the Bobko-Roth predictors than for the cognitive mance are accepted. These diversity improvements (the topmost subtest predictors, across all sample sizes examined (Research dotted lines in Figure 6) can be obtained despite diversity shrink- Question 1). Second, consistent with this general finding, when age. only one objective was maximized (akin to multiple regression, Altogether, these results demonstrate that many—but not and represented by the endpoints of the Pareto-optimal curve), all—of the claims made by De Corte et al. (2008) and Wee et al. shrinkage was larger on the maximized objective and negligible on (2014) about diversity improvements that can be gleaned from the ignored objective. Specifically, validity shrinkage existed to DIVERSITY SHRINKAGE 1649

the extent that job performance was being maximized, and diver- which we are using high-impact predictors (e.g., predictors with sity shrinkage existed to the extent that diversity was being max- large Black–White subgroup differences, such as cognitive sub- imized. For both the Bobko-Roth and cognitive subtest predictors, tests, for which the predictors have large correlations with race). validity shrinkage when job performance was maximized became According to classic shrinkage formulas (and as confirmed in the fairly negligible (i.e., we found ⌬R Յ .03) when sample size was current simulation), such a scenario with high-impact predictors at least 100. With a sample size of at least 100, cognitive subtest should yield small diversity shrinkage. Conversely, when the se- predictors also typically demonstrated negligible diversity shrink- lection scenario involves low-impact predictors, we would expect age when diversity was maximized (i.e., we found ⌬AI ratio Յ larger diversity shrinkage. This is consistent with what we see in .01). However, sample sizes greater than 500 were required for the the current study when we compare Bobko-Roth predictors against Bobko-Roth predictors before the cross-validated Pareto weights cognitive subtest predictors—the Bobko-Roth predictors (which demonstrated negligible diversity shrinkage (Research Question include more low-impact predictors; e.g., personality) suffer 2). Third, cross-validated Pareto-optimal weights typically evi- greater diversity shrinkage when cross-validating. denced diversity improvement over unit weights when the original Another contribution made in the current paper, beyond extending sample had at least 100 applicants (Research Question 3). single-criterion shrinkage notions to diversity shrinkage, involves In sum, when using the Pareto-optimal weighting strategy, di- shrinkage of both criterion validity and diversity across the entire versity improvements over unit weights are still possible even after Pareto-optimal curve (i.e., shrinkage when optimizing two criteria accounting for diversity shrinkage. However, it must also be simultaneously). That is, classic work on validity shrinkage did not reiterated that these cross-validated diversity improvements never dictate how validity shrinkage and diversity shrinkage would both produced AI ratios exceeding four-fifths, without some loss of job behave at all the points along the Pareto curve, for different predictor performance compared to unit weights. Thus, to further our efforts combinations (addressed in our response to Research Question 1). at enhancing organizational diversity—and ameliorating the per- Our study thus provides evidence of the magnitude of shrinkage on formance versus diversity dilemma—the Pareto-optimal weighting two criteria—validity shrinkage and diversity shrinkage. strategy should also be complemented with other approaches that In addition, the current study also provides evidence that the extent seek to develop predictor measures that are less race-loaded (e.g., of shrinkage is influenced by the extent of maximization: validity Goldstein et al., 2010; Hough et al., 2001), foster more favorable shrinkage was only observed to the extent that validity was being diversity recruiting (e.g., Avery & McKay, 2006), and conceptu- maximized, and diversity shrinkage was only observed to the extent alize and measure performance as a multifaceted rather than mono- that diversity was being maximized. Aside from the endpoints of the lithic construct (Campbell & Wiernik, 2015; Hattrup, Rock, & Pareto curve, it appears from the current simulation results that we can Scalia, 1997). directly interpolate shrinkage on validity and diversity when both are being simultaneously optimized using the Pareto-optimization proce- Theoretical Implications dure. To restate, shrinkage follows the extent of maximization on each of the two criteria. In general, results from the current study extended previous research on cross-validation/shrinkage to the multicriterion opti- Practical Implications mization context. To recap, previous research (i.e., using single- criterion optimization) indicated that smaller sample sizes and Previous research has suggested that Pareto-optimal weighting smaller observed multiple correlation coefficients result in greater could be an effective—and easily implemented—strategy to improve shrinkage of the multiple correlation.10 We expected, and found, diversity hiring outcomes in organizations (De Corte et al., 2008; Wee that these validity shrinkage factors, which are well-known when et al., 2014). Although these studies showed that diversity improve- only the one objective—job performance/criterion validity—is be- ments were theoretically possible, they did not explicitly estimate ing maximized (i.e., at the leftmost endpoint of the Pareto-optimal diversity shrinkage. Our study shows diversity improvements do curve), would generalize to diversity shrinkage in the case where generalize, as long as moderate sample sizes are used (i.e., based on only the one objective of diversity was being maximized (i.e., at at least 100 job applicants, in typical scenarios). More specifically, the rightmost endpoint of the Pareto-optimal curve). with a sample size of at least N ϭ 100, cross-validated Pareto-optimal One insight that emerged from applying our knowledge of weights often provided notable diversity improvements over unit This document is copyrighted by the American Psychological Association or one of its allied publishers. single-criterion shrinkage (classic shrinkage formulas) to the di- weights (e.g., notably increasing the number of job offers extended to This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. versity objective (i.e., diversity shrinkage) was this: using low- minority applicants, at no cost in terms of job performance). A full set impact predictors (e.g., personality, structured interviews) yields of practical implications is summarized in Table 5. larger diversity shrinkage. The logic for this finding can be One practical caveat is that the size of the applicant pool/ gleaned from applying classic shrinkage formulations to diversity, calibration sample can vary widely in practice, and is not always as opposed to criterion validity. First, we know that, all else equal, within the direct control of an organization. The current results validity shrinkage decreases as R2 increases (see footnote 10). This thus suggest that organizations with large applicant pools (e.g., the means that validity shrinkage decreases as we include more high- Armed Services, the civil service, Google Inc., etc.) could atten- validity predictors in the regression model (when predicting job uate the performance-diversity trade-off that they would typically performance). The insight comes when we switch from predicting job performance to predicting race (as we essentially are doing 10 By algebraically manipulating Wherry’s (1931) formula, we get Ϫ when we estimate diversity shrinkage). For a regression model that ϭ R2 Ϫ␳2 ϭ ͑R2 Ϫ ͒͑ Ϫ n 1 ͒ R2 2 shrinkage 1 1 nϪkϪ1 . Note that as gets larger, predicts race, the analog to high R (i.e., high racial variance the absolute value [magnitude] of (R2 Ϫ1) gets smaller. Hence shrinkage accounted for by the predictors) would be a regression scenario in decreases as R2 increases. 1650 SONG, WEE, AND NEWMAN

Table 5 Summary of Basic Findings

1. Sample size. Validity shrinkage and diversity shrinkage both decrease when sample size increases. The incremental benefit of sample size is greater when sample size is small. a. In the current simulation, diversity shrinkage was sizable when using the standard set of Bobko-Roth predictors (cognitive and noncognitive tests), whenever sample size was at or below 500. (Figure 2) b. In the current simulation, diversity shrinkage was typically negligible for cognitive subtest predictors, whenever sample size was at or above 100. (Figures 3 and 4) 2. Diversity shrinkage is greater when diversity is being maximized. Validity shrinkage is greater when job performance is being maximized. (Tables 3 and 4) 3. Including low impact predictors tends to increase diversity shrinkage. a. When maximizing a particular criterion, shrinkage decreases as R2 increases (this can be clearly seen in classic shrinkage formulas). b. For this reason, validity shrinkage should decrease when you have high-validity predictors (because high-validity predictors increase R2). c. Likewise, diversity shrinkage should decrease when you have high-impact predictors (because high-impact predictors increase R2 for predicting race). Conversely, low impact predictors would increase diversity shrinkage. 4. Pareto-optimal weights typically outperform unit weights. a. Pareto-optimal weights generally (but not always) yield equal or greater job performance outcomes than unit-weights (even after accounting for shrinkage; when calibration sample size is N Ն 100), and b. Pareto weights can typically yield greater diversity outcomes compared to unit weights, at no cost in terms of job performance. (Figures 2, 3, and 4) c. Further, if practitioners are willing to sacrifice a moderate amount of job performance (e.g., using R ϭ .50, instead of unit-weighted R ϭ .58 [with Bobko-Roth predictors]), usually a notably greater diversity outcome can be achieved (e.g., AI ratio ϭ .78, instead of unit-weighted AI ratio ϭ .63 [with Bobko-Roth predictors]) by using Pareto weights. (Figures 2 and 6)

face through the use of Pareto-optimal weighting strategies. Con- proaches to personnel selection (e.g., Schneider, Hough, & versely, these results also suggest that organizations with much Dunnette, 1996), if the applicant pool is sufficiently large. smaller applicant pools (N ϽϽ 100; e.g., start-up firms) might In the current study, the Control and Communication job family generally benefit from using unit-weighting strategies. One prom- revealed that large diversity improvements were possible (and unit ising area for future research here would be to investigate the weights were particularly suboptimal), in part because the predic- possibility that the sample-size tipping point between the recom- tors exhibited negative validity-d correspondence (i.e., the higher- mended use of Pareto weights versus unit weights could be influ- impact predictors had lower criterion validities) for this job family. enced by enhancing calibration sample size via meta-analysis In contrast, the Machinery Repair job family exhibited positive (which could be accomplished using nonlocal calibration samples/ validity-d correspondence (i.e., the higher-impact predictors had validity studies and meta-analysis with local Bayes estimation; higher criterion validities, similar to the Bobko-Roth scenario), Brannick, 2001; Newman et al., 2007). In other words, perhaps which meant that shrunken diversity improvement was smaller— even organizations with small applicant pools could augment their while still greater than zero. calibration sample sizes by using meta-analysis, in order to make Pareto weighting and diversity improvement more feasible. Limitations and Directions for Future Research Additionally, previous research indicates it is generally desirable to combine cognitive predictors (which typically exhibit high criterion In the current study, we examined shrinkage in Pareto-optimal validities and high impact) with noncognitive predictors such as solutions using data based on two distinct predictor combinations. structured interviews or personality tests (which typically exhibit This allowed us to draw tentative conclusions regarding the use of moderate criterion validities and low impact; Hough et al., 2001; Pareto weighting strategies in real-world contexts. By doing so, Ryan & Ployhart, 2014). Our study suggests that shrinkage affects however, we were unable to systematically manipulate the effects incremental diversity differently from how it affects incremental va- of sample size (N), the number of predictors (k), and the size of the

lidity. As discussed above, whereas validity shrinkage decreases with multiple correlation coefficients (i.e., Rperf and Rrace) in a crossed increasing criterion validity (footnote 10), diversity shrinkage in- experimental design, which would have allowed us to more di- This document is copyrighted by the American Psychological Association or one of its allied publishers. creases with increasing diversity (because increasing diversity here rectly assess main and interaction effects on the shrinkage of the This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 2 ϭ implies that the predictors do not relate to race, Rrace 0). Thus, Pareto-optimal solutions. For example, how much larger must caution is required when generalizing the incremental diversity gains sample sizes be to compensate for low-impact predictors? We that may be obtained from low impact predictors (i.e., low impact addressed this question with regard to the Bobko-Roth predictors, predictors imply more diversity shrinkage), especially if the diversity but not with regard to other possible predictor combinations in- gains were estimated in small or even medium-sized samples (e.g., volving low-impact predictors. Also, what increases in sample size 500 or fewer applicants, in the case of Bobko-Roth predictors). become necessary as we add each additional predictor? Across different job families, there were moderate variations in the Practically speaking, a statistical approach to cross-validation job performance and diversity outcomes possible, even when the would still be preferred over an empirical approach. We did not same set of predictors was used. This result held even after accounting use a statistical approach because the classic statistical shrinkage for shrinkage, and thus suggests that practitioners should consider formulas were based on least squares assumptions that do not hold tailoring their selection systems to specific job functions or fami- for Pareto-optimization. However, future research might consider lies, when diversity is a valued outcome. This is consistent with the extent to which current formulas can be adapted to provide previous calls for developing more construct-oriented ap- reasonable approximations, even for Pareto-optimal solutions. Al- DIVERSITY SHRINKAGE 1651

ternatively, statistical formulas for Pareto-optimal shrinkage could Avery, D. R., & McKay, P. F. (2006). Target practice: An organizational be specifically developed. impression management approach to attracting minority and female job Finally, we note that the need to examine shrinkage through applicants. , 59, 157–187. http://dx.doi.org/10 cross-validation efforts extends even beyond Pareto-optimization .1111/j.1744-6570.2006.00807.x techniques. All techniques for optimizing fit between sample data Avery, D. R., McKay, P. F., Wilson, D. C., & Tonidandel, S. (2007). and an empirically derived prediction model should be subjected to Unequal attendance: The relationships between race, organizational di- cross-validation. A helpful reviewer asked whether this sort of versity cues, and absenteeism. Personnel Psychology, 60, 875–902. http://dx.doi.org/10.1111/j.1744-6570.2007.00094.x shrinkage and need for cross-validation would also apply to pre- Bedeian, A. G., Sturman, M. C., & Streiner, D. L. (2009). Decimal dust, diction models estimated using Big Data methods and algorithms, significant digits, and the search for stars. Organizational Research such as LASSO regression. To elaborate, one family of Big Data Methods, 12, 687–694. http://dx.doi.org/10.1177/1094428108321153 methods is supervised learning methods (from machine learning Bell, M. P., Connerley, M. L., & Cocchiara, F. K. (2009). The case for or statistical learning) in which an algorithm tries to learn a set of mandatory diversity education. Academy of Management Learning & rules in order to predict a criterion using observed data. Some such Education, 8, 597–609. http://dx.doi.org/10.5465/AMLE.2009 Big Data methods and algorithms do indeed maximize local fit .47785478 (e.g., R2) by capitalizing on chance, and we would expect consid- Bobko, P. (2001). Correlation and regression: Applications for industrial erable shrinkage for such methods (whether maximizing job per- organizational psychology and management. Thousand Oaks, CA: Sage. formance or enhancing diversity). In contrast, LASSO regression http://dx.doi.org/10.4135/9781412983815 (least absolute shrinkage and selection operator; Tibshirani, 1996), Bobko, P., & Roth, P. L. (2013). Reviewing, categorizing, and analyzing as briefly described by Oswald and Putka (2016), attempts to the literature on Black–White mean differences for predictors of job address shrinkage by using a tuning parameter that tries to “strike performance: Verifying some perceptions and updating/correcting oth- a balance” between, “local optimization of prediction and future ers. Personnel Psychology, 66, 91–126. http://dx.doi.org/10.1111/peps .12007 generalization of prediction” (p. 52). In particular, James, Witten, Bobko, P., Roth, P. L., & Buster, M. A. (2007). The usefulness of unit Hastie, and Tibshirani (2013) explained that the tuning parameter weights in creating composite scores: A literature review, application to itself should be selected using cross-validation, by choosing the content validity, and meta-analysis. Organizational Research Methods, tuning parameter that yields the smallest cross-validation error (p. 10, 689–709. http://dx.doi.org/10.1177/1094428106294734 227). That is, LASSO-weights are designed to maximize cross- Bobko, P., Roth, P. L., & Potosky, D. (1999). Derivation and implications 2 validated R . At present, however, LASSO-style techniques that of a meta-analytic matrix incorporating cognitive ability, alternative can incorporate one or more tuning parameters into a Pareto- predictors, and job performance. Personnel Psychology, 52, 561–589. weighting algorithm (i.e., for use when simultaneously optimizing http://dx.doi.org/10.1111/j.1744-6570.1999.tb00172.x on two objectives—job performance and diversity) have not yet, to Brannick, M. T. (2001). Implications of empirical Bayes meta-analysis for our knowledge, been invented. test validation. Journal of Applied Psychology, 86, 468–480. http://dx .doi.org/10.1037/0021-9010.86.3.468 Conclusion Browne, M. W. (1975). A comparison of single sample and cross- validation methods for estimating the mean squared error of prediction The current study examined the extent of shrinkage in Pareto- in multiple linear regression. British Journal of Mathematical and Sta- optimal solutions. We found evidence for the magnitude of validity tistical Psychology, 28, 112–120. http://dx.doi.org/10.1111/j.2044-8317 shrinkage and diversity shrinkage across the entire Pareto-optimal .1975.tb00553.x trade-off curve. Specifically, validity shrinkage was greater when Campbell, J. P., & Wiernik, B. M. (2015). The modeling and assessment of the job performance objective was more important (i.e., maxi- work performance. Annual Review of Organizational Psychology and mized to a greater extent), and diversity shrinkage was greater to , 2, 47–74. http://dx.doi.org/10.1146/annurev- the extent the diversity objective was being maximized. Diversity orgpsych-032414-111427 Cascio, W. F., & Aguinis, H. (2005). Test development and use: New shrinkage was also larger when lower impact predictors were used twists on old questions. Human Resource Management, 44, 219–235. (e.g., a combination of cognitive and noncognitive predictors), and http://dx.doi.org/10.1002/hrm.20068 was relatively small when higher impact predictors were used Cattin, P. (1980). Estimation of the predictive power of a regression model. (e.g., a combination of only cognitive subtest predictors). As Journal of Applied Psychology, 65, 407–414. http://dx.doi.org/10.1037/ This document is copyrighted by the American Psychological Association or one of its alliedexpected, publishers. both validity shrinkage and diversity shrinkage de- 0021-9010.65.4.407 This article is intended solely for the personal use ofcreased the individual user and is not to be disseminated broadly. as calibration sample size increased. Further, when sample Das, I., & Dennis, J. E. (1998). Normal-boundary intersection: A new sizes were at least 100, cross-validated Pareto-optimal weights method for generating the Pareto surface in nonlinear multicriteria typically afforded diversity improvements over unit weights, at no optimization problems. SIAM Journal on Optimization, 8, 631–657. cost in terms of job performance. In sum, if the calibration sample http://dx.doi.org/10.1137/S1052623496307510 is of sufficient size, practitioners should consider using Pareto- De Corte, W. (1999). Weighing job performance predictors to both max- optimal weighting of predictors to achieve diversity gains. imize the quality of the selected workforce and control the level of adverse impact. Journal of Applied Psychology, 84, 695–702. http://dx References .doi.org/10.1037/0021-9010.84.5.695 De Corte, W. (2006). TROFSS user’s guide. Retrieved from http://users Arthur, W., Jr., Edwards, B. D., & Barrett, G. V. (2002). Multiple-choice .ugent.be/~wdecorte/trofss.pdf and constructed response tests of ability: Race-based subgroup perfor- De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors mance differences on alternative paper-and-pencil test formats. Person- to achieve optimal trade-offs between selection quality and adverse nel Psychology, 55, 985–1008. http://dx.doi.org/10.1111/j.1744-6570 impact. Journal of Applied Psychology, 92, 1380–1393. http://dx.doi .2002.tb00138.x .org/10.1037/0021-9010.92.5.1380 1652 SONG, WEE, AND NEWMAN

De Corte, W., Lievens, F., & Sackett, P. R. (2008). Validity and adverse Journal of Applied Psychology, 91, 538–554. http://dx.doi.org/10.1037/ impact potential of predictor composite formation. International Journal 0021-9010.91.3.538 of Selection and Assessment, 16, 183–194. http://dx.doi.org/10.1111/j Newman, D. A., Hanges, P. J., & Outtz, J. L. (2007). Racial groups and test .1468-2389.2008.00423.x fairness, considering history and construct validity. American Psychol- De Corte, W., Sackett, P. R., & Lievens, F. (2011). Designing Pareto- ogist, 62, 1082–1083. http://dx.doi.org/10.1037/0003-066X.62.9.1082 optimal selection systems: Formalizing the decisions required for selec- Newman, D. A., Jacobs, R. R., & Bartram, D. (2007). Choosing the best tion system development. Journal of Applied Psychology, 96, 907–926. method for local validity estimation: Relative accuracy of meta-analysis http://dx.doi.org/10.1037/a0023298 versus a local study versus Bayes-analysis. Journal of Applied Psychol- Dorans, N. J., & Drasgow, F. (1980). A note on cross-validating prediction ogy, 92, 1394–1413. http://dx.doi.org/10.1037/0021-9010.92.5.1394 equations. Journal of Applied Psychology, 65, 728–730. http://dx.doi Newman, D. A., & Lyon, J. S. (2009). Recruitment efforts to reduce .org/10.1037/0021-9010.65.6.728 adverse impact: Targeted recruiting for personality, cognitive ability, Einhorn, H. J., & Hogarth, R. M. (1975). Unit weighting schemes for and diversity. Journal of Applied Psychology, 94, 298–317. http://dx.doi decision making. Organizational Behavior & Human Performance, 13, .org/10.1037/a0013472 171–192. http://dx.doi.org/10.1016/0030-5073(75)90044-6 Nicholson, G. (1960). Prediction in future samples. In I. Olkin (Ed.), Equal Employment Opportunity Commission, Civil Service Commission, Contributions to probability and statistics (pp. 322–330). Stanford, CA: Department of Labor & Department of Justice. (1978). Uniform guidelines Stanford University Press. on employee selection procedures. Federal Register, 43, 38290–39315. Olkin, I., & Pratt, J. W. (1958). Unbiased estimation of certain correlation Gilbert, J. A., & Ivancevich, J. M. (2000). Valuing diversity: A tale of two coefficients. Annals of Mathematical Statistics, 29, 201–211. http://dx organizations. The Academy of Management Perspectives, 14, 93–105. .doi.org/10.1214/aoms/1177706717 http://dx.doi.org/10.5465/AME.2000.2909842 Ostroff, C. (1993). Comparing correlations based on individual-level and Gilliland, S. W. (1993). The perceived fairness of selection systems: An aggregated data. Journal of Applied Psychology, 78, 569–582. http://dx perspective. The Academy of Management Review, .doi.org/10.1037/0021-9010.78.4.569 18, 694–734. Oswald, F. L., & Putka, D. J. (2016). Statistical methods for big data: A Goldstein, H. W., Scherbaum, C. A., & Yusko, K. P. (2010). Revisiting g: scenic tour. In S. Tonidandel, E. B. King, & J. M. Cortina (Eds.), Big Intelligence, adverse impact, and personnel selection. In J. L. Outtz data at work: The data science revolution and organizational psychol- (Ed.), Adverse impact: Implications for organizational staffing and high ogy (pp. 43–63). New York, NY: Routledge. stakes selection (pp. 95–134). New York, NY: Routledge/Taylor & Outtz, J. L. (Ed.), (2010). Adverse impact: Implications for organizational Francis Group. staffing and high stakes selection. New York, NY: Routledge/Taylor & Hattrup, K., Rock, J., & Scalia, C. (1997). The effects of varying concep- Francis Group. tualizations of job performance on adverse impact, minority hiring, and Outtz, J. L., & Newman, D. A. (2010). A theory of adverse impact. In J. L. predicted performance. Journal of Applied Psychology, 82, 656–664. Outtz (Ed.), Adverse impact: Implications for organizational staffing http://dx.doi.org/10.1037/0021-9010.82.5.656 and high stakes selection (pp. 53–94). New York, NY: Routledge/Taylor Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, & Francis Group. detection and amelioration of adverse impact in personnel selection Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: procedures: Issues, evidence and lessons learned. International Journal Strategies for reducing racioethnic and sex subgroup differences and of Selection and Assessment, 9, 152–194. http://dx.doi.org/10.1111/ adverse impact in selection. Personnel Psychology, 61, 153–172. http:// 1468-2389.00171 dx.doi.org/10.1111/j.1744-6570.2008.00109.x James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies for to statistical learning: With applications in R. New York, NY: Springer. http://dx.doi.org/10.1007/978-1-4614-7138-7 reducing adverse impact and their effects on criterion-related validity. Jayne, M. E. A., & Dipboye, R. L. (2004). Leveraging diversity to improve Human Performance, 9, 241–258. http://dx.doi.org/10.1207/ business performance: Research findings and recommendations for or- s15327043hup0903_4 ganizations. Human Resource Management, 43, 409–424. http://dx.doi Raju, N. S., Bilgic, R., Edwards, J. E., & Fleer, P. F. (1999). Accuracy of .org/10.1002/hrm.20033 population validity and cross-validity estimation: An empirical compar- Joshi, A., & Roh, H. (2009). The role of context in work team diversity ison of formula-based, traditional empirical, and equal weights proce- research: A meta-analytic review. Academy of Management Journal, 52, dures. Applied Psychological Measurement, 23, 99–115. http://dx.doi 599–627. http://dx.doi.org/10.5465/AMJ.2009.41331491 .org/10.1177/01466219922031220 Kochan, T., Bezrukova, K., Ely, R., Jackson, S., Joshi, A., Jehn, K.,... R Core Team. (2015). R: A language and environment for statistical Thomas, D. (2003). The effects of diversity on business performance: computing. R Foundation for Statistical Computing, Vienna, Austria. This document is copyrighted by the American Psychological Association or one of its allied publishers. Report of the diversity research network. Human Resource Manage- Retrieved from http://www.R-project.org/ This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. ment, 42, 3–21. http://dx.doi.org/10.1002/hrm.10061 Ree, M. J., Earles, J. A., & Teachout, M. S. (1994). Predicting job Konrad, A. (2003). Defining the domain of workplace diversity scholar- performance: Not much more than g. Journal of Applied Psychology, 79, ship. Group & Organization Management, 28, 4–17. http://dx.doi.org/ 518–524. http://dx.doi.org/10.1037/0021-9010.79.4.518 10.1177/1059601102250013 Roberson, Q. M., & Park, H. J. (2007). Examining the link between Larson, S. C. (1931). The shrinkage of the coefficient of multiple corre- diversity and firm performance: The effect of diversity reputation and lation. Journal of , 22, 45–55. http://dx.doi.org/ leader racial diversity. Group & Organization Management, 32, 548– 10.1037/h0072400 568. http://dx.doi.org/10.1177/1059601106291124 Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., III, & Tyler, P. (2001). Oaks, CA: Sage. Ethnic group differences in cognitive ability in employment and educa- Lord, F. M. (1950). Efficiency of prediction when a regression equation tional settings: A meta-analysis. Personnel Psychology, 54, 297–330. from one sample is used in a new sample. (Research Bulletin No. http://dx.doi.org/10.1111/j.1744-6570.2001.tb00094.x 50–40). Princeton, NJ: Educational Testing Service. Roth, P. L., Switzer, F. S., III, Van Iddekinge, C. H., & Oh, I. (2011). McKay, P. F., & McDaniel, M. A. (2006). A reexamination of black-white Toward better meta-analytic matrices: How input values can affect mean differences in work performance: More data, more moderators. research conclusions in human resource management simulations. Per- DIVERSITY SHRINKAGE 1653

sonnel Psychology, 64, 899–935. http://dx.doi.org/10.1111/j.1744-6570 Schneider, R. J., Hough, L. M., & Dunnette, M. D. (1996). Broadsided by .2011.01231.x broad traits: How to sink science in five dimensions or less. Journal of Ryan, A. M., & Ployhart, R. E. (2014). A century of selection. Annual Organizational Behavior, 17, 639–655. http://dx.doi.org/10.1002/ Review of Psychology, 65, 693–717. http://dx.doi.org/10.1146/annurev- (SICI)1099-1379(199611)17:6Ͻ639::AID-JOB3828Ͼ3.0.CO;2-9 psych-010213-115134 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Sackett, P. R., & Ellingson, J. E. (1997). The effects of forming multi- Journal of the Royal Statistical Society, Series B: Methodological, 58, predictor composites on group differences and adverse impact. Person- 267–288. nel Psychology, 50, 707–721. http://dx.doi.org/10.1111/j.1744-6570 Van Iddekinge, C. H., Roth, P. L., Raymark, P. H., & Odle-Dusseau, H. N. .1997.tb00711.x (2012). The criterion-related validity of integrity tests: An updated Sackett, P. R., Schmitt, N., Ellingson, J. E., & Kabin, M. B. (2001). meta-analysis. Journal of Applied Psychology, 97, 499–530. http://dx High-stakes testing in employment, credentialing, and higher education. .doi.org/10.1037/a0021196 Prospects in a post-affirmative-action world. American Psychologist, 56, Wee, S., Newman, D. A., & Joseph, D. L. (2014). More than g: Selection 302–318. http://dx.doi.org/10.1037/0003-066X.56.4.302 quality and adverse impact implications of considering second-stratum Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection cognitive abilities. Journal of Applied Psychology, 99, 547–563. http:// methods in personnel psychology: Practical and theoretical implications dx.doi.org/10.1037/a0035183 of 85 years of research findings. Psychological Bulletin, 124, 262–274. Wherry, R. J. (1931). A new formula for predicting the shrinkage of the http://dx.doi.org/10.1037/0033-2909.124.2.262 coefficient of multiple correlation. Annals of Mathematical Statistics, 2, Schmitt, N., Rogers, W., Chan, D., Sheppard, L., & Jennings, D. (1997). 440–457. http://dx.doi.org/10.1214/aoms/1177732951 Adverse impact and predictive efficiency of various predictor combina- Yin, P., & Fan, X. (2001). Estimating R2 shrinkage in multiple regression: tions. Journal of Applied Psychology, 82, 719–730. http://dx.doi.org/10 A comparison of different analytical methods. Journal of Experimental .1037/0021-9010.82.5.719 Education, 69, 203–224. http://dx.doi.org/10.1080/00220970109600656

Appendix A Multiple Regression Shrinkage Formulas

The squared multiple correlation coefficient observed in a cal- tion coefficient, N is the calibration sample size, and k is the ibration sample (R2) is typically an overestimate of both the number of predictors. population squared multiple correlation (␳2) and the cross- A second set of shrinkage formulas has been used to describe ␳ 2 validated squared multiple correlation ( c ). Commonly used shrinkage for the cross-validated squared multiple correlation co- ␳ 2 2 shrinkage formulas for the population squared multiple correla- efficient ( c ), which is an estimate of what R would be in a new 2 2 ␳ 2 tion coefficient (␳ ), which estimates what R would be in the sample. These formulas estimate c based on the population general population, include Wherry’s (1931) shrinkage formula squared multiple correlation coefficient (␳2). Browne’s (1975) and Olkin and Pratt’s (1958, p. 211) shrinkage formula. Cattin shrinkage formula—which was based on Lord-Nicholson’s shrink- (1980) showed that Wherry’s (1931) shrinkage formula provides a age formula (Lord, 1950; Nicholson, 1960)—is commonly used to ␳ 2 biased estimate of population shrinkage whereas Olkin and Pratt’s estimate c . Yin and Fan (2001) conducted a simulation study to (1958) shrinkage formula provides an unbiased estimate of shrink- investigate the effectiveness of various shrinkage formulas and 2 age from the calibration sample to the population (see Cattin, found that Olkin and Pratt’s (1958) shrinkage formula for ␳ and ␳ 2 1980, Table 1 for more details). Browne’s (1975) shrinkage formula for c outperformed other Olkin and Pratt’s shrinkage formula for ␳2 (approximate for- shrinkage formulas. ␳ 2 mula, see Cattin, 1980, p. 409) Browne’s shrinkage formula for c Ϫ Ϫ ␳4 ϩ ␳2 2 ␳ 2 ϭ (N k 3)ˆ ˆ N Ϫ 3 2(1 Ϫ R ) ˆc ␳ˆ 2 ϭ 1 Ϫ (1 Ϫ R2) ϫ 1 ϩ Ϫ Ϫ ␳ˆ 2 ϩ N Ϫ k Ϫ 1 ͓ N Ϫ k ϩ 1 (N 2k 2) k ␳ 2 This document is copyrighted by the American Psychological Association or one of its allied publishers. Ϫ 2 2 where ˆc is the corrected cross-validated squared correlation co- ϩ 8(1 R ) ␳2

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. ˆ (N Ϫ k Ϫ 1)(N Ϫ k ϩ 3)͔ efficient; is the corrected population squared multiple correla- tion coefficient obtained from a formula such as Olkin and Pratt’s where ␳ˆ 2 is the corrected population squared multiple correlation (1958) shrinkage formula, N is the calibration sample size, and k is coefficient, R2 is the calibration sample squared multiple correla- the number of predictors.

(Appendices continue) 1654 SONG, WEE, AND NEWMAN

Appendix B Supplemental Analyses This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Figure B1. Comparison of results based on different selection ratios (SR) and proportions of applicants from minority group (Prop) [cognitive subtests: Machinery Repair job family]. SR ϭ selection ratio (overall), prop ϭ Proportion of applicants from minority group. [For example, prop ϭ 1/6 means that among all the applicants, 1/6 of the applicants were minority applicants (e.g., Black applicants)] Machinery Repair. See the online article for the color version of this figure.

(Appendices continue) DIVERSITY SHRINKAGE 1655 This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Figure B2. Comparison of results based on different selection ratios (SR) and proportion of applicants from minority group (Prop) [Bobko-Roth predictors]. SR ϭ selection ratio (overall), prop ϭ Proportion of applicants from minority group. [For example, prop ϭ 1/6 means that among all the applicants, 1/6 of the applicants were minority applicants (e.g., Black applicants)]. See the online article for the color version of this figure.

(Appendices continue) 1656 SONG, WEE, AND NEWMAN

Appendix C ParetoR R Package and Web Application for Implementing Pareto Weighting in R

Pareto-Optimization via Normal Boundary Intersection Method # (4) Correlation matrix (R) ϭ criterion & predictor in Diversity Hiring intercorrelation matrix (in applicant pool) Developer: Q. Chelsea Song ## Example: Contact: [email protected] # Format: Predictor_1,...,Predictor_n, Criterion Last Update: 01/11/2017 R ϽϪ matrix(c(1, .24, .00, .19, .30, The commands below can be found at: https://github.com/ .24, 1, .12, .16, .30, Diversity-ParetoOptimal/ParetoR .00, .12, 1, .51, .18, .19, .16, .51, 1, .28, Objective .30, .30, .18, .28, 1), (length(d)ϩ1),(length(d)ϩ1)) The current R program provides a set of Pareto-optimal solu- tions that simultaneously optimize both diversity and criterion 4. Paste and run the following command in R console or validity in a personnel selection scenario [see Song, Wee, & RStudio: Newman (in press); adapted from De Corte, Lievens, and Sackett out ϭ ParetoR(prop, sr, d, R) (2007); also see Wee, Newman, and Joseph (2014) for more details]. Pareto-optimal solutions are estimated using the Normal- Output Description Boundary Intersection method (Das & Dennis, 1998). 1. Pareto Optimal solutions (i.e., 21 equally spaced solu- tions that characterize the Criterion validity—AI ratio Instructions trade-off curve, and Predictor Weights at each point along trade-off curve). 1. Open an R console or RStudio window. (R can be downloaded for free from https://cran.r-project.org; 2. Plots (i.e., Criterion validity—AI ratio trade-off curve, RStudio can be downloaded for free from https://www and Predictor weights across trade-off points). .rstudio.com/). Note 2. Install R package “ParetoR” through Github, by pasting and running the following commands in R console or The program is modeled after De Corte’s (2006) TROFSS RStudio: Fortran program and Zhou’s (2006) NBI Matlab program (version install.packages(“devtools”) 0.1.3). The current version only supports scenarios where AI ratio library(“devtools”) and one other criterion are being optimized. install_github(“Diversity-ParetoOptimal/ParetoR”) library(“ParetoR”) References Das, I., & Dennis, J. E. (1998). Normal-boundary intersection: A new 3. Specify four inputs (example from De Corte, Lievens, method for generating the Pareto surface in nonlinear multicriteria and Sackett (2007) is given below): optimization problems. SIAM Journal on Optimization, 8, 631–657. # (1) Proportion of minority applicants (prop) ϭ (# of De Corte, W. (2006). TROFSS user’s guide. minority applicants)/(total # of applicants) De Corte, W., Lievens, F., & Sackett, P. R. (2007). Combining predictors ## Example: to achieve optimal trade-offs between selection quality and adverse prop ϽϪ1/4 impact. Journal of Applied Psychology, 92, 1380–1393. http://dx.doi # (2) Selection ratio (sr) ϭ (# of selected applicants)/ .org/10.1037/0021-9010.92.5.1380 (total # of applicants) Song, C., Wee, S., & Newman, D. A. (in press). Diversity shrinkage: This document is copyrighted by the American Psychological Association or one of its allied publishers. ## Example: Cross-validating Pareto-optimal weights to enhance diversity via hiring

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. practices. Journal of Applied Psychology . sr ϽϪ0.10 Wee, S., Newman, D. A., & Joseph, D. L. (2014). More than g: Selection # (3) Subgroup differences (d): standardized mean quality and adverse impact implications of considering second-stratum differences between minority and majority subgroups, cognitive abilities. Journal of Applied Psychology, 99, 547–563. http:// on each predictor (in applicant pool) dx.doi.org/10.1037/a0035183 ## Example: Zhou, A. (2006). myNBI (Version 0.1.3)[Computer program]. Retrieved d ϽϪ c(1.00, 0.23, 0.09, 0.33) from http://www.caam.rice.edu/~indra/NBIhomepage.html

(Appendices continue) DIVERSITY SHRINKAGE 1657

Acknowledgments qchelseasong.shinyapps.io/ParetoR/). The web application (like the ParetoR package) uses only a correlation matrix, selection Great appreciation to Dr. Serena Wee, Dr. Dan Newman, and ratio, proportion of applicants from the minority group, and sub- Dr. Wilfried De Corte for guidance and feedback on development group d values as input. It then provides a full set of Pareto of the program. solutions and their corresponding predictor weights.

Web Application Received June 30, 2016 We also developed a user-friendly web application to implement Revision received May 8, 2017 the Pareto-optimal technique described in the current paper (https:// Accepted May 9, 2017

Call for Nominations

The Publications and Communications (P&C) Board of the American Psychological Association has opened nominations for the editorships of the Journal of : Animal Learning and Cognition, , and Psychological Methods for the years 2020 to 2025. Ralph R. Miller, PhD, Gregory G. Brown, PhD, and Lisa L. Harlow, PhD, respectively, are the incumbent editors.

Candidates should be members of APA and should be available to start receiving manuscripts in early 2019 to prepare for issues published in 2020. Please note that the P&C Board encourages participation by members of underrepresented groups in the publication process and would partic- ularly welcome such nominees. Self-nominations are also encouraged.

Search chairs have been appointed as follows:

● Journal of Experimental Psychology: Animal Learning and Cognition, Chair: Stevan E. Hobfoll, PhD ● Neuropsychology, Chair: Stephen M. Rao, PhD ● Psychological Methods, Chair: Mark B. Sobell, PhD

Candidates should be nominated by accessing APA’s EditorQuest site on the Web. Using your browser, go to https://editorquest.apa.org. On the Home menu on the left, find “Guests/Supporters.” Next, click on the link “Submit a Nomination,” enter your nominee’s information, and click “Submit.”

Prepared statements of one page or less in support of a nominee can also be submitted by e-mail to Sarah Wiederkehr, P&C Board Editor Search Liaison, at [email protected]. This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Deadline for accepting nominations is Monday, January 8, 2018, after which phase one vetting will begin.

View publication stats