Science's Reproducibility and Replicability Crisis

EDITORIAL

Science’s reproducibility and replicability crisis: International business is not immune

Herman Aguinis1, Abstract Wayne F. Cascio2 and International business is not immune to science’s reproducibility and 1 replicability crisis. We argue that this crisis is not entirely surprising given the Ravi S. Ramani methodological practices that enhance systematic capitalization on chance. This occurs when researchers search for a maximally predictive statistical model 1Department of Management, School of Business, based on a particular dataset and engage in several trial-and-error steps that are George Washington University, 2201 G St. NW, rarely disclosed in published articles. We describe systematic capitalization on 2 Washington, DC 20052, USA; The Business chance, distinguish it from unsystematic capitalization on chance, address ﬁve School, University of Colorado Denver, 1475 common practices that capitalize on chance, and offer actionable strategies to Lawrence Street, Denver, CO 80202, USA minimize the capitalization on chance and improve the reproducibility and Correspondence: replicability of future IB research. H Aguinis, Department of Management, Journal of International Business Studies (2017) 48, 653–663. School of Business, George Washington doi:10.1057/s41267-017-0081-0 University, 2201 G St. NW, Washington, DC 20052, USA. Tel: (+1) 202-994-6976; Keywords: capitalization on chance; quantitative methods; reproducibility; replicability; e-mail: [email protected] credibility

INTRODUCTION International business (IB) and many other management and organization studies’ disciplines are currently immersed in an important debate regarding the credibility and usefulness of the scholarly knowledge that is produced (Cuervo-Cazurra, Andersson, Brannen, Nielsen, & Reuber, 2016; Davis, 2015; George, 2014; Meyer, van Witteloostuijn, & Beugelsdijk, 2017). A critical issue in this debate is the lack of ability to reproduce and replicate results described in published articles (Bakker, van Dijk, & Wicherts, 2012; Bergh, Sharp, & Li, 2017; Cuervo-Cazurra et al., 2016; Ioannidis, 2005; Open Science Collaboration, 2015). Reproducibility means that someone other than a published study’s authors is able to obtain the same results using the authors’ own data, whereas replicability means that someone other than a published study’s authors is able to obtain substantially similar results by applying the same steps in a different context and with different data. Clearly, it is difﬁcult to make the case that research results are credible and useful if they are irreproducible and not replicable. Unfortunately, there is a proliferation of evidence indicating that lack of reproducibility and replicability are quite pervasive (e.g., Banks, Rogelberg, Woznyj, Landis, & Rupp, 2016; Cortina, Green, Received: 17 March 2017 Accepted: 5 April 2017 Keeler, & Vandenberg, 2016; Cuervo-Cazurra et al., 2016; Ioanni- Online publication date: 12 May 2017 dis, 2005; Open Science Collaboration, 2015; Schwab & Starbuck, Reproducibility and Replicability in IB Herman Aguinis et al 654

2017). Accordingly, as noted by Verbeke, Von results in better fit indices for a favored model Glinow, and Luo, ‘‘… the IB discipline faces the (Bernerth & Aguinis, 2016). challenges of remaining at par with the method- We emphasize that our focus on systematic ological standards in adjacent fields for validity, capitalization on chance is different from unsystem- reliability, replicability and generalizability’’ (Ver- atic capitalization of chance, which is due to random beke et al., 2017: 6). In short, IB is not immune to fluctuations in any given sample drawn from a science’s reproducibility and replicability crisis. population. Unsystematic capitalization on chance We argue that concerns about lack of repro- is a known phenomenon and part of all inferential ducibility and replicability are actually not entirely statistical tests. Specifically, the goal of inferential surprising because of current methodological prac- statistics is to maximize the predictive power of a tices that enhance systematic capitalization on model based on the data available by minimizing chance. Systematic capitalization on chance occurs errors in prediction using sample scores (Cascio & when a researcher searches for a maximally predic- Aguinis, 2005). For example, ordinary least squares tive statistical model based on a particular dataset, (OLS) regression, which is one of the most fre- and it typically involves several trial-and-error steps quently used data-analytic approaches in IB and that are rarely disclosed in published articles. other fields (e.g., Aguinis, Pierce, Bosco, & Muslin, Currently, there is tremendous pressure to publish 2009; Boellis, Mariotti, Minichilli, & Piscitello, in the so-called top journals because the number of 2016; Fitzsimmons, Liao, & Thomas, 2017; Fung, such publications has an important impact on Zhou, & Zhu, 2016), minimizes the sum of the faculty performance evaluations and rewards, squared differences between fitted values and including promotion and tenure decisions (Agui- observed values. Unsystematic capitalization on nis, Shapiro, Antonacopoulou, & Cummings, 2014; chance is addressed by conducting inferential tests Butler, Delaney, & Spoelstra, 2017; Nosek, Spies, & of the parameter estimates that include their stan- Motyl, 2012). Thus researchers are strongly moti- dard errors, thereby providing information about vated to produce manuscripts that are more likely the precision in the estimation process (i.e., larger to be accepted for publication. This means submit- sample sizes are associated with greater precision ting manuscripts that report tests of hypotheses and smaller standard errors). Most articles in IB that are statistically significant and ‘‘more highly’’ research include information on sample size and significant, models that fit the data as well as standard errors, which allows consumers of possible, and effect sizes that are as large as possible research to independently evaluate the accuracy (Meyer et al., 2017). To paraphrase Friedman and of the estimation process and the meaning of Sunder (1994: 85), many researchers ‘‘torture the results for theory and practice, thereby accounting data until they confess’’ that effects are statistically for unsystematic capitalization on chance.1 significant, large, and supportive of favored Next, we describe several common practices that hypotheses and models. Each of these outcomes – enhance systematic capitalization on chance and which together are more likely to produce the illustrate these practices using articles published in desired result of a successful publication – can be Journal of International Business Studies (JIBS). reached more easily by systematically capitalizing Because each of the issues we discuss is so pervasive, on chance. we do not ‘‘name names.’’ We do not believe it Researchers today have more ‘‘degrees of free- would be helpful or constructive to point fingers at dom’’ regarding methodological choices than ever particular authors. However, we mention variable (Freese, 2007). Many of these degrees of freedom names and the overall substantive context of each involve practices that enhance capitalization on study so that the methodological issues we discuss chance and improve the probability of successful are directly and specifically relevant for an IB publication. For example, researchers may include readership. Then, we offer best-practice recommen- or delete outliers from a manuscript depending on dations on how to minimize capitalization on which course of action results in a larger effect-size chance in future IB research. Similar to previously estimate (Aguinis, Gottfredson, & Joo, 2013). As a published JIBS guest editorials (e.g., Andersson, second illustration, researchers may capitalize on Cuervo-Cazurra, & Nielsen, 2014; Chang, van Wit- chance by selecting a particular configuration of tleloostuijn, & Eden, 2010; Meyer et al., 2017; Reeb, control variables after analyzing the impact of Sakakibara, & Mahmood, 2012), these recommen- several groups of control variables on results and dations serve as resources for researchers, including selecting the final set based on which configuration doctoral students and their training, as well as for

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 655 journal editors and reviewers evaluating manu- book ratio of assets, sales efficiency, and corporate script submissions. risk-taking. Of these three studies, two did not provide any explanation or rationale for why they used those specific measures of firm performance, COMMON METHODOLOGICAL PRACTICES and the third cited ‘‘prior research’’ without pro- THAT ENHANCE SYSTEMATIC viding any references or arguments in support of CAPITALIZATION ON CHANCE this particular choice. A healthily skeptical reader- In this section, we discuss five common method- ship cannot judge if the firm-performance mea- ological practices that enhance systematic capital- sures used in these studies were chosen because ization on chance: (1) selection of variables to they aligned with the theories the researchers were include in a model, (2) use of control variables, (3) testing, or because these measures produced out- handling of outliers, (4) reporting of p values, and comes that supported the favored hypotheses. (5) hypothesizing after results are known (HARK- Moreover, it is not possible to ascertain if, initially, ing). We describe each of these issues and elaborate several measures of firm performance were consid- on how they lead to lack of reproducibility and ered, but only those that produced the most replicability. favorable results were retained in the published article. Selection of Variables to Include in a Model Systematic capitalization on chance in terms of The selection of variables to include in a model which variables are included in a predictive model, encompasses both the choice of variables to and how this final set of variables is chosen, has a include, as well as the specification of the nature direct detrimental impact of future efforts to of the relations among these variables. Rapid reproduce and replicate substantive results. Almost advances in computational methodologies have 25 years ago, MacCallum, Roznowski, and Necow- allowed researchers to analyze increasingly larger itz (1992) reported that researchers were making amounts of data without much additional effort or post-hoc modifications to improve the fit of models cost (Simmons, Nelson, & Simonsohn, 2011). by utilizing results provided by the data-analytical Within the field of IB in particular, researchers software. MacCallum et al. (1992: 491) noted that routinely deal with ‘‘Big Data,’’ that is, large this process of re-specifying models based on the amounts of information stored in archival datasets data was ‘‘inherently susceptible to capitalization (Harlow & Oswald, 2016). Because these datasets on chance’’ because the modifications were driven were not collected directly in response to a partic- not by substantive reasons, but by the peculiarities ular research question, they contain many variables of the dataset itself. Despite calls for a more that can be restructured to produce ‘‘favorable’’ thoughtful approach to the use and reporting of results (i.e., better fit estimates, larger effect-size these modifications (e.g., Bentler, 2007; Hurley estimates) (Chen & Wojcik, 2016). For example, et al., 1997), recent reviews show that they are still consider the case of firm performance, which is one widely used, but rarely reported (Banks, O’Boyle of the most frequently measured constructs in IB. et al., 2016; Cortina et al., 2016; Sijtsma, 2016). For As Richard, Devinney, Yip, and Johnson (2009) example, a recently published article in JIBS noted, firm performance can be defined and reported ‘‘relaxing’’ 35 of 486 constraints, including assessed in terms of objective measures (e.g., those associated with measurement error terms, shareholder returns, Tobin’s q), and subjective until the model reached an acceptable fit. The measures (e.g., reputation, comparative ranking of article does not include any information on which firms). The choice of which firm-performance specific paths were changed or any theory or measure is examined should be driven by theory, measurement rationale for each of these ‘‘improve- and there should be a clear justification for why a ments’’ other than the goal of achieving a superior particular measure was used, given the aims of the model fit. Given the popularity of data-analytical study (Richard et al., 2009). approaches such as structural equation modeling in Three recent articles published in JIBS have used research reported in JIBS (e.g., Funk, Arthurs, the following measures of firm performance: (Study Trevinõ, & Joireman, 2010; Lisak, Erez, Sui, & Lee, 1) increased reputation, overall performance, 2016), we suspect that there are many other increased number of new products and customers, instances where researchers systematically capital- and enhanced product quality; (Study 2) return on ize on chance by making such modifications until assets; and (Study 3) return on equity, market-to- an optimally fitting model is found – without

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 656 necessarily reporting which paths were added or female managers, and organizational sector; and deleted from the original model, and why. (Study 4) organizational tenure, tenure with super- visor, group size, and country affiliation. Of these Use of Control Variables four studies, two did not provide any explanation Statistical controls are variables considered to be or rationale for the authors’ choice to include those extraneous (i.e., non-central) to the hypotheses specific control variables or information on any being tested but that could provide alternative control variables that were initially included but explanations for results. Control variables are used later excluded. The authors of the other two studies very frequently in management and organization justified their choices by citing ‘‘past research’’ studies (Becker, 2005; Carlson & Wu, 2012; Spector examining the impact of the same control vari- & Brannick, 2011). For example, control variables ables. But, readers have no way of knowing whether are used by entering them in a hierarchical manner the control variables had a conceptual justification, when conducting multiple regression analyses, or whether they were added in a post-hoc manner under the presumption that they eliminate con- after much trial and error involving several poten- tamination between the predictor and outcome tial controls, and the final set was chosen because it variables (Bernerth & Aguinis, 2016). However, the improved model fit or provided better results in assumptions and theoretical rationale underlying support of the favored hypotheses. the use of control variables, namely, that including them provides a ‘‘truer’’ test of relations and that Handling of Outliers the controls used are measured reliably, are seldom Outliers are ‘‘data points that deviate markedly tested (Bernerth & Aguinis, 2016). Researchers from others’’ (Aguinis et al., 2013: 270), and are rarely make explicit the reasons why certain vari- commonly found in management and organization ables (and not others) were chosen as controls studies (Hunter & Schmidt, 2015; Rousseeuw & (Becker, 2005; Spector & Brannick, 2011). Finally, Leroy, 2003). Outliers are a challenge because they control variables reduce the statistical power of the can substantially affect results obtained when test- test and the variance associated with the criterion ing hypotheses (Bobko, 2001; Orr, Sackett, & that can potentially be explained by substantive DuBois, 1991). Because of their outsized influence, variables (Breaugh, 2008), thereby increasing the the management of outliers presents an opportu- chance that the results obtained are an artifact of nity for researchers to systematically capitalize on the choice of control variables used (Bernerth & chance when analyzing data, often in the direction Aguinis, 2016). Control variables therefore increase of supporting their hypotheses (Cortina, 2002). systematic capitalization on chance as researchers However, many researchers routinely fail to dis- test several models, including and excluding con- close whether they tested for outliers within their trols piecemeal until they obtain a desired result datasets, whether any outliers were identified, the (Banks, O’Boyle et al., 2016; Bernerth & Aguinis, type of outliers found, and the rationale behind 2016; Simmons et al., 2011). As noted by Cuervo- choosing to include or exclude outliers from anal- Cazurra et al. (2016: 894), ‘‘without specific knowl- yses (Aguinis et al., 2013). edge about which controls were included, how they Recently published articles in JIBS suggest the were measured and where they come from, repli- presence of systematic capitalization on chance cation is impossible’’. regarding the management of outliers. For exam- Systematic capitalization on chance regarding ple, reported practices include winsorizing firm- the use of control variables seems pervasive in IB level variables at the 5% level to account for research. For example, four recent studies published outliers, trimming the sample by excluding obser- in JIBS included the following sets of control vations at the top and bottom one percentile of variables: (Study 1) retained earnings scaled by the variables, and removing an outlier based on stu- book value of assets, the ratio of shareholders’ dentized residuals and Cook’s D.2 In none of these equity to the book value of assets, the natural cases did the authors define the type of outlier they logarithm of the ratio of current year sales revenue were addressing. Specifically, error outliers (i.e., to prior year sales, and an indicator variable data points that lie at a distance from other data denoting the incidence of share repurchases; (Study points), interesting outliers (i.e., non-error data 2) the natural log of a firm’s book value of tangible points that lie at a distance from other data points assets per employee and the log number of employ- and may contain valuable or unexpected knowl- ees; (Study 3) gender, age, job rank, exposure to edge), or influential outliers (i.e., non-error data

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 657 points that lie at a distance from other data points, articles published in JIBS. For example, recent are not error or interesting outliers, and also affect studies in JIBS reported p values by using cutoffs substantive conclusions). In addition, in none of instead of reporting actual p values, using multiple these published articles did the authors take appro- p value cutoffs within the same article, and using priate steps such as correcting the data for error the term ‘‘marginally significant’’ to indicate outliers and reporting the results with and without p \ 0.10. In classical hypothesis testing, conven- outliers (Aguinis et al., 2013). Therefore by not tional Type 1 error probabilities are p \ 0.05 or 0.01. providing clear and detailed reporting of the man- There are situations where a higher Type 1 error ner in which they addressed the issue of outliers, it probability, such as p \ 0.10, might be justified is virtually impossible to reproduce and replicate (Cascio & Zedeck, 1983), but it is the responsibility substantive conclusions. of the researcher to provide such justification explicitly (Aguinis et al., 2010). In classical hypoth- Reporting of p values esis testing, results either are or are not significant; Another issue that involves systematic capitaliza- there is no such thing as ‘‘marginally significant’’ tion on chance refers to the reporting of p values results. The examples regarding the use of control associated with tests of significance. Despite its variables and outliers provided above, along with many flaws, null hypothesis significance testing evidence from other fields, such as strategic man- (NHST) continues to be the choice of researchers in agement (Bettis et al., 2016; Goldfarb & King, 2016) management and organization studies (Bettis, Ethi- and psychology (Bakker & Wicherts, 2011; Nuijten, raj, Gambardella, Helfat, & Mitchell, 2016; Meyer Hartgerink, Assen, Epskamp, & Wicherts, 2015) et al., 2017). In NHST, the tenability of a null suggest the existence of published articles in which hypothesis (i.e., no effect or relation) is primarily researchers exercised their ‘‘degrees of freedom’’ to judged based on the observed p value associated systematically manipulate the data to obtain a with the test of the hypothesis, and values smaller significant (i.e., p \ 0.05) result. Engaging in these than 0.05 are often judged as providing sufficient practices increases systematic capitalization on evidence to reject it (Bettis et al., 2016; Goldfarb & chance and diminishes the probability that results King, 2016). Of the many problems associated with will be reproducible and replicable. this interpretation of p values, the most pernicious is that it motivates researchers to engage in a Hypothesizing After Results are Known (HARKing) practice called ‘‘p-hacking’’ and to report ‘‘crippled’’ Hypothesizing after results are known (HARKing) p values (see below) (Aguinis, Werner, Abbott, occurs when researchers retroactively include or Angert, Park, & Kohlhausen, 2010; Banks, Rogelberg exclude hypotheses from their study after analyz- et al., 2016). For example, consider a researcher who ing the data, that is, post-hoc hypotheses presented interprets p = 0.0499 as sufficient evidence for as a-priori hypotheses, without acknowledging having rejecting the null hypothesis, and p = 0.0510 as done so (Kerr, 1998). A key issue regarding HARKing evidence that the null hypothesis should be is lack of transparency. Specifically, epistemological retained, and believes that journals are more likely approaches other than the pervasive positivistic to look favorably on rejected null hypotheses. This model, which has become dominant in manage- researcher will be highly motivated to ‘‘p-hack,’’ that ment and related fields since before World War II is, find some way, such as using control variables or (Cortina, Aguinis, & DeShon, 2017), are indeed eliminating outliers, to reduce the p value below the useful and even necessary. For example, inductive 0.05 threshold (Aguinis et al., 2010, Goldfarb & and abductive approaches can lead to important King, 2016; Starbuck, 2016; Waldman & Lilienfeld, theory advancements and discoveries (Bamberger & 2016). Similarly, this researcher will be motivated to Ang, 2016; Fisher & Aguinis, 2017; Hollenbeck & report p values using cutoffs (e.g., p \ 0.05), rather Wright, 2016). So, we are not advocating a rigid that report the actual p value (0.0510). Using this adherence to a positivistic approach, but rather, cutoff not only ‘‘cripples’’ the amount of informa- methodological plurality that is fully transparent so tion conveyed by the statistic (Aguinis, Pierce, & that results can be reproduced and replicated. Culpepper, 2009), but also allows the researcher to While primary-level and meta-analysis estimates claim that his or her hypothesis was supported based on self-reports indicate that 30–40% of (Aguinis et al., 2010). researchers engage in HARKing, the number is Many of the aforementioned practices regarding likely higher because only a minority of researchers the reporting of p values are commonly found in are likely to admit openly that they engaged in this

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 658 practice (Banks, O’Boyle et al., 2016; Bedeian, researchers are almost five-times more likely to find Taylor, & Miller, 2010; Fanelli, 2009). Consider support for their favored hypotheses than they are to the study by John, Loewenstein, and Prelec (2012), reject them. While not definitive, these results, who surveyed 2,155 academic psychologists regard- combined with known self-reports of researchers ing nine questionable research practices, including admitting to HARKing, are a ‘‘smoking gun’’ (Bosco ‘‘reporting an unexpected finding as having been et al., 2016) that hints at the existence of HARKing in predicted from the start.’’ John et al. (2012) asked IB research. these researchers (a) whether they had engaged in those practices (self-admission rate), (b) the per- centage of other psychologists who had engaged in STRATEGIES TO MINIMIZE CAPITALIZATION those practices (prevalence estimate), and ON CHANCE (c) among those psychologists who had, the per- Meta-analysis seems to be a possible solution to centage that would admit to having done so understand whether a particular body of work has (admission estimate). For this particular question been subjected to capitalization on chance because addressing HARKing, the self-admission rate was it allows researchers to account for variables that about 30%, but the prevalence rate was about 50%, create fluctuations in the observed estimates of and the admission estimate was about 90%. More effect sizes (Hunter & Schmidt, 2015). Because recently, O’Boyle, Banks, and Gonzalez-Mule´ meta-analysis can correct for the effects of method- (2017) examined doctoral dissertations and the ological and statistical artifacts, such as sampling subsequent academic journal articles that they error and measurement error, it has become a spawned. Their results revealed that the ratio of popular methodological approach in IB research supported versus non-supported hypotheses was (e.g., Fischer & Mansell, 2009; Stahl, Maznevski, roughly 2 to 1. That is, somewhere between disser- Voigt, & Jonsen, 2010; van Essen, Heugens, Otten, tation defense and published journal article, & van Oosterhout, 2012). However, meta-analysis authors chose, altered, or introduced hypotheses only corrects for unsystematic capitalization on after examining their data, likely to enhance the chance and not for systematic capitalization on probability of publication (Bedeian et al., 2010; chance. As noted by Eysenck almost 40 years ago: Edwards & Berry, 2010; Starbuck, 2016). ‘‘garbage in, garbage out is a well-known axiom of Even more worrisome is that many instances of computer specialists; it applies here [for meta- HARKing are driven and even encouraged by analysis] with equal force’’ (Eysenck, 1978: 517). reviewers and editors as part of the peer-review In other words, if effect-size estimates in primary- process (Banks, O’Boyle et al., 2016; Bedeian et al., level studies are upwardly biased due to systematic 2010). In fact, Bosco, Aguinis, Field, Pierce, and capitalization on chance, accumulating all of those Dalton (2016) conducted a survey of authors who estimates will lead to a meta-analytic summary had published in Journal of Applied Psychology and effect that will be similarly biased. Thus even if the Personnel Psychology and found that 21% reported estimated parameters are used to create distribu- that at least one hypothesis change had occurred as tions (i.e., funnel plots) (Dalton, Aguinis, Dalton, a result of the review process. Because HARKing Bosco, & Pierce, 2012; Macaskill, Walter, & Irwig, involves researchers fabricating or altering hypothe- 2001), systematic capitalization on chance biases ses based on the specific peculiarities of their the entire distribution. In short, meta-analysis is datasets and not openly and honestly reporting so, not a solution to address systematic capitalization it represents a particularly blatant instance of on chance and its biasing effects on results and systematic capitalization on chance. substantive conclusions. To illustrate the aforementioned discussion, we Cross-validation is another strategy that could reviewed all articles published in JIBS in 2016 that potentially be used to minimize the effects of proposed and quantitatively tested hypotheses. Let capitalization on chance, but it also addresses its us be clear: our intentions are not to disparage any of unsystematic and not its systematic variety. Specif- the researchers or studies we examined, but simply to ically, qc, an estimate of cross-validity in the highlight trends. Across 30 studies published in JIBS population, refers to whether parameter estimates in 2016 that met our criteria, researchers proposed (usually regression coefficients) derived from one 137 hypotheses, of which 115 (84%) received com- sample can predict outcomes to the same degree in plete or partial support, and only 22 (16%) were not the population as a whole or in other samples supported. Based on these results, it seems that drawn from the same population. If cross-validity is

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 659 low, the use of assessment tools and prediction should provide a theoretical justification for the systems derived from one sample may not be choice of each control variable, along with evidence appropriate in other samples from the same popu- of prior empirical work showing a relationship lation. Cascio and Aguinis (2005) provided a between the proposed control and the focal variable. detailed discussion of various approaches to cross- They should explain why the control variable is validation and recommended estimating the cross- integral to the model they propose to test, and offer validity in the population (i.e., qc) by adjusting the evidence regarding the reliability of the control sample-based multiple correlation coefficient (R)by variable (Bernerth & Aguinis, 2016). When reporting a function of sample size (N) and the number of results, researchers should provide descriptive statis- predictors (k). It is important to note that what tics for all control variables, as well as reporting most computer outputs label ‘‘adjusted R2’’ is only results with and without control variables (Aguinis & an intermediate step in computing the cross-valid- Vandenberg, 2014; Becker, 2005; Bernerth & Agui- ity in the population. Adjusted R2 does not address nis, 2016). Regarding outliers, researchers should the issue of prediction optimization based on the provide evidence showing that they tested for out- capitalization on chance factors in the original liers in their datasets. They should specify the rules sample and, therefore, underestimates the shrink- used to identify and classify outliers as error, inter- age (i.e., amount by which observed values were esting, or influential, and disclose whether influen- overestimated). Based on a careful review of the tial outliers affect model fit or prediction. Finally, relevant literature, Cascio and Aguinis (2005) sug- they should test their models using robust gested appropriate formulas for estimating cross- approaches (e.g., absolute deviation) and report validity in the population. Next, we offer sugges- results with and without outliers (Aguinis et al., tions for how to minimize systematic capitalization 2013). on chance specifically regarding each of the first Issue #4 is the reporting of p values. Relying on five issues we mentioned earlier. arbitrary p values (such as 0.05) to guide decisions Issue #1 is the selection of variables in models. To motivates researchers to engage in ‘‘p-hacking,’’ improve reproducibility and replicability, research- report ‘‘crippled’’ results, and conflate statistical ers must clearly report the rationale behind the and practical significance (Aguinis et al., 2010). To decision rules used in determining the sample-size counter these deleterious effects, researchers should and data-collection procedures, and report all the formally state the a level used to evaluate their variables that they have considered (Simmons hypotheses given the relative seriousness of making et al., 2011). If a construct can be assessed using a Type I (probability of wrongly rejecting the null several measures available (e.g., firm performance), hypothesis) versus Type II (probability of mistakenly researchers should justify their choice in light of retaining the null hypothesis) error; justify the use of theoretical considerations and the aims of their multiple cutoffs within the same paper; report study (Richard et al., 2009). When making modifi- complete p values to the second decimal place; not cations to models, researchers should consider use terms such as ‘‘marginally significant’’ or ‘‘very sample size, as modifications made to models significant’’ when referring to p values; and discuss drawing on small samples are likely to yield larger the practical significance of their results in terms of and more idiosyncratic results (MacCallum et al., the context of their study (Aguinis et al., 2010). 1992). Because each modification made to a model Lastly, we examined how researchers might sys- increases the fit of the model to the data in hand tematically capitalize on chance through HARKing and decreases replicability (MacCallum et al., by creating and reporting hypotheses after analyz- 1992), researchers should explicitly report all mod- ing their data, either of their own volition, or as ifications made to their models, the theoretical directed to by reviewers, and not describing rationale for the modifications, and the fit statistics hypotheses as being post hoc in an open and honest for each model tested (Crede´ & Harms, 2015; manner. To counter this practice, researchers MacCallum et al., 1992). should conduct more studies using strong inference Issues #2 and #3 relate to the use of control testing and report results of post-hoc hypotheses in variables and the handling of outliers. Choosing a separate section from a-priori hypotheses (Banks, which variables to use as controls or which data O’Boyle et al., 2016; Bosco et al., 2016; Hollenbeck points to include or exclude from the analyses offers & Wright, 2016). In addition, influential and highly researchers an opportunity to systematically capi- visible journals like JIBS can play a prominent role talize on chance. To minimize this, researchers in countering this practice by encouraging more

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 660 replication studies, promoting inductive and abduc- the purview of journals. For example, Verbeke et al. tive research (Fisher & Aguinis, 2017), and using offered guidelines for reviewers, including being study registries where authors post the details of ‘‘promoters of good methods’’ (Verbeke et al., 2017: their proposed research before collecting and ana- 6) and the Journal of Management has recently lyzing the data (Aguinis & Vandenberg, 2014; Bosco included the following item on its reviewer-evalu- et al., 2016; Kepes & McDaniel, 2013). ation form: ‘‘To ensure that all papers have at least one reviewer with deep knowledge of the methods used, given your expertise in the statistical methods used in this paper, please indicate your com- CONCLUSIONS fort/confidence in your ability to rigorously evalu- A manuscript is more likely to be accepted for ate the results reported: (Very uncomfortable, some publication if results are statistically significant, discomfort, comfortable, confident, very confident, effect-size estimates are large, and hypotheses and not applicable)’’ (Wright, 2016). models are supported. So, consciously or not, it is in As an actionable implication of our discussion, the best interests of researchers to achieve these we offer the following modest proposal. Our rec- outcomes, and this is facilitated by engaging in ommendation is to include additional items on the methodological practices that systematically capi- manuscript-submission form such that authors talize on chance, which, in turn, lead to lack of acknowledge, for example, whether hypotheses reproducibility and replicability. Irreproducible and were created retroactively after examining the non-replicable research results threaten the credi- results. Similar items can be included on the bility, usefulness, and very foundation of all scien- manuscript-submission form regarding the selec- tific fields; IB is certainly not immune. tion of variables in a model, handling of control Our intention in this editorial is not to point variables and outliers, and other methodological fingers at authors, journal editors, or reviewers. choices and judgment calls that capitalize on Rather, we believe that there are systemic issues chance systematically. Clearly, not all methodolog- that we must tackle collectively because they are ical details can be included in a manuscript itself the result of multiple causes operating at different due to page limitations, and this is why some levels of analysis. They include, among other journals have chosen to reduce the font size of the factors, author motivation, methodological train- Method section (Cortina et al., 2017). However, ing (or lack thereof) of authors and reviewers, the given that many journals allow authors to submit a rapid progress of methodological advancements, supplemental file to be posted online, together the availability of large, archival datasets, the low with any published article, page limitations as a cost of computing tools, increased competition for reason for not including sufficient detail about journal space, pressures on universities to produce methodological procedures are no longer a valid increasingly high levels of research output, and constraint. university promotion and tenure systems that In closing, we believe that the motivation not to encourage publishing as many articles as possible engage in systematic capitalization on chance in the so-called top journals. needs to be greater than the motivation to engage We addressed five admittedly selective issues that in such practices. Hopefully, our article will provide are particularly prone to being affected by system- a small step in this direction. One thing is clear, atic capitalization on chance. Some of the issues we however: Lack of reproducibility and replicability, discussed are not new and have already been noted retractions, and negative effects on the credibility in the methodological literature and also in the and usefulness of our research are unlikely to substantive literature in IB (e.g., Cascio, 2012). We improve if we do not take proactive and tangible also offered suggestions on how to minimize the actions to implement a change in course. detrimental effects of capitalization on chance. But, realistically, even if researchers are aware of how to do things right, the issue of context (i.e., reward systems, manuscript-review processes) will remain ACKNOWLEDGEMENTS as powerful hurdles. Thus we believe that a critical We thank Alain Verbeke and two Journal of Interna- and necessary step is to enforce good methodolog- tional Business Studies anonymous reviewers for their ical practices through the review process and also highly constructive feedback that allowed us to journal policies – because these are actions within improve our manuscript.

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 661

NOTES 2These are different ways to ‘‘manage outliers.’’ Win- sorization involves transforming extreme values to a 1As noted by an anonymous reviewer, multilevel speciﬁed percentile of the data (e.g., a 90th percentile modeling is as susceptible to capitalization on chance Winsorization would transform all the data below the as other methods, including OLS regression. Although 5th percentile to the 5th percentile, and all the data the existence of a dependent data structure allows above the 95th percentile would be set at the 95th multilevel modeling to produce more accurate stan- percentile). Studentized residuals are computed by dard errors compared to OLS regression (Aguinis & dividing a residual by an estimate of its standard Culpepper, 2015), this is an improvement regarding deviation, and Cook’s D measures the effect of unsystematic but not systematic capitalization on deleting a given observation. chance.

REFERENCES Aguinis, H., & Culpepper, S. A. 2015. An expanded decision Bedeian A. G., Taylor, S. G., & Miller, A. N. 2010. Management making procedure for examining cross-level interaction effects science on the credibility bubble: Cardinal sins and various with multilevel modeling. Organizational Research Methods, misdemeanors. Academy of Management Learning and Educa- 18(2): 155–176. tion, 9(4): 715–725. Aguinis, H., Gottfredson, R. K., & Joo, H. 2013. Best-practice Bentler, P. M. 2007. On tests and indices for evaluating recommendations for defining, identifying, and handling structural models. Personality and Individual Differences, outliers. Organizational Research Methods, 16(2): 270–301. 42(5): 825–829. Aguinis, H., Pierce, C. A., Bosco, F. A., & Muslin, I. S. 2009. First Bergh, D. D., Sharp, B., & Li, M. 2017. Tests for identifying ‘‘red decade of Organizational Research Methods: Trends in design, flags’’ in empirical findings: Demonstration and recommen- measurement, and data-analysis topics. Organizational dations for authors, reviewers and editors. Academy of Man- Research Methods, 12(1): 69–112. agement Learning & Education, 16(1): 110–124. Aguinis, H., Pierce, C. A., & Culpepper, S. A. 2009. Scale Bernerth, J. & Aguinis, H. 2016. A critical review and best- coarseness as a methodological artifact: Correcting correlation practice recommendations for control variable usage. Person- coefficients attenuated from using coarse scales. Organiza- nel Psychology, 69(1): 229–283. tional Research Methods, 12(4): 623–652. Bettis, R. A., Ethiraj, S., Gambardella, A., Helfat, C., & Mitchell, Aguinis, H., Shapiro, D. L., Antonacopoulou, E., & Cummings, T. G. W. 2016. Creating repeatable cumulative knowledge in 2014. Scholarly impact: A pluralist conceptualization. Academy of strategic management. Strategic Management Journal, 37(2): Management Learning and Education, 13(4): 623–639. 257–261. Aguinis, H., & Vandenberg, R. J. 2014. An ounce of prevention is Bobko, P. 2001. Correlation and regression (2nd edn). Thousand worth a pound of cure: Improving research quality before data Oaks, CA: Sage. collection. Annual Review of Organizational Psychology and Boellis, A., Mariotti, S., Minichilli, A., & Piscitello, L. 2016. Family Organizational Behavior, 1(1): 569–595. involvement and firms’ establishment mode choice in foreign Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. H., & markets. Journal of International Business Studies, 47(8): Kohlhausen, D. 2010. Customer-centric science: Reporting 929–950. significant research results with rigor, relevance, and practical Bosco, F. A., Aguinis, H., Field, J. G., Pierce, C. A., & Dalton, D. impact in mind. Organizational Research Methods, 13(3): R. 2016. HARKing’s threat to organizational research: Evidence 515–539. from primary and meta-analytic sources. Personnel Psychology, Andersson, U., Cuervo-Cazurra, A., & Nielsen, B. B. 2014. From 69(3): 709–750. the editors: Explaining interaction effects within and across Breaugh, J. A. 2008. Important considerations in using statistical levels of analysis. Journal of International Business Studies, procedures to control for nuisance variables in non-experi- 45(9): 1063–1071. mental studies. Human Resource Management Review, 18(4): Bakker, M., van Dijk, A., & Wicherts, J. M. 2012. The rules of the 282–293. game called psychological science. Perspectives on Psycholog- Butler, N., Delaney, H., & Spoelstra, S. 2017. The grey zone: ical Science, 7(6): 543–554. Questionable research practices in the business school. Bakker, M., & Wicherts, J. M. 2011. The (mis) reporting of Academy of Management Learning & Education, 16(1): 94–109. statistical results in psychology journals. Behavior Research Carlson, K. D., & Wu, J. 2012. The illusion of statistical control: Methods, 43(3): 666–678. Control variable practice in management research. Organiza- Bamberger, P., & Ang, S. 2016. The quantitative discovery: tional Research Methods, 15(3): 413–435. What is it and how to get it published. Academy of Manage- Cascio, W. F. 2012. Methodological issues in international HR ment Discoveries, 2(1): 1–6. management research. International Journal of Human Resource Banks, G. C., O’Boyle, Jr., E. H. et al. 2016. Questions about Management, 23(12): 2532–2545. questionable research practices in the field of management: A Cascio, W. F., & Aguinis, H. 2005. Test development and use: guest commentary. Journal of Management, 42(1): 5–20. New twists on old questions. Human Resource Management, Banks, G. C., Rogelberg, S. G., Woznyj, H. M., Landis, R. S., & 44(3): 219–235. Rupp, D. E. 2016. Evidence on questionable research prac- Cascio, W. F., & Zedeck, S. 1983. Open a new window in tices: The good, the bad, and the ugly. Journal of Business and rational research planning: Adjust alpha to maximize statistical Psychology, 31(3): 323–338. power. Personnel Psychology, 36(3), 517–526. Becker, T. E. 2005. Potential problems in the statistical control of Chang, S. J., van Wittleloostuijn, A., & Eden, L. 2010. From the variables in organizational research: A qualitative analysis with editors: Common method variance in international business recommendations. Organizational Research Methods, 8(3): research. Journal of International Business Studies, 41(2): 274–289. 178–184.

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 662

Chen, E. E., & Wojcik, S. P. 2016. A practical guide to big data Hollenbeck, J. H. & Wright, P. M. 2016. Harking, sharking, and research in psychology. Psychological Methods, 21(4): tharking: Making the case for post hoc analysis of scientific 458–474. data. Journal of Management, 43(1): 5–18. Cortina, J. M. 2002. Big things have small beginnings: An Hunter, J. E., & Schmidt, F. L. 2015. Methods of meta-analysis: assortment of ‘‘minor’’ methodological misunderstandings. Correcting error and bias in research findings (3rd edn). Journal of Management, 28(3): 339–362. Thousand Oaks, CA: Sage. Cortina, J. M., Aguinis, H., & DeShon, R. P. 2017. Twilight of Hurley, A. E. et al. 1997. Exploratory and confirmatory factor dawn or of evening? A century of research methods in the analysis: Guidelines, issues, and alternatives. Journal of Orga- Journal of Applied Psychology. Journal of Applied Psychology, nizational Behavior, 18(6): 667–683. 102(3): 274–290. Ioannidis, J. P. A. 2005. Why most published research findings Cortina, J. M., Green, J. P., Keeler, K. R., & Vandenberg, R. J. are false. PLoS Med, 2(8): e124. doi: 10.1371/journal.pmed. 2016. Degrees of freedom in SEM: Are we testing the models 0020124. that we claim to test? Organizational Research Methods. doi: John, L. K, Loewenstein, G., & Prelec, D. 2012. Measuring the 10.1177/1094428116676345. prevalence of questionable research practices with incentives Crede´, M., & Harms, P. D. 2015. 25 years of higher-order for truth-telling. Psychological Science, 23(5), 524–532. confirmatory factor analysis in the organizational sciences: A Kepes, S., & McDaniel, M. A. 2013. How trustworthy is the critical review and development of reporting recommenda- scientific literature in industrial and organizational psychol- tions. Journal of Organizational Behavior, 36(6): 845–872. ogy? Industrial and Organizational Psychology, 6(3): 252–268. Cuervo-Cazurra, A., Andersson, U., Brannen, M.Y., Nielsen, B., Kerr, N. L. 1998. HARKing: Hypothesizing after the results are & Reuber, A. R. 2016. From the editors: Can I trust your known. Personality and Social Psychology Review, 2(3): 196–217. findings? Ruling out alternative explanations in international Lisak, A., Erez, M., Sui, Y., & Lee, C. 2016. The positive role of business research. Journal of International Business Studies, global leaders in enhancing multicultural team innovation. 47(8): 881–997. Journal of International Business Studies, 47(6): 655–673. Dalton, D. R., Aguinis, H., Dalton, C. A., Bosco, F. A., & Pierce, Macaskill, P., Walter, S., & Irwig, L. 2001. A comparison of C. A. 2012. Revisiting the file drawer problem in meta- methods to detect publication bias in meta-analysis. Statistics analysis: An empirical assessment of published and non- in Medicine, 20(4), 641–654. published correlation matrices. Personnel Psychology, 65(2): MacCallum, R. C., Roznowski, M., & Necowitz, L. B. 1992. 221–249. Model modification in covariance structure analysis: The Davis, G. F. 2015. What is organizational research for? Admin- problem of capitalization on chance. Psychological Bulletin, istrative Science Quarterly, 60(2): 179–188. 111(3): 490–504. Edwards, J. R., & Berry, J. W. 2010. The presence of something Meyer, K. E., van Witteloostuijn, A., & Beugelsdijk, S. 2017. or the absence of nothing: Increasing theoretical precision in What’s in a p? Reassessing best practices for conducting and management research. Organizational Research Methods, reporting hypothesis-testing research. Journal of International 13(4): 668–689. Business Studies. doi:10.1057/s41267-017-0078-8. Eysenck, H. J. 1978. An exercise in mega-silliness. American Nosek, B. A., Spies, J. R., & Motyl, M. 2012. Scientific utopia II: Psychologist, 33(5), 517. Restructuring incentives and practices to promote truth over Fanelli, D. 2009. How many scientists fabricate and falsify publishability. Perspectives on Psychological Science, 7(6): research? A systematic review and meta-analysis of survey data. 615–631. PLoS ONE,4,e5738.doi:10.1371/journal.pone.0005738. Nuijten, M. B., Hartgerink, C. H., Assen, M. A., Epskamp, S., & Fischer, R., & Mansell, A. 2009. Commitment across cultures: A Wicherts, J. M. 2015. The prevalence of statistical reporting meta-analytical approach. Journal of International Business errors in psychology (1985–2013). Behavior Research Methods, Studies, 40(8): 1339–1358. 48(4): 1–22. Fisher, G., & Aguinis, H. 2017. Using theory elaboration to make O’Boyle, E. H., Banks, G. C., & Gonzalez-Mule´, E. 2017. The theoretical advancements. Organizational Research Methods. chrysalis effect: How ugly initial results metamorphosize into doi: 10.1177/1094428116689707. beautiful articles. Journal of Management, 43(2): 376–399. Fitzsimmons, S., Liao, Y., & Thomas, D. 2017. From crossing Open Science Collaboration. 2015. Estimating the reproducibil- cultures to straddling them: An empirical examination of ity of psychological science. Science, 349(6251): aac4716. doi: outcomes for multicultural employees. Journal of International 10.1126/science.aac4716. Business Studies, 48(1): 63–89. Orr, J. M., Sackett, P. R., & DuBois, C. L. Z. 1991. Outlier Friedman, D., & Sunder, S. 1994. Experimental methods: A primer detection and treatment in I/O psychology: A survey of for economists. New York, NY: Cambridge University Press. researcher beliefs and an empirical illustration. Personnel Freese, J. 2007. Replication standards for quantitative social Psychology, 44(3): 473–486. science: Why not sociology. Sociological Methods & Research, Reeb, D., Sakakibara, M., & Mahmood, I. P. 2012. Endogeneity 36(2): 153–172. in international business research. Journal of International Fung, S. K., Zhou, G., & Zhu, X. J. 2016. Monitor objectivity Business Studies, 43(3): 211–218. with important clients: Evidence from auditor opinions around Richard, P. J., Devinney, T. M., Yip, G. S., & Johnson, G. 2009. the world. Journal of International Business Studies, 47(3): Measuring organizational performance: Towards methodolog- 263–294. ical best practice. Journal of Management, 35(3): 718–804. Funk, C. A., Arthurs, J. D., Trevinõ, L. J., & Joireman, J. 2010. Rousseeuw, P. J., & Leroy, A. M. 2003. Robust regression and Consumer animosity in the global value chain: The effect of outlier detection. Hoboken, NJ: Wiley. international production shifts on willingness to purchase Schwab, A., & Starbuck, W. H. 2017. A call for openness in hybrid products. Journal of International Business Studies, research reporting: How to turn covert practices into helpful 41(4): 639–651. tools. Academy of Management Learning & Education, 16(1): George, G. 2014. Rethinking management scholarship. Acad- 125–141. emy of Management Journal, 57(1): 1–6. Simmons, J. P., Nelson, L. D., & Simonsohn, U. 2011. False- Goldfarb, B., & King, A. A. 2016. Scientific apophenia in positive psychology: Undisclosed flexibility in data collection strategic management research: Significance tests & mistaken and analysis allows presenting anything as significant. Psycho- inference. Strategic Management Journal, 37(1): 167–176. logical Science, 22(11): 1359–1366. Harlow, L. L., & Oswald, F. L. 2016. Big data in psychology: Sijtsma, K. 2016. Playing with data – Or how to discourage Introduction to the special issue. Psychological Methods, 21(4): questionable research practices and stimulate researchers to 447–457. do things right. Psychometrika, 81(1): 1–15.

Journal of International Business Studies Reproducibility and Replicability in IB Herman Aguinis et al 663

Spector, P. E., & Brannick, M. T. 2011. Methodological urban a Fellow of the Academy of Management (AOM), legends: The misuse of statistical control variables. Organiza- and has received the AOM Research Methods tional Research Methods, 14(2): 287–305. Stahl, G. K., Maznevski, M. L., Voigt, A., & Jonsen, K. 2010. Division Distinguished Career Award for lifetime Unraveling the effects of cultural diversity in teams: A meta- contributions. analysis of research on multicultural work groups. Journal of International Business Studies, 41(4): 690–709. Starbuck, W. H. 2016. 60th anniversary essay: How journals Wayne F. Cascio (PhD, University of Rochester) could improve research practices in social science. Administra- holds the Robert H. Reynolds Distinguished Chair tive Science Quarterly, 61(2): 165–183. in Global Leadership at the University of Colorado van Essen, M., Heugens, P. P., Otten, J., & van Oosterhout, J. 2012. An institution-based view of executive compensation: A Denver. He has published 28 books and more than multilevel meta-analytic test. Journal of International Business 185 articles and book chapters. An editor of the Studies, 43(4): 396–423. Journal of International Business Studies (JIBS), his Verbeke, A., Von Glinow, M. Y., & Luo, Y. 2017. Becoming a great reviewer: Four actionable guidelines. Journal of Interna- research focuses on global talent management. In tional Business Studies, 48(1): 1–9. 2016, he received the World Federation of People Waldman, I. D., & Lilienfeld, S. O. 2016. Thinking about data, Management Associations’ Lifetime Achievement research methods, and statistical analyses: Commentary on Sijtsma’s (2014) ‘‘Playing with Data’’. Psychometrika, 81(1): Award. 16–26. Wright, P. M. 2016. Ensuring research integrity: An editor’s Ravi S. Ramani (MBA, Johnson & Wales Univer- perspective. Journal of Management, 42(5): 1037–1043. sity) is a PhD Candidate in the Department of Management at the George Washington University School of Business. His research examines employ- ABOUT THE AUTHORS ees’ emotional experiences at work; the changing Herman Aguinis (PhD, University at Albany, State paradigms of leadership and talent development; University of New York) is the Avram Tucker research methods; and practically relevant schol- Distinguished Scholar and Professor of Manage- arship. His work received the Best Paper award from ment at the George Washington University School the Academy of Management Entrepreneurship of Business. His research focuses on global talent Division and has been presented at several management and organizational research methods. conferences. He has published more than 140 journal articles, is

Accepted by Alain Verbeke, Editor-in-Chief, 5 April 2017. This paper was single-blind reviewed.

Journal of International Business Studies