Running head: RESEARCH QUALITY IN BEHAVIORAL SCIENCE 1

1 Citation counts and journal impact factors do not capture key aspects of research quality

2 in the behavioral and brain sciences

3 Michael . Dougherty

4 Department of , University of Maryland

5 Zachary Horne

6 Department of Psychology, University of Edinburgh RESEARCH QUALITY IN BEHAVIORAL SCIENCE 2

7 Abstract

8 Citation data and journal impact factors are important components of faculty dossiers and

9 figure prominently in both promotion decisions and assessments of a researcher’s broader

10 societal impact. Although these metrics play a large role in high-stakes decisions, the

11 evidence is mixed about whether they are strongly correlated with key aspects of research

12 quality. We use data from three large scale studies to assess whether citation counts and

13 impact factors predict three indicators of research quality: (1) the number of statistical

14 reporting errors in a paper, (2) the evidential value of the reported data, and (3) the

15 replicability of a given experimental result. Both citation counts and impact factors were

16 weak and inconsistent predictors of research quality, so defined, and sometimes negatively

17 related to quality. Our findings impugn the validity of citation data and impact factors as

18 indices of research quality and call into question their usefulness in evaluating scientists

19 and their research. In light of these results, we argue that research evaluation should

20 instead focus on the process of how research is conducted and incentivize behaviors that

21 support open, transparent, and reproducible research. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 3

22 Citation counts and journal impact factors do not capture key aspects of research quality

23 in the behavioral and brain sciences

24 Introduction

25 Researchers and administrators often assume that journal impact factors and citation

26 counts are indicators of research quality (e.g., McKiernan et al., 2019; Sternberg, 2016).

27 This assumption seems plausible: High impact journals may seem to have a more selective

28 and rigorous review process, thereby weeding out lower quality research as a consequence.

29 One might also view citation counts as reflecting something akin to the wisdom of the

30 crowd whereby high-quality research garners more citations than low-quality research. One

31 need not look far to see these assumptions on display: University libraries often promote

32 bibliometric indices such as citation counts and journal impact factors as indices of

33 “impact” or “quality”, academic departments use these metrics for important decisions

34 such as hiring, tenure, and promotion, and science journalists promote research from

35 high-impact journals. It is also common for authors to equate impact factors and citation

36 counts with quality (Ruscio, 2016; Sternberg, 2016) – an assumption that appears in

37 university promotion and tenure policies (McKiernan et al., 2019). The inclusion of these

38 metrics in high-stakes decisions starts with the assumption that there is a positive and

39 meaningful relation between the quality of ones work on the one hand, and impact factors

40 and citations on the other. This raises the question, are we justified in thinking that

41 high-impact journals or highly-cited papers are of higher quality?

42 Before proceeding, it is important to note that citation counts and journal impact

43 factors are often treated as variables to be maximized, under the assumption that high

44 citation counts and publishing in high-impact journals demonstrate that ones work is of

45 high quality. This view implicitly places these variables on the left-hand side of the

46 prediction equation, as if the goal of research evaluation is to predict (and promote)

47 individuals who are more likely to garner high citation counts and publish in high-impact

48 factor journals. This view, whether implicitly or explicitly endorsed is problematic for a RESEARCH QUALITY IN BEHAVIORAL SCIENCE 4

49 variety of reasons.

50 First, it neglects the fact that citation counts themselves are determined by a host of

51 factors unrelated to quality, or for that matter even unrelated to the science being

52 evaluated (Aksnes, Langfeldt, & Wouters, 2019; Bornmann & Daniel, 2008; Lariviére &

53 Gingras, 2010). For instance, citation counts covary with factors such as the length of the

54 title and the presence of colons or hyphens in the title (Paiva, Lima, & Paiva, 2012; Zhou,

55 Tse, & Witheridge, 2019) and can be affected by other non-scientific variables, such as the

56 size of ones social network (Mählck & Persson, 2000), the use of social media, and the

57 posting of on searchable archives (Gargouri et al., 2010; Gentil-Beccot, Mele, &

58 Brooks, 2009; S. Lawrence, 2001; Luarn & Chiu, 2016; Peoples, Midway, Sackett, Lynch, &

59 Cooney, 2016). Citation counts also tend to be higher for papers on so-called “hot” topics

60 and for researchers associated with big-named faculty (Petersen et al., 2014). Not only are

61 researchers incentivized to maximize these, but it is also easy to find advice on how to game

62 them (e.g., https://www.natureindex.com/news-blog/studies-research-five-ways-increase-

63 citation-counts, https://www.nature.com/news/2010/100813/full/news.2010.406.html,

64 https://uk.sagepub.com/en-gb/eur/increasing-citations-and-improving-your-impact-factor).

65 Second, treating these variables in this way can lead to problematic inferences such as

66 inferring that mentorship quality varies by gender simply because students of male mentors

67 tend to enjoy a citation advantage (see the now retracted paper by AlShebli, Makovi, &

68 Rahwan, 2020), and even perpetrate systemic inequalities in career advancement and

69 mobility due to biases in citation patterns that disfavor women and persons from

70 underrepresented groups (Greenwald & Schuh, 1994; X. Wang et al., 2020b). Finally,

71 treating citations and impact factors as the to-be-maximized variables may alter

72 researchers’ behaviors in ways that can undermine science (Chapman et al., 2019). For

73 example, incentivizing researchers to maximize citations may lead researchers to focus on

74 topics that are in vogue regardless of whether doing so addresses key questions that will

75 advance their field. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 5

76 If we instead think of citation counts and impact factors as predictors of success, we

77 can ask whether they are valid proxies for assessing key aspects of the quality of a

78 researcher. To be clear, these metrics are not and cannot be considered direct measures of

79 research quality, but of course, addressing this question requires a way of measuring quality

80 that is independent of citation counts. Past work on this topic has primarily addressed the

81 issue by relying on subjective assessments of quality provided by experts or peer reviews.

82 On balance, these studies have shown either weak or inconsistent relationships between

83 quality and citation counts (e.g., Nieminen, Carpenter, Rucker, & Schumacher, 2006;

84 Patterson & Harris, 2009; West & McIlwaine, 2002). One challenge in relying on subjective

85 assessments is that their use assumes that the judges can reliably assess quality—an

86 assumption that has been challenged by Bornmann, Mutz, and Daniel(2010), who showed

87 that the inter-rater reliability of ratings is extremely poor. Indeed, controlled

88 studies of consistency across reviewers also indicate a surprisingly high level of arbitrariness

89 in the review process (see Francois, 2015; Langford & Guzdial, 2015; Peters & Ceci, 1982).

90 Peters and Ceci(1982), for instance, resubmitted 12 articles (after changing the author

91 names) that had previously been accepted for publication in psychology journals and found

92 that the majority of the articles that had been accepted initially were rejected the second

93 time around based on serious methodological errors. Similarly, N. Lawrence and Cortes

94 (2014) reported a high degree of arbitrariness in accepted papers submitted to the Neural

95 Information Processing Systems annual meeting, widely viewed as one the premier peer

96 reviewed outlets in Machine Learning. In this study, a random sample of submissions went

97 through two independent review panels; the authors estimated that 60% of decisions

98 appeared to arise from an arbitrary process. By comparison, a purely random process

99 would have yielded an arbitrariness coefficient of 78%, whereas a process without an

100 arbitrary component would have yielded a coefficient of 0%. If reviewers who are

101 specifically tasked with judging quality cannot agree on the acceptability of research for

102 publication, then it is unlikely that impact factors or citation counts, which themselves are RESEARCH QUALITY IN BEHAVIORAL SCIENCE 6

103 dependent on the peer review process, are reliable indicators of quality.

104 Recent discussions have raised several issues with the use of bibliometrics for faculty

105 evaluation and the incentive structure for appointments, tenure, and promotion. First,

106 there is growing concern about the improper use of bibliometrics when evaluating faculty.

107 This concern has been expressed by numerous scholars (Aksnes et al., 2019; Hicks, Wouters,

108 Waltman, de Rijcke, & Rafols, 2015; Moher et al., 2018; Seglen, 1997) and has been codified

109 in the San Francisco Declaration on Research Assessment (DORA) – a statement that has

110 been endorsed by hundreds of organizations, including the Association for Psychological

111 Sciences. Second, several recent analyses challenge traditional orthodoxy that bibliometrics

112 are valid indicators of research quality. For instance, Fraley and Vazire(2014), and Szucs

113 and Ioannidis(2017) report a moderate negative correlation (approximately −0.42) between

114 impact factor and statistical power in the domains of social psychology and cognitive

115 neuroscience, respectively (see also Brembs, 2018; Brembs, Button, & Munafò, 2013).

116 Despite these results and repeated calls for scaling back the use of bibliometrics in research

117 evaluation, their use is still widespread in evaluating faculty job candidates and promoting

118 them on the basis of their research trajectory (McKiernan et al., 2019).

119 The use of impact factors and citation counts as indices of a scholar’s future

120 performance also highlights that these metrics are not only used retrospectively, they are

121 also intended to be diagnostic of how scholars will perform in the future. One of the

122 principle functions of the hiring and tenure processes is prediction: The goal is to predict

123 (and select) scientists with the most promise for advancing science. Realizing this goal

124 requires both that we have access to valid predictors of future scientific accomplishments

125 and that we use those indicators appropriately, neither of which there is evidence for

126 (Chapman et al., 2019). As discussed, there are several reasons to think that these metrics

127 do not predict quality (Aksnes et al., 2019; Hicks et al., 2015; Moher et al., 2018; Seglen,

128 1997). In order for hiring and tenuring committees to establish their usefulness would

129 require formally modeling how these factors accounted for a scholars future performance – RESEARCH QUALITY IN BEHAVIORAL SCIENCE 7

130 to our knowledge, this is never done. Moreover, even if hiring and tenuring committees

131 actually built such a model, search, hiring, and tenuring committees would need to

132 calibrate their decision-making in accordance with the model. We suspect, again, that this

133 is and would not be done.

134 According the view that impact factors and citations are useful for assessing research

135 quality, one would expect reasonable, or even strong, positive statistical relations with

136 objective indices of research quality. Evidence to the contrary or even a lack of evidence

137 would undermine their use. Existing research already suggests that journal impact factors

138 and citation counts are given weight far beyond how well they could plausibly predict

139 future performance at the expense of other diagnostic measures. Although quality is a

140 complex construct, there are a handful of factors that we believe are indicative of critical

141 aspects of research quality (though, without question, not all aspects). These include

142 methodological features of an article such as use of random assignment (when feasible) and

143 the use of appropriate control groups, the appropriate reporting and transparency of

144 statistical analyses, the accuracy with which results are reported, the statistical power of

145 the study, the evidentiary value of the data (i.e., the degree to which the data provide

146 compelling support for a particular hypothesis, including the null hypothesis), and the

147 replicability of research findings. Some aspects of research quality can be assessed only

148 with close inspection by domain experts (e.g., Nieminen et al., 2006; Patterson & Harris,

149 2009; West & McIlwaine, 2002), but many others can be ascertained using automated data

150 mining tools.

151 In this paper, we take the latter approach by examining how citation counts and

152 journal impact factors relate to indices of a key aspect of research quality: (1) the number

153 of errors in the reporting of statistics in a paper, (2) the strength of statistical evidence

154 provided by the data, as defined by the Bayes factor (Edwards, Lindman, & Savage, 1963),

155 and (3) the replication of empirical findings. In as much as impact factors and citation

156 counts are indices of critical aspects of research quality, papers in higher impact journals RESEARCH QUALITY IN BEHAVIORAL SCIENCE 8

157 and highly cited papers should have fewer reporting errors, provide more credible evidence

158 to support claims, be independently replicable at a higher rate than research appearing in

159 lower impact journals or articles with fewer citations. For these indices to be useful for

160 research evaluation, the associations should be reasonably large. The advantages of our

161 approach are that our chosen indices are quantifiable, verifiable, transparent, and not

162 reliant on subjective evaluations of reviewers who may not agree about the quality of a

163 paper.

164 The analyses reported herein bring together three data sources to address the validity

165 of citation counts and journal impact factors as indices of research quality. These data

166 sources include a data set used in a large-scale assessment of statistical reporting

167 inconsistencies across 50,845 articles in behavioral science and neuroscience journals

168 published between 1985 and 2016 (Hartgerink, 2016) and 190 replication studies from a

169 collection of repositories (see Reinero et al., 2020). Each data set was merged with the

170 2017 journal impact factors and/or article-level citation counts. The raw data and code

171 used to generate our analyses and figures are available on the Open Science Framework:

172 https://osf.io/hngwx/. Collectively, our analyses represent one of the most comprehensive

173 assessments of the validity of bibliometrics as indicators of research quality within the

174 behavioral and brain sciences.

175 Analytic strategy

176 Throughout, we perform a combination of Bayesian estimation and frequentist

177 analyses. For all Bayesian models, we set weakly regularizing priors in brms (Bürkner,

178 2017), which are detailed in our analysis scripts on the Open Science Framework. Crudely

179 put, weakly regularizing priors guide the MCMC estimation process but, particularly when

180 a data set is large, allow the posterior to primarily reflect the data. Posterior predictive

181 checks were conducted for each model to verify model fit. Posterior predictive checking

182 involves simulating the data from the fitted model and graphically comparing the RESEARCH QUALITY IN BEHAVIORAL SCIENCE 9

183 simulations to the observed distribution. This step is useful for diagnosing

184 mis-specifications of the statistical model and ensuring that the model adequately

185 reproduces the distribution. Code for conducting the posterior predictive checks is included

186 in the analysis scripts provided on OSF. Frequentist tests are provided for ordinal

187 correlations and proportion tests only.

188 We had specific questions we sought to answer about the relationship between

189 citation counts, impact factors, and measures of research quality, but our analyses are

190 nonetheless exploratory. Consequently, we checked the robustness of our inferences by

191 changing model parameterizations and prior choices, including or excluding different

192 covariates and group-level effects, and making alternative distributional assumptions (e.g.,

193 comparing model fits with Gaussian vs. Beta distributions). Throughout the paper, we will

194 report 95% credible intervals, but our focus is on characterizing the magnitude of the

195 effects, rather than on binary decisions regarding statistical significance (or not). To assess

196 the usefulness of the various covariates for prediction, we also computed the Bayesian

2 197 analogue of the R statistic based on leave-one-out (LOO) cross validation (Gelman,

2 198 Goodrich, Gabry, & Vehtari, 2019). The LOO adjusted R provides a better estimate of

2 199 out-of-sample prediction accuracy compared to the Bayesian R .

200 Question 1: Do article citation counts and journal impact factors predict more

201 accurate statistical reporting?

202 Hartgerink (2016) used statcheck (Nuijten, Hartgerink, van Assen, Epskamp, &

203 Wicherts, 2016), an automated data mining tool, to identify statistical reporting

204 inconsistencies among 688,112 statistical tests appearing in 50,845 journal articles in

205 content areas related to the behavioral and brain sciences. Articles included in the data set

206 were published between 1985 and 2016. A full description of the data and initial analyses

207 are provided by Nuijten and colleagues (2016), but we detail critical aspects of the

208 statcheck data set which suggest it is a valid measure of our research question. statcheck RESEARCH QUALITY IN BEHAVIORAL SCIENCE 10

209 is an that uses regular expressions to extract American Psychological

210 Association (APA) formatted results from the Null Hypothesis Significance Testing

211 framework (Epskamp & Nuijten, 2014). Among other things, statcheck recalculates the

212 p-values within a paper based on the test statistics and degrees of freedom reported with

213 each test. statcheck has been shown to be reliable indicator of inconsistencies in APA

214 formatted papers. For instance, Nuijten and colleagues 2016 calculated the interrater

215 reliability between statcheck and hand coding finding that statcheck and handcoders

216 had an interrater reliability of .76 for inconsistencies between p-values and reported test

217 statistics and .89 for the “gross inconsistencies.” We should note that although quite

218 reliable, naturalistic datasets—including the data set used here—will often have minor

219 limitations that can affect data quality. These data quality issues can arise for several

220 reasons including errors in converting PDF files to text files, or more systematic errors due

221 to some subfields reporting their analyses in ways which deviate from the American

222 Psychological Association Style guide. To consider one example, in reviewing the

2 223 statcheck data, we observed that many χ statistics appeared to be miscoded. In

224 particular, we observed that statcheck sometimes mis-identifed sample size as the degrees

2 225 of freedom for the χ statistic. This leads to an incorrect computation of the p-value. To

2 226 address this issue, we ran analyses both including and excluding all χ statistics. The

2 227 reported conclusions are based on analyses that included the χ statistics, but we note here

2 228 that analyses with the χ statistics converged with those without. The only analyses for

2 229 which the mis-coded χ are included are those reporting decision error rates.

230 These issues notwithstanding, because of the established reliability of the statcheck

231 data set, we merged this open-source data set with article-level citation counts and the

232 2018 journal impact factors to understand the relationship between reporting errors and

233 citation counts and journal impact factors. Citation counts were obtained using the

234 rcrossref (Chamberlain, Zhu, Jahn, Boettiger, & Ram, 2019) package in R and the

235 impact factors were obtained using the scholar (Keirstead, 2016) package in R. Data RESEARCH QUALITY IN BEHAVIORAL SCIENCE 11

236 from each article were then summarized to reflect the number of statistical reporting errors

237 per article and the total number of statistical tests. Articles with fewer than two statistical

238 tests do not allow for computing variability across tests and thus we excluded these papers

239 from our analyses. The final data set used for our analysis included 45,144 articles and

240 667,208 statistical tests.

241 To address the above question, we examined the degree to which citation counts and

242 impact factors predict quality in three ways. First, we examined whether articles containing

243 at least one error were cited more or less than articles with no errors. Of the 45,144 articles

244 included in the analysis, roughly 12.6% percent (N = 5,710) included at least one statistical

245 reporting inconsistency that affected the decision of whether an effect was statistically

246 significant or not. The majority of the decision errors (N = 8,515) included in these

247 articles involved authors reporting a result as significant (p ≤ .05), when the recalculated

248 p-value based on the reported test-statistic (e.g., t = 1.90) and df (e.g., df = 178) was

249 greater than .05. Only N = 1,589 consisted of the reverse error, in which a p-value was

250 reported as non-significant or p > .05 when the test statistic indicated otherwise. This

251 asymmetry suggests that these reporting errors are unlikely to be random. Indeed, of the

252 reported tests in which the recomputed p-value was greater than .05 (N = 178,978), 4.76%

253 (N = 8,515) were incorrectly reported as significant, whereas only 0.32% (N = 1,589) of

254 the recomputed p-values that were significant (N = 488,154) were incorrectly reported as

255 non-significant – a difference that is statistically reliable (Z = 131.31, p < 0.0001 with a

256 proportion test). More interesting is the fact that articles containing at least one reporting

257 error (M = 51.1) garnered more citations than articles that did not contain any errors (M

258 = 45.6), Z = 6.88,p < .0001 (Mann-Whitney U), although the difference is small. Thus, at

1 259 least by this criterion, citation counts do not reflect quality.

1 One question is whether researchers conceive of citation counts as an index of research quality in the first place rather than as an index of research impact. Replacing ‘quality’ with ‘impact’, raises the question, what is an appropriate index of impact independent of citations? If what we mean by ‘impact’ is societal impact, then it is clear that citation counts are not the right metric prima facie, because citations counts primarily reflect academic, not societal consideration. We suggest that regardless of how else we might define impact, it is bound up with an initial assumption that the research in question is high-quality RESEARCH QUALITY IN BEHAVIORAL SCIENCE 12

260 Although above analysis of citation counts neglects the multilevel structure and

261 complexity of the data, we share it because it is likely that people use citations counts in a

262 similar simplistic way when inferring research quality, following the general principle of

263 “more is better” without adjusting appropriately for differences across subdisciplines or

264 accounting for other relevant factors. We also examined the relationship between frequency

265 of errors and impact factor and citations counts using a Bayesian multilevel zero-inflated

266 Poisson model. This model predicted the frequency of errors per paper on the basis of the

267 log of a journal’s impact factor and the log of an article’s citations (+1), controlling for the

268 year of publication and number of authors, with the total number of reported statistical

269 tests per paper included as the offset in the model. Year of publication was included

270 because there have been large changes in APA formatting of statistical analyses since 1985,

271 as well as changes in statistical software over this period of time. We also considered it

272 plausible that the number of authors on a paper could affect error rates in a paper (e.g.,

273 more authors may check over the work being submitted and so could reduce the errors in a

274 paper) and so we conducted our analyses with and without the number of authors as a

275 covariate in this model. The inclusion of this covariate did not materially change the

276 conclusion of this set of analyses. We also classified each journal into one of 14 content

277 areas within psychology (clinical, cognitive, counseling, developmental, education,

278 experimental, general, health, industrial/organizational, law, methodology, neuroscience,

279 social, and “other”). The classification of the journals into content areas allowed us to

280 control for potential variation in publication practices across areas by including content

281 area as a grouping-factor. However, journal was not treated as a grouping-factor in the

282 model because it is perfectly co-linear with journal impact factor (i.e., the population and

283 grouping-level effects are identical).

284 The Poisson regression revealed that only journal impact factor b = −0.210 (−0.280,

(i.e., replicable, robust, and so on). RESEARCH QUALITY IN BEHAVIORAL SCIENCE 13

2 285 −0.139) and year of publication b = −0.108 (−0.134, −0.082) predicted decision errors. The

286 effect of citation counts was not credibly different from zero, b = −0.008 (−0.030, 0.015),

2 287 nor was number of authors b = 0.023 (−0.001, 0.046). Table 1 provides leave-one-out R ’s,

288 along with the parameter estimates for each predictor when modeled separately. For these

289 models, each predictor was entered into the model separately, including both content area

290 and the offset (number of statistical tests) in the model as control variables. These

291 analyses provide converging evidence that the conclusions based on the full model hold

292 when each predictor is modeled separately, however they also illustrate that none of the

2 293 predictors are of “practical” significance, as evidenced by the R ’s: Articles in higher

294 impact journals were associated with fewer statistical decision errors, but this relationship

295 is weak. To put this in perspective, the Kendall’s tau rank-order correlation between

296 impact factor and percent decision errors per paper, an analysis that doesn’t control for the

297 other factors, is non-significant and inconsequentially small (τ = −.002, z = −0.506, p >

298 .05), while the rank order correlation between decision errors and citation counts is also

299 small but significantly positive (τ = .024, z = 6.403, p < 0.01). Adjusting for the necessary

300 grouping and population effects, however, decreases this small bivariate correlation. Thus,

301 even though citation counts and impact factors often figure prominently in, for instance,

302 tenure decisions, these metrics provide very little information relevant to judging research

303 quality, as measured by decision-errors. Analysis of non-decision errors are provided in the

304 supplemental material. Conclusions based on these analyses are generally consistent with

305 those based on the decision errors: The relationships are weak and not of practical

306 significance. In both cases, the negative relation between the number of errors and the

307 journal impact factor can be interpreted either as reflecting the quality of the review

308 process (maybe reviewers or editors of higher-impact journals are more likely to catch

309 errors), the strength of the journals editorial team, or (perhaps) the possibility that

2 However, when only considering journals that have impact factors that are two or greater, we observed no relationship between journal impact factor and decision errors. This suggests that the negative relationship between impact and quality should be interpreted with a considerable amount of caution. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 14

310 authors are more careful when submitting to higher impact factor journals. To address the

311 question of quality more directly, we next evaluate whether the strength of evidence

312 presented in papers varies with impact factors or citation counts.

a. Citation Count c. Number of Authors −.008 (−0.030, 0.015) 0.023 (−0.001, 0.041) 6 6

4 4

2 2

0 0 0 1000 2000 3000 4000 0 10 20 30 40 Number of Citations Number of Authors Errors per 100 statistical tests Errors per 100 statistical tests b. Impact Factor d. Year of Publication −0.210 (−0.280,−0.139) −0.108 (−0.134, −0.082) 6 6

4 4

2 2

0 0 0 5 10 1990 2000 2010 Impact Factor Year Errors per 100 statistical tests Errors per 100 statistical tests

Figure 1 . Number of statistical decision errors per 100 tests as function (a) Number of times the article was cited, (b) journal impact factor, (c) number of authors on the paper, (d) year of publication. Shaded regions indicate 95% Credible Intervals. Analyses based on N=45,144 published articles. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 15

a. Citation Counts d. Year of Publication −0.004 (−0.007, −0.000) 0.019 (0.015,0.023) 1.00 1.00

0.95 0.95

0.90 0.90

0.85 0.85

0.80 0.80 Probability of H1 Probability of H1 0.75 0.75 0 1000 2000 3000 4000 1990 2000 2010 Number of Citations Year b. Impact Factor e. Degrees of Freedom −0.033 (−0.046,−0.020) 0.184 (0.124,0.248) 1.00 1.00

0.95 0.95

0.90 0.90

0.85 0.85

0.80 0.80 Probability of H1 Probability of H1 0.75 0.75 0 5 10 0 500000 1000000 1500000 2000000 Impact Factor Degrees of Freedom c. Number of Authors −0.014 (−0.018,−0.010) 1.00

0.95

0.90

0.85

0.80 Probability of H1 0.75 0 10 20 30 Number of Authors

Figure 2 . Bayesian estimated Bayes factor for all t-test and 1 degree of freedom F-tests that were statistical significant at the p < .05 level (N = 299,316) as a function of (a) Number of times the article was cited, (b) journal impact factor, (c) year of publication, (d) number of authors on the paper, and (e) reported degrees of freedom. Shaded regions indicate 95% Credible Intervals. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 16

313 Question 2: Do highly cited papers and papers in high impact factor journals

314 report stronger statistical evidence to support their claims?

315 Carl Sagan noted that extraordinary claims require extraordinary evidence. From a

316 Bayesian perspective, this can be interpreted to mean that surprising or novel results, or

317 any result deemed a priori unlikely, require greater evidence. Many papers published in

318 high-impact journals are considered a priori unlikely according to both online betting

319 markets and survey data (Camerer et al., 2018). Moreover, editorial policies for some higher

320 impact journals emphasize novelty as a criterion for publication—authors understand that

321 research reports should be “new” and of “broad significance.” Assuming high-impact

322 journals are more likely to publish novel and surprising research, then one might also

323 expect these journals to require a higher standard of evidence. One way to quantify the

324 strength of the evidence is with the Bayes factor (Morey, Romeijn, & Rouder, 2016). The

325 Bayes factor provides a quantitative estimate of the degree to which the data support the

326 alternative hypothesis versus the null. For instance, a BF10 of seven indicates that the data

327 are seven times more likely under the alternative hypothesis than under the null as those

328 hypotheses are specified. All else being equal, studies that provide stronger statistical

329 evidence are more likely to produce replicable effects (Etz & Vandekerckhove, 2016), and

330 hold promise for greater scientific impact because they are more likely to be observable

331 across a wide variety of contexts and settings (Ebersole et al., 2016; Klein et al., 2018).

332 To conduct these analyses, we computed the Bayes factors for all t-tests, one-degree

333 of freedom F-tests, and Pearson r’s (N = 299,504 statistical tests from 34,252 journal

334 articles) contained within the Hartgerink(2016) dataset in which the recomputed p-value

335 based on the test-statistic and degrees of freedom was p ≤ .05. This procedure allows us to

336 evaluate the general strength of evidence presented by authors in support of their claims,

337 but does not allow us to evaluate cases in which authors’ theoretical conclusions are based

338 on a single statistic amongst many that might be presented.

339 The BF was computed using the meta.ttestBF function in the Bayesfactor RESEARCH QUALITY IN BEHAVIORAL SCIENCE 17

340 package in R, which computes the Bayes factor based on a t-statistic and sample size.

341 Thus, we first converted the F and r statistics to t statistics before computing the Bayes

342 factor for a given test. Because our automated approach cannot differentiate between

343 within and between-subjects designs, we computed two separate BF’s for each statistic, one

344 assuming within subjects and one assuming between-subjects. Sample size for the

345 meta.ttestBF function was estimated from the degrees of freedom, with N = df + 1 for

346 the one-sample case and N1 = N2 = (df + 2)/2 for the two-sample case, rounded to the

347 nearest integer. Because the two estimates are highly correlated (r=0.99) and yield similar

348 findings, we report only the results of the two-sample case. It is also important to note

349 that while this procedure makes the intercept of our models difficult to interpret, our

350 primary interest was in the slope parameters – we sought to understand the relationship

351 between evidential strength and both journal impact factors and citation counts, not on

352 the overall or average magnitude of Bayes factors in the papers in our dataset.

353 Because the distribution of the Bayes factor was extremely skewed, we converted

354 them to probabilities and used beta regression to analyze the data. The conversion to the

355 probability scale was not possible for a small number (N = 188) of the Bayes factors due to

356 their extremity, resulting in a total of 299,316 total observations being used in the final

357 analysis. As a robustness test, we also repeated all of the analysis after applying the rank

358 normal transformation to the Bayes factor. These analyses are provided on OSF, and are

3 359 consistent with the results of the beta regression models reported here.

360 Under the common assumption that citation counts and impact factors reflect

361 scientific impact or quality, one would expect these indices to be positively correlated with

362 the evidentiary value of the data represented in the publications. However, this does not

363 seem to be the case. In fact, as we demonstrate next, there is a tendency for the

364 evidentiary value of the data to be lower in high-impact journals.

3 The log transformation of the Bayes factor did not sufficiently correct for the skewness; models other than the beta model or a Gaussian model applied to the rank normal transformation generally failed posterior predictive checks. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 18

365 The “centrality” of a hypothesis can only be assessed indirectly in these analyses, but

366 we work from the assumption that tests reported in a paper, particularly those that are

367 statistically significant, will tend to be the tests most central to an author’s argument.

368 Focusing on significant tests, we used citation counts and impact factors as predictors of a

369 test’s Bayes factor, controlling for content area, year of publication, number of authors of

4 370 the paper, and the reported degrees of freedom (as a surrogate for sample size). The

371 observed that the magnitude of the Bayes factor was negatively related to the journal

372 impact factor,b = −0.033 (−0.046, −0.020), citation counts, b = −0.004 (−0.007, −0.0002),

373 and number of authors b = −0.014 (−0.018, −0.010). Both year of publication b = 0.019

374 (0.015, 0.023) and degrees of freedom b = 0.184 (0.124, 0.248) were positively related.

2 375 Table 1 provides the regression coefficients and LOO R ’s for each predictor modeled

376 separately. Although the magnitude of the relationship is not meaningful to be of practical

377 relevance, the fact that higher impact journals and papers with more citations are

378 associated with less evidentiary value underscores the central thesis that neither the

379 impact factor or citation counts are valid indicators of key aspects of research quality. That

380 is, it is unlikely we observed a small negative relationship when the relationship between

381 evidentiary value and journal impact factor is, in fact, strong and positive.

382 The credible yet small negative relationship between evidentiary value and number of

383 authors indicates that papers with greater numbers of authors were also associated with a

384 given test providing less evidentiary value. To speculate, in the absence of preregistered

385 analysis plans, we conjecture that the number of alternative ways the data are analyzed

386 grows as a function of number of authors, a hypothesis consistent with the findings of

387 Silberzahn et al.(2018). In turn, it’s possible that weak evidence discovered during

388 extended exploratory data analysis is over-interpreted. However, further research would be

2 389 needed to confirm this hypothesis. As with analyses of the decision errors, the Bayesian R

4 The inclusion of degrees of freedom also serves as a sanity check, as the Bayes factor should scale with sample size, ceteris paribus. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 19

390 for these analyses are small, indicating that despite the fact that the results are reliable,

391 none of the predictors account for much variance in the magnitude of the Bayes factor (see

392 Table 1).

393 Another way to look at the data is to model the distribution of p-values directly. We

394 examined the degree to which the magnitude of the computed p-value co-varied with

395 impact factors, citation counts, and number of authors, controlling for year of publication,

396 number of statistical tests reported for each article, and content area within psychology.

397 We modeled only those p-values ≤ .05 under the assumption that these are the most

398 meaningful of authors’ findings: Of the 667,208 total statistical tests included in the

399 dataset, 488,151 of the re-computed p-values were less than .05. A Bayesian beta-regression

400 model again revealed that both the impact factor b=.041 (.032,.050) and the number of

401 authors b = .008 (.006, .011) were weak but reliable predictors of the magnitude of the

402 reported p-values: Journals with higher impact factors and papers with more authors were

403 associated with the reporting of higher p-values. Citation counts were also credibly related

404 to the magnitude of the p-value, but in this case negatively b=−.004 (−.007, −.002). Both

405 year of publication, b=−.017 (−.020, −.014), and the log of the total number of tests

406 included in each paper, b=−.007 (−.01, −.003), were also statistically reliable predictors. As

407 a robustness check, we fit a separate model without controlling for the total number of

408 reported statistical tests. This analysis did not alter our conclusions.

409 Question 3: Do citation counts or journal impact factors predict replicability of

410 results?

411 To address this third question, we used an openly available data set that included

412 data from eight different replication projects, including the Reproducibility project Open

413 Science Collaboration(2015), Many Labs 1 (Klein et al., 2014), Many Labs 2 (Klein et al.,

414 2018), and Many Labs 3 (Ebersole et al., 2016), a special issues of Social Psychology

415 (Epstude, 2017), the Association of Psychological Sciences Registered Reports Repository RESEARCH QUALITY IN BEHAVIORAL SCIENCE 20

a. Citations to paper b. Citations to first author

300 40000

30000 200

20000 100 10000

0 59 56 0 1097 1679 Citation Count (Original Paper) 85.61 83.33 3218.52 2576.97 No Yes No Yes Citation Count 1st Author (Original paper) Replicate Replicate c. Institutional prestige d. Surprisingness

5 6

4

4 3

2 2

3.65 3.67 Surprisingness of Result 1 3.4 2.75 Perceived Institutional Prestige Perceived 3.96 4.05 3.25 2.77 0 0 No Yes No Yes Replicate Replicate

Figure 3 . Violin plots of (a) number of times each paper is cited, (b) number of times the first author was cited, (c) rated institutional prestige, and (d) rated surprisingness of experimental findings split by studies that were and were not successfully replicated. Medians are given by boxed numbers; means are unboxed. Plots are based on the N = 100 replication attempts included in Open Science Collaboration(2015) .

416 (Simons, Holcombe, & Spellman, 2014), the pre-publication independent replication

417 repository (Schweinsberg et al., 2016), and Curate Science (Aarts & LeBel, 2016). Reinero

418 et al.(2020) collated these data, which also includes the number of citations to each

419 original article, the number of authors on the original article, year of publication, and the

420 Altmetric score for the original article. The full dataset curated by Reinero et al.(2020) is RESEARCH QUALITY IN BEHAVIORAL SCIENCE 21

421 available on the Open Science Framework (https://osf.io/pc9xd/). We merged these data

422 with the 2017 journal impact factors obtained using the scholar (Keirstead, 2016) package

423 in R and coded each journal by subdiscipline (clinical, cognitive, social, judgment and

5 424 decision making, general, marketing, and ‘other’). Of the 196 studies included in Reinero

425 et al.(2020), we were unable to obtain journal impact factors for four of the publications;

426 an additional two papers were not coded for replication success. Thus, the total number of

427 studies included in our analysis was N = 190.

428 A subset of the replications included in this dataset are from the Open Science

429 Collaboration(2015) replication project and include a variety of other variables, including

430 ratings of the “surprisingness” of each original finding (the extent to which the original

431 finding was surprising, as judged by the authors who participated in the replication

432 project) as well as citations to the first author of the original paper and prestige of the

433 authors institution. Our original analyses provided in early drafts of this manuscript

434 (https://psyarxiv.com/9g5wk/) focused only on the 100 studies included in the Open

435 Science Collaboration(2015) repository. We expanded our analysis in March 2021 to

436 include all of the studies reported in Reinero et al.(2020). The Open Science Collaboration

437 (2015) dataset includes only three journals (Journal of Experimental Psychology: Learning,

438 Memory and Cognition; Journal of Personality and Social Psychology; and Psychological

439 Science), and the impact factors for these three journals are completely confounded with

440 content area within psychology. The expanded dataset includes 190 total publications in

441 twenty-seven different journals, thereby enabling us to include impact factor as a predictor

442 in our models.

443 To examine whether impact factors and citation counts predict replication success, we

444 conducted a Bayesian multi-level logistic regression predicting replication success from the

5 We also manually looked up each impact factor on the journal website to obtain the most recently available impact factor, as the scholar package returned the 2017 impact factor. The manually obtained impact factors correlated r = 0.99 with impact factors obtained using the Scholar package. We therefore report analyses based on the scholar data. However, we note here that the conclusions based on the manual impact factors are essentially the same. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 22

445 log(impact factor), log(citation count+1) and the log(number of authors) with year of

446 publication and area within psychology included as a random or grouping factors, using

447 weakly informative priors. This analysis revealed that only impact factor b = −0.822 95%

448 CI (−1.433, −0.255) predicted replication success, albeit in the opposite direction than

449 would be desired by proponents of their use. Neither citation counts b = 0.237 95% CI

450 (−0.010, 0.490) nor number of authors b = 0.179 95% CI (−0.178, 0.536) was credibly

451 related to replication success. As a robustness analysis, we fit alternative models excluding

452 ‘area’ as a grouping factor, modeling the predictors individually, and specifying alternative

453 prior distributions. These models all yielded consistent findings, with the slope of the

454 impact factor consistently credibly different from zero ranging from −0.53 to −1.03, and

455 neither the slope for citation counts nor for number of authors credibly different from zero.

456 The finding that the impact factor is credibly negative is inconsistent with the belief

457 that it is a valid index of quality. Still, proponents of its use may question our results on

458 grounds that our priors do not adequately capture their prior beliefs. This critique can be

459 easily addressed within the Bayesian framework by incorporating a prior that reflects the

460 assumption that papers in higher impact factor journals are of higher quality. We therefore

461 define a model with priors ∼ N (0.50, 1.00) for the slope parameters for impact factor,

462 citation counts, and number of authors to reflect the proponents belief that these all reflect

463 key aspects of quality. This model once again finds a credible negative slope for impact

464 factor b = −0.716, 95% CI (−1.276, −0.178) and no credible relation with citation counts b

465 = 0.225, 95% CI (−0.019, 0.475) or number of authors b = 0.193, 95% CI (−0.155, 0.534).

466 Table 1 provides the parameter estimates for each predictor modeled separately. For these

467 models, both year of publication and ‘area’ within psychology are included as random

468 factors. Although the journal impact factor is negatively related to replication success, here

2 469 too the Bayesian R is small and not of practical significance. Nevertheless, that the

470 relation is negative should serve to caution against its use as a positive indicator of

471 research quality. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 23

472 As noted above, a subset of the data were drawn from the Open Science

473 Collaboration(2015) repository. This dataset included a number of additional variables

474 that might be viewed as reflective of quality, including institutional prestige, number of

475 citations garnered by the author, as well as number of citations to the original paper. As

476 an index of ‘novelty’, the dataset also includes ratings of surprisingness. Figure 4 displays

477 violin and boxplots for the subset of studies in Open Science Collaboration(2015), split by

478 whether they were successfully replicated. We conducted a Bayesian logistic regression

479 using number of citations to the original paper, number of citations to the first author,

480 institutional prestige and surprisingness as predictors of whether the original finding was

481 replicated, with replication success defined based on whether the replication study was

482 statistically significant. Of these indices, only surprisingness b = −0.663, 95% CI (−1.225,

483 −0.121) predicted whether a finding was successfully replicated. Consistent with results

484 from prediction markets (Camerer et al., 2018), studies with low prior probability—those

485 that would surprise us if true—were less likely to replicate (e.g., Extrasensory perception is

486 both highly surprising and the findings for it are not replicable). Neither the number of

487 citations to the original paper b = 0.005 95% CI (−0.003, 0.014), the number of citations

488 garnered by the first author of the original paper b = −0.00005 95% CI (−0.0002, 0.00005),

489 nor the institutional prestige of the first author were predictive of replication success b =

6 490 0.014 (−0.317, 0.274) amongst this subset of replications.

491 The above analyses are consistent with the results presented for Questions 1 and 2, in

492 that neither journal impact factor nor citation counts reflect key aspects of research

493 quality. If anything, the evidence seems to suggest that higher impact factor journals

494 publish work that is less replicable. Thus, in regard to the question of whether citation

495 counts or journal impact factors are reflective of quality, as defined by replicability, the

496 answer appears to be “no”, at least given the data used to address this question.

6 We also conducted analyses using the total number of citations to the senior author of each paper and institutional prestige of the senior author’s institution. The conclusions are unchanged. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 24

Table 1 Parameter estimates and leave-one-out (LOO) adjusted R2 for each predictor modeled separately. Each model includes the predictor, the grouping factor (Area of psychology), and when appropriate, the offset. Question 3 includes the year as a random factor as well.

DV Predictor Effect (95% CI) Bayesian LOO R2 Question 1 Decision Errors JIF −.146 (−.214, −.079) .021 Citations .004 (−.016,.024) .021 Authors −.036 (−.079, .008) .021 Year −.079 (−.11, −.06) .022 Question 2 Bayes factor JIF −.056 (−.067, −.044) .0049 Citations −.014 (−.017, −.011) .0048 Authors −.010 (−.014, −.006) .0050 Year .021 (.018, .025) .0045 Degrees of freedom .190 (.131, .254) .0044 Question 3 Replication JIF −.640 (−1.202, −.115) .022 Citations .068 (−.141, .281) −.007 Authors .158 (−.187, .503) −.005

497

498 Discussion

499 Goodhart’s Law states that a measure ceases to be a good measure when it becomes

500 a target (Goodhart, 1975). Even if citation counts and journal impact factors were once

501 reasonable proxies for quality (though there is no evidence to suggest this), it is now clear

502 that they have become targets and as a result lack validity as measures of faculty

503 performance (Fire & Guestrin, 2019). Our findings are consistent with this conclusion:

504 Neither citation counts nor impact factors were meaningfully related to research quality, as

505 defined by errors in statistical reporting (Question 1), strength of evidence as determined

506 by the Bayes factor (Question 2), or replicability (Question 3). Though there is some

507 evidence in Question 1 that impact factor is inversely related to statistical reporting errors RESEARCH QUALITY IN BEHAVIORAL SCIENCE 25

508 (fewer errors in higher impact journals), this finding comes with numerous caveats. For

509 example, the magnitude of this relationship was trivially small and completely absent for

510 the subset of articles in journals with impact factors greater than 2.0 and for articles with

511 fewer than 10% decision errors. Nevertheless, it is possible that some aspect of quality

512 could underlie this relation, whether its due to the quality of the copy-editing team, better

513 reviewers catching obvious statistical reporting errors, or diligence on the part of authors.

514 More problematic, however, is that some of our analyses indicate that impact factors

515 and citation counts are associated with poorer quality. For instance, articles in journals

516 with higher impact factors are associated with lower evidentiary value (Question 2) and

517 are less likely to replicate (Question 3). Unlike the presence of statistical reporting errors,

518 these later findings cannot be easily dismissed as typographical errors, as they speak

519 directly to the strength of the statistical evidence and are generally consistent with prior

520 results that show a negative relationship between impact factors and estimates of

521 statistical power in social psychological Fraley and Vazire(2014) and cognitive

522 neuroscience Szucs and Ioannidis(2017) journals. Regardless of the causal mechanisms

523 underlying our findings, they converge on the emerging narrative that bibliometric data are

524 weakly, if at all, related to fundamental aspects of research quality (e.g., Aksnes et al.,

525 2019; Brembs, 2018; Brembs et al., 2013; Fraley & Vazire, 2014; Nieminen et al., 2006;

526 Patterson & Harris, 2009; Ruano et al., 2018; Szucs & Ioannidis, 2017; J. Wang, Veugelers,

527 & Stephan, 2017; West & McIlwaine, 2002).

528 Practical relevance

529 Promotion and tenure is, fundamentally, a process for engaging in personnel selection,

530 and it is important that it is recognize it as such. In personnel psychology, researchers

531 often look for metrics that predict desired outcomes or behaviors without creating adverse

532 impact. Adverse impact occurs when a performance metric inherently favors some groups

533 over others. In this context, a good metric that does not create adverse impact might have RESEARCH QUALITY IN BEHAVIORAL SCIENCE 26

534 a modest correlation with work performance yet not produce differential preferences.

535 Conscientiousness, for example, has a validity coefficient of only about r = .2 (Barrick &

536 Mount, 1991), but is widely used because it produces minimal adverse impact (Hough &

537 Oswald, 2008). Using predictive validity and the potential for adverse impact to

538 benchmark the usefulness of citation counts and impact factors, it is clear that they fail on

539 both accounts. As suggested above, citation counts and journal impact factors likely play

540 an out-sized role in hiring and promotion and for predicting future research productivity

541 (see also McKiernan et al., 2019). Yet, it is clear from the research presented above and

542 elsewhere that these metrics do not capture key aspects of research quality. Moreover,

543 there is plenty of evidence to suggest that they hold the potential to produce adverse

544 impact. Citation counts are known to be lower for women and underrepresented minorities

545 (Dion, Sumner, & Mitchell, 2018; Dworkin et al., 2020; X. Wang et al., 2020a), and there is

546 some evidence for a negative relationship between impact factor and women authorship

547 (Bendels, Müller, Brueggmann, & Groneberg, 2018) and so hiring on their basis is likely to

548 perpetuate structural bias. Second, the present research highlights that the use of these

549 indicators in this context is predicated on the assumption that they reflect latent aspects of

550 either research quality or impact. But, inasmuch as the accuracy of statistical reporting

551 (decision errors), evidentiary value (Bayes factor), and replication are key components of

552 research quality, our results clearly undermine this assumption. More problematic,

553 however, is the potential for the misuse of these indicators to ultimately select for bad

554 science. Using an evolutionary computational model, Smaldino and McElreath(2016)

555 showed that the incentive and reward structure in academia can propagate poor scientific

556 methods. Insofar as impact factors and citation counts are used as the basis for hiring and

557 promotion, our analyses, as well as several other recent findings (e.g. Fraley & Vazire, 2014;

558 Szucs & Ioannidis, 2017; J. Wang et al., 2017) are consistent with this model.

559 The observed inverse relationships between the key indicators of quality on the one

560 hand, and citations and impact factors on the other, have been reported across numerous RESEARCH QUALITY IN BEHAVIORAL SCIENCE 27

561 publications. That is to say, decisions that reward researchers for publishing in high impact

562 journals and for having high citation counts may actually promote the evolution of poor

563 science (Smaldino & McElreath, 2016) as well as select for only certain types of science or

564 scientists (Chapman et al., 2019). Though our analyses indicate that the consequences

565 might be rather small for any given decision, these selection effects could theoretically

566 accumulate over time.

567 Still, a natural rebuttal to our observations that impact factors are either unrelated or

568 negatively related to quality is that our indices of quality (e.g., evidential strength) do not

569 capture other dimensions such as “theoretical importance” or novelty. One can interpret

570 “surprisingness” (see Question 3) as an indicator of novelty, and we are willing to cede that

571 higher impact journals may well publish more novel research on balance (but see J. Wang

572 et al., 2017, for data suggesting otherwise) — indeed, the ratings of ‘surprisingness’ of

573 studies included in the OSF replication project were positively, though weakly associated

574 with Impact factors, b = 0.151 95% CI = (.011, .293). However, acceptance of this

575 assumption only strengthens our argument. If papers in higher impact journals are a priori

576 less likely to be true (i.e., because they are novel), then it would require more evidence, not

577 less, to establish their validity. This follows from principles of Bayesian belief updating.

578 Our analyses suggest that the opposite is actually the case. This finding implies the

579 presence of a perverse trade-off in research evaluation: That reviewers and editors for

580 higher impact journals are more likely trade strength of evidence for perceived novelty.

581 This sort of trade-off belies rational belief updating and thus should be avoided. Regardless

582 of the underlying reasons for this pattern, higher impact journals are more likely to be read,

583 cited, and propagated in the literature, despite the fact that they may be based on less

584 evidence, less likely to be true, and therefore may be less likely to stand the test of time.

585 Nevertheless, it is important to acknowledge that each of the data sets used in our

586 analyses has certain limitations. For example, the program statcheck was used to curate

587 the data for Questions 1 and 2. This program only extracts statistics reported in written RESEARCH QUALITY IN BEHAVIORAL SCIENCE 28

588 text and in APA format (e.g., statistics reported in tables are excluded) and does not

589 differentiate between statistics reported for ‘central’ hypotheses versus more peripheral

590 hypotheses. Finally, the studies included in the replication studies for question 3 were not

591 a representative sample of many psychology studies. Keeping these limitations in mind, all

592 of our analyses converge on the same point: Impact factors and citation counts may not be

593 useful measures of key aspects of research quality.

594 Future directions

595 As a field, we need to reconsider how research is evaluated. We are not alone in this

596 call. Indeed, several recent working groups have begun drafting recommendations for

597 change. For instance, Moher et al.(2018) provide six principles to guide research

598 evaluation. Chief amongst these principles are the elimination of misleading bibliometric

599 indicators such as the journal impact factor and citation counting, reducing emphasis on

600 the quantity of publications, and developing new responsible indicators for assessing

601 scientists (RIAS’s) that place greater emphasis on good research practices such as

602 transparency and openness. Moher and colleagues’ recommendations also include the need

603 to incentivize intellectual risk taking, the use of open science practices such as data

604 sharing, the sharing of research material and analysis code, and the need to reward honest

605 and transparent publication of all research regardless of statistical significance. Some of

606 these recommendations may require fundamental change in how researchers and

607 administrators view publication.

608 Altogether, our analysis supports the growing call to reduce or eliminate the role that

609 bibliometrics play in the evaluation system, and line up with recommendations made in the

610 San Francisco Declaration On Research Assessment (DORA, https://sfdora.org/), the

611 Leiden Manifesto (Hicks et al., 2015), and by numerous scholars (Aksnes et al., 2019; Moher

612 et al., 2018). More concretely, we argue that evaluation should focus more on research

613 process and less on outcome, to incentivize behaviors that support open, transparent, and RESEARCH QUALITY IN BEHAVIORAL SCIENCE 29

614 reproducible science (Dougherty, Slevc, & Grand, 2019; Frankenhuis & Nettle, 2018; Moher

615 et al., 2018). Changing the focus of the evaluation to how faculty conduct their work from

616 what is produced shifts the incentives to aspects of the research enterprise that is both

617 under the control of the researcher and promotes good scientific practices. RESEARCH QUALITY IN BEHAVIORAL SCIENCE 30

618 References

619 Aarts, A., & LeBel, E. (2016). Curate science: A platform to gauge the replicability of

620 psychological science.

621 Aksnes, D. W., Langfeldt, L., & Wouters, P. (2019). Citations, citation indicators, and

622 research quality: An overview of basic concepts and theories. SAGE Open, 9 (1),

623 1–17. Retrieved from https://doi.org/10.1177/2158244019829575

624 AlShebli, B., Makovi, K., & Rahwan, T. (2020). The association between early career

625 informal mentorship in academic collaborations and junior author performance.

626 Communications, 11 (1), 1–8.

627 Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job

628 performance: A meta-analysis. Personnel Psychology, 44 (1), 1-26. Retrieved from

629 https://doi.org/10.1111/j.1744-6570.1991.tb00688.x doi:

630 10.1111/j.1744-6570.1991.tb00688.x

631 Bendels, M. H., Müller, R., Brueggmann, D., & Groneberg, D. A. (2018). Gender

632 disparities in high-quality research revealed by nature index journals. PloS One,

633 13 (1), e0189136.

634 Bornmann, L., & Daniel, H.-D. (2008). What do citation counts measure? A review of

635 studies on citing behavior. Journal of Documentation, 64 (1), 45–80. Retrieved from

636 https://doi.org/10.1108/00220410810844150

637 Bornmann, L., Mutz, R., & Daniel, H.-D. (2010). A reliability-generalization study of

638 journal peer reviews: A multilevel meta-analysis of inter-rater reliability and its

639 determinants. PLOS ONE, 5 (12), Article e14331. Retrieved from

640 https://doi.org/10.1371/journal.pone.0014331

641 Brembs, B. (2018). Prestigious science journals struggle to reach even average reliability.

642 Frontiers in Human Neuroscience, 12 , Article 37. Retrieved from

643 https://www.frontiersin.org/article/10.3389/fnhum.2018.00037

644 Brembs, B., Button, K., & Munafò, M. (2013). Deep impact: Unintended consequences of RESEARCH QUALITY IN BEHAVIORAL SCIENCE 31

645 journal rank. Frontiers in Human Neuroscience, 7 , Article 291. Retrieved from

646 https://www.frontiersin.org/article/10.3389/fnhum.2013.00291

647 Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan.

648 Journal of Statistical Software, 80 (1), 1–28. doi: 10.18637/jss.v080.i01

649 Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., . . .

650 Wu, H. (2018). Evaluating the replicability of social science experiments in Nature

651 and Science between 2010 and 2015. Nature Human Behaviour, 2 (9), 637–644.

652 Retrieved from https://doi.org/10.1038/s41562-018-0399-z

653 Chamberlain, S., Zhu, H., Jahn, N., Boettiger, C., & Ram, K. (2019). rcrossref: Client for

654 various ’crossref’ ’apis’ [Computer software manual]. Retrieved from

655 https://CRAN.R-project.org/package=rcrossref (R package version 0.9.2)

656 Chapman, C. A., Bicca-Marques, J. C., Calvignac-Spencer, S., Fan, P., Fashing, P. J.,

657 Gogarten, J., . . . Chr. Stenseth, N. (2019). Games academics play and their

658 consequences: how authorship, h-index and journal impact factors are

659 shaping the future of academia. Proceedings of the Royal Society B: Biological

660 Sciences, 286 (1916), 20192047. Retrieved from

661 https://royalsocietypublishing.org/doi/abs/10.1098/rspb.2019.2047 doi:

662 10.1098/rspb.2019.2047

663 Dion, M. L., Sumner, J. L., & Mitchell, S. M. (2018). Gendered citation patterns across

664 political science and social science methodology fields. Political Analysis, 26 (3),

665 312–327. doi: 10.1017/pan.2018.12

666 Dougherty, M. R., Slevc, L. R., & Grand, J. A. (2019). Making research evaluation more

667 transparent: Aligning research philosophy, institutional values, and reporting.

668 Perspectives on Psychological Science, 14 (3), 361–375. Retrieved from

669 https://doi.org/10.1177/1745691618810693

670 Dworkin, J. D., Linn, K. A., Teich, E. G., Zurn, P., Shinohara, R. T., & Bassett, D. S.

671 (2020, Aug 01). The extent and drivers of gender imbalance in neuroscience reference RESEARCH QUALITY IN BEHAVIORAL SCIENCE 32

672 lists. Nature Neuroscience, 23 (8), 918-926. Retrieved from

673 https://doi.org/10.1038/s41593-020-0658-y doi: 10.1038/s41593-020-0658-y

674 Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks,

675 J. B., . . . others (2016). Many labs 3: Evaluating participant pool quality across the

676 academic semester via replication. Journal of Experimental Social Psychology, 67 ,

677 68–82.

678 Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for

679 psychological research. Psychological Review, 70 (3), 193–242. Retrieved from

680 https://doi.org10.1037/h0044139

681 Epskamp, S., & Nuijten, M. (2014). statcheck: Extract statistics from articles and

682 recompute p values (r package version 1.0. 0.).

683 Epstude, K. (2017). Towards a replicable and relevant social psychology. Hogrefe

684 Publishing.

685 Etz, A., & Vandekerckhove, J. (2016). A bayesian perspective on the reproducibility

686 project: Psychology. PloS one, 11 (2), e0149794.

687 Fire, M., & Guestrin, C. (2019). Over-optimization of academic publishing metrics:

688 observing Goodhart’s Law in action. GigaScience, 8 (6). Retrieved from

689 https://doi.org/10.1093/gigascience/giz053

690 Fraley, R. C., & Vazire, S. (2014, 10). The N-pact factor: Evaluating the quality of

691 empirical journals with respect to sample size and statistical power. PLOS ONE,

692 9 (10), 1–12. Retrieved from https://doi.org/10.1371/journal.pone.0109019

693 Francois, O. (2015). Arbitrariness of peer review: A bayesian analysis of the nips

694 experiment.

695 Frankenhuis, W. E., & Nettle, D. (2018). Open science is liberating and can foster

696 creativity. Perspectives on Psychological Science, 13 (4), 439–447.

697 Gargouri, Y., Hajjem, C., Lariviere, V., Gingras, Y., Carr, L., & Brody, T. (2010).

698 Self-selected or mandated, increases citation impact for higher quality RESEARCH QUALITY IN BEHAVIORAL SCIENCE 33

699 research. PLOS ONE, 5 (10), e13636.

700 Gelman, A., Goodrich, B., Gabry, J., & Vehtari, A. (2019). R-squared for bayesian

701 regression models. The American Statistician, 73 (3), 307-309. Retrieved from

702 https://doi.org/10.1080/00031305.2018.1549100 doi:

703 10.1080/00031305.2018.1549100

704 Gentil-Beccot, A., Mele, S., & Brooks, T. C. (2009). Citing and reading behaviours in

705 high-energy physics. Scientometrics, 84 (2), 345–355.

706 Goodhart, C. (1975). Problems of monetary management: The U.K. experience. In Papers

707 in monetary economics, 1975. Reserve Bank of Australia.

708 Greenwald, A. G., & Schuh, E. S. (1994). An ethnic bias in scientific citations. European

709 Journal of Social Psychology, 24 (6), 623-639. doi: 10.1002/ejsp.2420240602

710 Hartgerink, C. (2016). 688,112 statistical results: Content mining psychology articles for

711 statistical test results. Data, 1 (3), 14–20.

712 Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The

713 Leiden Manifesto for research metrics. Nature, 520 , 429–431.

714 Hough, L. M., & Oswald, F. L. (2008). Personality testing and industrial–organizational

715 psychology: Reflections, progress, and prospects. Industrial and Organizational

716 Psychology: Perspectives on Science and Practice, 1 (3), 272-290. Retrieved from

717 https://doi.org/10.1111/j.1754-9434.2008.00048.x doi:

718 10.1111/j.1754-9434.2008.00048.x

719 Keirstead, J. (2016). scholar: analyse citation data from google scholar [Computer software

720 manual]. Retrieved from http://github.com/jkeirstead/scholar (R package

721 version 0.1.5)

722 Klein, R. A., Ratliff, K. A., Vianello, M., Adams Jr, R. B., Bahník, Š., Bernstein, M. J., . . .

723 others (2014). Investigating variation in replicability. Social psychology.

724 Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams Jr, R. B., Alper, S., . . .

725 others (2018). Many labs 2: Investigating variation in replicability across samples and RESEARCH QUALITY IN BEHAVIORAL SCIENCE 34

726 settings. Advances in Methods and Practices in Psychological Science, 1 (4), 443–490.

727 Langford, J., & Guzdial, M. (2015, March). The arbitrariness of reviews, and advice for

728 school administrators. Commun. ACM, 58 (4), 12–13. Retrieved from

729 https://doi.org/10.1145/2732417 doi: 10.1145/2732417

730 Lariviére, V., & Gingras, Y. (2010). The impact factor’s Matthew Effect: A natural

731 experiment in bibliometrics. Journal of the American Society for Information Science

732 and Technology, 61 (2), 424–427. Retrieved from

733 https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.21232

734 Lawrence, N., & Cortes, C. (2014). The nips experiment.

735 http://inverseprobability.com/2014/12/16/the-nips-experiment. (Accessed:

736 2021-03-03)

737 Lawrence, S. (2001). Free online availability substantially increases a paper’s impact.

738 Nature, 411 (6837), 521–521. Retrieved from https://doi.org/10.1038/35079151

739 Luarn, P., & Chiu, Y.-P. (2016). Influence of network density on information diffusion on

740 social network sites: The mediating effects of transmitter activity. Information

741 Development, 32 (3), 389-397. Retrieved from

742 https://doi.org/10.1177/0266666914551072

743 Mählck, P., & Persson, O. (2000). Socio-bibliometric mapping of intra-departmental

744 networks. Scientometrics, 49 (1), 81–91. Retrieved from

745 https://doi.org/10.1023/A:1005661208810

746 McKiernan, E. C., Schimanski, L. A., Muñoz Nieves, C., Matthias, L., Niles, M. T., &

747 Alperin, J. P. (2019). Meta-research: Use of the journal impact factor in academic

748 review, promotion, and tenure evaluations. eLife, 8 , e47338. Retrieved from

749 https://doi.org/10.7554/eLife.47338

750 Moher, D., Naudet, F., Cristea, I. A., Miedema, F., Ioannidis, J. P. A., & Goodman, S. N.

751 (2018). Assessing scientists for hiring, promotion, and tenure. PLOS Biology, 16 (3),

752 1–20. Retrieved from https://doi.org/10.1371/journal.pbio.2004089 RESEARCH QUALITY IN BEHAVIORAL SCIENCE 35

753 Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of Bayes factors

754 and the quantification of statistical evidence. Journal of Mathematical Psychology,

755 72 , 6–18. Retrieved from https://doi.org/10.1016/j.jmp.2015.11.001 (Bayes

756 Factors for Testing Hypotheses in Psychological Research: Practical Relevance and

757 New Developments)

758 Nieminen, P., Carpenter, J., Rucker, G., & Schumacher, M. (2006). The relationship

759 between quality of research and citation frequency. BMC Medical Research

760 Methodology, 6 (1), 42. Retrieved from https://doi.org/10.1186/1471-2288-6-42

761

762 Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts,

763 J. M. (2016). The prevalence of statistical reporting errors in psychology

764 (1985–2013). Behavior Research Methods, 48 (4), 1205–1226. Retrieved from

765 https://doi.org/10.3758/s13428-015-0664-2

766 Open Science Collaboration. (2015). Estimating the reproducibility of psychological

767 science. Science, 349 (6251). Retrieved from

768 https://doi.org/10.1126/science.aac4716

769 Paiva, C. E., Lima, J. A. P. d. S. N., & Paiva, B. S. R. (2012). Articles with short titles

770 describing the results are cited more often. Clinics, 67 , 509–513.

771 Patterson, M., & Harris, S. (2009). The relationship between reviewers’ quality-scores and

772 number of citations for papers published in the journal physics in medicine and

773 biology from 2003–2005. Scientometrics, 80 , 343—349. Retrieved from

774 https://doi.org/doi.org/10.1007/s11192-008-2064-1

775 Peoples, B. K., Midway, S. R., Sackett, D., Lynch, A., & Cooney, P. B. (2016). Twitter

776 predicts citation rates of ecological research. PLOS ONE, 11 (11), 1–11. Retrieved

777 from https://doi.org/10.1371/journal.pone.0166570

778 Peters, D. P., & Ceci, S. J. (1982). Peer-review practices of psychological journals: The

779 fate of published articles, submitted again. Behavioral and Brain Sciences, 5 (2), RESEARCH QUALITY IN BEHAVIORAL SCIENCE 36

780 187–255. doi: 10.1017/S0140525X00011183

781 Petersen, A. M., Fortunato, S., Pan, R. K., Kaski, K., Penner, O., Rungi, A., . . .

782 Pammolli, F. (2014). Reputation and impact in academic careers. Proceedings of the

783 National Academy of Sciences, 111 (43), 15316–15321.

784 Reinero, D. A., Wills, J. A., Brady, W. J., Mende-Siedlecki, P., Crawford, J. T., & Bavel,

785 J. J. V. (2020). Is the political slant of psychology research related to scientific

786 replicability? Perspectives on Psychological Science, 15 (6), 1310-1328. Retrieved

787 from https://doi.org/10.1177/1745691620924463 (PMID: 32812848) doi:

788 10.1177/1745691620924463

789 Ruano, J., Aguilar-Luque, M., Isla-Tejera, B., Alcalde-Mellado, P., Gay-Mimbrera, J.,

790 Hernández-Romero, J. L., . . . García-Nieto, A. V. (2018). Relationships between

791 abstract features and methodological quality explained variations of social media

792 activity derived from systematic reviews about psoriasis interventions. Journal of

793 Clinical Epidemiology, 101 , 35–43. Retrieved from

794 https://doi.org/10.1016/j.jclinepi.2018.05.015

795 Ruscio, J. (2016). Taking advantage of citation measures of scholarly impact: Hip hip h

796 index! Perspectives on Psychological Science, 11 (6), 905–908.

797 Schweinsberg, M., Madan, N., Vianello, M., Sommer, S. A., Jordan, J., Tierney, W., . . .

798 others (2016). The pipeline project: Pre-publication independent replications of a

799 single laboratory’s research pipeline. Journal of Experimental Social Psychology, 66 ,

800 55–67.

801 Seglen, P. (1997). Why the impact factor of journals should not be used for evaluating

802 research. BMJ (Clinical research ed.), 314 (7079), 498—-502. Retrieved from

803 https://europepmc.org/articles/PMC2126010

804 Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., . . . others

805 (2018). Many analysts, one data set: Making transparent how variations in analytic

806 choices affect results. Advances in Methods and Practices in Psychological Science, RESEARCH QUALITY IN BEHAVIORAL SCIENCE 37

807 1 (3), 337–356.

808 Simons, D. J., Holcombe, A. O., & Spellman, B. A. (2014). An introduction to registered

809 replication reports at perspectives on psychological science. Perspectives on

810 Psychological Science, 9 (5), 552-555. Retrieved from

811 https://doi.org/10.1177/1745691614543974 (PMID: 26186757) doi:

812 10.1177/1745691614543974

813 Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal

814 Society Open Science, 3 (9), 160384. Retrieved from

815 https://royalsocietypublishing.org/doi/abs/10.1098/rsos.160384

816 Sternberg, R. J. (2016). “Am I Famous Yet?” Judging scholarly merit in psychological

817 science: An introduction. Perspectives on Psychological Science, 11 (6), 877–881.

818 Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and

819 power in the recent cognitive neuroscience and psychology literature. PLOS Biology,

820 15 (3), 1–18. Retrieved from https://doi.org/10.1371/journal.pbio.2000797

821 Wang, J., Veugelers, R., & Stephan, P. (2017). Bias against novelty in science: A

822 cautionary tale for users of bibliometric indicators. Research Policy, 46 (8),

823 1416–1436. Retrieved from

824 http://www.sciencedirect.com/science/article/pii/S0048733317301038

825 Wang, X., Dworkin, J., Zhou, D., Stiso, J., Falk, E., Zurn, P., . . . Lydon-Staley, D. M.

826 (2020a). Gendered citation practices in the field of communication.

827 Wang, X., Dworkin, J., Zhou, D., Stiso, J., Falk, E. B., Zurn, P., . . . Lydon-Staley, D. M.

828 (2020b, Sep). Gendered citation practices in the field of communication. PsyArXiv.

829 Retrieved from psyarxiv.com/ywrcq doi: 10.31234/osf.io/ywrcq

830 West, R., & McIlwaine, A. (2002). What do citation counts count for in the field of

831 addiction? An empirical evaluation of citation counts and their link with peer ratings

832 of quality. Addiction, 97 (5), 501–504. Retrieved from https://

833 onlinelibrary.wiley.com/doi/abs/10.1046/j.1360-0443.2002.00104.x RESEARCH QUALITY IN BEHAVIORAL SCIENCE 38

834 Zhou, Z. Q., Tse, T. H., & Witheridge, M. (2019). Metamorphic robustness testing:

835 Exposing hidden defects in citation statistics and journal impact factors. IEEE

836 Transactions on Software Engineering, 1–1. Retrieved from

837 https://doi.org/10.1109/TSE.2019.2915065