Psychological Science replicates just fine, thanks Psychological Science replicates just fine, thanks

Abstract This manuscript aims to make two things clear. First, perhaps (to improve matters) Psychological Science should provide pupils (and others) proper tools, and a basis. Perhaps Psychological Science should be more aware of, acknowledge, focus on, train, and try and optimize what possibly underlies, permeates, and/or influences nearly everything mentioned in this manuscript: reasoning, logic, and argumentation. Secondly, it is hoped that this manuscript makes clear that there might be a distinct possibility that seen from several different perspectives, in several different ways, and on several different levels Psychological Science replicates just fine. Keywords: reasoning, logic, argumentation, replication, open science, crisis of confidence, replication crisis, education, academia, university

A brief scientific introduction In 2012, Doyen et al. (2012) published a paper concerning a “failed” replication of a study by Bargh et al. (1996). Later that year Bargh (2012) wrote a blog post titled “Priming effects replicate just fine, thanks” responding to an article (and more general concerns) regarding the replicability of so-called “priming” effects. In the blog post, Bargh (2012) mentions several “successful” replication attempts. These included “direct” replications like the Doyen et al. (2012) study which can be viewed as being a repetition of an experimental procedure (cf. Schmidt, 2009, p. 91). And these included “conceptual” replications, which can be viewed as being a repetition of a test of a hypothesis or a result of earlier research work with different methods (cf. Schmidt, 2009, p. 91). The publishing of Doyen et al.’s (2012) “failed” replication, and Bargh’s (2012) reaction in the form of his blog post, can be placed in a larger series of events that all happened around 2011-2012, and caused quite a stir in (parts of) the world of Psychological Science. Around that time period, a paper was published that presented evidence, based on studies that involved commonly used research practices, for what can be considered to be “impossible” (Bem, 2011). As LeBel & Peters (2011) state: “By using accepted standards for experimental, analytic, and data reporting practices, yet arriving at a fantastic conclusion, Bem has put empirical psychologists in a difficult position (…)” (p. 371). Additionally, there was a paper published that made clear that certain research practices can lead to finding “statistically significant” results for just about anything (Simmons et al., 2011), and there was a paper published that presented findings that indicate that (some of) these “questionable research practices” may be widespread, and commonly used (John et al., 2012). On top of this all, it became known in 2011 that a certain Psychological scientist had engaged in fraudulent activities for years on end, leading to a subsequent investigation and report concerning the matter (Levelt et al., 2012). This report included several findings and/or statements that could further lead one to question the validity of certain findings, the capability and/or intention of certain scientists, and the general research culture of certain parts of Psychological Science. Finally, at the end of 2012 a special replication-issue (also mentioned in the Levelt et al. report) was published that included papers regarding replication, the possible “replication crisis”, more general problematic issues in Psychological Science, and the possibility of the existence of “a crisis of confidence” regarding the reliability and validity of research findings in the field (cf. Pashler & Wagenmakers, 2012).

“Psychological Science replicates just fine, thanks”: Version and/or level I

A brief personal introduction In 2011-2012 I graduated from university (“Research Master in Behavioural Science”), and subsequently encountered information (e.g. from papers in the special replication-issue mentioned above) that I thought should have been part of my education but wasn’t. Mostly from outside academia and not employed in science, I started trying to help improve matters by engaging in online discussions, posting ideas and manuscripts regarding possible improvements, joining efforts that could possibly help better things, etc. It’s now about 10 years later and at this point in time, from a certain perspective, I would say that my initial enthusiastic, and hopeful outlook regarding the attempts to try and improve matters has been replaced by an increasingly more frustrated, and cynical one. Concerning the latter, perhaps these current gloomy thoughts replicate the thoughts I had when first encountering information about the possibly problematic issues in Psychological Science some 10 years ago. During these 10 years or so, I increasingly began to notice and/or worry that some of the possibly problematic issues, and processes might still be ignored and/or neglected. I also increasingly began to notice and/or worry that the possibly problematic issues, and processes might be more or less “conceptually” replicated in the form of “new” proposals, and changes. At a certain point in time the idea emerged that it might be useful and/or amusing in some way, shape, or form to write a manuscript about the things I noticed and worried about. I also had the title of this possible manuscript in mind (“Psychological Science replicates just fine, thanks”) from remembering the title of the blog post by Bargh from 2012 (“Priming effects replicate just fine, thanks”). I hereby reasoned that this title hopefully, and perhaps even likely, would fit the approach, content, and style of writing that seems to come most naturally to me, is most enjoyable for me, and might therefore be most useful. On with the list then.

Replication of the presence of “conceptual” replications? There may have been a sub-optimal emphasis on “conceptual” replications compared to “direct” replications in Psychological Science, although it may be hard to sometimes determine if something is a conceptual or direct replication (cf. Schmidt, 2009). In what follows, I will list some of the things, issues, and processes that I have noticed in the last 10 years or so that are possibly replications of other things, issues, and processes. Some of these things, issues, and processes seem to me to be problematic. Please note that the term “replication” is used very loosely in this context, and will mostly refer to a more or less “conceptual” replication where (viewed from a certain perspective) some (possibly crucial) part of something is possibly being replicated in some way, shape, or form. The following (“conceptual”) replications can perhaps be seen as being in line with the possibility that “Psychological Science replicates just fine”, at least seen from a certain perspective.

Replication of the “crisis of confidence”? The accumulation, and combination, of the events in 2011-2012 mentioned in the introduction above may have contributed to the declaration of a “replication crisis” or a “crisis of confidence” (cf. Pashler & Wagenmakers, 2012) regarding the reliability, replicability, and/or validity of research findings in (parts of) Psychological Science. It might however be noteworthy that this may not be the first and/or only (type of) crisis Psychological Science has faced. The following two introductory sentences from two specific papers might be seen as especially illustrative in this regard. A paper by Lewin (1977) states the following: “This paper examines the crisis in social psychology in the light of attempts to solve the same problems fifty years ago. The fundamental question is, are the phenomena of social psychology lawful?” (p. 159). Sturm & Mülberger (2012) start their introduction with: “This special issue is devoted to the analysis of discussions of the crisis in psychology that took place from the 1890s through to the mid-1970s.” (p. 425).

Replication of “mindless statistics”? Possibly encouraged and/or facilitated by the Simmons et al. (2011) paper and the declaration of a “replication crisis” or a “crisis of confidence” mentioned in the introduction, renewed attention for the possible problems of null-hypothesis significance testing (NHST) has since emerged. For instance, Gigerenzer (2018) writes in the abstract of his paper: “Here, I want to draw attention to a complementary internal factor, namely, researchers’ widespread faith in a statistical ritual and associated delusions (the statistical-ritual hypothesis).” (p. 198). The statistical ritual Gigerenzer (2018) refers to is “(…) an incoherent mishmash of ideas from Fisher on the one hand and Neyman and Pearson on the other, spiked with a characteristically novel contribution: the elimination of the researchers’ judgement.” (p. 202). Gigerenzer (2004) makes clear that null hypothesis testing received criticism decades ago: “The clinical psychologist Paul Meehl (1978, p. 817) called routine null hypothesis testing “one of the worst things that ever happened in the history of psychology” (…)” (p. 591). And as Cumming (2014) states in his paper: “(…) for more than half a century, scholars have been publishing cogent critiques of NHST, documenting the damage it does, and urging change. There have been very few replies, but also little reduction in reliance on NHST.” (p. 26). The recent papers that have outlined problematic issues regarding NHST have also come with several proposals to attempt to improve matters. For instance, Cumming (2014) mentions the following in the abstract of his paper: “Second, in response to renewed recognition of the severe flaws of null-hypothesis significance testing (NHST), we need to shift from reliance on NHST to estimation and other preferred techniques.” (p. 7). Furthermore, he concludes that “(…) best research practice is not to use NHST at all; we should strive to adopt best practice, and therefore should simply avoid NHST and use better techniques.” (Cumming, 2014, p. 26). Gigerenzer (2018) recommends that editors should no longer accept manuscripts that report results as “significant” or “not significant”, and that psychology departments should teach the statistical toolbox and not a statistical ritual (p. 213). Next to these recommendations however, it has also been proposed that statistical significance should be redefined by changing the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries (Benjamin et al., 2018). Perhaps the proposal by Benjamin et al. (2018) can be seen as a (“conceptual”) replication of the possibly sub-optimal reaction to (decades-old) criticism regarding NHST (cf. Cumming, 2014; Gigerenzer, 2004, 2018). And perhaps the proposal by Benjamin et al. (2018), and its possible consequences, can be seen as a (possible) replication of several problematic issues in Psychological Science. For instance, regarding the proposal Crane (2017) makes clear that “There are plausible scenarios under which the lower cutoff will make the replication crisis worse.” (p. 1), and “By accounting for the effects of P-hacking, we see that the claimed benefits to false-positive rate and replication rate are much less certain than suggested (…)” (p. 14). Next to this, Amrhein and Greenland (2018) write the following: “In sum, lowering significance thresholds will aggravate several biases caused by significance testing.” (p. 4). Furthermore, regarding the statement by Benjamin et al. (2018) that “The new significance threshold will help researchers and readers to understand and communicate evidence more accurately.” (p. 8), Gigerenzer (2018) notes that “(…) I do not see how it would improve understanding and eradicate the delusions documented in Tables 1 and 2.” (p. 213). And finally, one could wonder how redefining statistical significance by changing the default P-value threshold for statistical significance from 0.05 to 0.005 relates to statements like: “Yet one can find experienced researchers who proudly report that they have studied several hundreds or even thousands of subjects and found a highly significant mean difference in the predicted direction, say p<0.0001.” (Gigerenzer, 2004, p. 601), and “How big this effect is, however, is not reported in some of these articles. The combination of large sample size and low p-values is of little value in itself.” (Gigerenzer, 2004, p. 601).

Replication of possibly sub-optimal (depiction of) reasoning, and argumentation? As mentioned in the introduction, so-called “priming” -effects have been the topic of discussion regarding the reliability, and validity, of (some) research findings. They are also an important part of a specific paper (Katzko, 2006) that examined “(…) the structure of empirical arguments of some representative research in a field of psychology in sufficient detail so as to isolate precisely those steps in the arguments where flaws appear.” (p. 211), and found that “(…) an overly simplified interpretation of the data is propagated through the literature (…)” (p. 211). When first hearing of the possible problems in Psychological Science (or should I say academia) around 2011-2012 I would often read about “the incentives”, or the “publish or perish”- system. I have never fully understood the connection between “the incentives” and all the problematic issues I had encountered, and increasingly feared that the argumentation concerning this all might often be (depicted) sub-optimal (-ly) at best (also see Yarkoni, 2018). For instance, regarding the reasoning and/or explanation concerning the possible problems I wondered, and still wonder, why universities would even care how many papers (future) members of their faculties publish, and where they publish them. And I wondered, and still wonder, how the “publish or perish” perspective and/or explanation concerning the possible problems relates to the behaviour of tenured Psychology Professors who I reason can not “perish” anymore. Perhaps the following sentences from Smaldino & McElreath (2016), who among other things developed and analyzed a “dynamical population model”, can be seen as a useful example of what I am trying to make clear. When trying to follow and comprehend (parts of) the reasoning in this particular paper concerning “the incentives”, I particularly noticed the following three excerpts: (1) “Poor research design and data analysis encourage false-positive findings. Such poor methods persist despite perennial calls for improvement, suggesting that they result from something more than just misunderstanding” (p. 1)

(2) “This paper argues that some of the most powerful incentives in contemporary science actively encourage, reward, and propagate poor research methods and abuse of statistical procedures. We term this process the natural selection of bad science to indicate that it requires no conscious strategizing nor cheating on the part of researchers.” (p. 2)

(3) “Campbell’s Law, stated in this paper’s epigraph, implies that if researchers are incentivized to increase the number of papers published, they will modify their methods to produce the largest possible number of publishable results rather than the most rigorous investigations.” (p. 4)

Excerpt 1 to me suggests that (at least some) researchers are probably aware of what “poor methods” are. An interpretation that is subsequently strengthened by reading the following sentence in Smaldino & McElreath (2016) on page 5: “Why does low power, a conspicuous and widely appreciated case of poor research design, persist?”. If I am interpreting this correctly, and researchers might very well be aware of poor methods (like low power), but researchers are (according to excerpt 3) possibly modifying their methods to produce the largest number of publishable results, would this then not suggest that they could very well (in a way) probably engage in conscious strategizing and/or cheating (contrary to what excerpt 2 seems to suggest)? Thankfully (for my sense of sanity at least) it appeared to me that Smaldino & McElreath (2016) seem to be aware of, and indeed acknowledge, this option: “First, researchers may respond directly to incentives and strategically reason that these poor methods help them maximize career success.” (p. 5). But why then write the following only a few sentences later: “Our working assumption is that most researchers have internalized scientific norms of honest conduct and are trying their best to reveal true explanations of important phenomena. However, the evidence available is really insufficient.” (p. 6)? And why then call your paper “The natural selection of bad science”, and write “We term this process the natural selection of bad science to indicate that it requires no conscious strategizing nor cheating on the part of researchers.” (p. 2)? Although, sentences like these might be technically correct when viewed individually, and independently, the use, placing, and combination of them in the text concerning reasoning and argumentation result in me concluding that when it comes to “the incentives” and/or the problems in Psychological Science things remain (too) vague, and unclear. This all also reminded me of Katzko (2006) who writes:

“More important, the logical error at the level of a single experiment is being propagated in the literature through a conventional citation of substantive hypotheses and without any check on the logical integrity of the argument which links the data to those hypotheses. (…). The conclusion is, that the way in which research in presented may be the way in which its significance is perceived and propagated through the literature.” (p. 222)

Maybe the above mentioned argumentation in Smaldino & McElreath (2016) is (depicted) sub-optimal (-ly) and/or maybe my comprehension, and judgment of it is (depicted) sub-optimal (-ly). All I know is that after reading these, and other similar, sentences for the Nth time, I’m beginning to wonder whether (parts of) Smaldino & McElreath (2016) might be intended as sarcasm, or something along those lines. Regardless, perhaps all of the above makes clear that (the depiction of) reasoning, logic, and argumentation (and/or the use of sarcasm in scientific writing) should perhaps receive much more attention. As Katzko (2006) states: “(…) sound reasoning is as much a part of a scientist’s methodological toolkit as are procedures for data collection and analysis.” (p. 210), and “A more general remedy, aimed at prevention, would be to educate researchers both in the proper use of logical argument and in the detection of such logical improprieties in the published literature.” (p. 220).

Replication of possible improvements and/or solutions? Sound reasoning, and argumentation seem also crucial to me regarding both the proper diagnosis of a problem, and the subsequent attempt at coming up with possible improvements, and solutions. Perhaps similar sub-optimal (comprehension, and judgement of) reasoning, and argumentation, as expressed in the section above can also be found in some of the papers with regard to (the connection between) problematic issues, and proposed solutions in Psychological Science. For instance, one of the papers (Nosek et al., 2012) of the special replication-issue mentioned in the introduction above contains a heading that reads “The ultimate solution: opening data, materials, and workflow” (p. 623). This heading still puzzles me to this day, because it is not clear to me how, and why, and for what exactly opening data, materials, and workflow are “the ultimate solution” (e.g. also possibly see Gelman’s 2017 paper titled “Ethics and statistics: Honesty and transparency are not enough” in this regard). Regardless of the above, perhaps some of the proposed possible improvements, and solutions, can be seen as more or less “conceptual” replications of earlier proposals, and efforts. For instance, open practices badges (cf. Kidwell et al., 2016) can be offered by journals to acknowledge open practices (e.g. making data openly available). As Kidwell et al. (2016) state “Those who apply for a badge and meet open data or open materials specifications receive the corresponding badge symbol at the top of their paper (…)” (p. 3). As Rowhani-Farid & Barnett (2018) make clear, the open practices badges resemble earlier efforts by the journal Biostatistics. The reproducibility policy of the journal Biostatistics (cf. Peng, 2009) “(…) rewarded articles with data available with the letter D on the front page of the published article PDF, articles with code available with a C, and articles with data and code available and which were tested for reproducibility by the AER an R for reproducibility.” (Rowhani-Farid & Barnett, 2018, p. 3). Another possible example of a more or less “conceptual” replication of earlier proposals, and efforts are “registered reports” (cf. Chambers et al., 2014). Registered reports are thought and/or promoted as a way to improve several possibly poor methods, and problematic issues, including issues related to a lack of transparency (e.g. concerning research design, and statistical analysis). This is done by (among other things) reviewing a study-protocol, and possibly granting an in-principle acceptance (IPA) by the journal before the study has been conducted. As Hardwicke & Ioannidis (2018) make clear, registered reports resemble earlier efforts by the journal The Lancet: “Interestingly, in 1997, The Lancet introduced a publication pathway for clinical trials that was similar to registered reports.” (p. 793).

Replication of possibly poor methods, problematic issues, and/or sub-optimal processes? Lack of transparency: The implementation of the above mentioned registered reports has been investigated by Hardwicke & Ioannidis (2018), and they found (a perhaps ironic) lack of transparency. As Hardwicke & Ioannidis (2018) write: “It was not straightforward to establish the existence of individual registered reports because many IPA protocols are not publicly available, protocols are not formally registered and final reports are not clearly identified as having been prior registered reports.” (p. 794). These findings detailing a lack of transparency may in itself also be a replication of findings concerning the above mentioned similar efforts by the journal The Lancet. As Hardwicke & Ioannidis (2018) make clear concerning these earlier efforts by The Lancet: “During its application, it was noted that some trials had deviated substantially from their pre-specified outcomes and analyses, and full protocols were not always publicly available.“ (p. 793). “Sloppy science”: The above can perhaps also be seen as an example of what some have termed “sloppy science”. In the introduction of this manuscript, it is mentioned that it became known in 2011 that a certain Psychological scientist had engaged in fraudulent activities. This led to a subsequent investigation and report concerning the matter (Levelt et al., 2012), in which the term “sloppy science” was used to describe the findings by the investigation Committees. The following excerpts might be illustrative in this regard. Levelt et al. (2012) mention that “The Committees were forced increasingly to the conclusion that, even in the absence of fraud in the strict sense, there was a general culture of careless, selective and uncritical handling of research and data.” (p. 47). And “It involved a more general failure of scientific criticism in the peer community (…)” (Levelt et al., 2012, p. 47). And finally, “Virtually nothing of all the impossibilities, peculiarities and sloppiness mentioned in this report was observed by all these local, national and international members of the field, and no suspicion of fraud whatsoever arose.” (Levelt et al., 2012, p. 53). Another possible example of the replication of “sloppy science” is the following. One of the journals that adopted the open practices badges (mentioned somewhere else in this manuscript) has been investigated with regard to the analytic reproducibility of articles receiving an open data badge. Some of the findings of this investigation by Hardwicke et al. (2021) read as follows: “Less frequently, we encountered typographical errors and some issues related to data-files, including erroneous or missing data.” (p. 4). Hardwicke et al. (2021) note that “Importantly, none of the reproducibility issues we encountered appeared seriously consequential for the conclusions stated in the original articles (…)” (p. 6), but do note that “Nevertheless, non-reproducibility highlighted fundamental quality control and documentation failures during data management, data analysis and reporting.” (p. 6). The authors further note that these findings are consistent with the findings of a previous study (Hardwicke et al., 2018). Ignoring and/or neglecting possibly important issues: Many of the recent papers regarding the possible “crisis of confidence” feature sentences that indicate that problematic issues have been pointed out before, but were ignored and/or neglected. For instance, Smaldino & McElreath (2016) write the following: “We show that despite over 50 years of reviews of low statistical power and its consequences, there has been no detectable increase.“ (p. 3). That other possibly problematic issues might currently also still be ignored and/or neglected can perhaps be concluded from the following examples. Gelman (2017) writes: “Honesty and transparency are not enough, though; I worry that the push toward various desirable procedural goals can make people neglect the fundamental scientific and statistical problems that, ultimately, have driven the replication crisis.” (p. 38). And as Eronen & Bringmann (2021) state in the abstract of their paper: “Meehl argued in 1978 that theories in psychology come and go, with little cumulative progress. We believe that this assessment still holds, as also evidenced by increasingly common claims that psychology is facing a “theory crisis” and that psychologists should invest more in theory building.” (p. 1). Some pondering: Version and/or level I There are more possibly problematic issues, or processes that I have noticed which might currently be ignored, neglected, and/or (“conceptually”) replicated. And there are proposed improvements, changes, or solutions that I have noticed which I fear might not solve anything and/or perhaps even make things worse. With regard to the latter, I would like to point to Table 1 of a paper by Edwards & Roy (2017) that makes clear that the proposed intended (“positive”) effects of “incentives”, rewards, and changes may have very different actual (“negative”) effects. I will formulate some of these worries in the form of questions to attempt to indicate they are (more or less) rudimentary thoughts that are not meticulously checked, and/or worked out further. I did not want to do that, and I think it might not be necessary. On with the pondering then. If the problems (concerning the possibly negative incentive structure) in academia/science are not solved at this moment in time, does this imply that any proposed improvement coming from and/or involving people and/or entities that are part of academia/science might suffer from the same problematic issues that it is supposed to solve? Is coming up with some new proposal to “improve matters” a (new) way for (certain types of) scientists, and universities to receive attention, and as many research funds as possible (cf. Binswanger, 2014, p. 53)? Is it fair and/or a good idea to let people, institutions, and/or entities try and come up with possible solutions for possible problems they themselves may have (co-) created? Is it possible that letting people, institutions, and/or entities try and come up with possible solutions for possible problems they themselves may have (co-) created might resemble digging ditches and then filling them up again (cf. Binswanger, 2014, p. 70)? Do these possible solutions for possible problems increase, or decrease research bureaucracy (cf. Binswanger, 2014)? Is the university lost (cf. McFarlane, 2013)? Is it fair and/or “cooperative” when a smaller group of people works together with a larger group of people, but the smaller group of people benefits disproportionately from this “cooperation” (e.g. in the form of receiving grants, status, influence, power, etc.)? Is it wise, from a scientific perspective, to have a smaller group of people work together with a larger group of people, but let the smaller group of people have disproportionately more influence, and power (e.g. in the form of picking a few specific options that the larger group can only choose from)? Is there a point at which it can be considered to be unethical to involve (naive?) young and/or junior scientists in academic systems and projects (cf. Afonso, 2014; Kun, 2018)? Can the process of many different (kinds of) psychological scientists designing, performing, writing, and publishing different (kinds of) studies and papers so others can read, use, and cite them be considered to be “cooperative” and/or good for Psychological Science? If a project possibly interferes with this process by, for instance, asking and receiving a disproportionate amount of attention and/or resources, is this project “cooperative” and/or good for Psychological Science? Is there a risk that (unnecessarily?) large projects by centers, networks, and clusters result in the crowding out of intrinsic motivation by stick and carrot, the crowding out of unconventional people and approaches by the mainstream, the crowding out of quality by quantity, the crowding out of content by form, and the crowding out of research by bureaucracy (cf. Binswanger, 2014, p. 66)? Is Psychological Science, and/or should Psychological Science be, a meritocracy, and if so how could Psychological Science make sure the right people get to where they deserve to and/or should be? Do psychological scientists have a moral obligation (cf. Popper, 1971) to society at large to make sure the right people get to where they deserve to and/or should be? How do you make sure the best thoughts, ideas, reasoning, hypotheses, theories, and research are produced? Do psychological scientists have a moral obligation (cf. Popper, 1971) to society at large to make sure the best thoughts, ideas, reasoning, hypotheses, theories, and research are produced? Is an appeal to the majority just as unscientific as an appeal to authority? How do things like open data, and its possible use, relate to a sentence like: “(…) one always needs a fresh data set to test one’s hypothesis.” (Wagenmakers et al., 2012, p. 633)? How do things like open data, and its possible use, relate to a sentence like: “(…) the interpretation of common statistical tests in terms of Type I and Type II error rates is valid only if the data were used only once and if the statistical test was not chosen on the basis of suggestive patterns in the data.” (Wagenmakers et al., p. 633)? In light of the above two quotes, and the (statistical) reasoning behind it: if you split a data-set in an “exploratory” (data-set 1) and a “confirmatory” (data-set 2) subset (cf. Wagenmakers et al., 2012, p. 635), is data-set 2 (truly) a “fresh” data-set and/or are you (truly) using the data-set only once? When splitting a data-set in an “exploratory” and a “confirmatory” subset, is it likely that these subsets resemble each other and/or produce similar statistical findings when the data-set is (very) large (e.g. concerning the number of participants)? If it can be considered to be statistically valid and sound to split a data-set in two in order to get an “exploratory” and a “confirmatory” subset, can it also be considered to be statistically valid and sound to split a data-set in three in order to get an “exploratory”, “confirmatory”, and “replication” subset? If it can be considered to be statistically valid and sound to split a data-set in two in order to get an “exploratory” and a “confirmatory” subset, can it also be considered to be statistically valid and sound to split a data set with, for example, 20 measured variables in 20 separate “confirmatory” single-variable subsets? If it can be considered to be useful and/or important to pre-register studies, and to hereby be specific and clear concerning (the exact number and specifics of) things like participants, variables, measures, experiments, statistical analyses, etc., does this also hold for (the exact number and specifics of) the involved labs in multiple-lab efforts? Is there an important difference between open data (cf. Simmons et al., 2014) with or without available pre-registration information concerning the design and analysis of the corresponding study (cf. Simmons et al., 2011)? If writing and/or adhering to pre-registration information can be considered to be hard, does this suggest that writing and/or adhering to a grocery list can be considered to be hard as well? How exactly do you “(…) discard publishing as a meaningful incentive.” (Nosek et al., 2012, p. 623) when using things like pre-prints but at the same time still upholding (and perhaps even reinforcing) the “typical” journal system? Might things like “moderators” for pre-prints, “peer- review” for pre-prints, “applaud-buttons” next to pre-prints, connecting pre-prints with a submission-option to “typical” journals, and/or sharing-options on (a-) social media next to pre- prints facilitate and/or encourage possibly scientifically damaging processes concerning peer-review and/or usage of these pre-prints (cf. Binswanger, 2014; Crane & Martin, 2018a, 2018b; Smith, 2006)? If reviewers and editors of scientific journals may have encouraged irregular and/or unscientific practices in the past (cf. Levelt et al., 2012, p. 53), if the peer-review process as a quality assurance instrument can be questioned (cf. Binswanger, 2014, p. 56), and if the competition for publication may result in potential authors pleasing the reviewer (cf. Binswanger, 2014, p. 56), is it a good idea to involve reviewers and editors in earlier phases of research design and review (cf. Chambers et al., 2014)? If reviewers have been helping out with the design of research (cf. Chambers et al., 2014), are they co-authors and/or should they be made co-authors? If reviewers have been helping out with the design of research (cf. Chambers et al., 2014), how does this relate to the often touted “independent” and “objective” role of peer-review? Is there a risk that “meta-science” can be used (in the short-term) to “nudge” or “steer” (parts of) Psychological Science in a certain direction without much and/or sufficient thought concerning the soundness, desirability, and/or validity (in the long-term) of the object or process under investigation? Can “meta-scientific” research have similar problems as “normal” research concerning its validity, conflicts of interest, soundness of design and conclusions, etc.? Are there psychological scientists who think they (in their role as psychological scientists) have the knowledge, wisdom, and right to (attempt to) “nudge” or “steer” people and/or societies in a certain direction, and if so why do they think that? Can Psychological Science and/or psychological scientists make things “worse” when trying to make things “better”? Is it (at this point in time) indeed (still) the case that “(…) most scientists, at least most creative scientists, value independent and critical thinking very highly.” (Popper, 1971, p. 283)? If honesty and transparency are perhaps not enough (cf. Gelman, 2017), could it be that having (or thinking and/or saying you have) good intentions is perhaps not enough as well? Should “first, do no harm” (primum non nocere) be a principle and/or perspective that should receive much more attention in Psychological Science? Is “wisdom” a thing in Psychological Science and/or should it be? I hope this will suffice.

“Psychological Science replicates just fine, thanks”: Version and/or level II

Some pondering: Version and/or level II Whilst, and already years before, writing the above list of “conceptual” replications I noticed I had less and less patience to (re-) read certain papers and/or things, did not look forward to have to (re-) read certain papers and/or things, and also decided not to (re-) read certain papers and/or things. I also noticed that I thought certain other papers were very good, interesting, insightful, and/or inspiring. I also wondered why certain papers I thought should have been part of my education weren’t. Additionally, I wondered why (some) papers I thought were mediocre at best received many citations. And I wondered why (some) papers I thought were very good, interesting, insightful, and/or inspiring received (relatively) few citations. The combination of this all, together with some lingering thoughts, led me to ponder some more. I wonder if, and how, it is possible to determine which psychological scientists are “good”/“better”, and which psychological scientists are “bad”/“worse” (and I wonder if, and why, it is desirable to determine this). I wonder if, and how, it is possible to determine which Psychological Science papers or books are “good”/“better”, and which Psychological Science papers or books are “bad”/“worse” (and I wonder if, and why, it is desirable to determine this). I wonder if, and how, it is possible to determine which ideas, thoughts, and reasonings are “good”/”better”, and which ideas, thoughts, and reasonings are “bad”/”worse” (and I wonder if, and why, it is desirable to determine this). I wonder if, and how, it is possible to get “good”/”better” psychological scientists to receive grants, to design and perform research, to write and publish papers, and to get hired, and promoted, and not to get “bad”/”worse” psychological scientists to receive grants, to design and perform research, to write and publish papers, and to get hired, and promoted (and I wonder if, and why, this is desirable). I wonder if, and how, it is possible to get “good”/“better” papers to be read, used, and cited, and not to get “bad”/”worse” papers to be read, used, and cited (and I wonder if, and why, this is desirable). I wonder if, and how, it is possible to get “good”/”better” ideas, thoughts, and reasonings to be shared, listened to, used, and further developed, and not to get “bad”/”worse” ideas, thoughts, and reasonings to be shared, listened to, used, and further developed (and I wonder if, and why, this is desirable). I wonder if, and how, it is possible to determine which papers, books, topics, etc. should be part of the curriculum (and I wonder if, and why, it is desirable to determine this). I wonder if certain characteristics of Psychological Science can result in not only a natural selection of bad science, but also a natural selection of bad scientists. I wonder if psychological scientists have been self-assessing and self-regulating sub-optimally at best, and very poorly at worst. I wonder if Psychological Science has been mopping the floor while leaving the tap open. I have concluded that things probably have many levels and/or dimensions, and that it may be hard, if not impossible, to be (-come) aware of all of these levels and/or dimensions. I have concluded that certain levels and/or dimensions may not resonate with each other. I have concluded that I think there are too many psychological scientists who think and act like managers, politicians, and/or lawyers, and not like scientists. I have concluded that I think unjustified popularity and attention grabbing of scientists, papers, topics, and perspectives (continues to) play an excessive role in current day Psychological Science. I have concluded that I like a lot of papers that are decades old. I have concluded that I like a lot of papers that received (relatively) few citations. I have concluded that a lot of the things that are concluded, or written about, now may have been concluded, or written about, decades ago (if not even earlier). I have concluded that if you don’t look back from time to time, you may not notice you have been going around in circles. I have concluded that one can create one thing when trying to get rid of another thing. I have concluded that one can possibly make things worse when trying to make things better. I have concluded that small is, could, and/or should possibly be big, and that big is, could, and/or should possibly be small. I have concluded that I don’t know whether what I have done (or might do) regarding Psychological Science is (or will be) “positive” or “negative”. I have concluded that I don’t know whether Psychological Science (in its recent and current form) is (or will be) “positive” or “negative”. I have concluded that I think that it is hard to know and/or determine if I filled my bowl to the brim, or left (just) enough room. I have concluded that I am not sure whether I am comfortable (anymore) writing and/or publishing manuscripts. Perhaps these thoughts and conclusions can be seen as resembling and/or replicating reactions to things like the “crisis of confidence” in Psychological Science. For instance, Goertzen (2008) writes the following concerning the topic of the crisis in Psychology: “(…) I argue that there is still the need for an analysis of the crisis that is in fact a meta-analysis, based on an examination of the majority of the existing literature.” (p. 836). And “In sum, the fragmentation of the discipline is evidenced in the contrasting positions regarding whether or not there is a crisis (…)” (Goertzen, 2008, p. 836). Additionally Goertzen (2008) writes: “There are many additional factors (e.g., the professional reward structure of academia) that may appear to be causal for the crisis, but in fact they primarily only reinforce the crisis.” (p. 842). Additionally he states: “For example, behaviourism tried to make a primary commitment to one side of many of the tensions, but this manoeuvre only gave birth to its antithesis, humanism.” (Goertzen, 2008, p. 844). And finally, “Koch (1961) pessimistically argues that the kinds of people required for this work leave psychology on account of the very issues at stake (…)” (Goertzen, 2008, p. 846).

Conclusions I worry that I make, or have made, mistakes in my reasoning and argumentation. Also in this manuscript. I think it’s very hard to reason correctly and/or optimally, which could very well be evident to an acute reader of this manuscript (should I have made mistakes in this regard). I would like to replicate here, what I expressed somewhere else in this manuscript: perhaps all of the above makes clear that (the depiction of) reasoning, logic, and argumentation should receive much more attention in Psychological Science. If I remember correctly, a logic, reasoning, or argumentation class or course has not been part of the curriculum of my education. I find this especially strange concerning my “Research Master in Behavioural Science”, as I think reasoning, logic, and argumentation are extremely important with regard to comprehending scientific writing, integrating different scientific findings, critical thinking, hypothesizing, formulating theories, designing experiments, writing papers, etc. Perhaps reasoning, logic, and argumentation can also be seen as having a crucial function with regard to rules of conduct concerning the behaviour of scientists, both in real life and scientific writing. I think science just falls apart without the acknowledgement of, and strict adherence to, these (kinds of) rules, and principles. I recently came across a paper titled “The lost tools of learning” in which Sayers (1947) mentions the Trivium in classical education, where the three topics of Grammar, Logic, and Rhetoric are taught (in that order) and subsequently form the basis for further study. As Sayers (1947) writes: “The whole of the Trivium was in fact intended to teach the pupil the proper use of the tools of learning, before he began to apply them to “subjects” at all.” (p. 7). Perhaps (to improve matters) Psychological Science should provide pupils (and others) proper tools, and a basis as well. Perhaps Psychological Science should be more aware of, acknowledge, focus on, train, and try and optimize what possibly underlies, permeates, and/or influences nearly everything mentioned in this manuscript: reasoning, logic, and argumentation. Additionally, I hope this manuscript makes clear that there might be a distinct possibility that seen from several different perspectives, in several different ways, and on several different levels Psychological Science replicates just fine. Thanks. References

Afonso, A. (2014). How Academia Resembles a Drug Gang. SSRN Electronic Journal. 10.2139/ssrn. 2407748 Amrhein, V. & Greenland, S. (2018). Remove, rather than redefine, statistical significance. Nature Human Behaviour, 2, 4 Bargh, J. A. (2012, May 11). Priming effects replicate just fine, thanks. Psychology Today. https://www.psychologytoday.com/us/blog/the-natural-unconscious/201205/priming- effects-replicate-just-fine-thanks Bargh, J. A., Chen, M. & Burrows, L. (1996). Automaticity of social behavior: direct effects of trait construct and stereotype-activation on action. Journal of Personality and Social Psychology, 71, 230-244 Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407-425 Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. -J., Berk, R. (…) & Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6-10 Binswanger, M. (2014). Excellence by nonsense: The competition for publications in modern science. In: S. Bartling & S. Friesike (Eds.), Opening Science (pp. 49-72). Springer Chambers, C. D., Feredoes, E., Muthukumaraswamy, S. D. & Etchells, P. J. (2014). Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS and beyond. AIMS Neuroscience, 1, 4-17 Crane, H. (2017). Why “redefining statistical significance” will not improve reproducibility and could make the replication crisis worse. ArXiv: 1711.07801v1 Crane, H., & Martin, R. (2018a). In peer-review we (don’t) trust: How peer-review’s filtering poses a systemic risk to science. RESEARCHERS.ONE, https://www.researchers.one/article/2018-09- 17 Crane, H., & Martin, R. (2018b). The RESEARCHERS.ONE mission. RESEARCHERS.ONE, https://www.researchers.one/article/2018-07-1 Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7-29 Doyen, S., Klein, O., Pichon, C-L. & Cleeremans, A. (2012). Behavioral Priming: It’s all in the mind, but whose mind? PLoS ONE, 7: e29081 Edwards, M. A. & Roy, S. (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34, 51-61 Eronen, M. I. & Bringmann, L. F. (2021). The theory crisis in psychology: How to move forward. Perspectives on Psychological Science. January 2021 Gelman, A. (2017). Ethics and statistics: Honesty and transparency are not enough. CHANCE, 30, 37- 39 Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33, 587-606. Gigerenzer, G. (2018). Statistical rituals: The replication delusion and how we got there. Advances in Methods and Practices in Psychological Science, 1, 198-218 Goertzen, J. R. (2008). On the possibility of unification: The reality and nature of the crisis in psychology. Theory and Psychology, 18, 829-852 Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. E., Long, B., Yoon, E. J. & Frank, M. C. (2021). Analytic reproducibility in articles receiving open data badges at the journal Psychological Science: an observational study. Royal Society Open Science, 8, 201494 Hardwicke, T. E. & Ioannidis, J. P. A. (2018). Mapping the universe of registered reports. Nature Human Behaviour, 2, 793-796 Hardwicke, T. E., Mathur, M. B., MacDonald, K., Nilsonne, G., Banks, G. C., Kidwell, M. C., Hofelich Mohr, A., Clayton, E., Yoon, E. J., Henry Tessler, M., Lenne, R. L., Altman, S., Long, B. & Frank, M. C. (2018). Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition. Royal Society Open Science, 5, 180448 John, L. K., Loewenstein, G. & Prelec, D. (2012). Measuring the prevalence of Questionable Research Practices with incentives for truth telling. Psychological Science, 23, 524-532 Katzko, M. W. (2006). A study of the logic of empirical arguments in psychological research: “The automaticity of social behavior” as a case study. Review of General Psychology, 10, 210-228 Kidwell, M. C., Lazarević, L. B., Baranski, E., Hardwicke, T. E., Piechowski, S., Falkenberg, L-S., Kennett, C., Slowik, A., Sonnleitner, C., Hess-Holden, C., Errington, T. M., Fiedler, S. & Nosek, B. A. (2016). Badges to acknowledge open practices: A simple, low-cost, effective method for increasing transparency. PLoS Biology, 14: e1002456 Koch, S. (1961). Psychological Science versus the science-humanism antinomy: Intimations of a significant science of man. American Psychologist, 16, 629-639 Kun, A. (2018). Publish and who should perish: You or Science? Publications, 6. doi:10.3390/ publications6020018 LeBel, E. P. & Peters, K. R. (2011). Fearing the future of empirical Psychology: Bem’s (2011) evidence of Psi as a case study of deficiencies in modal research practices. Review of General Psychology, 15, 371-379 Levelt, W., Drenth, P. & Noort, E. (2012). Flawed science: The fraudulent research practices of social psychologist Diederik Stapel. Retrieved from: https://pure.mpg.de/rest/items/ item_1569964_8/component/file_1569966/content Lewin, M. A. (1977). Kurt Lewin’s view of social psychology: The crisis of 1977 and the crisis of 1927. Personality and Social Psychology Bulletin, 3, 159-172 McFarlane, D. A. (2013). The University lost: The meaning of the University. Interchange, 44, 153-168 Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft Psychology. Journal of Consulting and Clinical Psychology, 46, 806-834 Nosek, B. A., Spies, J. R. & Motyl, M. (2012). Scientific Utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631 Pashler, H. & Wagenmakers, E-J (2012). Editors’ introduction to the special section on replicability in Psychological Science: A crisis of confidence? Perspectives on Psychological Science, 7, 528- 530 Peng, R. D. (2009). Reproducible research and Biostatistics. Biostatistics, 10, 405-408 Popper, K. R. (1971). The moral responsibility of the scientist. Bulletin of Peace Proposals, 2, 279-283 Rowhani-Farid, A. & Barnett, A. G. (2018). Badges for sharing data and code at Biostatistics: An observational study [version 2; peer review: 2 approved]. F1000Research, 7:90 Sayers, D. L. (1947). The lost tools of learning. Paper read at a Vacation Course in Education, Oxford 1947. Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90-100 Simmons, J. P., Nelson, L. D. & Simonsohn, U. (2011). False-positive Psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366 Simmons, J. P., Nelson, L. D. & Simonsohn, U. (2014). Data from paper “False-positive Psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant”. Journal of Open Psychology Data, 2, e1 Smaldino, P. E., & McElreath, R. (2016). The natural selection of bad science. Royal Society Open Science, 3, 160384 Smith, R. (2006). Peer-review: A flawed process at the heart of science and journals. Journal of the Royal Society of Medicine, 99, 178-182 Sturm, T. & Mülberger, A. (2012). Crisis discussions in psychology - New historical and philosophical perspectives. Studies in History and Philosophy of Biological and Biomedical Sciences, 43, 425-433 Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L. J. & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 632-638 Yarkoni, T. (2018, October 2). No, it’s not The Incentives - It’s you. https://www.talyarkoni.org/blog/ 2018/10/02/no-its-not-the-incentives-its-you/