Why Robust Cognitive Tasks Do Not Produce Reliable Individual Differences
Total Page:16
File Type:pdf, Size:1020Kb
Behav Res DOI 10.3758/s13428-017-0935-1 The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences Craig Hedge1 & Georgina Powell1 & Petroc Sumner1 # The Author(s) 2017. This article is an open access publication Abstract Individual differences in cognitive paradigms are such reliability estimates into account has the potential to qual- increasingly employed to relate cognition to brain structure, itatively change theoretical conclusions. The implications of chemistry, and function. However, such efforts are often un- our findings are that well-established approaches in experimen- fruitful, even with the most well established tasks. Here we tal psychology and neuropsychology may not directly translate offer an explanation for failures in the application of robust to the study of individual differences in brain structure, chem- cognitive paradigms to the study of individual differences. istry, and function, and alternative metrics may be required. Experimental effects become well established – and thus those – tasks become popular when between-subject variability is Keywords Reliability . Individual differences . Reaction low. However, low between-subject variability causes low re- time . Difference scores . Response control liability for individual differences, destroying replicable corre- lations with other factors and potentially undermining pub- lished conclusions drawn from correlational relationships. Though these statistical issues have a long history in psychol- ogy, they are widely overlooked in cognitive psychology and Individual differences have been an annoyance rather neuroscience today. In three studies, we assessed test-retest than a challenge to the experimenter. His goal is to reliability of seven classic tasks: Eriksen Flanker, Stroop, control behavior, and variation within treatments is stop-signal, go/no-go, Posner cueing, Navon, and Spatial- proof that he has not succeeded… For reasons both Numerical Association of Response Code (SNARC). statistical and philosophical, error variance is to be Reliabilities ranged from 0 to .82, being surprisingly low for reduced by any possible device. (Cronbach, 1957,p. most tasks given their common use. As we predicted, this 674) emerged from low variance between individuals rather than high measurement variance. In other words, the very reason The discipline of psychology consists of two historically such tasks produce robust and easily replicable experimental distinct approaches to the understanding of human behavior: effects – low between-participant variability – makes their use the correlational approach and the experimental approach as correlational tools problematic. We demonstrate that taking (Cronbach, 1957). The division between experimental and correlational approaches was highlighted as a failing by some Electronic supplementary material The online version of this article theorists (Cronbach, 1957;Hull,1945), whilst others suggest (doi:10.3758/s13428-017-0935-1) contains supplementary material, that it may be the inevitable consequence of fundamentally which is available to authorized users. different levels of explanation (Borsboom, Kievit, Cervone, & Hood, 2009). The correlational, or individual differences, * Craig Hedge approach examines factors that distinguish between individ- [email protected] uals within a population (i.e., between-subject variance). Alternatively, the experimental approach aims to precisely 1 School of Psychology, Cardiff University, Park Place, Cardiff CF10 characterize a cognitive mechanism based on the typical or 3AT, UK average response to a manipulation of environmental Behav Res variables (i.e., within-subject variance). Cronbach (1957) encounter difficulty when trying to translate state-of-the art called for an integration between the disciplines, with the view experimental methods to studying individual differences that a mature science of human behavior and brain function (e.g., Ross, Richler, & Gauthier, 2015). By elucidating these would consist of frameworks accounting for both inter- and issues in tasks used prominently in both experimental and intra-individual variation. Whilst a full integration is far from correlational contexts, we hope to aid researchers looking to being realized, it is becoming increasingly common to see examine behavior from both perspectives. examinations of the neural, genetic, and behavioral correlates of performance on tasks with their origins in experimental research (e.g., Chen et al., 2015; Crosbie et al., 2013; The reliability of experimental effects Forstmann et al., 2012; Marhe, Luijten, van de Wetering, Smits, & Franken, 2013; L. Sharma, Markon, & Clark, Different meanings of reliability For experiments, a Breli- 2014; Sumner, Edden, Bompas, Evans, & Singh, 2010). able^ effect is one that nearly always replicates, one that is Such integration is not without obstacles (e.g., Boy & shown by most participants in any study and produces Sumner, 2014). Here, we highlight a general methodological consistent effect sizes. For example, in the recent BMany consequence of the historical divide between experimental labs 3^ project (Ebersole et al., 2016), which examined and correlational research. Specifically we ask whether tasks whether effects could be reproduced when the same proce- with proven pedigree as Breliable^ workhorses in the tradition dure was run in multiple labs, the Stroop effect was repli of experimental research are inevitably unsuitable for correla- cated in 100% of attempts, compared to much lower rates tional research, where Breliable^ means something different. for most effects tested.In the context of correlational re- This issue is likely to be ubiquitous across all domains where search, reliability refers to the extent to which a measure robust experimental tasks have been drawn into correlational consistently ranks individuals. This meaning of reliability studies, under the implicit assumption that a robust experi- is a fundamental consideration for individual differences mental effect will serve well as an objective measure of indi- research because the reliability of two measures limits the vidual variation. This has occurred, for example, to examine correlation that can be observed between them (Nunnally, individual differences in cognitive function, brain structure, 1970; Spearman, 1904). Classical test theory assumes that and genetic risk factors in neuropsychological conditions individuals have some Btrue^ value on the dimension of (e.g.. Barch, Carter, & Comm, 2008), or where individual interest, and the measurements we observe reflect their true difference analyses are performed as supplementary analyses score plus measurement error (Novick, 1966). In practice, in within-subject studies (c.f. Yarkoni & Braver, 2010). Many we do not know an individual’s true score, thus, reliability of the issues we discuss reflect long-recognized tensions in depends on the ability to consistently rank individuals at psychological measurement (Cronbach & Furby, 1970; two or more time points. Reliability is typically assessed Lord, 1956), though they are rarely discussed in contemporary with statistics like the IntraClass Correlation (ICC), which literature. The consequences of this are that researchers often takes the form: Variance between individuals ICC ¼ Variance between individuals þ Error variance þ Variance between sessions 1Here, variance between sessions corresponds to systemat- holding error variance constant. In other words, for two mea- ic changes between sessions across the sample. Error variance sures with identical Bmeasurement error,^ there will be lower corresponds to non-systematic changes between individuals’ reliability for the measure with more homogeneity. Measures scores between sessions, i.e. the score for some individuals with poor reliability are ill-suited to correlational research, as increases, while it decreases for others. Clearly, reliability de- the ability to detect relationships with other constructs will be creases with higher measurement error, whilst holding vari- compromised by the inability to effectively distinguish be- ance between participants constant. Critically, reliability also tween individuals on that dimension (Spearman, 1910). decreases for smaller between-participant variance, whilst In contrast to the requirements for individual differences, homogeneity is the ideal for experimental research. Whereas 1 The two-way ICC can be calculated for absolute agreement or for consisten- variance between individuals is the numerator in the ICC for- cy of agreement. The latter omits the between-session variance term. Note also mula above, it appears as the denominator in the t-test (i.e., the that the error variance term does not distinguish between measurement error standard error of the mean). For an experimental task to pro- and non-systematic changes in the individuals’ true scores (Heize, 1969). Some may therefore prefer to think of the coefficient as an indicator of duce robust and replicable results, it is disadvantageous for stability. there to be large variation in the within-subject effect. Behav Res Interestingly, it is possible for us to be perfectly aware of this Method for statistical calculations, without realising (as we previously didn't) that the meanings of a Breliable^ task for experimental Participants and correlational research are not only different, but can be opposite in this critical sense. Participants in Study 1 were 50 (three male) undergraduate students aged 18–21