Artefact Corrections in Meta-Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Postprint Advances in Methods and Practices in Psychological Science XX–XX Obtaining Unbiased Results in © The Author(s) 2019 Not the version of record. Meta-Analysis: The Importance of The version of record is available at DOI: XXX www.psychologicalscience.org/AMPPS Correcting for Statistical Artefacts Brenton M. Wiernik1 and Jeffrey A. Dahlke2 1Department of Psychology, University of South Florida, Tampa, FL, USA 2Human Resources Research Organization, Alexandria, VA, USA Abstract Most published meta-analyses address only artefactual variance due to sampling error and ignore the role of other statistical and psychometric artefacts, such as measurement error (due to factors including unreliability of meas- urements, group misclassification, and variable treatment strength) and selection effects (including range restriction/enhancement and collider biases). These artefacts can have severe biasing effects on the results of individual studies and meta-analyses. Failing to account for these artefacts can lead to inaccurate conclusions about the mean effect size and between-studies effect-size heterogeneity, and can influence the results of meta-regression, publication bias, and sensitivity analyses. In this paper, we provide a brief introduction to the biasing effects of measurement error and selection effects and their relevance to a variety of research designs. We describe how to estimate the effects of these artefacts in different research designs and correct for their impacts in primary studies and meta-analyses. We consider meta-analyses of correlations, observational group differences, and experimental effects. We provide R code to implement the corrections described. Keywords psychometric meta-analysis; measurement error; reliability; range restriction; range variation; selection bias; collider bias Received 11/06/18; Revision accepted 10/02/19 Meta-analysis is a critical tool for increasing the rigor of describe how to estimate the effects of these artefacts in research syntheses by increasing confidence that ap- different research designs and how to correct for their parent differences in findings across samples are not impacts in meta-analyses. We consider meta-analyses merely attributable to statistical artefacts (Schmidt, of correlations, observational group differences, and ex- 2010). However, most published meta-analyses are perimental effects. concerned only with artefactual variance due to sam- As noted above, most published meta-analyses are pling error and ignore the role of other statistical and concerned only with sampling error and do not correct psychometric artefacts, such as measurement error for other statistical artefacts. For example, of the 71 (stemming from factors including unreliability of meta-analyses published in Psychological Bulletin dur- measurements, group misclassification, and variable ing 2016–2018, only 6 made corrections for measure- treatment strength) and selection effects (including range restriction/enhancement and collider bias). In Corresponding Author: this paper, we provide a brief introduction to the bias- Brenton M. Wiernik, Department of Psychology, University of South Florida, 4202 East Fowler Avenue, ing effects of measurement error and selection effects PCD4118G, Tampa, FL 33620, USA. and their relevance to a variety of research designs. We Email: [email protected] 2 Wiernik et al. ment error and only 1 corrected for selection biases (a grouping variables, measurement error is also called similar review by Schmidt, 2010, found similarly low misclassification (Alexander et al., 2014a). rates of corrections for statistical artefacts). Corrections Measurement error can come from a variety of for measurement error are commonly applied in indus- sources, including truly random response errors (e.g., trial–organizational (I–O) psychology meta-analyses, momentary distractions, arbitrary choices between but rarely in other psychology subfields or other adjacent scale points), transcription errors, transi- disciplines (Schmidt & Hunter, 2015; Schmidt, Le, & ent/temporal effects (e.g., performing poorly on a cog- Oh, 2009). Corrections for selection effects are even nitive assessment due to fatigue on a particular day), rarer. They are typically only performed in meta-anal- environmental effects, content sampling effects (i.e., yses of personnel selection and educational admissions the specific items or content used in the measure do not research (Dahlke & Wiernik, 2019a; Sackett & Yang, function the same way as possible alternative items or 2000). Measurement error and selection artefacts have content), rater effects (e.g., differential knowledge/ severe biasing effects on the results of individual motivation or idiosyncratic beliefs across raters), and studies and meta-analyses. Failing to account for these low sensitivity or specificity of measurement instru- artefacts can lead to inaccurate conclusions about the ments, among others. These diverse sources of error mean effect size and between-studies effect-size hetero- can be grouped into four “classical” categories geneity, and can influence the results of meta-regres- (Cronbach, 1947; Schmidt et al., 2003); see Table 1 for sion, publication bias, and sensitivity analyses. When descriptions. Importantly, measurement error is not measurement error and selection effects are considered only about whether participants’ responses were cor- in meta-analyses outside of I–O psychology, they are rectly recorded, but more about the process that gener- often treated simply as indicators of general “study ated the responses and whether the responses would be quality” and used as exclusion criteria or as moderators the same if the process were repeated at different times, (Schmidt & Hunter, 2015, p. 485); both approaches are by different raters, or with a different instrument or suboptimal, as measurement error and selection effects item set (Schmidt & Hunter, 1996). For psychological have predictable effects on study results. The best way measures, random response error and transient error to handle measurement error and selection effects is to are typically the largest sources of measurement error apply statistical corrections that account for the known (Ones et al., 2016). impacts of these artefacts. Below, we describe the The amount of measurement error in a sample of impacts of measurement error and selection effects on scores is quantified using a reliability coefficient, de- primary studies and meta-analyses, as well as methods fined as the proportion of the observed-score variance to correct for these artefacts. that is consistent (i.e., believed to be “true”), or one minus the proportion of the observed-score variance Measurement Error attributable to measurement error: Measurement error is an artefact that causes observed 2 2 ����� ������ 1 ′ (i.e., measured) values to deviate from the “true” values �� 2 2 = ���� = 1 − ���� of underlying latent variables (Schmidt & Hunter, Conceptually, the reliability coefficient is the correla- 1 1996). For example, consider a political psychologist tion between two parallel measures of a construct. The assessing political orientation using a 10-item measure with items rated on a 7-point scale. A respondent might 1 In classical test theory, an individual’s “true score” is the obtain a mean score of “5” (somewhat conservative) or expected value of the individuals’ response to a measurement “3” (somewhat liberal) across the 10 items, when their process that has been repeated an infinite number of times, such that true score is in fact “4” (moderate/centrist). Measure- all errors of measurement average out to zero across observations. That it, it the “true score” is the score on the measure without any ment error is also called unreliability (Rogers et al., measurement error. The “true score” does not necessarily 2002), observational error (Walter, 1983), and infor- correspond to the individuals’ standing on the intended latent mation bias (Alexander et al., 2014a). For continuous construct—that is a question of measure validity, not reliability. If a variables, measurement error is also called (low) preci- measure has poor validity as an indicator of the intended construct, sion (Stallings & Gillmore, 1971); for dichotomous or even correlations corrected for measurement error will poorly reflect correlations with the latent construct variable (for discussions, see Borsboom, 2006; Schmidt et al., 2013). Correcting for Statistical Artefacts 3 Table 1. Four classical sources of measurement error and reliability estimators sensitive to each. Source of error Description and examples Random response error Truly random error specific to each item/response (e.g., random fluctuations in response time, momentary lapses in attention); unique to each measurement Reliability estimators: – All reliability estimators Transient error Error due the specific time or environment in which data are gathered (e.g., participant mood, environmental distractions); shared by measures completed within a meaningfully short time span Reliability estimators: – Test-retest reliability (coefficient of stability) – Delayed parallel forms reliability (coefficient of equivalence and stability) – Delayed internal consistency (e.g., test-retest alpha; Green, 2003) Content sampling error Error due to specific content used on a measure (e.g., interpretations of specific (item/test specific factor error) test items); shared by measures with the same or highly similar content Reliability estimators: – Parallel forms reliability