Running Head: INTRODUCING the NUMBERS NEEDED for CHANGE Introducing the Numbers Needed for Change (NNC): a Practical Measure Of

Running head: INTRODUCING THE NUMBERS NEEDED FOR CHANGE Introducing the Numbers Needed for Change (NNC): A practical measure of effect size for intervention research Stefan L.K. Gruijters1 Maastricht University Gjalt-Jorn Y. Peters12 Open University of the Netherlands *Author note* * Draft version 1, 4/3/17. This paper has not been peer reviewed. Please do not copy or cite without author's permission. 1Department of Work and Social Psychology, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands. 2Department of Methodology & Statistics, Faculty of Psychology & Education Science, Open University of the Netherlands, Heerlen, the Netherlands Acknowledgements: The first author is thankful to Niek Zonnebeld for his insights into medical statistics. We also thank Phillipe Verduyn for motivating comments on an earlier draft of this manuscript. Correspondence concerning this article should be addressed to Stefan Gruijters, Department of Work and Social Psychology, P.O. Box 616, 6200 MD Maastricht, The Netherlands. E-mail: [email protected] INTRODUCING THE NUMBERS NEEDED FOR CHANGE 2 Abstract Effect size indices are valuable to research in applied psychology, but generic measures (e.g. Cohen’s d or point-biserial correlation) are limited by their ability to convey practical information about intervention effectiveness. Researchers rely on concepts such as ‘standardized mean difference’ or ‘explained variance’ to express information about effect size. Practitioners, policymakers, and lay-people prefer concepts such as frequencies. Partial solutions provided to this discrepancy are offered by rules-of-thumb (e.g. Cohen’s categories of ‘small’, ‘moderate’ and ‘large’ effects), but such categories are somewhat arbitrary and of little nuance. We introduce the Numbers Needed for Change (NNC), an effect size that fills this communicative gap between research and practice, and is particularly suited to provide information about behavior change intervention effectiveness. NNC has three informational advantages: 1) it communicates effect magnitude in a common-sense frequency-format (number of people), 2) it considers the population behavior base-rate to estimate this metric, and 3) it provides a convenient intermediate measure between statistical estimates and cost-effectiveness estimates. The measure is an analogue to the Numbers Needed to Treat (NNT) index which is popular in the medical literature. We adapt and extent the index into the NNC to suit applied psychology research purposes, and argue that the measure can strengthen the translation of intervention research to practice. The statistical procedure to estimate the NNC is explained, illustrated with concrete examples, and supplemented by script and functions to calculate the index in the open source software environment R. Introduction Despite the many advances made in applied psychology, its impact on policy and practice leaves a lot to be desired. This incongruity has become sufficiently salient and urgent that the 2016 joint conference of the European Health Psychology Federation and the Division of Health Psychology of the British Psychological Association was themed “Behavior Change: Making an Impact on Health and Health Services”. In the dedicated roundtable “Enhancing the Impact of Health Psychology on Policy and Practice”, one of the conclusions was that the abstract level of scientific discourse does not lend itself well to communication with policymakers and practitioners. For example, the outcome that a behavioral intervention, compared to care as usual, has an effect size of Cohen’s d = .62 means little to people for whom statistics is sometimes not much more than a distant memory. The potential of applied psychology to impact policy and practice would be enhanced if the implications could be communicated less in the INTRODUCING THE NUMBERS NEEDED FOR CHANGE 3 form of abstract statistical quantifications, and more using a measure that is tangible, practical, and intuitive. In this paper, we introduce such a measure. In current practice, evaluating behavior change interventions in randomized controlled trials is a methodological golden standard. Results of intervention research in the literature have foremost been evaluated by null-hypothesis significance testing (NHST). Currently ongoing is a long overdue balance shift from NHST towards estimation of effect size and confidence intervals; a shift that gives ear to the pleas of many who have heralded such procedures as the better alternative to NHST (e.g., Cohen, 1994; Cumming, 2014; Gardner & Altman, 1986; Greenland et al., 2016; Gruijters, 2016; Kirk, 1996; Peters & Crutzen, 2017; Wilkinson et al., 1999). Effect size estimation provides research and practice with several important and clear advantages over NHST. First, these procedures allow straightforward comparison of different interventions on their relative effectiveness. Second, they provide a means to interpret and communicate the relevance of an effect. To this latter end, helpful but somewhat arbitrary cut-off values (or guidelines) have been forwarded in the literature that tentatively categorize the standardized indices into regions of ‘small’, ‘moderate’ and ‘large’ (e.g. Cohen, 1988). Third, effect size estimates can be combined and averaged over replication studies, to obtain a best estimate of the true intervention effect, and allow estimation of required sample sizes to detect effects of a given size. Unlike NHST, effect size estimation may bring researchers closer to what they want to know about their research questions (cf. Cohen, 1994; Kirk, 1996). Despite the many advantages offered by common measures of effect size, we think that for practical purposes these indices do not convey the required information for those with an eye on direct applicability of research. Correctly and meaningfully interpreting effect size measures requires considerable statistical expertise, which policymakers and practitioners usually do not require for their day-to-day jobs and therefore often lack. One of the problems is that the metric of such indices cannot readily be converted to metrics familiar to policymakers and practitioners. Practitioners and clinicians find them non-intuitive, because standardized measures of effect size communicate information relying on concepts not typically used in people’s thinking about magnitude (cf. May, 2004). The unit in which Cohen’s d expresses mean differences is ‘standard deviations’. Does an estimated intervention effect of Cohen’s d = .90 (falling in Cohen’s ‘large’ range) imply that it is effective and worth implementing? Alternatively, how large must the coefficient of determination (r2) be to qualify an intervention as ‘implementable’? We are not the first to note the interpretational and communicative limitations of ‘traditional’ effect size indices. Most attempts to forward suitable alternatives, however, have taken a probabilistic turn to improve interpretability. These approaches work from the shared premise that non-scientists, INTRODUCING THE NUMBERS NEEDED FOR CHANGE 4 students, or simply ‘the average Josephine’ deal better with probabilistic information than traditional metrics. Rosenthal & Rubin (1982) introduced the ‘Binomial effect size display’ (BESD) which is able to create an easily interpretable two-by-two cross-table, by default using r coefficients (see also Rosenthal, 1991, 2005, 1990). BESD dichotomizes a scaled outcome variable into categories of ‘success’ and ‘failure’ by assuming a 50% base-rate occurrence. That is, if no intervention effect is present the 2 (condition: intervention/control) by 2 (outcome: failure/success) are assigned conditional probabilities of p = .50. The conditional probabilities vary as a function of intervention effect: for example, given r = .20, a change from 40% ‘success’ in the control group to 60% in the experimental can be expected. Macgraw and Wong (1992) proposed the use of a probability based index of effect termed the common language effect size (CLES; see also Ruscio, 2008), which expresses the probability that a randomly selected case in the experimental condition scores higher than a randomly selected case from a control distribution. If there is no effect of an intervention (d=0) this value equals 50%, implying that a randomly picked individual from the experimental group has the same probability to have a score above the control group mean (i.e. z > 0) as one from the control group. The probability increases towards the asymptote given large deviations from zero in terms of Cohen’s d values. These measures have one clear advantage compared to standardized dimensional indices: their metric (p, or p x 100) is relatively intuitive. The CLES and BESD score high on May’s (2004) categories of practical utility of effect sizes (i.e., interpretability, understandability and comparability). Recent research supports the intuition that traditional effect sizes may not, on May’s terms, compete with indices such as the CLES and BESD (Brooks, Dalal, & Nolan, 2014). These findings strengthen the notion central in this paper: applied research may benefit from alternative, practically meaningful, indices of effect size. The value of applied psychological research is strongly connected to its translation to practice and ultimately its ability to inform how to improve the well-being of people, groups, and societies. Thus, a measure that aligns with the aims of applied research ideally conveys information that can be readily

Load more