Clarifying Questions About “Risk Factors”: Predictors Versus Explanation C
Total Page:16
File Type:pdf, Size:1020Kb
Schooling and Jones Emerg Themes Epidemiol (2018) 15:10 Emerging Themes in https://doi.org/10.1186/s12982-018-0080-z Epidemiology ANALYTIC PERSPECTIVE Open Access Clarifying questions about “risk factors”: predictors versus explanation C. Mary Schooling1,2* and Heidi E. Jones1 Abstract Background: In biomedical research much efort is thought to be wasted. Recommendations for improvement have largely focused on processes and procedures. Here, we additionally suggest less ambiguity concerning the questions addressed. Methods: We clarify the distinction between two confated concepts, prediction and explanation, both encom- passed by the term “risk factor”, and give methods and presentation appropriate for each. Results: Risk prediction studies use statistical techniques to generate contextually specifc data-driven models requiring a representative sample that identify people at risk of health conditions efciently (target populations for interventions). Risk prediction studies do not necessarily include causes (targets of intervention), but may include cheap and easy to measure surrogates or biomarkers of causes. Explanatory studies, ideally embedded within an informative model of reality, assess the role of causal factors which if targeted for interventions, are likely to improve outcomes. Predictive models allow identifcation of people or populations at elevated disease risk enabling targeting of proven interventions acting on causal factors. Explanatory models allow identifcation of causal factors to target across populations to prevent disease. Conclusion: Ensuring a clear match of question to methods and interpretation will reduce research waste due to misinterpretation. Keywords: Risk factor, Predictor, Cause, Statistical inference, Scientifc inference, Confounding, Selection bias Introduction (2) assessing causality. Tese are two fundamentally dif- Biomedical research has reached a crisis where much ferent questions, concerning two diferent concepts, i.e., research efort is thought to be wasted [1]. Recommen- prediction versus explanation, which require diferent dations to improve the situation have largely focused approaches and have substantively diferent interpreta- on processes and procedures, such as addressing high tions. However, the use of the term “risk factor” as some- impact questions, starting from what is already known, thing which may predict and/or explain means that these registration of protocols, and making data available [1]. two concepts may be confated so that a study may not Here we additionally suggest that ensuring the concep- fulfll either objective, i.e., neither predicts nor explains. tual approach matches the question would avoid confat- For example, major predictors of cardiovascular disease ing diferent questions and mistaking the answer to one have long been assumed to be targets of intervention [2]. question for the answer to a diferent question. Much After extensive and expensive research investment over observational biomedical research concerns the role more than 35 years, including the development, test- of “risk factors” in disease, which is conducted for two ing and failure of an entire new class of drugs (CETP main reasons, (1) risk stratifcation or prediction and inhibitors) [3], high density lipoprotein cholesterol has recently been identifed to be a non-causal risk factor *Correspondence: [email protected]; [email protected] (i.e. predictor) for cardiovascular disease [4, 5]. Equally, 1 Graduate School of Public Health and Health Policy, City University factors that do not predict risk are very rarely identifed of New York, 55 West 125th St, New York, NY 10027, USA as causal factors (such as factors that are ubiquitous in a Full list of author information is available at the end of the article © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Schooling and Jones Emerg Themes Epidemiol (2018) 15:10 Page 2 of 6 given community and thus are not caught as increasing inaccurate, possibly because the model needed dynamic risk) [6], which suggests interventions are being missed. recalibration of the relation between search terms and Given, the importance of avoiding ‘low priority questions’ infuenza to stay on track [9]. Tried and tested predic- [1], here, we clarify the diference between these two con- tion models are immensely valuable for identifying target cepts and their use in observational studies. populations, i.e. people or groups in need of prevention or treatment, but the “risk factors” that predict a health Prediction models condition are not necessarily targets of intervention. For Risk stratifcation, prediction models or ‘weather fore- example fu symptoms do not cause infuenza, and are casting’ models identify people or groups at high or ele- not targets of intervention to prevent the occurrence of vated risk of a particular health condition, ideally so that fu. Whether the “risk factors” that predict health con- they can be ofered proven interventions, or other miti- ditions in risk stratifcation models are also targets of gation can be implemented. A very successful example of intervention has to be established from diferent stud- a risk stratifcation model is the Framingham score which ies designed to assess efects of interventions. As such, it predicts 10-year risk of heart disease in healthy people is not appropriate to calculate a population attributable [7], to inform prevention, such as use of lipid modulators. risk or proportion for “risk factors” from a risk prediction Other examples include prognostic models for identify- model, because removal of these “risk factors” might or ing the best cancer treatment [8], or models for predict- might not afect population health. Similarly, the purpose ing disease trends, such as Google Flu Trends [9]. Tese of predictive models is to explain the greatest amount of predictive models typically rely on statistical projections the variance in the outcome, so only factors that contrib- of previous patterns, and to be feasible usually rely on ute to explaining the variance need to be included. Te easily captured information. For example, the Framing- concepts of confounding, mediation and efect measure ham score can be applied in daily clinical practice, even modifcation are not applicable to predictive models. in resource poor settings, because it only requires assess- Interaction terms can be added to predictive models to ment of age, sex, smoking, blood pressure, lipids and dia- improve model ft, but these interactions should not be betes, which are relatively cheap and quick to measure. interpreted as indicating diferent efects by subgroup. Google fu trends was based on internet search terms for particular symptoms [9]. Explanatory models Prediction models are usually developed based on statistical criteria to ft the distribution of the data well, Explanatory models can be thought of as simplifed, using techniques such as stepwise selection, or more abstract, propositional models for some particular aspect recently machine learning techniques. Prediction models of how the world works, which provide a guide as to how often include several “risk factors” to obtain a model that to manipulate items of interest. As such, explanatory fts the data well and can explain the greatest amount of models are based on potentially causal factors i.e., factors variance in the outcome health condition. Te contribu- whose manipulation changes the outcome [11]. Studies tion of each “risk factor” is presented so that the reader assessing causality are explanatory rather than predic- can see the independent contribution of each one to tive. Explanatory models are designed to assess whether the overall prediction, as well as measures of model ft. a particular “risk factor” explains the occurrence, or Prediction models are usually validated in similar popu- course, of disease and as such is a valid target of inter- lations. As with all statistical models they cannot be vention. “Risk factors” selected as potential causal factors expected to predict well in novel circumstances [9], and might be based on “risk factors” from predictive models, are best developed using a representative sample of the might be theoretically based or might be hypothesized population in which they will be applied. Consistently from other sources. For example, the Framingham score poor measurement will impair precision, because it adds includes factors, such as smoking and blood pressure, noise. Inconsistently poor measurement will impair pre- which undoubtedly cause heart disease, but also other dictive power, because it may change the relation between “risk factors”, such as age, sex and high-density lipopro- risk factor and outcome. Prediction models may not be tein, whose causal role in heart disease is less clear [12]. generalizable to populations that difer from the one in In contrast, factors that do not predict disease might be which they were developed because in a new population identifed as possible causal