Exploration, Inference and Prediction in Neuroscience and Biomedicine

Exploration, inference and prediction in neuroscience and biomedicine Danilo Bzdok 1, 2, 3, * & John P. A. Ioannidis 4, 5, * 1 Department of Psychiatry, Psychotherapy and Psychosomatics, RWTH Aachen University, 52072 Aachen, Germany 2 JARA, Translational Brain Medicine, Aachen, Germany 3 Parietal Team, INRIA, Neurospin, bat 145, CEA Saclay, 91191 Gif-sur-Yvette, France 4 Meta-Research Innovation Center at Stanford, Stanford University, Stanford, California, USA 5 Departments of Medicine, of Health Research and Policy, of Biomedical Data Science, and of Statistics, Stanford University, Stanford, California, USA *Correspondence: [email protected] (Twitter: @danilobzdok) and [email protected] Abstract The last decades saw dramatic progress in brain research. These advances were often buttressed by probing single variables to make circumscribed discoveries, typically through null hypothesis significance testing. New ways for generating massive data fueled tension between the traditional methodology, used to infer statistically relevant effects in carefully-chosen variables, and pattern-learning algorithms, used to identify predictive signatures by searching through abundant information. In this article, we detail the antagonistic philosophies behind two quantitative approaches: certifying robust effects in understandable variables, and evaluating how accurately a built model can forecast future outcomes. We discourage choosing analysis tools via categories like ‘statistics’ or ‘machine learning’. Rather, to establish reproducible knowledge about the brain, we advocate prioritizing tools in view of the core motivation of each quantitative analysis: aiming towards mechanistic insight, or optimizing predictive accuracy. Keywords: Reproducibility; big-data analytics; data science; black-box models; deep learning; precision medicine Page 1 ‘[Deep] neural networks are elaborate regression methods aimed solely at prediction, not estimation or explanation.’ Efron & Hastie [1, p. 371] The emergence of richer datasets alters everyday data-analysis practices There is a burgeoning controversy in neuroscience on what data analysis should be about. Similar to many other biomedical disciplines, there are differing perspectives among researchers, clinicians, and regulators about the best approaches to make sense of these unprecedented data resources. Traditional statistical approaches, such as null hypothesis significance testing, were introduced in a time of data scarcity and have been revisited, revised, or even urged to be abandoned. Currently, a growing literature advertises predictive pattern-learning algorithms hailed to provide some traction on the data deluge [2, 3]. Such modeling tools for prediction are increasingly discussed in particular fields of neuroscience [for some excellent sources see: 4, 5-9]. Ensuing friction is aggravated by the incongruent historical trajectories of mainstream statistics and emerging pattern-learning algorithms – the former long centered on significance testing procedures to obtain p-values, the latter with a stronger heritage in computer science [10-12]. We argue here that the endeavor of sorting each analysis tool into categories like ‘statistics’ or ‘machine-learning’ is futile [13, 14]. Take for instance ordinary linear regression, as it is routinely applied by many neuroscientists. The same tool and its underlying mathematical prosthetics can be used to achieve three diverging goals [15, pp. 82-83, 16, ch. 4.12]: a) exploration, to get a first broad impression of the dependencies between a set of measured variables in the data at hand, b) inference, to discern which particular input variables contribute to the target variable beyond chance level, and c) prediction, to enable statements about how well target variables can be guessed based on data measured in the future. Confusion can arise because it is the motivation for using linear regression that differs between these scenarios. The mathematical mechanics underlying model parameter fitting are indistinguishable. Taken more broadly, instead of attaching labels of opposing camps to each analysis tool, it would be more productive, we would argue, to focus on the desired goal of a specific quantitative analysis. The goal, rather than the choice of a particular tool, is the major factor that ultimately determines what statements can confidently be made about brains, behavior, or genes, or for that matter – any other question of interest. Page 2 Exploration, inference, prediction: A typology of different modeling goals The initial description of a correlative relation in brain data is a common first step in many research projects. An important distinction arises when deciding on how to venture into identifying reproducible findings in quantitative analysis. How a particular analysis tool is used in a certain application domain may often be more important than which class of analysis tool is chosen. a) Exploration of correlative associations: In various studies, a straightforward approach to charting candidate associations in brain data is Pearson’s correlation (without computing p- values). A simple statistic is thus computed between two series of measurements for descriptive purposes. As one concrete example, this analysis can quantify the relationship between amygdala activity measured in an fMRI experiment and some behavioral response. Such tentative data exploration can also be done in situations involving one input and one output variable by fitting a linear regression to the data. In these informal settings, the modeling goal is limited to a descriptive, correlational summary of the raw data that one happened to observe. Estimating linear-regression parameters alone does not license the importance of certain variable relationships (i.e., inference). Neither does a fitted linear regression itself declare whether these variable relationships hold up for other individuals or future data points (i.e., prediction). b) Inference of statistically significant (and possibly causal) associations: Another goal is to try to isolate the specific contributions of single variables, to uncover how the observed response depends on each particular measurement. This is a common agenda in many well- controlled experimental designs. For instance, those looking into the effects of gene- knockout in mice or clinical trials examining the impact of a specific treatment in patients. Historically, this type of deductive reasoning has often drawn on null hypothesis significance testing (NHST). The framework however is sometimes ill-suited and frequently misunderstood [17-19]. As an alternative to NHST, one may draw formal inference by means of false discovery rate (FDR), Bayesian posterior inference, or other tools [1, ch. 3 and 15]. Inferences also need to take into account various biases [20] to avoid making claims that represent false positives (in the NHST framework), underestimated FDR (in the FDR framework), or exaggerated posterior parameter distributions (in the Bayesian framework) [1, ch. 3, 21, ch. 18.7]. Much debate has emerged about what inferential statements about Page 3 relevant variable contributions mean [10, 13, 22], and how significant associations tends towards the holy grail of uncovering causal influences [23]. c) Generalization of predictive associations: One way to substantiate the explored correlations or inferred significance statements is by verifying whether these quantitative relationships still hold up in other data points or new individuals. This goal common to many observational, naturalistic, and prospective epidemiological studies. For instance, increasingly, predictive pattern-learning algorithms are used to derive the behavioral response of individuals from whole-brain neural activity or derive health risk from genomic profiling [cf. 24, 25, 26]. Predictive modeling can also be carried out by standard linear regression. Several fields of clinical medicine have already accumulated a large literature of predictive scores and tools [27, 28]. Currently, usage of predictive approaches lacks standardization and few are rigorously validated [29]. Even fewer are evaluated for replication in different settings and groups of individuals [30]. Increasingly complex predictive models use hundreds and thousands of parameters and/or try to benefit from non-linear interactions in extensive data, like electronic health records [31]. Notably, it has so far rarely been shown that accounting for complex non-linearity in “big” medical data has considerably improved the predictive performance. The low success rate is perhaps partly due to the still insufficient sample sizes or to limited quality of the measurements [7, 20, 32]. To be clear, exploration, inference, and prediction are not strictly mutually exclusive. Rather, quantitative investigations often involve a combination of the three approaches, prioritized to different degrees. In many neuroscience domains that are starting to amass “big data” predictive pattern-learning algorithms are becoming popular alternatives to classic linear-regression applications [2, 3]. Such algorithmic tools include support vector machines, random forests, or artificial neural network. Regardless of whether linear-regression approaches or pattern-learning algorithms are used, the main goal of the prediction enterprise is to put the built model, with already estimated model parameters, to the test in some independent data [21, ch. 7]. In this analysis regime, the investigator wishes to achieve the highest-possible

Exploration, Inference and Prediction in Neuroscience and Biomedicine

Downloading of a Human Consciousness Into a Digital Computer Would Involve ‘A Certain Loss of Our Finer Feelings and Qualities’89

Caveat Emptor: the Risks of Using Big Data for Human Development

Reproducibility: Promoting Scientific Rigor and Transparency

Rejecting Statistical Significance Tests: Defanging the Arguments

The Statistical Crisis in Science

Replicability Problems in Science: It's Not the P-Values' Fault

Just How Easy Is It to Cheat a Linear Regression? Philip Pham A

Distributional Null Hypothesis Testing with the T Distribution Arxiv

Responsible Use of Statistical Methods Larry Nelson North Carolina State University at Raleigh

P-Hacking' Now an Insiders' Term for Scientific Malpractice Has Worked Its Way Into Pop Culture

Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis

Reproducibility and Privacy: Research Data Use, Reuse, and Abuse