Molecular Psychiatry (2012) 17, 956–959 & 2012 Macmillan Publishers Limited All rights reserved 1359-4184/12 www.nature.com/mp PERSPECTIVE Machine learning and data mining: strategies for hypothesis generation MA Oquendo1, E Baca-Garcia1,2, A Arte´s-Rodrı´guez3, F Perez-Cruz3,4, HC Galfalvy1, H Blasco-Fontecilla2, D Madigan5 and N Duan1,6 1Department of Psychiatry, New York State Psychiatric Institute and Columbia University, New York, NY, USA; 2Fundacion Jimenez Diaz and Universidad Autonoma, CIBERSAM, Madrid, Spain; 3Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Madrid, Spain; 4Princeton University, Princeton, NJ, USA; 5Department of Statistics, Columbia University, New York, NY, USA and 6Department of Biostatistics, Columbia University, New York, NY, USA

Strategies for generating knowledge in medicine have included observation of associations in clinical or research settings and more recently, development of pathophysiological models based on molecular biology. Although critically important, they limit hypothesis generation to an incremental pace. Machine learning and data mining are alternative approaches to identifying new vistas to pursue, as is already evident in the literature. In concert with these analytic strategies, novel approaches to data collection can enhance the hypothesis pipeline as well. In data farming, data are obtained in an ‘organic’ way, in the that it is entered by patients themselves and available for harvesting. In contrast, in evidence farming (EF), it is the provider who enters medical data about individual patients. EF differs from regular electronic medical record systems because frontline providers can use it to learn from their own past experience. In addition to the possibility of generating large databases with farming approaches, it is likely that we can further harness the power of large data sets collected using either farming or more standard techniques through implementation of data-mining and machine-learning strategies. Exploiting large databases to develop new hypotheses regarding neurobiological and genetic underpinnings of psychiatric illness is useful in itself, but also affords the opportunity to identify novel mechanisms to be targeted in discovery and development. Molecular Psychiatry (2012) 17, 956–959; doi:10.1038/mp.2011.173; published online 10 January 2012 Keywords: data farming; discovery; empiricism; inductive reasoning

Some of the major discoveries in medicine have been Despite several centuries of empirical scientific the product of serendipity combined with astute approach to medical research, medicine remains observation. In 1877, Louis Pasteur observed that the beholden to a handful of strategies for generating growth of the anthrax bacilli in culture was inhibited new knowledge. One is the aforementioned time- when these were contaminated with moulds, this honored observation of associations in clinical prac- observation led to the discovery of penicillin. In 1952, tice or research settings. More recently, advances chlorpromazine, originally developed for surgical in molecular biology have made possible develop- interventions as a that did not induce ments that are based on hypotheses about patho- unconsciousness, was found not only to have serenic physiology that lead to the generation of treatment effects, but also to improve behavior and thinking in approaches that address the underlying mechanism. psychotic patients. These discoveries, cornerstones of For example, the transporter blocker zime- major advances in , were largely the lidine was developed in the 1960s as an antidepres- product of observation on the part of scientists who sant,1 based on the observation that tricyclics blocked were open to discovery and not prejudiced by their both norepinephrine and serotonin transporters. The own hypotheses, such that they were free to observe action was thought to be mediated and record previously unexpected effects. through norepinephrine reuptake blockade, leading investigators to develop a selective serotonin trans- porter blocker to assess its effects. Correspondence: Dr M Oquendo, Department of Psychiatry, New Unfortunately, the former approaches to discovery York State Psychiatric Institute and Columbia University, 1051 are both unpredictable and plodding, and rely on Riverside Drive, Unit 42, New York, NY 10032, USA. E-mail: [email protected] ‘creativity’ and ‘imagination’ in the context of Received 15 July 2011; revised 20 October 2011; accepted 21 unbiased observation. The latter is limited by the November 2011; published online 10 January 2012 incremental development of knowledge about Machine learning and data mining MA Oquendo et al 957 underlying molecular biology implicated in disease, a time.6 It is likely that we can further harness the process that is necessarily constrained by the time power of large data sets by using these types of required to perform the painstaking experiments to analytic strategies. develop the basis for a pathophysiological model. Akin to ML is data mining, and indeed distinctions One possible approach to breaking the barriers to between these two have always been blurred. Data rapid growth of the knowledge base in medicine is to mining strategies have been in use for decades.5 make use of unbiased observation, used to great Development of these new tools was made possible advantage by Pasteur, employing modern approaches. by the rapid evolution in computing power, allowing With the aid of recently developed tools and the sophisticated computations not possible until very availability of large databases, we may be able to recently. Indeed, data mining methods have been accelerate discovery and generate new leads that can used routinely to screen the WHO and other databases then be evaluated in subsequent hypothesis-testing for evidence of adverse drug reactions for some studies. years.7,8 However, apart from this type of application, One such tool is machine learning (ML), a field that medicine in general, and psychiatry in particular, has seeks to answer the question, ‘How can we build been relatively slow to adopt such techniques (see computer systems that automatically improve with Figure 1).9 One significant barrier to the adoption of experience, and what are the fundamental laws that these techniques is that physicians are not trained in govern all learning processes?’ (http://aaai.org/AITo- these methods and therefore are wary of them. Better pics/MachineLearning).2 ML has the advantage of appreciation of the potential value of these tools is being comprised of systems that ‘learn’ from experi- essential for the advancement of psychiatry as well as ence, observation, and/or other means. This results in other medical specialties. a system that improves its efficiency and/or effective- Of interest, exciting new leads have been generated ness over time. The usefulness of ML is bolstered by in other fields of medicine using hypothesis-generat- the versatility of its techniques (support vector ing and hypothesis-testing strategies. For instance, a machines or kernel methods, Gaussian processes, ground-breaking study in neuroscience by Ray et al.10 graphical models, deep belief networks or dirichlet used ML to classify patients as having Alzheimer’s process and so on) and its utility for artificial or not. Archived plasma from 259 patients intelligence (classification, prediction, planning, with presymptomatic to late-stage Alzheimer’s dis- recognition, regression, clustering, association rules ease and controls was used to examine 120 known and so on).3 Importantly, the use of ML approaches to signaling proteins, quantified with ELISA. The split data does not require a priori hypotheses. Much like a sample method was used to divide the Alzheimer’s scientist might observe his/her subject to glean and non-demented control groups into: (a) a training understanding, ML ‘observes’ the data and ‘learns’ sample to be used for predictor discovery and from it to build understanding and uncover supervised classification to generate a plausible previously unexpected associations. In this way, this hypothesis and (b) a test sample or validation sample computational approach allows exploration of data to test or validate the hypothesis generated in the to identify patterns and structures not suspected training sample. The training sample was subjected to a priori,4,5 and thus can lead to the generation of a shrunken centroid algorithm called predictive new hypotheses. This is critical, especially in areas analysis of microarray. Predictive analysis of micro- with huge data sets where hypothesis testing and/or array identified 18 proteins that were predictive of traditional analytic strategies have led to disappoint- Alzheimer’s. Using this algorithm on the training ing results, such as in genetic and brain-imaging sample, 95% of all Alzheimer’s cases were correctly studies. Further, ML techniques have been developed for dealing with disparate data types, such as text and 4 images, allowing analysis of heterogeneous data sets 3.5 that contain a mixture of clinical, genetic and imaging Genetics 3 data. As well, like more standard statistical Pharmacy approaches, ML permits addressing some of the 2.5 Neuroscience Psychiatry common questions about data: What is the relation- 2 ship between the variables? Are there associations Cardiology between a given outcome variable and the predictor 1.5 variables of interest? Might they be causal? Given all 1 these properties, ML offers great opportunities for 0.5 exploratory analysis of emerging large-scale data repositories, and the availability of large data sets 0 containing psychiatric information both in the public 198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009 domain and in the private sector provides opportu- Figure 1 Percent of publications using ML or data mining nities to generate new knowledge. Several federal cited in Institute for Scientific Information (ISI) across five funding agencies require investigators to make certain disciplines. Search was performed as follows: Topic = types of data, mostly genetic and epidemiologic, (‘data-mining’ or ‘data mining’ or ‘machine learning’ or available to the public after a specified period of ‘machine-learning’ or ‘support vector machine’ or ‘SVM’).

Molecular Psychiatry Machine learning and data mining MA Oquendo et al 958 identified (positive agreement). On the other hand, One example of the data-farming paradigm is the 83% of non-demented control cases were classified in online enterprise PatientsLikeMe.com. Founded in accordance with the clinical diagnosis (negative 2004, this website provides a platform for patients to agreement). The 18 predictors were then used to self identify and enter their own data into a database. classify subjects in the test sample as having Data include illness and laboratory variables, rating Alzheimer’s or not. Predictive analysis of microarray scales, and other treatments, outcomes classified subjects in the test sample with 90% and so on. The patient may use tools provided by the sensitivity (for the Alzheimer’s samples) with the website to track his or her own experience over time. clinical diagnosis and 88% specificity (for the non- One stated goal, which also serves as an incentive for Alzheimer’s samples). The authors then confirmed patients to provide data, is to help patients learn some of these results using postmortem diagnosis. In about each others’ course of illness, treatment and all, 8 out of 9 postmortem-confirmed subjects with outcome such that individuals may compare their Alzheimer’s disease were classified correctly by the own experience with that farmed from others with predictive analysis of microarray algorithm, as were similar conditions. The website also makes data 10 out of the 11 non-Alzheimer’s’ classification. This available to partners of the organization for analysis led the authors to suggest that this 18-protein array and research, which is stated explicitly on the site. Of may constitute a biosignature for Alzheimer’s disease. note, this website supports a variety of psychiatric This is an example of how a peripheral measure conditions, including anxiety, bipolar disorder, identified with machine-learning strategies can com- depression, obsessive compulsive disorder and post- port not only with a clinical diagnosis but also with traumatic stress disorder, to name a few, with a total the gold standard: a postmortem neuropathological of 117 184 registered patients. diagnosis. Although to our knowledge, this data is not yet A handful of examples of implementation of being analyzed by psychiatric investigators, it may machine-learning tools to psychiatric data sets high- provide a powerful repository of information. In light their utility. ML has been put to use in large particular, because the data set is unbiased by samples for the study of the natural evolution of scientific trends or funding agencies, it could poten- psychiatric illness11 and in the identification of tially yield information opening new vistas on single-nucleotide polymorphisms associated with psychiatric disease. In essence, successful data mental conditions.12 Smaller samples of psychotic farms such as this one can facilitate data collection patients (n = 36) and matched controls (n = 36) have available for analysis, using both traditional had MRI data subjected to Sparse Multinomial statistical methods and machine-learning methods, Logistic Regression Classifier, a ML approach that to expand our knowledge base. There are, of course, develops a classification function based on a the obvious issues around selection bias, because weighted combination of basis functions, tuning the only those who are willing to make their personal weights during the learning phase to optimize health information public will participate. Also, only classification of training data.13 Using this technique, those who are computer literate and/or technology cortical gray matter density maps discriminated savvy or who have the economic resources to access between controls and patients with 86% accuracy. technology are likely to submit their data to the ‘farm,’ In another small sample of patients with schizophre- both for their own information and for sharing with nia and normal controls, Feature Selection generated others. In addition, there are no methods for ascer- an algorithm that showed 92% accuracy in identify- taining the validity of the stated diagnoses, nor are ing affected individuals based on functional connec- there assurances that laboratory data are comparable tivity as assessed by resting state functional magnetic across patients. Some of these issues are certainly resonance.14 Although we still do not know whether not unique to this type of data set, but nonetheless, results from these initial forays will be confirmed in the breadth of the sample may render it of critical the long run, it is evident that unanticipated results importance. can be reliably generated. A related variant of the data-farming paradigm is Although most recent papers using ML-based the concept of evidence farming discussed in Hay approaches have applied these analytical techniques et al.15 In evidence farming, it is the provider who to extant data sets from large epidemiological sam- enters medical data about individual patients. Evi- ples, brain-imaging data sets or genetic repositories, dence farming differs from regular electronic medical development of new methods to enhance our ability record systems because it permits frontline providers to collect data is essential, as well. A recent to learn from their own past experience, examining it reconceptualization of data collection is illustrative. with user-friendly analytic tools and employing the Instead of viewing data as something that scientists results in their current clinical decision making.15 actively search for and collect, like fossil fuel Unutzer et al.16 have described a research tracking deposited underground waiting to be mined, data tool that is more comprehensive than a general can also be ‘organic,’ grown and harvested if suitable electronic medical record and that permits clinicians environmental conditions are provided. This notion to compare outcomes of their own patients with is illustrated in the concept of ‘data farming’ or outcomes of similar patients being seen by other ‘evidence farming.’ clinicians, providing opportunities for use of the data

Molecular Psychiatry Machine learning and data mining MA Oquendo et al 959 by individual clinicians who are entering data from Conflict of interest their own practice. An important potential advantage for the farming Dr Oquendo has received unrestricted educational (vs mining) paradigm is the incentive to the end-users grants and/or lecture fees form Astra-Zeneca, Bristol who submit data to the ‘farm,’ such as patients Myers Squibb, Eli Lilly, Janssen, Otsuko, Pfizer, participating in PatientsLikeMe.com, clinicians parti- Sanofi-Aventis and Shire. Her family owns stock in cipating in Unutzer’s data registry. In contrast to the Bistol Myers Squibb. The remaining authors declare experience of subjects who enroll in traditional no conflict of interest. research studies, data-farm participants benefit directly from the ‘harvest’ of the data from the ‘farm’, including the use of their own data to monitor their References own progress, and the opportunity to learn from data 1 Carlsson A. A paradigm shift in brain research. Science 2001; 294: provided by other participants in the ‘farm.’ Those 1021–1024. incentives provide opportunities to broaden the scope 2 Mitchell TM. The Discipline of Machine Learning. School of of the investigation, and include participants who Computer Science: Pittsburgh, PA, 2006. Available from: might not participate in traditional research studies http://aaai.org/AITopics/MachineLearning. due to the lack of direct incentives. 3 Nilsson NJ. Introduction to Machine Learning. An early draft of a proposed textbook. Robotics Laboratory, Department of Computer At the same time, a successful ‘farm’ might also Science, Stanford University: Stanford, 1996. Available from: provide opportunities for investigators to analyze the http://robotics.stanford.edu/people/nilsson/mlbook.html. data harvested from the ‘farm’ for the benefits of 4 Hand DJ. Mining medical data. Stat Methods Med Res 2000; 9: future patients who did not participate in the ‘farm.’ 305–307. 5 Smyth P. Data mining: data analysis on a grand scale? Stat These research opportunities could be viewed as the Methods Med Res 2000; 9: 309–327. by-product for the ‘farm’, in addition to the direct 6 Burgun A, Bodenreider O. Accessing and integrating data benefit to the participants themselves. and knowledge for biomedical research. Yearb Med Inform 2008; Given the paucity of paradigm shifting break- 47 (Suppl 1): 91–101. throughs in psychiatric research in recent decades, 7 Hochberg AM, Hauben M, Pearson RK, O’Hara DJ, Reisinger SJ, Goldsmith DI et al. An evaluation of three signal-detection it behooves the field to explore all promising algorithms using a highly inclusive reference event database. Drug strategies to generate new leads. Being able to exploit Saf 2009; 32: 509–525. large databases for new hypotheses regarding neuro- 8 Sanz EJ, De-las-Cuevas C, Kiuru A, Bate A, Edwards R. Selective biological and genetic underpinnings of psychiatric serotonin reuptake inhibitors in pregnant women and neonatal withdrawal syndrome: a database analysis. Lancet 2005; 365: illness is useful in itself, but also affords the 482–487. opportunity to identify novel mechanisms to be 9 Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, targeted in drug discovery and development. With Saiz-Ruiz J, Leiva-Murillo JM, de Prado-Cumplido M et al. Using several large data sets available to qualified investi- data mining to explore complex clinical decisions: A study of gators, there is a wealth of data that could be hospitalization after a suicide attempt. J Clin Psychiatry 2006; 67: 1124–1132. subjected to these methods. Moreover, data-farming 10 Ray S, Britschgi M, Herbert C, Takeda-Uchimura Y, Boxer A, approaches to amassing large data sets at low cost Blennow K et al. Classification and prediction of clinical from a diverse pool of end-users can also enhance our Alzheimer’s diagnosis based on plasma signaling proteins. Nat ability to develop new leads in understanding Med 2007; 13: 1359–1362. 11 Baca-Garcia E, Perez-Rodriguez MM, Basurte-Villamor I, Lopez- psychiatric disorders. Psychiatry needs novel ideas Castroman J, Fernandez del Moral AL, Jimenez-Arriero MA et al. to pursue. ML and computational models based on it Diagnostic stability and evolution of bipolar disorder in clinical can provide such a path and data-derived ideas practice: a prospective cohort study. Acta Psychiatr Scand 2007; emerging from these repositories have special appeal 115: 473–480. for the empirically minded investigator. Our patients 12 Baca-Garcia E, Vaquero-Lorenzo C, Perez-Rodriguez MM, Gratacos M, Bayes M, Santiago-Mozos R et al. Nucleotide are waiting for improved therapies and quality of life. variation in central genes among male suicide We are duty bound to chase them vigorously. Perhaps attempters. Am J Med Genet B Neuropsychiatr Genet 2010; 153B: our current notions of disease, diagnosis, causality 208–213. and cure can be advanced by generating cutting-edge, 13 Sun D, van Erp TG, Thompson PM, Bearden CE, Daley M, Kushan L et al. Elucidating a magnetic resonance imaging-based previously unsuspected hypotheses using these tools, neuroanatomic biomarker for psychosis: classification analysis to be tested/validated in subsequent confirmatory using probabilistic brain atlas and machine learning algorithms. studies. Are we ready to enhance the pipeline of Biol Psychiatry 2009; 66: 1055–1060. innovative hypotheses? 14 Shen H, Wang L, Liu Y, Hu D. Discriminative analysis of resting- state functional connectivity patterns of schizophrenia using low dimensional embedding of fMRI. Neuroimage 2010; 49: Acknowledgments 3110–3121. 15 Hay MC, Weisner TS, Subramanian S, Duan N, Niedzinski EJ, Dr Blasco-Fontecilla acknowledges the Spanish Min- Kravitz RL. Harnessing experience: exploring the gap between istry of Health (Rio Hortega CM08/00170), Alicia evidence-based medicine and clinical practice. J Eval Clin Pract 2008; 14: 707–713. Koplowitz Foundation, and Conchita Rabago Founda- 16 Unutzer J, Choi Y, Cook IA, Oishi S. A web-based data manage- tion for funding his post-doctoral rotation at CHRU, ment system to improve care for depression in a multicenter Montpellier, France. SAF2010-21849. . Psychiatr Serv 2002; 53: 671–673.

Molecular Psychiatry