AI AND HEALTH Editor: Daniel B. Neill, H.J. Heinz III College, Carnegie Mellon University, [email protected]

Subtyping: What It Is and Its Role in Precision Medicine

Suchi Saria, Johns Hopkins University Anna Goldenberg, SickKids Research Institute

any diseases—for example, neuropsychiat- for patients with complex cases, who also con- ric, cardiovascular, and autoimmune disor- sume the lion’s share of healthcare spending, is M largely lacking.4 ders—are diffi cult to treat because of the remark- These challenges motivated the idea of disease able degree of variation among affected individuals. as a central tenet of precision medicine.1 Precision medicine, also known as personalized Broadly construed, disease subtyping is the task of medicine or P4 medicine,1 is an emerging approach identifying subpopulations of similar patients that for individualizing the practice of medicine.2 It can guide treatment decisions for a given individ- takes into account individual variability in genes, ual. (When the subtypes have been established to environment, and lifestyle with the goals of bet- be causally associated with the underlying mech- ter defi ning health or wellness for each person, anism, these are also called endotypes.5) Boland predicting disease progression and transitions be- and colleagues describe the concept of a “vero- tween disease stages, and targeting the most appro- type” (the Latin word vero means “true”) to rep- priate medical interventions. resent the true population of similar patients for In the autoimmune disease scleroderma, for ex- treatment purposes.6 What constitutes these vero- ample, as many as six different organ systems may types and how they should be discovered remains be involved. Organ involvement trajectories can an open question. An active and growing body of vary greatly across individuals from no involve- work has explored different approaches for iden- ment to rapid and aggressive decline.3 This uncer- tifying homogeneous patient subgroups ranging tainty associated with an individual’s disease pro- from qualitative—based on clinical observations gression makes treatment planning challenging. alone—to quantitative models that integrate mea- Furthermore, the current evidence base for guid- surements from diverse high-throughput biotech- ing an individual’s treatment is insuffi cient in sev- nologies. Cancer, autism, autoimmune diseases, eral ways. First, clinical practice guidelines over- cardiovascular diseases, and Parkinson’s are exam- emphasize simplicity so that healthcare providers ples of diseases that have been studied through the can easily implement them without computerized lens of subtyping.3,7,8 decision support. Thus, it is rare to see decision The discovery and refi nement of disease sub- criteria combining many different types of data types can benefi t both the practice and science of about the individual (such as molecular, genetic, medicine. Clinically, by refi ning prognoses based and clinical) to make a therapeutic recommenda- on similar individuals, disease subtypes help reduce tion. Second, most of these guidelines are derived uncertainty in an individual’s expected outcome. from randomized controlled trials for single dis- Accurate prognoses can thereby improve treatment ease treatments, which can exclude patients with decisions. For example, administration of a ther- signifi cant complications; the evidence base de- apy with strong side effects could be well justifi ed rived is not tailored to the granular characteris- on an individual prognosticated to decline rapidly tics of each individual, but rather the “average” without this treatment. Beyond prognoses, sub- patient in the recruited cohort. Consequently, the types can also inform forecasts about the expected knowledge needed to provide appropriate therapy costs of care. In complex diseases, where there is

70 1541-1672/15/$31.00 © 2015 IEEE IEEE INtEllIgENt systEMs Published by the IEEE Computer Society tremendous heterogeneity in disease ­patients with OS. Subsequent micro- datastream analyses to latent variable presentation, subtyping can help im- scopic analysis of the tumor tissue re- models15,16 to more recent network- prove the effectiveness of clinical tri- vealed that these tumors were indeed based fusion approaches.17 als by enabling targeted recruitment. of a different (endothelial) origin, were One of the biggest drawbacks of this Scientifically, subtypes can drive the rather common among younger sub- line of work is that depending on the design of new genome-wide associa- jects, and were clearly distinct from type of data used, the resulting conclu- tion studies.9,10 For example, by finding­ the mainstream OS. This subtype later sions about disease subtypes differed. subgroups whose clinical manifesta- came to be known as Ewing’s sarcoma. Glioblastoma multiforme (GBM), a tions differ, researchers can conduct Early examples of subtyping were very aggressive form of brain cancer, targeted studies to identify the molec- limited by the power of individual is a good example of different analy- ular determinants of these differences. doctors to detect patterns among the ses producing a range of conclusions. Such analyses can allow clinical scien- patients they had observed. In the last An earlier analysis of GBM identified tists to understand the causes of related decade, the advent of high-through- two subtypes based on the loss of a diseases. put biotechnologies has provided the chromosome.18 An integrative analy- In this article, we provide an over- means for measuring differences be- sis of GBM driven primarily by mRNA view of the diverse approaches to tween individuals at the cellular and expression data identified four differ- subtyping, from early accounts based molecular levels. The cost of measur- ent subtypes, which were not strict on clinical practice to more recent ing various “–omic” data (such as ge- subsets of those previously identified.7 ­approaches that focus on computa- nomic, proteomic, and metabolomics A recent DNA methylation-based ap- tionally derived subtypes based on mo- data) has dropped significantly, let- proach19 identified a subtype, charac- lecular and electronic health record ting scientists collect such data on a terized by a mutation in a particular (EHR) data. This field is expansive and large number of patients. The number gene (IDH1), with a significantly bet- growing rapidly—thus, a comprehen- of measured variables in these data ter survival prognosis. Although meth- sive review is not our focus here. In- ranges from tens of thousands (for ex- ylation data was available in the earlier stead, we juxtapose approaches taken ample, expression levels of messenger analysis, their conclusions were differ- by different communities and empha- RNA, or mRNA) to millions (genetic ent—the IDH1-subtype was not identi- size the significant open computational data in the form of single nucleotide fied because the subtypes were largely problems that remain. variants); thus, research has shifted based on clustering mRNA expression toward computationally driven ap- data.7 From a technical standpoint, Disease Subtyping: proaches to identify subtypes. this is not surprising, because the re- Overview covered subtypes are a function of the Traditionally, disease subtyping re- Molecular Subtyping data, the clustering approach, and the search has been conducted as a by- One of the main goals driving the associated notion of similarity used. product of clinical experience, wherein analyses of high-throughput molecu- When the integrated data are high di- a clinician noticed the presence of pat- lar data is the unbiased biomedical mensional and heterogeneous, defin- terns or groups of outlier patients and discovery of disease subtypes via un- ing a coherent metric for clustering performed a more thorough (retrospec- supervised clustering of either indi- ­becomes increasingly challenging. tive or prospective) study to confirm vidual or multiple sources of molecu- To ensure identification of clinically their existence. An early case of such lar data. Using statistical and machine relevant subtypes, others have started an analysis is the work of James Ew- learning approaches such as non- to model distinct subgroups based on ing, a pathologist, who published his negative matrix factorization, hierar- clinical hypotheses and perform fol- observation of a clearly distinct subset chical clustering, and probabilistic la- low-up analysis to identify molecular of osteogenic sarcoma (OS), a type of tent factor analysis,12,13 researchers determinants of differences between bone tumor, nearly a century ago. He have identified subgroups of individu- these subgroups.20 Naturally, as the observed that a substantial number of als based on similar gene expression molecular-level characterization of his patients with OS experienced spon- levels. More recent approaches have human diversity becomes continually taneous fractures and swellings.11 All targeted data integration. Toward more detailed—not only in terms of these patients had very characteris- this, researchers have tried a broad genetic information but also molecular tic radiographic features that were sub- range of techniques,14 spanning from measurements over time—the pheno- stantially different from his typical ad hoc ­combinations of individual typic manifestations of all these details july/august 2015 www.computer.org/intelligent 71 are often insufficiently represented by is referred to as phenotyping. Pheno- ence seizures, whereas others are more available broad disease categories, typing algorithms implement probabi- likely to experience auditory complica- which capture only the end points listic rules22 that combine information tions.25 Others have clustered vectors rather than the exhaustive trajectories from the structured data (for exam- summarizing the procedure and diag- of disease development.21 The advent ple, demographics, diagnoses, medica- noses counts for each individual using of EHRs significantly advances our tions, and laboratory measurements) tensor factorization to identify clusters ability to describe subtypes that are as well as unstructured clinical text of diagnosis codes and medications that based on more precise and detailed de- (such as radiology reports, encounter co-occur.26 Joyce Ho and colleagues scriptions of the disease. notes, and discharge summaries) to show that this approach can automati- annotate attributes such as the pres- cally discover mild and severe forms of a Electronic Health Records ence or absence of specific diagnoses disease based on a combination of com- and Phenotyping or medical adverse events (see the bib- monly observed comorbidity and treat- The Health Information Technology liography of a recent JAMIA paper by ment patterns.26 These approaches le- for Economic and Clinical Health Jyotishman Pathak and colleagues23). verage treatment and diagnosis codes. (HITECH) Act, which was part of the The Electronic Medical Records and The main advantage of using such data American Recovery and Reinvestment Genomics (eMERGE) consortium— is that they are readily available in ad- Act of 2009, incentivized the adop- a network of nine academic medical ministrative billing records, which have tion of EHRs. As a result, today much centers—has demonstrated the suc- standardized formats. However, sub- of an individual’s health data—such cessful use of these EHR-derived phe- groups based on ICD-9 codes and treat- as demographics, personal and fam- notypes for cohort identification to ment data are highly sensitive to prac- ily medical history, diagnosis codes, conduct genome- and phenome-wide tice patterns. current and past treatments, history association studies.9,24 These efforts Others have thus sought to lever- of allergic reactions, vaccination re- have replicated genetic risk factors for age the rich array of clinical and lab- cords, laboratory test results, and im- many diseases, including Alzheimer’s, oratory data (such as glucose levels, aging results—is stored in an EHR. type 2 diabetes, and arrhythmia.24 blood counts, and functional test re- Academic medical centers such as It is important to remember that sults) for measuring disease activity. Vanderbilt University have succeeded phenotyping facilitates cleaner, but For example, David Chen and col- in linking EHR data to biobanked still broad, categories of disease. The leagues show that by clustering the blood samples that were accrued dur- power of EHRs lies in their detailed clinarray—a vector containing sum- ing routine clinical care. Over repeat and longitudinal data, which makes it mary statistics such as the mean value clinic visits, from a research perspec- possible to obtain more refined sub- of a marker over time—in diseases like tive, such integrated patient data con- types. Thus, recent efforts have built Crohn’s and cystic fibrosis, they can stitute a computable collection of fine- on phenotyping work to move away distinguish mild from severe forms of grained longitudinal clinical profiles. from analysis of broad disease catego- disease.27 Alternatively, in disease pro- However, data contained within ries and instead provide more descrip- gression modeling,28 generative mod- EHRs present numerous computa- tive definitions of the disease over els of the disease progression—which tional challenges. For example, even time. characterize disease as a continuum the seemingly simple task of extract- or as a of discrete stages and ob- ing the list of conditions that an in- Clinically Enriched Subtypes served clinical data as a function of dividual has been diagnosed with is Existing approaches have used differ- the stage—are learned from data. Past nontrivial. Although ICD-9 codes— ent hypotheses to cluster individuals works have developed dynamic mod- those indicating the presence or ab- into subtypes using EHR data. In au- els that characterize progression as a sence of a condition—are routinely tism, for example, using hierarchical function of comorbidities, or individual encoded for billing, they are incom- agglomerative clustering over individu- markers. These are typically developed plete and noisy. Thus, much research als based on their set of comorbid con- with the motivation of tracking and has focused on developing new algo- ditions (secondary complications) as predicting the progression (or sever- rithms for annotating patient records defined by ICD-9 codes, Finale Doshi- ity) of an individual’s disease over time. with this information; in the litera- Velez and colleagues show different However, these models can also help ture, this task of extracting various clinical subtypes: patients with some of identify patients with similar disease- clinical attributes for each individual the subtypes are more likely to experi- progression patterns and thus enable

72 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS 1,084 1,242 378 1,216 100 100 80 80 discovery of molecular variations that 60 60 might be driving individuals to express 40 40 different forms of the disease. 1,516 2,563 1,821 1,878 100 100 For example, Peter Schulam and col- 80 80 leagues devised an approach that uses 60 60 continuous laboratory and clinical 40 40 0510 15 markers measured over time to iden- 0510 15 0510 15 0510 15 tify subtypes with similar disease tra- Years since first symptom 3 jectories. Figure 1a shows four sub- 406 485 113 280 types based on their lung progression Lung severity (PFVC) 100 100 patterns, ranging from healthy and 80 80 stable lung function (top left) to poor 60 60 40 40 and consistent decline (bottom right). 1,031 2,106 690 2,140 In their work, rather than clustering 100 100 80 80 the raw measurements themselves, 60 60 Schulam and colleagues incorporate 40 40 knowledge of the measurement process 0510 15 0510 15 0510 15 0510 15 and known disease process to account (a) for sources of variation that affect the measured markers but are unrelated 3% 12% 4% Antibodies to the underlying subtype. For exam- 6% 25% 30% ACA 30% ple, two individuals might show simi- 57% Scl-70 55% 50% lar progression, but a chronic smoker 9% 19% RNA Poly. Other will tend to have slightly worse lung 110 110 110 function (modeled using random ef- 90 90 90 fects by a global shift). Thus, Schulam 70 70 70 Marker severity and colleagues removed these nuisance 50 50 50 sources of variability when inferring Lung severity (PFVC) Healthy 0510 15 0510 15 0510 15 subtypes.3 Moderate Critical Figure 1b shows three different sub- 40 40 40 types that Schulam and colleagues un- 20 20 20 covered using joint analysis of their 0 0 0 Skin severity (TSS ) lung and skin markers (the two key 0 510150 510150 5 10 15 Years since first symptom complications of scleroderma). These Clinical events 1.00 1.00 1.00 recovered subtypes are associated with 0.75 0.75 0.75 Death 0.50 0.50 0.50 0.25 0.25 0.25 Heart complications different autoantibody—a type of pro- 0.00 0.00 0.00 1.00 1.00 1.00 Hypertension tein produced by the immune system 0.75 0.75 0.75 0.50 0.50 0.50 Gastrointestinal and known to be a cause of autoim- 0.25 0.25 0.25 0.00 0.00 0.00 complications mune diseases—markers (see the top Onset cumulative dist. 0510 15 20 0510 15 20 0510 15 20 0510 15 20 0510 15 20 0510 15 20 of Figure 1b) and comorbidity patterns (b) (see the bottom of Figure 1b).

Figure 1. Subtyping driven by disease activity trajectories. (a) Four subtypes of Next Steps: Integrative scleroderma based on lung disease activity trajectories tracked over 15 years. The Analysis top left subtype shows stable lung function, whereas the bottom right shows active In the next decade, affordable access to decline. These are inferred using the probabilistic subtyping model by Peter Schulam molecular data collection linked with and colleagues based on continuous, sparse, and irregularly sampled measurements 3 rich clinical data has the potential to (shown as black dots) of the forced vital capacity, a marker of lung disease. These measurements were taken during clinical visits as part of routine care. (b) Joint significantly advance our understand- analysis of the lung and skin trajectories yields subpopulations that show distinct ing of how diseases are defined and autoantibody profiles and comorbidity patterns (three distinct subtypes are shown). treated. In the near term, subtype defi- The top panel shows autoantibody prevalence (using size). july/august 2015 www.computer.org/intelligent 73 nitions that enable accurate prognoses pervised setting (for example, predicting On a pragmatic level, researchers of disease trajectories can enable treat- markers of drug response) and the num- who wish to analyze large amounts of ment planning and prognosis. In the ber of samples corresponding to the in- patient data still face the technical chal- longer term, as molecular mechanisms dividual subpopulations is small. In lenges of integrating scattered, hetero- governing different disease subtypes works that have tackled joint analysis geneous data, in addition to ethical and are discovered, novel biomarkers that and model dependencies, typically only legal obstacles that limit access to the are predictive of the disease course, and a small range of data types is modeled data. Infrastructure investments at the novel treatments inspired by the mech- within any given study (see references national, regional, and institutional lev- anistic pathways, will become possible. within a paper by Marylyn Ritchie and els are needed to make integrated data To achieve these goals, we will need colleagues20). As consortia-based efforts sources readily available for research. careful integration of the diverse data are gearing up to collect whole-systems surrounding an individual’s health— level data for many patients, availabil- molecular, clinical, and environmental ity of data may no longer be a bottle- As molecular data linked with rich data. To summarize, the goals of inte- neck, but rather the availability of ana- clinical data are becoming easily acces- grative analysis are twofold. The first lytic approaches that can integrate data sible, new integrative methods for sub- goal is to identify naturally occurring at multiple resolutions, from the cell to typing have the potential to significantly subpopulations whose presentation in the organ level. advance our understanding of how dis- the clinic differ so that one can tailor Another challenge for integrative eases are defined and treated. We call on treatments to these subgroups. For ex- analysis arises from the heterogene- the computational community to par- ample, by being able to detect individ- ity of the different data types: some ticipate in this exciting computational uals at high risk for a given compli- markers are continuous while others task that can ultimately improve the cation early, one can tailor the use of are categorical; some are measured quality of healthcare for us all. more aggressive therapies, when avail- only once (for example, gender or able. The second goal is to identify DNA sequence) while others are mea- causal pathways associated with the sured repeatedly (such as blood cell Acknowledgments phenotypically differentiated subpop- counts and functional lung tests). Dif- Saria thanks Google Research, the Gordon and ulations. Causal pathways facilitate ferent sources of measurement noise Betty Moore Foundation, and the National development of treatment programs and bias can also affect the measure- Science Foundation for their support and Peter Schulam for creating Figure 1. Goldenberg appropriate to each subtype. ments made. For example, functional thanks the SickKids Foundation for support. Although we have made tremendous measurements (such as those shown Both authors thank Daniel Neill for valuable progress toward leveraging the diverse in Figure 1a) can vary depending on feedback on the article. molecular and clinical datasets, much the individual making the measure- remains to be done. For example, most ment, the altitude at which the mea- References of the existing methods for joint analy- surement was made, and whether the 1. L. Hood and S.H. Friend, “Predictive, sis of more than one data type16,17 tend patient is experiencing temporary in- Personalized, Preventive, Participatory to treat the different measurement types flammation. Similarly, the mRNA ex- (P4) Cancer Medicine,” Nature Rev. as independent sources of information. pression levels can vary depending Clinical Oncology, vol. 8, no. 3, 2011, However, in practice, we know that on the measured sample’s number pp. 184–187. many of these measurements are inter- and composition of cell types. Finally, 2. R. Mirnezami, J. Nicholson, and dependent. For example, DNA meth- the healthcare process itself governs A. Darzi, “Preparing for Precision ylation affects levels of mRNA expres- which measurements are taken and Medicine,” New England J. Medicine, sion, and miRNA regulates gene ex- recorded, and when.29 These are nui- vol. 366, no. 6, 2012, pp. 489–491. pression post transcriptionally. Thus, sance sources of variability that are 3. P. Schulam, F. Wigley, and S. Saria, inferences made using an assumption unrelated to subtype and should be “Clustering Longitudinal Clinical of independence are likely to be biased. accounted for in inferring meaning- Marker Trajectories from Electronic Moreover, exploiting the relationship ful subtypes.3 Multiresolution mod- Health Data: Applications to between these measurements can help els that integrate diverse markers in- Phenotyping and Endotype Discovery,” reduce the effective dimensionality of corporating knowledge of the mea- Am. Assoc. for Artificial Intelligence, the problem. This is especially useful surement process and the biology are 2015; http://pschulam.com/papers when integration is being done in a su- likely to be most fruitful. /schulam+wigley+saria_aaai_2015.pdf.

74 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS 4. S. Saria, “A $3 Trillion Challenge to Genomic Data Types using a Joint 25. F. Doshi-Velez, Y. Ge, and I. Kohane, Computational Scientists: Transforming Latent Variable Model with Application “Comorbidity Clusters in Autism Healthcare Delivery,” IEEE Intelligent to Breast and Lung Cancer Subtype Spectrum Disorders: An Electronic Systems, vol. 29, no. 4, 2014, pp 82–87. Analysis,” Bioinformatics, vol. 25, no. Health Record Time-Series Analysis,” 5. J. Lotvall et al., “Asthma Endotypes: A 22, 2009, pp. 2906–2912. Pediatrics, vol. 133, no. 1, 2014, New Approach to Classification of Disease 16. P. Kirk et al., “Bayesian Correlated pp. e54–63. Entities within the Asthma Syndrome,” Clustering to Integrate Multiple 26. J.C. Ho, J. Ghosh, and J. Sun, “Marble: J. Allergy and Clinical Immunology, Datasets,” Bioinformatics, vol. 28, no. High-Throughput Phenotyping from vol. 127, no. 2, 2011, pp. 355–360. 24, 2012, pp. 3290–3297. Electronic Health Records via Sparse 6. M.R. Boland et al., “Defining a 17. B. Wang et al., “Similarity Network Nonnegative Tensor Factorization,” Comprehensive Verotype using Electronic Fusion for Aggregating Data Types on a Proc. 20th ACM SIGKDD Int’l Conf. Health Records for Personalized Medicine,” Genomic Scale,” Nature Methods, vol. Knowledge Discovery and Data Mining, J. Am. Medical Informatics Assoc., vol. 20, 11, no. 3, 2014, pp. 333–337. 2014, pp. 115–124. 2013, pp. e232–e238. 18. J.M. Nigro et al., “Integrated Array- 27. D.P. Chen et al., “Clinical Arrays of 7. R.G. Verhaak et al., “Integrated Genomic Comparative Genomic Hybridization Laboratory Measures, or ‘Clinarrays,’ Analysis Identifies Clinically Relevant and Expression Array Profiles Identify Built from an Electronic Health Record Subtypes of Glioblastoma Characterized Clinically Relevant Molecular Subtypes Enable Disease Subtyping by Severity,” by Abnormalities in PDGFRA, IDH1, of Glioblastoma,” Cancer Research, Proc. AMIA Ann. Symp., 2007, EGFR, and NF1,” Cancer Cell, vol. 17, vol. 65, no. 5, 2005, pp. 1678–1686. pp. 115–119. no. 1, 2010, pp. 98–110. 19. D. Sturm et al., “Hotspot Mutations 28. D. Mould, “Models for Disease 8. S.J. Lewis et al., “Heterogeneity of in H3F3A and IDH1 Define Distinct Progression: New Approaches and Uses,” Parkinson’s Disease in the Early Clinical Epigenetic and Biological Subgroups of Clinical Pharmacology & Therapeutics, Stages using a Data Driven Approach,” J. Glioblastoma,” Cancer Cell, vol. 22, vol. 92, no. 1, 2012, pp. 125–131. Neurology, Neurosurgery, and Psychiatry, no. 4, 2012, pp. 425–437. 29. G. Hripcsak and D.J. Albers, “Next- vol. 76, no. 3, 2005, pp. 343–348. 20. M.D. Ritchie et al., “Methods of Generation Phenotyping of Electronic 9. A.N. Kho et al., “Electronic Medical Integrating Data to Uncover Genotype- Health Records,” J. Am. Medical Records for Genetic Research: Results Phenotype Interactions,” Nature Rev. Informatics Assoc., vol. 20, no. 1, 2012, of the Emerge Consortium,” Science Genetics, vol. 16, no. 2, 2015, pp. 117–121. Translational Medicine, vol. 3, no. 79, pp. 85–97. 2011, doi:10.1126/scitranslmed.3001807. 21. P.B. Jensen et al., “Mining Electronic 10. I.S. Kohane, “Using Electronic Health Health Records: Towards Better Suchi Saria is an assistant professor in Records to Drive Discovery in Disease Research Applications and Clinical the Departments of Computer Science and Genomics,” Nature Rev. Genetics, Care,” Nature Rev. Genetics, vol. 13, Health Policy and Management at Johns vol. 12, no. 6, 2011, pp. 417–428. no. 6, 2012, pp. 395–405. Hopkins University. Contact her at ssaria@ 11. J. Ewing, “Diffuse Endothelioma of 22. S. Saria et al., “Combining Structured cs.jhu.edu. Bone,” Proc. New York Pathological Soc., and Free-Text Data for Automatic vol. 21, 1921, pp. 17–24. Coding of Patient Outcomes,” Proc. Anna Goldenberg is a scientist in genet- 12. J.P. Brunet et al., “Metagenes and AMIA Ann. Symp., 2010, p. 712. ics and genome biology at the SickKids Re- Molecular Pattern Discovery using Matrix 23. J. Pathak, A.N. Kho, and J.C. Denny, search Institute and an assistant professor in Factorization,” Proc. Nat’l Academy “Electronic Health Records-Driven the Department of Computer Science at the Sciences, vol. 101, no. 12, 2004, 4164-4169. Phenotyping: Challenges, Recent University of Toronto. Contact her at anna.­ 13. C.M. Perou et al., “Molecular Portraits Advances, and Perspectives,” J. Am. [email protected]. of Human Breast Tumours,” Nature, vol. Medical Informatics Assoc., vol. 20, 406, no. 6797, 2000, pp. 747–752. no. e2, 2013, pp. e206–e211. 14. V.N. Kristensen et al., “Principles and 24. J.C. Denny et al., “Systematic Comparison Methods of Integrative Genomic Analyses of Phenome-Wide Association Study of in Cancer,” Nature Rev. Cancer, vol. 14, Electronic Medical Record Data and no. 5, 2014, pp. 299–313. Genome-Wide Association Study Data,” Selected CS articles and columns 15. R. Shen, A.B. Olshen, and M. Ladanyi, Nature Biotechnology, vol. 31, no. 12, are also available for free at “Integrative Clustering of Multiple 2013, pp. 1102–1111. http://ComputingNow.computer.org. july/august 2015 www.computer.org/intelligent 75