Subtyping: What It Is and Its Role in Precision Medicine
Total Page:16
File Type:pdf, Size:1020Kb
AI AND HEALTH Editor: Daniel B. Neill, H.J. Heinz III College, Carnegie Mellon University, [email protected] Subtyping: What It Is and Its Role in Precision Medicine Suchi Saria, Johns Hopkins University Anna Goldenberg, SickKids Research Institute any diseases—for example, neuropsychiat- for patients with complex cases, who also con- ric, cardiovascular, and autoimmune disor- sume the lion’s share of healthcare spending, is M largely lacking.4 ders—are diffi cult to treat because of the remark- These challenges motivated the idea of disease able degree of variation among affected individuals. subtyping as a central tenet of precision medicine.1 Precision medicine, also known as personalized Broadly construed, disease subtyping is the task of medicine or P4 medicine,1 is an emerging approach identifying subpopulations of similar patients that for individualizing the practice of medicine.2 It can guide treatment decisions for a given individ- takes into account individual variability in genes, ual. (When the subtypes have been established to environment, and lifestyle with the goals of bet- be causally associated with the underlying mech- ter defi ning health or wellness for each person, anism, these are also called endotypes.5) Boland predicting disease progression and transitions be- and colleagues describe the concept of a “vero- tween disease stages, and targeting the most appro- type” (the Latin word vero means “true”) to rep- priate medical interventions. resent the true population of similar patients for In the autoimmune disease scleroderma, for ex- treatment purposes.6 What constitutes these vero- ample, as many as six different organ systems may types and how they should be discovered remains be involved. Organ involvement trajectories can an open question. An active and growing body of vary greatly across individuals from no involve- work has explored different approaches for iden- ment to rapid and aggressive decline.3 This uncer- tifying homogeneous patient subgroups ranging tainty associated with an individual’s disease pro- from qualitative—based on clinical observations gression makes treatment planning challenging. alone—to quantitative models that integrate mea- Furthermore, the current evidence base for guid- surements from diverse high-throughput biotech- ing an individual’s treatment is insuffi cient in sev- nologies. Cancer, autism, autoimmune diseases, eral ways. First, clinical practice guidelines over- cardiovascular diseases, and Parkinson’s are exam- emphasize simplicity so that healthcare providers ples of diseases that have been studied through the can easily implement them without computerized lens of subtyping.3,7,8 decision support. Thus, it is rare to see decision The discovery and refi nement of disease sub- criteria combining many different types of data types can benefi t both the practice and science of about the individual (such as molecular, genetic, medicine. Clinically, by refi ning prognoses based and clinical) to make a therapeutic recommenda- on similar individuals, disease subtypes help reduce tion. Second, most of these guidelines are derived uncertainty in an individual’s expected outcome. from randomized controlled trials for single dis- Accurate prognoses can thereby improve treatment ease treatments, which can exclude patients with decisions. For example, administration of a ther- signifi cant complications; the evidence base de- apy with strong side effects could be well justifi ed rived is not tailored to the granular characteris- on an individual prognosticated to decline rapidly tics of each individual, but rather the “average” without this treatment. Beyond prognoses, sub- patient in the recruited cohort. Consequently, the types can also inform forecasts about the expected knowledge needed to provide appropriate therapy costs of care. In complex diseases, where there is 70 1541-1672/15/$31.00 © 2015 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society tremendous heterogeneity in disease patients with OS. Subsequent micro- datastream analyses to latent variable presentation, subtyping can help im- scopic analysis of the tumor tissue re- models15,16 to more recent network- prove the effectiveness of clinical tri- vealed that these tumors were indeed based fusion approaches.17 als by enabling targeted recruitment. of a different (endothelial) origin, were One of the biggest drawbacks of this Scientifically, subtypes can drive the rather common among younger sub- line of work is that depending on the design of new genome-wide associa- jects, and were clearly distinct from type of data used, the resulting conclu- tion studies.9,10 For example, by finding the mainstream OS. This subtype later sions about disease subtypes differed. subgroups whose clinical manifesta- came to be known as Ewing’s sarcoma. Glioblastoma multiforme (GBM), a tions differ, researchers can conduct Early examples of subtyping were very aggressive form of brain cancer, targeted studies to identify the molec- limited by the power of individual is a good example of different analy- ular determinants of these differences. doctors to detect patterns among the ses producing a range of conclusions. Such analyses can allow clinical scien- patients they had observed. In the last An earlier analysis of GBM identified tists to understand the causes of related decade, the advent of high-through- two subtypes based on the loss of a diseases. put biotechnologies has provided the chromosome.18 An integrative analy- In this article, we provide an over- means for measuring differences be- sis of GBM driven primarily by mRNA view of the diverse approaches to tween individuals at the cellular and expression data identified four differ- subtyping, from early accounts based molecular levels. The cost of measur- ent subtypes, which were not strict on clinical practice to more recent ing various “–omic” data (such as ge- subsets of those previously identified.7 approaches that focus on computa- nomic, proteomic, and metabolomics A recent DNA methylation-based ap- tionally derived subtypes based on mo- data) has dropped significantly, let- proach19 identified a subtype, charac- lecular and electronic health record ting scientists collect such data on a terized by a mutation in a particular (EHR) data. This field is expansive and large number of patients. The number gene (IDH1), with a significantly bet- growing rapidly—thus, a comprehen- of measured variables in these data ter survival prognosis. Although meth- sive review is not our focus here. In- ranges from tens of thousands (for ex- ylation data was available in the earlier stead, we juxtapose approaches taken ample, expression levels of messenger analysis, their conclusions were differ- by different communities and empha- RNA, or mRNA) to millions (genetic ent—the IDH1-subtype was not identi- size the significant open computational data in the form of single nucleotide fied because the subtypes were largely problems that remain. variants); thus, research has shifted based on clustering mRNA expression toward computationally driven ap- data.7 From a technical standpoint, Disease Subtyping: proaches to identify subtypes. this is not surprising, because the re- Overview covered subtypes are a function of the Traditionally, disease subtyping re- Molecular Subtyping data, the clustering approach, and the search has been conducted as a by- One of the main goals driving the associated notion of similarity used. product of clinical experience, wherein analyses of high-throughput molecu- When the integrated data are high di- a clinician noticed the presence of pat- lar data is the unbiased biomedical mensional and heterogeneous, defin- terns or groups of outlier patients and discovery of disease subtypes via un- ing a coherent metric for clustering performed a more thorough (retrospec- supervised clustering of either indi- becomes increasingly challenging. tive or prospective) study to confirm vidual or multiple sources of molecu- To ensure identification of clinically their existence. An early case of such lar data. Using statistical and machine relevant subtypes, others have started an analysis is the work of James Ew- learning approaches such as non- to model distinct subgroups based on ing, a pathologist, who published his negative matrix factorization, hierar- clinical hypotheses and perform fol- observation of a clearly distinct subset chical clustering, and probabilistic la- low-up analysis to identify molecular of osteogenic sarcoma (OS), a type of tent factor analysis,12,13 researchers determinants of differences between bone tumor, nearly a century ago. He have identified subgroups of individu- these subgroups.20 Naturally, as the observed that a substantial number of als based on similar gene expression molecular-level characterization of his patients with OS experienced spon- levels. More recent approaches have human diversity becomes continually taneous fractures and swellings.11 All targeted data integration. Toward more detailed—not only in terms of these patients had very characteris- this, researchers have tried a broad genetic information but also molecular tic radiographic features that were sub- range of techniques,14 spanning from measurements over time—the pheno- stantially different from his typical ad hoc combinations of individual typic manifestations of all these details JULY/AUGUST 2015 www.computer.org/intelligent 71 are often insufficiently represented by is referred to as phenotyping. Pheno- ence seizures, whereas others are more