Clustering Longitudinal Clinical Marker Trajectories from Electronic
Total Page:16
File Type:pdf, Size:1020Kb
Clustering Longitudinal Clinical Marker Trajectories from Electronic Health Data: Applications to Phenotyping and Endotype Discovery Peter Schulam Fredrick Wigley Suchi Saria Department of Computer Science Division of Rheumatology Department of Computer Science Johns Hopkins University Johns Hopkins School of Medicine Department of Health Policy & Mgmt. 3400 N. Charles St. 733. N. Broadway Johns Hopkins University Baltimore, MD 21218 Baltimore, MD 21205 3400 N. Charles St. [email protected] [email protected] Baltimore, MD 21218 [email protected] Abstract more objective methods for discovering subtypes (De Keu- lenaer and Brutsaert 2009). Growing repositories of health Diseases such as autism, cardiovascular disease, and the au- data stored in electronic health record (EHR) databases toimmune disorders are difficult to treat because of the re- markable degree of variation among affected individuals. and patient registries (Blumenthal 2009; Shea and Hripc- Subtyping research seeks to refine the definition of such com- sak 2010) present an exciting opportunity to identify disease plex, multi-organ diseases by identifying homogeneous pa- subtypes in an objective, data-driven manner using tools tient subgroups. In this paper, we propose the Probabilis- from machine learning that can help to tackle the problem tic Subtyping Model (PSM) to identify subgroups based on of combing through these massive databases. In this work, clustering individual clinical severity markers. This task is we propose such a tool, the Probabilistic Subtyping Model challenging due to the presence of nuisance variability— (PSM), that is designed to discover subtypes of complex, variations in measurements that are not due to disease systemic diseases using longitudinal clinical markers col- subtype—which, if not accounted for, generate biased esti- lected in EHR databases and patient registries. mates for the group-level trajectories. Measurement sparsity and irregular sampling patterns pose additional challenges in Discovering and refining disease subtypes can benefit clustering such data. PSM uses a hierarchical model to ac- both the practice and the science of medicine. Clinically, dis- count for these different sources of variability. Our experi- ease subtypes can help to reduce uncertainty in expected out- ments demonstrate that by accounting for nuisance variabil- come of an individual’s case, thereby improving treatment. ity, PSM is able to more accurately model the marker data. Subtypes can inform therapies and aid in making prognoses We also discuss novel subtypes discovered using PSM and and forecasts about expected costs of care (Chang, Clark, the resulting clinical hypotheses that are now the subject of and Weiner 2011). Scientifically, disease subtypes can help follow up clinical experiments. to improve the effectiveness of clinical trials (Gundlapalli Introduction and Background et al. 2008), drive the design of new genome-wide asso- ciation studies (Kho et al. 2011; Kohane 2011), and al- Disease subtyping is the process of developing criteria for low medical scientists to view related diseases through a stratifying a population of individuals with a shared dis- more fine-grained lens that can lead to insights that con- ease into subgroups that exhibit similar traits; a task that nect their causes and developmental pathways (Hoshida et is analogous to clustering in machine learning. Under the al. 2007). Disease subtyping is considered to be especially assumption that individuals with similar traits share an un- useful for complex, systemic diseases where mechanism is derlying disease mechanism, disease subtyping can help often poorly understood. Examples of disease subtyping re- to propose candidate subgroups of individuals that should search include work in autism (State and Sestan 2012), car- be investigated for biological differences. Uncovering such diovascular disease (De Keulenaer and Brutsaert 2009), and differences can shed light on the mechanisms specific to Parkinson’s disease (Lewis et al. 2005). each group. Observable traits useful for identifying sub- Complex, systemic diseases are characterized using the phenotypes populations of similar patients are called . When level of disease activity present in an array of organ sys- such traits have been linked to a distinct underlying pathobi- tems. Clinicians typically measure the influence of a dis- endotypes ological mechanism, these are then referred to as ease on an organ using clinical tests that quantify the ex- (Anderson 2008). tent to which that organ’s function has been affected by the Traditionally, disease subtyping research has been con- disease. The results of these tests, which we refer to as ill- ducted as a by-product of clinical experience. A clinician ness severity markers (s-markers for short), are being rou- may notice the presence of subgroups, and may perform a tinely collected over the course of care for large numbers more thorough retrospective or prospective study to confirm of patients within EHR databases and patient registries. For their existence (e.g. Barr et al. 1999). Recently, however, a single individual, the time series formed by the sequence literature in the medical community has noted the need for of these s-markers can be interpreted as a disease activity Copyright c 2015, Association for the Advancement of Artificial trajectory. Operating under the hypothesis that individuals Intelligence (www.aaai.org). All rights reserved. with similar disease activity trajectories are more likely to share mechanism, our goal in this work is to cluster individ- (Saria, Koller, and Penn 2010). Lasko et al. use deep learn- uals according to their longitudinal clinical marker data and ing to induce an over-complete dictionary in order to de- learn the associated prototypical disease activity trajectories fine traits observed in shorter segments of clinical markers (i.e. a continuous-time curve characterizing the expected s- (Lasko, Denny, and Levy 2013). marker values over time) for each subtype. Beyond s-markers, others have used ICD-9 codes—codes The s-marker trajectories recorded in EHR databases are indicating the presence or absence of a condition—to study influenced by factors such as age and co-existing condi- comorbidity patterns over time among patients with a shared tions that are unrelated to the underlying disease mecha- disease (e.g. Doshi-Velez, Ge, and Kohane 2014). ICD-9 nism (Lotvall¨ et al. 2011). We call the effects of these ad- codes are further removed from the biological processes ditional factors nuisance variability. In order to correctly measured by quantitative tests. Moreover, the notion of dis- cluster individuals and uncover disease subtypes that are ease severity is more difficult to infer from codes. likely candidates for endotyping (Lotvall¨ et al. 2011), it is Latent class mixed models (LCMMs) are a family of important to model and explain away nuisance variability. methods designed to discover subgroup structure in longitu- In this work, we account for nuisance variability in the fol- dinal datasets using fixed and random effects (e.g., Muthen´ lowing ways. First, we use a population-level regression on and Shedden 1999, McCulloch et al. 2002, and Nagin and to observed covariates—such as demographic characteris- Odgers 2010). Random effects are typically used in linear tics or co-existing conditions—to account for variability in models where an individual’s coefficients may be probabilis- s-marker values across individuals. For example, lung func- tically perturbed from the group’s, which alters the model’s tion as measured by the forced expiratory volume (FEV) test fit to the individual over the entire observation period. Mod- is well-known to be worse in smokers than in non-smokers eling s-marker data for chronic diseases where data are col- (Camilli et al. 1987). Second, we use individual-specific pa- lected over tens of years requires accounting for additional rameters to account for variability across individuals that is influences such as those due to transient disease activity. The not predicted using the observed covariates. This form of task of modeling variability between related time series has variability may last throughout the course of an individual’s been explored in other contexts (e.g. Listgarten et al. 2006 disease (e.g. the individual may have an unusually weak and Fox et al. 2011). These typically assume regularly sam- respiratory system) or may be episodic (e.g. periods dur- pled time series and model properties that are different from ing which an individual is recovering from a cold). Finally, those in our application. the subtypes’ prototypical disease activity trajectories must Work in the machine learning literature has also looked at be inferred from measurement sequences that vary widely relaxing the assumption of regularly sampled data. Marlin between individuals in the time-stamp of the first measure- et al. cluster irregular clinical time series from in-hospital ment, the duration between consecutive measurements, and patients to improve mortality prediction using Gaussian the time-stamp of the last measurement. After accounting process priors that allow unobserved measurements to be for nuisance variability, our goal is to cluster the time se- marginalized (Marlin et al. 2012). Lasko et al. also address ries formed by the residual activity. We hypothesize that dif- irregular measurements by using MAP estimates of Gaus- ferences across such clusters are more likely candidates for sian processes to impute sparse time series