Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Clustering Longitudinal Clinical Marker Trajectories from Electronic Health Data: Applications to Phenotyping and Endotype Discovery

Peter Schulam Fredrick Wigley Suchi Saria Department of Computer Science Division of Rheumatology Department of Computer Science Johns Hopkins School of Medicine Department of Health Policy & Mgmt. 3400 N. Charles St. 733. N. Broadway Johns Hopkins University Baltimore, MD 21218 Baltimore, MD 21205 3400 N. Charles St. [email protected] [email protected] Baltimore, MD 21218 [email protected]

Abstract more thorough retrospective or prospective study to confirm their existence (e.g. Barr et al. 1999). Recently, however, Diseases such as autism, cardiovascular disease, and the au- literature in the medical community has noted the need for toimmune disorders are difficult to treat because of the re- markable degree of variation among affected individuals. more objective methods for discovering subtypes (De Keu- Subtyping research seeks to refine the definition of such com- lenaer and Brutsaert 2009). Growing repositories of health plex, multi-organ diseases by identifying homogeneous pa- data stored in electronic health record (EHR) databases tient subgroups. In this paper, we propose the Probabilis- and patient registries (Blumenthal 2009; Shea and Hripc- tic Subtyping Model (PSM) to identify subgroups based on sak 2010) present an exciting opportunity to identify disease clustering individual clinical severity markers. This task is subtypes in an objective, data-driven manner using tools challenging due to the presence of nuisance variability— from that can help to tackle the problem variations in measurements that are not due to disease of combing through these massive databases. In this work, subtype—which, if not accounted for, generate biased esti- we propose such a tool, the Probabilistic Subtyping Model mates for the group-level trajectories. Measurement sparsity (PSM), that is designed to discover subtypes of complex, and irregular sampling patterns pose additional challenges in clustering such data. PSM uses a hierarchical model to ac- systemic diseases using longitudinal clinical markers col- count for these different sources of variability. Our experi- lected in EHR databases and patient registries. ments demonstrate that by accounting for nuisance variabil- Discovering and refining disease subtypes can benefit ity, PSM is able to more accurately model the marker data. both the practice and the science of medicine. Clinically, dis- We also discuss novel subtypes discovered using PSM and ease subtypes can help to reduce uncertainty in expected out- the resulting clinical hypotheses that are now the subject of come of an individual’s case, thereby improving treatment. follow up clinical experiments. Subtypes can inform therapies and aid in making prognoses and forecasts about expected costs of care (Chang, Clark, Introduction and Background and Weiner 2011). Scientifically, disease subtypes can help Disease subtyping is the process of developing criteria for to improve the effectiveness of clinical trials (Gundlapalli stratifying a population of individuals with a shared dis- et al. 2008), drive the design of new genome-wide asso- ease into subgroups that exhibit similar traits; a task that ciation studies (Kho et al. 2011; Kohane 2011), and al- is analogous to clustering in machine learning. Under the low medical scientists to view related diseases through a assumption that individuals with similar traits share an un- more fine-grained lens that can lead to insights that con- derlying disease mechanism, disease subtyping can help nect their causes and developmental pathways (Hoshida et to propose candidate subgroups of individuals that should al. 2007). Disease subtyping is considered to be especially be investigated for biological differences. Uncovering such useful for complex, systemic diseases where mechanism is differences can shed light on the mechanisms specific to often poorly understood. Examples of disease subtyping re- each group. Observable traits useful for identifying sub- search include work in autism (State and Sestan 2012), car- populations of similar patients are called phenotypes. When diovascular disease (De Keulenaer and Brutsaert 2009), and such traits have been linked to a distinct underlying pathobi- Parkinson’s disease (Lewis et al. 2005). ological mechanism, these are then referred to as endotypes Complex, systemic diseases are characterized using the (Anderson 2008). level of disease activity present in an array of organ sys- Traditionally, disease subtyping research has been con- tems. Clinicians typically measure the influence of a dis- ducted as a by-product of clinical experience. A clinician ease on an organ using clinical tests that quantify the ex- may notice the presence of subgroups, and may perform a tent to which that organ’s function has been affected by the Copyright c 2015, Association for the Advancement of Artificial disease. The results of these tests, which we refer to as ill- Intelligence (www.aaai.org). All rights reserved. ness severity markers (s-markers for short), are being rou-

2956 tinely collected over the course of care for large numbers an individual’s time series into windows in order to discover of patients within EHR databases and patient registries. For transient traits that are expressed over shorter durations— a single individual, the time series formed by the sequence minutes, hours, or days. For example, Saria et al. propose of these s-markers can be interpreted as a disease activity a probabilistic framework for discovering traits from phys- trajectory. Operating under the hypothesis that individuals iologic time series that have similar shape characteristics with similar disease activity trajectories are more likely to (Saria, Duchi, and Koller 2011) or dynamics characteristics share mechanism, our goal in this work is to cluster individ- (Saria, Koller, and Penn 2010). Lasko et al. use deep learn- uals according to their longitudinal clinical marker data and ing to induce an over-complete dictionary in order to de- learn the associated prototypical disease activity trajectories fine traits observed in shorter segments of clinical markers (i.e. a continuous-time curve characterizing the expected s- (Lasko, Denny, and Levy 2013). marker values over time) for each subtype. Beyond s-markers, others have used ICD-9 codes—codes The s-marker trajectories recorded in EHR databases are indicating the presence or absence of a condition—to study influenced by factors such as age and co-existing condi- comorbidity patterns over time among patients with a shared tions that are unrelated to the underlying disease mecha- disease (e.g. Doshi-Velez, Ge, and Kohane 2014). ICD-9 nism (Lotvall¨ et al. 2011). We call the effects of these ad- codes are further removed from the biological processes ditional factors nuisance variability. In order to correctly measured by quantitative tests. Moreover, the notion of dis- cluster individuals and uncover disease subtypes that are ease severity is more difficult to infer from codes. likely candidates for endotyping (Lotvall¨ et al. 2011), it is Latent class mixed models (LCMMs) are a family of important to model and explain away nuisance variability. methods designed to discover subgroup structure in longitu- In this work, we account for nuisance variability in the fol- dinal datasets using fixed and random effects (e.g., Muthen´ lowing ways. First, we use a population-level regression on and Shedden 1999, McCulloch et al. 2002, and Nagin and to observed covariates—such as demographic characteris- Odgers 2010). Random effects are typically used in linear tics or co-existing conditions—to account for variability in models where an individual’s coefficients may be probabilis- s-marker values across individuals. For example, lung func- tically perturbed from the group’s, which alters the model’s tion as measured by the forced expiratory volume (FEV) test fit to the individual over the entire observation period. Mod- is well-known to be worse in smokers than in non-smokers eling s-marker data for chronic diseases where data are col- (Camilli et al. 1987). Second, we use individual-specific pa- lected over tens of years requires accounting for additional rameters to account for variability across individuals that is influences such as those due to transient disease activity. The not predicted using the observed covariates. This form of task of modeling variability between related time series has variability may last throughout the course of an individual’s been explored in other contexts (e.g. Listgarten et al. 2006 disease (e.g. the individual may have an unusually weak and Fox et al. 2011). These typically assume regularly sam- respiratory system) or may be episodic (e.g. periods dur- pled time series and model properties that are different from ing which an individual is recovering from a cold). Finally, those in our application. the subtypes’ prototypical disease activity trajectories must Work in the machine learning literature has also looked at be inferred from measurement sequences that vary widely relaxing the assumption of regularly sampled data. Marlin between individuals in the time-stamp of the first measure- et al. cluster irregular clinical time series from in-hospital ment, the duration between consecutive measurements, and patients to improve mortality prediction using Gaussian the time-stamp of the last measurement. After accounting process priors that allow unobserved measurements to be for nuisance variability, our goal is to cluster the time se- marginalized (Marlin et al. 2012). Lasko et al. also address ries formed by the residual activity. We hypothesize that dif- irregular measurements by using MAP estimates of Gaus- ferences across such clusters are more likely candidates for sian processes to impute sparse time series (Lasko, Denny, endotype investigations. and Levy 2013). A large body of work has focused on identifying subtypes In this paper, we propose the Probabilistic Subtyping using genetic data (e.g. Chen et al. 2011). Enabled by the Model (PSM), a novel model for discovering disease sub- increasing availability of EHRs, researchers have also re- types and associated prototypical disease activity trajecto- cently started to leverage clinical markers to conduct sub- ries using observational data that is routinely collected in type investigations. Chen et al. 2007 use the clinarray—a electronic health records (EHRs). In particular, our model is vector containing summary statistics of all available clinical geared towards identifying homogeneous patient subgroups markers for an individual—to discover distinct phenotypes from s-marker data for chronic diseases, and is particularly among patients with similar diseases. The clinarray summa- useful for characterizing complex, systemic diseases that are rizes longitudinal clinical markers using a single statistic, often poorly understood. Towards this end, PSM uncovers which ignores the pattern of progression over time. More re- subtypes and their prototypical disease activity trajectories, cently, Ho et al. have used tensor factorization as an alterna- while explaining away variability unrelated to disease sub- tive approach to summarizing high-dimensional vectors cre- typing; namely (1) covariate-dependent nuisance variability, ated from EHRs (Ho, Ghosh, and Sun 2014). Others have (2) individual-specific long-term nuisance variability, and used cross-sectional data to piece together disease progres- (3) individual-specific short-term nuisance variability. In ad- sions (e.g. Ross and Dy 2013), but do not model longitudi- dition, PSM is able to infer these prototypical trajectories nal data. Another approach to phenotyping using time series using clinical time series that vary with respect to their end- data, often applied in the acute care setting, is to segment points and sampling patterns, which is common in observa-

2957 Covariate-dependent nuisance variability. When one or μ , Σ β B μB, ΣB β β g G more covariates are available that are known to influence clinical test results, but are posited to be unrelated to the task of discovering underlying disease mechanisms, the in- α π1:G zi ρ(xi) xi fluences can be accounted for using covariate-dependent effects. For example, smoking status can partially explain why an individual’s forced expiratory volume declines more Σ  y μ b, b bi ij tij rapidly, or African American race may be associated with especially severe scleroderma-related skin fibrosis. By fit- ting a standard mixture of B-splines to s-marker data di- a,  fi Ni M rectly, the subtype random variable zi may incorrectly cap- ture correlations among groups of individuals with similar σ covariates. By including covariate-dependent effects, these Figure 1: Graphical model for PSM. correlations are removed, thereby freeing the subtype mix- ture model to capture residual correlations that are more tional EHR data. To evaluate PSM, we use real and simu- likely to be due to shared underlying disease mechanism. lated data to demonstrate that, by accounting for these levels Covariate-dependent  effects are modeled using a polynomial of nuisance variability, PSM is able to both accurately pre- function γ t ρ(x) with feature matrix γ t and coefficients dict s-markers and accurately recover prototypical disease ρ(x). We use first order polynomials (lines) with feature ma- activity trajectories. Finally, we discuss novel subtypes dis-   trix for times t defined to be γ t = 1,t − t¯ , where t¯ is covered using PSM. the midpoint of the period over which the subtype trajecto- Probabilistic Subtyping Model ries are being modeled. The coefficients are modeled as a We define a generative model for a collection of M individ- linear function of the individual’s covariates: uals with associated s-marker sequences. For each individ- i N ual , the s-marker sequence has i measurement times and ρ(x)=Bx, (4) Ni Ni values, which are denoted as ti ∈ R and yi ∈ R re- spectively. In addition to the measurement times and values, where B ∈ R2,d is a loading matrix linearly linking the each individual is assumed to have a vector of d covariates covariates to the intercept and slope values. Therefore, con- ∈ Rd xi . The s-marker values y1:M are random variables, ditioned on the loading matrix B, individuals with similar  and we assume that t1:M and x1:M are fixed and known. covariates will have similar coefficients ρ(x). We model the The major conceptual pieces of the model are the sub- rows of the loading matrix Bc,· : c ∈{1, 2} as multivariate type mixture model, covariate-dependent nuisance variabil- normal random variables: ity, individual-specific long-term nuisance variability, and     individual-specific short-term nuisance variability. We de- B N B Σ p c,· = c,·; μB, B . (5) scribe each of these pieces in turn. The graphical model in Figure 1 shows the relevant hyperparameters, random vari- ables, and dependencies. Individual-specific long-term nuisance variability. An individual may express additional variability over the entire Subtype mixture model. Each individual is assumed to observation period beyond what is explained away using co- belong to one of G latent groups (representing a disease sub- variates. For example, an individual may have an unusually type). The random variable zi ∈{1,...,G} encodes sub- weak respiratory system and so may have a lower baseline type membership, and is drawn from a multinomial distri- value (intercept) than other individuals with similar covari- bution with probability vector π1:G. The probabilities π1:G ates. This form of variability is also modeled using first order are modeled as a Dirichlet random variable with symmetric polynomials with feature matrix γ(t). The coefficient vector concentration parameter α. Formally, we have: is a multivariate normal random variable:     p (π1:G)=Dir (π1:G; α) (1)   p bi|B,xi = N bi;0, Σb . (6) p (zi|π1:G)=Mult (zi; π1:G) . (2) We model each of the subtype prototypical disease activ- ity trajectories using B-splines. The P B-spline basis func- Individual-specific short-term nuisance variability. Fi- nally, an individual may experience episodic disease activity tion values at times tare arranged into columns of a feature that only affects a handful of measurements within a small matrix φ t =[φ1 t ,...,φP t ], and the prototypical disease activity trajectory for each subtype g is parameter- time window. For example, there may be periods during which an individual is recovering from a cold, which tem- β β ized by a coefficient vector g. The coefficients g are mod- porarily weakens his or her respiratory system and causes eled as vectors drawn from a multivariate normal distribu- FEV to become depressed over a short period of time. We tion:     model these transient deviations nonparametrically using a   Gaussian process with mean 0 and a squared exponential p βg = N βg; μβ, Σβ . (3) covariance function with hyperparameters a and :

2958  p (f )=GP (f ;0,k(·, ·)) (7) Expectation Step Conditioned on estimates of β1:G, π1:G, i i  and B at time τ, the posterior distribution over z for each − 2 i 2 (t1 t2) individual is: k(t1,t2)=a exp − . (8) 22    qi (zi)=p zi|yi, β1:G,π1:G, B,ti,xi When treatments exist that can alter long-term course, and   1 ( ) the points at which they are administered vary widely across N zi Σ = p (zi) μi , i , (11) individuals, treatments become additional sources of nui- Zi sance variability. In scleroderma, our disease of interest, and many other systemic diseases, no drugs known to modify the where Zi is the normalizing constant of the multinomial. long-term course of the disease exist, and, therefore, we do Maximization Step Using the posterior distribution com- not tackle this issue. Moreover, treatments are challenging puted with parameters Θτ , the expected complete-data log to fully account for, and subtyping studies that do so often likelihood for a single individual as a function of the param- focus on data from practices that follow a similar treatment eters at step τ +1is: protocol so that variability due to treatment is controlled. L (Θτ+1|Θτ )= (12) Collapsed likelihood. Conditioned on these four compo- i    E log p z |πτ+1 + nents, an individual’s s-marker sequence is modeled as a qi   i 1:G  spherical multivariate normal with standard deviation σ: E | τ+1 Bτ+1    qi log p yi zi, β1: , , ti,xi .   G p yi|zi, bi,fi, β1:G,ti (9)         The expected complete-data log likelihood and log priors N φ   γ  B   2I = yi; ti βzi + ti ( xi + bi)+fi ti ,σ . on the parameters then form the objective of the maximiza- Assuming independence between all individuals condi- tion step:  B tioned on the model parameters β1:G, π1:G, and , the com- M plete likelihood is: τ+1 τ τ+1 τ Q(Θ |Θ )= Li(Θ |Θ )+... (13) M     i=1 |  |    p (zi π1:G) p bi p (fi) p yi zi, bi,fi, β1:G, ti .   G     i=1 τ+1 τ+1 Bτ+1 log p π1:G + log p βg +logp ,  By marginalizing over the latent variables zi, bi, and fi, g=1 the likelihood for each individual can be written as a mixture which we optimize with respect to Θτ+1 to obtain the next of multivariate normals:     set of parameter estimates. We begin by maximizing with re- | B  E N (zi) Σ p yi β1:G,π1:G, , ti,xi = zi μ , i , i spect to π1:G. Dropping terms that do not include π1:G from where Equation 13, we find that the objective is maximized using ( )     the standard Dirichlet-multinomial posterior estimates. We zi φ   γ  B μi = ti βzi + ti xi     use a block coordinate ascent algorithm to maximize the ob- 2 B Σ = σ I + k(t ,t )+γ t Σ γ t . jective with respect to β1:G and . We break the parameters i i i i b i into G+2blocks; one block for each of the subtype-specific The final joint distribution becomes: coefficients βg and one block for each row of B. Each block M   G   appears in a quadratic term in the likelihood of each patient,   p yi|β1:G,π1:G, B,ti,xi p (π1:G) p βg p (B) . and so it is easy to see that holding all other blocks fixed, =1 i =1  g  the coordinate ascent step can be maximized exactly. We cy- likelihood prior cle through the block updates until the value of Equation 13 (10) converges, and iterate over the EM updates until the joint shown in Equation 10 converges. The updates for each of  Learning the parameters π1:G, β1:G, and B are listed in Section A1 of   1  the supplementary material. We want to learn the parameters Θ = β1:G,π1:G, B and the hyperparameters {a, , σ}. Conditioned on choices for Scalability PSM is designed to aid in the subtype discov- the hyperparameters, we will compute MAP estimates of Θ ery process when large electronic health databases are avail- by optimizing the log of the joint distribution in Equation able for analysis. Thus, the scalability of the learning algo- 10 with respect to Θ. The likelihood is a product of sums, rithm is a natural concern. The primary computational bot- which makes direct optimization difficult. We will instead tleneck of the PSM learning procedure is the E-step, which use the EM algorithm to maximize the sum of the log prior may be expensive due to (1) the number of individuals in and the expected complete-data log likelihood. To choose the analysis, or (2) the inversion of the individual covariance {a, , σ}, we use a grid search, often shown to work well in low dimensions (e.g. Bardenet, Kegl,´ and others 2010). 1www.cs.jhu.edu/˜ssaria/psm supp.pdf.

2959 matrices Σi (e.g., in Equation 11). The computational com- a quantitative perspective, we want to measure the model’s plexity due to large M can be offset by parallelizing the E- fit to data. From a qualitative standpoint, we want to judge step because the individual-specific latent variables are con- the insights that the model conveys. Consequently, we assess ditionally independent given Θ. Inversion of Σi has compu- the merit of PSM using two experiments. First, we investi- O 3 tational complexity (Ni ). Because we study disease ac- gate PSM’s ability to predict unobserved s-marker measure- tivity over the course of 10-20 years and because visit rates ments. In the absence of ground-truth subtypes that we can typically do not exceed 12 per year, the number of measure- use to compute a cluster-based metric, we instead measure ments Ni is typically on the order of 100-200 measurements. the generalizability of PSM by evaluating posterior predic- 2 Inversion of Σi is therefore inexpensive. tions of held-out s-marker observations. Our second exper- iment involves qualitative analyses of the subtypes discov- Inference ered by PSM. We evaluate the clinical merit of the proto- To obtain a posterior prediction of an individual’s disease typical disease activity trajectories discovered, and discuss activity trajectory, a subtype is first chosen using the most follow-up clinical investigations that have stemmed from our likely value of zi under the distribution in Equation 11. Con- results. ditioned on subtype membership, we estimate b and f by i i Scleroderma S-markers first conditioning on fi =0, and computing the MAP esti-  mate of bi. From Figure 1, it is clear that after conditioning Scleroderma is a multi-system autoimmune disease resulting  in insults that include damage to the skin, pulmonary system, on zi, fi and Θ, the joint distribution over bi and yi is a linear Gaussian system. We can therefore use standard equations to and circulatory system. For our experiments we choose four  datasets, each containing the s-marker corresponding to one obtain the MAP estimate of bi:   of the major organ systems implicated in scleroderma. Total     −2 (bi|zi) −1 skin score (TSS) scores the thickness of the skin, which is bi = Σ σ γ ti y + Σ μb , where bi i b used as a surrogate measure for the degree of fibrosis. An increased TSS indicates more extensive fibrosis across the      −1 −1 −1 individual’s body. Percent of predicted forced vital capacity Σ = σ γ ti γ ti + Σ bi b (pFVC) measures the volume of air expelled from the lung ( | )     y bi zi = y − φ t β − γ t Bx . after maximum inhalation. pFVC is a measure of restricted i i i zi i i ventilatory defect, which may reflect the severity of inter- We choose to temporarily condition on fi =0in order to stitial lung disease (ILD). Percent of predicted diffusing ca- use the long-term variability to explain as much of the in- pacity (pDLCO) measures the efficiency of oxygen diffusion dividual variation as possible. This encodes our preference from the lungs to the bloodstream. Decreases in pDLCO in- for a parsimonious posterior prediction that uses the sim- dicate a defect in gas exchange, which is associated with pler long-term variability before relying on the more com- development of pulmonary arterial hypertension (PAH). Fi- plex Gaussian process. nally, right ventricular systolic pressure (RVSP) measures  systolic pressure in the chamber of the heart that directly Finally, conditioned on zi and bi, the individual’s short term variation is defined at time t∗ to be pumps blood through the pulmonary vasculature. Signifi- cantly increased systolic pressure suggests an increased risk  −1  ∗ ∗ 2 (fi|bi,zi) of heart failure due to pulmonary arterial hypertension. We fi(t )=k(t ,ti) k(ti,ti)+σ I y , where i focus here on the analysis of subtypes for each of the compli-

       cations individually. If subtypes exist, then a natural follow- (fi|bi,zi) − φ   − γ  B − γ   yi = yi ti βzi ti xi ti bi. up is to identify whether a common mechanism might jointly influence trajectories for two or more organ systems, but this This is simply the MAP prediction of a Gaussian process. ∗ ∗ relies on first developing an understanding of the individual A new measurement y taken at time t is then predicted to s-markers; the focus of our analyses below. be: ∗ φ ∗  γ ∗ B γ ∗  ∗ Unobserved S-marker Prediction y = (t ) βzi + (t ) xi + (t ) bi + fi(t ) . (14)     Our first experiment evaluates the accuracy of s-marker subtype covariate long-term short-term value predictions at unobserved times for models that in- It is also useful to characterize posterior uncertainty in corporate different types of nuisance variability. The first prototypical trajectory estimates when drawing qualitative model includes covariate effects and group/subtype effects inferences from the model. Estimates of posterior uncer- (C+G). The covariates provided to us were gender, African tainty are described in Section A2 of the supplement. American race, and age at disease onset, which are well- known risk factors for severity in scleroderma (Varga, Den- Experiments ton, and Wigley 2012). The second model uses individual- The purpose of PSM is to discover subtypes, useful for tasks 2Alternatively, one may use held-out data log-likelihood, but we such as developing tailored treatment plans and advanc- choose predictive accuracy because the task of prediction is natu- ing understanding of underlying disease mechanisms. Ex- ral in the clinical setting and it is therefore easier to interpret the ploratory clustering method evaluations are two-fold. From significance of the results.

2960 (A) (B) (C) (D)

419 444 376 561 516 661 106 309

G 100 100 100 100 G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G GG G G G G G GG G 80 G 80 GG G G 80 G G GG G 80 G G GG G G G GG G GG GGG GGG G G G G G G GG GG G G G G G G G G G G G GGG G G G G G G G GGGGG G 60 60 GG 60 G 60 GG G G G 40 40 40 40 1037 1251 956 1013 813 832 771 880

100 G G 100 100 100 G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G GG G G G G G G 80 G 80 80 G G G 80 G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G G G GGG G G G G G G G G G G 60 60 60 G GG 60 GG GG

40 40 40 40 1399 1741 1572 1833 952 1383 1159 1723

100 G 100 100 100 GG G G G G G G G GG G G G G G G G G G G G G G GG G G G G G G GG G G G GG G G G G G G G G G G GGG G G G G G G G G G GG G G G G G G G 80 G G 80 G G 80 G 80 G G G G G G G GG G GGG G G G G G G G G G G G G G G G G 60 60 60 60 G GG G G GGG G G 40 40 40 40 1838 1898 2106 2225 1747 2210 1781 2002

100 100 100 100 G G G G G GG G G G G G G G G G G G G G G GG G G GGG G G G G G G GG GG G G G G G GG G GG 80 G 80 GG 80 G G G G G G 80 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G 60 60 60 GG G G 60 GG

40 40 40 40 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 Years Since First Symptom Figure 2: Example model fits to individual pFVC trajectories. Samples from four subtype candidates are displayed; one subtype per 4 by 2 block of individuals. Solid lines show full model fit (computed using Equation 14). Dots show observed pFVC values. Solid lines at the top of each block show the prototypical s-marker trajectory for that subtype. specific long-term effects in addition to covariate and group tories into 10 groups and use 10-fold cross validation. PSM effects (C+G+L). Finally, PSM uses covariate, group, long- is trained on nine of the ten folds, and predictions are made term, and short-term effects. for the tenth fold as follows. We select four s-marker obser- To choose the number of groups G for each model, we use vations from each held-out trajectory and assign each to a BIC as follows: we randomly generate five folds of the data point-level fold. For each of the point-level folds, we con- by subsampling 75% of the individuals without replacement. dition on the remaining s-marker observations and use the For each model and for each choice of G, we compute the MAP estimates of the held-out observations as our predic- average BIC across the five folds and choose the number of tions. We compute root mean squared error (RMSE) for each clusters that results in the largest sequential drop in BIC (i.e. point-level fold across the trajectory-level folds, and finally we search for the “elbow”). compute the mean RMSE and standard errors across point- level folds. For each model, we use P =5bases and non-informative The prediction results and standard errors are displayed priors for the parameters; α =2, μβ = μB =0, and Σ Σ × 5  in Table 1. We see that PSM significantly outperforms the β = B = diag(1 10 ). The prior variance on bi alternative models for TSS, pFVC, and RVSP. Although an was kept relatively small to limit the amount of variabil- improvement is demonstrated for pDLCO, the difference is ity that could be explained using long-term individual ef- not statistically significant. Figure 2 displays model fits to fects: Σb = diag(vb) where vb was swept along the log-10 {− − } individual trajectories sampled from four of the discovered scale from 1,..., 5 . Similarly, to limit the amount of candidate subtypes (one subtype per 4 by 2 block of individ- variability explained by the Gaussian process, the GP hy- { } { } uals) for the pFVC s-marker. Note that blocks display overall perparameters a, , σ were each swept from 1,...,5 . similar behavior, but that long-term and short-term variabil- Years since first scleroderma related symptom was used to ity tailor predictions for each individual using the observed align each individual on the time axis. We truncate time at markers. 15 years following the first scleroderma related symptom. An individual was included in the experiment if there were S-marker C+G C+G+L PSM TSS 5.32 ± 0.18 5.41 ± 0.07 ∗4.43 ± 0.14 at least 4 measurements available. We used 1,011, 1,177, ∗ pFVC 9.27 ± 0.49 9.34 ± 0.46 7.69 ± 0.39 1,114, and 504 individuals for TSS, pFVC, pDLCO, and pDLCO 15.03 ± 1.82 15.13 ± 1.93 14.08 ± 1.77 RVSP respectively. Within this subset of individuals, the av- RVSP 12.21 ± 0.50 12.11 ± 0.44 ∗10.89 ± 0.27 erage number of measurements for each s-marker over the Table 1: RMSE with standard errors for s-marker prediction. 15 year period is: 9.1 for TSS, 8.6 for pFVC, 8.3 for pDLCO, ∗ and 6.0 for RVSP. The average time in years between ob- Bold shows best performance on s-marker; shows statisti- ≤ served measurements for each s-marker is: 0.9 for TSS, 0.9 cal significance (p 0.05). for pFVC, 1.0 for pDLCO, and 1.3 for RVSP. Simulated Data Trajectory Estimate Accuracy Another To estimate prediction error, we split the individual trajec- natural question is whether modeling these sources of vari-

2961 ability reduces bias in our estimates of the prototypical tra- of Figure 2. Each of these show distinct patterns of decline: jectories. In other words, do the individual deviations cancel individuals in (A) have a steady, linear progression, those in out so that PSM offers no benefit over less expressive mod- (B) decline quickly within the first five years and then stabi- els like C+G? For this, we turn to simulated data and inves- lize, and those in (C) are stable for the first five to ten years tigate whether PSM recovers prototypical trajectories more and then decline rapidly. Many of these individuals have at accurately than C+G and C+G+L. least one measurement that drops by more than 7% from the The simulation model samples observation time-stamps previous observation, which is clinically considered to sug- by sampling the Ni from a Poisson distribution, and, con- gest interstitial lung disease (see, for example, Beretta et al. ditioned on Ni, samples ti from a Gaussian mixture model. 2007). It is clear, however, that they display unique patterns The yi are then sampled from a subtype mixture model with of decline, which has raised the question of whether indi- a hierarchy of individual-specific long-term, short-term, and viduals with these different subtypes differ with respect to iid noise as employed within PSM. The parameters used for their antibody profiles (since scleroderma is an autoimmune the simulation are specified in Section A3 of the supple- disease). mentary material. We compare the same three models used We now turn to panel (D). Fibrosis in the lungs due to end- above, but do not simulate covariates and so they are not stage interstitial lung disease is thought of as non-reversible included. To measure bias, we find the alignment between damage. The individuals shown in panel (D), however, be- estimated and true trajectories that minimizes the RMSE av- gin with inhibited pulmonary function, but slowly recover. eraged across each estimated-true pair; to compute RMSE This pattern of recovery warrants additional investigation; it between two curves, we use a discrete approximation. is possible that in these patients, the initial insult is not due The iid noise σ =0.1 for all simulations, and the left col- to end-stage lung disease, but rather due to other causes of umn of Table 2 shows the amplitude a of short-term individ- restricted ventilatory defect such as inflammation. Recog- ual variability and standard deviation of individual-specific nizing this pattern may alter clinical management of these intercept terms σb. Individual specific slope variance was set patients. Moreover, if it is the case that these patients all to 1 × 10−4 for all simulations. share a common comorbidity at the onset of disease, then We see that as more nuisance variability is added, PSM it may suggest the presence of another subtype whose un- is able to recover less biased trajectory estimates. In Sec- derlying mechanism triggers the comorbid condition. tion A4 of the supplementary material, we provide plots Figure 3B displays the clusters learned by PSM for To- that show example individual trajectories from these exper- tal Skin Score (TSS) using M =1, 011 individuals with iments and how they contribute to the bias of prototypical G =5(selected using BIC). TSS is a well-studied s-marker trajectory estimates. in scleroderma, and is used for one of the primary clinical classification criteria. Individuals with more extensive skin (a, σb) G G+L PSM (0.00, 0.10) 0.27 ± 0.06 ∗0.04 ± 0.01 0.07 ± 0.06 disease have higher TSS scores. Traditionally, skin disease (0.00, 0.15) 0.34 ± 0.05 ∗0.05 ± 0.01 0.09 ± 0.06 in scleroderma is defined as limited (minimal involvement (0.10, 0.15) 0.34 ± 0.04 0.11 ± 0.05 0.14 ± 0.08 at the system level) or diffuse (systemic level involvement) (0.15, 0.15) 0.34 ± 0.05 0.18 ± 0.04 ∗0.13 ± 0.06 (Varga, Denton, and Wigley 2012). Here we see five clusters: (0.20, 0.15) 0.36 ± 0.05 0.25 ± 0.07 ∗0.14 ± 0.07 clusters 4 and 5 exhibit limited involvement, and clusters 1, (0.25, 0.15) 0.36 ± 0.06 0.32 ± 0.07 ∗0.18 ± 0.04 2, and 3 indicate different patterns of diffuse skin disease. Table 2: Estimated trajectory RMSE and standard errors Figure 3C displays the clusters learned using PSM for (computed over 20 replications) for simulations. Bold indi- pDLCO using M =1, 114 individuals with G =11(chosen cates best performance, statistical significance is indicated using BIC). Clinicians monitor pDLCO to detect the onset using ∗ (p ≤ 0.05). of pulmonary arterial hypertension (PAH), one of the most prominent sources of mortality among patients with sclero- Discovered Subtypes derma. Once pDLCO is low enough, an individual will typ- ically be screened using additional diagnostic tests (Steen We now present a qualitative discussion of the discovered and Medsger 2003). It is clear from Figure 3D, however, subtypes. For all results below, we use non-informative pri- that there are several patterns of decline (seen in clusters 2, ors for the parameters; α =2, μβ = μB =0, and 8, 9, 10, and 11). Understanding whether particular patterns 5  Σβ = ΣB = diag(1 × 10 ). The prior variance on bi of pDLCO decline are more predictive of PAH may help to was kept relatively small to limit the amount of variability develop more effective clinical heuristics. that could be explained using long-term individual effects: Finally, Figure 3D displays the clusters learned using Σb = diag(vb) where vb was swept along the log-10 scale PSM for RVSP using M = 504 individuals with G =5 from {−1,...,−5}. Similarly, to limit the amount of vari- (chosen using BIC). The RVSP results are noisier than the ability explained by the Gaussian process, the GP hyperpa- others, which may be due to the inherent noise in the mea- rameters {a, , σ} were each swept from {1,...,5}. surement process. RVSP is measured using an echocardio- We begin with pFVC. Figure 3A displays the clusters gram, and is inaccurate when the true underlying systolic learned using PSM on pFVC trajectories of 1,177 individ- pressure is between 30 and 45 mmHg (millimeters mercury). uals with G =9(chosen using BIC), and Figure 2 displays We note that there are two groups (1 and 4) with stable, individual trajectory fits from 4 of the 9 clusters. We first healthy pressures, which are presumably individuals with no focus on the subtypes displayed in panels (A), (B), and (C) serious circulatory complications. The remaining three clus-

2962 (A) 1 2 3 4 5 6 7 8 9

G G G GG G G G G G G G G G GG G G G G GG G G G G G G G GGGG G G G G G G G G G G G G G G G G GG G GG G GG G G G G G G G G G GGG G G G G G G G GG G G G 90 GG G G GG G G G GG G G G G G G GGGG G G GG G G G G G G G G G G GG GG G G G G G G G G G G G G G GG G G G GG G G G G G G G G G G GG GGGG G GGG G G GG GGGGG G G G GG G G G GGG GG G G G G G G G G G G GG G G GG G G G G G G GG G G G G G G GGG G G G GGG G G G G G G G G G G G G GG GGGG GGG G G G GG GG G GG G G G G G G GG G GG G G G G G GG G G G GG GG G G G G G G G G G G G G G G GG G GG G G G G G G G G G G G GG G GG G G G G G G G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G GGGG G G G G G G GG G G G G G G GG G GG G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G GG G G G GGG GG G G G G G G G G G GG G G GGG G G G G G G G GG G G G GG G GG GGG GG G GGG G GGG G G G G G G G GGG G G G G 60 G GG GGGG G GG G G GG GG pFVC G G G G G G GGG G G G G GG G GG GGG G G GG G GG GGGG GGGG G G G G G G G G G G G G G G GG G G G G GGGG G G GG G G G G G G GG G GG G G GG G GGGG GGG G G G G G G G G G GG 30 G G G G

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 (B) (D) 1 2 3 4 5 1 2 3 4 5 60 G

GG G G G G G G G 120 G G GG G G G G G GG G G 40 G G G G G G GG G G G G G G G G G G G G G G G G GG G G G GG G G GG G GGG G G G G G G G G G G G G G 80 G G G GG G G GG G G G G G G GG G G G

TSS G G G G G G GG GG G G G G G G G G G GG G G G G RVSP G GG G G G G G G G GG G G G G G G G G G G G G G GG G G G G G G G G G G G G GG G G GGG G G G G G G G G G G G G G G GG 20 G G G GG G G G GGG G G G G G G G G G G G G GG GG GGG G G G G G GGG G G G G G G G GG G G G G G G G G G G G GGGG G G G G G GG G G G G G G G G G G G G G G G G G G G GGG G G G GG GGGG GG G G G GG G G G G G GG G GG G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGG GG G G G G G G GG GG G 40 GG GG G G G G G G G G G G GG G G G G G G GG G G G GG GGGG GGG G GG G G G G G GG GG GG G G G G G G G G G GG G G G GG G GG GGG G G G G G G G G G G G G G G G G G G G G G G G GGGGG GGGGG GGG G GG G G G GG GG G G G G GG G GGG GG GGGGG GGG G G G G GGGG G G G GGGGGGG GGGGG G GGGGGGGGGGGGG G G G GG G GGG G GGGGG GGGGGGGGGG GGGGGG G G GG GGGGGGG GGGGG GGGGG GG G G 0 G G GGG G GG GGGGGG GG G 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 (C) 1 2 3 4 5 6 7 8 9 10 11

G G G G G G G G G G G G G G G G GGG G G G GG G GG G G G G G G G G G GG G G G G G G G G G G G G G G G G G G GG G G G G G GGG G G G G G G G GG G G G G G G GG G 100 G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG GG G G G G G G G G G G G G G G G G G G G G G GG GG GG G G G G G G G G GG G G G G G G GG G G G G GGG G G G G G G GGG G G G G G GGG G G G G G G G G G G G G G G G G GG G G G G G G G GG G G G G G G G GG G G G G G G G G G G G G G GG G G G G G GG G G G G G G G G G G G G G G GG G G G G GG G G GG G G G G G G G GG G G G GG G GG G G GG G G G G G G G G G G G G G GG G G GG GG G G G GG G G G G G GGG G G G G G G G G G G G G G GG G G G G G G G GG G G G G G G G G G GG G G G G G G G G GG GG G GG G G G G G G G G G G GG G GG G G G G G G G G GGGGGG G G GG G G G GG G G G G G G G GG GGG G GGG G G G G G G G G G G G GG G GG G G G G GG G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G GGG G GG G G G G G G G G G GG G G G GG G G G G G G G GG G G GGG G GG G G G G G G G G G G GG G G G G G G G G G G G G G GG G G G G G G G G GGG G GG GGG G G GGG G G GG G G G pDLCO G G G G G GG G G G G G G G G GG G G G G G G G G G G G GG G G G G G G G G G GGG G G G G GG G G GGG G G G G GG GG G 50 G G G G G G G GG G G GG G G G G GG G G G GG G G GGG G G G GG G G G G G GG GG G G G G G G G G G G G G G GGG G G GG G G G G G G G GG G G G G G G GG G G G G G G G G G G G G G G G G G G GG G G GG G G G G G G G G G G G

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 Figure 3: Discovered subtypes for all four s-markers. Panel (A) shows pFVC, panel (B) shows TSS, panel (C) shows pDLCO, and panel (D) shows RVSP. Prototypical s-marker trajectories are shown in black, and individuals sampled from the subtype are shown in color. Colored lines show the individualized s-marker trajectory, and colored points show the observed s-markers. Best viewed in color. ters (2, 3, and 5) are more difficult to interpret because they PSM identified novel subtypes for each of the four key or- are not associated with any known patterns of heart involve- gan systems affected in patients with scleroderma. To iden- ment; this may be attributable to the noisiness of the obser- tify whether these subtypes are indeed distinct disease mech- vations, or other phenomena that are as yet unknown. These anisms, our ongoing clinical experiments are investigating remain the subject of future clinical follow up. whether there exist molecular differences consistent with these subtypes. Joint Analysis of S-Markers We have focused our anal- yses using PSM on single s-marker subtypes. A natural PSM provides a new tool to clinicians for studying can- follow-up question is whether we can infer clusters across didate subtypes, especially useful in complex, chronic dis- multiple s-markers. If we are analyzing K s-marker types, eases such as neuropsychiatric and autoimmune diseases then, as presented, PSM assumes the following joint distri- (e.g., scleroderma is one of 80 autoimmune diseases) known bution over s-marker sequences y k and memberships z k: to be heterogeneous and poorly understood. As electronic i i medical records continue to amass data from routine visits, K     there will be growing need for such tools to refine our char- k | k k p yi zi , Θ p zi . (15) acterization of what constitutes a disease. k=1 A natural direction for follow up is to identify ways to We can replace the fully factored distribution over quantify effects of variability in treatment protocols across 1 ,K 1 K individuals on the estimated subtypes. zi ,...,zi with a complete joint p zi ,...,zi to induce correlations across s-marker types. The posterior distribu- Acknowledgements. This work was supported by a 1 K Faculty Research Award, NSF award #1418590, and tion over zi ,...,zi can be inspected to discover clusters defined over multiple s-markers. a Whiting School of Engineering Centennial Fellowship. A simpler alternative when full posterior inference over References group memberships is not desired, is to represent each in- dividual using a vector of categorical variables indicating Anderson, G. P. 2008. Endotyping asthma: new insights into cluster membership for each s-marker as inferred by PSM key pathogenic mechanisms in a complex, heterogeneous (e.g. [tss-type-1, pfvc-type-3, dlco-type-2, ...]). Then, us- disease. The Lancet 372(9643):1107–1119. ing a distance-based clustering method, such as hierarchical Bardenet, R.; Kegl,´ B.; et al. 2010. Surrogating the surro- agglomerative clustering (HAC), clusters across multiple s- gate: accelerating gaussian-process-based global optimiza- marker types can be obtained. Due to space constraints, we tion with a mixture cross-entropy algorithm. In Proceedings omit analyses of joint clusters. of the 27th International Conference on Machine Learning, Discussion 55–62. This paper presents the Probabilistic Subtyping Model, a Barr, S. G.; ZonanaNacach, A.; Magder, L. S.; and Petri, M. model for clustering time series of clinical markers ob- 1999. Patterns of disease activity in systemic lupus erythe- tained from routine visits to identify homogeneous pa- matosus. Arthritis and Rheumatism 42(12):2682–2688. tient subgroups. We introduce the concept of nuisance variability—effects due to factors such as age and co- Beretta, L.; Caronni, M.; Raimondi, M.; Ponti, A.; Viscuso, existing conditions—that affect the observed clinical test re- T.; Origgi, L.; and Scorza, R. 2007. Oral cyclophos- sults, but are unrelated to the underlying disease mechanism. phamide improves pulmonary function in scleroderma pa- PSM introduces a framework for modeling these sources of tients with fibrosing alveolitis: experience in one centre. nuisance variability to uncover disease activity subtypes that Clinical rheumatology 26(2):168–172. are likely candidates for endotyping. Blumenthal, D. 2009. Stimulating the adoption of health

2963 information technology. New England Journal of Medicine learning over noisy, sparse, and irregular clinical data. PloS 360(15):1477–1479. one 8(6):e66341. Camilli, A. E.; Burrows, B.; Knudson, R. J.; Lyle, S. K.; and Lewis, S. J.; Foltynie, T.; Blackwell, A. D.; Robbins, T. W.; Lebowitz, M. D. 1987. Longitudinal changes in forced ex- Owen, A. M.; and Barker, R. A. 2005. Heterogeneity of piratory volume in one second in adults. effects of smoking parkinson’s disease in the early clinical stages using a data and smoking cessation. The American review of respiratory driven approach. Journal of neurology, neurosurgery, and disease 135(4):794–799. psychiatry 76(3):343–348. Chang, H. Y.; Clark, J. M.; and Weiner, J. P. 2011. Morbid- Listgarten, J.; Neal, R. M.; Roweis, S. T.; Puckrin, R.; and ity trajectories as predictors of utilization: multi-year disease Cutler, S. 2006. Bayesian detection of infrequent differences patterns in taiwan’s national health insurance program. Med- in sets of time series with shared structure. In Advances in ical care 49(10):918–923. neural information processing systems, 905–912. Chen, D. P.; Weber, S. C.; Constantinou, P. S.; Ferris, T. A.; Lotvall,¨ J.; Akdis, C. A.; Bacharier, L. B.; Bjermer, L.; Lowe, H. J.; and Butte, A. J. 2007. Clinical arrays of lab- Casale, T. B.; Custovic, A.; Lemanske Jr, R. F.; Wardlaw, oratory measures, or ”clinarrays”, built from an electronic A. J.; Wenzel, S. E.; and Greenberger, P. A. 2011. Asthma health record enable disease subtyping by severity. AMIA endotypes: a new approach to classification of disease en- Annual Symposium Proceedings Archive 115–119. tities within the asthma syndrome. Journal of Allergy and Clinical Immunology 127(2):355–360. Chen, M.; Zaas, A.; Woods, C.; Ginsburg, G. S.; Lucas, J.; Dunson, D.; and Carin, L. 2011. Predicting viral infection Marlin, B. M.; Kale, D. C.; Khemani, R. G.; and Wetzel, from high-dimensional biomarker trajectories. Journal of R. C. 2012. Unsupervised pattern discovery in electronic the American Statistical Association 106(496). health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health De Keulenaer, G. W., and Brutsaert, D. L. 2009. The heart Informatics Symposium, 389–398. ACM. failure spectrum: time for a phenotype-oriented approach. McCulloch, C.; Lin, H.; Slate, E.; and Turnbull, B. 2002. Circulation 119(24):3044–3046. Discovering subpopulation structure with latent class mixed Doshi-Velez, F.; Ge, Y.; and Kohane, I. 2014. Comorbidity models. Statistics in medicine 21(3):417–429. clusters in autism spectrum disorders: an electronic health Muthen,´ B., and Shedden, K. 1999. Finite mixture modeling record time-series analysis. Pediatrics 133(1):e54–63. with mixture outcomes using the em algorithm. Biometrics Fox, E. B.; Sudderth, E. B.; Jordan, M. I.; and Willsky, A. S. 55(2):463–469. 2011. Joint modeling of multiple related time series via the Nagin, D. S., and Odgers, C. L. 2010. Group-based trajec- beta process. arXiv preprint arXiv:1111.4226. tory modeling in clinical research. Annual Review of Clini- Gundlapalli, A. V.; South, B. R.; Phansalkar, S.; Kinney, cal Psychology 6:109–138. A. Y.; Shen, S.; Delisle, S.; Perl, T.; and Samore, M. H. Ross, J., and Dy, J. 2013. Nonparametric mixture of gaus- 2008. Application of natural language processing to va elec- sian processes with constraints. In Proceedings of the 30th tronic health records to identify phenotypic characteristics International Conference on Machine Learning (ICML-13), for clinical and research purposes. Summit on translational 1346–1354. bioinformatics 2008:36. Saria, S.; Duchi, A.; and Koller, D. 2011. Discovering Ho, J. C.; Ghosh, J.; and Sun, J. 2014. Marble: high- deformable motifs in continuous time series data. In In- throughput phenotyping from electronic health records via ternational Joint Conference on Artificial Intelligence, vol- sparse nonnegative tensor factorization. In Proceedings of ume 22, 1465. the 20th ACM SIGKDD international conference on Knowl- Saria, S.; Koller, D.; and Penn, A. 2010. Learning individ- edge discovery and data mining, 115–124. ACM. ual and population level traits from clinical temporal data. In Hoshida, Y.; Brunet, J.-P.; Tamayo, P.; Golub, T. R.; and Proc. Neural Information Processing Systems (NIPS), Pre- Mesirov, J. P. 2007. Subclass mapping: identifying com- dictive Models in Personalized Medicine workshop. mon subtypes in independent disease data sets. PloS one Shea, S., and Hripcsak, G. 2010. Accelerating the use of 2(11):e1195. electronic health records in physician practices. New Eng- Kho, A. N.; Pacheco, J. A.; Peissig, P. L.; Rasmussen, L.; land Journal of Medicine 362(3):192–195. Newton, K. M.; Weston, N.; Crane, P. K.; Pathak, J.; Chute, State, M. W., and Sestan, N. 2012. Neuroscience. the emerg- C. G.; Bielinski, S. J.; Kullo, I. J.; Li, R.; Manolio, T. A.; ing biology of autism spectrum disorders. Science (New Chisholm, R. L.; and Denny, J. C. 2011. Electronic medical York, N.Y.) 337(6100):1301–1303. records for genetic research: results of the emerge consor- Steen, V., and Medsger, T. A. 2003. Predictors of isolated tium. Science translational medicine 3(79):79re1. pulmonary hypertension in patients with systemic sclerosis Kohane, I. S. 2011. Using electronic health records to drive and limited cutaneous involvement. Arthritis & Rheumatism discovery in disease genomics. Nature Reviews Genetics 48(2):516–522. 12(6):417–428. Varga, J.; Denton, C. P.; and Wigley, F. M. 2012. Sclero- Lasko, T. A.; Denny, J. C.; and Levy, M. A. 2013. Com- derma: From pathogenesis to comprehensive management. putational phenotype discovery using unsupervised feature Springer.

2964