<<

Downloaded by guest on September 28, 2021 stochasticity associations. detection disease–pathway and by disease progression, illustrated and characterization, as tissue variation, involving of and Finally, analysis “personalized” population. more interpretable a some biologically facilitates in complexity divergent probability in is reduction the pathway the estimate or gene can a one that analyzed example, be can for dysregulation level, probabilistically; sample analy- individual the divergence at the is since sis subsets Notably, distinguished pathways). cases: and regulatory two genes) (e.g., as on individual focus interpreted (e.g., We support features consequently baseline. estimated single is that the to and outside relative lies distribution genes) “dysregulated” it baseline of if subset the “divergent” a of be for to vector said expression is (e.g., state subprofile from The samples tissues). any normal of of (e.g., set population a baseline from or profiles reference a omics repre- the digital a on to based profile sentation omics an omics simplify single a we Specifically, to profile. applicable is quan- that to heterogeneity theory sample-level unified tify a CpG introduce we or model Here diversity. effectively genes intersample not do expressed therefore and bioinfor- differentially population-based are in detecting sites, methods as current such However, even matics, subtype. another, cancer to het- tumor a individual such within one and of from microRNA, profiles example mRNA, methylation genome-wide prominent within between A occurs perva- states erogeneity phenotypes. molecular revealed of between have Yu) stochasticity and Bin technologies and and Tatonetti P. omics heterogeneity Nicholas by sive from reviewed 2015. 2017; in 13, elected collected December Sciences review of for Academy (sent National 2018 the 19, of March members Geman, by Donald Articles by Inaugural Contributed of series special the of part is contribution This and 21218; MD Baltimore, University, Hopkins a Price D. Nathan seilytcnqe odtrieweea mc rfiefalls profile omics an where determine are variation, to genome-wide population techniques quantify the to especially in Methods data (4). omics underway Notably, now large-scale modalities. monitor data to omics Similar for efforts selection. needed low- therapy now assessment, and are for approaches risk prognosis, clinically of and biomarkers, used use diagnosis and widely disease The already tests disease. is laboratory complex dimensional normal a a dysregulation of from molecular indicative reflects is profile that omics part, determining single In toward a progress introduced. whether impeding variation, when normal of expected characteriza- tion genome-wide was insufficient from arises as limitation this care, of standard clinical impact to (1–3). beginning diseases complex organism also for are practice and mechanisms data organization, substan- molecular These tissue have development. of functioning, we understanding cell conditions, underlying our healthy organisms, enhanced and model and tially diseased different individuals from from across data com- tissues, and these and complex of types Through analyses cell data. statistical omics and as putational are to and samples referred of thousands high-dimensional collectively on These made been domains. have molecular measurements distinct sys- over biological complex tems of molecular characterization unknown the enabling previously features, other and metabolites, proteins, I baseline a from Dinalankara divergence Wikum by profiles omics Digitizing www.pnas.org/cgi/doi/10.1073/pnas.1721628115 eateto nooy on okn nvriySho fMdcn,Blioe D21205; MD Baltimore, Medicine, of School University Hopkins Johns Oncology, of Department rfiigo eei ains N pce,eieei marks, epigenetic species, RNA variants, global genetic enabled have of advances profiling technological decades, recent n eetees mc aahv e osgicnl nomthe inform significantly to yet have data omics Nevertheless, | digitization c arn Younes Laurent , a,1 | dysregulation inKe Qian , b,1 b | ia Xu Yiran , ug Marchionni Luigi , cancer c nttt o ytm ilg,Sate A98109 WA Seattle, Biology, Systems for Institute | rcso medicine precision b alnJi Lanlan , a,2 b n oadGeman Donald and , ioePagane Nicole , 1073/pnas.1721628115/-/DCSupplemental etrso eerglto npriua nra)tsu,the tissue, (normal) particular in unique the regulation determine gene to samples is of normal goal the is features if population However, tissue. “baseline” particular that a natural from of a cancers lies are then it For interest tissue, if distribution. of phenotypes “divergent” probability the baseline be if of the to instance, subset of said this support is sam- the of profile outside that realization population sample the any a also in Then from Suppose features baseline. available set. the are get as subset declared functional this of a are observations or cases ple special the gene important example genes; single for of a features, set any omics of of we levels subset suppose expression particular Briefly, a neces- interest. given of information are phenotypes biological characterize important to the intended divergence sary is of on representation most based data capture simplified states to This discrete baseline. a characterize few high-resolution from to a quantizing framework into by unified variation measurements nonnormal a and present normal we tests, clinical diseases. and pro- complex diagnosis of for they prevention approaches clinical principle, inform can in which disease-specific variation, and assays; normal poten- of quantification the omics unprecedented realize vide high-dimensional fully to of essential be tial may baseline, a to relative ulse nieArl1,2018. 16, April online Published at online information supporting contains article This 2 under distributed 1 4528. page is on QnAs See article BY-NC-ND). (CC 4.0 access License NonCommercial-NoDerivatives open This interest. of conflict no declare Berkeley. authors California, The of University B.Y., and data; University; curated paper. Columbia N.P.T., Y.X., the and Reviewers: wrote collected Q.K., D.G. L.M. W.D., and and results; L.M., T.M., L.Y., E.J.F., the A.L., W.D., N.P., interpreted and L.Y., W.D., N.D.P., and research; E.J.F., performed experiments Q.K., L.J. the W.D., and designed project; the D.G. initiated and D.G. L.M., and L.M. contributions: Author owo orsodnemyb drse.Eal [email protected] rmarchion@jhu. or [email protected] Email: edu. addressed. be may correspondence whom work. To this to equally contributed Q.K. and W.D. ue oaysbe ffaue rae sasnl entity, cancer single various and a Experi- tissues pathway. subtypes. as human fea- a healthy in treated both single levels involve features expression ments from data gene of granularity, the any example subset of for to any level applied to any be tures at can and method modality our Moreover, normal from variables levels. selected of departures diag- landscapes for clinical of testing basic nostic range mimicking thereby the population, to baseline individual a in an of com- profile by the representation paring data develop personalized simplified, We heterogeneity. highly the a person-to-person and data of these degree diseases. of complexity high complex the of are obstacles data treatment major these Two personalized and the cells, inform of landscapes can comprehensive molecular the increasingly of profiling enable advances Technological Significance odvlpti ihdmninlaaou olow-dimensional to analogue high-dimensional this develop To a nhn Lien Anching , b eateto ple ahmtc n ttsis Johns Statistics, and Mathematics Applied of Department b,2 PNAS | a ,2018 1, May a eav Matam Tejasvi , . | o.115 vol. www.pnas.org/lookup/suppl/doi:10. raieCmosAttribution- Commons Creative a | ln .Fertig J. Elana , o 18 no. | 4545–4552 a ,

BIOPHYSICS AND STATISTICS INAUGURAL ARTICLE COMPUTATIONAL BIOLOGY baseline could be all other tissues types combined. The novelty Divergence. Suppose we are given a reference joint of our framework is that it applies to any subset of features and P0 of (Xj, j ∈ J ) associated with some baseline phenotype. Of course, in prac- to any new profile by itself. tice, we only observe samples from this distribution where each sample is Divergence is a sample property. In contrast, putative discrim- an omics profile from a single data modality, and the choice of reference inating sets of features, such as “differentially expressed genes,” phenotype depends on the particular study and data modality. As will be seen in Results, the baseline phenotype need not be a single tissue type or a are defined in terms of population-level distributions. For exam- disease-free phenotype and can be any population that is meaningful under ple, given two classes or phenotypes, a gene is differentially the analysis carried out. expressed if its distribution differs in the two classes. However, Given an omics profile X and a subset S ⊂ J , we will apply the defi- S S knowing the status of a molecular feature in the context of pop- nition of divergence to the random vector U = Q (X) = (Qj(X), j ∈ S). We ulations may say very little about the status of that feature for will focus on single features S = {j} (e.g., single genes) and on sets S a particular individual. Divergence then enables the prospec- with 1  |S|  |J | (e.g., gene pathways). Let m = |S|, so that US takes m S m S tive analysis of individual samples, as required for applications values in [0, 1] . We describe below how the support U0 ⊂ [0, 1] of U can be estimated from the data, resulting in a set UˆS that is explic- to precision and personalized medicine. It also enables a proba- 0 bilistic analysis at the phenotype level, where the probability that itly defined. When presented with a new sample X, we define a binary variable by a specific feature (say gene), or set of features (say pathway), is ( S S 1 if Q (X) ∈/ Uˆ , divergent is a well-defined concept. Phenotypes can then be char- Z = 0 acterized based on the corresponding divergence probabilities 0 otherwise . and predicted for unlabeled samples by examining the diver- We will then say that the set S is divergent (for the considered sample) if gent set. This approach can be applied to any high-dimensional and only if Z = 1. omics dataset, yielding simplified matrices of digitized values that can be analyzed statistically at the feature, sample, or Support Estimation. The support of the random vector US = QS(X) under phenotype level. P0 is estimated by a “covering” of the observed baseline samples. Cover- Deviation from a “normal” range is a form of “outlier analy- ing methods for the estimation of the support of a multivariate density sis” and as such is related to other work in statistics and genomics were introduced in ref. 13, with a goal similar to ours of detecting abnor- ˆS (5–12). All of these previous studies involve single features and mal behavior. The estimator U0 that we propose is a variation of the m require standardization of a new sample with respect to multi- one proposed in that paper. Let d be a metric on [0, 1] (we will use ple other samples from the same class. In contrast, divergence the Euclidean ). Assume that n0 independent and identically dis- tributed (i.i.d.) samples of X are observed under P0, resulting in i.i.d. samples applies to any subprofile, and its determination does not require S S m U (1), ... , U (n0); these are now n0 points in [0, 1] . We will define an the availability of other samples from the same class (e.g., disease increasing sequence of empirical supports indexed by a “smoothing” param- type); in fact, a new sample profile need not even be labeled. The eter γ ∈ [0, 1]; the determination of γ is described later. Let l = l(γ) = [γn0], only normalization is within-sample , and all comparisons the greatest integer less than or equal to γn0. For each k ∈ {1, ... , n0}, let rk with baseline samples occur in the rank space. denote the distance between US(k) and its lth nearest neighbor. We define

Methods n0 ˆS m [ S The notion of “divergence” is general: Let U be a assuming U0 = [0, 1] ∩ B(U (k), rk), [2] k=1 values in a space U and let P0 be a reference or baseline distribution on U with support supp(P0) ⊂ U. Then a value u ∈ U is divergent if u ∈/ supp(P0). By definition, if U has distribution P0, then the probability of observing a S S where B(U (k), rk) denotes the closed ball of center U (k) and radius rk. In divergent value is zero; the interest is in quantizing U when it follows alter- m ˆS other terms, a point u ∈ [0, 1] belongs to U0 if and only if there exists native distributions and to compare quantized values among alternatives. S k ∈ {1, ... , n0} such that U (k) is closer to u than to its lth nearest neighbor. We are going to apply this to functions U constructed from rank-normalized The smoothing parameter γ corresponds to the “bandwidth” in multivariate individual omics profiles. The supports will be estimated from data. (14), and our estimator, based on nearest neighbor dis- An omics profile is a vector X = (Xj, j ∈ J ) where the index set J and tances, implements what is commonly referred as an “adaptive bandwidth” the values assumed by individual features Xj depend on the particular data method in this literature. modality determined by the measurement technology. Typically, the fea- The estimate of the support in Eq. 2 is very conservative: Every sub- tures are states, counts, or concentrations of biomolecular species. In this profile S is nondivergent for every sample from the baseline population. paper, we consider gene expression (microarray and RNA-sequencing) and This is unrealistic, not only in view of possible outliers among the base- DNA methylation (microarray). This framework is directly applicable to other line samples, but also because these baseline samples may in fact contain types of measurements, including single-cell data and other omics modali- a small proportion of nonbaseline cell types. Another drawback is that γ ties (e.g., microRNA and protein expression). Here, for gene expression, J is yet to be specified. We address these two issues simultaneously. Again, is the set of genes and Xj is either a microarray value or RNA-seq count for ... let r1, , rn0 be the radii of the balls centered at baseline samples. Now gene j (“bulk” mRNA); in the case of methylation, J is a set of CpG sites and let ¯r be the 95th of these radii. Instead of covering all of Xj is the “β value” (see SI Appendix). the baseline samples, we remove the 5% for which rk >¯r before com- puting the support. That is, the support is constructed as in Eq. 2 but Quantiles. In our experiments, each individual profile is transformed into with the union over all k = 1, ..., n0 replaced by the union over {k : rk ≤ quantile space: Each element of the profile is replaced by its normalized ¯r}. Notice that some “left-out” samples may still lie in the support, but rank with respect to the other elements in the same profile. Define in general some will not, and therefore, some features will be declared divergent for some baseline samples. The smaller is γ, the more baseline divergence. |{i ∈ J : 0 < Xi ≤ Xj}| Q = Q (X) = . [1] For m = 1, say S = {j}, the estimated support is simply a union of inter- j j |{ ∈ J : < }| i 0 Xi vals centered at the quantile values of a subset of baseline samples with interval lengths determined by γ. In nearly all cases of interest, this union is itself an interval; that is, there are no “gaps.” Making this assumption (or This definition returns a value 0 ≤ Qj ≤ 1 and implies a separate treatment replacing the support by its convex hull), it suffices to compute its upper and of ties at 0 (e.g., “nonexpressed genes”). If Xj = 0, then Qj = 0, and positive quantiles are accrued only from positive X s. This is important for some data lower bounds. This can be done by first computing the smallest and largest j j ¯ modalities, especially gene expression from RNA-seq, for which the many values among the (retained) baseline samples U (k) for rk ≤ r, denoted ties that typically occur at 0 may offset standard definitions of quantiles, respectively by mj and Mj, and then estimating the baseline support of ˆj 0 0 0 assigning large quantile values to possibly small, nonzero, expression num- feature j by U0 = [lj , uj ] ⊂ [0, 1], where lj = max(0, mj − d1) [respectively, 0 bers. Importantly, this definition of quantiles is sample-based (i.e., they are uj = min(1, Mj + d2)] and d1 (respectively, d2) is the distance from the small- computed across variables measured for a single subject), not population- est sample (respectively, the largest) to its lth nearest neighbor among the or cohort-based. retained baseline samples.

4546 | www.pnas.org/cgi/doi/10.1073/pnas.1721628115 Dinalankara et al. Downloaded by guest on September 28, 2021 Downloaded by guest on September 28, 2021 hc ntr eemn h siae uprs noreprmnswith experiments either our take In select supports. we estimated divergence, the single-feature determine turn in which because is This specify. once Therefore, estimator. port γ namely population, baseline the in tures cen- disk the fifth of its radius to distance the Euclidean the where is on disks, values ([ quantile 47 neighbor of nearest of samples pair “BIRC5”—are each 50 union at and for tered the genes—“ERP1” values is expression two are gene the support there are Since and correlated. data tissue, clearly The breast 1. normal Fig. of in shown is omlsmlsi hw ytega hd;tesmlsfligotiethe outside falling the samples using the divergent. computed shade; declared gray support are the of support by Genome area shown Cancer is The The samples from data. normal random cancer at breast chosen (TCGA) were Atlas stars) (red samples A nal 1. Fig. and features fraction average the on limit to is divergence baseline of Parameters. features. divergent upper and divergent lower of iaakr tal. et Dinalankara Z that say then will We intervals refined: are be supports can the definition where features, single For dependent. The by (15). D denoted (MSigDB) is Database Signature set retrieved Molecular overlap- divergent FGSs Institute use be we Broad may experiments, the our sets from In the dimensions. different (FGSs)—where of sets and ping gene functional or for pathways determined is divergence j so considered, are features J ∈ j hc civsti fraction this achieves which , {1, ⊂ = ewl pl h ento fdvrec ntoseais (i scenarios: two in divergence of definition the apply will We (|S genes two for support estimated the of example An γ ,adlet and 1, . (ii ; · · · Dbsln upr.Ffynra ape bu ons n 0lumi- 50 and points) (blue samples normal Fifty support. baseline 2D iegnei ple ocletoso sets of collections to applied is divergence ) α , N eunn otecoc of choice the to Returning = M } D 1or .01 = ncase in l (X N = .1(50)] ) o aiyof family a for = j α α {j slwrdvretif divergent lower is nalcases, all In ii. = Z determines : j Z 0 (see .001 = j = ) h xeto oeaeovosydepends obviously coverage of extent The 5). D(X      −1} ,if −1, hr h same the where α, α ,if 1, ,otherwise 0, ) = sfie,teeaen te aaeesto parameters other no are there fixed, is n Results). and {j N γ 0 α D = : and ust.W hnslc h smallest the select then We subsets. ≡ Q Q γ Z sarno e—hti,i sample- is is, set—that random a is 0nra ape,teestimated the samples, normal 50 E j D aua a ocnrltelevel the control to way natural a , j j 1 ihfntoa eest,we sets, gene functional with .01; (X (X 6 0 where =0}, u γ  (X ) ) Z < > eemnsteradii the determines |D j M ) = = l u j 0  j 0 n pe iegn if divergent upper and −1 where , {j . : γ S Z 1 sue o vr sup- every for used is j , = ..., D α 1} J ⊂ M S S fdvretfea- divergent of | N = = = eoeteset the denote frexample, —for J| |J )and 2) {j ncase in } [l l single All ) r j 0 o single for o every for 1 , , u ..., j 0 γ i ] the , = and r n .1 0 , ie apepol,asto features of in set described a profile, As Atlas sample tissue. Genome given normal again, Cancer collection; is FGS The baseline Hallmark from the MSigDB data the measured and using also (16) level (TCGA) We set baseline. gene corresponding the the the as using tissues, counterpart various profiles normal from methylation samples DNA expres- tumor and gene of in RNA-seq) features and single (microarray for sion divergence, of degree the Baseline. Normal |D, a from Profiles Tumor of Divergence Results rbblte fDvrec cosPopulations. probability Across Divergence of Probabilities ER+, HER2-enriched, basal-like, tumors. B, ER– luminal and A, luminal vs. mal D (see RNA-seq platforms microarray by S1 other Fig. measured using as or types, TCGA in tumor in additional features for the considera- obtained of into identity taking the even tion without in separation differences near-perfect for the the fact, platform In (Wilcoxon and and 2). significant tissue divergence data highly each of is in level the samples the cohort generate tumor in and difference to normal the between cases, used all In platform irre- type. methylation, tissue the DNA and of expression gene spective both for tumors in all and support. normal baseline left-out (see the samples the tumor for estimate determined was to divergence selected Then sam- normal randomly available the were of half ples type, data omics and tissue each base- estimated the outside fall support. value(s) line quantile assumed the if features phenotype from accuracies test yields half other the on 99. testing and group of each size classifier the (SVM) thresholding machine be samples. just vector to support (effectively tumor linear appears and a training normal fact, sample In separating this in features, discriminating cancer single highly breast with the As For 2B. pathways. sizes of the set subtypes, sample-specific a now is iseo neet say normal given interest, a of end, between this tissue probabilities To divergence patterns. compared expression we tissue-specific highly showing lymph to spreading and size, progno- larger (18–20). poorer nodes A, grade, a higher luminal portending including with factors sis, compared more multiple subtype the by this with characterized consistent of is behavior a B aggressive of luminal finding with the our from associated divergence note, baseline) greater Of (i.e., 3A). dysregulation of Fig. degree (see greater B luminal in divergence corrected genes diver- ferroni (χ the be divergent” of “differentially all to significantly almost likely as particular, identified In more tumors. are cancer. A samples luminal breast than B B gent luminal luminal genes, and most A For luminal between profiles gence GTEx the normal from in of frequencies obtained relative variety data. are training types estimates large probability All tissue a (17). project distinct and from TCGA samples in subtypes molecular a elcdb h pe n oe iegn sets divergent lower and upper the by replaced was vrl,frsnl etrs yial huad r divergent are thousands typically features, single for Overall, for samples, tumor in divergence of extent the measuring In unn otedvrec fgn es h iegn set divergent the sets, gene of divergence the to Turning eas nlzdRASqdt rmGE oietf genes identify to GTEx from data RNA-Seq analyzed also We diver- feature-level ternary compared we features, single For % 0% 0% 0% 98 100%, 100%, 100%, 2%, .Tesm atr esse hntedvretset divergent the when persisted pattern same The A–C). j n o eesets gene for and P (Z j P 1 = y |D value IAppendix, SI sdvret eddti ohfrsingle for both this did We divergent. is PNAS | Y o u CAsmlsaesoni Fig. in shown are samples TCGA our for T = ≤ n h eann ns is,we First, ones. remaining the and , | y n a ihrpoaiiisof probabilities higher had and 0.05) = ) a ,2018 1, May . | % 99. 3%, D| S P j Methods n o ohbes cancer breast both for and , (j r ufiinl ag oallow to large sufficiently are iia eut eealso were results Similar D. ∈ epciey o nor- for respectively, 5%, D| P S nhl h aafrom data the half on D) | Y value sdfie sdivergent as defined is o details). for o.115 vol. = y eetmtdthe estimated We ) < o any for Methods, htasample a that | 10 emeasured We IAppendix , SI o 18 no. −6 2 D et Bon- test, e Fig. see ; u and |D | 4547 D at D l .

BIOPHYSICS AND STATISTICS INAUGURAL ARTICLE COMPUTATIONAL BIOLOGY Fig. 2. Degree of divergence in tumor and normal sample populations. (A) Single features for RNA-seq gene expression, CpG methylation, and microarray expression. (B) The 50 FGSs in MSigDB “Hallmark” collection. Note that, by design, the level of divergence in the normal populations is extremely small. In both cases, divergence is learned from a separate normal population.

computed a multitissue baseline profile from 50% of available ties estimated as described above. This reverse approach also non-T samples, and then, estimated divergence probabilities identified phenotype-specific genes (see SI Appendix, Figs. S2 separately from all of the T samples and the remaining half and S3). of non-T samples. For example, to identify prostate-specific For gene sets, we considered the breast cancer subtypes and genes, we used half of nonprostate samples pooled together the 50 Hallmark gene sets. Since we are estimating only a limited to estimate a nonprostate baseline. This analysis yielded 433 number of supports, we imposed virtually zero normal diver- prostate-specific significant genes (Bonferroni-adjusted P value gence by selecting α = .001, with the risk of underestimating ≤ 0.05), encompassing many known prostate markers including cancer subtype divergence probabilities if in fact the normal KLK3 (a.k.a. the prostate specific antigen; see Fig. 3B). Then, breast samples are contaminated. Fig. 4 shows a heat map of the we reversed the roles: We took tissue T as the baseline, with probabilities of divergence for each of the 50 pathways. The full support estimated from 50% of the T samples and probabili- table of values is in SI Appendix, Table S1.

A B 1.00 1.00 PCAT4T KLK3 PRAC1 LOCO 10005080808046 CDC454 KLKL P1 KLK2 NKX3−1−1 MCM100 RAD51AP1P KLK4 TRPPM8 POLLQQ NCAPG HOXB13 PCGCGEM1 EPR1R1 DIAPH3H3H PAGE4 LOC1019279 4828 0.75 GSG2 TCFC 19 0.75 TGM4 FAM7M72A ZNF761 SOCS2−2 AS1 SLCL 45A3 PRCAT474477 TPMM3P99 LSAS MP−−AS1 0.50 0.50 C20orff103 Prostate Luminal B FGD3

0.25 0.25

NEFE L 0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.02 0.04 0.06 Luminal A Other

Fig. 3. Differentially divergent genes between tumor phenotypes. (A) of divergence probabilities of genes for 231 luminal A and 127 luminal B breast cancer samples in TCGA. After Bonferroni correction, 941 genes (blue triangles) have an adjusted P value ≤ 0.05 for a χ2 test comparing the two subtypes using ternary digitized data. Among these, the genes with the smallest P values are labeled. (B) Scatter plot of divergence probabilities of genes for 106 prostate and 4216 nonprostate samples in Genotype-Tissue Expression (GTEx) with respect to a nonprostate baseline. After Bonferroni correction, 433 genes (blue triangles) have an adjusted P value ≤ 0.05 in a χ2 test comparing the two groups. The 20 genes with the smallest P values and higher divergence probability in prostate are labeled.

4548 | www.pnas.org/cgi/doi/10.1073/pnas.1721628115 Dinalankara et al. Downloaded by guest on September 28, 2021 Downloaded by guest on September 28, 2021 eeto fsvntsu utps sepce,i l ae,the cases, a all on in for skin) expected, obtained and As results (breast subtypes. tissue baseline the seven sam- the shows of the as the selection 6 of used all subtypes for Fig. as tissue tissues. sets well two as other as divergence samples tissue from the same the ples the computed of from then half samples left-out selected and types. randomly baseline tissue we across the samples tissue, normal each compare For to data Types. GTEx Tissue the Across Profiles Divergence Comparing test the pathways, three only is using relative accuracy Even divergence predictors: multidimensional box black with to achieved be interpretability in can gain that considerable the Notice could regarding cross-validation. trees can by accurate that more induced analysis framework; of be our types within different is performed the classification be of in illustration an This pathways. merely hallmark 50 the of so h om“sptwyXdvret” where divergent?”, X pathway tree “Is the the in and form question tree the Each the of accuracy. learning is the for estimating for samples half the other a half luminal using built and subtypes A we luminal B pathways, between distinguish certain to tree for decision single subtypes among observed ities the for above, scheme phenotypes. color cancer breast the across in collection coded FGS Hallmark as probabilities, divergence the 4. Fig. iaakr tal. et Dinalankara nadto,gvntelredfeecsi iegn probabil- divergent in differences large the given addition, In iegn rbblte o alakFS.Teha a shows map heat The FGSs. Hallmark for probabilities Divergent 81% seFg 5). Fig. (see α saprmtrt eotmzdin optimized be to parameter a as X eas used also We a eany be can inascae ihcne rgeso rrs atrexposure. factor risk or progression dysregula- cancer global that with the suggesting associated average capture tion strongly in efficiently can baseline, trend analysis normal increasing divergence the an from revealed can- divergence pathogenic, This more phenotypes. progressively cer and relevant, clinically ple profiles. information divergence discriminating the the in criterion, retained (see this is samples by non–basal-like Clearly, cases, by 7). both by other Fig. dominated the In one and aggre- ternary. clusters, samples well-separated basal-like the and major, to quantile two obtain clustering both we spectral data, applied training We the gated and data. data ternary quantile the the dif- used for subtypes we test most basal-like); cancer Kruskal–Wallis and 100 the breast HER2-enriched, the B, four luminal test. determined the A, (luminal for and we genes training data, expressed into training ferentially divided the evenly on and Based was randomly divergence were normal ples of taken fraction were data. at samples controlled the ternary tumor and the and TCGA, and (baseline) from normal data all quantile above, the As on of for results transform the clustering divergence compared spectral we the particular, of Phenotypes. In effect learning. Disease unsupervised the Between considered we Profiles Next, Divergence of Comparison and origin embryological differ- same (see the Similarly, types share cellular cells. brain glands the adipose of mammary regions of the ent since amounts expected substantial be contain might instance, tissue labeled were For adipose samples breast origin. as misclassified all tissue virtually and that fact always composition the were type type num- 90% tissue cell smallest above of reflected predictions accuracies the incorrect general, provides Moreover, In that observed. genes. type sample divergent A tissue of then samples. the ber and remaining as type the classified tissue of is each tissue-type from the using samples baselines classifying analyzed tissue-specific selected we deriving randomly hypothesis, GTEx, 50 of this in type types test tissue tissue To the 48 type. predict unknown to gene of used tissue-specific samples be that suggests (e.g., could This not baselines 6B). do expression that Fig. (e.g., those in than brain features 6A) and divergent Fig. skin in fewer tissue common have adipose share baseline and that breast the tissues with other types from cell Samples are types tissue tissues. GTEx divergence; S4 of Fig. degrees varying different the displayed from baseline samples subtypes the the while as divergence, declared least the subtype showed tissue the of samples remaining ye.Frec niiulsml,tease oteqeto s“E”if “NO” “YES” and is divergent question is the CHECKPOINT”) “G2M to otherwise. (e.g., answer pathway the indicated sample, the individual each For types. 5. Fig. utemr,w oprddvrec rfie cosmulti- across profiles divergence compared we Furthermore, the related how on depend divergence of degrees relative The eiinte o eaaiglmnlAadBbes acrsub- cancer breast B and A luminal separating for tree Decision hw diinlrslswt h ulls favailable of list full the with results additional shows α = .01 i.S5). Fig. Appendix, SI PNAS eaiet h nietasrpoe Sam- transcriptome. entire the to relative | a ,2018 1, May | o.115 vol. | IAppendix, SI o 18 no. χ 2 etfor test | 4549

BIOPHYSICS AND STATISTICS INAUGURAL ARTICLE COMPUTATIONAL BIOLOGY A B and batch effects. This is also the first step in “quantile normal- 10000 ization” (24) and in the work on “relative expression analysis.” The latter includes the top scoring pair (TSP) and k-TSP algo- 6000 7500 rithms for distinguishing between two disease phenotypes based on the expression ordering between two genes (25) and between 4000 5000 k pairs of genes (26, 27), as well as the “RankComp” approach (28), which aims at identifying dysregulated gene pairs starting 2000 2500 from a pool of stable pairs precompiled using normal samples. Within-profile rank normalization was extended to “differen- size of divergent set of divergent size size of divergent set of divergent size tial variation” between pathways using Hamming distance to a 0 0 phenotype-specific rank template (29) and using the Kendall-tau or “swap” distance between rank vectors (30, 31). Concerning “outliers” in omics data, an early example is “Can- cer Outlier Profile Analysis” (COPA) (6, 7) for gene expression: Brain−Cortex Brain−Cortex First, each gene is standardized across samples from a pheno- type of interest, and then those genes that display an outlier Adipose−Visceral Adipose−Visceral Brain−Cerebellum Brain−Cerebellum Skin−Sun Exposed Skin−Sun Exposed profile over a subset of samples are selected. This COPA frame- Skin−Not Sun Exposed Skin−Not Sun Exposed work can also be extended to model heterogeneity in integrated Adipose−Subcutaneous Adipose−Subcutaneous Breast−Mammary Tissue Breast−Mammary Tissue tissue tissue data modalities, and separate gene set statistics can be applied to determine whether outliers are enriched in multidimensional Fig. 6. Divergence profile comparison between normal tissue types. Using gene sets (11). (A) breast and (B) sun-exposed skin tissue as baselines, divergence sets All of the current methods to quantify dysregulation require are computed for multiple normal tissue subtypes available in GTEx with prior knowledge of the sample phenotypes and a population RNA-seq gene expression profiles. Half of the available samples were used of samples from each phenotype. Divergence is a sample-level to estimate a normal subtype-specific profile, and the divergence of all property that can be applied to a single sample without knowl- remaining samples was computed with respect to this baseline. edge of phenotype and thus is not directly comparable to these methods. The approach perhaps most closely related to diver- For instance, gene expression divergence increased in prostate gence is the “Anti-Profiles” method introduced in refs. 10, 32 cancer with increasing Gleason grade: the average size of the and applied to cancer diagnosis; genes that deviate from a nor- divergent set was 883 for tumors with a primary Gleason grade of mal profile are identified after standardization across samples 3, 1241 for a primary Gleason grade of 4, and 1490 for a primary and after prefiltering for hypervariability with respect to normal Gleason grade of 5. Similar results were also observed in breast samples. In contrast, our method requires no across-sample stan- cancer with increasing grade, in the progression from colon ade- dardization or prefiltering, making it amenable to the analysis of noma to carcinoma, and for DNA CpG methylation divergence individual samples prospectively collected. Moreover, similar to in lung cancer, which increased with tobacco-smoke exposure outlier analysis, the antiprofile approach requires subsequent set (see SI Appendix, Fig. S6 A–D). Notably, the level of devia- statistics for pathway analysis. Divergence applies to any subset tion from the normal baseline is strongly correlated between of features treated as a single entity using the same statistical the- distinct omics modalities derived from the same biological sam- ory. Both of these differences further limit direct comparison of ples, suggesting global genome-wide dysregulation in both DNA divergence to the antiprofile approach. and RNA in cancer; see, for example, the case of prostate Enormous biological variation is another barrier. It has been cancer gene expression and methylation in SI Appendix, Fig. consistently observed that in many disease conditions there is a S7A (Spearman correlation 0.59). Repeating this analysis with high degree of variation in the omics profiles from sample to sam- breast cancer data yielded a Spearman correlation of 0.51 (SI ple. Observations of individual samples and genomic features at Appendix, Fig. S7B). Overall, these results suggest that the diver- high-resolution may obscure patterns of dysregulation. Further- gence framework is capturing a sparse yet biologically meaning- more, omics studies usually focus on population-level statistics ful snapshot of an omics sample and its dysregulation from a healthy state, which can be observed across distinct omics data modalities. AB Discussion 2 Combing through large datasets to make meaningful biologi- 1 cal inferences has become the raison d’ˆetre for much of the current work in . Often the goal is the discovery 0 of biomarkers or molecular signatures (e.g., gene sets) of clin- PC2 −1 ical utility, and a great many of these based primarily on gene expression have been proposed over the last 15 years. However, −2 many high-throughput omics classifiers are learned from rela- −3 tively small cohorts of samples using population-level estimates −3 −2 −1 0 1 2 −4 0 4 8 that are highly sensitive to variability, and therefore, the resulting PC1 biomarkers are often difficult to reproduce and validate (21– 23). As a result, leveraging such “gene signatures” has proven subtype cluster difficult; in fact, the majority of such biomarkers are unable Basal−like 1 Non−basal−like 2 to meet the rigorous criteria required to be admitted to clini- cal use. The work here is aimed at ameliorating some of these Fig. 7. Clustering digitized data. Results of spectral clustering of Breast barriers. TCGA RNA-seq data in (A) quantile form and (B) digitized form. The Sensitivity to platform and preprocessing is one important bar- resulting clusters discriminate basal-like samples from all other subgroups. rier. Divergence coding begins with the initial conversion of raw Digitized data preserve biological information separating the different feature values to within-sample ranks to minimize preprocessing subtypes similarly to quantile data.

4550 | www.pnas.org/cgi/doi/10.1073/pnas.1721628115 Dinalankara et al. Downloaded by guest on September 28, 2021 Downloaded by guest on September 28, 2021 2 faiB ea ,Fri J(04 erigdseuae ahasi acr from cancers in pathways dysregulated Learning (2014) EJ Fertig pair D, scoring Geman B, top Afsari with 12. combined analysis set gene Outlier (2013) al. et can MF, variability on Ochs based signatures 11. expression Gene (2015) HC Bravo W, Dinalankara 10. iaakr tal. et Dinalankara a to given interest profile. Then, of ranks. profile to within-sample raw normalizing making profile every by space converting requires from quantile by baseline begin comparisons we a meaningful Here, from and divergence robust of concept The Conclusion assessment risk clinical selection. therapy including and medicine, pre- personal to and applications disease for cision well-suited a meaningful, and directly of is interpretable, state inherently deviation physiologic normal, the the from quantifying sample In and mechanism. biological capturing underlying contrast, any with findings rules statistical decision the and reconciling prevents often research. methods to future matics of divergence subject the linking be also as will such issue This refinements covariates. require to popu- how may Such especially lations populations. distribution, baseline baseline heterogeneous issue the accommodate related of from complexity A result the features. of also is of might number subsets power large high-dimensional of a analyzing loss for cancer A in allows samples. strong overconservative out-of-support so being is signal even divergence that the However, that being power. show of of results side loss our the corresponding of on a because err with and, probably overconservative, precision will , lack interpoint definitely large will divergence of tion study. further requires topic the this indeed resemble but is case, this to that the likely suggest more experiments space, Preliminary much truth. large-dimensional be ground neighborhood will a a estimator of support by of manifold our supported number small-dimensional particular, are “effective” a in data the of problem; the on complex When heavily a depend dimensions. is will balls accuracy of the unions with here, sions work the in property. of contrast, sample a set In is another. a expression” altered “differential to identify significantly phenotype is to distribution one used marginal from is the which testing for are inter- statistical genes of distributions instance, classes between their For compared est. of and data properties the features from and treated estimated The are variables, classes. methylation) random CpG more as levels, or transcript two (e.g., from measured samples from derived .MCl N pa ,Jfe A ilo J rzryR 21)Tegn xrsinbar- expression gene The (2011) data. RA Irizarry microarray MJ, for Zilliox HA, code Jaffee K, bar Uppal MN, expression McCall gene 9. A analysis. (2007) expression RA gene Irizarry differential MJ, for Zilliox sums 8. Outlier factor (2006) transcription T ETS Hastie and R, TMPRSS2 Tibshirani of fusion 7. Recurrent (2005) al. statistical et A SA, (2002) Tomlins E 6. dense, Gabrielson personal, R, using Anbazhagan individuals ES, 108 Garrett-Mayer of study G, wellness Parmigiani preterm A 5. (2017) spontaneous al. a et of ND, validation Price and 4. Development (2016) al. et GR Saade 3. strat- risk post-prostatectomy augments genomics Tissue-based (2016) characteriza- al. molecular et the AE, for Ross classifier proteomic 2. blood-based A (2013) al. et Xj, Li 1. ial,tecmlxt n lc o aueo aybioinfor- many of nature box black and complexity the Finally, defini- our samples, baseline of number small a using When dimen- high in distribution a of support the of Estimating oe eeaigpbi aarpstre obgnctlgn h ua n murine and human the cataloging begin transcriptomes. to repositories data public Leveraging code: Methods cancer. prostate in genes cancer. in Methodol classification Stat B molecular expression-based for framework clouds. data dynamic women. asymptomatic 633.e24. in predictor men. high-risk delivery and intermediate- of cohort history natural 69:157–165. a in ification nodules. pulmonary of tion d gmA omniE a K hoX,vnLahvnT(pigrBerlin (Springer T Laarhoven analysis. van variability differential XM, Zhao JK, Hao 47–58. E, pp Formenti Berlin), Heidelberg, A, Ngom eds activity. ics, pathway of biomarkers robust provides prognosis. and progression tumor predict robustly 4:911–913. 8:2–8. uli cd Res Acids Nucleic 64:717–736. a Biotechnol Nat Science c rnlMed Transl Sci acrInform Cancer 310:644–648. 39:D1011–D1015. 35:747–756. 5:207ra142. 13:61–67. mJOse Gynecol Obstet J Am atr eonto nBioinformat- in Recognition Pattern acrInform Cancer 14:71–81. ttScSer Soc Stat R J 214:633.e1– u Urol Eur Nat 4 osa M rzryR,AtadM pe P(03 oprsno normalization of Recurrent comparison A work: (2003) TP never Speed M, may Astrand RA, biomarker Irizarry BM, cancer Bolstad 24. new your Why (2012) SE DNA of Kern use the 23. in Pitfalls (2003) LM McShane KK, Dobbin MD, Radmacher R, in Simon Outcomes regional 22. Patient and Predicting local for Tests of Omics-Based risk of the Review and the on subtypes Committee cancer Breast 21. (2010) al. et KD, Voduc 20. sur- breast to on according outcome treatment and recurrence and of Patterns (2013) subtypes al. et cancer O, Metzger-Filho breast 19. of Impact (2012) al. project. et (gtex) R, expression genotype-tissue Haque The 18. (2013) al. set et project. analysis J, gene pan-cancer Lonsdale atlas hallmark genome 17. database cancer The signatures (2013) al. molecular et JN, The Weinstein (2015) 16. al. et A, Liberzon 15. (2015) DW Scott 14. 3 ery ,Ws L(90 eeto fanra eairvanonparametric via behavior abnormal of Detection (1980) GL Wise L, Devroye 13. yNHNtoa acrIsiueGatR01CA200859. supported Grant was Institute research Cancer National This NIH discussions. by helpful for Pereira-Lobo Francisco ACKNOWLEDGMENTS. from from obtained available are here divergence.html analyzed data All Availability Code and Data individualized to profile challenge with divergence major care. associated patient-level patient the often a on is center adapting will divergence of work of perva- Future extent are gravity. the relative divergence that epigenomic cancer, and and with sive genomic here that shown expect As lev- high we heterogeneity. display interpersonal that diseases of human els complex other of study the and meaningful directly presence is risk. sample the assessing disease for quantify appropriate a the can in that of dysregulation method of many a with cell Therefore, normal along of can- behavior. dysregulation of collectively, considered be hallmarks can hallmarks, be remaining and to progression considered are cer pathways, instability metabolic genome in and abnormality inducing proliferation, Stimulating heterogeneity. cell is molecular Cancer such pathogenesis. to of subject paradigms particularly accepted many with is line dysregulation molecular in patient-level on emphasis The cancer. important of deal preserved. great is diver- information a biological the nonetheless, We that, complexity. construction, in demonstrated reduction by have massive a metric-based cases, provides a both transform gence using a In support in argument. the genes all values covering estimated sup- of assuming have expression baseline vectors we the the (e.g., random pathway), space since for m-dimensional interval; gene) an features an in single omics simply a individual is of as port expression the such the variables of no (e.g., random straightforward samples is requires scalar support other baseline sample for the to of given Estimation respect class. any same with of standardization Impor- labeling further distribution. probability divergence baseline sup- tantly, the underlying estimate the to population, of ability baseline the port to suitable down boils a divergence from defining samples many sufficiently h nltclfaeokpeetdhr ol eapidto applied be could here presented framework analytical The primarily disease, human to applications on focused have We ehd o ihdniyoiouloiearydt ae nvrac n bias. and on based data array oligonucleotide density Bioinformatics high for failures. methods biomarker in diversity remarkable 6101. and patterns classification. prognostic and 14–18. diagnostic for data microarray Forward Path (2012) al. et Trials, Clinical relapse. breast international IX. from and Results VIII trials disease: group node-negative study cancer lymph in subtypes cancer decades. 1855. two spanning analysis An vival: 45:580–585. Genet collection. NJ). Hoboken, Sons, & Wiley (John siaino h support. the of estimation 45:1113–1120. lnOncol Clin J elSyst Cell Ntoa cdme rs,Wsigo,DC). Washington, Press, Academies (National 19:185–193. github.com/wikum/DivergenceAnalysis utvraeDniyEtmto:Ter,Patc,adVisualization and Practice, Theory, Estimation: Density Multivariate n h eesr akgsadcd a be can code and packages necessary the and , 1:417–425. 28:1684–1691. etakKatnBaov di ud-md,and Luidy-Imada, Eddie Blagoev, Krastan thank We PNAS vlto fTasainlOis esn ere n the and Learned Lessons Omics: Translational of Evolution IMJAp Math Appl J SIAM | a ,2018 1, May lnOncol Clin J acrEieilBoakr Prev Biomarkers Epidemiol Cancer 38:480–488. 31:3083–3090. | o.115 vol. luigimarchionni.org/ alCne Inst Cancer Natl J acrRes Cancer | o 18 no. . a Genet Nat 21:1848– 72:6097– | 4551 Nat 95:

BIOPHYSICS AND STATISTICS INAUGURAL ARTICLE COMPUTATIONAL BIOLOGY 25. Geman D, d’Avignon C, Naiman DQ, Winslow RL (2004) Classifying gene expres- 29. Eddy JA, Hood L, Price ND, Geman D (2010) Identifying tightly regulated and vari- sion profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3: ably expressed networks by Differential Rank Conservation (DIRAC). PLoS Comput Article19. Biol 6:e1000792. 26. Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D (2005) Simple decision rules 30. Afsari B, Geman D, Fertig EJ (2014) Learning dysregulated pathways in cancers from for classifying human cancers from gene expression profiles. Bioinformatics 21: differential variability analysis. Cancer Inform 13(Suppl 5):61–67. 3896–3904. 31. Kelley DZ, et al. (2017) Integrated analysis of whole-genome ChIP-Seq and RNA-Seq 27. Marchionni L, Afsari B, Geman D, Leek JT (2013) A simple and reproducible breast data of primary head and neck tumor samples associates HPV integration sites with cancer prognostic test. BMC Genomics 14:336. open chromatin marks. Cancer Res 77:6538–6550. 28. Wang H, et al. (2015) Individual-level analysis of differential expression of genes and 32. Corrada Bravo H, Pihur V, McCall M, Irizarry RA, Leek JT (2012) Gene expression anti- pathways for personalized medicine. Bioinformatics 31:62–68. profiles as a basis for accurate universal cancer signatures. BMC Bioinformatics 13:272.

4552 | www.pnas.org/cgi/doi/10.1073/pnas.1721628115 Dinalankara et al. Downloaded by guest on September 28, 2021