Digitizing Omics Profiles by Divergence from a Baseline
Total Page:16
File Type:pdf, Size:1020Kb
Digitizing omics profiles by divergence from a baseline INAUGURAL ARTICLE Wikum Dinalankaraa,1, Qian Keb,1, Yiran Xub, Lanlan Jib, Nicole Paganea, Anching Liena, Tejasvi Matama, Elana J. Fertiga, Nathan D. Pricec, Laurent Younesb, Luigi Marchionnia,2, and Donald Gemanb,2 aDepartment of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD 21205; bDepartment of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218; and cInstitute for Systems Biology, Seattle, WA 98109 This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2015. Contributed by Donald Geman, March 19, 2018 (sent for review December 13, 2017; reviewed by Nicholas P. Tatonetti and Bin Yu) Data collected from omics technologies have revealed perva- relative to a baseline, may be essential to fully realize the poten- sive heterogeneity and stochasticity of molecular states within tial of high-dimensional omics assays; in principle, they pro- and between phenotypes. A prominent example of such het- vide unprecedented quantification of normal and disease-specific erogeneity occurs between genome-wide mRNA, microRNA, and variation, which can inform clinical approaches for diagnosis and methylation profiles from one individual tumor to another, even prevention of complex diseases. within a cancer subtype. However, current methods in bioinfor- To develop this high-dimensional analogue to low-dimensional matics, such as detecting differentially expressed genes or CpG clinical tests, we present a unified framework to characterize sites, are population-based and therefore do not effectively model normal and nonnormal variation by quantizing high-resolution intersample diversity. Here we introduce a unified theory to quan- measurements into a few discrete states based on divergence tify sample-level heterogeneity that is applicable to a single omics from a baseline. This simplified data representation is intended profile. Specifically, we simplify an omics profile to a digital repre- to capture most of the important biological information neces- sentation based on the omics profiles from a set of samples from sary to characterize phenotypes of interest. Briefly, suppose we a reference or baseline population (e.g., normal tissues). The state are given a particular subset of omics features, for example the of any subprofile (e.g., expression vector for a subset of genes) expression levels of any set of genes; important special cases are is said to be “divergent” if it lies outside the estimated support a single gene or a functional get set. Suppose also that sam- of the baseline distribution and is consequently interpreted as ple observations of this subset are available from a population STATISTICS “dysregulated” relative to that baseline. We focus on two cases: declared as the baseline. Then the realization of this subset of single features (e.g., individual genes) and distinguished subsets features in any sample profile is said to be “divergent” if it lies (e.g., regulatory pathways). Notably, since the divergence analy- outside the support of the baseline probability distribution. For sis is at the individual sample level, dysregulation can be analyzed instance, if the phenotypes of interest are cancers of a particular probabilistically; for example, one can estimate the probability tissue, then a natural “baseline” population is normal samples that a gene or pathway is divergent in some population. Finally, from that tissue. However, if the goal is to determine the unique the reduction in complexity facilitates a more “personalized” and features of gene regulation in particular (normal) tissue, the biologically interpretable analysis of variation, as illustrated by BIOPHYSICS AND experiments involving tissue characterization, disease detection Significance and progression, and disease–pathway associations. COMPUTATIONAL BIOLOGY Technological advances enable increasingly comprehensive stochasticity j digitization j dysregulation j cancer j precision medicine profiling of the molecular landscapes of cells, and these data can inform the personalized treatment of complex diseases. n recent decades, technological advances have enabled global Two major obstacles are the complexity of these data and the Iprofiling of genetic variants, RNA species, epigenetic marks, high degree of person-to-person heterogeneity. We develop proteins, metabolites, and other previously unknown molecular a highly simplified, personalized data representation by com- features, enabling the characterization of complex biological sys- paring the profile of an individual to the range of landscapes tems over distinct molecular domains. These high-dimensional in a baseline population, thereby mimicking basic clinical diag- measurements have been made on thousands of samples and are nostic testing for departures of selected variables from normal collectively referred to as omics data. Through complex com- levels. Moreover, our method can be applied to any data putational and statistical analyses of these data from different modality and at any level of granularity, from single fea- cell types and tissues, across individuals and model organisms, tures to any subset of features treated as a single entity, and from diseased and healthy conditions, we have substan- for example the gene expression levels in a pathway. Experi- tially enhanced our understanding of molecular mechanisms ments involve both healthy human tissues and various cancer underlying cell functioning, tissue organization, and organism subtypes. development. These data are also beginning to impact clinical Author contributions: L.M. and D.G. initiated the project; W.D., Q.K., E.J.F., N.D.P., L.Y., practice for complex diseases (1–3). L.M., and D.G. designed the experiments and interpreted the results; W.D., Q.K., Y.X., Nevertheless, omics data have yet to significantly inform the and L.J. performed research; W.D., N.P., A.L., T.M., and L.M. collected and curated data; standard of care, as was expected when introduced. In part, and W.D., E.J.F., L.Y., L.M., and D.G. wrote the paper. this limitation arises from insufficient genome-wide characteriza- Reviewers: N.P.T., Columbia University; and B.Y., University of California, Berkeley. tion of normal variation, impeding progress toward determining The authors declare no conflict of interest. whether a single omics profile reflects molecular dysregulation This open access article is distributed under Creative Commons Attribution- that is indicative of a complex disease. The use of deviation NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). from a normal range is already widely used clinically for low- See QnAs on page 4528. dimensional laboratory tests and biomarkers, risk assessment, 1 W.D. and Q.K. contributed equally to this work. disease diagnosis and prognosis, and therapy selection. Similar 2 To whom correspondence may be addressed. Email: [email protected] or marchion@jhu. approaches are now needed for omics data modalities. Notably, edu. efforts to monitor large-scale omics data in the population are This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. now underway (4). Methods to quantify genome-wide variation, 1073/pnas.1721628115/-/DCSupplemental. especially techniques to determine where an omics profile falls Published online April 16, 2018. www.pnas.org/cgi/doi/10.1073/pnas.1721628115 PNAS j May 1, 2018 j vol. 115 j no. 18 j 4545–4552 Downloaded by guest on September 28, 2021 baseline could be all other tissues types combined. The novelty Divergence. Suppose we are given a reference joint probability distribution of our framework is that it applies to any subset of features and P0 of (Xj, j 2 J ) associated with some baseline phenotype. Of course, in prac- to any new profile by itself. tice, we only observe samples from this distribution where each sample is Divergence is a sample property. In contrast, putative discrim- an omics profile from a single data modality, and the choice of reference inating sets of features, such as “differentially expressed genes,” phenotype depends on the particular study and data modality. As will be seen in Results, the baseline phenotype need not be a single tissue type or a are defined in terms of population-level distributions. For exam- disease-free phenotype and can be any population that is meaningful under ple, given two classes or phenotypes, a gene is differentially the analysis carried out. expressed if its distribution differs in the two classes. However, Given an omics profile X and a subset S ⊂ J , we will apply the defi- S S knowing the status of a molecular feature in the context of pop- nition of divergence to the random vector U = Q (X) = (Qj(X), j 2 S). We ulations may say very little about the status of that feature for will focus on single features S = fjg (e.g., single genes) and on sets S a particular individual. Divergence then enables the prospec- with 1 jSj jJ j (e.g., gene pathways). Let m = jSj, so that US takes m S m S tive analysis of individual samples, as required for applications values in [0, 1] . We describe below how the support U0 ⊂ [0, 1] of U can be estimated from the data, resulting in a set U^S that is explic- to precision and personalized medicine. It also enables a proba- 0 bilistic analysis at the phenotype level, where the probability that itly defined. When presented with a new sample X, we define a binary variable by a specific feature (say gene), or set of features (say pathway), is ( S S 1 if Q (X) 2= U^ , divergent is a well-defined concept. Phenotypes can then be char- Z = 0 acterized based on the corresponding divergence probabilities 0 otherwise : and predicted for unlabeled samples by examining the diver- We will then say that the set S is divergent (for the considered sample) if gent set. This approach can be applied to any high-dimensional and only if Z = 1. omics dataset, yielding simplified matrices of digitized values that can be analyzed statistically at the feature, sample, or Support Estimation.