Ranking Genes by Their Co-Expression to Subsets of Pathway Members

THE CHALLENGES OF SYSTEMS BIOLOGY Ranking Genes by Their Co-expression to Subsets of Pathway Members a, a,c, b Priit Adler, § Hedi Peterson, § Phaedra Agius, Juri¨ Reimand,b,d and Jaak Vilob,c aInstitute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia bInstitute of Computer Science, University of Tartu, Tartu, Estonia cQuretec Ltd., Tartu, Estonia dEMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom Cellular processes are often carried out by intricate systems of interacting genes and proteins. Some of these systems are rather well studied and described in pathway databases, while the roles and functions of the majority of genes are poorly understood. A large compendium of public microarray data is available that covers a variety of conditions, samples, and tissues and provides a rich source for genome-scale information. We focus our study on the analysis of 35 curated biological pathways in the context of gene co-expression over a large variety of biological conditions. By defining a global co-expression similarity rank for each gene and pathway, we perform exhaustive leave- one-out computations to describe existing pathway memberships using other members of the corresponding pathway as reference. We demonstrate that while successful in recovering biological base processes such as metabolism and translation, the global correlation measure fails to detect gene memberships in signaling pathways where co- expression is less evident. Our results also show that pathway membership detection is more effective when using only a subset of corresponding pathway members as reference, supporting the existence of more tightly co-expressed subsets of genes within pathways. Our study assesses the predictive power of global gene expression correlation measures in reconstructing biological systems of various functions and specificity. The developed computational network has immediate applications in detecting dubious pathway members and predicting novel member candidates. Key words: pathway reconstruction; gene expression; systems biology Introduction to represent genes and their interactions within a biological network is a common approach A gene carries out its function through its adopted by all main pathway databases.1–3 Al- products in one or more biological pathways, though early collections of biological pathway forming protein complexes or playing an inter- databases predate the human genome sequenc- active role in metabolism and signaling. One ing project, only a small fraction of genes are of the most pertinent tasks in systems biol- mapped to at least one pathway. The major ogy today is the identification of genetic inter- pathway databases KEGG1 and Reactome2 actions across different biological conditions. account for only 4,220 and 1,804 interact- The graphical notation of vertices and edges ing genes out of the approximately 23,000 protein-coding genes that comprise the human genome.4 Address for correspondence: Jaak Vilo, Institute of Computer Science, High-throughput gene expression data have University of Tartu, Liivi 2, 50409 Tartu, Estonia. Voice: 3725049365. + proved a rich source of genome-scale informa- [email protected] §These authors contributed equally and appear in alphabetical order. tion ever since the first studies appeared in The Challenges of Systems Biology: Ann. N.Y. Acad. Sci. 1158: 1–13 (2009). doi: 10.1111/j.1749-6632.2008.03747.x C 2009 New York Academy of Sciences. ! 1 2 Annals of the New York Academy of Sciences 1997.5,6 Some interacting proteins exhibit a entially expressed compared to cell cycle reg- global correlation in gene expression,7 allow- ulators (such as CDK6). Indeed, arguing this ing researchers to reveal potential members of from a numerical perspective, if one can iden- large protein complexes such as proteasome or tify the most correlated subset of a pathway for ribosome.8 Most interacting proteins, however, a gene, the resulting correlation score would vary in expression over different biological con- be higher than the same score for all pathway ditions and display more evident correlation members. in only a selection of conditions. For example, The secondary objective of our work is the Brand and colleagues9 show that during the identification of members and candidates of erythroid differentiation switch, MafK changes known pathways using a selection of path- its dimerization partner from Bach1 to p45 to way members as reference. We perform an ex- form the NF-E2 complex. This result highlights ploratory analysis that detects the optimal cor- the importance of exploring possible gene in- relation threshold for every gene, so that the teractions across many meaningful biological subset of genes in a pathway (that is, a subpath- conditions. The number of publicly available way) corresponding to the threshold results in gene expression datasets has grown rapidly in the best gene rank for the pathway in question. recent years, and numerous biological condi- In the process of computing subpathway-based tions can be included in a gene network study. co-expression ranks for members of different The first objective of this work is to study pathways, we produce lists of candidate genes the predictive power of gene co-expression to that exhibit strong similarity to the subpath- infer gene memberships of known human pathways. In some cases, we provide supporting ways, using a large number of gene expres- evidence for the membership of these candi- sion datasets. By conducting an exhaustive se- dates in the form of shared protein–protein in- ries of leave-one-out computation experiments teractions with the pathway and enrichments for every gene and related pathway, we re- in related Gene Ontology (GO) terms.11 Our trieve a correlation-based gene rank that re- method therefore proves to be a promising new flects the co-expression of the gene and the direction in detecting pathway member candi- pathway in many biological conditions. Our dates and assessing dubious annotations. results show that correlation ranks vary sig- Previous studies have successfully applied nificantly across different types of pathways. gene expression data to infer certain pathways Expression correlation clearly distinguishes and thus supported the biological assumption members of transcriptionally regulated basal of coregulation, similar functionality, and cor- biological processes such as metabolism and related gene expression profiles.12–16 Neverthe- translation, while failing to recover signaling less it is well known that many gene networks networks where post-translational modifica- are not recoverable by exclusively using gene tions and ligand–receptor interactions have a expression data. Other approaches have used dominant regulatory role. topological information such as gene connec- It has been argued that gene networks con- tivity.15 The method proposed in this paper tain active subpathways that are more rele- chooses a subset of the most correlated path- vant to the expression of a gene than all the way genes to compare a gene ranking without pathway members taken together.10 Genes that taking into account the existence of links be- code components for the same protein complex tween genes in the pathway subset. We show need to have similar expression patterns to keep that despite computing ranks from optimal sub- the complex intact. For example, genes in cell pathways, there are several examples of known cycle minichromosome maintenance (MCM) pathway members with uninformative expres- and origin recognition complex (ORC) form sion patterns that prohibit the recovery of such co-expressed pathway subsets that are differ- genes from expression data alone. Our method Adler et al.: Ranking Genes by Their Co-expression to Subsets of Pathway Members 3 therefore clearly highlights the shortcomings co-expression scores between P " and the rest of of exclusively using expression data for path- the genes in G. way recovery, and encourages the application As we compute the rank for every gene in of methods such as that used by Gevaert and aselectedpathway,weassessthepredictive colleagues in Reference 17 that infer networks power of gene co-expression in the context of from heterogeneous sources of evidence. this pathway. As a side result, we obtain a list of candidate genes that are not annotated to the Materials and Methods pathway but show strong similarity to its expression patterns, frequently outweighing the Gene Expression Data and Pathways co-expression values of annotated members. In this study, we use the human gene expression atlas compiled by Lukk and colleagues.18 Correlation Scores and Gene Ranks The compendium covers 6,108 manually cu- Our goal is to compute a rank r for a se- rated and quality controlled samples from dif- ∗ lected query gene p and a pathway P.The ferent biological conditions that have been ∗ rank measures the query gene’s co-expression jointly normalized using the RMA algorithm.19 with the genes within the pathway, in contrast The entire sample collection originates from to the pathway members’ co-expression with the popular AffyMetrix HG_U133A microchip all other genes in the genome. We first define a platform that includes about 12,000 genes that correlation-based similarity score to measure the av- we consider the genome. erage co-expression between a gene g and the Note that a significant fraction of the genes members of a pathway P p , ..., p .Given on the HG_U133A platform have multiple mi- 1 m that g is not part of P and= {m is the number} of croarray probe sets corresponding to a single members of P, we compute the score as follows: gene. In our analysis, we select the most favor- 1 able probe set for each gene in order to get score (g, P) corr(g, p). p P the highest correlation score between a pair of = m ∈ genes. ! This study covers all 35 human pathways In order to compute rank r∗,wesplitthe from the pathway database Reactome2 version pathway of interest into the query gene p∗ and a set of training genes P " P p∗.

Ranking Genes by Their Co-Expression to Subsets of Pathway Members

Cooperativity of Imprinted Genes Inactivated by Acquired Chromosome 20Q Deletions

Genome-Wide Parent-Of-Origin DNA Methylation Analysis Reveals The

Investigation of Candidate Genes and Mechanisms Underlying Obesity

Imprinting of the Human L3MBTL Gene, a Polycomb Family Member Located in a Region of Chromosome 20 Deleted in Human Myeloid Malignancies

1 Inventory of Supplemental Information Presented in Relation To

Proteomic Expression Profile in Human Temporomandibular Joint

Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications

Cellular Functions of Genetically Imprinted Genes in Human and Mouse As Annotated in the Gene Ontology

The DNA Sequence and Comparative Analysis of Human Chromosome 20

L3MBTL1 Polycomb Protein, a Candidate Tumor Suppressor in Del(20Q12) Myeloid Disorders, Is Essential for Genome Stability

BMC Proceedings Biomed Central

Molecular Targeting and Enhancing Anticancer Efficacy of Oncolytic HSV-1 to Midkine Expressing Tumors