Genome-Wide Quantification of New Biological and Pathological Functions of Genes and Pathways ∗ ∗ ∗ Xu Chi , Maureen A
Total Page:16
File Type:pdf, Size:1020Kb
Briefings in Bioinformatics, 00(00), 2019, 1–16 doi: 10.1093/bib/bbz093 Advance Access Publication Date: Problem solving protocol Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz093/5593802 by University of Pittsburgh user on 31 October 2019 Universal concept signature analysis: genome-wide quantification of new biological and pathological functions of genes and pathways ∗ ∗ ∗ Xu Chi , Maureen A. Sartor , Sanghoon Lee , Meenakshi Anurag, Snehal Patil, Pelle Hall, Matthew Wexler and Xiao-Song Wang Corresponding author: Xiao-Song Wang, Departments of Pathology and Biomedical Informatics, UPMC Hillman Cancer Center, 5117 Centre Avenue, Pittsburgh, PA 15232. Tel.: +1-412-623-1587; Fax: 412-623-1010; E-mail: [email protected] ∗ These authors contributed equally to this work. Abstract Identifying new gene functions and pathways underlying diseases and biological processes are major challenges in genomics research. Particularly, most methods for interpreting the pathways characteristic of an experimental gene list defined by genomic data are limited by their dependence on assessing the overlapping genes or their interactome topology, which cannot account for the variety of functional relations. This is particularly problematic for pathway discovery from single-cell genomics with low gene coverage or interpreting complex pathway changes such as during change of cell states. Here, we exploited the comprehensive sets of molecular concepts that combine ontologies, pathways, interactions and domains to help inform the functional relations. We first developed a universal concept signature (uniConSig) analysis for genome-wide quantification of new gene functions underlying biological or pathological processes based on the signature molecular concepts computed from known functional gene lists. We then further developed a novel concept signature enrichment analysis (CSEA) for deep functional assessment of the pathways enriched in an experimental gene list. This method is grounded on the framework of shared concept signatures between gene sets at multiple functional levels, thus overcoming the limitations of the current methods. Through meta-analysis of transcriptomic data sets of cancer cell line models and single hematopoietic stem cells, we demonstrate the broad applications of CSEA on pathway discovery from Xu Chi is a post-doctoral researcher in the Department of Pathology, School of Medicine, at the University of Pittsburgh. He has been working in the field of bioinformatics and genomics. Current address: CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China. Maureen A. Sartor is an associate professor in the Department of Computational Medicine & Bioinformatics, Medical School, at the University of Michigan. She has been working in the field of bioinformatics and biostatistics. Sanghoon Lee is a graduate student in the Department of Biomedical Informatics at the University of Pittsburgh. His research area is bioinformatics and machine learning. Meenakshi Anurag is an assistant professor in the Department of Medicine at the Baylor College of Medicine. Her research specialties are cancer multiomics analysis and next-gen sequencing (NGS) data analysis. Snehal Patil is an application programmer in the Department of Computational Medicine & Bioinformatics at the University of Michigan. She is an expert in information science. Pelle Hall is a researcher in the Department in Computational Medicine & Bioinformatics at the University of Michigan. His research area is bioinformatics and statistics. Matthew Wexler is a graduate student in the Department of Pathology, School of Medicine at the University of Pittsburgh. He has been researching in the field of cancer genomics. Xiaosong Wang is an associate professor in the Department of Pathology and Biomedical Informatics, School of Medicine, at the University of Pittsburgh. He has been researching in the field of cancer genomics, bioinformatics, and Molecular Cancer Biology. Submitted: 11 March 2019; Received (in revised form): 23 May 2019 © The Author(s) 2019. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1 2 Chi et al. gene expression and single-cell transcriptomic data sets for genetic perturbations and change of cell states, which complements the current modalities. The R modules for uniConSig analysis and CSEA are available through https://github. com/wangxlab/uniConSig. Key words: quantification of genome function; disease gene discovery; causal pathway discovery Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz093/5593802 by University of Pittsburgh user on 31 October 2019 Background cancer by assessing their associations with the cancer gene The identification of causal genes and pathways underlying dis- signature concepts (i.e. oncogenic pathways, interactions with eases (such as cancer) based on proteomic, genomic, transcrip- key oncogenic proteins, characteristic protein domains, specific tomic, or single-cell profiling/sequencing data sets is a daunting gene ontologies). This algorithm provides a unique quantitative task, yet critical to translational research. The first challenge estimation of new gene functions underlying cancer, which have is that genomics data can provide molecular evidence for the been successfully applied by our group to identify new cancer pathological factors contributing to certain diseases; however, it genes from genomics data sets [6–10]. Other algorithms may is often challenging for biologists to interpret the possible func- utilize network analysis, text mining, or similarity profiling to tions of an overwhelming number of aberrant genes cataloged by rank candidate genes [11–24], which generally fail to account for genomics, and prioritize those genes based on existing human the variety of ways in which genes can be related to each other. knowledge of their biological functions. This often happens in It bears notable that all gene set-based algorithms including cancer genomics studies where hundreds of candidate patho- ConSig require merging of a wide array of molecular concept logical genetic alterations are identified, making it exceedingly databases thanks to their outgrowth in recent years [25–33]. difficult to pinpoint the key causal genes and prioritize them Such data fusion, however, carries the inherent challenge of data for experimental validation. This calls for innovative algorithms redundancy because different data sources follow different rules to compute novel gene–disease or gene–function associations for the nomenclature of concepts and different categories of con- based on genome knowledge databases. Here, we define ‘gene cepts possess levels of inherent redundancy. Hence, the ability set’ as a group of genes that are functionally related, such as a of the gene set-based algorithms to identify the redundancy and set of genes that function in certain diseases (disease gene set) sieve only the effective or unique signature concepts for further or pathways (pathway gene set). Thus, this type of associations calculation is critical. Here, we developed a powerful algorithm can be generalized as ‘gene to gene set associations’. called universal ConSig (uniConSig) analysis, which has sub- Another major challenge is to interpret the pathways stantially improved performance for genome-wide quantifica- characteristic of a list of aberrant genes, such as differentially tion of gene functions based on redundant molecular concept expressed or mutated genes. Here, we define a nominal gene list databases (Figure 1A). Based on this algorithm, we further devel- of interest cataloged by genomics data as an experimental gene oped a ConSig enrichment analysis (CSEA) for the quantification list. Pathway gene set analysis, originally derived from analyzing of precise functional associations between molecular concepts gene expression data, has progressively been applied to genetic (Figure 1B) by deep interpreting their shared signature concepts data. The current gene set analysis methods have been outlined rather than assessing shared gene numbers, which cannot be in a recent review [1]. While pathway tools are vastly available carried out by any of the previous methods. The uniConSig and for gene expression data of continuous variables, only handful of CSEA algorithms directly measure the functional interconnect- tools are available for genetic data of nominal variables (Table 1). edness of genes and gene sets, which will have wide applications A majority of these tools rely on overrepresentation analysis in genomics studies, such as discovering new gene functions (ORA) that performs statistical test to assess the statistical underlying certain disease or identifying the pathways under- overrepresentation of experimental gene list compared to the lying experimentally defined gene lists. pathway gene sets and are thus severely limited by their dependence on the genes included in the experimental gene list and the pathways. To illustrate, suppose two gene sets A Materials and methods and B do not share any common genes. Even if gene sets A and B each consist entirely of genes involved in DNA repair, they Compiling the molecular concept database and training will not be found to have functional relationship by current gene lists approaches. While recent studies has attempted to resolve this To generate a comprehensive and