Exploring the Human Genome with Functional Maps Curtis Huttenhower1,2,†, Erin M
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press Exploring the human genome with functional maps Curtis Huttenhower1,2,†, Erin M. Haley3,†, Matthew A. Hibbs4, Vanessa Dumeaux5, Daniel R. Barrett1, Hilary A. Coller3,‡, Olga G. Troyanskaya1,2,‡,* 1 Department of Computer Science, Princeton University, Princeton, NJ, 08540 2 Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, 08544 3 Department of Molecular Biology, Princeton University, Princeton, NJ, 08544 4 Jackson Laboratory, Bar Harbor, ME, 04609 5 Institute of Community Medicine, Tromsø University, Tromsø, Norway * To whom correspondence should be addressed: [email protected], phone 609-258-7014, fax 609-258-7599 † These authors contributed equally to this work. ‡ Co-principle investigators. Running title: Exploring the human genome with functional maps Manuscript type: Resource Keywords: human data integration, functional interaction network, computational predictions, disease and process associations 1 Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press Abstract biological truths: the size of the human genome, the complexity of human tissue types and regulatory Human genomic data of many types are readily mechanisms, and the sheer amount of available data all available, but the complexity and scale of human contribute to the analytical complexity of understanding molecular biology make it difficult to integrate this body human functional genomics. of data, understand it from a systems level, and apply it In order to take advantage of large collections of to the study of specific pathways or genetic disorders. genomic data, they must be integrated, summarized, An investigator could best explore a particular protein, and presented in a biologically informative manner. We pathway, or disease if given a functional map provide a means of mining tens of thousands of whole- summarizing the data and interactions most relevant to genome experiments by way of functional maps. Each his or her area of interest. Using a regularized Bayesian map represents a body of data, probabilistically integration system, we provide maps of functional weighted and integrated, focused on a particular activity and interaction networks in over 200 areas of biological question. These questions can include, for human cellular biology, each including information example, the function of a gene, the relationship from ~30,000 genome-scale experiments pertaining to between two pathways, or the processes disrupted in a ~25,000 human genes. Key to these analyses is the ability genetic disorder. Functional integrations investigating to efficiently summarize this large data collection from a individual genes' relationships have been successful variety of biologically informative perspectives: with smaller data collections in less complex organisms prediction of protein function and functional modules, (Date et al. 2006; Lee et al. 2004; Myers et al. 2007), cross-talk among biological processes, and association of although (as discussed below) it is particularly novel genes and pathways with known genetic challenging to scale these techniques up to the size and disorders. In addition to providing maps of each of these complexity of the human genome. Each functional map, areas, we also identify biological processes active in each based on an underlying predicted interaction network, dataset. Experimental investigation of five specific summarizes an entire collection of genomic genes, AP3B1, ATP6AP1, BLOC1S1, LAMP2 and experimental results in a biologically meaningful way. RAB11A, has confirmed novel roles for these proteins in While functional maps can readily predict functions the proper initiation of macroautophagy in amino acid- for uncharacterized genes (Murali et al. 2006), it is starved human fibroblasts. Our functional maps can be important to take advantage of the scale of available explored using HEFalMp, a web interface allowing data to understand entire pathways and processes. interactive visualization and investigation of this large Cross-talk and co-regulation among pathways, body of information. processes, and genetic disorders can be mapped by analyzing the structure of underlying functional Supplemental material is available online at relationship networks. This includes the association of www.genome.org; results from this study and the disease genes with (potentially causative) pathways; for interactive HEFalMp tool are available at example, many known breast cancer genes are involved http://function.princeton.edu/hefalmp. in aspects of the cell cycle and DNA repair, and novel associations of this type can be mined from high- Introduction throughput data. Similarly, associations between distinct but interacting biological processes (e.g. mitosis and The completion of the Human Genome Project and the DNA replication) can be quantified by examining subsequent flood of genomic data and analyses have functional relationships between groups of genes, provided a wealth of information regarding the entire allowing the identification of proteins key to catalog of human genes. Comprehensive assays of gene interprocess regulation. expression, protein binding, genetic interactions, and The functional maps we provide for the human regulatory relationships all provide snapshots of genome include information on protein function, molecular activity in specific cell types and associations between diseases, genes, and pathways, and environments, but turning these biomolecular parts lists cross-talk between biological processes. These are all into an understanding of pathways, processes, and based on probabilistic data integration using regularized systems biology has proven to be a challenging task. naive Bayesian classifiers. Naive Bayesian systems have This abundance of data can sometimes obscure been used successfully to analyze protein-protein 2 Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press interaction (PPI) data (Rhodes et al. 2005; von Mering et Using the system outlined in Figure 1A, we generate al. 2007), whereas our focus is on functional functional maps of predicted gene functions, pathway relationships and the biological roles of gene products. and process associations, and genetic disorders focused Prior work performing functional integration in simpler on 229 biological processes, incorporating information organisms with smaller data collections (Date et al. 2006; from ~30,000 genome-scale experiments. Within each Myers et al. 2007) has been similarly successful; see biological area, maps are derived from a functional Supplemental Text 1 for a complete discussion. Such relationship network predicted using regularized integrations have not previously been scaled biologically Bayesian integration of the genomic data. The features (i.e. to complex metazoans) or computationally (over and contents of the resulting interaction networks are very large genomic data collections) to provide a analyzed to produce gene-, process-, and disease-centric functional view of the human genome driven purely by functional maps specific to each biological area. We have experimental results. In addition to challenges of experimentally confirmed five genes newly predicted to computational efficiency in the presence of hundreds of be active in the area of macroautophagy, AP3B1, genome-scale datasets, naive classifiers assume that all ATP6AP1, BLOC1S1, LAMP2, and RAB11A. input datasets are independent; this becomes increasingly untrue and problematic as more datasets Data integration for functional mapping are analyzed, resulting in a paradox of decreasing performance with increasing training data. To address A functional map is a view of genomic data focused on a this, we use Bayesian regularization (Steck et al. 2002), a particular area of interest: genes, processes, diseases, process by which an observed distribution of data can be and their associations and interrelationships. To derive combined with a prior belief in a principled manner. these maps, we analyze functional relationship networks Intuitively, this results in groups of datasets containing predicted based on Bayesian integration of ~30,000 similar information making a more modest contribution genome-scale experiments (Supplemental Table 1). to the integration process, upweights unique datasets, These are organized into 656 datasets (grouped by and prevents overconfident predictions. Our related microarray experiments, individual interaction regularization of the naive classifier parameters using a databases, and so forth) and probabilistically weighted score based on mutual information up- and down- based on their functional activity in 229 biological weighted appropriate subsets of data, maintaining both processes of interest (e.g. autophagy, mitotic cell cycle, efficiency and accuracy. protein processing, etc.). As summarized in Table 1 and We applied our functional maps to a specific Supplemental Table 2, one product of this integration biological question in the area of autophagy, the process process is an estimate of the biological processes active by which a cell can recycle its own biomass under in each dataset. Further, as highlighted in Table 2, over conditions of starvation or stress (Klionsky 2007). 25% of our predicted functional relationships are Among many proteins predicted to participate in this supported by at least 100 datasets,