Computational Approaches for Protein Function Prediction: a Survey

Computational Approaches for Protein Function Prediction: A Survey Gaurav Pandey, Vipin Kumar and Michael Steinbach Department of Computer Science and Engineering, University of Minnesota {gaurav,kumar,steinbac}@cs.umn.edu Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. Experimental procedures for protein function prediction are inherently low throughput and are thus unable to annotate a non-trivial fraction of proteins that are becoming available due to rapid advances in genome sequencing technology. This has motivated the development of computational techniques that utilize a variety of high-throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks and phylogenetic profiles. Indeed, in a short period of a decade, several hundred articles have been published on this topic. This survey aims to discuss this wide spectrum of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the trends and needs of this very important field. The survey is expected to be useful for computational biologists and bioinformaticians aiming to get an overview of the field of computational function prediction, and identify areas that can benefit from further research. Key Words and Phrases: Protein function prediction, bioinformatics, Gene Ontology, multiple biological data types, high-throughput experimental data, data mining, non-homology based methods Contents 1 Introduction 3 2 What is Protein Function? 6 2.1 FunctionalClassificationSchemes . 7 2.2 GOistheWaytoGo!............................ 10 2.3 Discussion.................................. 13 3 Protein Sequences 13 3.1 Introduction................................. 13 3.2 Annotation transfer from homologues: How good is it for function prediction?..................................... 15 3.3 Existing Approaches Beyond Simple Homology-based Annotation Transfer 16 3.3.1 Homology-basedapproaches. 17 3.3.2 Subsequence-basedapproaches . 19 Authors’ Address: Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55414, USA Support: This material is based upon work supported by the National Science Foundation under Grant Nos. IIS-0308264 and ITR-0325949. Access to computing facilities was provided by the Minnesota Supercomputing Institute. 2 · Pandey et al. 3.3.3 Feature-basedapproaches . 24 3.4 Discussion.................................. 27 4 Protein Structure 28 4.1 Introduction................................. 28 4.2 IsStructureTiedtoFunction? . 30 4.3 ExistingApproaches ............................ 32 4.3.1 StructuralSimilarity-basedApproaches . .... 33 4.3.2 Three-dimensionalMotif-basedApproaches. .... 35 4.3.3 Surface-basedApproaches . 36 4.3.4 Learning-basedApproaches . 38 4.4 Discussion.................................. 39 5 Genomic Sequences 39 5.1 Introduction................................. 39 5.2 ExistingApproaches ............................ 40 5.2.1 Genome-wide homology-basedannotation transfer . ..... 40 5.2.2 Approachesexploitinggeneneighborhood . .. 41 5.2.3 Approachesexploitinggenefusion. 43 5.3 ComparisonandAssimilationoftheApproaches . ..... 44 6 Phylogenetic Data 46 6.1 Introduction................................. 46 6.2 ExistingApproaches ............................ 48 6.2.1 ApproachesUsingPhylogeneticProfiles . .. 48 6.2.2 ApproachesUsingPhylogeneticTrees . 51 6.2.3 HybridApproaches.. .... .... ... .... .... .... 54 6.3 Discussion.................................. 54 7 Gene Expression Data 55 7.1 Introduction................................. 55 7.2 ExistingApproaches ............................ 57 7.2.1 Clustering-basedapproaches . 57 7.2.2 Classification-basedapproaches . 62 7.2.3 Temporalanalysis-basedapproaches. .. 65 7.3 Discussion.................................. 66 8 Protein Interaction Networks 67 8.1 Introduction................................. 67 8.2 ThePromiseofProteinInteractionNetworks . ..... 70 8.3 Existingapproaches............................. 71 8.3.1 Neighborhood-basedapproaches. 72 8.3.2 Globaloptimization-basedapproaches . ... 75 8.3.3 Clustering-basedapproaches . 78 8.3.4 AssociationAnalysis-basedApproaches . ... 80 8.4 Discussion.................................. 81 Computational Techniques for Protein Function Prediction: A Survey · 3 9 Literature and Text 82 9.1 Introduction................................. 82 9.2 ExistingApproaches ............................ 82 9.2.1 IR-basedapproaches . 84 9.2.2 Textmining-basedapproaches . 85 9.2.3 NLP-basedapproaches . 86 9.2.4 Keywordsearch........................... 88 9.3 StandardizationInitiatives . ... 90 9.3.1 BioCreAtIvE ............................ 90 9.3.2 TREC2003GenomicsTrack. 92 9.4 Discussion.................................. 93 10 Mutiple Data Types 93 10.1Introduction................................. 93 10.2 ExistingApproaches ............................ 94 10.2.1 ApproachesUsingaCommonDataFormat . 94 10.2.2 ApproachesUsingIndependentDataFormats . ... 97 10.3Discussion.................................. 104 11 Conclusions 105 1. INTRODUCTION Proteins are macromolecules that serve as building blocks and functional components of a cell, and account for the second largest fraction of the cellular weight after water. Proteins are responsible for some of the most important functions in an organism, such as constitu- tion of the organs (structural proteins), the catalysis of biochemical reactions necessary for metabolism (enzymes), and the maintenance of the cellular environment (transmembrane proteins). Thus, proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. The early approaches to predicting protein function were experimental and usually fo- cused on a specific target gene or protein, or a small set of proteins forming natural groups such as protein complexes. These approaches included gene knockout, targeted mutations and the inhibition of gene expression [Weaver 2002]. However, irrespective of the details, these approaches are low-throughput because of the huge experimental and human effort required in analyzing a single gene or protein. As a result, even large-scale experimental annotation initiatives, such as the EUROFAN project [Oliver 1996], are inadequate for annotating a non-trivial fraction of the proteins that are becoming available due to rapid advances in genome sequencing technology. This has resulted in a continually expanding sequence-function gap for the discovered proteins [Roberts 2004]. In an attempt to close this gap, numerous high-throughputexperimental procedures have been invented to investigate the mechanisms leading to the accomplishment of a protein’s function. These procedures have generated a wide variety of useful data that ranges from simple protein sequences to complex high-throughput data, such as gene expression data sets and protein interaction networks. These data offer different types of insights into a protein’s function and related concepts. For instance, protein interaction data shows which proteins come together to perform a particular function, while the three-dimensional 4 · Pandey et al. structure of a protein determines the precise sites to which the interacting protein binds itself. Furthermore, recent years have seen the recording of this data in very standardized and professionally maintained databases such as SWISS-PROT [Boeckmann et al. 2003], MIPS [Mewes et al. 2002], DIP [Xenarios et al. 2002] and PDB [Berman et al. 2000]. The huge amount of data that has accumulated over the years has made biological dis- covery via manual analysis tedious and cumbersome. This has, in turn, necessitated the use of techniques from the field of bioinformatics, an approach that is crucial in today’s age of rapid generation and warehousing of biological data. Bioinformatics focuses on the uti- lization of techniques from computer science and the development of novel computational approaches for addressing problems in molecular biology and associated disciplines. In- deed, a more recently advocated path for biological research is the creation of hypotheses by generating results from an appropriate bioinformatics algorithm in order to narrow the search space, and the subsequent validation of these hypotheses to reach the final conclu- sion [Rastan and Beeley 1997; Roberts 2004]. Standard sequence comparison tools such as BLAST [Altschul et al. 1990; Altschul et al. 1997], and databases such as PROSITE [Hulo et al. 2006], Pfam [Sonnhammer et al. 1997] and PRINTS [Attwood et al. 2003] serve as testimonials to the benefits that bioinformatics can provide to molecular biology. Following the success of computational approaches in solving important problems such as sequence alignment and comparison [Altschul et al. 1997], and genome fragment as- sembly [Shendure et al. 2004], and given the importance of protein function, numerous computational techniques have also been proposed for predicting protein function. Early approaches used sequence similarity tools such as BLAST [Altschul et al. 1990] to transfer functional annotation from the most similar proteins. Subsequently,

Load more