Computational Analysis of Protein Function Within Complete Genomes
Total Page:16
File Type:pdf, Size:1020Kb
Computational Analysis of Protein Function within Complete Genomes Anton James Enright Wolfson College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom. Email: [email protected] March 7, 2002 To My Parents and Kerstin This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This thesis does not exceed the specified length limit of 300 pages as de- fined by the Biology Degree Committee. This thesis has been typeset in 12pt font using LATEX2ε accordingtothe specifications defined by the Board of Graduate Studies and the Biology Degree Committee. ii Computational Analysis of Protein Function within Complete Genomes Summary Anton James Enright March 7, 2002 Wolfson College Since the advent of complete genome sequencing, vast amounts of nucleotide and amino acid sequence data have been produced. These data need to be effectively analysed and verified so that they may be used for biologi- cal discovery. A significant proportion of predicted protein sequences from these complete genomes have poorly characterised or unknown functional annotations. This thesis describes a number of approaches which detail the computational analysis of amino acid sequences for the prediction and analy- sis of protein function within complete genomes. The first chapter is a short introduction to computational genome analysis while the second and third chapters describe how groups of related protein sequences (termed protein families) may be characterised using sequence clustering algorithms. Two novel and complementary sequence clustering algorithms will be presented together with details of how protein family information can be used to detect and describe the functions of proteins in complete genomes. Further research is described which uses this protein family information to analyse the molec- ular evolution of proteins within complete genomes. Recent developments in genome analysis have shown that the computational prediction of protein function in complete genomes is not limited to pure sequence homology meth- ods. So called genome context methods use other information from complete genome sequences to predict protein function. Examples of this include gene location or neighbourhood analysis and the analysis of the phyletic distri- bution of protein sequences. In chapter four a novel method for predicting whether two proteins are functionally associated or physically interact is de- scribed. This method is based on the detection of gene-fusion events within complete genomes. During this research many novel tools and methods for genome analysis, data-visualisation, data-mining and high-performance bio- logical computing were developed. Many of these tools represent interesting research projects in their own right, and formed the basis from which the research carried out in this thesis was conducted. Within the final chapter a selection of these novel algorithms developed during this research is described. Preface This thesis describes work carried out at the European Bioinformatics In- stitute (EBI) in Cambridge, UK, between October 1998 and February 2002. The EBI is an outstation of the European Molecular Biology Laboratory (EMBL). I was the recipient of an EMBL predoctoral fellowship to work with Dr. Christos Ouzounis in the Computational Genomics Group. There are many individuals who have provided assistance and encourage- ment throughout this work. I would first like to thank Professor Kenneth Wolfe and Professor David McConnell at the Smurfit Institute of Genetics in Trinity College Dublin, where I had previously been an undergraduate. While at Trinity I was given the best scientific education one could hope for, spanning classical genetics, mathematics and bioinformatics. I was encour- aged at Trinity to accept an internship with Digital Equipment Corporation in my final year. It was at Digital that I discovered how bad my programming skills actually were, and through the patient efforts of Dr. Eamonn O’Toole managed to rectify the situation somewhat. There are many people at the European Bioinformatics Institute who through their friendship and support have made this work an enjoyable and rewarding experience. Firstly, my supervisor Dr. Christos Ouzounis, whose patient support and scientific knowledge have greatly aided this research. The other members of the computational genomics group have helped me over the years in countless ways and have also had the dubious pleasure of discovering all of the bugs in software I have written. James Cuff and Michele Clamp helped me get on my feet as a bioinformatician by patiently providing advice and assistance on an almost daily basis. I’d also like to thank Steven Searle, who has helped me debug countless programs, and whose sage-like wisdom of all things computational still terrifies me. Stijn Van Dongen, the inventor of the MCL algorithm, deserves special mention for writing such a beautiful algorithm, and for helping me understand how it works. Finally, and not least, the members of the EBI ’A-Team’, especially Kerstin Dose, who keep the EBI running and who have sorted out countless problems for me. iv Contents Summary iii Preface iv 1 Introduction 1 1.1 Protein Sequence Alignment andSimilarityDetection..................... 3 1.1.1 The Needleman-Wunsch Algorithm . 4 1.1.2 TheSmith-WatermanAlgorithm............ 8 1.1.3 TheFASTAAlgorithm.................. 9 1.1.4 TheBLASTAlgorithm.................. 12 1.1.5 Statistics for Sequence Alignment Searching . 14 1.1.6 Profile Searching: PSI-BLASTandHMMER................ 22 1.2 Computational Biology Data Sources . 26 1.2.1 Nucleotide Sequence Databases . 26 1.2.2 ProteinSequenceDatabases............... 29 1.2.3 Protein Domain and Family Databases (Pfam&InterPro).................... 31 2 Algorithms for Automatic Protein Sequence Clustering 33 2.1GenomicProteinFamilyAnalysis................ 35 2.2ProteinSequenceClusteringTechniques............ 37 2.2.1 Multi-DomainProteins.................. 38 2.2.2 Orthology, Paralogy and Evolution . 40 2.2.3 AutomationandScaling................. 42 2.2.4 PreviousMethods..................... 44 2.3GeneRAGE............................ 49 2.3.1 InitialSteps........................ 49 2.3.2 SymmetrificationoftheMatrix............. 51 2.3.3 Detection of Multi-Domain Proteins . 52 v 2.3.4 ClusteringtheSimilarityMatrix............. 53 2.3.5 ValidationandTesting.................. 55 2.3.6 AlgorithmImplementation................ 58 2.3.7 Conclusions........................ 58 2.4Tribe-MCL............................ 60 2.4.1 Introduction........................ 60 2.4.2 Markov Clustering of Sequence Similarities . 62 2.4.3 Application of the MCL Algorithm to Biological Graphs 67 2.4.4 ValidationoftheAlgorithm............... 70 2.4.5 Large-Scale Family Detection . 72 2.4.6 Algorithm Implementation and Availability . 75 2.4.7 Conclusions........................ 75 3 Analysis of Protein Families 77 3.1 Exploration of Protein Families using GeneRAGE............................ 78 3.1.1 Detection of Novel Archaeal Domains . 78 3.1.2 Transcription Associated Protein Family Analysis.......................... 84 3.1.3 DetectionofSR-DomainProteins............ 89 3.2 Large-Scale Detection of Protein Families using Tribe-MCL . 93 3.2.1 Family Annotation of the Draft Human Genome . 93 3.3 Tribes: A Database of Protein Families . 99 3.3.1 A Complete Genomes Database . 100 3.3.2 Construction of the Tribes Database . 103 3.3.3 Protein Family Analysis using the Tribes Database . 106 4 Genomic Analysis of Protein Interaction 125 4.1 Computational Detection of Protein-ProteinInteractions...................126 4.1.1 Structural Biology Approaches . 126 4.1.2 Sequence and Genomic Context Approaches . 127 4.2 Prediction of Protein Function and Protein-Protein interaction using GeneFusionevents........................131 4.2.1 An Algorithm for the Detection of Gene Fusion Events 134 4.2.2 Application and Validation of the Algorithm using Four CompleteGenomes....................137 4.3 Exhaustive Detection of Protein-ProteinInteractions...................143 vi 4.3.1 Computational Protocol for Exhaustive Detection of GeneFusionEvents....................145 4.3.2 Results...........................148 4.3.3 Data Storage & Availability . 159 4.3.4 Discussion.........................159 5 Methods for Genome Analysis and Annotation 164 5.1 CAST: Detection of Compositionally BiasedRegionsinProteinSequences..............165 5.1.1 Introduction........................165 5.1.2 AlgorithmConcept....................166 5.1.3 Detection of Low-Complexity Residues . 168 5.1.4 FilteringProcedure....................169 5.1.5 Implementation......................169 5.2 BioLayout: A Visualisation Algorithm for Protein Sequence Similarities............................173 5.2.1 GraphLayoutTheory..................173 5.2.2 GraphLayoutAlgorithms................174 5.2.3 The BioLayout Algorithm . 175 5.2.4 ImplementationandUsage................177 5.3 TextQuest: Automatic Classification of Medline Abstracts . 183 5.3.1 Introduction........................183 5.3.2 Text-MiningApproaches.................184 5.3.3 TheTextQuestAlgorithm................185 5.3.4 DocumentClusteringResults..............188 5.3.5 Implementation......................191 5.4 Consensus Annotation of Protein Families . 192 5.4.1