Prediction of Protein Function Using Statistically Significant Sub
Total Page:16
File Type:pdf, Size:1020Kb
Prediction of Protein Function Using Statistically Significant Sub-Structure Discovery by Craig Lucas Submitted in accordance with the requirements for the degree of Doctor of Philosophy. The University of Leeds School of Computing February 2007 The candidate confirms that the work submitted is his own and that the appropriate credit has been given where reference has been made to the work of others. Abstract Proteins perform a vast number of functional roles. The number of protein structures available for analysis continues to grow and, with the development of methods to predict protein structure directly from genetic sequence without imaging technology, the number of structures with unknown function is likely to increase. Computational methods for predicting the function of protein structures are therefore desirable. There are several existing systems for attempting to assign function but their use is in- advisable without human intervention. Methods for searching proteins with shared func- tion for a shared structural feature are often limited in ways that are counterproductive to a general discovery solution. Assigning accurate scores to significant sub-structures also remains an area of development. A method is presented that can find common sub-structures between multiple proteins, without the size or structural limitations of existing discovery methods. A novel measure of assigning statistical significance is also presented. These methods are tested on artifi- cially generated and real protein data to demonstrate their ability to successfully discover statistically significant sub-structures. With a database of such sub-structures, it is then shown that prediction of function for a new protein is possible based on the presence of the discovered significant patterns. i Acknowledgements ii Declarations Some parts of the work presented in this thesis have been published in the following articles: Lucas, C. and Bulpitt, A., “Automatic Identification of Key Sub-Structures for Protein An- notation”, Proceedings of the 2004 International Conference on Mathematics and Engi- neering Techniques in Medicine and Biological Sciences, CSREA Press, pp. 463–466 iii Contents 1 Introduction 1 1.1 Proteins ................................... 3 1.2 X-RayCrystallography ........................... 5 1.3 StructurePrediction. 7 1.4 ThesisOverview .............................. 8 2 Background 11 2.1 ProteinFunction............................... 11 2.2 ClassificationSystems ........................... 12 2.3 EvolutionandAncestry. 14 2.4 HomologyTransfer ............................. 15 2.5 Structural Alignment Methods . 17 2.6 Multiple Alignment Methods . 23 2.7 ChapterReview............................... 26 3 Sub-Structure Discovery 27 3.1 FunctionGrouping ............................. 28 3.2 InputData.................................. 29 3.3 Disorder, Error and Flexibility . 30 3.4 UnderlyingMatch.............................. 32 3.5 ProgressiveMatch.............................. 34 3.6 ProgressiveMatchSteps . 38 3.7 ChapterReview............................... 41 4 Algorithm Testing 42 4.1 ArtificialDataGeneration. 43 4.2 ProteinData................................. 54 4.3 ChapterReview............................... 57 iv 5 Statistical Significance 58 5.1 ScoringMethods .............................. 58 5.2 Analysing Multiple Protein Groups . 63 5.3 Results.................................... 67 5.4 ChapterReview............................... 83 6 Function Prediction 84 6.1 CombiningEvidence ............................ 84 6.2 InheritingEvidence ............................. 86 6.3 PredictionExamples ............................ 87 6.4 Evaluation.................................. 88 6.5 ChapterReview............................... 90 7 Conclusions 91 7.1 Review ................................... 91 7.2 ProjectDecisions .............................. 92 7.3 FutureWork................................. 96 7.4 ChapterReview............................... 97 Bibliography 99 A Parallel Methods 109 A.1 Method1 ..................................110 A.2 Method2 ..................................111 A.3 Conclusions................................. 113 v List of Figures 1.1 PDBGrowth ................................ 2 1.2 VanderWaalsradii ............................. 3 1.3 α-helix ................................... 5 1.4 β-sheet ................................... 6 1.5 Proteinchains................................ 7 1.6 Foldedchain................................. 8 1.7 HydrophobicCore ............................. 9 1.8 X-rayApparatus............................... 10 2.1 GeneOntology ............................... 13 2.2 SurfaceMatching .............................. 20 2.3 ConnollySurface .............................. 21 3.1 PDBFile .................................. 29 3.2 Temperature factor image of 1BE3 . 31 3.3 NMRStructureof1T50........................... 32 3.4 ExampleMatchingPatterns. 33 3.5 DiscoveryAlgorithm ............................ 35 3.6 MaximumDiameter............................. 37 3.7 Coherence.................................. 37 3.8 ConsecutiveSegments ........................... 37 3.9 Coherencecachetable ........................... 39 4.1 Distribution of inter-point distances . ...... 44 4.2 Example of an artificially generated structure . ....... 45 4.3 Artificially generated structures with a common subpattern ........ 46 4.4 CliqueSizevsTime............................. 48 4.5 ParallelMatchTiming ........................... 50 4.6 VariableCoherence ............................. 51 vi 4.7 LabelPossibilities.............................. 52 4.8 TolerancevsTime.............................. 53 4.9 TolerancevsTime.............................. 54 4.10 Multiple Structure Pattern Discovery . ...... 56 5.1 Distribution of GO examples . 65 5.2 SignificantPatterns ............................. 67 5.3 Common pattern within GO:0004181 . 69 5.4 Common pattern within GO:0045012 . 70 5.5 Common pattern within GO:0009975 . 71 5.6 Common pattern within GO:0003702 . 72 5.7 Common pattern within GO:0004759 . 73 5.8 Common pattern within GO:0004623 . 74 5.9 Common pattern within GO:0016731 . 75 5.10 Common pattern within GO:0005246 . 75 5.11 Common pattern within GO:0019239 . 76 5.12 Common pattern within GO:0015666 . 76 5.13 Common pattern within GO:0005267 . 77 5.14 Common pattern within GO:0016247 . 77 6.1 InheritingEvidence ............................. 86 6.2 1HDXPredictionResults. 87 6.3 153LPredictionResults. 88 6.4 PredictiveAccuracy............................. 89 A.1 ParallelMethod1 .............................. 110 A.2 ParallelMethod2 .............................. 111 A.3 ParallelPerformance . 112 vii List of Tables 1.1 Tableofaminoacids ............................ 4 2.1 Limitations of annotation transfer . 16 4.1 LabelsvsTime ............................... 49 4.2 LabelsvsTime ............................... 49 5.1 Example Contingency Table . 62 5.2 Example Chi-squared expected values, calculated from the information giveninTable5.1............................... 62 5.3 Summary of the match parameters used and resulting scores obtained for each GO class presented. ‘Chi-Sq’ indicates the Chi-squared score. 68 5.4 Common pattern within GO:0004181 . 78 5.5 Common pattern within GO:0045012 . 78 5.6 Common pattern within GO:0009975 . 78 5.7 Common pattern within GO:0003702 . 79 5.8 Common pattern within GO:0004759 . 79 5.9 Common pattern within GO:0004623 . 79 5.10 Common pattern within GO:0016731 . 79 5.11 Common pattern within GO:0005246 . 80 5.12 Common pattern within GO:0019239 . 80 5.13 Common pattern within GO:0015666 . 81 5.14 Common pattern within GO:0005267 . 82 5.15 Common pattern within GO:0016247 . 82 6.1 OccurrenceTables.............................. 85 viii Chapter 1 Introduction Proteins are large, biological molecules that perform a vast number of diverse roles in the cellular and biological functions of every organism. Proteins may be antibodies in the im- mune system, enzymes for catalysing chemical reactions, hormones, transport molecules, structural building blocks or may perform contractile functions in muscle tissue. They vary from simple, almost spherical forms up to more complex structures that may exhibit basic mechanical behaviour. Each protein within an organism is created from the code in a single gene. The struc- ture and functional role of the protein is determined by the code within its gene sequence. The Swiss-Prot database [7] contains protein sequences which have been manually anno- tated with their function. This database contains almost 200,000 entries as of 2005, 10,000 more than in 2004. The Protein Data Bank [25] is a database containing protein atomic structures. The PDB contains nearly 30,000 entries as of 2005, an increase of 5,000 over the previous year and the rate at which new structures are added is also increasing, as illustrated in Figure 1.1. As well as new examples of structure, there is much optimism that the number of unique structures dissimilar to other existing examples will continue to be discovered [95] for some time. The rate at which new sequences are discovered has re- sulted in the development of computational methods to assist the biochemist in assigning a functional annotation to a new protein. Currently, the rate at which new structures are determined is sufficiently low for the function to have often been determined before the molecular structure. As methods for determining structure improve in speed, the number of submitted protein structures will also increase further.