Human Protein Function Prediction: Application of Machine Learning for Integration of Heterogeneous Data Sources

Human Protein Function Prediction: application of machine learning for integration of heterogeneous data sources Anna Lobley A dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Biochemistry University College London July 2009 1 I, Anna Lobley, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 2 Abstract Experimental characterisation of protein cellular function can be prohibitively expensive and take years to complete. To address this problem, this thesis focuses on the development of com- putational approaches to predict function from sequence. For sequences with well characterised close relatives, annotation is trivial, orphans or distant homologues present a greater challenge. The use of a feature based method employing ensemble support vector machines to predict indi- vidual Gene Ontology classes is investigated. It is found that different combinations of feature inputs are required to recognise different functions. Although the approach is applicable to any human protein sequence, it is restricted to broadly descriptive functions. The method is well suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate class assignments. Signatures of common function can be derived from different biological characteristics; interactions and binding events as well as expression behaviour. To investigate the hypothesis that common function can be derived from expression information, public domain human microarray datasets are assembled. The questions of how best to integrate these datasets and derive features that are useful in function prediction are addressed. Both co-expression and abundance information is represented between and within experiments and investigated for correlation with function. It is found that features derived from expression data serve as a weak but significant signal for recognising functions. This signal is stronger for biological processes than molecular function categories and independent of homology information. The protein domain has historically been coined as a modular evolutionary unit of protein function. The occurrence of domains that can be linked by ancestral fusion events serves as a signal for domain-domain interactions. To exploit this information for function prediction, novel domain architecture and fused architecture scores are developed. Architecture scores rather than single domain scores correlate more strongly with function, and both architecture and fusion scores correlate more strongly with molecular functions than biological processes. 3 The final study details the development of a novel heterogeneous function prediction approach designed to target the annotation of both homologous and non-homologous proteins. Support vector regression is used to combine pair-wise sequence features with expression scores and domain architecture scores to rank protein pairs in terms of their functional similarities. The target of the regression models represents the continuum of protein function space empirically derived from the Gene Ontology molecular function and biological process graphs. The merit and performance of the approach is demonstrated using homologous and non-homologous test datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence methods. The final model represents a method that achieves a compromise between high specificity and sensitivity for all human proteins regardless of their homology status. It is expected that this strategy will allow for more comprehensive and accurate annotations of the human proteome. 4 Acknowledgements First and foremost I would like to acknowledge my supervisors Prof. D T Jones and Prof. C A Orengo for their encouragement and excellent guidance throughout my project. I would also like to thank members of the Jones and Orengo groups especially Drs S. Lise and O. Redfern for helpful discussions. I acknowledge both UCL and CS department computing services for allowing me to gobble many CPU hours, abuse high memory machines and regularly destroy their exquisite computing infrastructure in order to complete my work. Lastly and most importantly, I acknowledge close friends, family (Dad especially for proof read- ing) and Dr. Richard Myers for his support and patience. Contents 1 Introduction 16 1.1 The importance of protein function ........................ 16 1.2 The need for automated methods ......................... 17 1.3 Function annotation schemes .......................... 18 1.3.1 Multifun, FunCat and KEGG hierarchical schemes . 19 1.3.2 The Gene Ontology ........................... 19 1.3.3 Modelling function similarity ....................... 27 1.4 Automated function prediction methods ...................... 30 1.4.1 Homology based approaches ....................... 31 1.4.2 Non-homology based approaches ..................... 41 1.4.3 Integrated function prediction approaches . 55 1.5 Thesis aims .................................. 61 2 Characterising the system: the Gene Ontology and Human proteome annotations 63 2.1 Chapter aims ................................. 63 2.2 Shape and structure of the Ontology Graphs .................... 64 2.3 Human proteome annotations .......................... 65 2.4 Annotation specificity and completeness ..................... 69 2.5 Growth rates for human sequence annotations ................... 70 2.6 Datasets for benchmarking ........................... 73 2.7 Chapter Summary ............................... 75 3 Quantifying homology based annotation transfer 79 3.1 Chapter introduction and aims .......................... 79 3.2 Conducting homology searches ......................... 82 3.2.1 Datasets ................................ 82 3.2.2 Scoring sequence similarity ....................... 82 3.2.3 Scoring annotation transfers ....................... 85 3.3 Global scoring for annotation transfer ...................... 87 3.3.1 Comparing Molecular Function and Biological Process annotation transfer . 89 3.3.2 Performance of BLAST and PSI-BLAST algorithms . 92 3.3.3 Within and between species transfer .................... 93 3.4 Annotation specificities ............................. 93 3.5 Sources of errors ................................ 96 3.6 Local scoring and functional heterogeneity . 101 3.7 Discussion .................................. 106 4 Feature based function prediction 111 4.1 Chapter aims ................................. 111 4.2 Designing features encoding disorder . 113 4.2.1 Functional analysis of disordered human sequences . 113 4.2.2 Encoding strategy for disorder features . 115 4.2.3 Disorder features in context with other features . 118 4.3 Support Vector Classification of Function . 127 4.3.1 Training and testing datasets . 127 4.3.2 Kernel choice and parameter optimisation . 128 4.3.3 Function Category Classification Results . 129 4.3.4 Assessing the importance of different features in function prediction . 133 4.4 Benchmarking against ProtFun method . 137 4.4.1 FFPred Server ............................. 140 4.5 Chapter Summary ............................... 148 5 Designing pairwise features for function prediction 150 5.1 Introduction and aims .............................. 150 5.2 Defining a Function Similarity Measure . 151 5.2.1 Selection of a semantic similarity measure . 152 5.2.2 Function similarity measures . 155 5.3 Feature design for heterogeneous data . 155 5.3.1 Sequence Similarity . 157 5.3.2 Protein-protein Interactions . 159 5.3.3 Topology ............................... 160 5.3.4 Cellular Localisation . 165 5.3.5 Domain content and domain fusions . 167 5.3.6 Microarray expression information . 184 5.3.7 Characterising feature relationships . 200 5.4 Chapter summary ............................... 202 6 Combining features for function prediction: Performance evaluation and benchmark 205 6.1 Introduction and aims .............................. 205 6.2 Methods ................................... 207 6.2.1 Vector space integration . 207 6.2.2 Data source integration method . 210 6.2.3 Support Vector Regression training . 213 6.2.4 Integrating complex features . 216 6.3 Results and model application . 222 6.3.1 Technical performance of the different model integration approaches . 223 6.3.2 Practical assessment of model quality in annotation transfer . 224 6.3.3 Establishing the value of different data sources . 227 6.3.4 Annotation of uncharacterised human sequences . 232 6.4 Chapter Discussion ............................... 236 7 Discussion 240 Appendix I 248 Appendix II 263 Glossary 268 Bibliography 272 List of Figures 1.1 Example of Molecular Function graph taken from (Ashburner et al. 2000) . 22 1.2 Example of Biological Process graph taken from (Ashburner et al. 2000) . 23 1.3 Example of Cellular Component graph taken from (Ashburner et al. 2000) . 24 1.4 Semantic similarity example ........................... 28 1.5 Schematic of phylogenomics approach. ...................... 36 1.6 Phylogenetic profiling analysis .......................... 38 1.7 Diagrammatic representation of domain fusion and fissions . 40 1.8 Domain content vector representation ....................... 42 1.9 Schematic of the ProtFun method ........................ 44 1.10 Unsupervised clustering for microarray data .................... 47 1.11 Similarity measures for co-expression .....................

Load more