Functional Classification of Protein Domain Superfamilies for Protein

Functional classification of protein domain superfamilies for protein function annotation Sayoni Das A thesis submitted for the degree of Doctor of Philosophy October 2016 Institute of Structural and Molecular Biology University College London Declaration I, Sayoni Das, confirm that the work presented in this thesis is my own except where otherwise stated. Where the thesis is based on work done by myself jointly with others or where information has been derived from other sources, this has been indicated in the thesis. Sayoni Das October 2016 Abstract Proteins are made up of domains that are generally considered to be independent evolutionary and structural units having distinct functional properties. It is now well established that analysis of domains in proteins provides an effective approach to understand protein function using a ‘domain grammar’. Towards this end, evolutionarily-related protein domains have been classified into homol- ogous superfamilies in CATH and SCOP databases. An ideal functional sub- classification of the domain superfamilies into ‘functional families’ can not only help in function annotation of uncharacterised sequences but also provide a use- ful framework for understanding the diversity and evolution of function at the domain level. This work describes the development of a new protocol (FunFHMMer) for identifying functional families in CATH superfamilies that makes use of sequence patterns only and hence, is unaffected by the incompleteness of function annotations, annotation biases or misannotations existing in the databases. The result- ing family classification was validated using known functional information and was found to generate more functionally coherent families than other domain-based protein resources. A protein function prediction pipeline was developed exploiting the functional annotations provided by the domain families which was validated by a database rollback benchmark set of proteins and an independent assessment by CAFA 2. The functional classification was found to capture the functional diversity of superfamilies well in terms of sequence, structure and the protein-context. This aided studies on evolution of protein domain function both at the superfamily level and in specific proteins of interest. The conserved positions in the functional family alignments were found to be enriched in catalytic site residues and ligand- binding site residues which led to the development of a functional site prediction tool. Lastly, the function prediction tools were assessed for annotation of moonlighting functions of proteins and a classification of moonlighting proteins was proposed based on their structure-function relationships. Acknowledgements The last four years of my life as a PhD student has been a remarkable experience. It would not have been possible for me to finish my doctoral research without the support and guidance that I received from many people. First and foremost, I would like to express my sincere gratitude to my super- visor, Prof. Christine Orengo for her invaluable guidance, patience and encour- agement throughout the period of my doctoral research and for giving me the opportunity to work in such a friendly and prestigious group. I am very grateful to her for being an excellent teacher and mentor. I would like to acknowledge UCL for the UCL Overseas Research Scholarship. I would like to thank my Thesis committee members, Dr. Andrew Martin and Prof. Snezana Djordjevic for their for their valuable insights and support during my thesis committee meetings. I would also like to thank Ishita Khan and Prof. Daisuke Kihara for helpful discussions on moonlighting proteins. I would like to thank all the past and present members of the Orengo group for always being kind and helpful and for making my stay in the group such a pleasant experience. In particular, thanks to Dr. Ian Sillitoe and Dr. Natalie Dawson for their help and advice throughout my research. Special thanks to Dr. David Lee and Dr. Jon Lewis for their help and support during CAFA 2. Thanks to Dr. Tony Lewis, Dr. Romain Studer and Dr. Paul Ashford for their help and advice with any computational or programming questions. Thanks are also due to Dr. Rob Rentzsch for his help when I first started with my research work. Special thanks to Francesco Carbone, Ivana Pilizota, Su Datt Lam, Tom Northy, Sonja Lehtinen, Saba Ferdous, Avneet Saini, Millie Pang for many amazing times. I would like also like to thank Jesse from the Biosciences IT and David Gregory and Tristan Clark from the CS Cluster Support for their timely support and help throughout my research work. Finally, a big thank you to everyone in the Room 627 and the rest of Institute of Structural and Molecular Biology for always being friendly and helpful. My fellow PhD student and best friend Priscilla has also played a big role in ACKNOWLEDGEMENTS 5 making my stay in London an enjoyable experience. I am ever so grateful to her for being there for me and for always being a source of hope and positivity. Special thanks to Arghya for being patient, understanding, encouraging and for his efforts for teaching me to be more pragmatic. Thanks to Ambalika and Kalpana for their love, understanding and support throughout my research work. I would like to thank Prateek, Pavi, Challenger, Abhishek, Anjul and Tanvi for many a wonderful times in the last four years. Also, thanks to all my other friends for their best wishes. I would like to thank my cousin, aunt and uncle for their love and for making me feel at home in London. Last but not the least, heartfelt thanks to my parents, my brother and my sister-in-law for their unwavering love and support at the most difficult times. Special thanks to the little bundle of joy in our family, Tara, who is growing up at an alarming rate and makes me smile whenever I see her latest picture or video. I am grateful to them for having faith in me. Finally, I would like to dedicate this thesis to my family. Contents Contents 6 List of Figures 12 List of Tables 18 List of Abbreviations 19 List of Publications 21 1 Introduction 22 1.1 Protein function . 23 1.1.1 Multi-faceted function of proteins . 25 1.1.2 Function annotation resources . 25 1.1.2.1 Enzyme Commission number . 26 1.1.2.2 Gene Ontology . 26 1.1.3 Widening function annotation gap . 30 1.2 Bioinformatics methods and protocols . 31 1.2.1 Protein sequence analysis . 31 1.2.1.1 Sequence alignments . 32 1.2.1.2 Identification of conserved residues . 36 1.2.1.3 Identification of specificity or functional determinants 39 1.2.1.4 Database searching methods . 41 1.2.1.5 Sequence alignment profiles . 42 1.2.2 Protein structure analysis . 44 1.2.2.1 Structural alignments . 44 1.2.3 Protein classification resources . 45 1.2.3.1 Sequence-based protein classifications . 46 1.2.3.2 Structure-based protein classifications . 47 1.3 Functional diversity in domain superfamilies . 52 1.3.1 Mechanisms of functional divergence . 52 CONTENTS 7 1.3.1.1 Structural mechanisms . 53 1.3.1.2 Molecular tinkering . 55 1.3.1.3 Different multi-domain contexts . 56 1.3.1.4 Promiscuity and moonlighting . 59 1.3.1.5 Combination of mechanisms . 59 1.3.2 Capturing diversity in superfamilies . 60 1.4 Overview of thesis . 60 2 FunFHMMer: functional classification of domain superfamilies 62 2.1 Background . 62 2.1.1 Clustering methods for protein sequences . 62 2.1.1.1 Hierarchical clustering . 63 2.1.1.2 Partitioning clustering . 63 2.1.1.3 Graph-based clustering . 64 2.1.1.4 Greedy incremental clustering . 65 2.1.2 Automated classification of protein families . 65 2.1.3 Quality assessment of automated classification methods . 67 2.1.3.1 Structure-Function Linkage Database (SFLD) . 68 2.1.4 Functional classification of CATH superfamilies . 70 2.1.4.1 GeMMA algorithm for clustering domain sequences 71 2.1.4.2 DFX protocol for functional classification . 74 2.2 Aims and Objectives . 75 2.3 Implementation . 75 2.3.1 Development of a protocol for functional classification using sequence patterns . 75 2.3.1.1 TPP-dependent enzyme superfamily as a prelimi- nary test case . 76 2.3.1.2 Exploiting specificity-determining positions in MSAs 76 2.3.1.3 Prediction of SDPs in TPP-dependent enzyme families . 78 CONTENTS 8 2.3.2 FunFHMMer algorithm . 79 2.3.2.1 Parameters affecting analysis of functional coherence of alignments . 81 2.3.2.2 Functional Coherence Index (FC)......... 86 2.3.3 Modification of the GeMMA tree . 88 2.3.4 Generation of CATH FunFams using FunFHMMer . 90 2.3.5 FunFam model generation and mapping of FunFam sequence and structural relatives . 90 2.3.6 Assessment of Functional Purity of FunFams . 92 2.3.6.1 TPP-dependent enzyme superfamily . 92 2.3.6.2 Structure-Function Linkage Database (SFLD) superfamilies . 93 2.3.6.3 Quality of functional classification based on EC annotations. 95 2.3.7 Functionally important residues highly conserved in FunFams 97 2.4 Conclusion . 100 3 Protein function annotation using FunFHMMer 102 3.1 Background . 102 3.1.1 Current approaches for protein function prediction . 102 3.1.1.1 Sequence homology . 102 3.1.1.2 Protein family resources . 104 3.1.1.3 Gene Ontology-based prediction methods . 106 3.1.1.4 Phylocogenomics . 107 3.1.1.5 Structural homology . 108 3.1.1.6 Combination of heterogenous data . 109 3.1.2 Assessment of function prediction methods . 110 3.1.3 Critical Assessment of Function Annotation (CAFA) . 113 3.1.3.1 CAFA evaluation metrics . 113 3.1.3.2 CAFA 1, 2010-2012 .

Functional Classification of Protein Domain Superfamilies for Protein

Evolution of Biomolecular Structure Class II Trna-Synthetases and Trna

The Architecture of the Protein Domain Universe

Molecular Markers of Serine Protease Evolution

Structural Basis for the Sheddase Function of Human Meprin Β Metalloproteinase at the Plasma Membrane

Documentation and Localization of Force-Mediated Filamin a Domain

Lipid-Targeting Pleckstrin Homology Domain Turns Its Autoinhibitory Face Toward the TEC Kinases

Protein Family Classification with Neural Networks

Pyruvate-Phosphate Dikinase of Oxymonads and Parabasalia and the Evolution of Pyrophosphate-Dependent Glycolysis in Anaerobic Eukaryotes† Claudio H

Gent Forms of Metalloproteinases in Hydra

Annotation of Glycolysis and Gluconeogenesis Pathways

Pleckstrin Homology Domains and the Cytoskeleton

Biased Signaling of G Protein Coupled Receptors (Gpcrs): Molecular Determinants of GPCR/Transducer Selectivity and Therapeutic Potential