A Domain Based Protein Structural Modelling Platform Applied in the Analysis of Alternative

A domain based protein structural modelling platform applied in the analysis of alternative splicing Su Datt Lam A thesis submitted for the degree of Doctor of Philosophy January 2018 Institute of Structural and Molecular Biology University College London Declaration I, Su Datt Lam, confirm that the work presented in this thesis is my own except where otherwise stated. Where the thesis is based on work done by myself jointly with others or where information has been derived from other sources, this has been indicated in the thesis. Su Datt Lam January 2018 Abstract Functional families (FunFams) are a sub-classification of CATH protein domain superfamilies that cluster relatives likely to have very similar structures and functions. The functional purity of FunFams has been demonstrated by comparing against ex- perimentally determined Enzyme Commission annotations and by checking whether known functional sites coincide with highly conserved residues in the multiple sequence alignments of FunFams. We hypothesised that clustering relatives into Fun- Fams may help in protein structure modelling. In the first work chapter, we demonstrate the structural coherence of domains in FunFams. We then explore the usage of FunFams in protein monomer modelling. The FunFam based protocol produced higher percentages of good models compared to an HHsearch (the state-of-the-art HMM based sequence search tool) based protocol for both close and remote homologs. We developed a modelling pipeline that, utilises the FunFam protocol, and is able to model up to 70% of domain sequences from human and fly genomes. In the second work chapter, we explore the usage of FunFams in protein complex modelling. Our analysis demonstrated that domain-domain interfaces in FunFams tend to be conserved. The FunFam based complex modelling protocol produced significantly more good quality models when compared to a BLAST based protocol and slightly better than a HHsearch based protocol. In the final work chapter, we employ the FunFam based structural modelling tool to understand the implications of alternative splicing. We focused on isoforms derived from mutually exclusively exons (MXEs) for which there is more enriched in pro- teomics data. MXEs which could be mapped to structure show a significant tendency to be exposed to the solvent, are likely to exhibit a significant change in their physio- chemical property and to lie close to a known/predicted functional sites. Our results suggest that MXE events may have a number of important roles in cells generally. Acknowledgements Undertaking this PhD has been a truly life-changing experience for me and it would not have been possible to do without the support and guidance that I received from many people. It has been an amazing experience been able to pursue my studies at the Uni- versity College London. I have gained a lot of scientific knowledge from my research degree and professional development. I would like to thank those that have been instrumental in helping me achieve and reach this stage. First and for most, I would like to express my deepest gratitude to Prof. Christine Orengo, my main supervisor. Thank you for your willingness to accept me to study and work with such a great team of people in your group. All your encouragement, guidance and support has been the powerhouse to keep me motivated throughout these four years. I will definitely miss all those weekly tutorial sessions we had. Thank you for being such a good mentor, Christine. I would like to thank the Malaysia’s Ministry of Education and the National Univer- sity of Malaysia for their financial support. I would like to thank my Thesis committee members, Prof. David Jones, Dr. Konstantinos Thalassinos and Dr. Joanne Santini for their invaluable insight and support during my thesis committee meetings. Special thanks to Dr. Sayoni Das for her constant help and motivation throughout my PhD. Also to Dr. Jonathan Lees for the suggestions and ideas that had driven me to venture into alternative splicing. Appreciation to Dr. Ian Sillitote, Dr. Tony Lewis, Dr. Natalie Dawson and Dr. David Lee for curating the CATH resource and providing computational support when I needed it. Also would like to thank Dr. Paul Ashford for his insight on structural clustering and Dr. Aurelio Moya Garcia for direction and advice on developing my protein complex modelling pipeline and providing the computational scripts to predict allosteric sites. In addition, thanks to all the other members of the Orengo group especially Camilla Pang and Tolulope Adeyelu for always being kind and helpful when I needed it. I also like to thank all my amazing friends when I am off college. I am deeply grateful to Susanna, Jing Wei, Zora for being such an amazing and warm friends. I 5 am also in debt to my fellow housemates/friends, Filsan, Jonas, Alvin, Bruna, Peter, Wagner, Jaimie for all the wonderful times spent and being there for me all the time. My friends back in Malaysia, Yi Ling, Ai Cheng, Beng Si and Lynice, for their best wishes throughout. Last but not the least, heartfelt thanks to my parents, my brothers, my sisters-in- law for their unwavering love and support at the most difficult times. Su Datt September 2017 Contents Contents 6 List of Figures 13 List of Tables 19 List of Abbreviations 20 List of Publications 23 1 Introduction 24 1.1 Experimental techniques to determine the protein structure . 25 1.2 The structures of proteins . 27 1.3 Aligning protein sequences . 30 1.3.1 Multiple sequence alignment . 34 1.4 Database searching . 36 1.4.1 Sequence based . 36 1.4.2 Profile based . 36 1.4.3 Assessing the significance of the result obtained from these pro- grams ............................... 38 1.5 Structure comparison . 39 1.5.1 Structure comparison methods . 39 1.5.2 Performance of structure comparison methods . 43 1.5.3 Structure comparison scores . 44 1.6 Protein structure classification . 46 1.6.1 CATH . 47 1.6.1.1 Gene3D . 48 1.6.1.2 Sub-classification of homologous superfamilies into functional families (FunFams) . 49 1.6.2 SCOP . 50 1.7 Overview of Thesis . 51 CONTENTS 7 2 Modelling protein monomers 52 2.1 Introduction . 52 2.1.1 Predicting protein structures . 52 2.1.2 Recent developments in structural modelling . 56 2.1.3 Assessing protein models . 57 2.1.3.1 Protein monomer model quality assessment . 58 2.1.3.2 Recent developments in model quality assessment . 61 2.1.4 Resources dedicated to large-scale comparative modelling of genome sequences . 61 2.1.4.1 ModBase . 62 2.1.4.2 SWISS-MODEL repository . 62 2.1.4.3 Protein Model Portal . 63 2.1.4.4 Genome3D initiative . 64 2.1.5 Objectives . 65 2.2 Materials and Methods . 67 2.2.1 Calculating the structural conservation of structural domains within CATH-Gene3D FunFams and CATH superfamilies . 67 2.2.2 Assessing the model quality assessment methods, query-template alignment and template selection strategy . 67 2.2.2.1 Template selection methods . 68 2.2.2.2 Comparing different model quality assessment methods 69 2.2.2.3 Comparing different query-template alignments methods 69 2.2.2.4 Comparative modelling . 70 2.2.2.5 Assessing the performance of the prediction protocols and ranking the models . 70 2.2.3 Modelling structurally uncharacterised sequences in human and fly.................................. 71 2.3 Results and Discussion . 71 CONTENTS 8 2.3.1 Structural conservation of structural domains within FunFams and across superfamilies . 72 2.3.2 Choosing the best model assessment method . 73 2.3.3 Choosing the best query-template alignment method . 74 2.3.4 Assessing the performance of FunFam and HHsearch template selection methods . 76 2.3.4.1 Close homologues with sequence identity ≥50% . 77 2.3.4.2 Close homologues with sequence identity 30%-50% . 78 2.3.4.3 Remote homologues with sequence identity <30% sequence identity . 78 2.3.4.4 Which protocol selects a higher proportion of good templates than the other protocol . 80 2.3.5 Modelling human and fly genomes . 81 2.3.5.1 Analysis of model quality for human and fly genes us- ing FunFams and HHsearch (CATH) . 82 2.3.5.2 Analysis involved FunFams and all the HHsearch li- braries . 84 2.4 Conclusions . 87 2.4.1 Future directions . 88 2.4.2 Uses of structural modelling in experimental studies . 89 2.4.2.1 Facilitation of cryo-EM density map fitting with homol- ogy models . 89 2.4.2.2 Integrative structural biology . 90 3 Modelling protein-protein interactions 91 3.1 Introduction . 91 3.1.1 Structural modelling of protein complexes . 99 3.1.1.1 Template-free modelling/docking . 99 3.1.1.2 Template based modelling . 101 CONTENTS 9 3.1.2 Community-wide protein complex modelling assessment exper- iment . 105 3.1.2.1 Assessing the performance of modelling methods . 106 3.2 Objectives . 107 3.3 Materials and Methods . 107 3.3.1 Investigating the conservation of protein-protein interfaces in CATH-Gene3D FunFams . 107 3.3.1.1 Generating a library of CATH-Gene3D FunFam domain- domain interfaces . 107 3.3.1.2 Calculating the conservation of protein-protein interfaces within FunFams and between FunFams across a superfamily . 108 3.3.2 Modelling protein complexes - Template based modelling (MOD- ELLER) . 111 3.3.2.1 Complex modelling pipelines . 111 3.3.2.2 Comparing different ways of selecting the best templates112 3.3.2.3 Benchmark dataset used in the comparison of the Fun- Fam protocol with the Interactome3D protocol (Chain- chain interactions) . 113 3.3.2.4 Benchmark dataset used in the HHsearch comparison (DDIs) .

A Domain Based Protein Structural Modelling Platform Applied in the Analysis of Alternative

Atnea1-Identification and Characterization of a Novel Plant Nuclear Envelope Associated Protein

Identification of Plant SUN Domain-Interacting Tail Proteins and Analysis of Their Function in Nuclear Positioning

Nesprins: from the Nuclear Envelope and Beyond

New Methods for the Prediction and Classification of Protein Domains

The Nuclear Envelope: Target and Mediator of the Apoptotic Process Liora Lindenboim1, Hila Zohar1,Howardj.Worman2 and Reuven Stein1

Table S4. Trophoblast Differentiation-Associated Genes

A Stochastic Point Cloud Sampling Method for Multi-Template Protein Comparative Modeling

Human Induced Pluripotent Stem Cell–Derived Podocytes Mature Into Vascularized Glomeruli Upon Experimental Transplantation

The Emerin-Binding Transcription Factor Lmo7 Is Regulated by Association with P130cas at Focal Adhesions

Sharmin Supple Legend 150706

Connecting the Nucleus to the Cytoskeleton by SUN–KASH

Emodel-BDB: a Database of Comparative Structure Models Of