An Analysis of Intron Positions in Relation to Protein Structure, Function, and Evolution
Total Page:16
File Type:pdf, Size:1020Kb
An Analysis of Intron Positions in Relation to Protein Structure, Function, and Evolution Gordon Stuart Whamond Biomolecular Structure and Modelling Unit Department of Biochemistry and Molecular Biology University College London and European Bioinformatics Institute Hinxton Cambridge A thesis submitted to the University of London in the Faculty of Science for the degree of Doctor of Philosophy September 2004 ProQuest Number: U642089 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest. ProQuest U642089 Published by ProQuest LLC(2015). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 Abstract The genes of most eukaryotic organisms are organised into coding (exon) and non coding (intron) sequences. Since their discovery, introns have been the subject of a considerable amount of research focused on trying to locate correlations be tween intron positions and protein structure. However, in many cases these studies have suffered from using small datasets. In recent years, the amount of available nucleotide sequence data, protein sequence data, and protein structure data has rapidly increased, allowing more reliable and accurate conclusions to be drawn from contemporaneous analyses. The work presented in this thesis is an investigation of intron positions in relation to protein sequence, structure, and function. To facilitate this analysis a number of bespoke databases have been created, which include: an intron database, containing intron positions in protein coding sequences; a protein database, containing protein sequences and annotation such as secondary structure and domain boundaries; and an enzyme database, containing the position of catalytic residues in enzyme chains. This analysis shows that intron positions are non-randomly distributed in rela tion to amino acids, secondary structure, and solvent accessibility. However, this can be explained principally on the basis of nucleotide bias associated with the lo cation of introns. Most protein domains are joined by regions not associated with introns, although where an association does occur, it is unlikely to be the result of random intron insertion. There is evidence that a small set of domains may have utilised introns during evolution. In enzymes, catalytic residues have no association with introns, and most introns tend not to split pairs of catalytic residues. This is explained by a preference for catalytic residues to be close together in sequence. These analyses suggest that biased intron distributions in relation to proteins are the result of non-random intron insertion. Although introns may have subsequently been utilised in the evolution of proteins, this involvement appears to have been limited. Acknowledgement s The members of the Thornton group, along with numerous people at UCL and the EBI, have been encouraging throughout the course of my PhD. Although I have singled out a few people below for specific reasons, I would first like to thank everyone not mentioned here for their support and kindness over the last few years. I apologise for not having the space, time, or memory to thank everyone individually. First and foremost I would like to thank my supervisor, Janet Thornton, for making all of this possible. She has been a constant source of encouragement and enthusiasm throughout this work, both during the bad years and the good one. Further thanks goes to my secondary supervisor, Christine Orengo; although I have not seen Christine often since the move to the EBI, when we have spoken she has always been supportive. The bulk of the funding for this work was kindly provided by the BBSRC, with additional funds from Inpharmatica, who also provided me with access to their considerable technical expertise and numerous training courses and away days. In particular at Inpharmatica I would like to thank Mark Swindells for his constant enthusiasm and input, despite the scarcity of our meetings. While I was writing this thesis further financial support was provided by EMBL, despite my not being one of their students, for this I am very grateful. The not inconsiderable amount of administration involved in doing a PhD has been made much easier by Alison Goodwin, Charlotte Phillips, Christine Mason, and Gillian Adams, who have all helped me with forms, contracts, and numerous other queries and questions. During my time at UCL, many people were instrumental in cheering me up and helping with life’s little hurdles. I would like to thank: James Bray and Stuart Rison for welcoming a fresh faced naive student and demanding to know why I wasn’t going to circuit training; Adrian Shephard and William Valdar for encouraging my some what on/off (no pun intended) rock climbing; John Bouquiere and Duncan Mckenzie for sorting out a number of minor computing disasters before they became major catastrophes; and the lunch time goldfish bowl dwellers - Alistair Grant, Andreas Brakoulias, Andrew Harrison, Annabel Todd, Brian Ferguson, Gabby Reeves, Jen nifer Dawe, Mike Plevin, Oliver Redfern, Robert Steward, Shusuke Ono, Stephen Campbell, Tina Clarke, and Thomas Kabir - for helpful ideas, stupid but amusing topics of conversation, and the time to have a quick pint or two on a Friday. Special mention goes to Daniel Buchan and Ian Sillitoe for introducing me to Jitsu, which continues to be a fantastic source of stress relief. I would also like to thank all the members of the UCL and Cambridge Jitsu clubs for allowing me to hit them with big sticks on a regular basis. My time at the EBI has been influenced by several members of the Thornton group. I wish to thank: Richard Morris for taking the time to have several “brief” chats about statistics; Roman Laskowski for his optimism and bright outlook on life, as well as being a comprehensive source of knowledge for anything'that’s not Perl; Mike Stevens (who has answered many indulgent Latex and chi-squared questions) and Sue Jones for being great office mates, also Marialuisa Pellegrini-Calace, who replaced Sue and ensured we didn’t end up as the quiet office, apparently; Jonathan Barker (our resident code monkey), for helping with several aspects of the ENTRAP development, and still having time to tell me about some of the finer points of back- porch rocketry; Gail Bartlett for sharing the pain and introducing me to Film Wise; Hugh Shanahan, for statistical chats, numerous “quick questions”, and everything else he has done over the last few years, including the funniest moment of rage in a cinema I have ever seen. On a broader note, I have had a good social relationship with many members of the Thornton group. The following have been involved in some of the more memorable beer related shenanigans: Craig Porter, Eric Blanc, Eugene Schuster, Fabian Glaser, Hannes Ponstingl, Irilenia Nobeli, James Watson, James Torrance, Malcom Mac Arthur, Matthew Bashton, Rafael Najmanovich, Shiri Freilich, and Tim Massingham. For most of my PhD I have lived with work colleagues in a kind of Thornton group outstation. Of all the people I have lived with I would especially like to thank the Haslingfield three for a fantastic year: Alex Gutteridge, who consistently beats me at squash, except that once; Gareth Stockwell, a friend and housemate for almost three years, who helped build the snowman; and Kevin Murray, for his advice, work on repeats, and reminding me that “It’s cold in Moscow this time of year”. More recently, Ella Hinton and Andrew Pick have been great housemates, understanding the stress and being willing participants of numerous “Big Mondays”. Although the support of my colleagues has been invaluable, this work would never have been completed without the unfaltering understanding of my friends and family. I have devoted a great deal of time over the last few years to missing birthdays, re-unions, stag parties, etc. It is a testament to my friends that they still talk to me and are still patient. While I can’t thank them all, I would especially like to mention Alison Morrow, Daniel Valente, Daren Fox, Dayan Trott, Katie Prentice, Richard Grange, and Stephanie Polgreen. Also Rachel Howarth, who had to put up with a lot but has always been there for me, when she is in the country that is. Finally, I would like to thank Charlotte Ellis for two years of patience at my constant inability to spend more than a day with her before worrying about the lack of work being done. Funding Funding for this research was provided by the Biotechnology and Biological Sci ences Research Council (BBSRC) along with additional support from Inpharmatica through a CASE award. For my parents, thanks for everything. In memory of Clifford Loveday, who would have been interested. Contents A bstract 2 Acknowledgements 3 Funding 5 Dedication 6 Contents 7 List of Figures 12 List of Tables 15 List of Abbreviations 17 1 Introduction 21 1.1 Overview............................................................................................................... 21 1.2 Eukaryotic Genes.................................................................................................23