Protocols to Capture the Functional Plasticity of Protein Domain
Total Page:16
File Type:pdf, Size:1020Kb
Research Department of Structural and Molecular Biology PPrroottooccoollss ttoo CCaappttuurree tthhee FFuunnccttiioonnaall PPllaassttiicciittyy ooff PPrrootteeiinn DDoommaaiinn SSuuppeerrffaammiilliieess Robert Rentzsch October 2011 SUBMITTED IN SUPPORT OF THE DEGREE OF Doctor of Philosophy Declaration I, Robert Rentzsch, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Robert Rentzsch October 2011 2 Abstract Most proteins comprise several domains, segments that are clearly discernable in protein structure and sequence. Over the last two decades, it has become increasingly clear that domains are often also functional modules that can be duplicated and recombined in the course of evolution. This gives rise to novel protein functions. Traditionally, protein domains are grouped into homologous domain superfamilies in resources such as SCOP and CATH. This is done primarily on the basis of similarities in their three-dimensional structures. A biologically sound subdivision of the domain superfamilies into families of sequences with conserved function has so far been missing. Such families form the ideal framework to study the evolutionary and functional plasticity of individual superfamilies. In the few existing resources that aim to classify domain families, a considerable amount of manual curation is involved. Whilst immensely valuable, the latter is inherently slow and expensive. It can thus impede large-scale application. This work describes the development and application of a fully-automatic pipeline for identifying functional families within superfamilies of protein domains. This pipeline is built around a method for clustering large-scale sequence datasets in distributed computing environments. In addition, it implements two different protocols for identifying families on the basis of the clustering results: a supervised and an unsupervised protocol. These are used depending on whether or not high-quality protein function annotation data are associated with a given superfamily. The results attained for more than 1,500 domain superfamilies are discussed in both a qualitative and quantitative manner. The use of domain sequence data in conjunction with Gene Ontology protein function annotations and a set of rules and concepts to derive families is a novel approach to large-scale domain sequence classification. Importantly, the focus lies on domain, not whole-protein function. 3 Acknowledgements Thanks are due to several people that helped making my time in the Orengo lab and in London in general a valuable experience. From very early on until the now nearing end, my fellow PhD student and friend Jim Perkins played a big role in this – from quite practical issues, such as moving houses, sleeping in other people’s houses and playing pool in semi-accordance with the rules, to long and inspiring conversations about basically everything imaginable, including science. I apologise for having destroyed his belief in German punctuality, though. Benoit Dessailly was a great postdoctoral colleague and will hopefully remain a friend; his special sense of humour is certainly beaten by no other Belgian. Adam Reid and Ollie Redfern, who both left after my first year, were inspiring people to talk to, each in their own very special way. Thanks to Ian Sillitoe, Corin Yeats and Jon Lees for helping me to squeeze the relevant bits and bytes out of ‘their’ databases. Further, thanks to everyone else in the Orengo and Martin groups for the many random acts of kindness, most of which involved food. I would like to particularly thank my supervisor, Christine Orengo, for her help along the way, reading manuscripts on the weekend or whilst (technically) being on holiday, and for giving me the opportunity to work in such a friendly and prestigious group. Thanks are also due to Kate Bowers, chair of my thesis committee, and David Jones, member of my thesis committee for guidance. Tom Knight, Jesse Oldershaw and Jahid Ahmed, known in the Darwin building as ‘the IT guys’, were of great help on several occasions. If I was Fox Mulder, these would be my personal Lone Gunmen. And I may well hire Tristan Clark from UCL Computer Science too, as he was incredibly helpful on short notice when it came to cluster troubles. Back to the lower ranks, it should be noted here that I never met a fellow student in the Darwin building who would not have been friendly, helpful and 4 easy to get along with. Actually, I think this holds for everyone I have ever met there – apart from maybe this one person, but to be honest, who would not get a bit grumpy maintaining a building that poses unprecedented challenges to climate and pest control. Concerning the failure of the latter, maybe I am even glad that the mice have never really left. Their jumping around like mad in the bins was a comforting noise in the long, otherwise solitary nights that were spent at the office during ‘peak’ times. It would be difficult to mention all the people and things outside work that made my time in London worthwhile. Yet, I will try to name a few important ones. I enjoyed every single fruit I ever bought on my way to and from work from the guy in front of the ‘Euston Supermarket’ – ten bananas for a pound! More importantly, he was always friendly and…well, there. Rain or shine. The same accounts for the Hare Krishna people who give out free food to students, homeless people and to everyone else who wants it. Again, value for (no) money paired with exceptional friendliness. Finally, there is the Poetry Library on the Southbank. This is probably the most serene and friendly place in London. Last but not least, I want to thank the European Union and all its tax payers, for funding my research. As acknowledgements tend to be written at a point where the word ‘deadline’ becomes perceptible in all its nasty metaphorical power, and memory weakens, there will inevitably be people that should have been mentioned here but are not. To those I apologise. Please do not think you were less important or helpful than the banana man. One person I will certainly not forget to thank is Anna. It is never easy to live with someone who is constantly working towards a self-imposed goal, but, at the same time, it is great to know there will be someone to share the comfort with when the goal is achieved. To doing this with her, my close friends and my family, I look forward. 5 Abbreviations aaRS Aminoacyl-tRNA-synthetase AMP Adenosine monophosphate ATP Adenosine triphosphate BP Biological Process (in GO) CC Cellular Component (in GO) DAG Diacyclic graph DNA Deoxyribonucleic acid EM Electron microscopy GMP Guanosine monophosphate GO Gene Ontology GTP Guanosine triphosphate HIV Human Immunodeficiency Virus HMM Hidden Markov model HPC High-performance computing I/O Input/Output LUCA Last Universal Common Ancestor (organism) MF Molecular Function (in GO) NAD Nicotinamide adenine dinucleotide NADP Nicotinamide adenine dinucleotide phosphate NCBI National Centre for Biotechnology Information NMR Nuclear magnetic resonance (spectroscopy) ORF Open Reading Frame RAM Random Access Memory RMSD Root Mean Square Deviation RNA Ribonucleic acid UNIX UNiplexed Information and Computing System 6 Contents Chapter 1. Introduction ...........................................................................17 1.1 Protein domains and superfamilies........................................................18 1.1.1 Origin and definition of concepts..................................................19 1.1.1.1 Superfamilies.............................................................................20 1.1.1.2 Protein domains........................................................................21 1.1.2 The evolution of multi-domain proteins.......................................22 1.1.3 Existing superfamily studies ...........................................................23 1.2 Relationships between protein sequences.............................................24 1.2.1 Pairwise relationships.......................................................................25 1.2.1.1 Homology..................................................................................25 1.2.1.2 Orthology and Paralogy...........................................................26 1.2.2 Group concepts................................................................................27 1.2.2.1 Superfamily................................................................................27 1.2.2.2 Orthologue cluster....................................................................28 1.2.2.3 Family.........................................................................................28 1.2.3 Application to proteins and domains ............................................30 1.3 Protein function annotation ...................................................................31 1.3.1 The Gene Ontology.........................................................................32 1.3.2 Other systems...................................................................................32 1.4 Bioinformatics methods..........................................................................33 1.4.1 Sequence alignment..........................................................................33 1.4.1.1 Pairwise sequence alignment...................................................34 1.4.1.2 Multiple sequence alignment...................................................37 1.4.2 Alignment