James Wasmuth Thesis

Declaration I declare that this thesis has been composed by myself and, except where otherwise stated, is entirely my own work. James D. Wasmuth December 2005 1 Abstract Funding allocation for complete genome sequencing has a severe taxonomic bias; nearly half of the completed or draft stage genomes are vertebrates. The luxury of a full genome sequence is unlikely to be available for the vast majority of organisms, regardless of their importance in terms of evolution, health or ecology. This has lead to the investigation of genomes of an increasing number of species, especially parasites, through generating expressed sequence tags (ESTs). Over 300,000 ESTs are available for 36 species of parasitic nematodes. These species include whipworm and filarial worms, which currently infect 3 billion people, as well as a large number of plant parasites. The need for a database and analysis suite, which integrates these transcriptomes with associated annotation and metadata, led to the creation of NEMBASE (http://www.nematodes.org). The system enables comparative transcriptomic analyses across the phylum Nematoda. The focus of this thesis is the comparison of nematode proteomes. I describe my work to identify sequences and sequence features that have patterns of interest to nematode biology. Such patterns, include proteins that are unique to parasitic feeding strategy, and protein domains that have been lost in certain nematode lineages. This involved not only global comparisons of the proteomes, but also investigating the proteins domain complement from each species. One vital step in the analysis is identifying credible coding regions within the error-prone EST sequences. Robust identification of the coding regions presents an opportunity to perform comparative analysis previously confined to those working with complete genomes. 2 To achieve this, I built the translation pipeline prot4EST, a hierarchical collection of freely- available algorithms. The benchmarking showed that prot4EST produced coding region predictions that were better than its constituent algorithms. Exploring the effect of sequence composition of both the studied species and the program's training sets, improved the accuracy of prediction. A database of high quality protein translations for all nematodes studied was generated, called NemPep. This was accompanied by a collection of predicted domains (NemDom). The decoration of protein sequences with domain annotation is not trivial, especially given the incomplete nature of ESTs. It was necessary to explore domain model assignment to ensure the most accurate results. The rigorous analysis of NemPep and NemDom has revealed: 1) proteins specific to certain nematode-lineages, 2) the level and potential effects of contamination in the original cDNA libraries, 3) the extent of protein loss and domain modification in the caenorhabditid lineage. These findings are of particular importance to parasitic nematodes, as they highlight possible candidates for new anthelmintic drug targets. The methods used also offer new approaches for other complex cross-species comparisons. 3 Acknowledgements I must thank Mum, Dad and Nick for all their love through the trials and celebrations of the last 28 years. It's all worth it. Dad you were right, I should 'stop moping about and buck [my] bloody ideas up!' I'm also glad I heeded Mum telling me to 'do whatever makes [me] happy.' Nick, thank you for showing me a whole world that I want to stay well away from. The support from you three has been amazing and I love you all. The rest of my family deserves my love and thanks. Thanks for supporting me and repeatedly asking me what it was I did. By letting me describe my work to you, I reminded myself how much I enjoyed it. Big man-hugs go to Chris 'the Boy with the dancing feet' and Gaz. Thanks for always being there (well in Bristol at least) when I needed you. And for keeping my feet firmly on the floor. Thanks for the long, long nights (and occasionally days) dancing to the best of music. Some of those will live very long in my memory. My love goes to Whitney, Dale, Mel, George and Chris for being superb friends for what seems like forever. I've had some amazing times with you lot, and looking forward to more. None of this work, or what follows (what follows?), would have been possible without the teachers who have graced my presence for 23 years. You are too numerous to mention individually, but special tributes should go to Martin Watson, Terry Brooks, John Evans, James MacPherson, David Perkins, Nick Edwards, Brian 'Dr. K' Kempley, Peter Little, Ann Dell, Gary Warren, Peter Young and Emma Rand. Two people deserve huge thanks yous - Sidney Reid-Smith, you more than anyone is responsible for starting my trip into science and particularly Biology. Mark Blaxter has been my mentor and friend for the past three years, and without his guidance, support and “permission” to go to (overseas) conferences my Ph.D. would have suffered. That said if he didn't feature creep so much, I'd have finished sooner :o) To the people who I work with, thank you for putting up with me, I hope I've given a fraction back to repay the massive amount I've taken. To Ralf, Ann, John (see you soon), Al, and Martin (no more drafting - I promise) I'm pleased I shared an office with you. Thank you for furthering my programming knowledge by posing interesting problems and then in turn answering my questions. I must thank Claire, Jen, Jenna, Robin, Fran, Habib, Chris, Asher and Katelyn for reminding me where my data came from, and that there is a world of biology away from my computer. Thank you to Karen, Duff, Anjie, Katie and Henry for not groaning too often every time I popped into their office - well if you will bake cakes... The biggest of hugs goes to Constance, without whom the last 18 months would have not been so much fun. Thank you for looking after me, and reminding me to eat while I have been writing up. I'm glad that I braved that injured knee to go on that date with you. What love I have left is all yours... Finally I wish to thank the BBSRC (UK) for funding me for the last four years and Tom Hutchinson and Becky Brown at Astra Zeneca, Brixham for providing me with a CASE award for three years. My work would have been impossible without all those wonderful people who make their programs freely available and their open source philosophy. A big thank you and I hope you keep coding... 4 This thesis is dedicated to: Sidney Reid-Smith, for starting me on this crazy journey and the memory of Dr. Gary 'it's been emotional' Warren and Auntie Fiona "You have made your way from worm to man, and much in you is still worm." Friedrich Nietzsche, Thus Spoke Zarathustra 5 Table of Contents Abstract 2 Acknowledgements.......................................................................................................4 List of Figures...............................................................................................................9 List of Tables..............................................................................................................10 Abbreviations used......................................................................................................11 Nematode species and their codes..............................................................................12 Chapter One - Introduction 13 1.1 The need for more sequence...........................................................................13 1.2 Expressed sequence tags.................................................................................14 1.2.1 Processing and databasing ESTs.............................................................21 1.3 The phylum Nematoda....................................................................................24 1.3.1 Nematode systematics.............................................................................25 1.3.2 Comparative studies of the Nematoda.....................................................28 1.4 Summary of thesis: ESTs to proteinspace.......................................................30 1.4.1 Identification of coding regions...............................................................31 1.4.2 Deriving nematode proteomes.................................................................32 1.4.3 Phylogenomics of nematodes..................................................................32 1.4.4 Nematode protein domains......................................................................33 Chapter Two - Predicting coding regions from ESTs 35 2.1 Abstract...........................................................................................................35 2.2 Introduction.....................................................................................................35 2.3 Translating Expressed Sequence Tags............................................................36 2.3.1 Similarity-based methods........................................................................37 2.3.2 de novo predictions..................................................................................40 2.3.3 ab initio method.......................................................................................51 2.4 New solution – prot4EST................................................................................52 2.4.1 HSP tiling................................................................................................52 2.4.2 Polypeptide extension..............................................................................53

James Wasmuth Thesis

Downloading Them Directly from the GO

Evolution of the D. Melanogaster Chromatin Landscape and Its Associated Proteins

IISER Pune Annual Report 2015-16 Chairperson Pune, India Prof

The Drosophila Speciation Factor HMR Localizes to Genomic Insulator Sites

Structure, Function, and Evolution of a Signal-Regulated Enhancer

Promoter-Proximal Chromatin Domain Insulator Protein Beaf Mediates

Promoter-Proximal Chromatin Domain Insulator Protein BEAF Mediates Local and Long-Range Communication with a Transcription Facto

Gene Duplication, Lineage-Specific Expansion, And

Insect Transcription Factors: a Landscape of Their Structures and Biological Functions in Drosophila and Beyond

Coordinate Enhancers Share Common Organizational Features in the Drosophila Genome

Transcriptomic Data from Panarthropods Shed New Light on the Evolution of Insulator Binding Proteins in Insects

49 Annual Drosophila Research Conference • Program and Abstracts