Improved Genomic Assembly and Genomic Analyses of Entamoeba Histolytica
Total Page:16
File Type:pdf, Size:1020Kb
Improved genomic assembly and genomic analyses of Entamoeba histolytica Thesis submitted in accordance with the requirements of the University of Liverpool for the degree of Doctor in Philosophy by Amber Leckenby September 2018 Acknowledgements There are many people without whom this thesis would not have been possible. The list is long and I am truly grateful to each and every one. Firstly I have to thank my supervisors Gareth, Christiane, Neil and Steve for the continuous support throughout my PhD. Particularly, I am grateful to Gareth and Christiane, for their patience, motivation and immense knowledge that helped me through the entirety of the proJect from the initial research to the writing of this thesis. I cannot have imagined having better mentors and role models. I also have to thank the staff at the CGR for their role in the sequencing aspects of this thesis. My further thanks extend to the CGR bioinformatics team, most notably Richard, Matthew, Sam and Luca, for not only tolerating the number of bioinformatics questions I have asked them, but also providing great friendship and warmth in the office. I must also give a special mention to Graham Clark at the London School of Hygiene and Tropical Medicine for sending cultures of Entamoeba and providing general advice, especially around the tRNA arrays. I would also like to thank David Starns, for his efforts troubleshooting the Companion pipeline and to Laura Gardiner for providing advice around all things methylation. My gratitude goes to the members of the many offices I have moved around during my PhD, many of which have become close friends who have got me through many bioinformatics conundrums, lab meltdowns and (some equally challenging) gym sessions. Stef, Jen, Charl and Dave – you guys kept me sane. Further, I must also include Sophie and Grace who have also made it through the past seven years at the University of Liverpool. From the first day of being an undergraduate to the final day of being a PhD student. I’m so grateful I got to navigate the adventures of student life with you both – looks like we all made it out okay in the end, right? On a personal note, I must give my warmest thanks to my family, particularly my parents for encouraging me in everything and making this endeavour possible. To my many sisters for being my personal cheerleaders as I made it to the end of this thesis. To my housemate, Adele, who was an excellent sounding board to probably a few too many rants. To John Newsham, the teacher who got me so interested in biology in the first place. And finally, to Troy for his unwavering support and belief that I could achieve such great things. i Abstract Amoebiasis is the third most common cause of mortality worldwide from a parasitic infection. It affects up to 50 million people annually, of whom 100,000 will die from the disease each year. Amoebiasis is caused by the amoeba Entamoeba histolytica, an obligate parasite of humans. Our understanding of the biology of this pathogen has been greatly advanced by the sequencing of its genome. However, the unusual nature of the genome (an extreme nucleotide composition bias, abundant repetitive elements and unknown chromosome structures/ploidy) made it particularly challenging to sequence and the resulting reference genome assembly is highly fragmented and possibly incomplete, limiting its usefulness for some analyses. New sequencing technologies can overcome some of the problems of the previous genome assembly. Here, single molecule real time (SMRT) sequencing was applied to sequence long fragments of DNA and build an improved reference genome for E. histolytica. This thesis describes the generation of sequence data and a comprehensive comparative analysis of genome assembly tools available for long-read SMRT sequencing data. This analysis showed that assembly using PacBio data only produced better quality genome assemblies than hybrid assembly approaches utilising both long- and short-read data together. The PacBio genome assembly is significantly better than the published reference genome assembly based on a range of quality metrics. The new genome assembly was annotated, revealing an increase in gene number. The spatial organisation of key virulence gene families (AIG1, Ariel-1, BspA, cysteine proteases, Gal/GalNAc lectins and STIRP families) was analysed, revealing an association of virulence gene families with transposable elements. The new assembly allowed analyses of two key, unusual features of the E. histolytica genome: the long arrays of multiple tRNA genes and the multi-copy, extra-chromosomal molecules containing the ribosomal DNA. Several lines of evidence were consistent with tRNA arrays capping chromosomes and acting as telomeres in Entamoeba. Variation among array units exists (relevant as they are used as population genetic markers), but the maJority sequence was consistently retrieved when genotyping, suggesting they may be relatively robust markers. Analysis of the rDNA episomes present in the E. histolytica strain sequenced (the HM-1:IMSS strain used for previous whole genome sequencing) revealed that one of the two rDNA episome types described in this strain has apparently been lost during in vitro culture. Genome-wide 5-methylcytosine methylation profiles for trophozoite stage parasites in culture were determined using bisulphite sequencing for the new E. histolytica genome assembly and two additional species (Entamoeba moshkovskii and Entamoeba invadens). The analyses confirmed previous reports of sparse methylation of the genome as a whole but highlighted interesting patterns of methylation. While there was virtually no methylation of genes, there was extensive methylation of transposable elements and tRNA arrays. These patterns suggest methylation functions to suppress active transposition and may play a role in the structural control of tRNA arrays, again consistent with telomeric role. The work presented here improves our understanding of the structure, content and regulation of the E. histolytica genome and provides a platform for improved future analyses for the Entamoeba research community. ii Table of Contents Acknowledgements ....................................................................................................................... i Abstract ............................................................................................................................................ ii Table of Contents ......................................................................................................................... iii Table of figures ............................................................................................................................. ix Table of tables ............................................................................................................................. xii Abbreviations ............................................................................................................................. xiv Chapter 1 – Introduction ................................................................................................... 1 1.1. Entamoeba phylogeny .......................................................................................................... 1 1.2. Entamoeba histolytica and other important Entamoeba species .......................... 3 1.2.1. The Entamoeba histolytica life cycle ....................................................................................... 4 1.2.2. Pathogenicity and treatment of amoebiasis ....................................................................... 6 1.2.3. Epidemiology of amoebiasis ................................................................................................... 10 1.2.4. Other Entamoeba species relevant to E. histolytica research ................................... 12 1.3. The power of genomics and comparative analyses amongst Entamoeba species ............................................................................................................................................ 13 1.3.1. The Entamoeba histolytica genome ..................................................................................... 14 1.3.2. Genome structure and gene content of the Entamoeba histolytica genome ...... 19 1.3.3. Gene regulation mechanisms in Entamoeba histolytica ............................................. 25 1.3.4. Comparative genomics among Entamoeba species ...................................................... 27 1.4. Improvements to the assembly of repetitive, complex genomes ...................... 28 1.4.1. Single Molecule Real Time (SMRT) sequencing ............................................................. 29 1.4.2. BioNano optical mapping ......................................................................................................... 30 1.5. Gaps still remaining in Entamoeba knowledge ........................................................ 31 1.5.1. Organisation of genes and gene families ........................................................................... 31 1.5.2. Structural features of the E. histolytica genome and associated genes ................ 32 1.5.3. Distribution of DNA methylation across the E. histolytica genome ....................... 33 1.6. Aims of thesis ....................................................................................................................... 34 Chapter 2 – SMRT sequencing and assembly of the Entamoeba histolytica HM-1:IMSS genome ..........................................................................................................