Dennis, Tristan Philip Wesley (2018) Mining Genome Data for Endogenous Viral Elements and Interferon Stimulated Genes: Insights Into Host Virus Co- Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Dennis, Tristan Philip Wesley (2018) Mining genome data for endogenous viral elements and interferon stimulated genes: insights into host virus co- evolution. PhD thesis. https://theses.gla.ac.uk/30887/ Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Mining genome data for endogenous viral elements and interferon stimulated genes: insights into host virus co-evolution by Tristan Philip Wesley Dennis BSc (Hons) Submitted in fulfilment of the requirements of the degree of Doctor of Philosophy (PhD). MRC-University of Glasgow Centre for Virus Research College of Medical, Veterinary and Life Sciences July 2018 Abstract Paleovirology is the study of viruses over evolutionary timescales. Contemporary paleovirological analyses often rely on sequence data, derived from organism genome assemblies. These sequences are the germline inherited remnants of past viral infection, in the form of endogenous viral elements and the host immune genes that are evolving to combat viruses. Their study has found that viruses have exerted profound influences on host evolution, and highlighted the conflicts between viruses and host immunity. As genome sequencing technology cheapens, the accumulation of genome data increases, furthering the potential for paleovirological insights. However, data on ERVs, EVEs and antiviral gene evolution, are often not captured by automated annotation pipelines. As such, there is scope for investigations and tools that investigate the burgeoning bulk of genome data for virus and antiviral gene sequence data in the search of paleovirological insight. I used DIGS, a similarity-searched based framework, in heuristic investigations of host:virus co-evolution in three contexts: ERVs, EVEs and interferon stimulated genes (ISGs). These three projects were unified by the approach and theme: the heuristic, similarity search based mining of genome sequences for insight into host:virus co-evolution. In the first project, I identified and characterised five novel murine ERV lineages, and used murine ERV sequences to provide insight into the recent acquisition and functionalisation of murine ERVs. In the second project, I performed a comparative analysis of animal endogenous circoviruses (CVe), increasing the known past host range of family Circoviridae and finding that genus Cyclovirus is more likely to be a clade of arthropod viruses than vertebrate viruses, in conflict with recent metagenomics analyses. In the third project, I screened mammalian genomes in an attempt to identify lineage specific patterns of ISG expansion and loss, finding three novel instances of ISG loss in cetaceans and felid carnivores. Together, these datasets will be of great utility in future studies of host:virus co-evolution, and underscore the utility of searching WGS data for paleovirological insight. 1 Table of contents Abstract 1 Table of contents 2 List of figures 6 List of tables 8 Acknowledgements 9 Author’s Declaration 10 Publications included in this thesis 11 Abbreviations 12 1 – Introduction 15 1.1 – Direct paleovirology – Endogenous retroviruses 16 1.1.1 - Retroviral genome structure 17 1.1.2 - Retroviral life cycle 18 1.1.3 - Retroviral gene products 19 1.1.4 – Accessory genes 22 1.2 – ERV structure, replication and classification 24 1.2.1 - ERV structure 24 1.2.2 - ERV amplification 24 1.2.3 - Retroviral taxonomy 26 1.2.4 - Class I retroviruses 28 1.2.5 - Class II retroviruses 29 1.2.6 - Class III retroviruses 30 1.3 – ERVs and the host 31 1.3.1 – Retroviruses and disease 31 1.3.2 – Restriction factors 32 1.3.3 - Exapted sequences 32 1.3.4 – Epigenetics 33 1.3.5 - LTR based gene regulation 34 1.4 – Murine ERVs 36 1.4.1 - Class I murine ERVs 37 1.4.2 - Class II murine ERVs 38 1.4.3 - Class III murine ERVs 39 1.4.4 - ERV-like transposons 39 1.5 – Direct paleovirology: EVEs 40 2 1.5.1 – Circoviruses 41 1.5.2 - Circovirus classification 42 1.5.3 - Circovirus diversity, host range, and role in disease 43 1.6 - Indirect paleovirology – antiviral genes 45 1.6.1 - Interferon stimulated genes (ISGs). 45 1.6.2 – Virus:ISG coevolution 45 1.6.3 – Antiviral gene dynamism 46 1.7 - Using paleovirology to study host:virus co-evolution 47 1.7.1 - Phylogenetic analysis 48 1.7.2 - Sequence analysis 50 1.7.3 – The timeline of host:virus co-evolution 51 1.7.4 – Functional genomics analyses 52 1.7.5 – Study of gene evolution 53 1.8 – Mining genome data for paleovirological insight 53 1.8.1 – Similarity searches 53 1.8.2 - Introduction to DIGS 56 1.8.3 - Why use DIGS? 57 1.9 - Research scope and aims 58 2 – Materials and Methods 59 2.1 - Materials 60 2.1.1 – Sequence data 59 2.1.2 – Software and tools 62 2.2 – Methods 65 2.2.1 – DIGS 65 2.2.2 - GLUE 65 2.2.3 - Host evolutionary timelines 70 2.2.4 – Recovery of proviruses 70 2.2.5 – Provirus screening and consolidation 70 2.2.6 – Sequence analysis 71 2.2.7 – Pairwise LTR dating 71 2.2.8 – Enrichment of genomic features at ERV loci 72 2.2.9 – Chromosomal distribution of ERVs 73 2.2.10 – Orthologous ERV and EVE integrations 73 2.2.11 – CVe sequence analyses 73 2.2.12 – PCR of CVe-Pseudomyrmex –Peter Flynn 73 2.2.13 – ISG screen design 74 3 2.2.14 – ISG data processing and display 74 2.2.15 – Shared genomic organisation 75 3 – Discovery and characterisation of murine endogenous retroviruses 76 3.1 – Introduction 76 3.2 – Results 79 3.2.1 - Assessment of murine ERV diversity using phylogenetic screening 79 3.2.2 - Identification of additional murine ER`V lineages 82 3.2.3 - Evolutionary history of Gag and Env 85 3.2.4 - A survey of murine ERV counts 87 3.2.5 - Analysis of novel ERVs 89 3.2.6 - Evolutionary analysis of murine ERVs 93 3.2.7 - Functional analysis of murine ERVs 102 3.2.9 – Dating ERV integrations 106 3.3 – Discussion 109 3.3.1 – Generation of a murine ERV dataset 109 3.3.2 – A transition in murine:ERV co-evolutionary history 109 3.3.3 – Identification of ERVs potentially involved in physiological processes 111 3.3.4 – Conclusions 113 4 - Paleovirological analyses of endogenous circoviruses 114 4.1 – Introduction 115 4.2 - Results 118 4.2.1 - Identification and characterisation of animal CVe 118 4.3.2 – Identification of orthologs and construction of a timeline of CVe evolution 124 4.3.3 - Phylogenetic analysis of Circoviridae Rep 127 4.3.4 - Phylogenetic analysis of Circoviridae Cap 131 4.3.5 - Species breakdown 133 4.3.5.1 - CVe in jawless fish 133 4.3.5.2 - CVe in Actinoperygii 134 4.3.5.3 - CVe in amphibians 135 4.3.5.4 - CVe in reptiles 136 4.3.5.5 - CVe in birds 136 4.3.5.6 - CVe in mammals 137 4.3.5.6 - CVe in mites 140 4 4.3.5.7 - CVe in ants 140 4.3.5.8 - CVe in other invertebrates 141 4.4 – Discussion 143 4.4.1 – Assessment of the methods 143 4.4.2 - CVe provide information about circovirus evolution 143 4.4.3 - Mapping the host range of circoviruses 144 4.4.4 - Impact of CVe on host genome evolution 147 4.4.5 – Conclusions 148 5 – Searching for ISG expansion and loss in mammals 150 5.1 – Introduction 150 5.2 – Results 152 5.2.1 - Initial screening and results 152 5.2.2 – IFIT 157 5.2.3 - XAF1 161 5.2.4 - SERTAD1 163 5.2.5 - Other potentially deleted genes 163 5.3 – Discussion 165 5.3.1 – Assessment of the methods 165 5.3.2 – Deletions in mammalian xaf1 and ifit 166 5.3.3 – Conclusions 168 6 - General discussion 169 6.1 – Assessment of the methods 169 6.2 - Future directions 171 6.3 – Conclusions 172 Bibliography 174 Appendices 200 5 List of figures Figure 1.1 - Canonical retroviral genome structure 17 Figure 1.2 – Reverse transcription 21 Figure 1.3 - ERV structures 25 Figure 1.4 - ERV amplification strategies 26 Figure 1.5 - Unrooted pol NJ phylogeny of Retroviridae 27 Figure 1.6 - Distinguishing structural features of retrovirus groups 28 Figure 1.7 - KAP1 (TRIM28) repression of ERV expression 35 Figure 1.8 - Phylogeny of known murine ERV lineages 36 Figure 1.9 - Canonical circovirus genome structure. 43 Figure 1.10 - ERV amplification dynamics reflected in their phylogenies. 50 Figure 1.11 - ERV invasion leads to orthologous integrations 52 Figure 1.12 - Strategies for similarity searches of genomes 55 Figure 1.12 - Basic DIGS searching strategy 56 Figure 2.1 – Genome screening in DIGS. 68 Figure 2.2 - Phylogenetic screening with DIGS 68 Figure 2.3 - Entity-Relationship diagram of database generated by DIGS 69 Figure 3.1 – The mouse genome contains 13 distinct ERV lineages. 81 Figure 3.2 - Phylogenetic relationships between murine and nonmurine gammaretrovirus Pol sequences 84 Figure 3.3 - Phylogenetic relationships between murine and nonmurine Betaretroviruses 85 Figure 3.4 - Phylogenetic relationships between murine and nonmurine gammaretrovirus Gag sequences 86 Figure 3.5: Phylogenetic relationships between murine and nonmurine gammaretrovirus TM sequences 87 Figure 3.6 - Genome structures of novel murine ERV lineages 92 Figure 3.7 – ML phylogenies of murine ERVs 98 Figure 3.8 - Enrichment of ERVs relative to transcription start sites 103 Figure 3.9 - Enrichment of TF binding and H methylation at ERV loci 103 Figure 3.10 – Paired LTR estimation of murine ERV integration dates 107 Figure 3.12 – Orthologs of four murine ERV lineages 108 6 Figure 4.1 - Genome structures of 24 endogenous circovirus (CVe) elements identified in animal genomes.