Evolutionary Impacts of Secondary Structures Within the Genomes Of
Total Page:16
File Type:pdf, Size:1020Kb
Evolutionary Impacts of Secondary Structures within the Genomes of Eukaryote-Infecting Single-Stranded DNA Viruses Thesis presented for the degree of DOCTOR OF PHILOSOPHY in the Department of Integrative BioMedical Sciences at the UNIVERSITY OF CAPE TOWN University of Cape Town August 2015 Author: Supervisor: Brejnev Muhizi Muhire A/Prof. Darren Martin The copyright of this thesis vests in the author. No quotation from it or information derived from it is to be published without full acknowledgement of the source. The thesis is to be used for private study or non- commercial research purposes only. Published by the University of Cape Town (UCT) in terms of the non-exclusive license granted to UCT by the author. University of Cape Town Abstract Abstract Secondary structures forming through base-pairing in virus genomes have been proven to regulate several processes during viral replication cycles, including genome replication, transcription, post-transcriptional activities, protein synthesis, genome packaging, generation of viral sub-genomes and evasion of host-cell immune responses. Although computational DNA/RNA folding methods based-on free energy minimisation approaches are capable of predicting structures that form within virus genomes, these methods are not entirely accurate. Notably, many of structures that are accurately predicted will likely have no biological importance within the genomes in which they reside because even randomly generated single- stranded RNA/DNA sequences will form stable secondary structures. Nevertheless, with additional genome evolution analyses involving the detection of natural selection, sequence co-evolution, and genetic recombination, it is possible to both validate the existence of, and infer the biological importance of, computationally predicted structures. Here I implement and deploy free bioinformatics tools to (1) automate nucleotide and protein sequences classification into datasets useful for downstream molecular evolution analyses; (2) improve the accuracy of computational virus-genome-scale secondary structure prediction; (3) enable the identification of biologically relevant secondary structures using signals of purifying selection, coevolution and recombination within aligned sequence datasets; and (4) enable efficient visualisation of structural and selection data for better characterisation of individual secondary structural elements. Using these tools I carried-out large scale studies that predicted and characterised novel functional secondary structures, that potentially regulate transcription, translation, gene splicing, and replication, within the genomes of eukaryote-infecting ssDNA viruses (Circoviridae, Anelloviridae, Parvoviridae, Nanoviridae, and Geminiviridae). I show that purifying selection tends to be stronger at base-paired sites than it is at unpaired sites and, wherever mutations are tolerable within paired regions, I demonstrate that there exist strong associations between base-pairing and complementary coevolution. Finally, I show that the recombinant genomes of some, but not all, eukaryote-infecting ssDNA virus groups display weak evidence of both homologous i Abstract and non-homologous recombination break-points preferentially occurring at genome sites that minimally disrupt secondary structures. Altogether, these results suggest that natural selection acting to maintain important biologically functional secondary structural elements has been a major process during the evolution of eukaryote- infecting ssDNA viruses. ii Acknowledgements Acknowledgements I wish to thank Associate Professor Darren Martin who supervised this research and ensured my intellectual progress and financial stability over the course of my PhD degree. Under Darren’s supervision, I quickly learned a large amount of biology (a challenge to any mathematician wanting to transform into a bioinformatician) and I was fortunate to spend a lot of time working closely with him, which significantly enhanced my skills in software development and scientific writing: skills crucial to my future bioinformatics career. I thank the University of Cape Town (UCT) for the International & Refugee Scholarship awarded to me. This was my first award without which I wouldn’t have managed to join the Bioinformatics programme. I acknowledge the Carnegie Corporation and Poliomyelitis Research Foundation for funding this research. Special thanks to the Head of our Research Group, Associate Professor Nicky Mulder. Her guidance, courage and kindness have been an inspiration throughout my postgraduate studies and will continue to be, in my future research endeavours. Special thanks to Michael Golden for making tremendous contributions toward this work, his close collaboration played an influential role in improving my programming skills and aptitude. A special thank you to Arvind Varsani, Emil Tanov, Fredrick Nindo, Ben Murrell, Gordon Harkins, Philippe Roumagnac, Pierre Lefeuvre, Simona Kraberger, Daisy Stainton, Adérito Monjane, Rebone Meraba, Penelope Hartnady and the UCT’s Computational Biology research group for your intellectual and moral contribution which always kept me moving forward. This thesis is dedicated to my daughter Joanne. iii Acronyms Acronyms BLAST : Basic Local Alignment Search Tool BMV : Brome mosaic virus cp : Capsid protein gene CP : Capsid protein DEmARC : DivErsity pArtitioning by hieRarchical Clustering DENV : Dengue virus DOOSS : Data Overlaid On Secondary Structure dN : Non-synonymous substitution rates DNA : Deoxyribonucleic acid dS : Synonymous substitution rates dsDNA : Double-stranded DNA dsRNA : Double-stranded RNA FUBAR : Fast Unconstrained Bayesian Approximation EMF : Enhance Metafile Format HCSS : High Confidence Structure Set HCV : Hepatitis C virus HIV : Human immunodeficiency virus HyPhy : Hypothesis testing using Phylogenies GARD : Genetic Algorithm for Recombination Detection GUI : Graphical User Interface GDI : Graphical Device Interface kb : Kilobase MFE : Minimum Free Energy ML : Maximum Likelihood iv Acronyms mp : Movement protein gene MP : Movement protein MSV : Maize streak virus NASP : Nucleic Acid Secondary-Structure Predictor NAVA : Nucleic Acid Visualisation and Analysis nt : Nucleotide NW : Needleman-Wunsch OTUs : Operational Taxonomic Units PARRIS : Partitioning Approach for Inference of Selection PASC : PAirwise Sequence Comparison PCV : Porcine circovirus PNG : Portable Network Graphic RNA : Ribonucleic acid RDP : Recombination Detection Program rep : Replication associated-protein gene Rep : Replication associated protein SDT : Sequence Demarcation Tool SHAPE : Selective 2‘-Hydroxyl Acylation by Primer Extension SIV : Simian immunodeficiency virus ssDNA : Single-stranded DNA ssRNA : Single-stranded RNA v Table of contents Table of contents Abstract ................................................................................................................ i Acknowledgements .................................................................................................... iii Acronyms ............................................................................................................... iv Table of contents ........................................................................................................ vi List of tables ............................................................................................................... xi List of figures ............................................................................................................. xii Chapter 1 : Introduction .............................................................................................. 1 1.1 Nucleic acid secondary structures within virus genomes ......................... 1 1.1.1 Impact of secondary structure on virus evolution .................................... 3 1.1.2 Computational prediction of secondary structure .................................... 6 1.1.3 Experimental prediction of secondary structure ...................................... 9 1.2 Eukaryote-infecting single-stranded DNA viruses ................................. 11 1.2.1 Diversity of eukaryote-infecting ssDNA viruses .................................... 12 1.2.2 Characterised genomic secondary structures ....................................... 13 1.2.3 Evolution of eukaryote-infecting ssDNA viruses ................................... 14 1.3 Thesis structure ..................................................................................... 16 Chapter 2 : Sequence Demarcation Tool (SDT): a tool for objective classification of virus genomes ...................................................................................... 18 2.1 Abstract ................................................................................................. 18 2.2 Introduction ............................................................................................ 19 2.3 Materials and methods .......................................................................... 24 2.3.1 Implementation of SDT ......................................................................... 24 2.3.2 Sequence identity calculation ............................................................... 24 2.3.3 Pairwise identity matrix and pairwise identity distribution plots ............. 26 2.3.4 Usage of pre-computed identity scores ................................................ 26 vi Table of contents 2.3.5 Creation of datasets based on sequence identities .............................. 27 2.3.6 The SDT_Linux,