Utilization of Complete Chloroplast Genomes for Phylogenetic Studies
Total Page:16
File Type:pdf, Size:1020Kb
Utilization of complete chloroplast genomes for phylogenetic studies Shairul Izan Binti Ramlee Thesis committee Promotor Prof. Dr R.G.F. Visser Professor of Plant Breeding Wageningen University Co-promotors Dr M.J.M. Smulders Senior researcher, Wageningen UR Plant Breeding Wageningen University & Research Dr T.J.A. Borm Researcher, Wageningen UR Plant Breeding Wageningen University & Research Other members Prof. Dr M.E. Schranz, Wageningen University Dr G.F. Sanchez Perez, Wageningen University Dr R. Vos, Naturalis Biodiversity Center, Leiden Dr R. van Velzen, Wageningen University This research was conducted under the auspices of the Graduate School of Production Ecology and Resource Conservation Utilization of complete chloroplast genomes for phylogenetic studies Shairul Izan Binti Ramlee Thesis submitted in fulfilment of the requirements for the degree of doctor at Wageningen University by the authority of the Rector Magnificus Prof. Dr A.P.J. Mol, in the presence of the Thesis Committee appointed by the Academic Board to be defended in public on Friday 28 October 2016 at 11 a.m. in the Aula. Shairul Izan Binti Ramlee Utilization of complete chloroplast genomes for phylogenetic studies 186 pages. PhD thesis, Wageningen University, Wageningen, NL (2016) With references, with summary in English ISBN: 978-94-6257-935-4 DOI: 10.18174/390196 Table of Contents Chapter 1: General Introduction………………………………………………………………...….1 Chapter 2: De novo assembly of complete chloroplast genomes from non-model species based on a k-mer frequency-based selection of chloroplast reads from total DNA sequences…………………………………………………………………………….........................17 Chapter 3: Visual comparison of the quality of chloroplast assemblies…….……........47 Chapter 4: Phylogenetic analysis of tomato (Solanum section Lycopersicon) based on various complete chloroplast genomes and subsets thereof…….………………….....…….64 Chapter 5: Gene loss and inversions in the chloroplast of subgenus Paphiopedilum (Orchidaceae) based on 32 de novo assembled complete organellar genomes…………………………………………………………………………………………..…………...97 Chapter 6: General Discussion…………………….……………………..…………………………131 Summary……………………………………………………………………………...………….…………149 References………………………………………………………………………………..……….……..…152 Acknowledgements…………………………………………………………………..……………….…180 Curriculum Vitae…………………………………………………………………………..……….…...183 Education statement of the graduate school………..…………………………....…………….184 Chapter 1 General Introduction 1 Phylogenomics Modern evolutionary theory hypothesizes that all organisms have descended from a common ancestor, which means that all extant and extinct species are related. Phylogenetic relationships can be inferred using morphological, physiological and molecular characteristics. Molecular sequences such as DNA sequences play a key role in recent day molecular phylogenetic analysis. The structure and function of the DNA sequences and how they change over time are used to infer evolutionary relationships. As new DNA sequencing methods became available since 2000, the costs have been driven down (http://www.genome.gov/sequencingcosts/). As a result, large amounts of sequence data can now be generated cheaply for researchers to infer species relationships from. Rather than using one or a few genes to study species evolutionary relationship in the conventional approach, one can now study evolutionary relationships based on comparative analysis of genome-scale data called phylogenomics (Eisen and Fraser 2003; Lemmon and Lemmon 2013). Phylogenetic relationships can be reconstructed based on comparisons of DNA sequences and genome organisation features. In addition, rare genomic changes such as insertions and deletions in introns, retrotransposon integration, changes in gene order in the organellar genome, gene duplications and genetic code changes of the entire genome can be used as molecular markers for a wide range of taxonomical levels (species, genus, family or higher) (Rokas and Holland 2000). Phylogenetic trees reconstructed based on the conventional approach of using just one or a few genes may show conflicts (Teichmann and Mitchison 1999) due to the fact that individual genes may have gone through different evolutionary lineages. In addition, lack of sufficient phylogenetic informative variation leads to the risk of stochastic errors and poorly resolved phylogenetic trees. In contrast, phylogenomics should be able to resolve difficult phylogenies and be able to verify or overturn proposed relationships (Delsuc et al. 2005). Several empirical studies have shown the robustness of phylogenomics to resolve difficult phylogenies. For example, phylogenomic analysis using plastid sequences was able to produce strongly supported phylogenies of Araceae as discussed in Henriquez et al. (2014). Likewise, 2 in the study of Ma et al. (2014) and Pyron et al. (2014) a series of phylogenomic analyses was conducted to infer difficult phylogenies at low taxonomic levels in the temperate woody bamboo and snakes respectively. Chloroplast phylogenomics Beside the nuclear genome, plant cells contain up to two more genomes: The organellar genomes of the mitochondrion and the chloroplast (the plastome). Unlike the mitochondrial genome, chloroplast genomes or plastid genomes as referred to by some authors rarely show evidence of intra- or inter molecular recombination (Dong et al. 2012) and are therefore highly conserved in terms of gene order and content. These characteristics make the chloroplast genome an attractive tool for phylogenetic studies. Phylogenetic studies in plants mostly employ chloroplast genome sequences along with a few sequences on the nuclear genome, such as internal transcribed spacer DNA (ITS). Chloroplast DNA has been shown to provide a wealth of information on molecular variation for molecular phylogenetic studies. Early molecular phylogenetic studies using chloroplast DNA sequences were based on the comparison of restriction site polymorphism and gene order changes at a wide range of taxonomic levels (Olmstead and Palmer 1994; Jansen et al. 1998). 3 700 600 500 400 300 200 100 0 1980-1989 1990-1999 2000-2009 2010-2015 Eudicots Monocot Magnoliids Basal Chloranthaceae Figure 1: Numbers and taxonomic distribution of complete chloroplast genomes submitted to GenBank up to June 2016 Publication of the first complete chloroplast genome, that of Nicotiana tabacum (Shinozaki et al. 1986) was a defining moment in the study of chloroplast genome evolution as this enabled detailed nucleotide-level genome-wide comparisons to be made. Since then the number of complete chloroplast genomes sequenced for angiosperms deposited in the NCBI Organelle Genome Resources database has increased every year, reaching a total of 817 complete genomes in June 2016 (http://ncbi.nlm.nih.gov/genome/organelle/) (see also Figure 1). However, given the total number of angiosperm species of about 300,000 (Cowan et al. 2006), the fraction of published chloroplast genomes is concentrated on the eudicots and monocot class which is too low to fully understand chloroplast evolution (Figure 1). 4 Several groups of scientists have been focusing their efforts to develop and sequence complete chloroplast genomes to fill the most important gaps (Naito et al. 2013; Nikiforova et al. 2013; Ruhfel et al. 2014; Carbonell-Caballero et al. 2015) and the number of chloroplast genomes is expected to increase dramatically in the next few years. Brief overview of chloroplast structure and evolution The genes encoded in the chloroplast genome are generally conserved in content and in order among land plants. Genes can be categorised into functional groups (Kim and Lee 2004; Yi and Kim 2012; Li et al. 2013; Huang et al. 2014; Zhang et al. 2014). The first category includes genes that are involved in photosynthesis such as genes for photosystem I and II. Genes from the second category are involved in transcription, translation or self-replication such as transfer RNA, ribosomal RNA, ribosomal subunit genes and RNA polymerase genes (Mullet 1988). The third category comprises conserved open reading frames (ORFs) such as protein-coding genes like maturaseK (matK), chloroplast envelope membrane protein (cemA) and hypothetical chloroplast open reading frame (ycfs). The chloroplast genome of angiosperms is circular and the reported size varies from 120 to 220 kilobases (kb) (Wu et al. 2010; Wicke et al. 2011). The genome generally has a quadripartite structure with two copies of an inverted repeat (IR) region separated by small (SSC) and large single copy (LSC) regions (Saski et al. 2005; Yang et al. 2013; Yang et al. 2014). The IR size averages around 20-30 kb with some exceptions (such as Pelargonium hortorum (~76 kb) (Palmer et al. 1987; Chumley et al. 2006). The IRs are thought to act as stabilising regions and evolve ~2.3 times slower compared to the single copy region (Perry and Wolfe 2002). While the positions of boundaries between IR and single copy regions show some variability between species (this thesis), sometimes including some genes in the IR in one species that are present in a single copy region in another species, the IRs are exact reverse complemented duplicates, and hence both IR´s in a single chloroplast have the same gene content. The chloroplast genome is able to retain signatures of evolutionary history much longer than its nuclear counterparts due to the low level of mutation, which appears 5 at least in part due to the presence of these IRs. Mutations in the IR have been observed in species with chloroplasts