Solanum Section Lycopersicon: Solanaceae)
Total Page:16
File Type:pdf, Size:1020Kb
Biological Journal of the Linnean Society, 2016, 117, 96–105. With 4 figures. Using genomic repeats for phylogenomics: a case study in wild tomatoes (Solanum section Lycopersicon: Solanaceae) 1,2 2,3 € 4 5 STEVEN DODSWORTH *, MARK W. CHASE , TIINA SARKINEN , SANDRA KNAPP and ANDREW R. LEITCH1 1School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London, E1 4NS, UK 2Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3DS, UK 3School of Plant Biology, The University of Western Australia, Crawley, WA, 6009, Australia 4Royal Botanic Garden, Edinburgh, 20A Inverleith Row, Edinburgh, EH3 5LR, UK 5Department of Life Sciences, Natural History Museum, Cromwell Road, London, SW7 5BD, UK Received 17 February 2015; revised 7 May 2015; accepted for publication 21 May 2015 High-throughput sequencing data have transformed molecular phylogenetics and a plethora of phylogenomic approaches are now readily available. Shotgun sequencing at low genome coverage is a common approach for isolating high-copy DNA, such as the plastid or mitochondrial genomes, and ribosomal DNA. These sequence data, however, are also rich in repetitive elements that are often discarded. Such data include a variety of repeats present throughout the nuclear genome in high copy number. It has recently been shown that the abundance of repetitive elements has phylogenetic signal and can be used as a continuous character to infer tree topologies. In the present study, we evaluate repetitive DNA data in tomatoes (Solanum section Lycopersicon)to explore how they perform at the inter- and intraspecific levels, utilizing the available data from the 100 Tomato Genome Sequencing Consortium. The results add to previous examples from angiosperms where genomic repeats have been used to resolve phylogenetic relationships at varying taxonomic levels. Future prospects now include the use of genomic repeats for population-level analyses and phylogeography, as well as potentially for DNA barcoding. © 2015 The Authors. Biological Journal of the Linnean Society published by John Wiley & Sons Ltd on behalf of Linnean Society of London, Biological Journal of the Linnean Society, 2016, 117,96–105. ADDITIONAL KEYWORDS: genome skimming – high-throughput sequencing – molecular systematics – next-generation sequencing – phylogenetics – phylogenetic signal – repetitive elements. INTRODUCTION (Dodsworth et al., 2015). Molecular systematics relies on the alignment of homologous DNA sequences, One of the simplest approaches to using high- whether coding or noncoding, and subsequent phylo- throughput sequencing for phylogenetics is to ran- genetic trees are inferred based on patterns of differ- domly sequence a small proportion of total genomic ences in these alignments. Repetitive elements are DNA. The sequences of reads present in these data- not suitable for such analyses in exactly the same sets are biased towards sequences with the greatest way. For example, although retrotransposons have numbers of copies in the genome (Straub et al., homologous protein domains involved in element 2012); this includes not only high-copy organellar mobility, the sequence divergence of these domains DNA, such as the plastid and mitochondrial ge- between taxa is not sufficient to resolve phylogenetic nomes, but also ribosomal DNA and the many kinds relationships. What does vary, and in many cases of repeats, particularly retrotransposon sequences drastically, is the abundance of particular retrotrans- posons and other repeat types. This abundance of *Corresponding author. E-mail: steven.dodsworth@ homologous repeats can then be used as a quantita- qmul.ac.uk tive character for phylogenetic reconstruction. 96 © 2015 The Authors. Biological Journal of the Linnean Society published by John Wiley & Sons Ltd on behalf of Linnean Society of London, Biological Journal of the Linnean Society, 2016, 117, 96–105 This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. GENOMIC REPEATS FOR PHYLOGENOMICS 97 Recent tools have been developed that allow us to Macas, 2010; Novak et al., 2013; Dodsworth et al., analyze, quickly and efficiently, the repetitive por- 2015). This methodology has been shown to be effec- tion of the genome from low-coverage genome tive for inferring phylogenetic relationships in well- sequencing data, and then to use these data for phy- studied groups of angiosperms in several different logenetic inference (Fig. 1) (Novak, Neumann & families (Apocynaceae, Fabaceae, Liliaceae, Orob- anchaceae, and Solanaceae). Typically, the method does not work well above the level of genus because A gDNA sequencing there are often no repeats in common (and therefore no shared characters on which to infer phylogenetic relationships). Understanding how repetitive ele- gDNA ments could be used in phylogeographical and popu- lation genetic studies, as well as in resolving difficult phylogenetic problems at the species-level, is now a Sequence focus for future research. reads In the present study, we test the usefulness and power of nuclear repeat regions at inter- and intra- (0.1–5% of specific levels. We test this using wild and cultivated genome) tomato species, including multiple cultivars as a case study to explore intraspecific variation in genomic repeats and the subsequent performance of these datasets in phylogenetic inference. The wild toma- B Repeat clustering toes present an excellent case study as a result of the availability of genomic and genetic data, and R1 - tandem repeat extensive previous analyses of phylogenetic relation- Species Species ships using plastid markers, low-copy nuclear mark- n A R2 - retroelement ers, nuclear ribosomal internal transcribed spacers, Species and amplified fragment length polymorphisms (Per- B Species C Species alta, Spooner & Knapp, 2008; Grandillo et al., 2011). B Species Four informal groups are recognized within the sec- C Species tion: (1) ‘Lycopersicon group’ with Solanum lycopersi- Abundance A estimation of repeat cum, Solanum cheesmaniae, Solanum galapagense, Species types n and Solanum pimpinellifolium (the ‘red/orange fruit’ clade); (2) ‘Arcanum group’ with Solanum arcanum, Solanum chmielewskii, and Solanum neorickii (the ‘green fruit’ clade); (3) ‘Eriopersicon group’ with Sola- C Phylogenetics num huaylasense, Solanum chilense, Solanum corne- liomulleri, Solanum peruvianum, and Solanum R1 R2 R3...Rn habrochaites; and (4) ‘Neolycopersicon group’ contain- Sp. A 12.4 4.5 11.2 ing only Solanum pennellii, which was considered to Sp. B 10.2 0.5 12 Sp. C 62.5 32.5 3.7 Sp. n be sister to the rest of the section by (Peralta et al., ... 2008) based on its lack of the sterile anther append- ... ... age that occurs as a morphological synapomorphy in Sp. n etc. S. habrochaites and the rest of the core tomatoes. Sp. A More recent studies using conserved orthologous Sp. B sequence markers (COSII; Rodriguez et al., 2009) and genome-wide single nucleotide polymorphisms Sp. C (SNPs) (Aflitos et al., 2014; Lin et al., 2014) have largely supported previous hypotheses with respect Figure 1. Schematic illustrating the workflow for build- to major clades within the tomatoes, although indi- ing trees from repetitive DNA abundances. A, low-cover- vidual species relationships are less clear cut for age genomic DNA sequencing using next-generation some taxa. The extent to which multiple evolutionary sequencing methods (NGS; e.g. Illumina). B, clustering of histories can be recovered through the analysis of NGS reads using RepeatExplorer pipeline, resulting in different genomic fractions has been explored in abundance estimates of different repeat families. C, phy- tomatoes, and concordance analysis revealed signifi- logenetic analysis in TNT using cluster abundances as cant discordance possibly as a result of biological continuous phylogenetic characters. processes such as hybridization or incomplete lineage © 2015 The Authors. Biological Journal of the Linnean Society published by John Wiley & Sons Ltd on behalf of Linnean Society of London, Biological Journal of the Linnean Society, 2016, 117, 96–105 98 S. DODSWORTH ET AL. sorting (Rodriguez et al., 2009). There are no different cultivars of S. lycopersicum were also reported polyploids in this clade, and there is not included. much variation in genome size, although macro- and microgenome rearrangements are reported (Tang et al., 2008; Szinay et al., 2010, 2012; Verlaan et al., HIGH-THROUGHPUT SEQUENCE DATA ACQUISITION 2011). Illumina sequence data from the 100 Tomato Gen- ome Sequencing Consortium (Aflitos et al., 2014) were downloaded from the NCBI Short Read Archive (SRA), with the accession numbers: ERR418040 – MATERIAL AND METHODS S. lycopersicum ‘Alisa Craig’ LA2838A; ERR418039 – TAXA SAMPLED S. lycopersicum ‘Moneymaker’ LA2706; ERR418048 – We sampled material from 20 accessions, including S. lycopersicum ‘Sonata’ LYC1969; ERR418055 – all currently recognized species of the core tomato S. lycopersicum ‘Large Pink’ EA01049; ERR418056 – clade (section Lycopersicon) and Solanum tuberosum S. lycopersicum LYC3153; ERR418058 – S. lycopersi- L. (potato) as the outgroup. Representatives of Sola- cum PI129097; ERR418078 – S. lycopersicum num sect. Lycopersicon (Table 1) (Peralta, Knapp & LYC2962; ERR418093 – S. arcanum LA2172; Spooner, 2005; Peralta et al., 2008) included in the ERR418061 – S. corneliomulleri LA0118; ERR418087 analyses were: S. lycopersicum L., S. arcanum