Genome-Wide Interrogation Advances Resolution of Recalcitrant Groups In
Total Page:16
File Type:pdf, Size:1020Kb
SUPPLEMENTARY INFORMATION VOLUME: 1 | ARTICLE NUMBER: 0020 In the format provided by the authors and unedited. 1 Supplementary Information Genome-wide2 interrogation advances resolution 3 Genome-wide interrogation advances resolution of recalcitrant of 4recalcitrantgroups in the Tree groups of Life in the tree of life 5 6 Dahiana Arcila1,2, Guillermo Ortí1, Richard Vari2§, Jonathan W. Armbruster3, Melanie L. 7 J. Stiassny4, Kyung D. Ko1, Mark H. Sabaj5, John Lundberg5, Liam J. Revell6 & Ricardo 8 Betancur-R.7,2* 9 10 11 1Department of Biological Sciences, The George Washington University, 2023 G St. 12 NW, Washington, DC, 20052, United States 13 2Department of Vertebrate Zoology, National Museum of Natural History Smithsonian 14 Institution, PO Box 37012, MRC 159, Washington, DC, 20013, United States 15 3Department of Biological Sciences, Auburn University, Auburn, AL 36849, United 16 States 17 4Department of Ichthyology, Division of Vertebrate Zoology, American Museum of 18 Natural History, New York, New York, United States 19 5Department of Ichthyology, The Academy of Natural Sciences, 1900 Benjamin Franklin 20 Parkway, Philadelphia, Pennsylvania 19103, United States 21 6Department of Biology, University of Massachusetts Boston, Boston, Massachusetts 22 02125, United States 23 7Department of Biology, University of Puerto Rico – Río Piedras, PO Box 23360, San 24 Juan, Puerto Rico 25 26 27 This PDF file includes: 28 29 Supplementary Methods 30 Supplementary Notes 31 Figs. S1 to S7 32 Tables S1 to S8 33 34 Other Supplementary Materials for this manuscript: 35 36 A compressed folder containing all data files is available from Zenodo.org 37 (http://dx.doi.org/10.5281/zenodo.51603). These include: 38 39 Database S1. List of target loci and probes designed for pilot study 40 Database S2. Scripts for quasi de novo assembler 41 Database S3. DNA sequence alignments, gene and multi-locus trees for pilot 42 study 43 Database S4. List of loci selected for Otophysi 44 Database S5. List of target loci and probe designed for Otophysi 45 Database S6. DNA and protein sequence alignments for Otophysi NATURE ECOLOGY & EVOLUTION | DOI: 10.1038/s41559-016-0020 | www.nature.com/natecolevol 1 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. SUPPLEMENTARY INFORMATION 46 Database S7. List of presence and absence of species and genes examined for 47 Otophysi 48 Database S8. Gene sequence annotation using Blast2Go. GO = Gene ontology 49 Database S9. Summary of gene properties and selection criteria for subset 50 assembly (selected loci for each subset given in bold) 51 Database S10. Gene and multi-locus trees for Otophysi 52 Database S11. GGI tutorial dataset 53 Database S12. GGI trees (Otophysi) and species trees analyses using STAR and 54 ASTRAL-2 55 Database S13. GGI trees metazoans, birds, mammals, and yeast 56 57 Supplementary Methods 58 59 Pilot study (ray-finned fishes). We ran an initial experiment to test the efficacy of target 60 capture techniques to obtain genome-level data for a large number of exons across the 61 diversity of fishes. Taxonomic sampling for this experiment consisted of 4 otophysans 62 (one representative per order) and 10 distantly-related ray-finned fish species (Table S5; 63 Database S1). Exon markers were selected using the EvolMarker pipeline1 to screen 64 genomes of zebrafish (Danio rerio) and medaka (Oryzias latipes) via pairwise BLAST 65 searches to identify single-copy, slowly-evolving exon regions according to a 50% 66 dissimilarity criterion. A set of 3,957 putatively orthologous exons longer than 200 bp 67 was defined by this method. Custom biotinylated RNA bait libraries for the pilot study 68 were designed based on zebrafish sequences following specifications of the MYBait 69 target enrichment system (http://www/mycroarray.com). A total of 15,376 70 oligonucleotide baits (120 bp long) tiling over the 3,957 exon sequences with 2x density 71 were designed using the py_tiler.py script https://github.com/faircloth-lab/uce-probe- 72 design; 2]. 73 DNA extractions were performed using DNeasy extraction kits (Qiagen, Inc.) 74 and resulting genomic DNAs were sheared to a target size of 500 bp using a Covaris 75 sonicator. Library preparation and indexing followed standard Illumina protocols. Target 76 enrichment (TE) was conducted using a modification of the Mamanova et al. protocol3, 77 as optimized by Faircloth et al.2 Indexed libraries enriched for the 3,957 exons were 78 pooled and sequenced in two lanes of Illumina HiScan. 79 Reads were assembled into contigs with a quasi de novo approach4 using custom 80 scripts to perform the following steps (Dataset S2). First, reads from each of the 14 81 species were mapped against the reference zebrafish genome using GNUMAP v3.0.25, 82 setting similarity (a), genome mer-size (m), and matching seed hashes (k) to 0.55, 7, and 83 5, respectively. Matching reads and their pair-mates (FASTQ files) were collected 84 separately for each species and locus. Second, each set of collected reads was assembled 85 separately using the de novo algorithm with pair-mate mode implemented in Trinity6. The 86 number of contigs per locus obtained for each species varied from zero to four. Third, 87 pairwise alignments were conducted against the reference for contigs from each locus 88 separately using BLAST searches. If multiple contigs were retrieved for a single locus 89 with a similarity value < 97%, they were flagged as potential paralogs and discarded for 90 downstream analyses; if two contigs assembled had a similarity value >97%, presumably 91 representing allelic variants, they were subsequently collapsed using IUPAC ambiguity NATURE ECOLOGY & EVOLUTION | DOI: 10.1038/s41559-016-0020 | www.nature.com/natecolevol 2 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. SUPPLEMENTARY INFORMATION 92 codes. Finally, the coding region for each orthologous exon was extracted from the 93 assembled contigs for all species and subsequently aligned using MAFFT v7.023 7 (Table 94 S5). 95 For 279 loci, assemblies produced more than one contig for at least one species 96 with <97% similarity; therefore, these were removed from further analyses. In addition, 97 2637 exons also were discarded because they were captured with low efficiency and had 98 either extremely low sequencing coverage or were successfully assembled in less than 6 99 species. The final dataset assembled consisted of 1041 single-copy genes (288,581 sites). 100 To test the efficacy of these markers in resolving ray-finned fish phylogeny, two different 101 analyses were conducted. First, a Maximum Likelihood (ML) analysis of the 102 concatenated dataset in RAxML v8.2.68 based on the GTRGAMMA model; second, a 103 summary multi-species coalescent analysis using ASTRAL-29, using as input the 1041 104 gene trees obtained with RAxML (Database S3). Both methods converged on a single 105 well-supported topology that is highly congruent with previous phylogenetic hypotheses 106 for fishes10,11; fig. S3). This preferred set of 1041 single-copy exons was used to create a 107 new capture library customized for otophysan taxa. 108 109 Supplementary Notes 110 111 Testing the assumption of subclade monophyly of individual gene genealogies. The 112 main goal of the GGI approach is to address a systematic question when a number of 113 well-supported clades of undisputed monophyly have unresolved relationships (in our 114 case, the relationships among otophysan orders/suborders). The assumption of 115 monophyly for these “undisputed” clades is based on organismal phylogenies supported 116 by several lines of evidence and is more likely to be met for ancient divergences spanning 117 tenths or hundreds of millions of years of evolution (e.g., Otophysi, and the other groups 118 analyzed here) than in shallow species divergences (e.g., humans, gorillas and 119 chimpanzees)12. The question is whether in these cases forcing the monophyly of 120 organismal clades onto individual gene trees is acceptable or reasonable. Rosenberg13 121 calculated that under a neutral coalescent model the vast majority of genes in a genome 122 (99.99–99.999%) require 5.3–8.3 coalescence time units to achieve monophyly (for either 123 one species or two sister species). While it is theoretically possible to directly estimate 124 branch lengths in coalescent time units using multi-species coalescent approaches (e.g., 125 ASTRAL-2), in reality these can be severely underestimated due to high levels of gene 126 tree error14, which is evidently the case with our dataset (Figs. S1, S2). Thus, to test 127 whether this assumption is met for all major otophysan groups (i.e., cypriniforms, 128 gymnotiforms, siluriforms characoids and citharinoids; Fig. 1), we first estimated the 129 length of the branches subtending these clades in millions of years using fossil-calibrated 130 phylogenies. A time-calibrated version of our tree and trees published by others15,16 show 131 that the subtending stem lineages leading to these clades are 20–65 million years (MY) 132 (Table S7). With these branch length values in MY (T) it is thus possible to estimate 133 coalescent time units, given by T/(2Ne*generation time) for diploid species, where Ne is 134 effective population size12. Assuming a conservative estimate of Ne= 100000 and a 135 generation time of 5 years, the minimum estimated length in coalescent units for all 136 otophysan stem lineages is 20–65 (see also range of estimates in Table S8). Note that 137 these parameters are conservatively estimated. For instance, freshwater fishes are NATURE ECOLOGY & EVOLUTION | DOI: 10.1038/s41559-016-0020 | www.nature.com/natecolevol 3 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. SUPPLEMENTARY INFORMATION 138 typically confined to small geographic regions with restricted breeding and their 139 population sizes are likely smaller17. Furthermore, the mean age at maturation for several 140 freshwater fishes is estimated at 2.5 years (North American species)18.