1 Supplementary Materials:
2
3 Supplementary File 1: Figures S1 - S5.
4 Supplementary File 2: Tables S1.
5 Supplementary File 3: A user-friendly pipeline to retrieve target sequences of the 4,434 loci
6 from new genome sequences or transcriptomes.
7 Supplementary File 4: Detailed materials and methods.
8 8 Supplementary File 1: Figures S1 - S5. 9 10
11 12 13 FIG. S1. Mining single-copy protein coding markers through genome comparison using
14 EvolMarkers (Li et al., 2012). The eight species compared are Anguilla anguilla, Tetraodon
15 nigroviridis, Gadus morhua, Danio rerio, Oryzias latipes, Lepisosteus oculatus,
16 Gasterosteus aculeatus and Oreochromis niloticus.
17 17
18 FIG. S2. Number of loci captured for each sample tested. The color indicates different
19 projects carried in the author’s laboratory. 20
21
22 FIG. S3. Average length of coding regions, GC content, pairwise distance, retention index 23 and consistency index of the 4,434 loci. All statistics were summarized from 10 captured 24 samples plus 7 species with genome sequences available.
25 25
26
27 FIG. S4. Species tree of four freshwater sleepers distributed in China reconstructed using
28 ASTRAL v4.11.1 based on 3,817 loci that have complete data in all species. Perccottus
29 glenii was used to root the ingroup.
30 30
31
32 FIG. S5. Principle component analysis (PCA) of four freshwater sleepers (Odontobutis)
33 derived from SNPs data of 3,914 loci. One SNP site with best score was chosen for each
34 locus.
35 35 Supplementary File 2:
36 Table S1. List of samples used in testing the candidate markers.
# Species Sample ID Family Order Project 1 Scaphirhynchus platorynchus CL182 Acipenseridae Acipenseriformes Acipen 2 Acipenser sinensis CL356 Acipenseridae Acipenseriformes Acipen 3 Acipenser schrenckii CL357 Acipenseridae Acipenseriformes Acipen 4 Acipenser gueldenstaedti CL358 Acipenseridae Acipenseriformes Acipen 5 Huso huso CL359 Acipenseridae Acipenseriformes Acipen 6 Huso dauricus CL360 Acipenseridae Acipenseriformes Acipen 7 Acipenser ruthenus CL504 Acipenseridae Acipenseriformes Acipen 8 Acipenser dabryanus CL505 Acipenseridae Acipenseriformes Acipen 9 Polyodon spathula CL129 Polyodontidae Acipenseriformes Acipen 10 Hiodon tergisus CL124 Hiodontidae Osteoglossomorpha Basal 11 Lepisosteus osseus CL130 Lepisosteidae Lepisosteiformes Basal 12 Hiodon alosoides CL132 Hiodontidae Osteoglossomorpha Basal 13 Osteoglossum bicirrhosum CL133 Osteoglossidae Osteoglossomorpha Basal 14 Elops saurus CL134 Elopidae Elopomorpha Basal 15 Albula vulpes CL135 Albulidae Elopomorpha Basal 16 Anguilla rostrata CL136 Anguillidae Elopomorpha Basal 17 Lepisosteus platostomus CL181 Lepisosteidae Lepisosteiformes Basal 18 Saccopharynx ampullaceus CL184 Saccopharyngidae Elopomorpha Basal 19 Elops saurus CL330 Elopidae Elopomorpha Basal 20 Conger japonicus CL544 Congridae Elopomorpha Basal 21 Erpetoichthys calabaricus CL628 Polypteridae Polypteriformes Basal 22 Polypterus endlicher CL629 Polypteridae Polypteriformes Basal 23 Osteoglossum bicirrhosum CL630 Osteoglossidae Osteoglossomorpha Basal 24 Ctenogobius boleosoma 23CTBO Gobionellidae Gobiomorpharia Goby 25 Sicydium crenilabrum 38SICR Gobionellidae Gobiomorpharia Goby 26 Stonogobiops nematodes 19STNE Gobiidae Gobiomorpharia Goby 27 Gobiosoma robusyum 16GORO Gobiidae Gobiomorpharia Goby 28 Mugilogobius cavifrons 24MUCA Gobionellidae Gobiomorpharia Goby 29 Taenioides sp. 35TASP Gobionellidae Gobiomorpharia Goby 30 Tridentiger bifasciatus 27TRBI Gobionellidae Gobiomorpharia Goby 31 Periophthalmus kalolo 32PEKA Gobionellidae Gobiomorpharia Goby 32 Rhinogobius giurinus 28RHGI Gobionellidae Gobiomorpharia Goby 33 Odontamblyopus lacepedii 37ODLA Gobionellidae Gobiomorpharia Goby 34 Acanthogobius ommaturus 22ACOM Gobionellidae Gobiomorpharia Goby 35 Boleophthalmus pectinirostris 34BOPE Gobionellidae Gobiomorpharia Goby 36 Valenciennea strigata 18VAST Gobiidae Gobiomorpharia Goby 37 Periophthalmus modestus 33PEMO Gobionellidae Gobiomorpharia Goby 38 Rhinogobius davidi 29RHDA Gobionellidae Gobiomorpharia Goby 39 Tridentiger barbatus 26TRBA Gobionellidae Gobiomorpharia Goby 40 Acanthogobius flavimanus 21ACFL Gobionellidae Gobiomorpharia Goby 41 Mugilogobius abei 25MUAB Gobionellidae Gobiomorpharia Goby 42 Kribia nana 10KRNA Butidae Gobiomorpharia Goby 43 Trypauchen vagina 36TRVA Gobionellidae Gobiomorpharia Goby 44 Valenciennea puellaris 17VAPU Gobiidae Gobiomorpharia Goby 45 Amblyeleotris yanoi 12AMYA Gobiidae Gobiomorpharia Goby 46 Lophiogobius ocellicauda 30LOOC Gobionellidae Gobiomorpharia Goby 47 Amblyeleotris wheeleri 13AMWH Gobiidae Gobiomorpharia Goby 48 Pomatoschistus microps 14POMI Gobiidae Gobiomorpharia Goby 49 Oxyeleotris marmoratus 11OXMA Butidae Gobiomorpharia Goby 50 Microdesmus dorsipunctatus 20MIDO Gobiidae Gobiomorpharia Goby 51 Dormitator maculatus 08DOMA Eleotridae Gobiomorpharia Goby 52 Ecsenius bicolor 43ECBI Blenniidae Ovalentariae Goby 53 Typhleotris pauliani 04TYPA Milyeringidae Gobiomorpharia Goby 54 Odontobutis potamophila 02ODPO Odontobutidae Gobiomorpharia Goby 55 Pterapogon kauderni 41PTKA Apogonoidae Gobiomorpharia Goby 56 Sphaeramia orbicularis 40SPOR Apogonidae Gobiomorpharia Goby 57 Oreochromis niloticus 172ORNI Cichlidae Ovalentariae Goby 58 Salarias fasciatus 42SAFA Blenniidae Ovalentariae Goby 59 Gobiomorus dormitor 06GODO Eleotridae Gobiomorpharia Goby 60 Rhyacichthys aspro 01RHAS Rhyacichthyidae Gobiomorpharia Goby 61 Eleotris acanthopoma 07ELAC Eleotridae Gobiomorpharia Goby 62 Butis koilomatondon 09BUKO Butidae Gobiomorpharia Goby 63 Milyeringa veritas 05MIVE Milyeringidae Gobiomorpharia Goby 64 Micropercops swinhonis 03MISW Odontobutidae Gobiomorpharia Goby 65 Siniperca chuatsi 44SICH Sinipercidae Percomorpharia Goby 66 Kurtus gulliveri 39KUGU Kurtidae Gobiomorpharia Goby 67 Denticeps clupeoides Denclu Denticeptidae Clupeiformes Ostario 68 Ilisha elongata Le_LQ Pristigasteridae Clupeiformes Ostario 69 Rutilus rutilus Rutilusrutilus Cyprinidae Cypriniformes Ostario 70 Coreoperca whiteheadi CL935_1 Sinipercidae Percomorpharia Sini 71 Coreoperca whiteheadi CL940_1 Sinipercidae Percomorpharia Sini 72 Coreoperca whiteheadi CL940_2 Sinipercidae Percomorpharia Sini 73 Coreoperca whiteheadi CL945_3 Sinipercidae Percomorpharia Sini 74 Coreoperca whiteheadi CL958_1 Sinipercidae Percomorpharia Sini 75 Siniperca obscura CL934_1 Sinipercidae Percomorpharia Sini 76 Siniperca obscura CL937_1 Sinipercidae Percomorpharia Sini 77 Siniperca scherzeri CL944_4 Sinipercidae Percomorpharia Sini 78 Siniperca undulata CL946 Sinipercidae Percomorpharia Sini 79 Siniperca undulata CL946_2 Sinipercidae Percomorpharia Sini 80 Siniperca kneri CL949 Sinipercidae Percomorpharia Sini 81 Siniperca kneri CL954 Sinipercidae Percomorpharia Sini 82 Siniperca kneri CL957 Sinipercidae Percomorpharia Sini 83 Siniperca roulei CL947 Sinipercidae Percomorpharia Sini 84 Siniperca roulei CL961_2 Sinipercidae Percomorpharia Sini 85 Siniperca roulei CL961_3 Sinipercidae Percomorpharia Sini 86 Siniperca chuatsi CL938_1 Sinipercidae Percomorpharia Sini 87 Siniperca chuatsi CL942_1 Sinipercidae Percomorpharia Sini 88 Siniperca chuasti CL943_1 Sinipercidae Percomorpharia Sini 89 Siniperca chuasti CL951_1 Sinipercidae Percomorpharia Sini 90 Siniperca chuasti CL955_1 Sinipercidae Percomorpharia Sini 91 Pomoxis nigromaculatus CL173 Centrarchidae Percomorpharia Sini 92 Psammoperca waigiensis CL202 Latidae Carangimorphariae Sini 93 Niphon spinosus CL212 Serranidae Percomorpharia Sini 94 Dicentrarchus labrax CL301 Moronidae Percomorpharia Sini 95 Percichthys trucha CL302 Percichthyidae Percomorpharia Sini 96 Lateolabrax japonicus CL322 Moronidae Percomorpharia Sini 97 Micropterus salmoides CL626 Centrarchidae Percomorpharia Sini 98 Trichiurus lepturus CL627_1 Trichiuridae Percomorpharia Sini 99 Epinephelus awoara CL647_2 Serranidae Percomorpharia Sini
37
38 38 Supplementary File 3: Get4434Seq, a pipeline for retrieving sequences of the 4,434
39 targeted loci from species of user’s interest.
40
41 Get4434Seq is a Perl package for getting target sequences of the 4434 loci from the species
42 of user’s interest through comparing the fish target database with user provided genome
43 sequences or transcriptomes.
44
45 1. User picks a species from the eight model fish that is most closely related to the user’s
46 species of interest.
47 2. Use the 4434 target sequences of the chosen model fish as queries to compare with user
48 provided genome sequences or transcriptomes using blast and retrieve sequences of best hit
49 for each target that meet requirement of preset coverage and identity.
50 3. Perform paralogy checking through reciprocal blast.
51 4. Keep the sequences of best hit that are mapped to the target regions of the query genome.
52 If no qualified target was found for some loci, use the target sequence of the query for baits
53 design.
54
55 The fish target database (sequences of 4434 loci of the eight model species) and Perl scripts
56 are available at http://www.lmse.org/markersandtools.html. 57 57 Supplementary File 4: Detailed materials and methods.
58 DNA extraction
59 Total genomic DNA was extracted from fin or muscle tissue of samples using a Tissue
60 DNA kit (Omega Bio-tek, Norcross, GA, USA) and quantified using a NanoDrop 3300
61 Fluorospectrometer (Thermo Fisher Scientific, Wilmington, DE, USA).
62 Baits design and synthesis
63 For the a priori baits design and synthesis, biotinylated RNA baits (MYcroarray, Ann Arbor,
64 Michigan) of 120 bp were synthesized with 2× tiling based on the sequences of
65 Oreochromis niloticus (used in “Goby” and “Sini” projects) and Lepisosteus oculatus (used
66 in “Basal”, “Acipen” and “Ostario” projects). For the baits designed on O. niloticus, loci
67 longer than 100 bp were targeted. The 3’ end of the baits was padded with “Ts” if the baits
68 were shorter than 120 bp. For the baits designed on L. oculatus, loci longer than 120 bp
69 were targeted. The baits design was refined to 4,434 loci afterwards (see the main text)
70 based on the gene capture results and synthesized at MYcroarray (Ann Arbor, Michigan,
71 USA).
72 Library preparation, gene capture and sequencing
73 The genomic DNA was sheared to approximately 250 bp using a Covaris E220
74 Focused-ultrasonicator (Covaris, Woburn, USA). Subsequently, 350 ng – 500 ng of the
75 sheared DNA from each sample was used to construct library. Blunt-end repair, adapter
76 ligation, fill-in, pre-hybridization PCR and double target gene enrichment steps mainly
77 followed the protocol of cross-species gene capture (Li, et al. 2013). The enriched libraries
78 were amplified with 8 bp indexed primers, and the concentration of products were
79 measured using a NanoDrop 3300 Fluorospectrometer. The products were pooled
80 equimolarly and sequenced on an Illumina HiSeq 2500 platform (Illumina, Inc, San Diego,
81 CA, USA) with other samples from the same or other projects. 82 Data assembly
83 The raw reads were parsed to respective files for each species according to the 8 bp indices
84 on the adapter. Reads assembling mainly followed the pipeline of Yuan et al.(Yuan, et al.
85 2016) . The final output includes three fasta files: coding region with and without flanks,
86 and the intron sequences. The sequences without flanking regions were used for subsequent
87 analyses.
88 Sequence alignment
89 Each individual locus was aligned using Mafft v7 (Katoh and Standley 2013) with default
90 parameter settings (mafft_AA.pl). The aligned AA sequences were translated to DNA
91 sequences via aa2dna_align.pl.
92 Summary statistics
93 Statistics were summarized from aligned 4,434 loci of 10 captured samples and 7 species
94 with genome sequences available. Statistics, including average length of coding regions,
95 average GC content and average pairwise distance were calculated using a custom Perl
96 script (statistics.pl, http://www.lmse.org/markersandtools.html) and R package ape(Paradis,
97 et al. 2004). Consistency index (CI) and retention index (RI) were calculated using PAUP*
98 v4.0a.
99 Phylogenetic analysis
100 Due to highly variable flanking regions, only coding region without flanks were used for
101 phylogenetic inference. The aligned loci were concatenated by a custom Perl script
102 (concatnexus.pl, http://www.lmse.org/markersandtools.html). Then, phylogenetic trees were
103 constructed using the maximum likelihood method implemented in ExaML v3 (Kozlov, et
104 al. 2015) under the GTRGAMMA model with 100 bootstrap replicates to assess nodal
105 support.
106 ASTRAL v4.11.1 (Mirarab and Warnow 2015) was used to reconstruct species tree. 107 Due to the low limit of input species for tree reconstruction in ASTRAL, loci with less than
108 four species were filtered. Individual gene trees of each locus were reconstruct using
109 RAxML HPC-PTHREAD with a GTRGAMMA model (Stamatakis 2006). Then, the gene
110 trees were summarized into species tree with ASTRAL v4.11.1 using default parameter
111 settings.
112 SNPs calling
113 Reference sequences were generated from aligned sequences based on majority rule
114 consensus using a custom Perl script (consensus.pl,
115 http://www.lmse.org/markersandtools.html). Trimmed reads were mapped to reference by
116 BWA v0.7.15-r1140 (Li and Durbin 2009). Picard MarkDuplicates
117 (http://broadinstitute.github.io/picard/) was carried out to mark duplicates. Then GATK
118 Best Practices recommendations (McKenna, et al. 2010; DePristo, et al. 2011; GA, et al.
119 2013) were followed to do local realignment, base quality score recalibration, SNPs
120 discovery and genotyping across all samples in concert by standard hard filtering
121 parameters using GATK-3.2.2 (McKenna, et al. 2010). Only one of the best SNPs of each
122 locus was selected for downstream analyses to fulfill the assumption of linkage
123 disequilibrium. The output vcf file was converted to genotype data file format for principle
124 component analysis (PCA) by a custom script (vcftosnps.pl,
125 http://www.lmse.org/markersandtools.html).
126 Principle component analysis (PCA)
127 PCA was performed with R package ade4 (Dray and Dufour 2007) to unravel variability
128 among 16 Odontobutis individuals (four species) using dudi.pca() command. 129 130 DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, 131 Rivas MA, Hanna M, et al. 2011. A framework for variation discovery and genotyping using 132 next-generation DNA sequencing data. NATURE GENETICS:491-498. 133 Dray S, Dufour A-B. 2007. The ade4 Package:Implementing the Duality Diagram for Ecologists. 134 Journal of Statistical Software. 135 GA VdA, MO C, C H, R P, G DA, A L-M, T J, K S, D R, J T, et al. 2013. From FastQ data to high 136 confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc 137 Bioinformatics:11.10(11-33). 138 Katoh K, Standley DM. 2013. MAFFT Multiple Sequence Alignment Software Version 7: 139 Improvements in Performance and Usability. Molecular Biology and Evolution 30:772-780. 140 Kozlov AM, Aberer AJ, Stamatakis A. 2015. ExaML version 3: a tool for phylogenomic analyses on 141 supercomputers. Bioinformatics 31:2577-2579. 142 Li C, Hofreiter M, Straube N, Corrigan S, Naylor GJ. 2013. Capturing protein-coding genes across 143 highly divergent species. BioTechniques 54:321-326. 144 Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler 145 transform. Bioinformatics:1754-1760. 146 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler 147 D, Gabriel S, Daly M, et al. 2010. The Genome Analysis Toolkit: A MapReduce framework for 148 analyzing next-generation DNA sequencing data. Genome Research:1297-1303. 149 Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R 150 language. Bioinformatics:289-290. 151 Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with 152 thousands of taxa and mixed models. Bioinformatics:2688-2690. 153 Yuan H, Jiang J, Jimenez FA, Hoberg EP, Cook JA, Galbreath KE, Li C. 2016. Target gene 154 enrichment in the cyclophyllidean cestodes, the most diverse group of tapeworms. Molecular 155 ecology resources 16:1095-1106. 156 157