<<

1 Supplementary Materials:

2

3 Supplementary File 1: Figures S1 - S5.

4 Supplementary File 2: Tables S1.

5 Supplementary File 3: A user-friendly pipeline to retrieve target sequences of the 4,434 loci

6 from new genome sequences or transcriptomes.

7 Supplementary File 4: Detailed materials and methods.

8 8 Supplementary File 1: Figures S1 - S5. 9 10

11 12 13 FIG. S1. Mining single-copy protein coding markers through genome comparison using

14 EvolMarkers (Li et al., 2012). The eight compared are Anguilla anguilla, Tetraodon

15 nigroviridis, Gadus morhua, Danio rerio, Oryzias latipes, Lepisosteus oculatus,

16 Gasterosteus aculeatus and Oreochromis niloticus.

17 17

18 FIG. S2. Number of loci captured for each sample tested. The color indicates different

19 projects carried in the author’s laboratory. 20

21

22 FIG. S3. Average length of coding regions, GC content, pairwise distance, retention index 23 and consistency index of the 4,434 loci. All statistics were summarized from 10 captured 24 samples plus 7 species with genome sequences available.

25 25

26

27 FIG. S4. Species tree of four freshwater sleepers distributed in China reconstructed using

28 ASTRAL v4.11.1 based on 3,817 loci that have complete data in all species. Perccottus

29 glenii was used to root the ingroup.

30 30

31

32 FIG. S5. Principle component analysis (PCA) of four freshwater sleepers (Odontobutis)

33 derived from SNPs data of 3,914 loci. One SNP site with best score was chosen for each

34 locus.

35 35 Supplementary File 2:

36 Table S1. List of samples used in testing the candidate markers.

# Species Sample ID Project 1 platorynchus CL182 Acipenseridae Acipen 2 sinensis CL356 Acipenseridae Acipenseriformes Acipen 3 Acipenser schrenckii CL357 Acipenseridae Acipenseriformes Acipen 4 Acipenser gueldenstaedti CL358 Acipenseridae Acipenseriformes Acipen 5 huso CL359 Acipenseridae Acipenseriformes Acipen 6 Huso dauricus CL360 Acipenseridae Acipenseriformes Acipen 7 Acipenser ruthenus CL504 Acipenseridae Acipenseriformes Acipen 8 Acipenser dabryanus CL505 Acipenseridae Acipenseriformes Acipen 9 Polyodon spathula CL129 Polyodontidae Acipenseriformes Acipen 10 Hiodon tergisus CL124 Hiodontidae Basal 11 Lepisosteus osseus CL130 Lepisosteidae Lepisosteiformes Basal 12 Hiodon alosoides CL132 Hiodontidae Osteoglossomorpha Basal 13 bicirrhosum CL133 Osteoglossidae Osteoglossomorpha Basal 14 Elops saurus CL134 Basal 15 vulpes CL135 Albulidae Elopomorpha Basal 16 Anguilla rostrata CL136 Elopomorpha Basal 17 Lepisosteus platostomus CL181 Lepisosteidae Lepisosteiformes Basal 18 ampullaceus CL184 Saccopharyngidae Elopomorpha Basal 19 Elops saurus CL330 Elopidae Elopomorpha Basal 20 Conger japonicus CL544 Elopomorpha Basal 21 Erpetoichthys calabaricus CL628 Polypteridae Polypteriformes Basal 22 Polypterus endlicher CL629 Polypteridae Polypteriformes Basal 23 Osteoglossum bicirrhosum CL630 Osteoglossidae Osteoglossomorpha Basal 24 Ctenogobius boleosoma 23CTBO Gobionellidae Gobiomorpharia 25 Sicydium crenilabrum 38SICR Gobionellidae Gobiomorpharia Goby 26 Stonogobiops nematodes 19STNE Gobiomorpharia Goby 27 Gobiosoma robusyum 16GORO Gobiidae Gobiomorpharia Goby 28 Mugilogobius cavifrons 24MUCA Gobionellidae Gobiomorpharia Goby 29 Taenioides sp. 35TASP Gobionellidae Gobiomorpharia Goby 30 Tridentiger bifasciatus 27TRBI Gobionellidae Gobiomorpharia Goby 31 Periophthalmus kalolo 32PEKA Gobionellidae Gobiomorpharia Goby 32 Rhinogobius giurinus 28RHGI Gobionellidae Gobiomorpharia Goby 33 Odontamblyopus lacepedii 37ODLA Gobionellidae Gobiomorpharia Goby 34 Acanthogobius ommaturus 22ACOM Gobionellidae Gobiomorpharia Goby 35 Boleophthalmus pectinirostris 34BOPE Gobionellidae Gobiomorpharia Goby 36 Valenciennea strigata 18VAST Gobiidae Gobiomorpharia Goby 37 Periophthalmus modestus 33PEMO Gobionellidae Gobiomorpharia Goby 38 Rhinogobius davidi 29RHDA Gobionellidae Gobiomorpharia Goby 39 Tridentiger barbatus 26TRBA Gobionellidae Gobiomorpharia Goby 40 Acanthogobius flavimanus 21ACFL Gobionellidae Gobiomorpharia Goby 41 Mugilogobius abei 25MUAB Gobionellidae Gobiomorpharia Goby 42 nana 10KRNA Butidae Gobiomorpharia Goby 43 Trypauchen vagina 36TRVA Gobionellidae Gobiomorpharia Goby 44 Valenciennea puellaris 17VAPU Gobiidae Gobiomorpharia Goby 45 Amblyeleotris yanoi 12AMYA Gobiidae Gobiomorpharia Goby 46 Lophiogobius ocellicauda 30LOOC Gobionellidae Gobiomorpharia Goby 47 Amblyeleotris wheeleri 13AMWH Gobiidae Gobiomorpharia Goby 48 Pomatoschistus microps 14POMI Gobiidae Gobiomorpharia Goby 49 marmoratus 11OXMA Butidae Gobiomorpharia Goby 50 Microdesmus dorsipunctatus 20MIDO Gobiidae Gobiomorpharia Goby 51 Dormitator maculatus 08DOMA Gobiomorpharia Goby 52 Ecsenius bicolor 43ECBI Blenniidae Ovalentariae Goby 53 pauliani 04TYPA Milyeringidae Gobiomorpharia Goby 54 Odontobutis potamophila 02ODPO Odontobutidae Gobiomorpharia Goby 55 Pterapogon kauderni 41PTKA Apogonoidae Gobiomorpharia Goby 56 Sphaeramia orbicularis 40SPOR Apogonidae Gobiomorpharia Goby 57 Oreochromis niloticus 172ORNI Cichlidae Ovalentariae Goby 58 Salarias fasciatus 42SAFA Blenniidae Ovalentariae Goby 59 Gobiomorus dormitor 06GODO Eleotridae Gobiomorpharia Goby 60 Rhyacichthys aspro 01RHAS Rhyacichthyidae Gobiomorpharia Goby 61 Eleotris acanthopoma 07ELAC Eleotridae Gobiomorpharia Goby 62 koilomatondon 09BUKO Butidae Gobiomorpharia Goby 63 veritas 05MIVE Milyeringidae Gobiomorpharia Goby 64 Micropercops swinhonis 03MISW Odontobutidae Gobiomorpharia Goby 65 chuatsi 44SICH Sinipercidae Percomorpharia Goby 66 Kurtus gulliveri 39KUGU Kurtidae Gobiomorpharia Goby 67 Denticeps Denclu Denticeptidae Ostario 68 elongata Le_LQ Clupeiformes Ostario 69 Rutilus rutilus Rutilusrutilus Cyprinidae Ostario 70 whiteheadi CL935_1 Sinipercidae Percomorpharia Sini 71 Coreoperca whiteheadi CL940_1 Sinipercidae Percomorpharia Sini 72 Coreoperca whiteheadi CL940_2 Sinipercidae Percomorpharia Sini 73 Coreoperca whiteheadi CL945_3 Sinipercidae Percomorpharia Sini 74 Coreoperca whiteheadi CL958_1 Sinipercidae Percomorpharia Sini 75 Siniperca obscura CL934_1 Sinipercidae Percomorpharia Sini 76 Siniperca obscura CL937_1 Sinipercidae Percomorpharia Sini 77 Siniperca scherzeri CL944_4 Sinipercidae Percomorpharia Sini 78 Siniperca undulata CL946 Sinipercidae Percomorpharia Sini 79 Siniperca undulata CL946_2 Sinipercidae Percomorpharia Sini 80 Siniperca kneri CL949 Sinipercidae Percomorpharia Sini 81 Siniperca kneri CL954 Sinipercidae Percomorpharia Sini 82 Siniperca kneri CL957 Sinipercidae Percomorpharia Sini 83 Siniperca roulei CL947 Sinipercidae Percomorpharia Sini 84 Siniperca roulei CL961_2 Sinipercidae Percomorpharia Sini 85 Siniperca roulei CL961_3 Sinipercidae Percomorpharia Sini 86 CL938_1 Sinipercidae Percomorpharia Sini 87 Siniperca chuatsi CL942_1 Sinipercidae Percomorpharia Sini 88 Siniperca chuasti CL943_1 Sinipercidae Percomorpharia Sini 89 Siniperca chuasti CL951_1 Sinipercidae Percomorpharia Sini 90 Siniperca chuasti CL955_1 Sinipercidae Percomorpharia Sini 91 Pomoxis nigromaculatus CL173 Percomorpharia Sini 92 Psammoperca waigiensis CL202 Latidae Carangimorphariae Sini 93 Niphon spinosus CL212 Serranidae Percomorpharia Sini 94 Dicentrarchus labrax CL301 Moronidae Percomorpharia Sini 95 Percichthys trucha CL302 Percichthyidae Percomorpharia Sini 96 Lateolabrax japonicus CL322 Moronidae Percomorpharia Sini 97 Micropterus salmoides CL626 Centrarchidae Percomorpharia Sini 98 Trichiurus lepturus CL627_1 Trichiuridae Percomorpharia Sini 99 Epinephelus awoara CL647_2 Serranidae Percomorpharia Sini

37

38 38 Supplementary File 3: Get4434Seq, a pipeline for retrieving sequences of the 4,434

39 targeted loci from species of user’s interest.

40

41 Get4434Seq is a Perl package for getting target sequences of the 4434 loci from the species

42 of user’s interest through comparing the target database with user provided genome

43 sequences or transcriptomes.

44

45 1. User picks a species from the eight model fish that is most closely related to the user’s

46 species of interest.

47 2. Use the 4434 target sequences of the chosen model fish as queries to compare with user

48 provided genome sequences or transcriptomes using blast and retrieve sequences of best hit

49 for each target that meet requirement of preset coverage and identity.

50 3. Perform paralogy checking through reciprocal blast.

51 4. Keep the sequences of best hit that are mapped to the target regions of the query genome.

52 If no qualified target was found for some loci, use the target sequence of the query for baits

53 design.

54

55 The fish target database (sequences of 4434 loci of the eight model species) and Perl scripts

56 are available at http://www.lmse.org/markersandtools.html. 57 57 Supplementary File 4: Detailed materials and methods.

58 DNA extraction

59 Total genomic DNA was extracted from fin or muscle tissue of samples using a Tissue

60 DNA kit (Omega Bio-tek, Norcross, GA, USA) and quantified using a NanoDrop 3300

61 Fluorospectrometer (Thermo Fisher Scientific, Wilmington, DE, USA).

62 Baits design and synthesis

63 For the a priori baits design and synthesis, biotinylated RNA baits (MYcroarray, Ann Arbor,

64 Michigan) of 120 bp were synthesized with 2× tiling based on the sequences of

65 Oreochromis niloticus (used in “Goby” and “Sini” projects) and Lepisosteus oculatus (used

66 in “Basal”, “Acipen” and “Ostario” projects). For the baits designed on O. niloticus, loci

67 longer than 100 bp were targeted. The 3’ end of the baits was padded with “Ts” if the baits

68 were shorter than 120 bp. For the baits designed on L. oculatus, loci longer than 120 bp

69 were targeted. The baits design was refined to 4,434 loci afterwards (see the main text)

70 based on the gene capture results and synthesized at MYcroarray (Ann Arbor, Michigan,

71 USA).

72 Library preparation, gene capture and sequencing

73 The genomic DNA was sheared to approximately 250 bp using a Covaris E220

74 Focused-ultrasonicator (Covaris, Woburn, USA). Subsequently, 350 ng – 500 ng of the

75 sheared DNA from each sample was used to construct library. Blunt-end repair, adapter

76 ligation, fill-in, pre-hybridization PCR and double target gene enrichment steps mainly

77 followed the protocol of cross-species gene capture (Li, et al. 2013). The enriched libraries

78 were amplified with 8 bp indexed primers, and the concentration of products were

79 measured using a NanoDrop 3300 Fluorospectrometer. The products were pooled

80 equimolarly and sequenced on an Illumina HiSeq 2500 platform (Illumina, Inc, San Diego,

81 CA, USA) with other samples from the same or other projects. 82 Data assembly

83 The raw reads were parsed to respective files for each species according to the 8 bp indices

84 on the adapter. Reads assembling mainly followed the pipeline of Yuan et al.(Yuan, et al.

85 2016) . The final output includes three fasta files: coding region with and without flanks,

86 and the intron sequences. The sequences without flanking regions were used for subsequent

87 analyses.

88 Sequence alignment

89 Each individual locus was aligned using Mafft v7 (Katoh and Standley 2013) with default

90 parameter settings (mafft_AA.pl). The aligned AA sequences were translated to DNA

91 sequences via aa2dna_align.pl.

92 Summary statistics

93 Statistics were summarized from aligned 4,434 loci of 10 captured samples and 7 species

94 with genome sequences available. Statistics, including average length of coding regions,

95 average GC content and average pairwise distance were calculated using a custom Perl

96 script (statistics.pl, http://www.lmse.org/markersandtools.html) and R package ape(Paradis,

97 et al. 2004). Consistency index (CI) and retention index (RI) were calculated using PAUP*

98 v4.0a.

99 Phylogenetic analysis

100 Due to highly variable flanking regions, only coding region without flanks were used for

101 phylogenetic inference. The aligned loci were concatenated by a custom Perl script

102 (concatnexus.pl, http://www.lmse.org/markersandtools.html). Then, phylogenetic trees were

103 constructed using the maximum likelihood method implemented in ExaML v3 (Kozlov, et

104 al. 2015) under the GTRGAMMA model with 100 bootstrap replicates to assess nodal

105 support.

106 ASTRAL v4.11.1 (Mirarab and Warnow 2015) was used to reconstruct species tree. 107 Due to the low limit of input species for tree reconstruction in ASTRAL, loci with less than

108 four species were filtered. Individual gene trees of each locus were reconstruct using

109 RAxML HPC-PTHREAD with a GTRGAMMA model (Stamatakis 2006). Then, the gene

110 trees were summarized into species tree with ASTRAL v4.11.1 using default parameter

111 settings.

112 SNPs calling

113 Reference sequences were generated from aligned sequences based on majority rule

114 consensus using a custom Perl script (consensus.pl,

115 http://www.lmse.org/markersandtools.html). Trimmed reads were mapped to reference by

116 BWA v0.7.15-r1140 (Li and Durbin 2009). Picard MarkDuplicates

117 (http://broadinstitute.github.io/picard/) was carried out to mark duplicates. Then GATK

118 Best Practices recommendations (McKenna, et al. 2010; DePristo, et al. 2011; GA, et al.

119 2013) were followed to do local realignment, base quality score recalibration, SNPs

120 discovery and genotyping across all samples in concert by standard hard filtering

121 parameters using GATK-3.2.2 (McKenna, et al. 2010). Only one of the best SNPs of each

122 locus was selected for downstream analyses to fulfill the assumption of linkage

123 disequilibrium. The output vcf file was converted to genotype data file format for principle

124 component analysis (PCA) by a custom script (vcftosnps.pl,

125 http://www.lmse.org/markersandtools.html).

126 Principle component analysis (PCA)

127 PCA was performed with R package ade4 (Dray and Dufour 2007) to unravel variability

128 among 16 Odontobutis individuals (four species) using dudi.pca() command. 129 130 DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, 131 Rivas MA, Hanna M, et al. 2011. A framework for variation discovery and genotyping using 132 next-generation DNA sequencing data. NATURE GENETICS:491-498. 133 Dray S, Dufour A-B. 2007. The ade4 Package:Implementing the Duality Diagram for Ecologists. 134 Journal of Statistical Software. 135 GA VdA, MO C, C H, R P, G DA, A L-M, T J, K S, D R, J T, et al. 2013. From FastQ data to high 136 confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc 137 Bioinformatics:11.10(11-33). 138 Katoh K, Standley DM. 2013. MAFFT Multiple Sequence Alignment Software Version 7: 139 Improvements in Performance and Usability. Molecular Biology and Evolution 30:772-780. 140 Kozlov AM, Aberer AJ, Stamatakis A. 2015. ExaML version 3: a tool for phylogenomic analyses on 141 supercomputers. Bioinformatics 31:2577-2579. 142 Li C, Hofreiter M, Straube N, Corrigan S, Naylor GJ. 2013. Capturing protein-coding genes across 143 highly divergent species. BioTechniques 54:321-326. 144 Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler 145 transform. Bioinformatics:1754-1760. 146 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler 147 D, Gabriel S, Daly M, et al. 2010. The Genome Analysis Toolkit: A MapReduce framework for 148 analyzing next-generation DNA sequencing data. Genome Research:1297-1303. 149 Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R 150 language. Bioinformatics:289-290. 151 Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with 152 thousands of taxa and mixed models. Bioinformatics:2688-2690. 153 Yuan H, Jiang J, Jimenez FA, Hoberg EP, Cook JA, Galbreath KE, Li C. 2016. Target gene 154 enrichment in the cyclophyllidean cestodes, the most diverse group of tapeworms. Molecular 155 ecology resources 16:1095-1106. 156 157