A Modular Sequence Capture Probe Set for Phylogenomics And
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/825307; this version posted October 31, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 FrogCap: A modular sequence capture probe set for phylogenomics and population genetics for all frogs, assessed across multiple phylogenetic scales Carl R. Hutter1,2*, Kerry A. Cobb3, Daniel M. Portik4, Scott L. Travers1, Perry L. Wood, Jr.3 and Rafe M. Brown1 1 Biodiversity Institute and Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA. 2 Museum of Natural Sciences and Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA. 3Department of Biological Sciences & Museum of Natural History, Auburn University, Auburn, Alabama 36849, USA. 4 California Academy of Sciences, San Francisco, CA 94118, USA *Corresponding author and current address: Carl R. Hutter Museum of Natural Sciences and Department of Biological Sciences Louisiana State University Baton Rouge, LA 70803, USA E-mail: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/825307; this version posted October 31, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 2 ABSTRACT Despite the increasing use of high-throughput sequencing in phylogenetics, many phylogenetic relationships remain difficult to resolve because of conflict between gene trees and species trees. Selection of different types of markers (i.e. protein-coding exons, non-coding introns, ultra- conserved elements) is becoming important to alleviate these phylogenomic challenges. For evolutionary studies in frogs, we introduce the new publicly available FrogCap suite of genomic resources, which is a large and flexible collection of probes corresponding to ~15,000 markers that unifies previous frog sequencing work. FrogCap is designed to be modular, such that subsets of markers can be selected based on the phylogenetic scale of the intended study. FrogCap uses a variety of molecular marker types that include newly obtained exons and introns, previously sequenceD UCEs, and Sanger-sequencing markers, which span a range of alignment lengths (100–12,000 base pairs). We tested three probe sets from FrogCap using 105 samples across five phylogenetic scales, comparing probes designed using a consensus- or genome-based approach. We also tested the effects of using different bait kit sizes on depth of coverage and missing data. We found that larger bait kits did not result in lowered depth of coverage or increased missing data. We also Found that sensitivity, specificity, and missing data are not related to genetic distance in the consensus-based probe design, suggesting that this approach has greater success and overcomes a major hurdle in probe design. We observed sequence capture success (in terms of missing data, quantity of sequence data, recovered marker length, and number of informative sites) and compared them at all phylogenetic scales. The incorporation of different molecular marker types allowed recovery of the variation required for resolving difficult phylogenetic relationships and for performing population genetic studies. Altogether, FrogCap is a valuable and adaptable resource for performing high-throughput sequencing projects across variable timescales. Key words: amphibians, Anura, exon capture, genomics, target capture, UCEs bioRxiv preprint doi: https://doi.org/10.1101/825307; this version posted October 31, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 3 INTRODUCTION The widespread adoption of high-throughput DNA sequencing technologies to address questions in evolutionary biology has revealed new challenges for analyzing these expansive genomic datasets (Shendure & Ji 2008; Kircher & Kelso 2010; Jones & Good 2016). Despite the increased ability to collect large phylogenomic datasets, many phylogenetic relationships remain ambiguous because of both biological and non-biological processes that can lead to gene trees that are discordant with the species tree (Knowles 2009; Liu et al. 2009; Hobolth et al. 2011; Blom et al. 2017; Hahn & Nakhleh 2016; Reddy et al. 2017; Knowles et al. 2018; Richards et al. 2018). Further exacerbating this challenge, data from different genomic regions can demonstrate conflicting phylogenomic results (Liu et al. 2009; Lemmon & Lemmon 2013; Chen et al. 2017; Reddy et al. 2017). Despite the decreasing costs of obtaining sequence data, selecting the best sequencing methods and molecular markers for particular projects remains challenging. Although genome sequencing for non-model organisms remains costly, high-throughput sequencing of targeted genomic regions has resulted in numerous approaches for creating expansive phylogenomic and population genetic datasets at reduceD financial costs (Kircher & Kelso 2010; Glenn 2011; Rohland & Reich 2012). These approaches aim to affordably sequence reduced portions of the genome and also allow more individuals and/or species to be sampled simultaneously (Rohland & Reich 2012). An important consideration for systematics and population genetics is balancing the trade-off between obtaining orthologous markers across moderate to deep timescales, while obtaining markers that are variable enough to resolve ambiguous phylogenetic relationships and provide high-quality variants for population genetics (Sulonen et al. 2011; Jones & Good 2016). bioRxiv preprint doi: https://doi.org/10.1101/825307; this version posted October 31, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 4 There are now many approaches for sequencing subsets of an organism’s genome, and the choice of method largely depends on the phylogenetic breadth required and the number of samples to include. The most common methods for obtaining subsets of genome-wide sequence data include: (1) restriction-associated digestion methods (RADseq) that target markers adjacent to restriction enzyme sites (Miller et al. 2007); (2) targeted sequence capture (HybSeq) methods that target genomic regions through hybridization-based sequence capture (Hancock-Hanser et al. 2013; Jones & Good 2016), including exon capture; and (3) transcriptome sequencing (RNASeq) that target the expressed exome of a sampleD tissue type (Wang et al. 2009). Exome capture is another approach similar to HybSeq except it is highly specialized for a reduceD taxonomic scale, because probes are designed from transcriptomes/genomes within a particular focal group (Choi et al. 2009; Sulonen et al. 2011). Each of these three methods have been used to address a variety of phylogenetic and population genetic questions and have method-specific advantages and disadvantages (reviews in Lemmon & Lemmon 2013 and McCormack et al. 2013), including tradeoffs regarding costs, phylogenetic scale, and availability of initial genomic resources necessary for study design and development. Phylogenomic data obtained from each of these methods has the potential to resolve difficult phylogenetic relationships, yet many relationships remain ambiguous. Numerous studies suggest that natural processes such as incomplete lineage sorting, horizontal gene transfer, gene loss or duplication, and hybridization can lead to gene trees that are discordant with the species tree (Maddison 1997; Liu et al. 2009; Knowles 2009; Knowles et al. 2018). Additionally, natural selection of protein coding and other selected regions can be problematic for phylogenetic inference if phylogenetic signal is reduceD as a result of selection (Castoe et al. 2009; Edwards 2009; Liu et al. 2009; Hobolth et al. 2011). Discordant relationships can also be caused by non- bioRxiv preprint doi: https://doi.org/10.1101/825307; this version posted October 31, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 5 biological mechanisms, such as gene tree estimation error resulting from model inadequacy, short alignment lengths, low levels of phylogenetic informativeness, anomaly zones, or errors in sequence assembly or alignment (Xi et al. 2015; Hahn & Nakhleh 2016; Blom et al. 2017; Reddy et al. 2017; Richards et al. 2018). As a result, selecting the best method and types of molecular markers to address these issues remains challenging. Selection of different types of molecular markers (e.g. protein-coding exons, non-coding regions, ultra-conserved elements [UCEs], or single nucleotide polymorphisms [SNPS]) is an increasingly important issue when dealing with the challenges of analyzing large phylogenomic datasets. Each phylogenomic marker type may be useful at various timescales and may present different biases for analyses (Lemmon & Lemmon 2013; McCormack et al. 2013; Hosner et al. 2016).