PHYLOGENOMIC CONFLICT in HYLARANA 1 Exons, Introns, And
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/765610; this version posted September 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. PHYLOGENOMIC CONFLICT IN HYLARANA 1 Exons, Introns, and UCEs Reveal Conflicting Phylogenomic Signals in a Rapid 2 Radiation of Frogs (Ranidae: Hylarana) 3 4 Kin Onn Chan1,2,*, Carl R. Hutter2, Perry L. Wood, Jr.3, L. Lee Grismer4, Rafe M. 5 Brown2 6 7 1 Lee Kong Chian National History Museum, Faculty of Science, National University of 8 Singapore, 2 Conservatory Drive, Singapore 117377. Email: [email protected] 9 10 2 Biodiversity Institute and Department of Ecology and Evolutionary Biology, University 11 of Kansas, Lawrence, KS 66045, USA. Email: [email protected]; [email protected] 12 13 3 Department of Biological Sciences & Museum of Natural History, Auburn University, 14 Auburn, Alabama 36849, USA. Email: [email protected] 15 16 4 Herpetology Laboratory, Department of Biology, La Sierra University, 4500 Riverwalk 17 Parkway, Riverside, California 92505, USA. Email: [email protected] 18 19 *Corresponding author 20 1 bioRxiv preprint doi: https://doi.org/10.1101/765610; this version posted September 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. CHAN ET AL. 21 Abstract.—Numerous types of genomic markers have been used to resolve recalcitrant 22 nodes, yet their relative performance and congruence have rarely been compared directly. 23 Using target-capture sequencing, we obtained more than 12,000 highly informative exons 24 and introns, including ~600 UCEs to address long-standing systematic problems in 25 Southeast Asian Golden-backed frogs of the genus complex Hylarana. To reduce gene 26 tree estimation errors, we filtered the data using different thresholds of taxon 27 completeness and parsimony informative sites (PIS) in addition to using the best-fit 28 models of DNA evolution to estimate individual single-locus gene trees. We then 29 estimated species trees using concatenation (IQ-TREE), summary coalescent (ASTRAL), 30 and distance-based methods (ASTRID). Topological incongruence among these methods 31 and variation in nodal support were examined in detail using a suite of different measures 32 including quartet frequencies, bootstrap, local posterior probabilities, gene concordance 33 factors, and quartet scores. Results showed that high levels of incongruence were present 34 along the backbone of the phylogeny, specifically surrounding short internodes. We also 35 demonstrated that filtering data by PIS was more efficacious at improving congruence 36 compared to filtering by missing data, and that exons were more sensitive to data filtering 37 than introns and UCEs. Despite utilizing more than 6.9 million characters and 2.7 million 38 PIS, analyses failed to converge on a single concordant topology. Instead, exons, introns, 39 and UCEs produced genuinely strongly-supported yet conflicting phylogenetic signals, 40 which affected our phylogeny estimates at different scales/levels—indicating a general, 41 potentially alarming challenge for phylogenomics inference employing many of todays 42 massive datasets. Additionally, bootstrap values were consistently high despite low levels 43 of congruence and high proportions of gene trees that support conflicting topologies, bioRxiv preprint doi: https://doi.org/10.1101/765610; this version posted September 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. PHYLOGENOMIC CONFLICT IN HYLARANA 44 indicating that traditional bootstraps are likely poor measures of congruence or branch 45 support in large phylogenomic datasets, especially during instances of rapid 46 diversification. Although low bootstrap values do ostensibly reflect low heuristic support, 47 we recommend that high bootstrap support obtained from large genomic datasets be 48 interpreted with caution. Additional complimentary measures such quartet frequencies, 49 gene concordance factors, quartet scores, and posterior probabilities can be useful to 50 provide a more robust and accurate representation of bipartition certainty and ultimately, 51 evolutionary history of incompletely resolved or poorly-understood clades. 52 Keywords: FrogCap, bootstrap, branch support, incongruence, quartet frequency, gene 53 concordance factor 54 55 Generating large amounts of data is no longer an issue in the era of 56 phylogenomics. Instead, limitations are imposed by model complexities (parameter 57 space) and computational tractability. Furthermore, analyzing genome-scale data has 58 revealed a different suite of challenges including high levels of incongruence, conflicting 59 evolutionary histories, and systematic bias (Gee 2003; Phillips et al. 2004; Philippe et al. 60 2011; Delsuc et al. 2005; Philippe et al. 2005; Jeffroy et al. 2006; Galtier and Daubin 61 2008; Dell’Ampio et al. 2014; Smith et al. 2015; Zhang et al. 2015; Leaché et al. 2015; 62 Kendall and Colijn 2016; Crowl et al. 2017; Reddy et al. 2017; Platt et al. 2018; Pease et 63 al. 2018; Roycroft et al. 2019). It is therefore important to find a “sweetspot” that 64 optimizes the shifting trade-off between amount of data and analytical resources without 65 compromising the accuracy of inferences. As such, understanding the impacts of data 66 filtering/subsampling strategies and performing robust assessments on analytical methods 3 bioRxiv preprint doi: https://doi.org/10.1101/765610; this version posted September 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. CHAN ET AL. 67 and the accuracy of species tree inferences are integral components to the rapidly 68 expanding future of the field. 69 Incongruence can arise not only from biological processes such as hybridization, 70 horizontal gene transfer, and incomplete lineage sorting that violate the assumption of 71 orthology (Whitfield and Lockhart 2007; Whitfield and Kjer 2008; Eaton et al. 2015; 72 Meiklejohn et al. 2016; Tarver et al. 2016; Ottenburghs et al. 2017; Léveillé-Bourret et al. 73 2018), but also through systematic biases associated with the analysis of large datasets. 74 Gene tree estimation errors (GTEE) resulting from (but not limited to) model 75 misspecification or insufficient phylogenetic signal can increase noise and affect 76 phylogenetic inference (Roure et al. 2013; Doyle et al. 2015; Roch and Warnow 2015; 77 Vachaspati and Warnow 2015; Blom et al. 2017; Molloy and Warnow 2017; Nute et al. 78 2018). Due to different underlying models and assumptions, different analytical methods 79 such as concatenation, distance-based, and coalescent-based summary methods can also 80 produce variable results. Several studies have argued that concatenation can perform as 81 well or better than summary methods, which may be adversely affected by high GTEE 82 (Gatesy and Springer 2014; Simmons and Gatesy 2015; Tonini et al. 2015). Conversely, 83 concatenation analyses have also been shown to fail or produce spuriously high support 84 for the wrong tree (Weisrock et al. 2012; Wielstra et al. 2014; Roch and Steel 2015; 85 Warnow 2015; Molloy and Warnow 2017; Mendes and Hahn 2018). Although it is 86 widely acknowledged that GTEE is an important analytical challenge, potentially 87 affecting species tree estimation, recent studies suggest that distance- and summary-based 88 methods that are statistically consistent under the MSC model may perform better under a 89 wide range of model conditions—and that they have the potential to produce low error bioRxiv preprint doi: https://doi.org/10.1101/765610; this version posted September 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. PHYLOGENOMIC CONFLICT IN HYLARANA 90 rates when many genes are available and GTEE is low (Bayzid and Warnow 2013; Patel 91 2013; Lanier and Knowles 2015; Roch and Warnow 2015; Mirarab et al. 2016; Baca et 92 al. 2017; Molloy and Warnow 2017; Nute et al. 2018; Vachaspati and Warnow 2018). 93 Therefore, if large amounts of gene trees can be estimated with low GTEE, the power of 94 coalescent-based methods can be harnessed to estimate species trees with high accuracy. 95 Analyses of massive gene sequence datasets have also demonstrated how 96 traditional measures of support such as the non-parametric bootstrap and posterior 97 probabilities can be positively misleading (Phillips et al. 2004; Seo 2008; Wiens and 98 Morrill 2011; Kumar et al. 2012; Weisrock et al. 2012; Yang and Zhu 2018; Roycroft et 99 al. 2019). Resampling methods such as non-parametric bootstrapping essentially measure 100 site-sampling variance as opposed to observed variance in the data. Because site- 101 sampling variance is an inverse function of sample size (amount of data), bootstrap 102 values will inevitably inflate as the amount of data increases (Felsenstein 1985; Kumar et 103 al. 2012); this tendency does not necessarily reflect variation in the data themselves. In 104 contrast, calculating Bayesian posterior probabilities is computationally expensive and 105 can also produce spuriously high support in big datasets (Susko 2008; Yang and Zhu 106 2018). As genome-scale datasets become more common, more robust characterizations of 107 uncertainty is needed to tease apart conflict from true signal strength (Gadagkar et al.