Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics Article Open Access
Total Page:16
File Type:pdf, Size:1020Kb
Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics Karen Siu-Ting,*,1,2,3 Marıa Torres-Sanchez,‡,4 Diego San Mauro,4 David Wilcockson,5 Mark Wilkinson,6 Davide Pisani,7 Mary J. O’Connell,8,9 and Christopher J. Creevey*,1 1Institute for Global Food Security, School of Biological Sciences, Queen’s University Belfast, Belfast, United Kingdom 2School of Biotechnology, Dublin City University, Glasnevin, Dublin, Ireland 3 Dpto. de Herpetologıa, Museo de Historia Natural, Universidad Nacional Mayor de San Marcos, Lima, Peru Downloaded from https://academic.oup.com/mbe/article-abstract/36/6/1344/5418531 by University of Leeds Medical & user on 10 June 2019 4Department of Biodiversity, Ecology, and Evolution, Complutense University of Madrid, Madrid, Spain 5Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, United Kingdom 6Department of Life Sciences, Natural History Museum, London, United Kingdom 7Life Sciences Building, University of Bristol, Bristol, United Kingdom 8School of Biology, Faculty of Biological Sciences, University of Leeds, Leeds, United Kingdom 9School of Life Sciences, University of Nottingham, University Park, United Kingdom ‡Present address: Department of Neuroscience, Spinal Cord and Brain Injury Research Center and Ambystoma Genetic Stock Center, University of Kentucky, Lexington, KY *Corresponding authors: E-mails: [email protected]; [email protected]. Associate editor: Fabia Ursula Battistuzzi Abstract Increasingly, large phylogenomic data sets include transcriptomic data from nonmodel organisms. This not only has allowed controversial and unexplored evolutionary relationships in the tree of life to be addressed but also increases the risk of inadvertent inclusion of paralogs in the analysis. Although this may be expected to result in decreased phyloge- netic support, it is not clear if it could also drive highly supported artifactual relationships. Many groups, including the hyperdiverse Lissamphibia, are especially susceptible to these issues due to ancient gene duplication events and small numbers of sequenced genomes and because transcriptomes are increasingly applied to resolve historically conflicting taxonomic hypotheses. We tested the potential impact of paralog inclusion on the topologies and timetree estimates of the Lissamphibia using published and de novo sequencing data including 18 amphibian species, from which 2,656 single- copy gene families were identified. A novel paralog filtering approach resulted in four differently curated data sets, which were used for phylogenetic reconstructions using Bayesian inference, maximum likelihood, and quartet-based supertrees. We found that paralogs drive strongly supported conflicting hypotheses within the Lissamphibia (Batrachia and Procera) Article and older divergence time estimates even within groups where no variation in topology was observed. All investigated methods, except Bayesian inference with the CAT-GTR model, were found to be sensitive to paralogs, but with filtering convergence to the same answer (Batrachia) was observed. This is the first large-scale study to address the impact of orthology selection using transcriptomic data and emphasizes the importance of quality over quantity particularly for understanding relationships of poorly sampled taxa. Key words: phylogenomics, orthology, paralogy, Lissamphibia, timetree. Introduction transcriptomic data for previously unsampled organisms. It has often been argued that including ever larger numbers of Although the technologies and types of data have moved genes contributes to more robust phylogenetic analyses forward, little has changed in the practical approach to ortho- (Graybeal 1998; Kim 1998; Rokas and Carroll 2005; Philippe log detection since the days when phylogenetic reconstruc- et al. 2011; San Mauro et al. 2012; Chen et al. 2015; Irisarri et al. tion was dominated by Sanger sequencing (Koonin 2005; 2017) motivating the use of phylogenomic-scale data to ad- Gabaldon 2008). This has consequences for several stages of dresslong-standingcontroversiesacrossalllineagesofthe the phylogenomic analytical pipeline because many assump- tree of life (Delsuc et al. 2005; Thomson and Shaffer 2010; tions that could previously have been made about the Giribet 2016). This has been driven in practice primarily by sequence data may no longer hold (da Fonseca et al. 2016). access to cheaper and higher-throughput sequencing For instance, transcriptomic data represent a snapshot in technologies generating both genomic and, increasingly, time of the genes expressed by an organism in the tissue ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Open Access 1344 Mol. Biol. Evol. 36(6):1344–1356 doi:10.1093/molbev/msz067 Advance Access publication March 23, 2019 Paralog Inclusion in Phylogenomics . doi:10.1093/molbev/msz067 MBE sampled, this means that highly expressed genes may have extant species. For genes where duplicate copies were disad- more reads allowing better reconstruction of their transcripts vantageous (for instance because of gene dosage effects compared with more lowly expressed genes. Furthermore, the [Canestro~ et al. 2013; Gout and Lynch 2015]), loss is likely absence of a gene from a transcriptomic data set may either to have occurred early—defined here as “early loss” events. It mean it was not expressed at the time or tissue of sampling, is possible that early loss events occurred prior to any subse- or that it is not found in the organism. Compounding this, in quent speciation events and they are most likely to represent many eukaryotic species, temporal or tissue-specific alterna- “true deep orthologs” in extant species (fig. 1C). However, for tive splicing of exons may result in alternative isoforms of genes where duplicate copies were neither positively nor neg- Downloaded from https://academic.oup.com/mbe/article-abstract/36/6/1344/5418531 by University of Leeds Medical & user on 10 June 2019 genes being transcribed into mRNA, resulting in single-copy atively selected, loss of duplicates could have occurred much genes giving the appearance of being multicopy or vice versa. later—defined here as “late loss” events. These late loss events Finally, even with ideal genomic data, it can be extremely can occur independently in multiple subsequent lineages and challenging to recognize genes that have been in single are most likely to result in misclassified paralogs in extant copy since the Last Common Ancestor (LCA) of the species species (fig. 1D). Overall, single-copy genes in extant jawed under study from those genes that were in multiple copies in vertebrates likely consist of a mixture of both orthologs and the LCA but are in single copy in extant organisms due to paralogs, a product of both early and late-gene loss events subsequent lineage-specific loss (i.e., paralogs retained in sin- following ancient duplication events. The application of sin- gle copy). gle-copy genes derived from a mixture of early and late loss These and other issues can result in paralogs (whose evo- events to phylogenomic studies could be fueling strongly lutionary history reflects a mix of speciation and gene dupli- supported conflicting topologies within the jawed vertebrate cation events) being falsely identified as orthologs or vice group. versa in phylogenomic studies, introducing confounding phy- The Lissamphibia, a group of jawed vertebrates that com- logenetic signals from genes with different histories. However, prise the three orders of extant Amphibia: Anura (frogs and not every phylogenetic question requires the analysis of toads), Caudata (salamanders and newts) and Gymnophiona genome-scale data sets to be resolved, indeed simulations (caecilians), is a case in point. For many years, the evolutionary (Zwickl and Hillis 2002; Salamin et al. 2005)andempirical relationships among these three orders have remained a con- studies (Erdos} et al. 1997) indicate that moderate sized data troversial question in vertebrate evolution (Feller and Hedges sets can be sufficient to reconstruct trees even with large 1998; Zardoya and Meyer 2001; Ruta and Coates 2007; Chen numbers of taxa (Steel 2007), suggesting that approaches et al. 2015; Shen et al. 2017) and citations therein) with two that prioritize quality over quantity should be considered. main conflicting hypotheses. The Procera hypothesis pro- Although it is likely that inadvertent inclusion of paralogs poses a sister-group relationship between Gymnophiona in genome-scale studies could result in decreased support and Caudata, to the exclusion of Anura (fig. 1A)(Feller and for the correct relationship, it is not clear if they could cause Hedges 1998; Vallin and Laurin 2004), whereas the Batrachia highly supported alternative relationships to be retrieved in- hypothesis posits a sister-group relationship between stead, even when they form a minority of all the genes used in Caudata and Anura, to the exclusion of Gymnophiona the analysis. If true, this may explain recent findings where, (fig. 1A)(Milner 1988; Zardoya and Meyer 2001; Ruta and regardless of the application of increasingly large genomic Coates 2007).