An Automated Method for Evolutionary Analysis of Nonculturable Ciliated Microeukaryotes
Total Page:16
File Type:pdf, Size:1020Kb
Received: 11 October 2017 | Revised: 25 December 2017 | Accepted: 26 December 2017 DOI: 10.1111/1755-0998.12750 RESOURCE ARTICLE GPSit: An automated method for evolutionary analysis of nonculturable ciliated microeukaryotes Xiao Chen1,2 | Yurui Wang1,2 | Yalan Sheng1,2 | Alan Warren3 | Shan Gao1,2,4 1Institute of Evolution & Marine Biodiversity, Ocean University of China, Abstract Qingdao, China Microeukaryotes are among the most important components of the microbial food 2Laboratory for Marine Biology and web in almost all aquatic and terrestrial ecosystems worldwide. In order to gain a Biotechnology, Qingdao National Laboratory for Marine Science and better understanding their roles and functions in ecosystems, sequencing coupled Technology, Qingdao, China with phylogenomic analyses of entire genomes or transcriptomes is increasingly 3Department of Life Sciences, Natural History Museum, London, UK used to reconstruct the evolutionary history and classification of these microeukary- 4College of Marine Life Sciences, Ocean otes and thus provide a more robust framework for determining their systematics University of China, Qingdao, China and diversity. More importantly, phylogenomic research usually requires high levels Correspondence of hands-on bioinformatics experience. Here, we propose an efficient automated Shan Gao, Institute of Evolution & Marine method, “Guided Phylogenomic Search in trees” (GPSit), which starts from predicted Biodiversity, Ocean University of China, Qingdao, China. protein sequences of newly sequenced species and a well-defined customized Email: [email protected] orthologous database. Compared with previous protocols, our method streamlines Funding information the entire workflow by integrating all essential and other optional operations. In so National Natural Science Foundation of doing, the manual operation time for reconstructing phylogenetic relationships is China, Grant/Award Number: 31522051, 31430077; Natural Science Foundation of reduced from days to several hours, compared to other methods. Furthermore, Shandong Province, Grant/Award Number: GPSit supports user-defined parameters in most steps and thus allows users to JQ201706; the BBSRC China Partnering Award scheme, Grant/Award Number: adapt it to their studies. The effectiveness of GPSit is demonstrated by incorporat- BB/L026465/1. ing available online data and new single-cell data of three nonculturable marine cili- ates (Anteholosticha monilata, Deviata sp. and Diophrys scutum) under moderate sequencing coverage (~59). Our results indicate that the former could reconstruct robust “deep” phylogenetic relationships while the latter reveals the presence of intermediate taxa in shallow relationships. Based on empirical phylogenomic data, we also used GPSit to evaluate the impact of different levels of missing data on two commonly used methods of phylogenetic analyses, maximum likelihood (ML) and Bayesian inference (BI) methods. We found that BI is less sensitive to missing data when fast-evolving sites are removed. KEYWORDS GPSit, missing data, nonculturable ciliates, phylogenomics, single-cell sequencing 1 | INTRODUCTION and functions in ecosystems, investigations of the evolutionary his- tory and classification of microeukaryotes based on high-throughput Microeukaryotes are an important component of the microbial food sequencing are becoming of increasing importance. This is especially web in almost all aquatic and terrestrial ecosystems worldwide, true for those forms that cannot be cultured in the laboratory but enhancing nutrient cycles and promoting energy flow to organisms nonetheless compose a significant part of microbial communities as at higher trophic levels. To gain a better understanding their roles revealed by molecular ecological investigations. In the age of 700 | © 2018 John Wiley & Sons Ltd wileyonlinelibrary.com/journal/men Mol Ecol Resour. 2018;18:700–713. CHEN ET AL. | 701 integrative biology, analysis of phylogenetic diversity based on large- 2017). With the rapid increase in volume of genomic data in public scale genomic data is one of the most important ways to describe databases, it is now possible to study “shallow” phylogenetic rela- evolutionary relationships among organisms (Clamp & Lynn, 2017). tionships using “supertree” approaches, such as Bayesian concor- Phylogenetic reconstruction is the most powerful tool for achieving dance analysis (BCA), by combining hundreds of phylogenetic trees this, including the recovery of “deep” (resulting from complete lin- based on single locus for each. eage sorting) and “shallow” (resulting from incomplete lineage sort- The utility of phylogenomic analyses is, however, currently con- ing) branches (Arnold, 1981; Avise, Shapira, Daniel, Aquadro, & strained by a number of technical difficulties. These include: Lansman, 1983; Doyle, 1992; Farris, 1978; Felsenstein, 1979; Mad- (i) Heterogeneous data types. Phylogenomic analyses usually involve dison, 1997; Maddison & Knowles, 2006; Pamilo & Nei, 1988; data from different sources, such as genomic and transcriptomic Rosenberg, 2002, 2003; Takahata, 1989; Throckmorton, 1965). sequence data both from single cells and from multiple cells, gener- “Deep” or “global” phylogenetic reconstruction focuses on genes and ated from different types of sequencer or in various sequencing for- groups that diverged early in evolutionary history, whereas “shallow” mats or quality. (ii) Missing data. In the past two decades, many or “local” phylogenetic reconstruction corresponds to more recent studies have demonstrated that missing data had very limited influ- evolutionary divergences (Kumar & Gadagkar, 2000). ence on supermatrix- and supertree-based phylogenetic analyses, and In the past three decades, rDNA sequence information (18S, ITS, that adding characters with missing data and incomplete taxa to data 28S) has been widely used in phylogenetic studies in almost all major sets can improve the accuracy of the results, for example, see (Fulton eukaryote lineages (Baldwin et al., 1995; Berbee, Yoshimura, & Strobeck, 2006; Wiens, 1998, 2003, 2005, 2006; Wiens & Morrill, Sugiyama, & Taylor, 1995; Hedges, Moberg, & Maxson, 1990; Lynn, 2011). However, data sets used in previous studies mentioned above 2008; Mindell & Honeycutt, 1990; Zhang et al., 2017). Maddison usually contained no more than 3,000 characters from a few genes or and Knowles (2006) examined the reconstructability of species phy- were based on simulated data sets. By contrast, ciliate phylogenomic logenies affected by the number of loci and their results showed data sets are at least ten times larger than those generated by previ- that the reconstruction is more accurate when more loci are investi- ous methods and usually include more than 100 genes (Chen et al., gated. Subsequently, using sequence information from multiple gene 2015; Gentekaki et al., 2014, 2017). Nevertheless, there are contro- markers (18S-ITS-28S rDNA, alpha-tubulin gene, COI, etc.) has versies regarding the effect of missing data on phylogenetic recon- become the widely-accepted principle in phylogenetic analysis, espe- structions (Roure, Baurain, & Philippe, 2013). (iii) Choice of tree cially for ciliates, an ecologically important protist group with a high construction methods. ML and BI are the most widely accepted meth- degree of morphological complexity (Chen et al., 2016; Fernandes, ods in phylogenetic reconstructions, but their resistance to missing da Silva Paiva, da Silva-Neto, Schlegel, & Schrago, 2016; Gao, Gao, data has not been compared using empirical phylogenomic data. Pre- Wang, Katz, & Song, 2014; Gao & Katz, 2014; Gao, Song, & Katz, vious studies either used several genes as data sets or did not directly 2014; Gao, Li et al., 2016; Gao, Warren et al., 2016; Gao et al., compare the two methods with each other (Lemmon, Brown, Stan- 2017; Huang, Luo, Bourland, Gao, & Gao, 2016; Wang, Zhang et al., ger-Hall, & Lemmon, 2009; Roure et al., 2013). (iv) Supermatrix vs. 2017; Wang, Wang et al., 2017; Wang, Wang, Sheng et al., 2017; supertree approach. Recent studies in phylogenomics of ciliates have Wang, Sheng et al., 2017; Yan et al., 2016; Yi, Huang, Yang, Lin, & combined data sets from hundreds of loci into concatenated data sets Song, 2016; Zhang et al., 2017; Zhao et al., 2016). and employed ML and BI methods to reconstruct the phylogeny in a With the development of high-throughput sequencing tech- supermatrix approach (Gentekaki et al., 2014, 2017). By grafting niques, hundreds of loci scattered on the genome are available for topologies from phylogenetic analyses of each single locus, the super- phylogenetic study using phylogenomics (Chen et al., 2015; Gen- tree approach has proved to be as valid as the supermatrix approach tekaki, Kolisko, Gong, & Lynn, 2017; Gentekaki et al., 2014; Grant & in situations where all loci are shared by all taxa (Chen et al., 2015). Katz, 2014). However, for microeukaryotes, including most marine These studies have not, however, compared the supermatrix and ciliates, cultivation and identification have remained a significant supertree approaches in a scenario where there are missing data. (v) challenge (Finlay, Esteban, & Fenchel, 2004). Fortunately, the recent Utility and efficiency of methodology. Compared to phylogenetic progress achieved in single-cell sequencing has helped to mitigate analyses based on a few genes only, phylogenomic studies require these constraints, greatly facilitating phylogenetic research in ciliates researchers to be equipped with