Received: 11 October 2017 | Revised: 25 December 2017 | Accepted: 26 December 2017 DOI: 10.1111/1755-0998.12750

RESOURCE ARTICLE

GPSit: An automated method for evolutionary analysis of nonculturable ciliated microeukaryotes

Xiao Chen1,2 | Yurui Wang1,2 | Yalan Sheng1,2 | Alan Warren3 | Shan Gao1,2,4

1Institute of Evolution & Marine Biodiversity, Ocean University of China, Abstract Qingdao, China Microeukaryotes are among the most important components of the microbial food 2Laboratory for Marine Biology and web in almost all aquatic and terrestrial ecosystems worldwide. In order to gain a Biotechnology, Qingdao National Laboratory for Marine Science and better understanding their roles and functions in ecosystems, sequencing coupled Technology, Qingdao, China with phylogenomic analyses of entire genomes or transcriptomes is increasingly 3Department of Life Sciences, Natural History Museum, London, UK used to reconstruct the evolutionary history and classification of these microeukary- 4College of Marine Life Sciences, Ocean otes and thus provide a more robust framework for determining their systematics University of China, Qingdao, China and diversity. More importantly, phylogenomic research usually requires high levels

Correspondence of hands-on bioinformatics experience. Here, we propose an efficient automated Shan Gao, Institute of Evolution & Marine method, “Guided Phylogenomic Search in trees” (GPSit), which starts from predicted Biodiversity, Ocean University of China, Qingdao, China. protein sequences of newly sequenced species and a well-defined customized Email: [email protected] orthologous database. Compared with previous protocols, our method streamlines

Funding information the entire workflow by integrating all essential and other optional operations. In so National Natural Science Foundation of doing, the manual operation time for reconstructing phylogenetic relationships is China, Grant/Award Number: 31522051, 31430077; Natural Science Foundation of reduced from days to several hours, compared to other methods. Furthermore, Shandong Province, Grant/Award Number: GPSit supports user-defined parameters in most steps and thus allows users to JQ201706; the BBSRC China Partnering Award scheme, Grant/Award Number: adapt it to their studies. The effectiveness of GPSit is demonstrated by incorporat- BB/L026465/1. ing available online data and new single-cell data of three nonculturable marine cili- ates (Anteholosticha monilata, Deviata sp. and Diophrys scutum) under moderate sequencing coverage (~59). Our results indicate that the former could reconstruct robust “deep” phylogenetic relationships while the latter reveals the presence of intermediate taxa in shallow relationships. Based on empirical phylogenomic data, we also used GPSit to evaluate the impact of different levels of missing data on two commonly used methods of phylogenetic analyses, maximum likelihood (ML) and Bayesian inference (BI) methods. We found that BI is less sensitive to missing data when fast-evolving sites are removed.

KEYWORDS GPSit, missing data, nonculturable , phylogenomics, single-cell sequencing

1 | INTRODUCTION and functions in ecosystems, investigations of the evolutionary his- tory and classification of microeukaryotes based on high-throughput Microeukaryotes are an important component of the microbial food sequencing are becoming of increasing importance. This is especially web in almost all aquatic and terrestrial ecosystems worldwide, true for those forms that cannot be cultured in the laboratory but enhancing nutrient cycles and promoting energy flow to organisms nonetheless compose a significant part of microbial communities as at higher trophic levels. To gain a better understanding their roles revealed by molecular ecological investigations. In the age of

700 | © 2018 John Wiley & Sons Ltd wileyonlinelibrary.com/journal/men Mol Ecol Resour. 2018;18:700–713. CHEN ET AL. | 701 integrative biology, analysis of phylogenetic diversity based on large- 2017). With the rapid increase in volume of genomic data in public scale genomic data is one of the most important ways to describe databases, it is now possible to study “shallow” phylogenetic rela- evolutionary relationships among organisms (Clamp & Lynn, 2017). tionships using “supertree” approaches, such as Bayesian concor- Phylogenetic reconstruction is the most powerful tool for achieving dance analysis (BCA), by combining hundreds of phylogenetic trees this, including the recovery of “deep” (resulting from complete lin- based on single locus for each. eage sorting) and “shallow” (resulting from incomplete lineage sort- The utility of phylogenomic analyses is, however, currently con- ing) branches (Arnold, 1981; Avise, Shapira, Daniel, Aquadro, & strained by a number of technical difficulties. These include: Lansman, 1983; Doyle, 1992; Farris, 1978; Felsenstein, 1979; Mad- (i) Heterogeneous data types. Phylogenomic analyses usually involve dison, 1997; Maddison & Knowles, 2006; Pamilo & Nei, 1988; data from different sources, such as genomic and transcriptomic Rosenberg, 2002, 2003; Takahata, 1989; Throckmorton, 1965). sequence data both from single cells and from multiple cells, gener- “Deep” or “global” phylogenetic reconstruction focuses on genes and ated from different types of sequencer or in various sequencing for- groups that diverged early in evolutionary history, whereas “shallow” mats or quality. (ii) Missing data. In the past two decades, many or “local” phylogenetic reconstruction corresponds to more recent studies have demonstrated that missing data had very limited influ- evolutionary divergences (Kumar & Gadagkar, 2000). ence on supermatrix- and supertree-based phylogenetic analyses, and In the past three decades, rDNA sequence information (18S, ITS, that adding characters with missing data and incomplete taxa to data 28S) has been widely used in phylogenetic studies in almost all major sets can improve the accuracy of the results, for example, see (Fulton lineages (Baldwin et al., 1995; Berbee, Yoshimura, & Strobeck, 2006; Wiens, 1998, 2003, 2005, 2006; Wiens & Morrill, Sugiyama, & Taylor, 1995; Hedges, Moberg, & Maxson, 1990; Lynn, 2011). However, data sets used in previous studies mentioned above 2008; Mindell & Honeycutt, 1990; Zhang et al., 2017). Maddison usually contained no more than 3,000 characters from a few genes or and Knowles (2006) examined the reconstructability of species phy- were based on simulated data sets. By contrast, phylogenomic logenies affected by the number of loci and their results showed data sets are at least ten times larger than those generated by previ- that the reconstruction is more accurate when more loci are investi- ous methods and usually include more than 100 genes (Chen et al., gated. Subsequently, using sequence information from multiple gene 2015; Gentekaki et al., 2014, 2017). Nevertheless, there are contro- markers (18S-ITS-28S rDNA, alpha-tubulin gene, COI, etc.) has versies regarding the effect of missing data on phylogenetic recon- become the widely-accepted principle in phylogenetic analysis, espe- structions (Roure, Baurain, & Philippe, 2013). (iii) Choice of tree cially for ciliates, an ecologically important group with a high construction methods. ML and BI are the most widely accepted meth- degree of morphological complexity (Chen et al., 2016; Fernandes, ods in phylogenetic reconstructions, but their resistance to missing da Silva Paiva, da Silva-Neto, Schlegel, & Schrago, 2016; Gao, Gao, data has not been compared using empirical phylogenomic data. Pre- Wang, Katz, & Song, 2014; Gao & Katz, 2014; Gao, Song, & Katz, vious studies either used several genes as data sets or did not directly 2014; Gao, Li et al., 2016; Gao, Warren et al., 2016; Gao et al., compare the two methods with each other (Lemmon, Brown, Stan- 2017; Huang, Luo, Bourland, Gao, & Gao, 2016; Wang, Zhang et al., ger-Hall, & Lemmon, 2009; Roure et al., 2013). (iv) Supermatrix vs. 2017; Wang, Wang et al., 2017; Wang, Wang, Sheng et al., 2017; supertree approach. Recent studies in phylogenomics of ciliates have Wang, Sheng et al., 2017; Yan et al., 2016; Yi, Huang, Yang, Lin, & combined data sets from hundreds of loci into concatenated data sets Song, 2016; Zhang et al., 2017; Zhao et al., 2016). and employed ML and BI methods to reconstruct the phylogeny in a With the development of high-throughput sequencing tech- supermatrix approach (Gentekaki et al., 2014, 2017). By grafting niques, hundreds of loci scattered on the genome are available for topologies from phylogenetic analyses of each single locus, the super- phylogenetic study using phylogenomics (Chen et al., 2015; Gen- tree approach has proved to be as valid as the supermatrix approach tekaki, Kolisko, Gong, & Lynn, 2017; Gentekaki et al., 2014; Grant & in situations where all loci are shared by all taxa (Chen et al., 2015). Katz, 2014). However, for microeukaryotes, including most marine These studies have not, however, compared the supermatrix and ciliates, cultivation and identification have remained a significant supertree approaches in a scenario where there are missing data. (v) challenge (Finlay, Esteban, & Fenchel, 2004). Fortunately, the recent Utility and efficiency of methodology. Compared to phylogenetic progress achieved in single-cell sequencing has helped to mitigate analyses based on a few genes only, phylogenomic studies require these constraints, greatly facilitating phylogenetic research in ciliates researchers to be equipped with more bioinformatic analysis skills (Zong, Lu, Chapman, & Xie, 2012). Newly developed methods (e.g., and experience in handling large amounts of genome sequencing

MALBAC for genome amplification and SPADES for genome assembly) data, making the procedure technically challenging and preventing make single-cell sequencing possible thereby facilitating the study of beginners from accessing this field. protist genomes, including those of ciliates, despite the problems of In order to enhance the human abilities to distinguish among sequencing bias and achieving adequate genome coverage (Banke- species, extended DNA barcodes (hundreds of orthologous genes or vich et al., 2012; Lasken, 2013; Zong et al., 2012). Previous studies single-copy nuclear regions) from genome skimming were actively indicate that the “supermatrix” approach, which uses maximum developed and applied to augment the standard barcoding approach likelihood (ML) and/or Bayesian inference (BI) methods to analyse (for instance, the V4 region of the 18S ribosomal DNA for ) concatenated data sets, is effective in the reconstruction of “deep” (Coissac, Hollingsworth, Lavergne, & Taberlet, 2016; Dodsworth, phylogenetic relationships (Chen et al., 2015; Gentekaki et al., 2014, 2015; Li et al., 2015; Lynn & Kolisko, 2017; Pawlowski et al., 2012; 702 | CHEN ET AL.

Straub et al., 2012). However, an automated workflow is required to 2.3 | Ortholog detection and data set preparation process and manage the large amount of extended DNA barcoding data before it is truly scalable (Coissac et al., 2016). All the following steps were supported by multithread operation In this study, we propose an efficient, user-friendly, automated (eight threads and 16 Gb memory were used in this case). In the first method, namely “Guided Phylogenomic Search in trees” (GPSit), for part, 157 multiple sequence alignments (MSAs) of orthologous genes phylogenomic analysis. This method is compatible with data from used as an initial database, excluding the “nsf1-H” which was not genome sequencing and transcriptome sequencing, including that present in ciliates, were derived from Brown, Kolisko, Silberman, and from single cells. Compared to previous works (Gentekaki et al., Roger (2012) and Gentekaki et al. (2017) representing 37 species, 2014, 2017), GPSit streamlines the entire workflow by integrating all including 24 ciliates, four and nine apicomplexans (see essential operations and greatly reduces the manual operation time. Table S1 for details). Predicted genes from (i) single-cell genomic GPSit also supports user-defined parameters in most steps, thus data of Anteholosticha monilata, Deviata sp. and Diophrys scutum, (ii) allowing users to adapt it to their studies as necessary and enables genomic data of Strombidium sulcatum sequenced by the authors’ researchers to perform analyses using supermatrix and supertree group (unpublished) and (iii) predicted protein data of nine ciliates approaches simultaneously. By incorporating available online data from transcriptome sequencing by the Marine Microbial Eukaryote and new single-cell sequencing data of three nonculturable marine Transcriptome Sequencing Project (data available on iMicrobe: ciliates (Anteholosticha monilata, Deviata sp. and Diophrys scutum), we http://imicrobe.us/, accession number and gene ID see Table S1) evaluated the impact of different levels of missing data on phyloge- (Keeling et al., 2014) were used to update the initial MSA database. nomic analysis using both ML and BI, under relaxed and stringent Contamination from bacterial genes was removed during genome masking, based on empirical genomic data sets. assembly. It should be noted that whether the widespread stop codon reassignments exist in these ciliates is still unknown (Swart,

Serra, Petroni, & Nowacki, 2016). BLAST+ version 2.3.0 was employed 2 | METHODS to search homologous sequences in the predicted protein data set of newly sequenced species. With customized E-value and identity cut- 2.1 | Single-cell sample preparation offs (E-value = 1e10 and identity = 50% in this case), a single Anteholosticha monilata, Deviata sp. and Diophrys scutum were col- ortholog reciprocal best hit (RBH) per taxon was retained. After the lected from seawater along the coast of Yellow Sea at Qingdao orthologs were automatically added to the corresponding gene data 0 0 (36°06 N, 120°32 E), China in 2016. Single cells of these species sets, sequences in each data set were aligned by MUSCLE version were washed in PBS buffer (without Mg2+ or Ca2+), and genomic 3.8.31 (Edgar, 2004). Ambiguously aligned positions were automati-

DNA was amplified using Single Cell WGA Kit (Yikon, YK001A) cally masked with the BMGE version 1.12 (Criscuolo & Gribaldo, based on the MALBAC technology (Lu et al., 2012). 2010), in either a relaxed (-m BLOSUM30 -g 0.5 -b 5) or a stringent (-

m BLOSUM62 -g 0.3 -b 5) masking mode. Finally, 157 updated MSAs covering 50 taxa constituted the full data set. The shared ortholo- 2.2 | Illumina sequencing and genome assembly gous genes (41 in this case) of 16 initial ciliates whose gene recovery Illumina libraries were prepared from amplified single-cell genomic rates were above 80%, and single-cell genomic data from the three DNA according to manufacturer’s instructions and paired-end newly added ciliates constituted the core data set (for a list of spe- sequencing (150 bp read length) was performed using an Illumina cies included in the data sets see Table S1). HiSeq4000 sequencer. The sequencing adapter was trimmed and low-quality reads (reads containing more than 10% Ns or 50% 2.4 | Phylogenomic analyse bases with Q value ≤ 5) were filtered out. Single-cell genomes of the three species were assembled using SPADES version 3.7.1 (-k The phylogenomic analyses utilized two approaches: “supertree” and 21, 33, 55, 77) (Bankevich et al., 2012; Nurk et al., 2013). Oxytri- “supermatrix.” For the “supertree” approach, which was embedded in cha trifallax mitochondrial genomic peptides and bacterial genomes the GPSit pipeline as an optional analysis, a BI tree was constructed were downloaded from GenBank as BLAST databases to remove based on each MSA from the full data set or core data set using MR- contamination caused by mitochondria or bacteria (BLAST E-value BAYES version 3.2.2 (ngen = 100000, samplefreq = 100, nchains = 4, cut-off = 1.0e5). CD-HIT version 4.6.1 (CD-HIT-EST, -c 0.98 -n 8 - use first 10% as burn-in) (Ronquist et al., 2012). BUCKY version 1.4.2 r 1) was employed to eliminate the redundancy of contigs (default parameters) was then employed to build a concordant tree (with sequence identity threshold = 98%) (Fu, Niu, Zhu, Wu, & Li, by exploring the different topologies of subtrees of all BI trees gen- 2012). Poorly supported contigs (coverage < 1 and erated in the last step (Larget, Kotha, Dewey, & Ane, 2010). length < 400 bp) were discarded, considering the inherent charac- For the “supermatrix” approach, the 157 MSAs from the full data teristics of single-cell genomic sequencing technique (Zong et al., set were concatenated to form a supermatrix. Comparing the differ-

2012). Genome-wide gene predictions were performed using AU- ent degrees of masking, the relaxed masking generated a large

GUSTUS version 3.2.2 (species = ) (Stanke, Tzvetkova, & supermatrix containing 37,047 amino acid sites, while a small super- Morgenstern, 2006). matrix containing 11,784 sites was produced by the stringent CHEN ET AL. | 703 masking. The supermatrix, generated either from relaxed masking or based on the sequence identity, adopting RBH (reciprocal best hit) from the stringent mode, was uploaded to CIPRES Science Gateway records that passed the customized criteria of E-value and sequence to RAXML-HPC2 version 8.2.9 (LG model of amino acid substitution + Γ identity from BLAST all-against-all search results, and realigned with distribution + F, four rate categories, 500 bootstrap replicates) and sequences from the initial data set to generate the updated MSAs.

PHYLOBAYES MPI 1.5a (CAT-GTR model +Γ distribution, four indepen- After masking ambiguously aligned sites with customized masking dent chains, 10,000 generations for matrix of relaxed masking or level, the data sets are ready for both supertree and supermatrix 20,000 generations for matrix of stringent mode, with 10% burn-in, approaches. For the supermatrix analysis, concatenated data sets will convergence Maxdiff < 0.3) to generate the ML and BI trees, respec- be automatically generated and ready for regular ML and/or BI anal- tively (Gentekaki et al., 2017; Lartillot, Lepage, & Blanquart, 2009; yses. For the supertree analysis, BCA is integrated in GPSit and final Le & Gascuel, 2008; Miller, Pfeiffer, & Schwartz, 2010; Roure et al., results of phylogenomic reconstruction will be produced following 2013; Stamatakis, 2014). the pipeline.

Trees were visualized by MEGA version 7.0.20 (Kumar, Stecher, & The GPSit pipeline was written in PERL, to detect orthologous Tamura, 2016). To visualize all available phylogenetic signals in the genes and prepare data sets for the purpose of phylogenomic analy- supermatrix, phylogenetic network analyses were calculated with sis (https://github.com/seanchen607/GPSit). GPSit was parallelized

SPLITSTREE version 4.14.4 (Network = NeighborNet; 1000 bootstrap by submitting multiple tasks concurrently and automatically based on replicates) (Huson & Bryant, 2006). the customized thread number and then combining and delivering the results. The most outstanding characteristics of GPSit are its util- ity, efficiency and user-friendliness. The whole pipeline can be 2.5 | Missing data analysis accomplished on a personal computer within 1 day in the scenario To evaluate the effect of data missing on phylogenomic analyses, where data volume and computation capacity were on the same four minor data sets with different levels of missing data were split level as this case (50 taxa/157 loci, eight threads/16 Gb memory). from the full data set (Wiens, 2006). 94, 31, 16 and 8 MSAs in the Rather than participate at each step of the workflow, one initial full data set were randomly picked with no overlaps and distributed command line is sufficient for the whole automated pipeline. Thus, into minor data sets with 40%, 80%, 90% and 95% data missing, by employing this streamlined and automated pipeline, users are not respectively (see Table S4). At least two loci for each taxon were required to be equipped with high levels of hands-on bioinformatics guaranteed. experience in order to carry out phylogenomic analyses. GPSit is The accuracy of phylogenomic reconstruction was quantified highly flexible to meet various potential research demands: (i) the ini- based on phylogenetic divergence (PD) and the number of nodes tial orthologous database can be replaced by a single parameter (-i); (NN). The PD of each tree was defined as the total frequency of (ii) the criteria of BLAST E-value and sequence identity can be changed occurrence of topological alteration (TA) and poorly supported node (-e and -d, respectively); (iii) the ambiguously aligned sites can be “ ” (PSN). PSNs were only counted when the previous local bootstrap masked either in relaxed mode (i.e., -m BLOSUM30 -g 0.5 -b 5 in “ value was above the cut-off (ML: 90/BI: 98) in ML/BI trees under BMGE, the default option) or in a stringent mode (i.e., -m BLOSUM62 -g ” relaxed masking, while the current local bootstrap value dropped 0.3 -b 5 in BMGE, -S); (iv) the generations and sampling frequency for below that cut-off in the trees with missing data. As NN was always BI analysis can be set by users (-g and -f, respectively); (v) the thread two less than the number of taxa (NT), the equation was number can be adjusted (-t, and if thread number is 1, parallelization is disabled); and (vi) the supertree approach fulfilled by BUCKY (Baye- NN PD ðÞNT 2 ðÞTA þ PSN Accuracy ¼ ¼ sian concordance analysis, BCA) can be enabled by a parameter (-U) NN NT 2 and a user-defined taxon list can be provided by another parameter In this case, NT = 50. (-l) to assign the taxa to be analysed. The ortholog detection of the GPSit pipeline initiates from pre- 3 | RESULTS dicted protein sequences, while the de novo method (Tree-of-life) proposed by Grant and Katz (2014) extracts orthologs directly from 3.1 | The workflow of “Guided Phylogenomic sequencing data. These two methods employ different strategies: Search in trees” (GPSit) GPSit uses a customized orthologous database as the reciprocal BLAST source while Tree-of-life uses OrthoMCL. The initial database in By making maximum use of procedural programming for the phy- GPSit can be customized by users according to their research inter- logenomic analysis, here we propose a time-saving and automated ests, whereas the OrthoMCL database cannot be manipulated freely. method, “Guided Phylogenomic Search in trees” (GPSit), for phyloge- As a result, using a user-defined, taxon-specific orthologous database nomic analysis. The GPSit pipeline requires an initial data set con- would be helpful to both improve the efficiency and accuracy, and taining multiple sequence alignments (MSAs) of orthologous reduce the disturbance from bacterial contamination during the proteins, and several independent predicted protein data sets from ortholog detection. Furthermore, the updated orthologous database newly sequenced taxa (Figure 1). Orthologous genes were identified in GPSit can be outputted as the initial database for subsequent strictly from the data sets of new taxa according to the initial MSAs analyses, while Tree-of-life does not support iteration of the initial 704 | CHEN ET AL.

FIGURE 1 Workflow of “Guided Phylogenomic Search in trees” (GPSit). It is generally divided into five automatically interlocking stages: input data processing, orthologous gene detection, sequence alignment, ambiguous sites masking and phylogenetic analysis. At the first stage, quality control is performed on protein sequence data files in FASTA format from both initial multiple sequence alignments (MSAs) and input data of newly sequenced species, and clean data are generated, respectively (stage 1 takes <2 min). At the second stage, BLASTP is called by GPSit between sequences of species in initial MSAs and that of newly sequenced species, orthologous genes are detected by reciprocal best hit from BLAST results under custom thresholds (E-value cut-off = 1e10 and identity cut-off = 50% in this case) and data sets are updated by integrating the initial MSAs with the identified orthologous protein sequences from newly sequenced species (stage 2 may take several hours, depending on the volume of initial MSAs and input data, as well as practical computing capacity). At the third stage, MUSCLE is employed to align the sequences in each data set, and generates the updated MSAs and a concatenated data set without masking ambiguous sites for potential utilization by concatenating these MSAs (stage 3 takes about 10 min). At the fourth stage, the ambiguous sites in updated MSAs are masked by BMGE, and a concatenated data set is produced in either a relaxed or a stringent masking mode which is preset in GPSit (stage 4 also takes 10 min). At the final stage, phylogenetic analysis will be carried out according to user needs (stage 5 may take several hours, depending on the volume of the updated MSAs or concatenated data set, as well as practical computing capacity). For the supermatrix approach, the concatenated data set can be used to perform maximum likelihood (ML) and Bayesian inference (BI) analysis using RAXML and PhyloBayes, respectively. For the supertree approach, BUCKY will be called by GPSit to proceed the Bayesian concordance analysis (BCA) using the updated MSAs generated by the last stage. All five stages are parallelized. Red asterisks mark the steps allowing user-defined parameters: *BLAST thresholds (E-value and identity cut-off); **Ambiguous sites masking criterion of BMGE (relaxed “-m BLOSUM30 -g 0.5 -b 5” or stringent masking “-m BLOSUM62 -g 0.3 -b 5”); ***Phylogenetic analysis parameters (generation number and sample frequency) [Colour figure can be viewed at wileyonlinelibrary.com] CHEN ET AL. | 705 database. Lastly, GPSit enables phylogenomic analyses to be carried single-cell sequencing for phylogenomic studies of nonculturable out using both supermatrix and supertree approaches, while Tree-of- protists, especially for those whose genomes are compact (gene- life currently only supports the supertree approach. dense).

3.2 | Single-cell sequencing of nonculturable ciliates 3.3 | Validation of GPSit for ciliate phylogenomic analysis To test the effectiveness of GPSit and single-cell sequencing in phy- logenomic analyses, we collected single-cell genomic data from three The heatmap of orthologs shared among taxa demonstrated that nonculturable ciliates, Anteholosticha monilata, Deviata sp. strongest correspondence of newly added species came from their and Diophrys scutum. 97.29 M, 68.91 M and 71.21 M paired-end most closely related taxa (Figure 2b), based on BLASTP integrated in reads were produced for Anteholosticha monilata, Deviata sp. and GPSit (E-value < 1e10 and identity > 50% in this case). To com- Diophrys scutum, respectively. Their GC contents were 41.11%, pare these findings with previous results (Gentekaki et al., 2014, 49.30% and 43.75%, respectively. Following contamination control, 2017), we first performed deep phylogenetic analyses with the con- originating both from bacteria and from ciliate mitochondria, and catenated data set by the supermatrix approach. The results indi- redundancy removal, the clean single-cell genomic sequencing data cated that the overall topologies of all trees were maintained as of the three ciliates were assembled into contigs (Table 1). The large previously reported after integrating 13 new species by the GPSit numbers of contigs (82,044–322,632) and small N50 values (683– pipeline (Figure 3, Figure S1). The 13 newly added species, including 846 bp) indicated that the genome assemblies were incomplete, the three species with single-cell genomic data, robustly grouped probably caused by the bias of single-cell sequencing. This could also with their closest relatives in both ML and BI trees. In the subclass be a reflection of the fact that spirotrich ciliates contain highly frag- Stichotrichia, Anteholosticha clustered with Pseudokeronopsis, repre- mented chromosomes that encode single coding regions in their senting the order Urostylida, and then grouped with Deviata (order macronuclear genome, known as “nano-chromosomes” or “gene- Stichotrichida) and (order ). Diophrys had an sized chromosomes” (Baird & Klobutcher, 1989; Klobutcher et al., intermediate position between and other spirotrichs with 1998; Metenier & Hufschmid, 1988; Riley & Katz, 2001; Swart et al., high support (ML/BI: 96/1.00). 2013). For instance, the genome of Oxytricha trifallax, a typical spiro- The monophyly of each ciliate class was fully supported by both trich ciliate, has 19,152 contigs with 3,597 bp as its N50 (Chen ML and BI methods, with three main discrepancies. First is the rela- et al., 2014). tionship among classes Spirotrichea, Armophorea and The gene recovery rates (GRR), which is equivalent to the cover- (SAL). In the BI tree, the class Armophorea clustered with the class age of 157 orthologous genes in each taxon, are 77.1%, 80.9% and Spirotrichea, to form a group that was sister to the class Litostom- 65.0% for Anteholosticha monilata, Deviata sp. and Diophrys scutum, atea, while in the ML tree, Armophorea clustered with Litostomatea respectively, which are comparable to the average of ciliate species with low support. Second, in the order (class Oligohy- in our initial data set (78.9%) (Figure 2a, Table 1, Tables S1 and S2). menophorea, subclass Scuticociliatia), the ML tree showed an No statistically significant difference between the GRR of single-cell ambiguous clustering of the genera Anophryoides, and genome data and the GRR of bulk data of initial species (p = .777), Miamiensis with low support, while the relationship among these nor that of other newly added species (p = .694), was detected. three genera was fully supported in the BI tree. Third, the subclass Although the contigs of these three species were highly fragmented, Peniculia, represented by tetraurelia, clustered with the the coverage of gene sequences in these three single-cell genomic subclasses Scuticociliatia and Hymenostomatia in the ML tree vs. data sets demonstrated the potential for the wide application of with the subclass Peritrichia in the BI tree.

TABLE 1 Information of genome Anteholosticha monilata Deviata sp. Diophrys scutum assembly of single-cell sequencing Genome size (Mb) 67.4 60.3 255.6 # Contigs 100,059 82,044 322,632 Contig N50 683 758 846 # Contaminant contigs 3425 (3.42%) 9024 (11.00%) 13055 (4.05%) Coverage 4.837 5.594 3.994 Gene (complete) 14,127 12,310 75,495 Gene (partial) 69,183 54,496 273,885 CDS 86,058 102,456 464,049 Protein 68,911 57,079 305,146 GRRa 121/157 (77.1%) 127/157 (80.9%) 102/157 (65.0%)

aGene recovery rate (GRR) of a taxon was equivalent to its coverage of total 157 MSAs. 706 | CHEN ET AL.

FIGURE 2 Information on updated multiple sequence alignments (MSAs). (a) The matrix of updated MSAs showing the presence of 157 genes in each species, including the 37 species in the initial data set (left side of the red line) and the 13 newly added species (right side of the red line). The bar plot beneath shows the gene recovery rates (GRRs) of each species. The bars of three ciliates with single-cell sequencing are in red. Species numbers are in accordance with Table S1. (b) The heatmap for orthologous correspondence between newly added ciliates and initial species. Numbers of identified orthologous genes are labelled in the matrix. For each newly added species, strong orthologous correspondences are highlighted in green. Abbreviation of 13 newly added species: A2, Aristerostoma sp. 2; Ft, Favella taraikaensis; Ss, Strombidium sulcatum;D,Deviata sp.; P1, Pseudokeronopsis sp. 1; P2, Pseudokeronopsis sp. 2; Am, Anteholosticha monilata; Ds, Diophrys scutum; Ec, Euplotes crassus; Ef, Euplotes focardii; Cv, virens; Bj, japonicum; Fs, Fabrea salina. Three ciliates with single-cell sequencing are in red. See Tables S1 and S2 for more information [Colour figure can be viewed at wileyonlinelibrary.com]

NeighborNet analysis of phylogenetic networks was carried out to genomic data, the systematic positions of the three nonculturable display all possible relationships among the ciliate lineages (Figure S2). marine ciliate species were not well-resolved by this method. The placement of most classes in the NeighborNet graph was gener- ally consistent with ML and BI analyses, although it is noteworthy that 3.4 | Supertree approach reveals the presence of (i) the monophyly of subclass Scuticociliatia was supported and a poly- intermediate taxa in shallow relationships tomy was formed by Anophryoides, Uronema and Miamiensis; (ii) Sub- class Peniculia, represented by Paramecium, was located at an The supermatrix approach allows for well-reconstructed deep intermediate position between the subclass Peritrichia and the sub- branches; however, it sometimes caused inflated support values on classes Scuticociliatia and Hymenostomatia; (iii) based on single-cell shallow relationships, as reported previously (Chen et al., 2015). For CHEN ET AL. | 707

FIGURE 3 Phylogenomic trees estimated from a 157-gene concatenated data set under relaxed masking by maximum likelihood (ML) (a) and Bayesian inference (BI) (b) methods (the supermatrix approach). Black dots indicate bootstrap value of 100% or posterior probability of 1.00. Newly added 13 ciliates are in bold. Three ciliates with single-cell sequencing data are marked by red triangles. The ciliate with unpublished genome sequencing data is marked with a black square. SAL: classes Spirotrichea, Armophorea and Litostomatea. O: class . C: class . P: class Protocruziea. H: class Heterotrichea. The outgroup taxa, that is nine apicomplexans and four dinoflagellates, are in blue and green, respectively [Colour figure can be viewed at wileyonlinelibrary.com] example, Paramecium clustered with subclasses Scuticociliatia and with either of them. These results indicate that the supertree Hymenostomatia in the ML tree but with the subclass Peritrichia in approach could alert researchers to the presence of intermediate BI tree, both with strong support (ML/BI: 97/1.00) (Figure 3). Results taxa in shallow relationships through its low discordance tolerance. like this could be misleading in the absence of one of the analyses. Thus, we suggest combining supermatrix and supertree methods to In order to prevent confusion caused by these apparently con- reconstruct phylogenomic relationships. flicting results, we carried out supertree analysis (BCA) to construct concordant trees. Initially, the main data set, including all 157 MSAs, 3.5 | Effect of missing data on phylogenomic was used for tree construction. However, deep branches were not relationships was limited well supported by the concordance factors (Figure S3). Considering that the supertree approach performs better in the situation where Analyses of missing data effects were conducted on the data sets shared taxa occur (Larget et al., 2010), we built a core data set that with 40%, 80%, 90% and 95% of missing data, under relaxed and removed those species with low gene recovery rate (<80%). In the stringent masking modes, using ML and BI methods (Figure S4). To concordant tree built from the core data set (Figure 4), concordant evaluate the effect of missing data, we considered the ML/BI trees factors for deep branches were significantly improved. More impor- based on the full data set under relax masking as the “accurate” tantly, it presented a more logical clustering for shallow branches. trees. Compared with these “accurate” trees, the occurrence of topo- For instance, Paramecium grouped with subclasses Scuticociliatia and logical alteration (TA) and poorly supported nodes (PSN) both Hymenostomatia with low support values (31/23, relaxed/stringent increased with the increase in missing data levels. In other words, masking). As no other members of the subclass Peniculia, to which the “accuracy” of trees dropped when more data was missing. Nei- Paramecium belongs, have available genomic or transcriptomic data, ther of the ML trees from data sets with 40% nor 80% data missing Paramecium should be considered as an intermediate taxon between differed significantly from that derived from the full data set. It was these two subclasses, rather than assigning a definitive affiliation notable that, with the lowest gene recovery rate (GRR) among the 708 | CHEN ET AL.

FIGURE 4 Concordant tree estimated from core data set (41 genes shared by 19 ciliates) under both relaxed and stringent masking by the Bayesian concordance analysis (BCA) method (the supertree approach). Concordance factor (CF) values from relaxed and stringent masking are both labelled on each node. Three ciliates with single-cell sequencing data are in bold and marked by red triangles. The protargol staining picture of the three ciliates is provided by the authors’ group. The red arrows in Deviata sp. represent its macronuclei [Colour figure can be viewed at wileyonlinelibrary.com]

three ciliates with single-cell genome data (65.0%), Diophrys clus- tered with Nyctotherus with poor support in both ML and BI trees when 80% of data was missing. In the phylogenomic tree reconstructed by ML under relaxed mode, the accuracy gradually dropped to 0.71 when the missing data level increased to 95% (Figure 5a, blue line). Similarly, using BI method, the decreasing accuracy stopped at 0.71 when only 5% of data remains (Figure 5b, blue line). The situation in reconstructions under stringent masking was basically in line with that in relaxed masking (Figure 5a,b, red lines). However, compared with relaxed masking, with more fast-evolving sites removed, stringent masking caused 11.3% decrease in accuracy in shallow reconstructions by BI compared to 14.6% when using ML (Table S3). Taken together, these results suggest that both lower GRR and stringent masking may weaken the resistance to the discordance caused by missing data. It is noteworthy that BI was less sensitive than ML to missing data. Furthermore, in all situations, there was a notable decrease in accuracy when the missing data level exceeded 80%, both for ML and BI and under both relaxed and stringent masking (Figure 5). Therefore, it is important to ensure the necessary quantity of initial MSAs.

4 | DISCUSSION

4.1 | Advice for parameter setting of GPSit FIGURE 5 Diagrams showing different levels of missing data had limited effect on phylogenomic relationships under both relaxed and Guided Phylogenomic Search in trees (GPSit) is a wrap-around, using stringent masking, using either ML (a) or BI (b) methods [Colour PERL scripts, which integrates the manual procedures of widely figure can be viewed at wileyonlinelibrary.com] CHEN ET AL. | 709 accepted workflow in the published papers, and thus enhances our 4.2 | Interpretation of results produced by abilities to distinguish among species by extended DNA barcodes. It supermatrix and supertree approaches is designed for a wide variety of phylogenomic study purposes, com- patible with both the supermatrix and supertree approaches. To In the concatenated trees derived from ML and BI analyses, the sub- meet this purpose, parameters such as the criterion for ortholog class Scuticociliatia (represented by Anophryoides haemophila, Uro- detection, the masking level, and number of generations in BCA, can nema sp. and Miamiensis avidus) was monophyletic, and the BI tree all be customized by users. If any of the phylogenetic programs used strongly supported Uronema being more closely related to Anophry- by GPSit is updated, users just need to update the paths of the new oides than to Miamiensis. The ML tree, however, suggested that the versions in environment variables of their operation systems and the relationships among these taxa are ambiguous, which is consistent new version of the program would be used by GPSit. For reference, with previous studies based on rDNA sequences (Jung, Bae, Oh, & advice for initial MSA selection and parameter settings are here pro- Lee, 2011; Pan et al., 2015). This finding was also supported by the vided. phylogenetic network, which concluded their relationships as a poly- tomy in the evolutionary history. Unlike Uronema (GRR = 89.2%), the GRRs of both Anophryoides and Miamiensis were rather low 4.1.1 | Quality and quantity of initial MSAs (43.3% and 41.4%, respectively). Anophryoides, Miamiensis and Uro- The accuracy of results is highly dependent on the high quality and nema are well-known scuticociliates which are causative organisms essential quantity of initial MSAs. For quality, initial MSAs should of disease in fishes and lobsters in aquacultural holding facilities consist of well-identified orthologous protein sequences. For quan- (Acorn et al., 2011; Cheung, Nigrelli, & Ruggieri, 1980; Dragesco tity, at least 32 MSAs (about 20% of 157 MSAs in this case) are rec- et al., 1995; Iglesias et al., 2001; Munday, O’Donoghue, Watts, ommended for robust reconstruction of deep phylogenetic Rough, & Hawkesford, 1997). A previous study indicated that at relationships, based on the results of missing data analyses in the least 14 marine scuticociliates have been misidentified or are junior present study. synonyms (Song & Wilbert, 2000). An investigation of selected pathogenic scuticociliates suggested that Miamiensis, rather than Uro- nema, is the main causative agent of scuticociliatosis in olive floun- 4.1.2 | Threshold for ortholog detection der, although they share similar morphometric characteristics (Song Although the reciprocal best hits were adopted during the ortholo- et al., 2009). The present work supports the assertion that morpho- gous gene detection, proper cut-off values for E-value and identity logically similar species can differ significantly from one another in during BLAST should be optimized (for instance, 1E-10 and 50%, terms of their genome content and pathogenicity. respectively). To reliably identify the orthologs in data sets of newly The phylogenomic reconstructions by both ML and BI methods, sequenced species, these values should be determined on each occa- using either the full data set (Figure 3 and Figure S1) or the data set sion. with 40% data missing (Figure S4A,E,I,M), supported the SAL (Spirotrichea + Armophorea + Litostomatea) hypothesis in agreement with previous studies (Gao, Li et al., 2016; Gentekaki et al., 2014, 4.1.3 | Masking level for ambiguous sites 2017; Riley & Katz, 2001; Vd’acny, Orsi, & Foissner, 2010). Among As stringent masking may remove conservative sites when gaps are these reconstructions, the results obtained using the ML method introduced by missing data or by adding species with distant rela- (Figure 3a, Figures S1A and S4A,E) recognized the infraphylum tionships, the masking level should be carefully chosen to remove Lamellicorticata (the classes Armophorea + Litostomatea) proposed fast-evolving sites. The stringent masking (-m BLOSUM62 -g 0.3 -b 5) by Vd’acny et al. (2010). However, the results obtained using the BI was adopted in previous studies (Gentekaki et al., 2014, 2017). method (Figure 3b, Figures S1B and S4I,M) did not support the However, our results indicate that, with the inclusion of additional monophyletic relationship of Nyctotherus ovalis (Armophorea) and species in the analysis, stringent masking behaved too aggressively Litonotus sp. (Litostomatea). For the data sets with more than 80% resulting in only 11,784 amino acid sites being retained compared to data missing, the assignments of Nyctotherus ovalis and Litonotus sp.

37,047 sites with relaxed masking (-m BLOSUM30 -g 0.5 -b 5). Fur- were ambiguous and unconvincing, likely due to their low GRRs thermore, relaxed masking performed better when using the super- (29.3% and 66.9%, respectively). Although the ultrastructural and matrix approach. ontogenetic evidence supported the infraphylum Lamellicorticata (Vd’acny & Foissner, 2017; Vd’acny et al., 2010), the phylogenomic reconstruction results obtained using ML and BI methods are incom- 4.1.4 | Supermatrix or supertree? patible with each other with respect to the monophyly of this infra- The application of multiple methods of analysis is recommended to phylum, and thus further studies are needed. reconstruct phylogenomic relationships. The supertree approach was Although NeighborNet analysis effectively reconstructed deep more sensitive to missing data and is preferable for resolving “shal- relationships, it lacked the precision to accurately place taxa such as low” relationships when loci are shared by all taxa, whereas the the three nonculturable marine ciliates, based on data obtained by sin- supermatrix approach was more resistant to missing data. gle-cell sequencing. This may have been due to the high level of 710 | CHEN ET AL. missing data in single-cell sequences, as the consensus method utilized generously providing photographs of ciliates (Anteholosticha monilata, by NeighborNet analysis of phylogenetic networks considers only the Deviata sp. and Diophrys scutum); and Dr. Yifan Liu (University of parsimony-informative sites in MSAs. This it does by identifying and Michigan, USA) for institutional support. eliminating characters that are potentially problematic or uninforma- tive thus causing a dramatic decrease in site numbers. The perfor- AUTHOR CONTRIBUTION mance of NeighborNet in the present study reflected that eliminating uninformative characters might be beneficial for reducing perturba- X.C. conceived the project, designed research, developed the tions caused by missing data on deep branches. This is inconsistent, method, analyzed data and wrote the manuscript. Y.W. performed however, with the findings of previous studies which indicated that the single-cell genomic DNA extraction. Y.S. provided advice and NeighborNet is no more sensitive to the effects of missing data than edited the manuscript. A.W. provided English editing. S.G. conceived ML and BI methods (Lemmon et al., 2009; Roure et al., 2013). Fur- the project and edited the manuscript. thermore, a previous study suggested that the increase in the number of characters could facilitate the resolution of relationships among all DATA ACCESSIBILITY taxa of interest (Wiens, 2003). To this end, we propose that ML and BI methods are more effective than NeighborNet for deep-branch The GPSit is a PERL script and deposited at GitHub (https://github.c reconstruction in phylogenomic analyse. om/seanchen607/GPSit). All Illumina DNA sequencing data were A previous study showed that BCA was more sensitive to discor- submitted to the NCBI Short Read Archive (SRA), with the Accession dances caused by biological processes, including incomplete lineage nos. SRR6339723-SRR6339725 for Diophrys scutum, Deviata sp. and sorting and horizontal gene transfer (Larget et al., 2010). Therefore, Anteholosticha monilata. a supertree approach can give more authentic support values than a supermatrix approach for shallow relationships. However, the former ORCID has a higher demand for the intactness of data sets, so missing data would negatively impact the effectiveness of the supertree approach Shan Gao https://orcid.org/0000-0003-2083-2616 far more profoundly than the supermatrix approach for phyloge- nomic reconstruction. The CF values of deep nodes could not be REFERENCES resolved when many species with low GRRs were included (i.e., when there were few shared taxa), for example, Nyctotherus ovalis Acorn, A. R., Clark, K. F., Jones, S., Despres, B. M., Munro, S., Cawthorn, (29.3%), Litonotus sp. (66.9%), and Protocruzia adherens (75.2%), Ante- R. J., & Greenwood, S. J. (2011). Analysis of expressed sequence tags holosticha monilata (77.1%) and Diophrys scutum (65.0%) (Figure S3). (ESTs) and gene expression changes under different growth condi- tions for the ciliate Anophryoides haemophila, the causative agent of bumper car disease in the American lobster (Homarus americanus). Journal of Invertebrate Pathology, 107, 146–154. https://doi.org/10. 1016/j.jip.2011.04.006 5 | CONCLUSION Arnold, E. N. (1981). Estimating phytogenies at low taxonomic levels. Journal of Zoological Systematics and Evolutionary Research, 19,1–35. In the current work, we present an efficient automated method, Avise, J. C., Shapira, J., Daniel, S. W., Aquadro, C. F., & Lansman, R. A. “Guided Phylogenomic Search in trees” (GPSit), to facilitate phyloge- (1983). Mitochondrial DNA differentiation during the speciation pro- cess in Peromyscus. Molecular Biology and Evolution, 1,38–56. nomic analyses of microeukaryotes. It is anticipated that GPSit will Baird, S. E., & Klobutcher, L. A. (1989). Characterization of chromosome contribute to the automated process and scalability of collection of fragmentation in two protozoans and identification of a candidate extended DNA barcodes and specimen identification after genome fragmentation sequence in Euplotes crassus. Genes & Development, 3, skimming or single-cell sequencing. GPSit requires minimal hands-on 585–597. https://doi.org/10.1101/gad.3.5.585 Baldwin, B. G., Sanderson, M. J., Porter, J. M., Wojciechowski, M. F., operations and bioinformatics experience and thus is a useful tool Campbell, C. S., & Donoghue, M. J. (1995). The ITS region of nuclear for molecular systematics and molecular ecological investigations. ribosomal DNA: A valuable source of evidence on angiosperm phy- logeny. Annals of the Missouri Botanical Garden, 82, 247–277. https://doi.org/10.2307/2399880 ACKNOWLEDGEMENTS Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., ... Pevzner, P. A. (2012). SPAdes: A new genome assembly algo- This work was supported by National Natural Science Foundation of rithm and its applications to single-cell sequencing. Journal of Computa- tional Biology, 19, 455–477. https://doi.org/10.1089/cmb.2012.0021 China (31522051, 31430077), Natural Science Foundation of Shan- Berbee, M. L., Yoshimura, A., Sugiyama, J., & Taylor, J. W. (1995). Is Peni- dong Province (JQ201706) and the BBSRC China Partnering Award cillium monophyletic? An evaluation of phylogeny in the family Tri- scheme (BB/L026465/1). The authors would like to thank the fol- chocomaceae from 18S, 5.8 S and ITS ribosomal DNA sequence lowing people for assistance with this study: Dr. Weibo Song (Ocean data. Mycologia, 87, 210–222. https://doi.org/10.2307/3760907 Brown, M. W., Kolisko, M., Silberman, J. D., & Roger, A. J. (2012). University of China, China), Dr. Hongan Long (OUC, China) and Dr. Aggregative multicellularity evolved independently in the eukaryotic Peter Foster (Natural History Museum, UK) for their advice on the supergroup Rhizaria. Current Biology, 22, 1123–1127. https://doi.org/ preparation of the manuscript; Mr. Song Li (OUC, China) for 10.1016/j.cub.2012.04.021 CHEN ET AL. | 711

Chen, X., Bracht, J. R., Goldman, A. D., Dolzhenko, E., Clay, D. M., Swart, Gao, F., Gao, S., Wang, P., Katz, L. A., & Song, W. (2014). Phylogenetic E. C., ... Landweber, L. F. (2014). The architecture of a scrambled analyses of cyclidiids (Protista, Ciliophora, Scuticociliatia) based on genome reveals massive levels of genomic rearrangement during multiple genes suggest their close relationship with thigmotrichids. development. Cell, 158, 1187–1198. https://doi.org/10.1016/j.cell. Molecular Phylogenetics and Evolution, 75, 219–226. https://doi.org/ 2014.07.034 10.1016/j.ympev.2014.01.032 Chen, X., Gao, S., Liu, Y., Wang, Y., Wang, Y., & Song, W. (2016). Enzy- Gao, F., Huang, J., Zhao, Y., Li, L., Liu, W., Miao, M., ... Song, W. (2017). matic and chemical mapping of nucleosome distribution in purified Systematic studies on ciliates (Alveolata, Ciliophora) in China: Pro- micro- and macronuclei of the ciliated model organism, Tetrahymena gress and achievements based on molecular information. European thermophila. Science China Life Sciences, 59, 909–919. https://doi.org/ Journal of Protistology, 61, 409–423. https://doi.org/10.1016/j.ejop. 10.1007/s11427-016-5102-x 2017.04.009 Chen, X., Zhao, X., Liu, X., Warren, A., Zhao, F., & Miao, M. (2015). Phy- Gao, F., & Katz, L. A. (2014). Phylogenomic analyses support the bifurca- logenomics of non-model ciliates based on transcriptomic analyses. tion of ciliates into two major clades that differ in properties of Protein & Cell, 6, 373–385. https://doi.org/10.1007/s13238-015- nuclear division. Molecular Phylogenetics and Evolution, 70, 240–243. 0147-3 https://doi.org/10.1016/j.ympev.2013.10.001 Cheung, P., Nigrelli, R., & Ruggieri, G. (1980). Studies on the morphology Gao, F., Li, J., Song, W., Xu, D., Warren, A., Yi, Z., & Gao, S. (2016). of Uronema marinum Dujardin (Ciliatea: Uronematidae) with a Multi-gene-based phylogenetic analysis of oligotrich ciliates with description of the histopathology of the infection in marine fishes. emphasis on two dominant groups: Cyrtostrombidiids and strombidi- Journal of Fish Diseases, 3, 295–303. https://doi.org/10.1111/j.1365- ids (Protozoa, Ciliophora). Molecular Phylogenetics and Evolution, 105, 2761.1980.tb00400.x 241–250. https://doi.org/10.1016/j.ympev.2016.08.019 Clamp, J. C., & Lynn, D. H. (2017). Investigating the biodiversity of cili- Gao, F., Song, W., & Katz, L. A. (2014). Genome structure drives patterns ates in the ‘Age of Integration’. European Journal of Protistology, 61, of gene family evolution in ciliates, a case study using Chilodonella unci- 314–322. https://doi.org/10.1016/j.ejop.2017.01.004 nata (Protista, Ciliophora, ). Evolution, 68, 2287–2295. Coissac, E., Hollingsworth, P. M., Lavergne, S., & Taberlet, P. (2016). Gao, F., Warren, A., Zhang, Q., Gong, J., Miao, M., Sun, P., ... Song, W. From barcodes to genomes: Extending the concept of DNA barcod- (2016). The all-data-based evolutionary hypothesis of ciliated protists ing. Molecular Ecology, 25, 1423–1428. https://doi.org/10.1111/mec. with a revised classification of the phylum Ciliophora (Eukaryota, 13549 Alveolata). Scientific Reports, 6, 24874. https://doi.org/10.1038/sre Criscuolo, A., & Gribaldo, S. (2010). BMGE (Block Mapping and Gathering p24874 with Entropy): A new software for selection of phylogenetic informa- Gentekaki, E., Kolisko, M., Boscaro, V., Bright, K. J., Dini, F., Di Giuseppe, tive regions from multiple sequence alignments. BMC Evolutionary G., ... Lynn, D. H. (2014). Large-scale phylogenomic analysis reveals Biology, 10, 210. https://doi.org/10.1186/1471-2148-10-210 the phylogenetic position of the problematic taxon Protocruzia and Dodsworth, S. (2015). Genome skimming for next-generation biodiversity unravels the deep phylogenetic affinities of the ciliate lineages. analysis. Trends in Plant Science, 20, 525–527. https://doi.org/10. Molecular Phylogenetics and Evolution, 78,36–42. https://doi.org/10. 1016/j.tplants.2015.06.012 1016/j.ympev.2014.04.020 Doyle, J. J. (1992). Gene trees and species trees: Molecular systematics Gentekaki, E., Kolisko, M., Gong, Y., & Lynn, D. (2017). Phylogenomics as one-character . Systematic Botany, 17, 144–163. solves a long-standing evolutionary puzzle in the ciliate world: The https://doi.org/10.2307/2419070 subclass Peritrichia is monophyletic. Molecular Phylogenetics and Evo- Dragesco, A., Dragesco, J., Coste, F., Gasc, C., Romestand, B., Raymond, lution, 106,1–5. https://doi.org/10.1016/j.ympev.2016.09.016 J. C., & Bouix, G. (1995). Philasterides dicentrarchi, n. sp., (Ciliophora, Grant, J. R., & Katz, L. A. (2014). Building a phylogenomic pipeline for the Scuticociliatida), a histophagous opportunistic parasite of Dicentrar- eukaryotic tree of life-addressing deep phylogenies with genome-scale chus labrax (Linnaeus, 1758), a reared marine fish. European Journal data. PLoS Currents, 6. ecurrents.tol.c24b6054aebf3602748ac3602 of Protistology, 31, 327–340. https://doi.org/10.1016/S0932-4739 042ccc3602748f3602742e3602749. (11)80097-0 Hedges, S. B., Moberg, K. D., & Maxson, L. R. (1990). Tetrapod phy- Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high logeny inferred from 18S and 28S ribosomal RNA sequences and a accuracy and high throughput. Nucleic Acids Research, 32, 1792– review of the evidence for amniote relationships. Molecular Biology 1797. https://doi.org/10.1093/nar/gkh340 and Evolution, 7, 607–633. Farris, J. (1978). Inferring phylogenetic trees from chromosome inversion Huang, J., Luo, X., Bourland, W. A., Gao, F., & Gao, S. (2016). Multigene- data. Systematic Biology, 27, 275–284. based phylogeny of the ciliate families Amphisiellidae and Trache- Felsenstein, J. (1979). Alternative methods of phylogenetic inference and lostylidae (Protozoa: Ciliophora: Hypotrichia). Molecular Phylogenetics their interrelationship. Systematic Biology, 28,49–62. https://doi.org/ and Evolution, 101, 101–110. https://doi.org/10.1016/j.ympev.2016. 10.1093/sysbio/28.1.49 05.007 Fernandes, N. M., da Silva Paiva, T., da Silva-Neto, I. D., Schlegel, M., Huson, D. H., & Bryant, D. (2006). Application of phylogenetic networks & Schrago, C. G. (2016). Expanded phylogenetic analyses of the in evolutionary studies. Molecular Biology and Evolution, 23, 254–267. class Heterotrichea (Ciliophora, ) using five https://doi.org/10.1093/molbev/msj030 molecular markers and morphological data. Molecular Phylogenetics Iglesias, R., Parama, A., Alvarez, M., Leiro, J., Fernandez, J., & Sanmartın, and Evolution, 95, 229–246. https://doi.org/10.1016/j.ympev.2015. M. L. (2001). Philasterides dicentrarchi (Ciliophora, Scuticociliatida) as 10.030 the causative agent of scuticociliatosis in farmed turbot Scophthalmus Finlay, B. J., Esteban, G. F., & Fenchel, T. (2004). Protist diversity is dif- maximus in Galicia (NW Spain). Diseases of Aquatic Organisms, 46,47– ferent? Protist, 155, 15. https://doi.org/10.1078/1434461000160 55. https://doi.org/10.3354/dao046047 Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: Accelerated for Jung, S.-J., Bae, M.-J., Oh, M.-J., & Lee, J. (2011). Sequence conservation clustering the next-generation sequencing data. Bioinformatics, 28, in the internal transcribed spacers and 5.8 S ribosomal RNA of para- 3150–3152. https://doi.org/10.1093/bioinformatics/bts565 sitic scuticociliates Miamiensis avidus (Ciliophora, Scuticociliatia). Para- Fulton, T. L., & Strobeck, C. (2006). Molecular phylogeny of the Arctoidea sitology International, 60, 216–219. https://doi.org/10.1016/j.parint. (Carnivora): Effect of missing data on supertree and supermatrix anal- 2011.02.004 yses of multiple gene data sets. Molecular Phylogenetics and Evolution, Keeling, P. J., Burki, F., Wilcox, H. M., Allam, B., Allen, E. E., Amaral-Zet- 41, 165–181. https://doi.org/10.1016/j.ympev.2006.05.025 tler, L. A., ... Worden, A. Z. (2014). The marine microbial eukaryote 712 | CHEN ET AL.

transcriptome sequencing project (MMETSP): Illuminating the func- Munday, B., O’Donoghue, P., Watts, M., Rough, K., & Hawkesford, T. tional diversity of eukaryotic life in the oceans through transcriptome (1997). Fatal encephalitis due to the scuticociliate Uronema nigricans sequencing. PLOS Biology, 12, e1001889. https://doi.org/10.1371/ in sea-caged, southern bluefin tuna Thunnus maccoyii. Diseases of journal.pbio.1001889 Aquatic Organisms, 30,17–25. https://doi.org/10.3354/dao030017 Klobutcher, L. A., Gygax, S. E., Podoloff, J. D., Vermeesch, J. R., Price, C. Nurk, S., Bankevich, A., Antipov, D., Gurevich, A. A., Korobeynikov, A., M., Tebeau, C. M., & Jahn, C. L. (1998). Conserved DNA sequences Lapidus, A., ... Pevzner, P. A. (2013). Assembling single-cell genomes adjacent to chromosome fragmentation and addition sites and mini-metagenomes from chimeric MDA products. Journal of Com- in Euplotes crassus. Nucleic Acids Research, 26, 4230–4240. https://d putational Biology, 20, 714–737. https://doi.org/10.1089/cmb.2013. oi.org/10.1093/nar/26.18.4230 0084 Kumar, S., & Gadagkar, S. R. (2000). Efficiency of the neighbor-joining Pamilo, P., & Nei, M. (1988). Relationships between gene trees and spe- method in reconstructing deep and shallow evolutionary relationships cies trees. Molecular Biology and Evolution, 5, 568–583. in large phylogenies. Journal of Molecular Evolution, 51, 544–553. Pan, H., Huang, J., Hu, X., Fan, X., Al-Rasheid, K. A., & Song, W. (2015). https://doi.org/10.1007/s002390010118 Morphology and SSU rRNA gene sequences of three marine ciliates Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular evolu- from Yellow Sea, China, including one new species, Uronema hetero- tionary genetics analysis version 7.0 for bigger datasets. Molecular marinum nov. spec. (Ciliophora, Scuticociliatida). Acta Protozoologica, Biology and Evolution, 33, 1870–1874. https://doi.org/10.1093/molbe 49,45–59. v/msw054 Pawlowski, J., Audic, S., Adl, S., Bass, D., Belbahri, L., Berney, C., ... de Larget, B. R., Kotha, S. K., Dewey, C. N., & Ane, C. (2010). BUCKy: Gene Vargas, C. (2012). CBOL protist working group: Barcoding eukaryotic tree/species tree reconciliation with Bayesian concordance analysis. richness beyond the animal, plant, and fungal kingdoms. PLOS Biology, Bioinformatics, 26, 2910–2911. https://doi.org/10.1093/bioinformatic 10, e1001419. https://doi.org/10.1371/journal.pbio.1001419 s/btq539 Riley, J. L., & Katz, L. A. (2001). Widespread distribution of extensive Lartillot, N., Lepage, T., & Blanquart, S. (2009). PhyloBayes 3: A Bayesian chromosomal fragmentation in ciliates. Molecular Biology and Evolu- software package for phylogenetic reconstruction and molecular dat- tion, 18, 1372–1377. https://doi.org/10.1093/oxfordjournals.molbe ing. Bioinformatics, 25, 2286–2288. https://doi.org/10.1093/bioinfor v.a003921 matics/btp368 Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Lasken, R. S. (2013). Single-cell sequencing in its prime. Nature Biotech- Hohna,€ S., ... Huelsenbeck, J. P. (2012). MrBayes 3.2: Efficient Baye- nology, 31, 211. https://doi.org/10.1038/nbt.2523 sian phylogenetic inference and model choice across a large model Le, S. Q., & Gascuel, O. (2008). An improved general amino acid replace- space. Systematic Biology, 61, 539–542. https://doi.org/10.1093/sysb ment matrix. Molecular Biology and Evolution, 25, 1307–1320. io/sys029 https://doi.org/10.1093/molbev/msn067 Rosenberg, N. A. (2002). The probability of topological concordance of Lemmon, A. R., Brown, J. M., Stanger-Hall, K., & Lemmon, E. M. (2009). gene trees and species trees. Theoretical Population Biology, 61, 225– The effect of ambiguous data on phylogenetic estimates obtained by 247. https://doi.org/10.1006/tpbi.2001.1568 maximum likelihood and Bayesian inference. Systematic Biology, 58, Rosenberg, N. A. (2003). The shapes of neutral gene genealogies in two 130–145. https://doi.org/10.1093/sysbio/syp017 species: Probabilities of monophyly, paraphyly, and polyphyly in a Li, X., Yang, Y., Henry, R. J., Rossetto, M., Wang, Y., & Chen, S. (2015). coalescent model. Evolution, 57, 1465–1477. https://doi.org/10. Plant DNA barcoding: From gene to genome. Biological Reviews of the 1111/j.0014-3820.2003.tb00355.x Cambridge Philosophical Society, 90, 157–166. https://doi.org/10. Roure, B., Baurain, D., & Philippe, H. (2013). Impact of missing data on 1111/brv.12104 phylogenies inferred from empirical phylogenomic data sets. Molecu- Lu, S., Zong, C., Fan, W., Yang, M., Li, J., Chapman, A. R., ... Xie, X. S. lar Biology and Evolution, 30, 197–214. https://doi.org/10.1093/mol (2012). Probing meiotic recombination and aneuploidy of single bev/mss208 sperm cells by whole-genome sequencing. Science, 338, 1627–1630. Song, J.-Y., Kitamura, S.-I., Oh, M.-J., Kang, H. S., Lee, J. H., Tanaka, S. J., https://doi.org/10.1126/science.1229112 & Jung, S. J. (2009). Pathogenicity of Miamiensis avidus (syn. Philas- Lynn, D. (2008). The ciliated protozoa: Characterization, classification, and terides dicentrarchi), Pseudocohnilembus persalinus, Pseudocohnilembus guide to the literature. New York, NY: Springer Science & Business hargisi and Uronema marinum (Ciliophora, Scuticociliatida). Diseases of Media. Aquatic Organisms, 83, 133–143. https://doi.org/10.3354/dao02017 Lynn, D. H., & Kolisko, M. (2017). Molecules illuminate morphology: Phy- Song, W., & Wilbert, M. (2000). Redefinition and redescription of some logenomics confirms convergent evolution among ‘oligotrichous’ cili- marine scuticociliates from China, with report of a new species, ates. International Journal of Systematic and Evolutionary Microbiology, Metanophrys sinensis nov. spec. (Ciliophora, Scuticociliatida). Zoologis- 67, 3676–3682. cher Anzeiger, 239,45–74. Maddison, W. P. (1997). Gene trees in species trees. Systematic Biology, Stamatakis, A. (2014). RAxML version 8: A tool for phylogenetic analysis 46, 523–536. https://doi.org/10.1093/sysbio/46.3.523 and post-analysis of large phylogenies. Bioinformatics, 30, 1312– Maddison, W. P., & Knowles, L. L. (2006). Inferring phylogeny despite 1313. https://doi.org/10.1093/bioinformatics/btu033 incomplete lineage sorting. Systematic Biology, 55,21–30. https://doi. Stanke, M., Tzvetkova, A., & Morgenstern, B. (2006). AUGUSTUS at org/10.1080/10635150500354928 EGASP: Using EST, protein and genomic alignments for improved Metenier, G., & Hufschmid, J.-D. (1988). Evidence of extensive fragmen- gene prediction in the human genome. Genome Biology, 7,1. tation of macronuclear DNA in two non-hypotrichous ciliates. The Straub, S. C., Parks, M., Weitemier, K., Fishbein, M., Cronn, R. C., & Lis- Journal of Protozoology, 35,71–73. https://doi.org/10.1111/j.1550- ton, A. (2012). Navigating the tip of the genomic iceberg: next-gen- 7408.1988.tb04079.x eration sequencing for plant systematics. American Journal of Botany, Miller, M. A., Pfeiffer, W., & Schwartz, T. (2010). Creating the CIPRES 99, 349–364. https://doi.org/10.3732/ajb.1100335 Science Gateway for inference of large phylogenetic trees. Gateway Swart, E. C., Bracht, J. R., Magrini, V., Minx, P., Chen, X., Zhou, Y., ... Computing Environments Workshop (GCE), 1,1–8. Landweber, L. F. (2013). The Oxytricha trifallax macronuclear genome: Mindell, D. P., & Honeycutt, R. L. (1990). Ribosomal RNA in vertebrates: A complex eukaryotic genome with 16,000 tiny chromosomes. PLOS Evolution and phylogenetic applications. Annual Review of Ecology and Biology, 11, e1001473. https://doi.org/10.1371/journal.pbio.1001473 Systematics, 21, 541–566. https://doi.org/10.1146/annurev.es.21. Swart, E. C., Serra, V., Petroni, G., & Nowacki, M. (2016). Genetic codes 110190.002545 with no dedicated stop codon: Context-dependent translation CHEN ET AL. | 713

termination. Cell, 166, 691–702. https://doi.org/10.1016/j.cell.2016. Wiens, J. J. (2006). Missing data and the design of phylogenetic analyses. 06.020 Journal of Biomedical Informatics, 39,34–42. https://doi.org/10.1016/ Takahata, N. (1989). Gene genealogy in three related populations: Con- j.jbi.2005.04.001 sistency probability between gene and population trees. Genetics, Wiens, J. J., & Morrill, M. C. (2011). Missing data in phylogenetic analy- 122, 957–966. sis: Reconciling results from simulations and empirical data. System- Throckmorton, L. H. (1965). Similarity versus relationship in Drosophila. atic Biology, 60, 719–731. https://doi.org/10.1093/sysbio/syr025 Systematic Zoology, 14, 221–236. https://doi.org/10.2307/2411551 Yan, Y., Fan, Y., Chen, X., Li, L., Warren, A., Al-Farraj, S. A., & Song, W. Vd’acny, P., & Foissner, W. (2017). A huge diversity of metopids (Cilio- (2016). Taxonomy and phylogeny of three ciliates (Proto- phora, Armophorea) in soil from the Murray River Floodplain, Aus- zoa, Ciliophora), with description of a new Blepharisma species. Zoo- tralia. II. Morphology and morphogenesis of Lepidometopus logical Journal of the Linnean Society, 177, 320–334. https://doi.org/ platycephalus nov. gen., nov. spec. Acta Protozoologica, 56,39–57. 10.1111/zoj.12369 Vd’acny, P., Orsi, W., & Foissner, W. (2010). Molecular and morphological Yi, Z., Huang, L., Yang, R., Lin, X., & Song, W. (2016). Actin evolution in evidence for a sister group relationship of the classes Armophorea ciliates (Protist, Alveolata) is characterized by high diversity and three and Litostomatea (Ciliophora, , Lamellicorticata inf- duplication events. Molecular Phylogenetics and Evolution, 96,45–54. raphyl. nov.), with an account on basal litostomateans. European Jour- https://doi.org/10.1016/j.ympev.2015.11.024 nal of Protistology, 46, 298–309. Zhang, Q., Agatha, S., Zhang, W., Dong, J., Yu, Y., Jiao, N., & Gong, J. Wang, Y. Y., Sheng, Y., Liu, Y., Pan, B., Huang, J., Warren, A., & Gao, (2017). Three rDNA loci-based phylogenies of tintinnid ciliates (Cilio- S. (2017). N6-methyladenine DNA modification in the unicellular phora, Spirotrichea, Choreotrichida). Journal of Eukaryotic Microbiol- eukaryotic organism Tetrahymena thermophila. European Journal of ogy, 64, 226–241. https://doi.org/10.1111/jeu.12354 Protistology, 58,94–102. https://doi.org/10.1016/j.ejop.2016.12. Zhao, Y., Yi, Z., Gentekaki, E., Zhan, A., Al-Farraj, S. A., & Song, W. 003 (2016). Utility of combining morphological characters, nuclear and Wang, Y. R., Wang, Y. Y., Sheng, Y., Huang, J., Chen, X., Al-Rasheid, K. A. mitochondrial genes: An attempt to resolve the conflicts of species S., & Gao, S. (2017). A comparative study of genome organization identification for ciliated protists. Molecular Phylogenetics and Evolu- and epigenetic mechanisms in model ciliates, with an emphasis on tion, 94, 718–729. https://doi.org/10.1016/j.ympev.2015.10.017 Tetrahymena, Paramecium and Oxytricha. European Journal of Protistol- Zong, C., Lu, S., Chapman, A. R., & Xie, X. S. (2012). Genome-wide detec- ogy, 61, 376–387. https://doi.org/10.1016/j.ejop.2017.06.006 tion of single-nucleotide and copy-number variations of a single Wang, P., Wang, Y., Wang, C., Zhang, T., Al-Farraj, S. A., & Gao, F. human cell. Science, 338, 1622–1626. https://doi.org/10.1126/scie (2017). Further consideration on the phylogeny of the Ciliophora: nce.1229164 Analyses using both mitochondrial and nuclear data with focus on the extremely confused class Phyllopharyngea. Molecular Phylogenet- ics and Evolution, 112,96–106. https://doi.org/10.1016/j.ympev. 2017.04.018 SUPPORTING INFORMATION Wang, C., Zhang, T., Wang, Y., Katz, L. A., Gao, F., & Song, W. (2017). Additional Supporting Information may be found online in the sup- Disentangling sources of variation in SSU rDNA sequences from sin- gle cell analyses of ciliates: Impact of copy number variation and porting information tab for this article. experimental error. Proceedings of the Royal Society B-Biological Sciences, 284, 20170425. https://doi.org/10.1098/rspb.2017.0425 Wiens, J. J. (1998). Does adding characters with missing data increase or decrease phylogenetic accuracy? Systematic Biology, 47, 625–640. How to cite this article: Chen X, Wang Y, Sheng Y, Warren https://doi.org/10.1080/106351598260635 A, Gao S. GPSit: An automated method for evolutionary Wiens, J. J. (2003). Missing data, incomplete taxa, and phylogenetic accu- analysis of nonculturable ciliated microeukaryotes. Mol Ecol racy. Systematic Biology, 52, 528–538. https://doi.org/10.1080/ Resour. 2018;18:700–713. https://doi.org/10.1111/1755- 10635150390218330 Wiens, J. J. (2005). Can incomplete taxa rescue phylogenetic analyses 0998.12750 from long-branch attraction? Systematic Biology, 54, 731–742. https://doi.org/10.1080/10635150500234583