PDF Proof: Mol. Biol. Evol. 14 15 16 17 18 19 20 21 22 23 24 25 26 27 FIG

Page 1 of 61 Molecular Biology and Evolution 1 2 3 Submission intended as an Article for the section Resources of MBE 4 5 6 PhyloToL: A taxon/gene rich phylogenomic pipeline to explore genome evolution of 7 8 diverse eukaryotes 9 10 11 Cerón-Romero M. A a,b, Maurer-Alcalá, X. X. a,b,d, Grattepanche, J-D. a, e, Yan, Y. a, Fonseca, M. 12 c a,b. 13 M. , Katz, L. PDFA Proof: Mol. Biol. Evol. 14 15 16 a Department of Biological Sciences, Smith College, Northampton, Massachusetts, USA. 17 b Program in Organismic and Evolutionary Biology, University of Massachusetts Amherst, 18 19 Amherst, Massachusetts, USA. 20 c 21 CIIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of 22 Porto, Porto, Portugal. 23 24 d Current address: Institute of Cell Biology, University of Bern, Bern, Switzerland. 25 e Current address: Biology Department, Temple University, Philadelphia, Pennsylvania, USA. 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ScholarOne, 375 Greenbrier Drive, Charlottesville, VA, 22901 Support: (434) 964-4100 Molecular Biology and Evolution Page 2 of 61 1 2 3 ABSTRACT 4 5 Estimating multiple sequence alignments (MSAs) and inferring phylogenies are essential for 6 many aspects of comparative biology. Yet, many bioinformatics tools for such analyses have 7 8 focused on specific clades, with greatest attention paid to plants, animals and fungi. The rapid 9 10 increase of high-throughput sequencing (HTS) data from diverse lineages now provides 11 opportunities to estimate evolutionary relationships and gene family evolution across the 12 13 eukaryotic treePDF of life. At Proof:the same time, theseMol. types ofBiol. data are known Evol. to be error-prone (e.g. 14 substitutions, contamination). To address these opportunities and challenges, we have refined a 15 16 phylogenomic pipeline, now named PhyloToL, to allow easy incorporation of data from HTS 17 studies, to automate production of both MSAs and gene trees, and to identify and remove 18 19 contaminants. PhyloToL is designed for phylogenomic analyses of diverse lineages across the 20 21 tree of life (i.e. at scales of >100 million years). We demonstrate the power of PhyloToL by 22 assessing stop codon usage in Ciliophora, identifying contamination in a taxon- and gene-rich 23 24 database and exploring the evolutionary history of chromosomes in the kinetoplastid parasite 25 Trypanosoma brucei, the causative agent of African sleeping sickness. Benchmarking 26 27 PhyloToL’s homology assessment against that of OrthoMCL and a published paper on 28 superfamilies of bacterial and eukaryotic organelle outer membrane pore-forming proteins 29 30 demonstrates the power of our approach for determining gene family membership and inferring 31 32 gene trees. PhyloToL is highly flexible and allows users to easily explore HTS data, test 33 hypotheses about phylogeny and gene family evolution and combine outputs with third-party 34 35 tools (e.g. PhyloChromoMap, iGTP). 36 37 38 Keywords: Phylogenomic pipeline, high-throughput sequencing data, contamination removal, 39 40 genome evolution, chromosome mapping. 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ScholarOne, 375 Greenbrier Drive, Charlottesville, VA, 22901 Support: (434) 964-4100 Page 3 of 61 Molecular Biology and Evolution 1 2 3 INTRODUCTION 4 5 An important way to study biodiversity is through phylogenomics, which uses the 6 generation of multiple sequence alignments (MSAs), gene trees and species trees (e.g. Katz 7 8 and Grant 2015; Hug, et al. 2016). During the last two decades, advances in DNA sequencing 9 10 technology (e.g. 454, Illumina, Nanopore and PacBio) have led to the rapid accumulation of 11 data (transcriptomes and genomes) from diverse lineages across the tree of life, greatly 12 13 expanding thePDF opportunities Proof: for phylogenomic Mol. studies Biol.(Katz and Grant Evol. 2015; Burki, et al. 2016; 14 Brown, et al. 2018; Heiss, et al. 2018). Such approaches are powerful by using increasingly 15 16 large molecular datasets to reduce the discordance between gene and species trees. Indeed, 17 studies relying on a small number of genes are often impacted by lateral gene transfer, gene 18 19 duplication and loss, and incomplete lineage sorting (e.g. Maddison 1997; Tremblay-Savard and 20 21 Swenson 2012; Mallo and Posada 2016). Large-scale phylogenomic analyses allow for the 22 exploration of deep evolutionary relationships (dos Reis, et al. 2012; Wickett, et al. 2014; Katz 23 24 and Grant 2015; Hug, et al. 2016), but such analyses require data-intensive computing 25 methods. As a result, numerous laboratories have developed custom phylogenomic pipelines 26 27 proposing different methods to efficiently process and analyze massive gene and taxon 28 databases (e.g. Sanderson, et al. 2008; Wu and Eisen 2008; Smith, et al. 2009; Kumar, et al. 29 30 2015). 31 32 In general, phylogenomic pipelines are composed of three steps: 1) construction of a 33 collection of homologous gene datasets from various input sources (e.g. whole genome 34 35 sequencing, transcriptome analyses, PCR based studies), 2) production of MSAs, and 3) 36 generation of gene trees and sometimes a species tree. Phylogenomic pipelines typically put 37 38 more effort in the first two steps (collecting homologous genes and MSA curation) to ensure a 39 40 more accurate tree inference. For instance, pipelines such as PhyLoTA (Sanderson, et al. 2008) 41 and BIR (Kumar, et al. 2015) focus on the identification and collection of homologous genes by 42 43 exploring public databases such as GenBank (Benson, et al. 2017). On the other hand, 44 pipelines such as AMPHORA (Wu and Eisen 2008) and Mega-phylogeny (Smith, et al. 2009) 45 46 focus on the construction and refinement of robust alignments rather than the collection of 47 homologs. A recently published tool, SUPERSMART (Antonelli, et al. 2017), incorporates more 48 49 efficient methods for data mining than PhyLoTA (Sanderson, et al. 2008). SUPERSMART 50 51 includes sophisticated methods for tree inference using a multilocus coalescent model, which 52 benefits biogeographical analyses. Although these pipelines incorporate sophisticated methods 53 54 for data mining, alignment and tree inference, a major issue is that they are optimized for either 55 56 57 58 59 60 ScholarOne, 375 Greenbrier Drive, Charlottesville, VA, 22901 Support: (434) 964-4100 Molecular Biology and Evolution Page 4 of 61 1 2 3 a relatively narrow taxonomic sampling (e.g. plants) or for relatively narrow sets of conserved 4 5 genes/gene markers. 6 A major problem for phylogenomic analyses using public sequence data, including 7 8 GenBank and EMBL (Baker, et al. 2000), is the inherent difficulty in identifying and removing 9 10 annotation errors and contamination (e.g. data from food sources, symbionts or organelles). 11 Additional errors are introduced when non-protein coding regions (e.g. pseudogenes, promoters 12 13 and repeats) PDFare inferred Proof: as open reading Mol. frames (ORFs) Biol. by gene-prediction Evol. tools such as 14 GENESCAN (Burge and Karlin 1997), SNAP (Korf 2004), AUGUSTUS (Stanke and 15 16 Morgenstern 2005) and MAKER (Cantarel, et al. 2008). Similarly, some public databases are 17 more prone to contain annotation errors than others depending on how much effort they invest 18 19 in manual curation of public submissions. For instance, data from GenBank NR, TrEMBL 20 21 (Bairoch and Apweiler 2000) and KEGG (Kanehisa and Goto 2000) may have very high rates of 22 these errors, whereas curated resources like Gene Ontology (GO; Ashburner, et al. 2000) and 23 24 SwissProt (Bairoch and Apweiler 2000) are more likely to have low to moderate rates of such 25 errors (Schnoes, et al. 2009). The misidentification errors in these databases often stem from 26 27 problems surrounding accurate taxonomic identification of sequences from HTS data sets, as 28 contamination by other taxa can be frequent, particularly of organisms that cannot be cultured 29 30 axenically (Shrestha, et al. 2013; Lusk 2014; Parks, et al. 2015). Hence, a crucial element of 31 32 any phylogenomic pipeline that relies on public databases is the ability to identify and exclude 33 annotation errors and contaminants from its analyses. 34 35 At the same time, the availability of curated databases and third-party tools provide 36 considerable power and efficiency for phylogenomic analyses. We rely on OrthoMCL, a 37 38 database generated initially to support analyses of the genome of Plasmodium falciparum and 39 40 other apicomplexan parasites (Li, et al. 2003; Chen, et al. 2006), for the initial identification of 41 homologous gene families (i.e. GFs). We also incorporate GUIDANCE V2.02 (Penn, et al. 2010; 42 43 Sela, et al. 2015) for assigning statistical confidence MSA scores based on the robustness of 44 the MSA to guide-tree uncertainty. GUIDANCE allows an efficient identification and removal of 45 46 potentially non-homologous sequences (i.e. sequences having very low scoring values) and 47 unreliably aligned columns and residues under various parameters (Privman, et al. 2012; Hall 48 49 2013; Vasilakis, et al. 2013). This flexibility is critical – while concepts such as homology and 50 51 paralogy have clear definitions in textbooks, when it comes to deploy phylogenomic tools on 52 inferences at the scale of >100 million years, they become working definitions that depend of 53 54 parameters and sampling of both genes and taxa. Finally, we have chosen RAxML V8 55 (Stamatakis, et al. 2005; Stamatakis 2014) for tree inference as its efficient algorithms allow for 56 57 58 59 60 ScholarOne, 375 Greenbrier Drive, Charlottesville, VA, 22901 Support: (434) 964-4100 Page 5 of 61 Molecular Biology and Evolution 1 2 3 robust estimation of maximum likelihood trees [though users can access the MSAs from our 4 5 pipeline for analyses with other software].

Load more