TREE2FASTA: a Flexible Perl Script for Batch Extraction of FASTA

Sauvage et al. BMC Res Notes (2018) 11:164 https://doi.org/10.1186/s13104-018-3268-y BMC Research Notes RESEARCH NOTE Open Access TREE2FASTA: a fexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees Thomas Sauvage1,2* , Sophie Plouviez1, William E. Schmidt1 and Suzanne Fredericq1 Abstract Objective: The body of DNA sequence data lacking taxonomically informative sequence headers is rapidly growing in user and public databases (e.g. sequences lacking identifcation and contaminants). In the context of systematics studies, sorting such sequence data for taxonomic curation and/or molecular diversity characterization (e.g. crypticism) often requires the building of exploratory phylogenetic trees with reference taxa. The subsequent step of segregating DNA sequences of interest based on observed topological relationships can represent a challenging task, especially for large datasets. Results: We have written TREE2FASTA, a Perl script that enables and expedites the sorting of FASTA-formatted sequence data from exploratory phylogenetic trees. TREE2FASTA takes advantage of the interactive, rapid point-and- click color selection and/or annotations of tree leaves in the popular Java tree-viewer FigTree to segregate groups of FASTA sequences of interest to separate fles. TREE2FASTA allows for both simple and nested segregation designs to facilitate the simultaneous preparation of multiple data sets that may overlap in sequence content. Keywords: Barcoding, Biodiversity, Clone, Contaminant, Cryptic, Environmental, FigTree, Forensic, Metabarcoding, OTU, Phylogeny, Systematics Introduction from tree topologies can represent a difficult task since A classic workflow in DNA-based systematics stud- tree-viewing relies on a Newick string [4] that does not ies [1] consists in building exploratory trees to visu- contain DNA information, the latter being enclosed in alize topological relationships of novel sequences the original FASTA file used for tree-building. Thus, to within a larger framework of reference taxa. This relate DNA strings to tip labels (i.e. sequence names), allows for the molecular identification of uncurated one usually needs to script in programming language sequences, the discovery of molecular crypticism [2], such as R, e.g. relying on the package Ape [5] with as well as choosing relevant ingroup/outgroup taxa [3] function ‘drop.tip’ or ‘extract.clade’ to create object (i.e. those to be segregated among the pool of avail- lists of sequence names to match to DNA sequences. able FASTA sequences for focused systematics stud- While this may facilitate part of the process, rap- ies). Systematists may also need to segregate groups idly selecting numerous clades or tips interactively in of FASTA sequences to examine sequence attributes the R interface may not be as fluid as in a dedicated across different clades, such as comparing GC con- tree-viewer such as the popular Java program FigTree tent, examine sequence motifs or divergence. Cur- [6]. For researchers with limited scripting skills, the rently, efficiently mining FASTA sequences of interest process requires to manually edit FASTA files via copy/paste (or delete) in a text editor for wanted (or unwanted) sequences. Others may type extensive lists *Correspondence: [email protected] 1 Department of Biology, University of Louisiana at Lafayette, 410 E. Saint of observed tip labels (i.e. sequence names) that can be Mary Boulevard, Lafayette, LA 70503, USA used to parse FASTA files with dedicated scripts avail- Full list of author information is available at the end of the article able from the community, or with matching functions © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Sauvage et al. BMC Res Notes (2018) 11:164 Page 2 of 6 (See fgure on next page.) Fig. 1 Simulated phylogeny displaying taxa named ‘A’ to ‘T’. a Basic workfow for FASTA sequence extraction with TREE2FASTA. An exploratory tree is built following multiple-alignment of FASTA data. The Newick tree string (NWK) is visualized and edited in the tree-viewer FigTree and saved as a NEXUS fle (NEX). TREE2FASTA uses the FASTA alignment and the NEXUS fle (NEX) to produce subsetted FASTA fles according to user selection scheme (here color). b Example of possible color and/or annotation selection schemes in FigTree for TREE2FASTA sequence extraction. The FASTA icon marked with an asterisk ‘*’ contains FASTA sequences for taxa H and I lacking color selection (i.e. achromatic) or lacking annotation. For fgure clarity annotation ‘Group1’ to ‘Group4’ are reported G1 to G4 within FASTA fle icons. FASTA fles output to diferent folders are delimited by dashed boxes of the Galaxy tool shed [7], as well as with command Input and output line tools such as samtools [8] or blastdbcmd from the TREE2FASTA formats input fles’ end of lines to line feed NCBI Blast+ package [9]. Overall, although some of (LF) character and thus can be run on Linux, Mac and the above practices may be feasible for small datasets Windows systems with fles generated on any of (and (e.g. typing lists), they may rapidly become unpracti- across) these platforms. Te user provides an interleaved cal for researchers who are faced with large data sets or sequential FASTA fle (i.e. DNA in multi-line or sin- (100 to 1000+ sequences to be sorted). Here, to offer gle-line, respectively) and a NEXUS tree fle exclusively a rapid and interactive solution to sequence selection edited from FigTree. While running, TREE2FASTA will from exploratory phylogenies, we devised a Perl script check for the concordance of FASTA sequence identi- named TREE2FASTA that allows the batch extrac- fers with tree labels, as well as for duplicates within tion of FASTA sequences via color and/or annotation each of these fles. TREE2FASTA will issue warnings of tips/clades of interest in FigTree. We illustrate an and print duplicate or missing sequence labels to help example use of this Perl script to rapidly sort unidenti- with any troubleshooting. However, in such instance, fied chloroplast 16S rDNA sequences belonging to the TREE2FASTA will still run forward and extract existing red seaweeds from the public NCBI Genbank® reposi- matches for fexibility. Subsetted FASTA fles are placed tory. We also document TREE2FASTA’s execution in dedicated folders (‘ANNOT’, ‘COLOR’, ‘COMBO’) speed on two 1000+ sequence trees built from refer- and each FASTA fles is named according to the annota- ence databases used in phototroph metabarcoding. tion or color edited on the tree (as hexadecimal (HEX) codes according to the encoding of individual colors in FigTree’s NEXUS fle, Additional fle 1) or as an anno- Main text tation_color combination. TREE2FASTA fully decon- General implementation structs the edited tree design (each color or annotation A preliminary requirement to using TREE2FASTA is component and their combination) so users can pick the to produce an exploratory phylogeny in a tree-building FASTA fles most relevant to their splitting goals. program relying on a FASTA sequence alignment (e.g. in RAxML [10]). Such program will output a Newick Selection schemes string (‘.nwk’ or ‘.tre’ file) directly readable by most For the purpose of illustration, we use a small simulated tree-viewers for visualization and editing (Newick phylogeny with taxa denoted A to T that we color and/ strings encode taxa relationships in a simple nested or annotate as Group1 to Group4. Five main exam- parentheses format embracing sequence labels [4]). ples are displayed, single and multiple colors, single and Here, we specifically chose the tree-viewer FigTree multiple annotations, and the nested combination of for its intuitive and interactive interface allowing colors and annotations (Fig. 1b). For simplicity here, we rapid coloring/annotation in a few mouse clicks, and depicted TREE2FASTA outputs as FASTA icons with for conveniently exporting edited information within colors matching the tips in the tree (rather than with NEXUS files [11]. TREE2FASTA exclusively parses HEX codes, see section above) and with shorter annota- the taxa block of FigTree’s NEXUS file, which contains tions (in top left corner of FASTA icon, e.g. ‘G1’ instead sequence labels and edited information (see Additional of ‘Group1’). Note that when using annotation in FigTree, file 1), in order to match sequence labels in the original this program masks the original tip label on the tree but FASTA file (containing the DNA sequences) for batch keeps track of it internally. extraction of the sequences of interest in subsetted Using a single color or annotation (i.e. monochro- FASTA files (Fig. 1a). Sequence extraction follows the matic or mono-annotated) will result in the output of selection scheme edited on the tree by the user. two fles, one containing all edited sequences (FASTA icon red or FASTA icon ‘G1’ for Group1, respectively), Sauvage et al. BMC Res Notes (2018) 11:164 Page 3 of 6 Sauvage et al. BMC Res Notes (2018) 11:164 Page 4 of 6 and the second, the remaining unselected sequences (i.e. seaweed. Following multiple-alignment of the sequences taxa H and I in the FASTA icon denoted with an aster- in MUSCLE (2 iterations) [12], we built an exploratory isk; such fles are named ‘NOCOLOR.fas’ or ‘NON- maximum likelihood (ML) tree with RAxML (keeping AME.fas’ in TREE2FASTA computer output). Likewise, the best tree out of 10 restart searches with the rapid using multiple colors or annotations (i.e.

TREE2FASTA: a Flexible Perl Script for Batch Extraction of FASTA

Treebase: an R Package for Discovery, Access and Manipulation of Online Phylogenies

NASP: an Accurate, Rapid Method for the Identification of Snps in WGS Datasets That Supports Flexible Input and Output Formats Jason Sahl

Hardware Pattern Matching for Network Traffic Analysis in Gigabit

Introduction to Bioinformatics (Elective) – SBB1609

Can Hybridization Be Detected Between African Wolf and Sympatric Canids?

Comprehensive Examinations in Computer Science 1872 - 1878

A Comprehensive SARS-Cov-2 Genomic Analysis Identifies Potential Targets for Drug Repurposing

Trusted Research Environments (TRE) a Strategy to Build Public Trust and Meet Changing Health Data Science Needs

Computational Biology and Bioinformatics

An Efficient Pipeline for Assaying Whole-Genome Plastid Variation for Population Genetics and Phylogeography

Deep Sampling of Hawaiian Caenorhabditis Elegans Reveals

A Personalized Cloud-Based Traffic Redundancy Elimination for Smartphones Vivekgautham Soundararaj Clemson University, [email protected]