Reference Database for the 18S Rrna Gene
Total Page:16
File Type:pdf, Size:1020Kb
Received: 3 November 2017 | Revised: 15 February 2018 | Accepted: 24 February 2018 DOI: 10.1111/1755-0998.12781 RESOURCE ARTICLE DINOREF: A curated dinoflagellate (Dinophyceae) reference database for the 18S rRNA gene Solenn Mordret1 | Roberta Piredda1 | Daniel Vaulot2 | Marina Montresor1 | Wiebe H. C. F. Kooistra1 | Diana Sarno1 1Integrative Marine Ecology, Stazione Zoologica Anton Dohrn, Naples, Italy Abstract 2Sorbonne Universite, CNRS, UMR Dinoflagellates are a heterogeneous group of protists present in all aquatic ecosys- Adaptation et Diversite en Milieu Marin, tems where they occupy various ecological niches. They play a major role as primary Station Biologique, Roscoff, France producers, but many species are mixotrophic or heterotrophic. Environmental Correspondence metabarcoding based on high-throughput sequencing is increasingly applied to Diana Sarno, Integrative Marine Ecology, Stazione Zoologica Anton Dohrn, Naples, assess diversity and abundance of planktonic organisms, and reference databases Italy. are definitely needed to taxonomically assign the huge number of sequences. We Email: [email protected] provide an updated 18S rRNA reference database of dinoflagellates: DINOREF. Funding information Sequences were downloaded from GENBANK and filtered based on stringent quality Italian Ministry of Education, University and Research criteria. All sequences were taxonomically curated, classified taking into account classical morphotaxonomic studies and molecular phylogenies, and linked to a series of metadata. DINOREF includes 1,671 sequences representing 149 genera and 422 species. The taxonomic assignation of 468 sequences was revised. The largest num- ber of sequences belongs to Gonyaulacales and Suessiales that include toxic and symbiotic species. DINOREF provides an opportunity to test the level of taxonomic resolution of different 18S barcode markers based on a large number of sequences and species. As an example, when only the V4 region is considered, 374 of the 422 species included in DINOREF can still be unambiguously identified. Clustering the V4 sequences at 98% similarity, a threshold that is commonly applied in metabarcoding studies, resulted in a considerable underestimation of species diversity. KEYWORDS 18S rRNA gene, dinoflagellates, diversity, phylogeny, sequence database, V4 region 1 | INTRODUCTION precise ways to enumerate them. DNA-based detection and enumera- tion methodologies, such as high-throughput sequencing (HTS) Assessing global biodiversity constitutes an important and urgent task metabarcoding of environmental samples, now offer opportunities for in the face of the currently unprecedented rate of climate change, but assessing protistan diversity rapidly and precisely (Amaral-Zettler, this task is fraught with major challenges. A large part of this biodiver- McCliment, Ducklow, & Huse, 2009; Massana et al., 2015; Piredda sity is composed of protists, and it is especially in these unicellular et al., 2017; Stoeck et al., 2009; de Vargas et al., 2015). Yet, to trans- eukaryotes that taxonomists are confronted by the fact that cell mor- late these HTS data into species occurrences requires a comprehen- phology does not always allow discrimination of species, especially in sive reference database. Curated databases of reference sequences small and relatively featureless taxa. When compared with morpho- linked to taxonomically identified specimens constitute important logical traits, sequence data usually provide more precise and appar- research infrastructures for the advancement of our knowledge of the ently more objective ways to delineate species and therefore more protistan diversity (Decelle et al., 2015; Morard et al., 2015). | Mol Ecol Resour. 2018;1–14. wileyonlinelibrary.com/journal/men © 2018 John Wiley & Sons Ltd 1 2 | MORDRET ET AL. Dinoflagellates form a large phylum of protists distributed in vari- dinoflagellate classification at the suprageneric level, we also present ous aquatic ecosystems (Hackett, Anderson, Erdner, & Bhattacharya, an alternative organization of the sequence data, namely one based 2004; Not et al., 2012). The lineage includes autotrophs, hetero- on results of 18S phylogenetic inferences integrated with the latest trophs and mixotrophs; most species are free-living, but parasites literature sources. Finally, we illustrated the degree to which the 18S and symbionts are abundant as well, adding complexity to aquatic V4 region, which is broadly used in metabarcoding surveys and has ecological networks (Gomez, 2012b; Jephcott et al., 2016; Stoecker, been proposed as the universal eukaryotic prebarcode (Hu et al., 1999; Weisse et al., 2016). Dinoflagellates are of socio-economic rel- 2015; Pawlowski et al., 2012), is able to discriminate species. evance because several species produce secondary metabolites, some of which are highly toxic to humans and marine organisms 2 | MATERIALS AND METHODS (Anderson, Cembella, & Hallegraeff, 2012; Berdalet et al., 2016). Therefore, dinoflagellates are monitored on a regular basis in many 2.1 | Sequence retrieval coastal regions. The assessment of dinoflagellate diversity is still lar- gely based on light microscopy observations (LM). Yet, this approach Dinoflagellate 18S rRNA entries available on the 29th of August 2016 has its limitations: it requires a high level of taxonomic expertise and were downloaded from NCBI GENBANK (https://www.ncbi.nlm.nih.gov/) is time-consuming, and many species, including minute, naked and using the following text query: Dinophyceae[Organism] AND (small parasitic dinoflagellates, are virtually impossible to identify. In some subunit ribosomal[titl] OR 18S[titl] OR SSU[titl]). The features associ- cases, toxic species are difficult to distinguish from nontoxic relatives ated with the GENBANK entries were extracted. Sequences of early in LM (de Salas, Laza-Martınez, & Hallegraeff, 2008; Montresor, branching dinoflagellates, not classified as Dinophyceae (sensu Orr John, Beran, & Medlin, 2004). et al. (2012)), were recovered genus by genus and were excluded from Despite its rather low variability (Murray, Flø Jørgensen, Ho, Patter- DINOREF, but provided as supplementary material (see below). son, & Jermiin, 2005), the 18S ribosomal subunit (18S) is potentially a good DNA barcode region for dinoflagellates because of the large number of 2.2 | Sequence verification sequences deposited in public repositories as a result of its popular use in taxonomic and phylogenetic studies (Gomez, 2014; John et al., 2014), as Sequences labelled as “environmental samples” in GENBANK (i.e., from well as in single-cell PCR amplification studies (e.g., Gomez, Moreira, & metabarcoding, clone libraries, uncultured organisms—labelled as Lopez-Garc ıa, 2010; Hoppenrath, Murray, Sparmann, & Leander, 2012; Ki, ENV in the GENBANK locus field) and those classified above the genus Jang, & Han, 2005; Ruiz Sebastian & O’Ryan, 2001). The variable V4 or V9 level were removed. Retained sequences were aligned with MAFFT regions in the 18S rRNA-encoding region are the most commonly used version 7 (default parameters, Katoh & Standley, 2013) and manually nucleotide markers in metabarcoding studies on environmental samples checked. Sequences not meeting the following quality criteria were (e.g., Le Bescot et al., 2016; Onda et al., 2017). removed (Figure 1, step 1): (i) sequences deposited as 18S but not To improve taxonomic annotation of environmental sequences representing 18S at all or only the very 30-end of it; and (ii) 18S generated by HTS approaches, curated reference barcode databases sequences not including the full V4 region. Of the remaining 2 are needed. PR was the first database that became available for pro- sequences, the regions outside the 18S rRNA gene were removed. tists (Guillou et al., 2013), which included 136,866 nuclear-encoded After a second filtration step (Figure 1, step 2), only those sequences sequences that were taxonomically assigned to eight taxonomic that fulfilled all of the following criteria were retained: (i) sequence fields. Few other specialized databases have been created to provide length ≥450 bp; (ii) <50 ambiguous nucleotides; (iii) <20 consecutive taxonomically validated sequences with updated nomenclature and ambiguous bp; (iv) insertion <20 consecutive bp; and v. deletions contextual metadata for different groups of organisms (e.g., Decelle <100 consecutive bp. Sequences that aligned poorly or not at all et al., 2015 for 16S of photosynthetic eukaryotes; Morard et al., with any of the other sequences were blasted against GENBANK and 2015 for foraminifera). were removed if BLAST results suggested placement outside dinoflag- The aim of this study was to provide a taxonomically curated ellates (i.e., BLAST assignation <90%). The retained set of sequences database, called DINOREF, composed of the “core dinoflagellates,” that constituted the “quality-checked database” (Figure 1). is, the species with a dinokaryon, a modified nucleus containing per- manently condensed fibrillar chromosomes, as defined by Orr, Mur- 2.3 | Taxonomic validation ray, Stuken,€ Rhodes, and Jakobsen (2012). To populate the database, we gathered all dinoflagellate 18S rRNA sequences available in The taxonomic validation was an iterative process including the GENBANK, screened them against a set of quality criteria and verified following steps: their taxonomic assignations by means of phylogenetic positioning. For each sequence, taxonomy and nomenclature from GENBANK were 1. control on the validity of the nomenclature,