DNA Barcode Data Accurately Assign Higher Spider Taxa
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Crossref DNA barcode data accurately assign higher spider taxa Jonathan A. Coddington1, Ingi Agnarsson1,2, Ren-Chung Cheng3, Klemen Čandek3, Amy Driskell1, Holger Frick4, Matjaº Gregori£3, Rok Kostanj²ek5, Christian Kropf4, Matthew Kweskin1, Tja²a Lokov²ek3, Miha Pipan3,6, Nina Vidergar3 and Matjaº Kuntner1,3 1 National Museum of Natural History, Smithsonian Institution, Washington, D.C., United States 2 Department of Biology, University of Vermont, Burlington, Vermont, United States 3 EZ Lab, Institute of Biology, Research Centre of the Slovenian Academy of Sciences and Arts, Ljubljana, Slovenia 4 Department of Invertebrates, Natural History Museum Bern, Bern, Switzerland 5 Department of Biology, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia 6 Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom ABSTRACT The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios ``barcodes'' (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level assignment. We used BLAST queries of Submitted 11 January 2016 each sequence against the entire library and got the top ten hits. The percent sequence Accepted 10 June 2016 identity was reported from these hits (PIdent, range 75–100%). Accurate assignment Published 20 July 2016 of higher taxa (PIdent above which errors totaled less than 5%) occurred for genera at Corresponding author PIdent values >95 and families at PIdent values ≥ 91, suggesting these as heuristic Matjaº Kuntner, [email protected] thresholds for accurate generic and familial identifications in spiders. Accuracy of Academic editor identification increases with numbers of species/genus and genera/family in the library; Sven Rahmann above five genera per family and fifteen species per genus all higher taxon assignments Additional Information and were correct. We propose that using percent sequence identity between conventional Declarations can be found on barcode sequences may be a feasible and reasonably accurate method to identify animals page 21 to family/genus. However, the quality of the underlying database impacts accuracy of DOI 10.7717/peerj.2201 results; many outliers in our dataset could be attributed to taxonomic and/or sequencing Copyright errors in BOLD and GenBank. It seems that an accurate and complete reference 2016 Coddington et al. library of families and genera of life could provide accurate higher level taxonomic Distributed under identifications cheaply and accessibly, within years rather than decades. Creative Commons CC-BY 4.0 OPEN ACCESS How to cite this article Coddington et al. (2016), DNA barcode data accurately assign higher spider taxa. PeerJ 4:e2201; DOI 10.7717/peerj.2201 Subjects Biodiversity, Bioinformatics, Ecology, Genetics, Taxonomy Keywords Taxonomic impediment, Family, Genus, Global Genome Initiative, Genome, DNA barcoding INTRODUCTION Accurate identification of biological specimens has always limited the application of biological data to important societal problems. Obstacles are well-known and difficult: the vast majority of species are undescribed scientifically (Erwin, 1982; May, 1992; Mora et al., 2011); some unknown but large fraction of higher taxa are not monophyletic (Goloboff et al., 2009; Pyron & Wiens, 2011); many species can only be identified if certain life stages are available, e.g., adults (Coddington & Levi, 1991), classical data sources such as morphology imperfectly track species identity; the discipline of taxonomy continues to dwindle (Agnarsson & Kuntner, 2007); the classical process of taxonomic identification is mostly manual and cannot scale to provide the amounts of data required for real-time decisions such as environmental monitoring, invasive species, climate change, etc. DNA sequence data potentially can eliminate most of these obstacles. DNA barcoding uses a fragment of the mitochondrial gene cytochrome c oxidase subunit I (CO1) as a unique species diagnosis/identification tool in the animal kingdom (Hebert et al., 2003), with analogous single to several locus protocols applied for vascular plants, ferns, mosses, algae and fungi (Saunders, 2005; Kress & Erickson, 2007; Nitta, 2008; Chase & Fay, 2009; Liu et al., 2010), protists (Scicluna, Tawari & Clark, 2006), and prokaryotes (Barraclough et al., 2009). Due to relative ease and inexpensive sequencing, DNA barcoding is a popular tool in species identification and taxonomic applications (e.g., Doña et al., 2015; Xu et al., 2015; see also Collins & Cruickshank, 2013), and the method is no longer fundamentally controversial at the species level (Pentinsaari, Hebert & Mutanen, 2014; Lopardo & Uhl, 2014; Čandek & Kuntner, 2015; Anslan & Tedersoo, 2015; Wang et al., 2015). While most species remain undescribed, the situation is not so dire for larger monophyletic groups such as clades accorded the Linnaean ranks of genus or family. In assessing the state of knowledge about biodiversity, it is important to distinguish between the first scientific discovery of an exemplar of a lineage, and phylogenetic understanding of that lineage. Phylogenetic understanding—both tree topology and consequent taxonomic changes, are research programs with no clear end in sight. Linnaean rank is partially arbitrary, and one expects that the number of higher taxa will probably increase over time as understanding improves. Discovery, however, can have an objective definition: the year of the earliest formal taxonomic description of a member of the lineage or taxonomic group in which it is currently included. By this definition the earliest possible discovery of an animal lineage is 1758 (Linnaeus, 1758), or in the case of spiders, 1757 (Clerck, 1757). More illuminating are the latest discoveries of lineages with the rank of family within larger clades, because the data tell us something about progress towards broad scale knowledge of biodiversity. The species representing the most recent discovery of a family of birds, for example, is the Broad-billed Sapayoa, Sapayoa aenigma Hunt, 1903 (Sapayoaidae). The species representing the most recently discovered mammal family is Kitti's hog-nosed bat, Craseonycteris thonglongyai Hill, 1974 (Craseonycteridae). For flowering plants, it is Coddington et al. (2016), PeerJ, DOI 10.7717/peerj.2201 2/25 Figure 1 First discovery of major clades of life. Accumulation curve of dates of first discovery (year of first description of a contained species) of families for six major clades of life, 1758–2010. Gomortega keule (Molina) Baill, 1972 (Gomertegaceae). For bees, it is Stenotritus elegans Smith, 1853 (Stenotritidae). For spiders, a megadiverse and poorly known group, it is Trogloraptor marchingtoni Griswold, Audisio & Ledford, 2012 (Trogloraptoridae), but the second most recent discovery of an unambiguously new spider family was in 1955, Gradungulidae (Forster, 1955). Figure 1 illustrates the tempo of first discovery of families for these five well-known clades. At the family level, these curves are essentially asymptotic, implying that science is close to completing the inventory of clades ranked as families for these large lineages. On the other hand, for Bacteria and Archaea (Fig. 1), as one would expect, the curve is not asymptotic at all but sharply increasing; prokaryote discovery and understanding is obviously just beginning. In fact, although many new eukaryote families are named every year, the vast majority of these new names result from advances in phylogenetic understanding, not biological discovery of major new forms of life. The last ten years of Zoological Record suggests that roughly 5–10 truly new families are discovered per year. In the context of the above question—approximate taxonomic assignment of organisms using DNA sequences—these data suggest that our knowledge of major clades of life is approaching completion. The Global Genome Initiative (GGI; http://ggi.si.edu/) of the Smithsonian Institution via the GGI Knowledge Portal (http://ggi.eol.org/) has tabulated a complete list of families of life, which total 9,650—on the whole a surprisingly small number. 10,000 barcodes, more or less, seems like a feasible goal. If we were able to assemble a complete database of DNA sequences at the family level, would it suffice to identify any eukaryote on Earth to the family level? While the literature on species identification success of DNA barcodes comprises thousands of studies, only a few have tested their effectiveness at the level of higher Coddington et al.