Mini-Barcodes Are Equally Useful for Species Identification and More Suitable

bioRxiv preprint doi: https://doi.org/10.1101/594952; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Mini-barcodes are equally useful for species identification and more suitable 2 for large-scale species discovery in Metazoa than full-length barcodes 3 4 Running Title: mini-barcodes for species discovery 5 6 7 8 9 10 11 12 13 14 Darren Yeo1, Amrita Srivathsan1, Rudolf Meier1* 15 16 1Department of Biological Sciences, National University of Singapore, 14 Science 8 17 Drive 4, Singapore 117543 18 *Corresponding author: [email protected] 19 1 bioRxiv preprint doi: https://doi.org/10.1101/594952; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 20 Abstract 21 New techniques for the species-level sorting of millions of specimens are needed in 22 order to accelerate species discovery, determine how many species live on earth, 23 and develop efficient biomonitoring techniques. These sorting methods should be 24 reliable, scalable and cost-effective, as well as being largely insensitive to low-quality 25 genomic DNA, given that this is usually all that can be obtained from museum 26 specimens. Mini-barcodes seem to satisfy these criteria, but it is unclear how well 27 they perform for species-level sorting when compared to full-length barcodes. This is 28 here tested based on 20 empirical datasets covering ca. 30,000 specimens and 29 5,500 species, as well as six clade-specific datasets from GenBank covering ca. 30 98,000 specimens for over 20,000 species. All specimens in these datasets had full- 31 length barcodes and had been sorted to species-level based on morphology. Mini- 32 barcodes of different lengths and positions were obtained in silico from full-length 33 barcodes using a sliding window approach (3 windows: 100-bp, 200-bp, 300-bp) and 34 by excising nine mini-barcodes with established primers (length: 94 – 407-bp). We 35 then tested whether barcode length and/or position reduces species-level 36 congruence between morphospecies and molecular Operational Taxonomic Units 37 (mOTUs) that were obtained using three different species delimitation techniques 38 (PTP, ABGD, objective clustering). Surprisingly, we find no significant differences in 39 performance for both species- or specimen-level identification between full-length 40 and mini-barcodes as long as they are of moderate length (>200-bp). Only very short 41 mini-barcodes (<200-bp) perform poorly, especially when they are located near the 42 5’ end of the Folmer region. The mean congruence between morphospecies and 43 mOTUs is ca. 75% for barcodes >200-bp and the congruent mOTUs contain ca. 75% 44 of all specimens. Most conflict is caused by ca. 10% of the specimens that can be 2 bioRxiv preprint doi: https://doi.org/10.1101/594952; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 45 identified and should be targeted for re-examination in order to efficiently resolve 46 conflict. Our study suggests that large-scale species discovery, identification, and 47 metabarcoding can utilize mini-barcodes without any demonstrable loss of 48 information compared to full-length barcodes. 49 50 Keywords: DNA barcoding, mini-barcodes, species discovery, metabarcoding 3 bioRxiv preprint doi: https://doi.org/10.1101/594952; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 51 Introduction 52 The question of how many species live on earth has intrigued biologists for 53 centuries, but we are nowhere close to having a robust answer. We do know that 54 fewer than 2 million have been described and that there are an estimated 10-100 55 million multicellular species on the planet (Roskov et al. 2018). We also know that 56 many are currently being extirpated by the “sixth mass extinction” (Ceballos et al. 57 2015; Sánchez-Bayo and Wyckhuys 2019), with potentially catastrophic 58 consequences for the environment (Cafaro 2015). Monitoring, halting, and perhaps 59 even reversing this process is frustrated by the “taxonomic impediment”. This 60 impediment is particularly severe for “invertebrates” that collectively contribute much 61 of the animal biomass (e.g., Stork et al. 2015; Bar-On et al. 2018). Most biologists 62 thus agree that there is a pressing need for accelerating species discovery and 63 description. This very likely requires the development of new molecular sorting 64 methods because the traditional approach involving parataxonomists or highly- 65 trained taxonomic experts is either too imprecise (Krell 2004) or too slow and 66 expensive. However, any replacement method based on DNA sequences should be 67 accurate, but also (1) rapid, (2) cost-effective and (3) and largely insensitive to DNA 68 quality. These criteria are important because tackling the earth’s biodiversity will 69 likely require the processing of >500 million specimens, even under the very 70 conservative assumption that there are only 10 million species (Stork 2018) and a 71 new species is discovered with every 50 specimens processed. Cost-effectiveness is 72 similarly important because millions of species are found in countries with limited 73 funding and only basic research facilities. On the positive side, many species are 74 already represented as unsorted material in museum holdings, but such specimens 75 often yield degraded DNA (Cooper 1994). Therefore, methods that require DNA of 4 bioRxiv preprint doi: https://doi.org/10.1101/594952; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 76 high-quality and quantity are not likely to be suitable for large-scale species 77 discovery in invertebrates. 78 High-throughput species discovery with barcodes 79 Conceptually, species discovery and description can be broken up into three 80 steps. The first is obtaining specimens, the second, species-level sorting, and the 81 third, species identification or description. Fortunately, centuries of collecting have 82 gathered many of the specimens that are needed for large-scale species discovery. 83 Indeed, for many invertebrate groups it is likely that the museum collections contain 84 more specimens of undescribed than described species; i.e., this unsorted collection 85 material represents vast and still underutilized source for species discovery (Lister et 86 al. 2011; Kemp 2015; Yeates et al. 2016). The second step in species 87 discovery/description is species-level sorting, which is in dire need of acceleration. 88 Traditionally, it starts with sorting unsorted material into major taxa (e.g., order-level 89 in insects). This task can be accomplished by parataxonomists but may in the future 90 be guided by robots utilizing neural networks (Valan et al. 2019). In contrast, the 91 subsequent species-level sorting is usually time-limiting because the specimens for 92 many invertebrate taxa have to be dissected and slide-mounted before they can be 93 sorted to species-level by highly-skilled specialists; i.e., the traditional techniques are 94 neither rapid nor cost-effective. This impediment is likely to be responsible for why 95 certain taxa that are known to be abundant and species-rich are particularly poorly 96 studied (Bickel 1999). 97 An alternative way to sort specimens to species-level would be with DNA 98 sequences. This approach is particularly promising for metazoan species because 99 most multicellular animal species can be distinguished based on cytochrome c 5 bioRxiv preprint doi: https://doi.org/10.1101/594952; this version posted November 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 100 oxidase subunit I (cox1) barcode sequences (Hebert et al. 2003). However, such 101 sorting requires that every specimen is barcoded. This creates cost- and scalability 102 problems when the barcodes are obtained with Sanger sequencing (see Taylor and 103 Harris 2012). Such sequencing is currently still the standard in many barcoding 104 studies because the animal barcode was defined as a 658-bp long fragment of cox1 105 ("Folmer region”: Folmer et al. 1994), although sequences >500-bp with <1% 106 ambiguous bases are also considered BOLD-compliant (Barcode Of Life Data 107 System: BOLDsystems.org). The 658-bp barcode was optimized for ABI capillary 108 sequencers but it has become a burden because it is not suitable for cost-effective 109 sequencing with Illumina platforms. 110 Due to these constraints, very few studies have utilized DNA barcodes to sort 111 entire samples into putative species (but see Fagan-Jeffries et al. 2018). Instead, 112 most studies use a mixed approach where species-level sorting is carried out based 113 on morphology before a select few specimens per morphospecies are barcoded (e.g.

Mini-Barcodes Are Equally Useful for Species Identification and More Suitable

Descriptions of Six New Caribbean Fish Species in the Genus Starksia (Labrisomidae)

Biol. Eduardo Palacio Prez Estudiante De La Maestría En Ecología Y Pesquerías Universidad Veracruzana P R E S E N T E

Hotspots, Extinction Risk and Conservation Priorities of Greater Caribbean and Gulf of Mexico Marine Bony Shorefishes

Concentración Y Tiempo Máximo De Exposición De Juveniles De Pargo

Validation of the Synonymy of the Teleost Blenniid Fish Species

Fishes Collected During the 2017 Marinegeo Assessment of Kāne

Identifying Marine Key Biodiversity Areas in the Greater Caribbean Region

Fisheries Centre Research Reports 2011 Volume 19 Number 6

Scientific Articles

Sequence Data on Four Genes Suggest Nominal Gerres Filamentosus Specimens from Nayband National Park in the Persian Gulf Represent Two Distinct Species

Labrisomidae

Phylogeny and Phylogeography of a Shallow Water Fish