Identifying Insects with Incomplete DNA Barcode Libraries, A
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Insects with Incomplete DNA Barcode Libraries, A pragmatic approach towards workable solutions Massimiliano Virgilio Royal Museum for Central Africa Tervuren, BE the original definition of DNA barcoding: “molecular identification of a species based on the reference sequence with the lowest genetic distance” (Ratnasingham and Hebert 2007) the reference libraries of the BOLD System The Tephritid Barcode Initiative (TBI) CBOL obtained funding from the Sloan Foundation to support a “Demonstrator System” Steering Committee formed in April, 2006, in Belgium TBI Chair: Steering Committee Members: Bruce McPheron, Penn State Karen Armstrong, New Zealand Norman Barr, USA TBI Coordinators: Amnon Freidberg, Israel Allen Norrbom, USDA, USA Ho-Yeon Han, South Korea Marc De Meyer, RMCA, Belgium George Roderick, USA Ian White, UK What needs to be provided in BOLD for TBI 1. identification of specimen by an expert taxonomist 2. voucher specimen 3. collection information (collection date and location) Euleia fratria (Trypetinae)TEPH101 (from BOLD) 4. other infos (GPS, elevation, photodocumentation) not mandatory but strongly encouraged Other COI records (e.g., 5. barcode: at least 500bp with less than Genbank submissions) are 1% missing data. integrated into the BOLD database but kept separate. 6. trace files stored in BOLD. Institutions generating fruit fly barcodes 1. Penn State University, USA: Bruce McPheron, Md. Sajedul Islam 2. Lincoln University, New Zealand: Karen Armstrong 3. Royal Museum Central Africa, BE: Marc De Meyer, Massi Virgilio 4. Yonsei University, Korea: Ho-Yeon Han 5. California Department of Agriculture, USA: Peter Kerr 6. Smithsonian National Museum of Natural History, USA: Allen Norrbom 7. APHIS-PPQ Mission lab, USA: Norman Barr 8. University of Guelph 9. Biodiversity Institute of Ontario methodological problems in the barcoding of museum specimens age of specimens vs barcoding success 100 80 60 % specimens amplified 40 % specimens sequenced 20 (n=394) 0 90s 80s 2007 2006 2005 2004 2003 2002 2001 2000 < 1980 methodological problems in the barcoding of museum specimens pinned vs EtOH preserved specimens % of succesfully sequenced specimens (n=394) 100 80 from EtOH specimens from pinned specimens 60 40 20 0 90s 80s 2007 2006 2005 2004 2003 2002 2001 2000 < 1980 a new set of internal primers for the barcoding of tephritids LCO 1490 HCO 2198 full barcode - c. 670 bp frag. 1 - 343bp frag. 2 - 269bp frag. 3 - 227bp Van Houdt JKJ, Breman FC, Virgilio M, De Meyer M (2010) Recovering full DNA barcodes from natural history collections of Tephritid fruitflies (Tephritidae, Diptera) using mini barcodes. Molecular Ecology Resources 10, 459-465. a new set of internal primers for the barcoding of tephritids higher performances compared to the standard primers 100 % of barcodes obtained (>500bp) 80 +7% +32% +6% +7% 60 standard primers 40 internal primers 20 (n=229) 0 >2000 90s 80s <1980 Royal Museum for Central Africa – Royal Belgian Institute of Natural Sciences distance- and tree-based identifications Royal Museum for Central Africa – Royal Belgian Institute of Natural Sciences distance-based identification: the Best Match ID criterion (BM) query assigned the species name of its best-matching barcode regardless of how similar the query and barcode sequences are unknown reference genetic query database distances closest matches species genetic distance barcode a sp.1 0.1% barcode b sp.1 2.5% ... barcode c sp.2 3.0% ... ID= sp.1 distance-based identification: COMPLETE an ideal scenario: library of reference DNA barcodes 100% taxon coverage sp. 1 sp. 2 inter-specific sp. 3 > sp. 4 intra-specific divergence sp. 5 sp. 6 correct match! sp. 7 sp. 8 sp. 9 unknown query (sp. 8) sp. 10 … distance-based identification: INCOMPLETE library of a more realistic reference DNA barcodes scenario: sp. 1 sp. 2 missing incomplete reference library sp. 3 sp. 4 sp. 5 missing sp. 6 missing wrong match! sp. 7 sp. 8 missing misidentification sp. 9 missing unknown query (sp. 8) sp. 10 … A RULE of THUMB queries with SUSPICIOUSLY LARGE GENETIC DISTANCES with their best match might be misidentified and suggest that there might be NO CONSPECIFIC reference sequences for that query in the library library of 602 tephritid DNA barcodes a DISTANCE THRESHOLD might help reducing the misidentification of unrepresented queries the Best Close Match criterion (BCM) 1. establish a distance threshold: 2. accept ID if distance query-best match < threshold 3. reject ID if distance query-best match > threshold several distance thresholds have been proposed: “fixed” thresholds •BOLD is now using a 3% distance threshold (?) •earlier barcoding studies 2% or 3% “relative” thresholds •10x threshold (Hebert et al. 2004) • marine gastropods: 3.2× to 6.8× (Meyer and Paulay 2005) • crustaceans (Lefevure et al. 2006) but no universal threshold is applicable to all taxonomic groups tree-based identification • distance-based (NJ) trees • Bayesian trees • ML trees • MP trees • …. reference query = sequence tree-based identification: an ideal situation Sp.1 Sp.2 Sp.3 Sp.4 reference my query = sequences is Sp. 3 ! tree-based identification: a mislabeled / contaminated sequence Sp.1 Sp.2 Sp.4 reference my query = sequences is Sp. ?? tree-based identification: a species complex Sp.1 Sp.2 Sp.4 reference my query = sequences is Sp. ?? tree-based identification: what about node support? ? ? ? ? ? ? ? Sp.1 Sp.2 Sp.3 Sp.4 reference my query = sequences is Sp. ?? a practical example: ID of a medfly my “unknown” query is a medfly from Zambia (ref. JEMU AB31509525) a practical example: distance-based identification: 2 species, one barcode contamination/ mislabeling a practical example: tree-based identification: my query is Sp. ?? a practical example: tree-based identification: contamination/ mislabeling the state of insect reference libraries barcodes for ≈10-15% of described species www.boldsystems.org, April 2014 31 orders 1. Lepidoptera insect DNA barcodes in BOLD 2. Diptera 3. Hymenoptera (www.boldsystems.org, April 2014) 4. Coleoptera 5. Hemiptera 6. Trichoptera 7. Ephemeroptera 8. Orthoptera Coleoptera 8% 9. Odonata 10.Thysanoptera Lepidoptera 11.Psocoptera 36% 12.Plecoptera Hymenoptera 13.Neuroptera 17% 14.Isoptera 15.Blattodea Diptera 16.Phthiraptera 17.Megaloptera 28% 18.Mantodea 19.Phasmatodea 20.Siphonaptera 21.Dermaptera 22.Archaeognatha 23.Strepsiptera 24.Mecoptera 25.Embioptera 26.Raphidioptera 4 insect orders 27.Thysanura (≈50% of described species) 28.Diplura 29.Psocodea 30.Grylloblattodea 89% of barcodes 31.Mantophasmatodea questions in (insect) DNA barcoding: 1. which ID criterion? 2. which distance threshold? 3. should I trust incomplete reference libraries? DNA barcoding simulations n. species 1. Coleoptera, 2. Diptera, 3. Hemiptera, 4. Hymenoptera, 5. Lepidoptera 6. Orthoptera 15,948 “true” DNA barcodes >550 bp, complete species information n. DNA barcodes 1,995 species each species represented at least by 2 DNA barcodes DNA barcoding simulations each reference DNA barcode in the library used as a query against all the others library 30 arbitrary distance thresholds: TP, TN, FP, FN ID accuracy query precision (from the library) overall ID error relative ID error DNA barcoding possible outcomes (1) true positive query = sp. A ID accepted (2) false positive query ≠ sp. A below the threshold query = sp. A above the threshold (3) true negative query ≠ sp. A ID rejected (4) false negative query = sp. A DNA barcoding performances accuracy = (TP+TN) / total number of queries precision = TP / number of not discarded queries overall ID error = (FP+FN) / total number of queries relative ID error = FP / number of not discarded queries DNA barcoding simulations: the barcoding gap performance of different ID criteria in different insect orders no barcoding gap! 35 30 interspecific s 25 intraspecific n o s i r a 27.3% of values p 20 m between the 95% o c e 15 2.47%<K2P<7.64% percentiles of intra- and s i w r i interspecific distributions a p 10 f o % 5 0 0 2 4 6 8 10121416182022 K2P pairwise distance DNA barcoding simulations: performances of different ID criteria • Best Match (no threshold) • Best Close Match (threshold) • Neighbor Joining Tree performance of different ID criteria in different insect orders 1.0 D I 0.9 t c e r proportion of correct ID r 0.8 o c f 0.7 o NJT = 0.66 ± 0.12 n o i BM = 0.95 ± 0.03 t 0.6 r o BCM = 0.95 ± 0.03 p o r p 0.0 T M M J B C N B performance of different ID criteria in different insect orders BM and BCM have comparable performances across insect orders, NJT doesn’t SNK test: ID Criterion x Insect Order BM: Coleopt. = Dipt. = Hemipt. = Hymenopt. = Lepidopt. = Orthopt. BCM: Coleopt. = Dipt. = Hemipt. = Hymenopt. = Lepidopt. = Orthopt. NJT: Hymenopt. = Orthopt. > Lepidopt. > Coleopt. > Hemipt. > Dipt. performance of different ID criteria in different insect orders “to tree or not to tree, that is the question” “There are several drawbacks to the use of tree-building approaches to species identification. The first relates to the use of distances to construct trees” relationships between taxon coverage of the reference library and ID success about the effects of taxon coverage of libraries and distance thresholds methods • 3 large reference libraries Lepidoptera, Hymenoptera, Diptera, 13.914 DNA barcodes • simulation of 100%, 75%, 50%, 25% taxon coverage in each library • 30 arbitrary distance thresholds from no threshold to THRK2P=0.00 • ID performances overall ID error -> (FP + FN / tot n queries) Lepidoptera Hymenoptera Diptera 0.40 0.35 0.30 100% 0.25 0.20 taxon coverage 0.15 0.10 TP 0.05 0.00 ID accepted FP 0.40 0.35 0.30 75% 0.25 0.20 taxon coverage 0.15 0.10 0.05 below the threshold 0.00 0.40 0.35 above the threshold 0.30 50% 0.25 0.20 taxon coverage 0.15 0.10 0.05 0.00 TN 0.40 ID rejected 0.35 FN 0.30 25% 0.25 0.20 taxon coverage 0.15 0.10 0.05 0.00 0 5 0 5 0 5 0 5 0 5 0 5 0 0 1 1 0 0 1 1 0 0 1 1 .