Biological Identifications Through DNA Barcodes
Total Page:16
File Type:pdf, Size:1020Kb
Received 29 July 2002 Accepted 30September 2002 Publishedonline 8January 2003 Biologicalidentificat ionsthrough DNAbarcodes Paul D.N.Hebert * ,AlinaCywinska, Shelley L.Ball and Jeremy R.deWaard Departmentof Zoology, University ofGuelph, Guelph, Ontario N1G 2W1,Canada Although muchbiological researchdepends upon species diagnoses, taxonomic expertise is collapsing. Weare convincedthat thesole prospect for asustainableidentification capability lies in theconstruction ofsystems that employ DNA sequencesas taxon ‘barcodes’. Weestablish that themitochondrial gene cytochrome c oxidase I(COI) canserve as the core of aglobal bioidentification systemfor animals. First, wedemonstratethat COI profiles, derivedfrom thelow-density sampling ofhigher taxonomic categories, ordinarily assign newlyanalysed taxa tothe appropriate phylum or order.Second, we demonstrate that species-levelassignments can be obtained by creating comprehensiveCOI profiles. AmodelCOI profile, basedupon the analysis ofasingle individual from eachof 200 closely allied speciesof lepidopterans, was 100% successfulin correctly identifying subsequentspecimens. When fully developed,a COI identification systemwill provide areliable, cost-effectiveand accessible solution to the current problem ofspecies identification.Its assembly will also generate important newinsights into thediversification oflife and therules of molecular evolution. Keywords: molecular taxonomy; mitochondrial DNA;animals; insects;sequence diversity; evolution 1. INTRODUCTION Allander et al. 2001; Hamels et al. 2001). However,the problems inherentin morphological taxonomy are general The diversity oflife underpinsall biological studies,but enoughto merit theextension of this approach toall life. it is also aharsh burden.Whereas physicists deal with a In fact,there are agrowing numberof cases in which cosmosassembled from 12 fundamentalparticles, biol- DNA-basedidentification systemshave beenapplied to ogistsconfront a planet populatedby millions ofspecies. higher organisms (Brown et al. 1999; Bucklin et al. 1999; Their discrimination is noeasy task. In fact,since few Trewick 2000; Vincent et al. 2000). taxonomists cancritically identify more than 0.01% ofthe Genomicapproaches totaxon diagnosis exploit diversity estimated10– 15 million species(Hammond 1992; Hawk- among DNA sequencesto identify organisms (Kurtzman sworth& Kalin-Arroyo 1995), acommunity of15 000 1994; Wilson 1995). In avery real sense,these sequences taxonomists will berequired, in perpetuity, toidentify life canbe viewed as genetic ‘ barcodes’that are embeddedin if ourreliance onmorphological diagnosis is tobe sus- every cell.When one considers the discrimination oflife’ s tained.Moreover, this approach tothe task ofroutine diversity from acombinatorial perspective,it is amodest speciesidentification has foursignificant limitations. First, problem. The Universal ProductCodes, used to identify both phenotypic plasticity andgenetic variability in the retail products,employ 10 alternate numerals at 11 pos- characters employed for speciesrecognition canlead to itions togenerate 100 billion uniqueidentifiers. Genomic incorrectidentifications. Second, this approach overlooks barcodeshave only fouralternate nucleotidesat eachpos- morphologically cryptic taxa, which are commonin many ition, butthe string ofsites available for inspectionis huge. groups (Knowlton1993; Jarman &Elliott 2000). Third, The survey ofjust15 ofthesenucleotide positions creates sincemorphological keysare ofteneffective only for a thepossibility of4 1 5 (1 billion) codes,100 times thenum- particular life stage or gender,many individuals cannotbe ber that wouldbe required todiscriminate life if each identified.Finally, although moderninteractive versions taxon wasuniquely branded.However, the survey of representa major advance,the use of keysoften demands nucleotidediversity needsto be more comprehensive sucha high level ofexpertise that misdiagnosesare com- becausefunctional constraints hold somenucleotide pos- mon. itions constantand intraspecific diversity existsat others. The limitations inherentin morphology-based identifi- The impact offunctional constraints can be reduced by cation systemsand the dwindling pool oftaxonomists sig- focusingon a protein-codinggene, given that mostshifts nal theneed for anewapproach totaxon recognition. at thethird nucleotideposition ofcodonsare weakly con- Microgenomic identification systems,which permit life’s strainedby selectionbecause of their four-folddegener- discrimination through theanalysis ofa small segmentof acy. Hence,by examining any stretchof 45 nucleotides, thegenome, represent one extremely promising approach onegains accessto 15 sitesweakly affectedby selection tothe diagnosis ofbiological diversity. This concepthas and,therefore, 1 billion possibleidentification labels. In already gained broad acceptanceamong thoseworking practice, thereis noneed to constrain analysis tosuch with theleast morphologically tractable groups,such as shortstretches of DNA becausesequence information is viruses,bacteria andprotists (Nanney 1982; Pace 1997; easily obtainedfor DNA fragments hundredsof base pairs (bp) long. This ability toinspect longer sequencesis sig- nificant,given twoother biological considerations.First, *Authorfor correspondence ([email protected]). nucleotidecomposition at third-position sitesis often Proc.R. Soc.Lond. B (2003) 270, 313–321 313 Ó 2003 TheRoyal Society DOI10.1098/ rspb.2002.2218 314P. D. N. Hebertand others DNA-basedidenti cations strongly biased(A –Tin arthropods,C –Gin chordates), acid substitutions,it may bepossible to assign any reducinginformation content.However, even if theA –T unidentifiedorganism toa higher taxonomic group (e.g. or C–Gproportion reached1, theinspection of just 90 bp phylum, order),before examining nucleotidesubstitutions wouldrecover theprospect of 1 billion alternatives todetermine its speciesidentity. (23 0 = 41 5).The secondconstraint derives from thelimited This studyevaluates thepotential ofCOI asa taxo- useof this potential information capacity, sincemost nomic tool.We first createda COI ‘profile’ for theseven nucleotidepositions are constantin comparisonsof closely mostdiverse animal phyla, basedon the analysis of100 related species.However, given amodestrate (e.g.2% representative species,and subsequently showed that this per Myr) ofsequence change, one expects 12 diagnostic baseline information assigned96% ofnewly analysed taxa nucleotidedifferences in a600 bp comparison ofspecies totheir proper phylum. Wethen examined theclass Hexa- with justa million year history ofreproductive isolation. poda,selected because it representsthe greatest concen- Asboth thefossil record and prior molecular analysessug- tration ofbiodiversity ontheplanet (Novotny et al. 2002). gestthat mostspecies persist for millions ofyears,the like- Wecreated a COI ‘profile’ for eight ofthe most diverse lihood oftaxon diagnosis is high. However,there is no ordersof insects, based on a single representative from simple formula that canpredict thelength ofsequence eachof 100 differentfamilies, andshowed that this ‘pro- that mustbe analysed toensure species diagnosis, because file’ assignedeach of 50 newlyanalysed taxa toits correct rates ofmolecular evolutionvary betweendifferent seg- order.Finally, wetested the ability ofCOI sequencesto mentsof the genome and across taxa. Obviously, the correctly identify speciesof lepidopterans, a group tar- analysis ofrapidly evolving generegions or taxa will aid getedfor analysis becausesequence divergences are low thediagnosis oflineages with brief histories ofrepro- among families in this order.As such, the lepidopterans ductiveisolation, while thereverse will betrue for rate- provide achallenging casefor speciesdiagnosis, especially deceleratedgenes or species. sincethis is oneof the most speciose orders of insects. Although therehas neverbeen an effortto implement This test,which involved creating aCOI ‘profile’ for 200 amicrogenomic identification systemon a large scale, closely allied speciesand subsequently using it toassign enoughwork has beendone to indicate key design 150 newlyanalysed individuals tospecies, was 100% suc- elements.It is clear that themitochondrial genomeof ani- cessfulin identification. mals is abetter target for analysis than thenuclear genome becauseof its lack ofintrons, its limited exposureto 2. MATERIALAND METHODS recombination andits haploid modeof inheritance (a) Sequences (Saccone et al. 1999). Robustprimers also enablethe rou- Approximately one-quarter of the COIsequences (172 out of tinerecovery ofspecific segments of the mitochondrial 655)used in this study wereobtained fromGenBank. The rest genome(Folmer et al. 1994; Simmons& Weller 2001). wereobtained by preparing a30 m ltotal DNA extract fromsmall Past phylogenetic workhas oftenfocused on mitochon- tissuesamples using the Isoquick(Orca Research Inc, drial genesencoding ribosomal (12S, 16S) DNA,but their Bothell,WA, 1997) protocol. The primerpair LCO1490 usein broad taxonomic analysesis constrainedby the (59-GGTCAACAAATCATAAAGATATTGG-3 9)and HCO2198 prevalence ofinsertionsand deletions (indels) that greatly (59-TAAACTTCAGGGTGACCAAAAAATCA-3 9) was sub- complicate sequencealignments (Doyle &Gaut2000). sequentlyused to amplifya 658bp fragment of the COIgene The 13 protein-codinggenes in theanimal mitochondrial (Folmer et al. 1994).Each PCRcontained 5 m l of 10´ PCR genomeare better targets becauseindels are rare since buffer,pH 8.3(10 mM ofTris –HCl,pH 8.3;1.5 mM ofMgCl 2; mostlead