Received 29 July 2002 Accepted 30September 2002 Publishedonline 8January 2003

Biologicalidentificat ionsthrough DNAbarcodes Paul D.N.Hebert * ,AlinaCywinska, Shelley L.Ball and Jeremy R.deWaard Departmentof Zoology, University ofGuelph, Guelph, Ontario N1G 2W1,Canada Although muchbiological researchdepends upon species diagnoses, taxonomic expertise is collapsing. Weare convincedthat thesole prospect for asustainableidentification capability lies in theconstruction ofsystems that employ DNA sequencesas taxon ‘barcodes’. Weestablish that themitochondrial gene cytochrome c oxidase I(COI) canserve as the core of aglobal bioidentification systemfor . First, wedemonstratethat COIprofiles, derivedfrom thelow-density sampling ofhigher taxonomic categories, ordinarily assign newlyanalysed taxa tothe appropriate phylum or order.Second, we demonstrate that species-levelassignments can be obtained by creating comprehensiveCOI profiles. AmodelCOI profile, basedupon the analysis ofasingle individual from eachof 200 closely allied speciesof lepidopterans, was 100% successfulin correctly identifying subsequentspecimens. When fully developed,a COIidentification systemwill provide areliable, cost-effectiveand accessible solution to the current problem ofspecies identification.Its assembly will also generate important newinsights intothe diversification oflife and therules of molecular evolution. Keywords: molecular ; mitochondrial DNA;animals; ;sequence diversity; evolution

1. INTRODUCTION Allander et al. 2001; Hamels et al. 2001). However,the problems inherentin morphological taxonomy are general The diversity oflife underpinsall biological studies,but enoughto merit theextension of this approach toall life. it isalso aharsh burden.Whereas physicists deal with a In fact,there are agrowing numberof cases in which cosmosassembled from 12 fundamentalparticles, biol- DNA-basedidentification systemshave beenapplied to ogistsconfront a planet populatedby millions ofspecies. higher organisms (Brown et al. 1999; Bucklin et al. 1999; Their discrimination isno easy task. In fact,since few Trewick 2000; Vincent et al. 2000). taxonomists cancritically identify more than 0.01% ofthe Genomicapproaches totaxon diagnosis exploit diversity estimated10– 15 million species(Hammond 1992; Hawk- among DNA sequencesto identify organisms (Kurtzman sworth& Kalin-Arroyo 1995), acommunity of15 000 1994; Wilson1995). In avery real sense,these sequences taxonomists will berequired, in perpetuity, toidentify life canbe viewed as genetic ‘ barcodes’that are embeddedin if ourreliance onmorphological diagnosisis to be sus- every cell.When one considers the discrimination oflife’ s tained.Moreover, this approach tothe task ofroutine diversity from acombinatorial perspective,it is amodest speciesidentification hasfour significant limitations. First, problem. The Universal ProductCodes, used to identify both phenotypic plasticity andgenetic variability in the retail products,employ 10 alternate numeralsat 11 pos- characters employed for speciesrecognition canlead to itionsto generate 100 billion uniqueidentifiers. Genomic incorrectidentifications. Second, this approach overlooks barcodeshave only fouralternate nucleotidesat eachpos- morphologically cryptic taxa, whichare commonin many ition, butthe string ofsites available for inspectionis huge. groups (Knowlton1993; Jarman &Elliott 2000). Third, The surveyof just15 ofthesenucleotide positions creates sincemorphological keysare ofteneffective only for a thepossibility of4 1 5 (1 billion) codes,100 times thenum- particular life stage or gender,many individuals cannotbe ber that wouldbe required todiscriminate life if each identified.Finally, although moderninteractive versions taxon wasuniquely branded.However, the survey of representa major advance,the use of keysoften demands nucleotidediversity needsto be more comprehensive sucha high level ofexpertise that misdiagnosesare com- becausefunctional constraints hold somenucleotide pos- mon. itions constantand intraspecific diversity existsat others. The limitations inherentin morphology-based identifi- The impact offunctional constraints can be reduced by cation systemsand the dwindling pool oftaxonomists sig- focusingon a protein-codinggene, given that mostshifts nal theneed for anewapproach totaxon recognition. at thethird nucleotideposition of codonsare weakly con- Microgenomic identification systems,which permit life’s strainedby selectionbecause of their four-folddegener- discrimination through theanalysis ofa small segmentof acy.Hence, by examining any stretchof 45 nucleotides, thegenome, represent one extremely promising approach onegains accessto 15 sitesweakly affectedby selection tothe diagnosis ofbiological diversity. This concepthas and,therefore, 1 billion possibleidentification labels. In already gained broad acceptanceamong thoseworking practice, thereis noneed to constrain analysis tosuch with theleast morphologically tractable groups,such as shortstretches of DNA becausesequence information is viruses,bacteria andprotists (Nanney 1982; Pace 1997; easily obtainedfor DNA fragments hundredsof base pairs (bp) long. This ability toinspect longer sequencesis sig- nificant,given twoother biological considerations.First, *Authorfor correspondence ([email protected]). nucleotidecomposition at third-position sitesis often

Proc.R. Soc.Lond. B (2003) 270, 313–321 313 Ó 2003 TheRoyal Society DOI10.1098/ rspb.2002.2218 314P. D.N.Hebertand others DNA-basedidentiŽ cations strongly biased(A –Tin ,C –Gin chordates), acid substitutions,it may bepossible to assign any reducinginformation content.However, even if theA –T unidentifiedorganism toa higher taxonomic group (e.g. or C–Gproportion reached1, theinspection of just 90 bp phylum, order),before examining nucleotidesubstitutions wouldrecover theprospect of 1 billion alternatives todetermine its speciesidentity. (23 0 = 41 5).The secondconstraint derives from thelimited This studyevaluates thepotential ofCOI asa taxo- useof this potential information capacity, sincemost nomictool. We first createda COI ‘profile’ for theseven nucleotidepositions are constantin comparisonsof closely mostdiverse phyla, basedon the analysis of100 related species.However, given amodestrate (e.g.2% representative species,and subsequently showed that this per Myr) ofsequence change, one expects 12 diagnostic baselineinformation assigned96% ofnewly analysed taxa nucleotidedifferences in a600 bp comparison ofspecies totheir proper phylum. Wethen examined theclass Hexa- with justa million year history ofreproductive isolation. poda,selected because it representsthe greatest concen- Asboth thefossil record and prior molecular analysessug- tration ofbiodiversity ontheplanet (Novotny et al. 2002). gestthat mostspecies persist for millions ofyears,the like- Wecreated a COI ‘profile’ for eight ofthe most diverse lihood oftaxon diagnosisis high. However,there is no ordersof insects, based on a single representative from simple formula that canpredict thelength ofsequence eachof 100 differentfamilies, andshowed that this ‘pro- that mustbe analysed to ensure species diagnosis, because file’ assignedeach of 50 newlyanalysed taxa toits correct ratesof molecular evolutionvary betweendifferent seg- order.Finally, wetested the ability ofCOI sequencesto mentsof the genome and across taxa. Obviously, the correctly identify speciesof lepidopterans, a group tar- analysis ofrapidly evolving generegions or taxa will aid getedfor analysis becausesequence divergences are low thediagnosis oflineages with brief histories ofrepro- among families in this order.As such, the lepidopterans ductiveisolation, while thereverse will betrue for rate- provide achallenging casefor speciesdiagnosis, especially deceleratedgenes or species. sincethis is oneof the most speciose orders of insects. Although therehas neverbeen an effortto implement This test,which involved creating aCOI ‘profile’ for 200 amicrogenomic identification systemon a large scale, closely allied speciesand subsequently using it toassign enoughwork has beendone to indicate key design 150 newlyanalysed individuals tospecies, was 100% suc- elements.It is clear that themitochondrial genomeof ani- cessfulin identification. mals is abettertarget for analysis than thenuclear genome becauseof its lack ofintrons, its limited exposureto 2. MATERIALAND METHODS recombination andits haploid modeof inheritance (a) Sequences (Saccone et al. 1999). Robustprimers also enablethe rou- Approximately one-quarter of the COIsequences(172 out of tinerecovery ofspecific segments of the mitochondrial 655)used in this study wereobtained fromGenBank. The rest genome(Folmer et al. 1994; Simmons& Weller 2001). wereobtained by preparing a30 m ltotal DNA extract fromsmall Past phylogenetic workhas oftenfocused on mitochon- tissuesamples using the Isoquick(Orca Research Inc, drial genesencoding ribosomal (12S, 16S) DNA,but their Bothell,WA, 1997)protocol. The primerpair LCO1490 usein broad taxonomic analysesis constrainedby the (59-GGTCAACAAATCATAAAGATATTGG-3 9)and HCO2198 prevalence ofinsertionsand deletions (indels) that greatly (59-TAAACTTCAGGGTGACCAAAAAATCA-3 9) was sub- complicate sequencealignments (Doyle &Gaut2000). sequentlyused to amplifya 658bp fragment of the COIgene The 13 protein-codinggenes in theanimal mitochondrial (Folmer et al. 1994).Each PCRcontained5 m l of 10´ PCR genomeare better targets becauseindels are rare since buffer,pH 8.3(10 mM ofTris –HCl,pH 8.3;1.5 mM ofMgCl 2; mostlead toa shiftin thereading frame. There is nocom- and 50mMof KCl;0.01% NP-40), 35 m lof distilledwater, pelling a priori reasonto focus analysis ona specificgene, 200 m Mof eachdNTP, 1unitof Taq polymerase,0.3 m M of butthe cytochrome c oxidase Igene(COI) doeshave two eachprimer and 1 –4 m lof DNAtemplate.The PCRthermal important advantages. First, theuniversal primers for this regimeconsisted of onecycle of 1minat 94 °C;fivecycles of geneare very robust,enabling recovery ofits 5 9 end from 1 min at 94 °C,1.5min at 45 °Cand 1.5min at 72 °C; 35 cycles representativesof most, if notall, animal phyla (Folmer of 1 min at 94 °C,1.5min at 50 °Cand 1minat 72 °C and a et al. 1994; Zhang &Hewitt1997). Second,COI appears finalcycle of 5minat 72 °C.Each PCRproduct was sub- topossess a greater range ofphylogenetic signal than any sequentlygel purified using the QiaexII kit (Qiagen)and other mitochondrial gene.In commonwith other protein- sequencedin onedirection on an ABI377automated sequencer codinggenes, its third-position nucleotidesshow a high (AppliedBiosystems) using the BigDye v. 3sequencingkit. All incidenceof base substitutions, leading toa rate ofmol- sequencesobtained inthis study have beensubmitted to Gen- ecular evolutionthat isabout threetimes greater than that Bank; theiraccession numbers are provided in Electronic of12S or 16S rDNA (Knowlton& Weigt 1998). In fact, AppendicesA –C,availableon The RoyalSociety ’sPublications theevolution of this geneis rapid enoughto allow the Web site. discrimination ofnot only closely allied species,but also phylogeographic groups within asingle species(Cox & (b) COIprofiles Hebert2001; Wares& Cunningham2001). Although Wecreated three COI profiles:one for the sevendominant COImay bematched by other mitochondrial genesin phyla of animals,another for eight of the largest ordersof insects resolving suchcases of recent divergence, this geneis more and the last for 200closely allied species of lepidopterans.These likely toprovide deeperphylogenetic insights than alterna- profileswere designed to providean overviewof COIdiversity tives suchas cytochrome b (Simmons& Weller 2001) within eachtaxonomic assemblage and weresubsequently used becausechanges in its amino-acid sequenceoccur more as the basis for identificationsto the phylum,ordinal or species slowly than thosein this,or any other,mitochondrial gene levelby determiningthe sequencecongruence between each (Lynch& Jarrell 1993). Asaresult,by examining amino- ‘unknown’ taxon and the speciesincluded in aparticularprofile.

Proc.R. Soc.Lond. B (2003) DNA-basedidentiŽ cations P.D.N.Hebertand others 315

The phylum profileincluded 100 COI sequences,all obtained using the Kimura-two-parameter (K2P)model, the best metric fromGenBank (see electronic Appendix Aavailableon The when distancesare low (Nei& Kumar2000) as inthis study. RoyalSociety ’sPublicationsWeb site). To ensurebroad taxo- Neighbour-joining (NJ)analysis,implemented in MEGA2.1 nomiccoverage, each sequence was derivedfrom a different (Kumar et al. 2001),was employedto both examinerelation- familyand representativeswere included from all available ships among taxa inthe profilesand for the subsequent classi- classes.Ten sequences were obtained for eachof the five fication of ‘test’ taxa becauseof its strong track recordin the phyla (Annelida,Chordata, Echinodermata,Nematoda, analysis of largespecies assemblages (Kumar& Gadagkar 2000). Platyhelminthes)that include5000 –50000species, while 25 This approach has the additionaladvantage of generating results sequenceswere collected for eachof the phyla (Arthropoda, muchmore quickly than alternatives.The NJprofilesfor both Mollusca)with morethan 100000 species. the ordersand the phyla possessed 100terminal nodes, each The ordinalprofile was createdby obtaining aCOIsequence representinga speciesfrom a differentfamily, while the species froma singlerepresentative of eachof 100insect families (see NJprofilehad 200nodes, each representing a differentlepidop- electronicAppendix Bavailableon The RoyalSociety ’s Publi- teran species.A memberof the primitiveinsect order Thysanura cationsWeb site).The four most diverseorders (more than (familyLepismatidae) was usedas the outgroup for the 100000 described species) of insectswere selected for analysis, profile,while single members of threeprimitive lepidopteran together with fouradditional orders chosen randomly from familieswere employed as the outgroup for the speciesprofile. among the 15insect orders (Gaston &Hudson 1994)with Each of the three(phylum, order, species) NJ profileswas mediumdiversity (1000 –15000 described species). Between ten subsequently usedas aclassificationengine, by re-runningthe and 25familieswere examined for eachof the four most diverse analysis with the repeatedaddition of asingle ‘test’ taxon to the orders(Coleoptera, Diptera, Hymenoptera,), while dataset. Followingeach analysis, the ‘test’ specieswas assigned four to tenfamilies were examined for the other orders membershipof the sametaxonomic group as its nearestneigh- (Blattaria,Ephemeroptera, Orthoptera, Plecoptera). bouring node.For example, in the ordinalanalysis, a ‘test’ taxon The speciesprofile was based upon COIdata for asingleindi- was identifiedas amemberof the orderLepidoptera ifit vidualfrom each of the 200commonest lepidopteran species grouped most closelywith any of the 24lepidopteran families froma sitenear Guelph, Ontario, Canada (electronicAppendix includedin the profile.The successof classificationwas quant- Cavailableon The RoyalSociety ’sPublicationsWeb site).This ifiedfor both the phylum and the ordinalanalyses by determining profileexamined members of just threeallied superfamilies the proportion of ‘test’ taxa assignedto the proper phylum/order. (Geometroidea,, Sphingoidea) to determinethe Inthe caseof species,a strictercriterion was employed.A ‘test’ lowerlimits of COIdivergencein an assemblage of closely taxon was recognizedas beingcorrectly identified only if its relatedspecies. The Noctuoideaincluded members of three sequencegrouped most closelywith the singlerepresentative of families(Arctiidae, Noctuidae, ) whilethe others its speciesin the profile. includedrepresentatives from just asinglefamily each Multidimensionalscaling (MDS), implemented in S ystat (Geometridaeand , respectively). 8.0,was employedto providea graphical summaryof the spec- ies-levelresults because of the verylarge number of taxa. MDS (c) Test taxa exploressimilarity relationships in Euclidean space and has the Additionalsequences were collected to test the abilityof each advantage of permittinggenetically intermediate taxa to remain profileto assign newlyanalysed species to ataxonomic category spatially intermediate,rather than forcingthem to clusterinto a (electronicAppendices A –C).COIsequenceswere obtained pseudogroup as inhierarchical methods (Lessa1990). In the from 55 ‘test’ taxa to assess the successof the phylum profilein presentcase, a similaritymatrix was constructedby treating assigning newlyanalysed species to aphylum.These ‘test’ taxa everyposition in the alignmentas aseparate characterand includedfive representatives from each of the five ‘small’ phyla, ambiguous nucleotidesas missingcharacters. The sequence and 15representativesfrom both the Molluscaand the Arthro- informationwas codedusing dummyvariables (A = 1, G = 2, poda. Whenpossible, the ‘test’ taxa belongedto familiesthat C = 3, T = 4).However, as notedearlier, a NJprofilewas also werenot includedin the phylum profile.A similarapproach was constructedfor the lepidopteransequences using K2Pdistances employedto test the abilityof the ordinalprofile to classify and this isprovidedin electronicAppendix Davailableon The newlyanalysed insects to an order.Fifty new taxa wereexam- RoyalSociety ’sPublicationsWeb site. ined,including between one and fiverepresentatives from each smallorder and fiveto tenrepresentatives from each of the four 3. RESULTS largeorders. When possible, the ‘test’ taxa belongedto families orgenera that werenot includedin the ordinalprofile. A test (a) Taxonprofiles of the speciesprofile required a slightly differentapproach, as Eachof the 100 speciesincluded in thephylum and identificationswere only possible for speciesrepresented in it.As ordinal profiles possesseda differentamino-acid sequence aresult,sequences were obtained fromanother 150individuals at COI.The phylum profile showedgood resolution of belongingto the speciesincluded in this profile. themajor taxonomic groups (figure 1). Monophyletic assemblages wererecovered for threephyla (Annelida, (d) Dataanalysis Echinodermata,Platyhelminthes) andthe chordate lin- Sequenceswere aligned in the SeqApp 1.9sequence editor. eagesformed a cohesivegroup. Membersof the Nema- They weresubsequently reducedto 669bp for the phylum todawere separated into three groups, but each analysis,624 bp for the ordinalanalysis and 617bp for the correspondedto oneof thethree subclasses that comprise species-levelanalysis. Analyses at the ordinaland phylum levels this phylum. Twenty-threeout of the 25 arthropods for- examinedamino-acid divergences, using Poissoncorrected p- meda monophyletic group, butthe sole representatives of distancesto reducethe impacts of homoplasy. Forthe species- twocrustacean classes (Cephalocarida, Maxillopoda) fell levelanalysis, nucleotide-sequence divergences were calculated outsidethis group. Twelve outof the 25 molluscanlin-

Proc.R. Soc.Lond. B (2003) 316P. D.N.Hebertand others DNA-basedidentiŽ cations

AR 1 AR 2 AR 3 AR 4 AR 5 AR 6 AR 7 AR 8 AR 9 AR 10 AR 11 Arthropoda AR 12 AR 13 AR 14 AR 15 AR 16 AR 17 AR 18 AR 19 AR 20 AR 21 AR 22 AR 23 CH 1 CH 2 CH 3 CH 4 CH 5 Chordata CH 6 CH 7 CH 8 CH 9 CH 10 EC 1 EC 2 EC 3 EC 4 EC 5 Echinodermata EC 6 EC 7 EC 8 EC 9 EC 10 ML 1 ML 2 Mollusca ML 3 ML 4 Cephalopoda) ML 5 AN 1 AN 2 AN 3 AN 4 AN 5 Annelida AN 6 AN 7 AN 8 AN 9 AN 10 ML 6 ML 7 ML 8 ML 9 ML 10 ML 11 Mollusca ML 12 ML 13 ML 14 ML 15 ML 16 ML 17 AR 242 ML 18 ML 19 Mollusca ML 20 Pulmonata) AR 252 ML 212 NE 1 NE 2 NE 3 Nematoda NE 4 NE 5 Rhabditoidea) NE 6 NE 7 ML 22 ML 23 Mollusca ML 24 ML 25 Bivalvia) NE 8 NE 9 Nematoda NE 10 PL 1 Spirurida) PL 2 PL 3 PL 4 PL 5 Platyhelminthes PL 6 PL 7 PL 8 PL 9 PL 10

Figure 1. NJanalysisof Poisson corrected p-distances basedon theanalysis of 223 amino acids of theCOI gene in 100 taxa belonging toseven animal phyla.The taxa in thegrey boxes represent outliers. AR24 and AR25 are thesole representatives of thearthropod classesCephalocarida and Maxillopoda, respectively. ML21 is amember of themolluscan classBivalvia, while NE8is thesole member of thenematode subclassEnoplia. Scalebar, 0.1.

eagesformed a monophyletic assemblage allied tothe rather than secondaryconvergence on the amino-acid annelids,but the others were separated into groups that arrays ofother groups. showedmarked geneticdivergence. One group consisted The ordinal profile showedhigh cohesionof taxonomic solely ofcephalopods, a secondwas largely pulmonates groups with sevenout of the eight ordersforming mono- andthe rest were bivalves. It is worthemphasizing that phyletic assemblages (figure 2). The soleexception was theseoutlying COIsequencesalways showedconsiderable theColeoptera whose members werepartitioned into amino-acid divergencefrom sequencespossessed by other threegroups. Two of these groups included21 families taxonomic groups.As such, the rate acceleration in these belonging tothe very diversesuborder Polyphaga, while lineages generatednovel COIamino-acid sequences theother group includedfour families belonging tothe

Proc.R. Soc.Lond. B (2003) DNA-basedidentiŽ cations P.D.N.Hebertand others 317

Geometridae Table1. MeanPoisson corrected p-distances ( d) for 208 Arctiidae Nolidae amino acids of COIin 100 insect families belonging to eight Notodontidae Noctuidae orders. Lymantriidae (n indicates thenumber of families analysed in eachorder. G/C Lasiocampidae Epiplemidae content is also reported.) Bombycidae Drepanidae Sphingidae order n d s.e. G/C (%) Nymphalidae Lepidoptera Lycaenidae Saturniidae Thyatiridae Hymenoptera 11 0.320 0.028 27.7 Riodinidae Coleoptera 25 0.125 0.015 35.6 Ceratocampidae Pyralidae Orthoptera 50.119 0.019 35.4 Satyridae Cossidae Blattaria 40.076 0.014 35.7 Pieridae Diptera 15 0.055 0.011 34.1 Papilionidae Limacodidae Lepidoptera 24 0.054 0.009 31.0 Hepialidae Pedilidae Ephemeroptera 10 0.036 0.008 40.5 Cerambycidae Plecoptera 60.031 0.008 39.5 Melandryidae Tenebrionidae Chrysomelidae Coleoptera Melyridae Polyphaga) Lampyridae Cantharidae Curculionidae Scolytidae 2 Coccinellidae Sphecidae Gasteruptiidae H Ichneumonidae y

Chrysididae m

Formicidae e n

Tiphiidae o p

Braconidae t e 1 Vespidae r a Halictidae Apidae Anthophoridae Silphidae Dermestidae 2 Scarabaeidae n

Lucanidae o i

Staphylinidae Coleoptera s 0 Elateridae n Polyphaga) e Buprestidae m Heteroceridae i d Elmidae Psephenidae Cicindellidae Carabidae Coleoptera _1 Dytiscidae Adephaga) Gyrinidae Ptychopteridae Simuliidae Bombyliidae Tabanidae Pelecorhynchidae Athericidae Rhagionidae Diptera Syrphidae _ _ Tipulidae 2 1 0 1 2 Culicidae dimension 1 Asilidae Tephritidae Sphaeroceridae Figure 3. Multidimensional scaling of Euclidian distances Drosophilidae Calliphoridae among theCOI genes from 200 lepidopteran species Gryllidae belonging to three superfamilies: Geometroidea (stars); Tettigoniidae Gryllotalpidae Orthoptera Sphingoidea (triangles) and Noctuoidea (circles). Eumastacidae Acrididae Lecutridae Capniidae Notonemouridae Peltoperlidae Plecoptera Perlidae suborderAdephaga. Among thefour major orders,species Chloroperlidae Blatellidae ofDiptera andLepidoptera showed much less variation Cryptocercidae Blattidae Blattaria in their amino-acid sequencesthan didthe Hymenoptera, Blaberidae Baetiscidae while theColeoptera showedan intermediate level of Caenidae divergence(table 1). Ameletidae Ephemerellidae Eachof the 200 lepidopterans includedin thespecies Ephemeridae Leptophlebiidae Ephemeroptera profile possesseda distinctCOI sequence.Moreover, a Siphlonuridae Metretopodidae MDSplot showedthat speciesbelonging toeach of the Heptageniidae threesuperfamilies fell intodistinct clusters (figure 3), sig- Isonychiidae Thysanura 0.05 nalling their geneticdivergence. A detailedinspection of theNJ tree(electronic Appendix D)revealed further evi- dencefor theclustering oftaxonomically allied species. For example, 23 genera wererepresented by twospecies, Figure 2. NJanalysisof Poisson corrected p-distances based andthese formed monophyletic pairs in 18 cases.Simi- on theanalysis of 208 amino acids in 100 taxabelonging to larly, five outof thesix genera representedby threespecies eight insect orders. Scalebar, 0.05. formedmonophyletic assemblages.

Proc.R. Soc.Lond. B (2003) 318P. D.N.Hebertand others DNA-basedidentiŽ cations

Table2. Percentage successin classifying speciesto member- Wedemonstrated that differencesin COIamino-acid shipof aparticular taxonomic group basedupon sequence vari- sequenceswere sufficient to enablethe reliable assignment ation atCOI. oforganisms tohigher taxonomic categories.It isworth (n indicates thenumber of taxathat were classified using each emphasizing that mostnewly analysed taxa wereplaced in taxon ‘profile’.) thecorrect order or phylum despitethe fact that ourpro- files werebased on atiny fraction ofthemember species. taxon target group n % success For example, ourordinal profile, which wasbased on just 0.002% ofthe total speciesin theseorders, led to 100% kingdom Animalia 7 phyla 55 96.4 identification success.The twomisidentifications at the classHexapoda 8 orders 50 100 order Lepidoptera 200 species150 100 phylum level wereundoubtedly a consequenceof the lim- itedsize and diversity ofour phylum profile. The mis- placedpolychaete belongedto an order that wasnot in theprofile, while themisidentified mollusc belonged to a (b) Testingtaxonomic assignments subclassthat wasrepresented in theprofile by justa single Fifty-three outof the 55 ‘test’ species(96.4%) were species.Such misidentifications wouldnot occur in pro- assignedto the correct phylum in theanalyses at this level files that more thoroughly surveyedCOI diversity among (table 2). The exceptionswere a polychaete annelidthat members ofthe target assemblage. The general successof groupedmost closely with amolluscand a bivalve that COIin recognizing relationships among taxa in thesecases groupedwith oneof the outliers.However, in is important becauseit signals that character convergence both cases,there was substantial sequencedivergence andhorizontal genetransfer (i.e. via retroviruses)have not (13% and25%, respectively) betweenthe test taxon and disruptedthe recovery ofexpectedtaxon affinities.More- thelineage in theprofile that wasmost similar toit. Identi- over,it establishesthat theinformation contentof COI is fication successat theordinal level was100% asall 50 sufficientto enable the placement of organisms in the insectspecies were assigned to the correct order. More- deepesttaxonomic ranks. over, when a ‘test’ speciesbelonged to a family rep- The gold standardfor any taxonomic systemis its ability resentedin theordinal profile, it typically groupedmost todeliver accuratespecies identifications. Our COIspec- closely with it. Identification successat thespecies level iesprofile was100% successfulin identifying lepidopteran wasalso 100%, aseach of the150 ‘test’ individuals clus- species,and we expect similar resultsin other groups, teredmost closely with its conspecificin theprofile. The sincethe Lepidoptera are oneof the most taxonomically sequencesin thespecies profile weresubsequently merged diverseorders of animals andthey showlow sequence with thosefrom the ‘test’ taxa toallow amore detailed divergences.There is also reasonto expect successful examination ofthe factors enabling successfulclassi- diagnosis at other locales,as the species richness of lepi- fication (electronicAppendix Eavailable onThe Royal dopteransat thestudy site exceeds that whichwill be Society’sPublications Website). MDS analysis (figure 4) encounteredin regional surveysof most animal orders. showed that ‘test’ taxa werealways either genetically Higher diversities will beencountered for someorders in identical toor mostclosely associatedwith their conspe- tropical settings(Godfray et al. 1999), butCOI diagnoses cific in theprofile. Examination ofthe genetic distance shouldnot fail at thesesites unless species are unusually matrix quantifiedthis fact,showing that divergences young. betweenconspecific individuals werealways small, the COI-basedidentification systemscan also aid theinitial family values averaging 0.25% (table 3). By contrast, delineationof species. For example, inspectionof the gen- sequencedivergences between species were much greater, eticdistance matrix for lepidopterans indicatedthat diver- averaging 6.8% for congenerictaxa andhigher for more gencevalues betweenspecies are ordinarily greater than distantly related taxa (table 3). Afewspecies pairs showed 3%.In fact,when this value wasemployed asa threshold lowervalues, but only fourout of the 19 900 pairwise for speciesdiagnosis, it ledto the recognition of196 out comparisonsshowed divergences that wereless than 3%. ofthe 200 (98%) speciesrecognized through prior mor- Figure 4a showsone of thesecases, involving twospecies phological study.The exceptionswere four congeneric of Hypoprepia,but,even in thesesituations, there were no speciespairs that weregenetically distinctbut showed low sharedsequences between taxa. (0.6–2.0%) divergences,suggesting their recentorigin. The general easeof species diagnosis reveals oneof the great values ofa DNA-basedapproach toidentification. 4. DISCUSSION Newly encounteredspecies will ordinarily signal their This studyestablishes the feasibility ofdeveloping a presenceby their geneticdivergence from knownmem- COI-basedidentification systemfor animals-at-large. bersof the assemblage. PCRproductswere recovered from all speciesand there The prospectof using a standardCOI thresholdto wasno evidence of the nuclear pseudogenes that have guidespecies diagnosis in situationswhere prior taxo- complicated somestudies employing degenerateCOI pri- nomicwork has beenlimited is appealing. It is,however, mers(Williams &Knowlton2001). Moreover,the align- important tovalidate this approach by determining the mentof COI sequenceswas straightforward, asindels thresholdsthat distinguish speciesin other geographical wereuncommon, reinforcing theresults of earlier work regions andtaxonomic groups.Thresholds will parti- showingthe rarity ofindels in this gene(Mardulyn & cularly needto be established for groups with differences Whitfield 1999). Asidefrom their easeof acquisition and in traits, suchas generation length or dispersal regime, alignment, theCOI sequencespossessed, as expected, a that are likely toalter rates ofmolecular evolutionor the high level ofdiversity. extentof population subdivision.However, differences in

Proc.R. Soc.Lond. B (2003) DNA-basedidentiŽ cations P.D.N.Hebertand others 319

2 a) 2 b)

Furcula cinerea Grammia virguncula borealis 1 Hypoprepia fucosa Spilosoma congrua 1 Grammia virgo Furcula modesta Hypoprepia miniata Spilosoma virginica basitriens Grammia arge Clostera albosigma 2

2 Datana ministra Notodonta simplaria

n n o o i

i Gluphisia lintneri

Haploa confusa s

s Nadata gibbosa n n 0 0 Clostera apicalis Symmerista leucitys e e Halysidota tessellaris Hyphantria cunea m m i i d d Euchaetes egle Pyrrharctia isabella Ellida caniplaga guttivitta Lophocampa maculata Heterocampa umbrata Phragmatobia fuliginosa _ _ Schizura badia 1 Cycnia tenera 1 Heterocampa biundata Schizura unicornis Lochmaeus manteo fulvicollis Cycnia oregonensis

_2 _1 0 1 2 _2 _1 0 1 2 dimension 1 dimension 1

2 c)

1 Lapara bombycoides Smerinthus cerisyi Sphinx canadensis Sphinx gordius 2

n Ceratomia undulosa o i s

n 0 Deidamia inscripta e Smerinthus jamaicensis m i d

_ 1 Paonias myops Paonias excaecatus Pachysphinx modesta

_2 _1 0 1 2 dimension 1

Figure 4. Multidimensional scaling of Euclidian distances among theCOI genes from ( a)18 speciesof Arctiidae, ( b) 20 speciesof Notodontidae and ( c)11 speciesof Sphingidae. Circles identify thesingle representatives of eachspecies included in theprofile, while thecrosses markthe position of ‘test’ individuals. Profile and testindividuals from thesame species always grouped together. thresholdsmay besmaller than might beexpected. For aries are blurred by hybridization or introgression,sup- example, differentspecies of vertebrates ordinarily show plemental analysesof one or more nucleargenes will be more than 2% sequencedivergence at cytochrome b required.Similarly, whenspecies have arisen through (Avise& Walker 1999), avalue closeto the3% COIthres- polyploidization, determinationsof genome size may be hold adoptedfor lepidopterans in this study. needed.While protocolswill berequired todeal with such The likely applicability ofa COIidentification system complications, aCOI-basedidentification systemwill tonew animal groups andgeographical settingssuggests undoubtedlyprovide taxonomic resolutionthat exceeds thefeasibility ofcreating anidentification systemfor that which canbe achieved through morphological stud- animals-at-large. Certainly, existing primers enablerecov- ies.Moreover, the generation ofCOIprofiles will provide ery ofthis genefrom most,if notall, animal speciesand apartial solutionto the problem ofthe thinning ranks of its sequencesare divergent enoughto enable recognition morphological taxonomists by enabling acrystallization of ofall butthe youngest species. It is,of course, impossible their knowledgebefore they leave thefield. Also, since for any mitochondrially basedidentification systemto COIsequencescan be obtained from museumspecimens resolvefully thecomplexity oflife. Wherespecies bound- withouttheir destruction,it will bepossible to regain taxo-

Proc.R. Soc.Lond. B (2003) 320P. D.N.Hebertand others DNA-basedidentiŽ cations

Table3. Percentage nucleotide sequence divergence (K2P distances) atCOI between members of fivelepidopteran families at three levels of taxonomic affinity. (At thespecies level, n indicates thenumber of speciesfor whichtwo or more individuals were analysed. At ageneric level, n represents thenumber of genera withtwo or more species, while atthe family level it indicates thetotal number of speciesthat were analysed.)

family n within species n within n within family

Arctiidae 13 0.33 4 7.0 18 10.0 Geometridae 30 0.23 10 9.1 61 12.5 Noctuidae 42 0.17 12 5.8 90 10.4 Notodontidae 14 0.36 4 5.9 20 12.4 Sphingidae 8 0.17 3 6.4 11 10.5

nomiccapability, albeit in anovel format, for groups that Theauthors thankWin Bailey, Klaus Bolte, SteveBurian, Don currently lack anauthority. Klemm, Don Lafontaine, SteveMarshall, Christine Nalepa, Webelieve that aCOIdatabase canbe developed within Jeff Webband JackZloty for either providing specimens or verifying taxonomic assignments. Theyalso thankLisa Schie- 20 years for the5 –10 million animal specieson theplanet man, Tyler Zemlak,Heather Cole and Angela Holliss for their (Hammond1992; Novotny et al. 2002) for approximately assistancewith the DNA analyses. $1 billion, far lessthan that directedto other major science initiatives suchas the Human Genome project or the International SpaceStation. Moreover, initial efforts REFERENCES couldfocus on species of economic, medical or academic Allander, T.,Emerson, S.U., Engle, R.E.,Purcell, R.H.& importance. Data acquisition isnow simple enoughfor Bukh, J.2001 Avirus discovery method incorporating individual laboratories togather, in asingle year, COIpro- DNasetreatment and its application to theidentification of files for 1000 species,a numbergreater than that in many two bovine parvovirus species. Proc.Natl Acad. Sci. USA 98, major taxonomic groups ona continentalscale. Once 11 609–11 614. completed,these profiles will beimmediately cost- Avise, J.C.&Walker, D.1999 Speciesrealities and numbers effectivein many taxonomic contexts,and innovations in in sexual vertebrates: perspectives from an asexually trans- 96 sequencingtechnology promise futurereductions in the mitted genome. Proc.Natl Acad. Sci. USA , 992–995. Brown, B., Emberson, R.M.&Paterson, A.M.1999 Mito- costof DNA-based identifications. chondrial COIand IIprovide useful markers for Weiseana If advancedcomprehensively, a COIdatabase could (Lepidoptera, Hepialidae) speciesidentification. Bull. Ento- serveas the basis for aglobal bioidentification system mol. Res. 89, 287–294. (GBS)for animals. Implementation onthis scalewill Bucklin, A., Guarnieri, M., Hill, R.S.,Bentley, A.M.& require theestablishment ofa newgenomics database. Kaartvedt, S.1999 Taxonomic and systematicassessment While GenBankaims for comprehensivecoverage of of planktonic copepods using mitochondrial COIsequence genomic diversity, theGBS database wouldaim for com- variation and competitive, species-specific PCR. Hydrobiol- prehensivetaxonomic coverage ofjust a single gene. ogy 401, 239–254. Through web-baseddelivery, this systemcould provide Cox, A.J.&Hebert, P.D.N.2001 Colonization, extinction easyaccess to taxonomic information, aparticular benefit and phylogeographic patterning in afreshwater crustacean. Mol. Ecol. 10, 371–386. todeveloping nations.Its adoption by anorganization Doyle, J.J.&Gaut, B.S.2000 Evolution of genes and taxa: a suchas the Global Biodiversity Information Facility or the primer. Plant Mol.Biol. 42, 1–6. All SpeciesFoundation would be an important step Folmer, O.,Black, M., Hoeh, W.,Lutz, R.&Vrijenhoek, R. towardsensuring the longevity lacking in many web-based 1994 DNAprimers for amplification of mitochondrial cyto- resources.Once established, this microgenomic identifi- chrome c oxidase subunit Ifrom diverse metazoan invert- cation systemwill overcomethe deficits of morphological ebrates. Mol.Mar. Biol. Biotechnol. 3, 294–299. approaches tospecies discrimination: thebounds of intra- Gaston, K.J.&Hudson, E.1994 Regional patterns of diversity specificdiversity will bequantifiable, sibling specieswill and estimatesof global insect speciesrichness. Biodivers. berecognizable, taxonomic decisionswill beobjective and Conserv. 3, 493–500. all life stageswill beidentifiable. Moreover,once com- Godfray, H.C.J., Lewis, O.T.&Memmott, J.1999 Studying plete,the GBS will allow single laboratories toexecute insect diversity in thetropics. Phil. Trans.R. Soc. Lond. B 354, 1811–1824. (DOI 10.1098/rstb.1999.0523.) taxon diagnosesacross the full spectrumof animal life. Hamels, J., Gala, L., Dufour, S.,Vannuffel, P., Zammatteo, The creation ofthe GBS will bea substantial undertaking N.&Remacle, J.2001 Consensus PCRand microarray for andwill require closealliances betweenmolecular biol- diagnosis of thegenus Staphylococcus ,species, and methicil- ogistsand taxonomists. However, its assembly promises lin resistance. BioTechniques 31, 1364–1372. both arevolution in accessto basic biological information Hammond, P.1992 Speciesinventory. In Global biodiversity: anda newlydetailed view ofthe origins ofbiological diver- status ofthe earth’s living resources (ed. B.Groombridge), pp. sity. 17–39. London: Chapman& Hall. Hawksworth, D.L.&Kalin-Arroyo, M.T.1995 Magnitude Thiswork wassupported bygrants from NSERCand theCan- and distribution of biodiversity. In Global biodiversity assess- adaResearch ChairsProgram toP.D.N.H. Teri Crease, Mel- ment (ed. V.H.Heywood), pp. 107 –191. Cambridge Uni- ania Cristescu, Derek Taylor, Jonathan Witt and two reviewers versity Press. provided helpful comments on earlier drafts of thismanuscript. Jarman, S.N.&Elliott, N.G.2000 DNAevidence for mor-

Proc.R. Soc.Lond. B (2003) DNA-basedidentiŽ cations P.D.N.Hebertand others 321

phological and cryptic Cenozoic speciations in theAnaspidi- Pace, N.R.1997 Amolecular viewof microbial diversity and dae, ‘living fossils ’ from theTriassic. J.Evol. Biol. 13, thebiosphere. Science 276, 734–740. 624–633. Saccone, C.,DeCarla, G., Gissi, C.,Pesole, G.&Reynes, A. Knowlton, N.1993 Sibling speciesin thesea. A.Rev. Ecol. 1999 Evolutionary genomics in theMetazoa: themitochon- Syst. 24, 189–216. drial DNAasa model system. Gene 238, 195–210. Knowlton, N.&Weigt, L.A.1998 Newdates and new rates Simmons, R.B.&Weller, S.J.2001 Utility and evolution of for divergence across theIsthmus of Panama. Proc.R. Soc. cytochrome b in insects. Mol.Phylogenet. Evol. 20, 196– Lond. B 265, 2257–2263. (DOI10.1098/rspb.1998.0568.) 210. Kumar, S.&Gadagkar, S.R.2000 Efficiency of theneigh- Trewick, S.A.2000 Mitochondrial DNAsequences support bour-joining method in reconstructing deep and shallow allozyme evidence for cryptic radiation of NewZealand Per- evolutionary relationships in large phylogenies. J.Mol.Evol. ipatoides (Onychophora). Mol. Ecol. 9, 269–282. 51, 544–553. Vincent, S.,Vian, J.M.&Carlotti, M.P.2000 Partial Kumar, S., Tamura, K., Jacobsen, I.B.&Nei, M.2001 sequencing ofthecytochrome oxidase-b subunit gene. I.A MEGA2: molecular evolutionary genetics analysis software . tool for theidentification of European speciesof blow flies Tempe, AZ:Arizona StateUniversity. for post mortem interval estimation. J.Forensic Sci. 45, Kurtzman, C.P.1994 Molecular taxonomy of theyeasts. Yeast 820–823. 10, 1727–1740. Wares, J.P.&Cunningham, C.W.2001 Phylogeography and Lessa,P. 1990 Multidimensional scaling of geographic genetic historical ecology of theNorth Atlantic intertidal. Evolution structure. Syst. Zool. 39, 242–252. 12, 2455–2469. Lynch, M.&Jarrell, P.E.1993 Amethod for calibrating mol- Williams, S.T.&Knowlton, N.2001 Mitochondrial pseudo- ecular clocks and its application to animal mitochondrial genes are pervasiveand often insidious in thesnapping DNA. Genetics 135, 1197–1208. shrimp Alpheus. Mol.Biol. Evol. 18, 1484–1493. Mardulyn, P.&Whitfield, J.B.1999 Phylogenetic signal in Wilson, K.H.1995 Molecular biology asa tool for taxonomy. theCOI, 16S, and 28Sgenes for inferring relationships Clin.Infect. Dis. 20(Suppl.), 192 –208. among genera of Microgastrinae (Hymenoptera: Zhang, D.-X. &Hewitt, G.M.1997 Assessmentof theuniver- Braconidae): evidence of ahighdiversification rate in this salityand utility of asetof conserved mitochondrial primers group of parasitoids. Mol.Phylogenet. Evol. 12, 282–294. in insects. Insect Mol.Biol. 6, 143–150. Nanney, D.L.1982 Genes and phenes in Tetrahymena . Biosci- ence 32, 783–788. Nei, M.&Kumar, S.2000 Molecular evolution and phylogen- As this paper exceedsthe maximum lengthnormally permitted, the authors have agreedto contributeto production costs. etics.Oxford University Press. Novotny, V.,Baset, Y., Miller, S.E.,Weiblen, G.D., Bremer, B., Cizek, L.&Drezel, P.2002 Lowhost specificity of her- Visithttp:/ /www.pubs.royalsoc.ac.uk to seeelectronic appendices to bivorous insects in atropical forest. Nature 416, 841–845. this paper.

Proc.R. Soc.Lond. B (2003)