Original Papers

Development of a Reference Standard Library of Chloroplast Genome Sequences, GenomeTrakrCP

Authors Ning Zhang 1, Padmini Ramachandran 1,JunWen2, James A. Duke 3,HelenMetzman3, William McLaughlin 4, Andrea R. Ottesen1, Ruth E. Timme1,SaraM.Handy1

Affiliations ABSTRACT 1 Center for Food Safety and Applied Nutrition, Office of Precise, species-level identification of in foods and die- Regulatory Science, U. S. Food and Drug Administration, tary supplements is difficult. While the use of DNA barcoding College Park, Maryland, United States regions (short regions of DNA with diagnostic utility) has been 2 Department of Botany, National Museum of Natural His- effective for many inquiries, it is not always a robust approach tory, Smithsonian Institution, Washington D. C., United for closely related species, especially in highly processed prod- States ucts. The use of fully sequenced chloroplast genomes, as an 3 Green Farmacy Garden, Fulton, Maryland, United States alternative to short diagnostic barcoding regions, has demon- 4 United States Botanic Garden Conservatory, Washington strated utility for closely related species. The U.S. Food and D. C., United States Drug Administration (FDA) has also developed species-specif- ic DNA-based assays targeting species of interest by uti- Key words lizing chloroplast genome sequences. Here, we introduce a chloroplast genome, GenomeTrakrCP, dietary supplements, repository of complete chloroplast genome sequences called botanicals, DNA barcoding GenomeTrakrCP, which will be publicly available at the National Center for Biotechnology Information (NCBI). Target received March 7, 2017 species for inclusion are plants found in foods and dietary sup- revised May 21, 2017 plements, toxin producers, common contaminants and adul- accepted June 5, 2017 terants, and their close relatives. Publicly available data will in- Bibliography clude annotated assemblies, raw sequencing data, and vouch- DOI https://doi.org/10.1055/s-0043-113449 er information with each NCBI accession associated with an Published online June 26, 2017 | Planta Med 2017; 83: 1420– authenticated reference herbarium specimen. To date, 40 1430 © Georg Thieme Verlag KG Stuttgart · New York | complete chloroplast genomes have been deposited in ISSN 0032‑0943 GenomeTrakrCP (https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA325670/), and this will be expanded in the future. Correspondence Dr.SaraM.Handy Center for Food Safety and Applied Nutrition, Office of Regulatory Science, U. S. Food and Drug Administration HFS 707, 5001 Campus Dr., 20740 College Park, Maryland, United States Phone: + 12404023063, Fax: + 13014362624 [email protected]

Introduction ABBREVIATIONS Plants are used in innumerable ways in food, spices, and dietary CAERS CFSAN Adverse Event Reporting System supplements. One major use is in botanical or herbal dietary sup- CFSAN Center for Food Safety and Applied Nutrition plements. People have embraced herbal dietary supplements for DNA deoxyribonucleic acid millennia–to prevent disease and augment health and as an alter- native to pharmaceuticals for acute and chronic illnesses. In 2015, the total sales of herbal dietary supplements in the United Stated reached $6.92 billion, a 7.5% increase from 2014, and demand for botanicals has continued to increase for the past 12 years [1].

1420 Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 Adulteration of botanicals has had the same long and complex [7]. This phenomenon was dubbed “pine mouth syndrome” [8,9]. history as botanical themselves. According to Roy Upton, execu- After much chemical testing of consumer complaints, the FDA tive director of the American Herbal Pharmacopoeia, in a given CFSAN could not determine the cause. A DNA-based approach year, as many as 60–70% of ginkgo products, 40% of St. Johnʼs was applied to determine whether or not a species substitution, wort products, and 60% of ginseng products may be adulterated Pinus armandii Franch., Pinaceae, was causing this effect [10, 11]. [2]. Traditional DNA barcoding regions (i.e., matK, rbcL, and ITS2) The U. S. Food and Drug Administration (FDA) regulates dietary were not useful in identifying these pines [10]; instead a diagnos- supplements under the Federal Food, Drug and Cosmetic Act, tic region, ycf1, was targeted from 33 Pinus chloroplast genome which was amended in 1994 via the Dietary Supplement Health comparison and was used to differentiate P. armandii from closely and Education Act (as reviewed in Pawar and Grundel [3]). Cur- related species using a specific real-time polymerase chain reac- rently, complaints or adverse events about dietary supplements, tion assay [11]. as well as foods, and cosmetics, can be reported by consumers The pine nut project highlighted one particular difficulty that and medical professionals to the FDA CAERS, while dietary supple- could arise while using DNA to identify plants, specifically when ment firms are required to report serious adverse events. CAERS is distinguishing among those from closely related species. This a post-market surveillance system, so data from them only reflect problem is pretty common for closely related plant species. Ran information as reported and do not represent any conclusions by et al. [12] noted that the seven gene regions (matK, rbcL, rpoB, the FDA about whether the product actually caused the adverse rpoC1, atpF-atpH,psbA-trnH, and psbK-psbI) typically used for events. As of December 2016, CAERS data is now available pub- identification are not effective for closely related spruce species. licly (http://www.fda.gov/Food/ComplianceEnforcement/ucm494 Techen et al. [13] found that Illicium L., Schisandraceae species 015.htm#files, accessed December 15, 2016). (some of which are highly toxic) could also be difficult to differen- Of the 40 species presented here (▶ Table 1, which include tiate. Chen et al. [14] reported that there were no barcode gaps plants used in foods/spices, dietary supplements, and others), a for Curcuma L. Zingiberaceae species with four chloroplast bar- total of 2095 adverse events/complaints have been reported to coding regions (matK, rbcL, trnH-psbA, and trnL‑F). Recently this CAERS in the past two years (January 1, 2014 – October 17, issue was also encountered with closely related Echinacea species 2016). These events were searched in the ingredients and product in our own study [15]. names first by common name and then by Latin binomial if noth- More studies have shown that traditional or core DNA barcod- ing was returned. The highest number of complaints were from ing markers may not work effectively for distinguishing between the food product, Citrus limon (L.) Osbeck at 431 (CAERS 2016); closely related plant species, which might be the source of adul- however, it must be noted that in this case “lemon” appeared teration or substitution in food products and dietary supple- somewhere in the complaint or adverse event, not necessarily ments. To identify botanical specimens accurately using DNA- that it caused the event directly. Reports may be duplicated if based approaches, a large high-resolution genomic database is the product had several of the ingredients that were reviewed. needed. The database captures the observed genetic diversity of Additionally, especially in the case of foods, the FDA does not gen- targeted species and their close relatives at the genomic scale erally have ingredients listed for complaints/adverse events, while with each genome entry properly vouchered with a herbarium dietary supplements usually do. specimen. Once this database is set up, there are multiple options In the U. S. mainstream multi-outlet channel for botanical die- for its utilization. Some users may do small targeted assays [16– tary supplements, the top six botanical sellers in 2015 were hore- 18], while other applications may need the whole chloroplast ge- hound (Marrubium vulgare L., Lamiaceae), cranberry (Vaccinium nomes (i.e., super barcoding) to identify plant species [19–22]. macrocarpon Aiton, Ericaceae), echinacea (Echinacea spp. Moench, Clearly there was a strong need to develop the FDAʼs own curated Asteraceae), garcinia cambogia (Garcinia gummi-gutta (L.) Roxb., collection of sequences from authenticated specimens. Clusiaceae), green tea (Camellia sinensis (L.) Kuntze, Theaceae) This study describes the steps from specimen collection to and black cohosh (Actaea racemosa L., Ranunculaceae) [1]. Adverse submission of chloroplast genome sequences for a publicly avail- event reports and complaints over the past two years for these spe- able database called GenomeTrakrCP. This effort is modeled after cies in foods and dietary supplements were 5, 320, 395, 110, 1068, two other successful FDA programs through which genetic data- and 73, respectively (CAERS 2016) (▶ Table 1). In some cases, bases were made available to the public: the GenomeTrakr data- these products may have included substitutions or adulterations. base for foodborne pathogen surveillance hosted at the NCBI [6] Economically motivated or accidental substitutions, which can re- and the FDAʼs Reference Standard Sequence Library for Seafood sult in illnesses, highlight the need for updated methods for species Identification [23], which contains sequence data from authenti- identification of important and widely used plants. The FDAʼsDNA- cated specimens primarily housed at the Smithsonian Institution. based species identification tools for plants complement estab- lished chemical methods and provide the agency with an improved Results and Discussion approach to protect this aspect of public health, in much the same way it has approached the safeguarding of seafood products [4, 5] Among the initial 40 species with whole chloroplast genome data, and the source tracking of bacterial pathogens [6]. 16 species are commonly used as foods or spices: Allium sativum For example, between July 2008 and June 2012, the FDA re- L., Aloysia citrodora Loes. & Moldenke, Althaea officinalis Cav., C. li- ceived 501 complaints from consumers complaining of a bitter mon, Coffea arabica L., Dioscorea villosa L., Eriobotrya japonica or metallic taste within hours or days of consumption of pine nuts (Thunb.) Lindl., Fragaria virginiana Mill., Illicium verum Hook. f.,

Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 1421 1422 rgnlPapers Original ▶ Table 1 The 40 species sampled in this study. DS: dietary supplement; F: food/spice. All voucher specimens are deposited at the U. S. National Herbarium.

Species Common name Family Uses Reports received by Location Voucher CAERS (past two number years)a Actaea racemosa black cohosh Ranunculaceae Roots and rhizomes; treat gynecological and other 73 Green Farmacy Garden N/Aj disorders; DS Allium sativum garlic Amaryllidaceae Bulb; condiment; F and DS 346 Green Farmacy Garden N/Aj Aloysia citrodora lemon verbena Verbenaceae Leaves; spice and herbal tea; F and DS 13 Washington D. C.: U. S. Wen12894 National Arboretum Althaea officinalis marsh mallow Malvaceae Leaves, flowers, and the root; mouth and throat ulcers 23 Green Farmacy Garden N/Aj and gastric ulcers; F Artemisia annua sweet wormwood Asteraceae Whole plant; malaria treatment, treat fever; DS 2b Green Farmacy Garden N/Aj Boswellia sacra frankincense or Burseraceae Gum resin; used for anti-inflammatory and anti- 11 Washington D. C.: U. S. Wen12916 olibanum-tree neoplastic; DS National Arboretum Citrus limon lemon Rutaceae Fruit; drinks; F 431 Washington D. C.: U. S. Wen12901 National Arboretum Coffea arabica coffee Rubiaceae Fruit; drinks, weight loss; F and DS 297 Washington D. C.: U. S. Wen12914 National Arboretum Digitalis lanata Grecian foxglove Plantaginaceae Whole plant; medicine for heart conditions; DS 1c Green Farmacy Garden N/Aj Dioscorea villosa wild yam Dioscoreaceae Leaf and root; cancer prevention, treatment of Crohnʼs 16 Green Farmacy Garden N/Aj disease and whooping cough; F and DS Echinacea narrow-leaf coneflower Asteraceae Root; pain relief, relief of colds and toothaches; DS 395e Washington D. C.: U. S. US2802433 angustifoliad National Arboretum Echinacea Topeka purple Asteraceae Unknown; closely related species 395e Washington D. C.: U. S. US2235164 atrorubensd coneflower National Arboretum hn ta.Dvlpeto a of Development al. et N Zhang Echinacea laevigatad smooth coneflower Asteraceae Unknown; closely related species 395e Washington D. C.: U. S. US3360860 National Arboretum Echinacea pallidad pale purple coneflower Asteraceae Root; treatment of flu-like infections; DS 395e Washington D. C.: U. S. US2233063 National Arboretum Echinacea paradoxad yellow coneflower Asteraceae Unknown; closely related species 395e Washington D. C.: U. S. US1653013 National Arboretum Echinacea purpuread purple coneflower Asteraceae Root; treat colds, upper respiratory tract infections, uri- 395e Washington D. C.: U. S. US2349097 … nary tract infections, and slow-healing wounds; DS National Arboretum lnaMd21;8:1420 83: 2017; Med Planta Echinacea sanguinead sanguine purple Asteraceae Unknown; closely related species 395e Washington D. C.: U. S. US1468035 coneflower National Arboretum Echinacea speciosad narrow-leaved purple Asteraceae Unknown; closely related species 395e Washington D. C.: U. S. US2349080 coneflower National Arboretum Echinacea Tennessee coneflower Asteraceae Unknown; closely related species 395e Washington D. C.: U. S. US980416 tennesseensisd National Arboretum –

1430 continued hn ta.Dvlpeto a of Development al. et N Zhang

▶ Table 1 Continued

Species Common name Family Uses Reports received by Location Voucher CAERS (past two number years)a Eleutherococcus sen- Eleuthero, Ciwujia Araliaceae Root and other parts; adaptogen, Chinese medicine for 15 Green Farmacy Garden N/Aj ticosus high or blood pressure, hardening of the arteries, and … rheumatic heart disease; DS lnaMd21;8:1420 83: 2017; Med Planta Eriobotrya japonica Loquat, Pipa Rosaceae Fruit, leaves; Chinese medicine, for soothing the throat 0 Washington D. C.: U. S. Wen12898 and making cough drops, beverages; F and DS National Arboretum Fragaria virginiana strawberry Rosaceae Fruit; F N/A Washington D. C.: U. S. Wen12936 National Arboretum Hydrastis canadensis goldenseal Ranunculaceae Root; alterative, anti-catarrhal, anti-inflammatory, anti- 24 Green Farmacy Garden N/Aj septic, astringent, bitter tonic, laxative, anti-diabetic, and –

1430 muscle stimulant; DS Illicium verum Chinese star anise Schisandraceae Fruit; spice; F 0 U. S. Botanic Garden 13-0668-A Illicium anisatum Japanese star anise Schisandraceae Highly toxic; closely related species N/A Washington D. C.: U. S. 52084L National Arboretum Illicium floridanum purple anise or Florida Schisandraceae Ornamental; closely related species N/A Washington D. C.: U. S. Wen12934 anise National Arboretum Illicium henryi Henry anise tree Schisandraceae Unknown; closely related species N/A Washington D. C.: U. S. 60238H National Arboretum Jasminum sambac Arabian jasmine Oleaceae Flower; make tea; F 15f Washington D. C.: U. S. Wen12922 National Arboretum Jasminum tortuosum twisted jasmine Oleaceae Flower; ornamental, perfume; closely related species N/A Washington D. C.: U. S. Wen12904 National Arboretum Laurus nobilis bay laurel Lauraceae Leaves; astringents, alleviate arthritis and rheumatism, 0 Washington D. C.: U. S. Wen12923 treat earaches and high blood pressure; F National Arboretum Magnolia officinalis Houpu magnolia Magnoliaceae Bark; traditional Chinese medicine, known as hou po, to 27g Washington D. C.: U. S. Wen12928 var. biloba eliminate damp and phlegm, and relieve distension; DS National Arboretum Magnolia biondii Biondʼs magnolia Magnoliaceae Traditional Chinese medicine as xinyi; DS 27g Washington D. C.: U. S. Wen12930 National Arboretum Magnolia denudata lilytree Magnoliaceae Unknown; closely related species N/A Washington D. C.: U. S. Wen12879 National Arboretum Mitragyna speciosa kratom Rubiaceae Leaves; mood enhancer and/or painkiller; DS 39 Green Farmacy Garden N/Aj Pimenta dioica allspice Myrtaceae Fruit; spice; F 1 Washington D. C.: U. S. Wen 12912 National Arboretum auritum Vera Cruz pepper Fruit; spice; F 32h Washington D. C.: U. S. Wen12891 National Arboretum continued 1423 Original Papers

Jasminum sambac (L.) Aiton, Laurus nobilis L., Pimenta dioica (L.) Merr., Piper auritum Kunth, Piper nigrum L., Prunus dulcis (Mill.) D. A. Webb, and Theobroma cacao L. (▶ Table 1). Twelve species Found using j b are used as dietary supplements (excluding seven species that Wen12890 number Wen12915 Wen12895 are also used as foods/species mentioned above): Actaea racemosa

which can mean two L., Artemisia annua L., Boswellia sacra Flueck., Digitalis lanata Ehrh., ” Echinacea angustifolia (DC.) A. Heller [15], Echinacea pallida (Nutt.) Samples were collected from

j Nutt. [15], Echinacea purpurea (L.) Moench [15], Eleutherococcus Jasmine, “ senticosus (Rupr.&Maxim.)Maxim.,Hydrastis canadensis L., Mag- nolia officinalis Rehder & E. H. Wilson, Mitragyna speciosa (Korth.) Havil., and Scutellaria lateriflora L. In addition, we sequenced 12

Found using closely related species, including six Echinacea species: Echinacea f National Arboretum LocationWashington D. C.: U. S. National Arboretum Voucher Green Farmacy Garden N/A National Arboretum atrorubens (Nutt.) Nutt., Echinacea laevigata (F.E. Boynton & spp.; Beadle ex C. L. Boynton & Beadle) S. F. Blake, Echinacea paradoxa (Norton) Britton, Echinacea sanguinea Nutt., Echinacea speciosa (Wender.) Paxton, and Echinacea tennesseensis (Beadle) Small [15, Echinacea which can mean several species.; ts, searched by common name and Latin binomial; ” 24]), three Illicium species (Illicium anisatum Gaertn., Illicium flor- For all

a idanum J. Ellis, and Illicium henryi Diels, Schisandraceae [13]), two e

h i Magnolia species (Magnolia biondii Pamp. and Magnolia denudata skullcap, 26 “

CAERS (past two years) Desr.), and one Jasminum species (Jasminum tortuosum Willd.). The number of next-generation sequencing base pair reads for each species ranges from 941,530 (T. cacao) to 10,966,208 Found using

I (E. sanguinea). The data size for species ranges from 277 MB (T. ca-

sp.; cao) to 2,661 MB (J. sambac). In total, the size of newly generated next-generation sequencing data in this study reached Piper 39,873 MB. Because genomic DNA rather than chloroplast DNA

For all was used for sequencing, only a small fraction of base pair reads h was from the chloroplast genome. Reads mapping to the refer- Originally described in Zhang et al. [15]; spp.;

d ence chloroplast genome ranges from 0.25% (Echinacea purpurea) to 7.18% (Fragaria virginiana), and the mapping coverage of chlo- roplast genome ranges from 19 x (Illicium henryi) to 354 x (Jasmi- Magnolia num sambac)(▶ Table 2). A large portion of the sequencing data captured through the shotgun genome skimming approach in the study was from mitochondrial and nuclear genomes. Because of this, some nuclear genes like ribosomal DNA and even some sin- promoter; DS gle-copy DNA regions can be retrieved in this process and used in the future as DNA sequencing markers [18]. which can mean several species; ” The number of chloroplast genomes available in GenBank per year increased significantly after 2014, from 125 (2014) to 218

foxglove, (2015) due to reduced cost of next-generation sequencing, and “ to 326 by 2016 (▶ Fig. 1). As of October 19, 2016, there were 1003 chloroplast genomes of land plants available. In this study, the 40 new chloroplast genomes derived from authenticated Found using

c specimens are from 24 genera and 19 families. Chloroplast ge-

which can mean several species. Number included all nomes of six species (Boswellia sacra, Coffea arabica, Eleutherococ- ” cus senticosus, Fragaria virginiana, Magnolia denudata,andTheo- broma cacao) have been reported by other groups (▶ Table 3). Magnolia,

“ For another 16 species, members of the same are already available in GenBank (▶ Table 3). Another 16 species have mem- cacao Malvaceae Fruit; source of chocolate, drinks; F 92 Washington D. C.: U. S. black pepperalmond Piperaceae Rosaceae Fruit; spice; F Fruit; nuts; F 32 216 Washington D. C.: U. S. skullcap Lamiaceae Whole plant; herb medicine for sedative and sleep bers of the same family available, but none within the same genus (▶ Table 3). The remaining two species, Alloysia citrodora and Digi- Found using g talis lanata, represent the first chloroplast genome reports from which can mean several species;

” the plant families Verbenaceae and Plantaginaceae, respectively. Continued The size of chloroplast genomes ranges from 142,723 bp (Illicium anisatum) to 163,186 bp (Jasminum tortuosum)(▶ Table Table 1 Adverse event reports or complaints over the past two years with these species listed in the product name or ingredients in foods and dietary supplemen wormwood, 4) and the GC-content of each species ranges from 36.7% ▶ Species Common name FamilyTheobroma cacao a Uses Reports received by different species; the Green Farmacy Garden and the corresponding voucher specimen will be collected in the summer of 2017 Piper nigrum Prunus dulcis “ Scutellaria lateriflora (Allium sativum) to 39.2% (Illicium anisatum)(▶ Table 4).

1424 Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 ▶ Table 2 Next-generation sequencing data generated for 40 species.

Species Raw data Number of Size of Reads map- Percentage Cover- GenBank accession NCBIʼsSRA size (MB) reads reads ping to refer- of mapping age (x) number (chloro- accession (bp) ence genome reads (%) plast genome) number Actaea racemosa 1262 5087818 250 86888 1.71 148 KY085920 SRR5602599 Allium sativum 748 2540120 300 19194 0.74 38 KY085913 SRR5602598 Aloysia citrodora 844 2847446 300 67990 2.43 132 KY085903 SRR5602597 Althaea officinalis 833 2819422 300 87513 3.1 164 KY085914 SRR5602596 Artemisia annua 1048 4218222 250 55743 1.33 92 KY085890 SRR5602595 Boswellia sacra 1199 4025894 300 89368 2.22 168 KY085915 SRR5602594 Citrus limon 666 2725584 250 75471 2.82 118 KY085897 SRR5602593 Coffea arabica 1494 5065216 300 129184 2.6 250 KY085909 SRR5602572 Digitalis lanata 730 2515048 300 134546 5.35 264 KY085895 SRR5602573 Dioscorea villosa 858 2894046 300 132880 4.59 259 KY085893 SRR5602590 Echinacea angustifoliaa 878 3338742 300 46795 1.49 70 KX548221 SRR5602579 Echinacea atrorubensa 472 1923846 250 19698 1.06 32 KX548220 SRR5602578 Echinacea laevigataa 545 2198622 250 17608 0.81 29 KX548219 SRR5602581 Echinacea pallidaa 832 4078614 250 20048 0.49 33 KX548218 SRR5602580 Echinacea paradoxaa 1692 6202480 300 30803 0.51 51 KX548217 SRR5602575 Echinacea purpureaa 2531 10394828 300 25388 0.25 40 KX548224 SRR5602574 Echinacea sanguineaa 2437 10966208 250 33158 0.31 51 KX548225 SRR5602577 Echinacea speciosaa 483 1941430 250 13009 0.67 22 KX548222 SRR5602576 Echinacea 434 1814356 250 8757 0.53 20 KX548223 SRR5602587 tennesseensisa Eleutherococcus 787 2679182 300 25266 0.94 46 KY085901 SRR5602586 senticosus Eriobotrya japonica 918 3705664 250 50203 1.36 79 KY085905 SRR5602604 Fragaria virginiana 708 2398914 300 172232 7.18 328 KY085911 SRR5602605 Hydrastis canadensis 671 2713622 250 24608 0.92 38 KY085918 SRR5602606 Illicium verum 1121 4479368 250 27242 0.61 47 KY085896 SRR5602607 Illicium anisatum 1454 5816572 250 50306 0.87 88 KY085919 SRR5602608 Illicium floridanum 1141 3858232 300 40793 1.06 85 KY085892 SRR5602609 Illicium henryi 611 2480392 250 9095 0.37 19 KY085910 SRR5602610 Jasminum sambac 2661 8975500 300 204276 2.31 354 KY085902 SRR5602611 Jasminum tortuosum 726 2937490 250 122672 4.2 189 KY085898 SRR5602601 Laurus nobilis 880 3549864 250 138593 3.92 227 KY085912 SRR5602602 Magnolia officinalis 1040 3488006 300 83560 2.43 147 KY085916 SRR5602589 Magnolia biondii 953 3200248 300 75629 2.4 133 KY085894 SRR5602588 Magnolia denudata 978 3281958 300 83986 2.41 157 KY085917 SRR5602603 Mitragyna speciosa 658 2655068 250 94629 3.56 150 KY085908 SRR5602600 Pimenta dioica 1068 3642300 300 75011 2.11 133 KY085891 SRR5602585 Piper auritum 964 3903784 250 126403 3.26 196 KY085906 SRR5602592 Piper nigrum 797 2685872 300 83697 3.12 153 KY085899 SRR5602591 Prunus dulcis 631 2571744 250 145183 5.73 228 KY085904 SRR5602582 Scutellaria lateriflora 843 3398096 250 101824 3.02 167 KY085900 SRR5602584 Theobroma cacao 277 941530 300 15765 1.67 29 KY085907 SRR5602583

aReported previously in Zhang et al. [15].

Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 1425 Original Papers

the three Illicium species, whereas there are 24 and 23 genes in B. sacra and C. limon, respectively. Economically motivated or accidental substitutions can result in illness, at which point the FDA needs to take action quickly. To facilitate this response, the FDA must be able to quickly identify a wide range of plant species contained within products being sold to the public. With the recent intense scrutiny on authentication, dietary supplements, and proper use of DNA-based methods, it is necessary to release reference sequence data publicly so that users may be able to develop their own assays or preventative controls. We herein describe the new FDA CFSAN GenomeTrakrCP database (▶ Fig. 2). The current data reported here represent ▶ Fig. 1 Number of chloroplast genomes deposited per year in the work in progress, and the database will continue to grow and be GenBank increased dramatically from 138 (24 years, 1986–2010) improved. Through our ongoing collaborations with the Smith- to 326 (2016). Due to the small number of chloroplast genomes deposited between 1986 and 2010, we summed these numbers. sonian Institution and with other colleagues in the community, we will continue to target more plant species used as food/spices (e.g., lemon and cumin), botanical dietary supplements (e.g., Echinacea and Ginkgo), known toxin producers (e.g., Japanese star All chloroplast genomes have one large single-copy region, one anise), allergens (e.g., peanut, tree nuts, and mango), and known small single-copy region, and two inverted-repeat regions (IR). contaminants and species closely related to any of the above. All The number of proteins encoded by each chloroplast genome the annotated chloroplast genome sequences will be publicly varies from 79 (Illicium verum, I. floridanum,andI. henryi)to93 available (NCBI BioProject PRJNA325670). To date, 40 chloroplast (Citrus limon and Boswellia sacra). The major difference lies in the genomes have been sequenced and deposited (▶ Table 1). With number of genes within the IR. There are 10 genes in each IR of more chloroplast genomes available in the near future, the

▶ Fig. 2 Illustration of a chloroplast genome along with a representative species, coffee.

1426 Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 ▶ Table 3 Reference genomes used for genome-guided assembly and annotation.

Species Family Reference genome Accession number of Family the reference genome Actaea racemosaa Ranunculaceae Clematis terniflora NC_028000.1 Ranunculaceae Allium sativumb Amaryllidaceae Allium cepa NC_024813.1 Amaryllidaceae Aloysia citrodorac Verbenaceae Sesamum indicum NC_016433.2 Pedaliaceae Althea officinalisa Malvaceae Hibiscus syriacus NC_026909.1 Malavaceae Artemisia annuab Asteraceae Artemisia frigida NC_020607.1 Asteraceae Boswellia sacrad Burseraceae B. sacra NC_029420.1 Burseraceae Citrus limonb Rutaceae Citrus aurantiifolia NC_024929.1 Rutaceae Coffea arabicad Rubiaceae C. arabica NC_008535.1 Rubiaceae Digitalis lanatac Plantaginaceae Scutellaria insignis NC_028533.1 Lamiaceae Dioscorea villosab Dioscoreaceae Dioscorea elephantipes NC_009601.1 Dioscoreaceae Echinacea angustifoliaa Asteraceae Parthenium argentatum NC_013553.1 Asteraceae Echinacea atrorubensa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea laevigataa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea pallidaa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea paradoxaa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea purpureaa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea sanguineaa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea speciosaa Asteraceae P. argentatum NC_013553.1 Asteraceae Echinacea tennesseensisa Asteraceae P. argentatum NC_013553.1 Asteraceae Eleutherococcus Araliaceae E. senticosus NC_016430.1 Araliaceae senticosusd Eriobotrya japonicaa Rosaceae Prunus kansuensis NC_023956.1 Rosaceae Fragaria virginianad Rosaceae F. virginiana NC_019602.1 Rosaceae Hydrastis canadensisa Ranunculaceae Ranunculus macranthus NC_008796.1 Ranunculaceae Illicium verumb Schisandraceae Illicium oligandrum NC_009600.1 Schisandraceae Illicium anisatumb Schisandraceae I. oligandrum NC_009600.1 Schisandraceae Illicium floridanumb Schisandraceae I. oligandrum NC_009600.1 Schisandraceae Illicium henryib Schisandraceae I. oligandrum NC_009600.1 Schisandraceae Jasminum sambacb Oleaceae Jasminum nudiflorum NC_008407.1 Oleaceae Jasminum tortuosumb Oleaceae J. nudiflorum NC_008407.1 Oleaceae Laurus nobilisa Lauraceae Machilus balansae NC_028074.1 Lauraceae Magnolia officinalisb Magnoliaceae M. denudata NC_018357.1 Magnoliaceae Magnolia biondiib Magnoliaceae M. denudata NC_018357.1 Magnoliaceae Magnolia denudatad Magnoliaceae M. denudata NC_018357.1 Magnoliaceae Mitragyna speciosaa Rubiaceae C. arabica NC_008535.1 Rubiaceae Pimenta dioicaa Myrtaceae Eucalyptus regnans NC_022386.1 Myrtaceae Piper auritumb Piperaceae Piper cenocladum NC_008457.1 Piperaceae Piper nigrumb Piperaceae Piper cenocladum NC_008457.1 Piperaceae Prunus dulcisb Rosaceae P. kansuensis NC_023956.1 Rosaceae Scutellaria lateriflorab Lamiaceae Scutellaria baicalensis NC_027262.1 Lamiaceae Theobroma cacaod Malvaceae T. cacao NC_014676.1 Malvaceae

a Chloroplast genome from the same family but not from the same genus is available in GenBank; b Chloroplast genome from the same genus is available in GenBank; c Chloroplast genome from the same order but not from the same family is available in GenBank; d Chloroplast genome of the same species is available in GenBank

Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 1427 Original Papers

▶ Table 4 Forty new chloroplast genomes obtained in this study.

Species Size (bp) GC Protein rRNA tRNA Gene Actaea racemosa 146906 37.5 82 8 35 126 Allium sativum 153118 36.7 83 8 38 135 Aloysia citrodora 154699 39.2 87 8 37 134 Althaea officinalis 159987 37.0 83 8 37 128 Artemisia annua 150952 37.5 87 8 37 134 Boswellia sacra 159228 37.8 93 8 37 140 Citrus limon 160101 38.5 93 8 37 138 Coffea arabica 155188 37.4 85 8 41 139 Digitalis lanata 153108 38.6 85 8 37 132 Dioscorea villosa 153974 37.2 84 8 38 129 Echinacea angustifolia 151935 37.6 85 8 36 138 Echinacea atrorubens 151912 37.6 85 8 36 138 Echinacea laevigata 151886 37.6 85 8 36 138 Echinacea pallida 151883 37.6 85 8 36 138 Echinacea paradoxa 151837 37.6 85 8 36 138 Echinacea purpurea 151913 37.6 85 8 36 138 Echinacea sanguinea 151926 37.6 85 8 36 138 Echinacea speciosa 151860 37.6 85 8 36 138 Echinacea tennesseensis 151877 37.6 85 8 36 138 Eleutherococcus senticosus 156863 37.9 87 8 37 134 Eriobotrya japonica 159156 36.7 86 8 39 133 Fragaria virginiana 155577 37.2 85 8 37 130 Hydrastis canadensis 160000 38.6 83 8 36 129 Illicium verum 143187 39.1 79 8 35 122 Illicium anisatum 142723 39.2 80 8 34 122 Illicium floridanum 143571 39.0 79 8 35 123 Illicium henryi 143240 39.1 79 8 35 122 Jasminum sambac 163186 37.6 85 8 37 130 Jasminum tortuosum 162080 37.6 84 8 37 130 Laurus nobilis 152750 39.1 80 8 36 126 Magnolia officinalis 160136 39.2 84 8 37 129 Magnolia biondii 160002 39.2 84 8 37 129 Magnolia denudata 160089 39.2 84 8 37 129 Mitragyna speciosa 155600 37.5 85 8 37 138 Pimenta dioica 158984 37.0 85 8 37 134 Piper auritum 159909 38.3 85 8 36 129 Piper nigrum 161523 38.3 85 8 37 131 Prunus dulcis 157723 36.8 86 8 38 132 Scutellaria lateriflora 152283 38.3 86 8 36 132 Theobroma cacao 160619 36.9 81 8 37 126

ultimate goal is to be able to discriminate closely related species says via PCR/real-time or quantitative PCR or develop barcodes, and cultivars efficiently and precisely. These data can be used by mini-barcodes, or even super barcodes using more universal pri- a much wider array of researchers, including those in industry and mers. They can provide a database of authenticated specimens government, systematists, botanists, etc. This database is power- to compare Sanger or next-generation sequencing runs to any ful because it can be used to design species- or group-specific as- portion of the targeted chloroplast genome.

1428 Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 Materials and Methods We also thank the Office of Dietary Supplement Programs at the FDACFSAN for direction with plants to sample and both Ella Smith and Manuel Kavekos of the FDAʼs Adverse Reporting system for Sample collection and DNA extraction help with CAERS data. Finally, we thank Lili Fox Vélez for scientific Fresh leaves were collected and dried in silica gel (Cat. #920010, writing support for portions of this manuscript. This study was AGM Container Controls Inc.) for DNA extraction. Detailed taxon supported by an ORISE fellowship to Ning Zhang from FDA CFSAN. sampling information is presented in ▶ Table 1.Somespecies were collected from the Green Farmacy Garden in Fulton, Mary- Conflict of Interest land, which is a collection of around 300 plant species that have been used or researched for medicinal purposes (https:// The authors declare no conflict of interest. thegreenfarmacygarden.com/, accessed December 14, 2016). These specimens were later verified by Jun Wen from the National References Museum of Natural History at the Smithsonian Institution and will be collected in the summer of 2017. Specimens collected by Jun [1] Smith T, Kawa K, Eckl V, Johnson J. Sales of herbal dietary supplements in – Wen were deposited into the U. S. National Herbarium, National US increased 7.5% in 2015. HerbalGram 2016; 111: 67 73 Museum of Natural History, Smithsonian Institution. [2] Runestad T. Botanical industry faces the music for its adulteration DNeasy Plant Mini Kit (part #69106, Qiagen) was used to ex- problem. New Hope. Available at http://www.newhope.com/ ingredients-general/botanical-industry-faces-music-its-adulteration- tract total DNAs from the dried leaf samples. For the next-gener- problem. Accessed February 03, 2015 ation sequencing library preparation, 150 ng DNA was used to [3] Pawar RS, Grundel E. Overview of regulation of dietary supplements in shear into ~ 550 base pair contigs using the Covaris M220 the USA and issues of adulteration with phenethylamines (PEAs). Drug Focused-ultrasonicator. The library was constructed with the Test Analysis 2017; 9: 500–517 TruSeq Nano DNA NeoPrep Kit (Illumina, NP-101-1001). Paired- [4] Eischeid AC, Stadig SR, Handy SM, Fry FS, Deeds J. Optimization and end reads (2x 250 or 2x 300) were generated using MiSeq evaluation of a method for the generation of DNA barcodes for the iden- Reagent Kit v2 (MS-102–2001) or MiSeq Reagent Kit v3 (MS- tification of crustaceans. LWT-Food Sci Technol 2016; 73: 357–367 102-3001), respectively, with an Illumina MiSeq sequencer. [5] Handy SM, Deeds JR, Ivanova NV, Hebert PD, Hanner RH, Ormos A, Weigt LA, Moore MM, Yancy HF. A single-laboratory validated method Genome assembly and annotation for the generation of DNA barcodes for the identification of fish for reg- ulatory compliance. J AOAC Int 2011; 94: 201–210 Low-quality base pair reads were filtered using the Qiagen CLC Ge- [6] Allard MW, Strain E, Melka D, Bunning K, Musser SM, Brown EW, Timme nomics Workbench v.9.0.1 (hereafter called CLC) with the limit of R. The practical value of food pathogen traceability through building a quality scores being 0.05 and the other settings as default. Con- whole-genome sequencing network and database. J Clin Microbiol tigs were obtained using de novo assembly implemented in CLC 2016; 54: 1975–1983 with the automatic word size and the automatic bubble size being [7] Kwegyir-Afful EE, DeJager LS, Handy SM, Wong J, Begley TH, Luccioli S. 20 and 50, respectively. Additionally, for each species, a refer- An investigational report into the causes of pine mouth events in US con- – ence-guided assembly was conducted using CLC with the pub- sumers. Food Chem Toxicol 2013; 60: 181 187 lished chloroplast genome of the most closely related species as [8] Mostin M. Taste disturbances after pine nut ingestion. Eur J Emerg Med 2001; 8: 76 the reference genome (▶ Table 3). After the reference-guided as- “ ” sembly, a consensus sequence was obtained. Both the consensus [9] Munk MD. Pine mouth syndrome: cacogeusia following ingestion of pine nuts (genus: Pinus). An emerging problem? J Med Toxicol 2010; 6: sequence from the reference-guided assembly and the contigs 158–159 from the de novo assembly were imported into Geneious Pro [10] Handy SM, Parks MB, Deeds JR, Liston A, De Jager LS, Luccioli S, Kwegyir- 9.1.4 [25], and then those contigs were mapped onto the consen- Afful E, Fardin-Kia AR, Begley TH, Rader JI. Use of the chloroplast gene sus sequence. The mapped contigs were manually checked to ycf1 for the genetic differentiation of pine nuts obtained from consum- align them with the consensus sequence obtained using refer- ers experiencing dysgeusia. J Agric Food Chem 2011; 59: 10995–11002 enced-guided assembly [15,26]. The final sequence of the chloro- [11] Handy SM, Timme RE, Jacob SM, Deeds JR. Development of a locked plast genome of each species is the ordered sequence of those nucleic acid real-time polymerase chain reaction assay for the detection of Pinus armandii in mixed species pine nut samples associated with dys- mapped contigs. The chloroplast genomes were annotated using geusia. J Agric Food Chem 2013; 61: 1060–1066 Geneious with the chloroplast genome of the most closely related [12] Ran JH, Wang PP, Zhao HJ, Wang XQ. A test of seven candidate barcode species as the reference (▶ Table 3). All sequence data were regions from the plastome in Picea (Pinaceae). J Integr Plant Biol 2010; submitted to the NCBI under the BioProject PRJNA305670. 52: 1109–1126 [13] Techen N, Pan Z, Scheffler BE, Khan IA. Detection of Illicium anisatum as Acknowledgements adulterant of Illicium verum. Planta Med 2009; 75: 392–395 [14] Chen J, Zhao J, Erickson DL, Xia N, Kress WJ. Testing DNA barcodes in The authors thank Jonathan Deeds for helpful discussions, Eric closely related species of Curcuma (Zingiberaceae) from Myanmar and Brown for coining the term “GenomeTrakrCP”, and Rahul Pawar, China. Mol Ecol Resour 2015; 15: 337–348 Erich Grundel, Steve Casper, Sue Lutz, and Stefan Lura for assis- [15] Zhang N, Erickson DL, Ramachandran P, Ottesen AR, Timme RE, Funk tance with collecting and/or processing the specimens. We deeply VA, Yan L, Handy SM. An analysis of Echinacea chloroplast genomes: implications for future botanical identification. Sci Rep 2017; 7: 216 appreciate the permission granted to us for collection by Carol Bordelon and Kevin Tunison of the U. S. National Arboretum.

Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430 1429 Original Papers

[16] Hollingsworth PM, Li DZ, van der Bank M, Twyford AD. Telling plant spe- [22] Coutinho Moraes D, Still DW, Lum MR, Hirsch AM. DNA-based authenti- cies apart with DNA: from barcodes to genomes. Philos Trans R Soc Lond cation of botanicals and plant-derived dietary supplements: where have B Biol Sci 2016; 371: 20150338 we been and where are we going? Planta Med 2015; 81: 687–695 [17] Li X, Yang Y, Henry RJ, Rossetto M, Wang Y, Chen S. Plant DNA barcod- [23] Deeds JR, Handy SM, Fry F jr., Granade H, Williams JT, Powers M, Shipp R, ing: from gene to genome. Biol Rev Camb Philos Soc 2015; 90: 157–166 Weigt LA. Protocol for building a reference standard sequence library for – [18] Coissac E, Hollingsworth PM, Lavergne S, Taberlet P. From barcodes to DNA-based seafood identification. J AOAC Int 2014; 97: 1626 1633 genomes: extending the concept of DNA barcoding. Mol Ecol 2016; 25: [24] Flagel LE, Rapp RA, Grover CE, Widrlechner MP, Hawkins J, Grafenberg JL, 1423–1428 Álvarez I, Chung GY, Wendel JF. Phylogenetic, morphological, and [19] Parks M, Cronn R, Liston A. Increasing phylogenetic resolution at low chemotaxonomic incongruence in the North American endemic genus – taxonomic levels using massively parallel sequencing of chloroplast ge- Echinacea. Am J Bot 2008; 95: 756 765 nomes. BMC Biol 2009; 7: 84 [25] Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, [20] Yang JB, Tang M, Li HT, Zhang ZR, Li DZ. Complete chloroplast genome Buxton S, Cooper A, Markowitz S, Duran C. Geneious Basic: an integrated of the genus Cymbidium: lights into the species identification, phylo- and extendable desktop software platform for the organization and – genetic implications and population genetic analyses. BMC Evol Biol analysis of sequence data. Bioinformatics 2012; 28: 1647 1649 2013; 13: 84 [26] Zhang N, Wen J, Zimmer EA. Congruent deep relationships in the grape [21] Nock CJ, Waters DL, Edwards MA, Bowen SG, Rice N, Cordeiro GM, Henry family (Vitaceae) based on sequences of chloroplast genomes and mito- RJ. Chloroplast genome sequences from total DNA for plant identifica- chondrial genes via genome skimming. PLoS One 2015; 10: e0144701 tion. Plant Biotechnol J 2011; 9: 328–333

1430 Zhang N et al. Development of a … Planta Med 2017; 83: 1420–1430