Xu et al. BMC Biology (2020) 18:63 https://doi.org/10.1186/s12915-020-00795-3

RESEARCH ARTICLE Open Access Tandem gene duplications drive divergent evolution of and crocin biosynthetic pathways in Zhichao Xu1,2†, Xiangdong Pu1†, Ranran Gao1, Olivia Costantina Demurtas3, Steven J. Fleck4, Michaela Richter4, Chunnian He1,2, Aijia Ji1, Wei Sun5, Jianqiang Kong6, Kaizhi Hu7, Fengming Ren1,7, Jiejie Song8, Zhe Wang6, Ting Gao8, Chao Xiong5, Haoying Yu1, Tianyi Xin1, Victor A. Albert4,9, Giovanni Giuliano3*, Shilin Chen2,5* and Jingyuan Song1,2,10*

Abstract Background: Plants have evolved a panoply of specialized metabolites that increase their environmental fitness. Two examples are caffeine, a purine psychotropic alkaloid, and crocins, a group of glycosylated pigments. Both classes of compounds are found in a handful of distantly related genera (, Camellia, Paullinia, and Ilex for caffeine; Crocus, Buddleja, and Gardenia for crocins) wherein they presumably evolved through convergent evolution. The closely related Coffea and Gardenia genera belong to the family and synthesize, respectively, caffeine and crocins in their fruits. Results: Here, we report a chromosomal-level genome assembly of Gardenia jasminoides, a crocin-producing species, obtained using Oxford Nanopore sequencing and Hi-C technology. Through genomic and functional assays, we completely deciphered for the first time in any plant the dedicated pathway of crocin biosynthesis. Through comparative analyses with and other eudicot genomes, we show that Coffea caffeine synthases and the first dedicated gene in the Gardenia crocin pathway, GjCCD4a, evolved through recent tandem gene duplications in the two different genera, respectively. In contrast, genes encoding later steps of the Gardenia crocin pathway, ALDH and UGT, evolved through more ancient gene duplications and were presumably recruited into the crocin biosynthetic pathway only after the evolution of the GjCCD4a gene. (Continued on next page)

* Correspondence: [email protected]; [email protected]; [email protected] †Zhichao Xu and Xiangdong Pu contributed equally to this work. 3Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), Casaccia Res. Ctr, 00123 Rome, Italy 2Engineering Research Center of Chinese Medicine Resource, Ministry of Education, Beijing 100193, China 1Key Lab of Chinese Medicine Resources Conservation, State Administration of Traditional Chinese Medicine of the People’s Republic of China, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100193, China Full list of author information is available at the end of the article

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Xu et al. BMC Biology (2020) 18:63 Page 2 of 14

(Continued from previous page) Conclusions: This study shows duplication-based divergent evolution within the family (Rubiaceae) of two characteristic secondary metabolic pathways, caffeine and crocin biosynthesis, from a common ancestor that possessed neither complete pathway. These findings provide significant insights on the role of tandem duplications in the evolution of plant specialized metabolism. Keywords: Crocin biosynthesis, Caffeine biosynthesis, Gardenia jasminoides, Coffea canephora, Genomics, cleavage dioxygenases, Aldehyde dehydrogenases, UDP-glucosyltransferases, N-

Background molecules, flower and fruit pigments, and regulators of Flowering plants have evolved a diverse array of secondary membrane fluidity [14]. They have been reported to have metabolites to repel pathogens and predators, attract polli- anticancer, anti-inflammatory, , and anti- nators, and drive ecosystem functions. In many cases, the diabetic activities and to be beneficial in the treatment genomic context for the evolution of specialized plant com- of central nervous system and cardiovascular diseases pounds involves tightly linked clusters of genes, usually [15, 16]. Crocin biosynthesis in stigmas is initi- containing nonhomologous gene families, that together ated by carotenoid cleavage dioxygenase 2 (CsCCD2), control novel biosynthetic pathways [1–3]. A few important which cleaves to produce dialdehyde metabolic clusters involve only tandem duplicates within [17]. The aldehyde dehydrogenase CsALDH3I1 and the single gene families, such as the N- UDP-glucosyltransferase CsUGT74AD1 perform, re- (NMT) genes that control caffeine biosynthesis in the coffee spectively, the dehydrogenation of crocetin dialdehyde to plant [4], and the cytochrome p450 genes encoding the 2,4- crocetin and its glycosylation to crocins 1 and 2′ [18]. dihydroxy-7-methoxy-1,4-benzoxazin-3-one (DIMBOA) The UGT mediating the formation of more highly glyco- metabolic cluster of maize, which produces an important sylated crocins is still uncharacterized [18]. In Buddleja defense compound [5, 6]. Tandem gene duplicate clusters flowers, only the zeaxanthin cleavage step has been char- originally arise as copy number variants (CNVs) in popula- acterized, and it is mediated by BdCCD4.1 and tions that later become fixed within species by evolving split BdCCD4.3 [19]. Thus, it appears that crocin biosynthesis or novel functions [7]. Given the ongoing nature of CNV in (Buddleja) and monocots (Crocus) has production during evolution, genome sequencing of closely evolved through the convergent evolution of different related plants harboring distinct secondary metabolite pro- CCD subclasses (CCD2 and CCD4, respectively) that files holds great promise for understanding the stepwise have acquired the capacity to cleave zeaxanthin at the at evolution of important tandem duplicate clusters. the 7/8,7′/8′ positions to produce crocetin dialdehyde. The Gardenia , which is among the most com- In G. jasminoides, crocins are accumulated in green and monly grown horticultural plants worldwide and is val- red fruits (Fig. 1b). The Gardenia crocin biosynthesis ued for the strong, sweet fragrance of its flowers, pathway has not yet been elucidated, in spite of the avail- belongs to the family Rubiaceae. In this large family of ability of transcriptome data [20]. Two G. jasminoides angiosperms, only the Coffea canephora (robusta coffee) UGTs, GjUGT94E5 and GjUGT75L6, are able to catalyze genome has been sequenced to date [4]. The Chinese the two-step conversion of crocetin into crocins in vitro species Gardenia jasminoides (gardenia) has been culti- [21]. However, the expression profiles of the correspond- vated for at least 1000 years and was introduced to ing genes are not consistent with their proposed role in Europe and America in the mid-eighteenth century. The crocin biosynthesis in G. jasminoides fruits [20]. Addition- fruits of G. jasminoides, whose major bioactive constitu- ally, the lack of genomic information for crocin-producing ents are genipin and crocins, were used as an imperial species hampers the elucidation of the mechanisms under- dye for royal costumes during the Qin and Han dynas- lying the molecular evolution of crocin biosynthesis. Re- ties in China and are recorded in the Chinese cently, sequencing of the C. canephora (Rubiaceae) and Pharmacopoeia [8, 9]. Unlike coffee, gardenia does not (Theaceae) genomes has shown that the accumulate caffeine. However, in a pattern similar to the synthesis of caffeine, a purine alkaloid, has evolved inde- scattered instances of convergent caffeine biosynthesis pendently in the two genera through tandem duplication among several angiosperm families [4, 10], crocins are and neofunctionalization of different N-methyltransferase found in the flowers of the distantly related plants Bud- (NMT) ancestral genes [4]. dleja davidii (Buddlejaceae) and in the stigmas of Crocus Here, we report a chromosome-level assembly of the sativus (saffron) (Iridaceae) (Fig. 1a). highly heterozygous G. jasminoides genome, using a , derived from by oxidative combination of Illumina short reads, Oxford Nanopore cleavage [11–13], play crucial roles in plants as signaling (ONT) long reads, and Hi-C scaffolding. The genes Xu et al. BMC Biology (2020) 18:63 Page 3 of 14

involved in the crocin biosynthesis pathway were identi- The purged assembly was 534.1 Mb long with a contig fied through functional assays, and the molecular evolu- N50 of 1.0 Mb (Additional file 2: Table S3). The com- tion of crocin and caffeine biosynthesis in the Rubiaceae pleteness of the genome assembly was assessed with the family was clarified through comparative genomic BUSCO pipeline, which found 95.0% complete BUSCOs, studies. of which 92.7% were single-copy and 2.3% duplicated (Additional file 2: Table S3). Results The contiguity of the assembly was further improved Chromosome-level assembly of the G. jasminoides using Hi-C scaffolding. There were 56,933,122 paired- genome (Additional file 1) end reads representing valid interactions were used to The size of the G. jasminoides genome was predicted to scaffold 99.5% of the assembly into 11 chromosomes be 550.6 ± 9 Mb (± SD) based on flow cytometry and using the Lachesis package [22] (Fig. 2a). The final cor- 547.5 Mb based on 17 k-mer distribution analysis and to rected chromosome-level genome was 535 Mb in size, show a very high level (about 2.2%) of heterozygosity 531 of which were assembled in the 11 chromosomes (Additional file 2: Fig. S1). A highly collapsed and frag- (Table 1). The G. jasminoides chromosomes showed a mented genome was assembled by ALLPATHS-LG using significant level of synteny with their C. canephora Illumina shotgun reads (293× coverage) (Additional file 2: counterparts, with a limited number of translocations Table S1). This assembly was 635.6 Mb in size (28% of N between the two genomes (Fig. 2b, Additional file 2: Fig. bases) and was composed of 58,859 scaffolds (N50, 60.6 S2). In addition, a 154,919-bp chloroplast genome and kb) (Additional file 2: Table S3). 640,334-bp mitochondrial genome of G. jasminoides To improve the contiguity of the assembly, we gener- were assembled and identified. ated 2,675,530 Oxford Nanopore (ONT) long reads with an N50 of 21.6 kb. The longest reads were 366.8 kb, and Genome annotation and phylogenetic analysis the genome coverage was about 60× (Additional file 2: The G. jasminoides genome comprises 35,967 protein- Table S2). After testing different de novo assembly pipe- coding genes (Table 1). Consistent with the assessment lines, we identified a package (Canu-SMARTdenovo- of genome assembly quality, orthologs of 96.5% of 3×Pilon) that yielded satisfactory results (Additional file 2: eukaryotic BUSCOs were identified in the G. jasminoides Table S3). A contiguous assembly of 677.9 Mb with a gene sets (Table 1). Transposable elements (TEs) ac- contig N50 of 703.1 kb was produced, and the longest count for approximately 54.0% (288,723,343 nt) of the G. contig in the assembly was 11.7 Mb. Since the assembly jasminoides genome (Additional file 2: Table S4, 5, 6), size was larger than predicted genome size, we ran Purge and 62.2% of these TEs are long terminal repeat (LTR) Haplotigs to collapse the highly heterozygous regions. elements. We identified 1798 full-length LTR elements

Fig. 1 Crocin biosynthesis. a Simplified angiosperm phylogenetic tree. Crocin-synthesizing genera are shown in red (Gardenia), green (Buddleja), and purple (Crocus), and caffeine-synthesizing genera (Coffea) in brown. The red dot marks the divergence of Coffea and Gardenia. The divergence time between G. jasminoides and C. canephora was estimated at approximately 20.69 MYA. b Crocin accumulation in different G. jasminoides organs. R, root; St, stem; L, leaf; Fl, flower; Frl, fruitlet; GF, green fruit; RF, red fruit; Sa, sarcocarp Xu et al. BMC Biology (2020) 18:63 Page 4 of 14

Fig. 2 The G. jasminoides genome. a Chromosome-level assembly of the G. jasminoides genome using Hi-C technology. b Synteny between C. canephora and G. jasminoides chromosomes. The positions of NMT genes, catalyzing caffeine biosynthesis in Coffea, and of CCD, ALDH, and UGT genes, catalyzing crocin biosynthesis in Gardenia, are marked

including 709 Gypsy and 403 Copia elements. These ele- Gardenia 10, spanning 65.9 kb (Additional file 2: Table ments have an average insertion time of 1.4 million years S7, 8). − ago (MYA) assuming a mutation rate of μ = 1.3 × 10 8 34,662 orthologous gene groups were found for G. jas- (per bp per year) [23, 24] (Additional file 2: Fig. S3), and minoides and 10 additional angiosperms, covering 335, Gypsy elements are much younger than Copia elements 254 genes in total. Among these, 155,220 genes cluster- (1.2 MYA vs 1.7 MYA, P < 0.05). Most 5s rRNAs are ing into 6671 groups were conserved in all plants exam- tandemly arrayed in Chr 11 of the G. jasminoides gen- ined, and 1248 gene families containing 5123 genes ome, and the tandem repeats of 18s rRNAs (average size appeared to be unique to G. jasminoides (Additional file 2: of 1.8 kb), 5.8s rRNA (average size of 154 bp), and 28s Fig. S4A). Eight hundred ninety-one gene families were rRNAs (average size of 6.3 kb) were clustered in expanded, and 2666 gene families were contracted in the

Table 1 Metrics of the final assembly and annotation of the G. jasminoides genome Assembly size 535 Mb N50 44 Mb No. of chromosomes 11 Assembly in chromosomes 531 Mb Assembly in unanchored contigs 4Mb BUSCOs in assembly 95.8% (S, 92.7%; D, 2.3%; F, 0.8%) No. of genes 35,967 No. of genes in chromosomes 35,779 No. of genes in unanchored contigs 188 BUSCOs in annotation 96.5% (S, 87.9%; D, 2.1%; F, 6.5%) Xu et al. BMC Biology (2020) 18:63 Page 5 of 14

G. jasminoides lineage (Additional file 2: Fig. S4A). G. was highly expressed in green and red fruits and flowers jasminoides-specific and expanded gene families com- (Additional file 2: Fig. S10, Additional file 2: Table S12). prised glycoside (PF00232) and UDP- Two hundred thirty-seven UGT genes were identified glucosyl/glucuronosyl (PF00201), which in the G. jasminoides genome and were classified into 19 might be related to the metabolism of bioactive com- subfamilies (Additional file 2: Fig. S9, Additional file 2: pounds such as crocins and geniposides (Additional file 2: Table S13). More than 50% are members of the UGT79, Table S9). Based on phylogenomic analysis of 121 UGT73, UGT85,andUGT94 groups. Eleven UGT genes single-copy genes, the divergence time between G. jas- were significantly expressed in green and red fruits (Fig. S10, minoides and C. canephora was estimated at approxi- Additional file 2:TableS13). mately 20.7 MYA (Fig. 1a, Additional file 2: Fig. S4B). Elucidation of the G. jasminoides crocin biosynthetic Identification of G. jasminoides candidate crocin pathway biosynthetic genes (Additional file 3) We used expression in E. coli to test the activity of the can- The mature fruit of G. jasminoides exhibits a visible red- didate crocin biosynthetic genes identified by expression dish or brown color, due to the presence of cro- analysis. A strain co-transformed with pET32a-CCD4a and cins, crocetin, and geniposides [21, 25, 26], which are the zeaxanthin accumulation plasmid pACCAR25ΔcrtX found in the mature pericarp and sarcocarp of the fruit showed discoloration, and a new product with a retention (Fig. 1b). We collected samples from three different fruit time and characteristic fragment ions ([M + H]+: m/z developmental stages: fruitlet with a green pericarp and 297.1939) identical to that of the crocetin dialdehyde stand- immature sarcocarp, green fruit with a green pericarp and ard was observed (Fig. 3a, Additional file 2: Fig. S11). red sarcocarp, and red fruit with red pericarp and red sar- In both Crocus and Buddleja, the substrate for crocetin cocarp. Analysis of different organs and tissues, including dialdehyde formation is zeaxanthin [17, 19, 28](Fig.1b). the root, stem, leaf, flower, fruitlet, green fruit, and red Since and β- are also possible substrates, fruit, indicated that crocins are specifically localized in we expressed GjCCD4a in E. coli strains accumulating green and red fruits, whose sarcocarps have matured these carotenoids. In both cases, we detected the forma- (Fig. 1b, Additional file 2: Fig. S5-S6). Transcriptome reads tion of crocetin dialdehyde (Fig. 3a) and, in the case of β- from seven G. jasminoides organs were mapped to the as- carotene, also of the intermediate product 8′-apo-β-caro- sembled genome and annotated gene loci to calculate tenal (Fig. 3a, Additional file 2: Fig. S12, Additional file 2: gene expression (fragments per kilobase of exon per mil- Table S14). In contrast, none of the other Gardenia or lion reads mapped, FPKM). Genome-wide analysis of the Coffea CCD4 proteins was able to cleave zeaxanthin or β- G. jasminoides assembly identified fourteen CCD genes carotene (Additional file 2: Fig. S13). Protein structure that might be involved in the tailoring of carotenoids modeling and docking analysis of GjCCD4a suggested the for apocarotenoid formation (Additional file 2: Fig. capacity to bind all three substrates, in accordance with its S7, Additional file 2:TableS10).TheGjCCD4 and catalytic activity (Additional file 2:Fig.S14). GjCCD8 gene subfamilies were expanded in the G. jasmi- Next, we tested the ability of the purified GjALDH2C3 noides lineage (Additional file 2: Table S10). GjCCD4a was protein to catalyze in vitro the oxidation of crocetin dia- highly expressed in green and red fruits (FPKM > 1000), in ldehyde. Incubation of crocetin dialdehyde with accordance with the distribution of crocins, while GjALDH2C3 yielded two new peaks, and the retention GjCCD4a, GjCCD4c,andGjCCD4d were highly expressed time (7.84 min) and characteristic fragment ions ([M + in flowers (Additional file 2: Fig. S10, Additional file 2: H]+: m/z 327.1605) of the most abundant peak were Table S11). GjCCD4a (ARU08109.1) showed high amino consistent with those of the crocetin standard. In acid sequence similarity to GjCCD4b (81%), GjCCD4c addition, the intermediate product crocetin semialde- (81%), and GjCCD4d (77%), and the four genes were lo- hyde (8.68 min and [M + H]+: m/z 311.1686) was also de- cated on a single gene cluster on chr 9 (see below). tected (Fig. 3b, Additional file 2: Fig. S15). Eighteen G. jasminoides genes with similarity to alde- As noted above, a large number of UGT genes are hyde dehydrogenases (ALDHs), which catalyze the oxi- expressed in Gardenia fruits, and these are likely involved dation of aldehydes [27], were identified and classified in the synthesis of several classes of glycosylated com- into 10 distinct subfamilies (Additional file 2: Fig. S8, pounds in these organs. The enzymatic activities of twelve Additional file 2: Table S12). In C. sativus, CsALDH3I1 candidate UGTs were characterized in vitro using crocetin is known to mediate with high efficiency the dehydro- as a substrate (marked with an asterisk in Additional file 2: genation of crocetin dialdehyde to crocetin [18]; how- Table S13). Three UGTs, namely GjUGT74F8, ever, expression of GjALDH3 genes was low in the fruit GjUGT75L6, and GjUGT94E13, were able to glycosylate and thus inconsistent with the crocin distribution in G. crocetin and/or crocins (Fig. 3c, d, Additional file 2:Fig. jasminoides organs. Instead, GjALDH2C3 (KY631926.1) S16-S21). Of these, only two (GjUGT74F8 and Xu et al. BMC Biology (2020) 18:63 Page 6 of 14

Fig. 3 Elucidation of the crocin biosynthesis pathway in G. jasminoides. a UPLC-DAD chromatograms (abs at 440 nm) of E. coli extracts expressing GjCCD4a. b–d UPLC-DAD chromatograms (abs at 440 nm) of in vitro reactions catalyzed by GjALDH2C3 (b), GjUGT74F8 (c), and GjUGT94E13 (d). St, standards; EV, empty vector; lyc, lycopene; β-car, β-carotene; zea, zeaxanthin; cro, crocetin; CD, crocetin dialdehyde; CS, crocetin semialdehyde; CrI–V, crocins I–V

GjUGT94E13) were expressed at high levels in fruits, sug- of crocins II and III to crocin I (Fig. 3d, Additional file 2: gesting that they are involved in crocin production, while Fig. S17, S18). In no case was generation of a crocin with the third, described previously [21], was not significantly a single β-D-glucosyl ester detected under different reac- expressed in these organs (Additional file 2:Fig.S10). tion conditions, suggesting that GjUGT94E13 could GjUGT74F8, which is the most highly expressed among catalyze either the addition of a second glucosyl group all UGTs (Additional file 2: Fig. S10, Additional file 2: to a pre-existing one (secondary glycosylation) or the se- Table S13), is similar to CsUGT74AD1, which is respon- quential addition of two glucosyl groups to a carboxyl sible for crocin primary glycosylation in C. sativus [18, 29] group (primary and secondary glycosylation). (Additional file 2: Fig. S22). Upon in vitro incubation of Based on these results, the complete crocin biosyn- GjUGT74F8 with crocetin, two new peaks with retention thesis pathway in G. jasminoides is depicted in Fig. 4. times of 4.96 and 6.30 min were detected. The newly Lycopene, β-carotene, and zeaxanthin are cleaved at the formed products had the same fragmentation patterns as 7/8,7′/8′ positions by GjCCD4a, yielding crocetin dia- crocin III ([M + Na]+, 675.2620) and crocin V ([M + Na]+, ldehyde, which is then converted to crocetin by 513.2090), respectively, demonstrating that GjUGT74F8 GjADH2C3. Crocetin is the substrate of two UGTs, has the ability to add one or two β-D-glucoses to the carb- GJUGT74F8 and GjUGT94E13, adding respectively one oxyl groups of crocetin (primary glycosylation) (Fig. 3c, or two glucose esters to the two carboxylic groups. Additional file 2: Fig. S20). In addition, reversible conversion of crocin II to crocin IV and vice versa was observed, suggesting Evolution of crocin and caffeine biosynthesis genes in the that GjUGT74F8 possesses also a hydrolytic activity, able to re- Rubiaceae move a single β-D-glucose (Fig. 3c, Additional file 2:Fig.S20). Synteny analysis demonstrated clearly that Coffea caf- In vitro incubation of crocetin with purified feine synthases and Gardenia CCD4a are localized, re- GjUGT94E13 resulted in the generation of two new spectively, in tandemly repeated arrays that are unique products with retention times of 4.22 and 5.75 min, re- to each of the two species, suggesting the two gene du- spectively. These two compounds had mass spectra with plication series occurred after the Coffea-Gardenia di- [M + Na]+ peaks at m/z 999.3699 and 675.2610, respect- vergence (Fig. 5a, Additional file 2: Fig. S23-24). The ively, consistent with those of crocin I and crocin IV Coffea caffeine synthase gene cluster is syntenic to non- (Fig. 3d, Additional file 2: Fig. S16). We also found that caffeine-producing NMTs in the Gardenia genome crocin IV was gradually converted to crocin I by the ex- (Fig. 5a, Additional file 2: Fig. S23), while the close Gen- tension of the incubation time with GjUGT94E13 tianales relative Gelsemium (Gelsemiaceae) does not (Fig. 3d, Additional file 2: Fig. S16) and that even contain NMTs in the homologous chromosomal GjUGT94E13 could efficiently catalyze the glycosylation regions. The latter observation suggests that the Xu et al. BMC Biology (2020) 18:63 Page 7 of 14

Fig. 4 The crocin biosynthesis pathway in G. jasminoides common ancestral genomic block in all gentianalean Phylogenetic analyses of NMTs related to caffeine species lacked an NMT gene and that the ancestor of synthases and CCDs related to crocin synthase confirmed the future caffeine synthase cluster translocated there this evidence based on synteny. Caffeine biosynthetic en- during Rubiaceae evolution, where it duplicated in Cof- zymes belong to a lineage of NMTs that are specific to fea only after the divergence of Coffea and Gardenia Rubiaceae, however, ones that duplicated and specifically (Fig. 5b, Additional file 2: Fig. S24). Conversely, the Gar- evolved caffeine biosynthetic function only in Coffea (Fig. 5b, denia gene cluster comprising GjCCD4a-b-c-d is syn- Additional file 2: Fig. S26, Additional file 4:Fig.S27).Simi- tenic to a unique CCD4 gene in the coffee genome larly, the CCD4 subfamily wherein Gardenia’scrocinsyn- (Fig. 5a, Additional file 2: Table S15), yielding an thase resides is Rubiaceae-specific, but not tandemly inverted evolutionary scenario wherein local duplications duplicated in Coffea as it is in Gardenia (Fig. 5b, Add- in Gardenia that occurred after Coffea-Gardenia com- itional file 5:Fig.S28).TheUGT74 and UGT94 gene sub- mon ancestry led to the production of distinct bioactive families in both Rubiaceae species are contained within compounds. In contrast, the downstream crocin biosyn- distinct subclades in a large UGT family tree, each grouping thesis genes GjALDH2C3, GjUGT74F8, and GjUGT94E5 with related genes from other in sublineages are localized in gene clusters that appear conserved be- likely to be of ancient origin (Additional file 6:Fig.S29). tween Coffea and Gardenia (Additional file 2: Fig. S24), Similarly, the ALDH2 genes of Coffea and Gardenia are suggesting that they may have been generated before the close relatives in a subclade also comprising genes from split of the two genera. other Gentianales species (Additional file 7:Fig.S30). Xu et al. BMC Biology (2020) 18:63 Page 8 of 14

Fig. 5 Coffea caffeine synthases and Gardenia crocin biosynthesis dioxygenase arose through genus-specific tandem duplications. a Microsynteny between Coffea and Gardenia around the caffeine synthases of the former and the first dedicated gene in crocin biosynthesis in Gardenia, GjCCD4a. b Simplified phylogenies (pruned from the trees provided in Figs. S27-S28) depicting the evolution of caffeine and crocin biosynthetic genes in Gentianales. Purple branches indicate clades that are Gentianales-specific; green, Rubiaceae-specific; red, Coffea-specific. The gene IDs used for the phylogenetic trees of NMTs and CCD are the following: CcNMT2 (Cc02_g09350), CcNMT4b (Cc09_g07000), CcNMT4a (Cc09_g06990), CcMTL (Cc09_g06950), CcNMT3 (Cc09_g06960), CcXMT (Cc09_g06970), CcDXMT (Cc01_g00720), CcMXMT (Cc00_g24720), GjNMT2 (Gj9A1032T108), GjNMT4a (Gj1A458T26), GjNMT4b (Gj1X458T25), GjNMT4c (Gj1A458T28), SlCCD4a (Solyc08g075480.2.1), SlCCDb (Solyc08g075490.2.1), CcCCD4 (Cc08_g05610), GjCCD4a (Gj9A597T69), GjCCD4b (Gj9A597T68), GjCCD4c (Gj9A597T67), GjCCD4d (Gj9P597T6), GsCCD4 (Gs_scaff312_1.38), CrCCD4a (CRO_T140281), CrCCD4b (CRO_T140282), CrCCD4c (CRO_T140277), and CgCCD4 (cal_g006885.t1). Gene IDs for those with the biochemical activities in question are shown in red. Gardenia gene IDs are shown in orange. All other gene IDs are shown in black. For the collapsed tree branches in the pruned trees, species abbreviations are as follows: At, Arabidopsis thaliana; Vv, Vitis vinifera; Sl, Solanum lycopersicum; Gs, Gelsemium sempervirens; Cr, Catharanthus roseus; Cg, Calotropis gigantea; Cc, Coffea canephora; Gj, Gardenia jasminoides; Bd, Buddleja davidii

Discussion heterozygosity. We developed an efficient method to Chromosome-level assembly of a highly heterozygous produce chromosome-level assemblies for heterozygous genome plant genomes using a combination of short Illumina Many plants including medicinal ones such as Salvia and long ONT reads and Hi-C scaffolding [22, 35, 36]. miltiorrhiza [30], Panax ginseng [31], Panax notoginseng The assemblies constructed using short reads were ex- [32, 33], and Glycyrrhiza uralensis [34] are highly het- tremely fragmented, while the combination of short and erozygous, resulting in fragmented genome assemblies ONT reads with a Canu-SMARTdenovo assembly pipe- when using traditional short read methods. G. jasmi- line produced a G. jasminoides assembly with the highest noides, the second plant of Rubiaceae with a sequenced contiguity. The significant difference in the number of an- genome, is self-incompatible with high (2.2%) notated repeat sequences between the ONT (53.97%) and Xu et al. BMC Biology (2020) 18:63 Page 9 of 14

Illumina (36.51%)-based assemblies indicates the import- crocetin to crocins in vitro [21]. However, their low ex- ance of genome contiguity for annotation of repetitive pression in fruits suggests that they are not the primary elements (Additional file 2: Table S3). However, the involved in crocin biosynthesis in vivo. We ONT-based assembly size was much larger than the gen- identified two novel UGT genes, GjUGT74F8 and ome size predicted by k-mer distribution and flow cytom- GjUGT94E13, closely related to GjUGT75L6 and etry, suggesting that highly divergent haplotypes were GjUGT94E5 that are highly expressed in fruits and show assembled separately. This was confirmed and corrected co-expression with GjCCD4a. GjUGT74F8 was the most by haplotig purging, whereafter the assembly was scaf- highly expressed UGT gene in mature fruit, and the pro- folded into chromosomal pseudomolecules using Hi-C. tein product catalyzed the primary glycosylation of cro- Given the low cost of ONT long reads, the pipeline de- cetin, the same reaction catalyzed by GjUGT75L6 [21] scribed here provides an extremely cost-effective method and by Crocus UGT74AD1 [18]. Similar to the latter, for producing chromosome-level assemblies of highly het- GjUGT74F8 did not exhibit a secondary glycosylation erozygous plant genomes. activity. GjUGT74F8 was also found to catalyze de- glycosylation reactions. Reversible glycosylation has been Characterization of the Gardenia crocin biosynthetic previously described in other plant UGTs [37]. pathway GjUGT94E13 catalyzed primary and secondary glycosyl- In a previous transcriptomic study [20], we identified ation of crocetin to form crocins IV and I, presumably several G. jasminoides unigenes bearing high sequence via the sequential addition of two β-D-glucosyl esters. similarity to candidate genes for crocin biosynthesis Crocins V and II were undetectable when crocetin was (Additional file 2: Table S16). However, the low number incubated with GjUGT94E13, suggesting that the sec- of tissues analyzed by RNA-Seq (3 versus 7 in the ondary glycosylation step occurred much faster than the present study) and the high level of fragmentation of the primary glycosylation one. Incubation of crocin II or unigenes prevented the unambiguous identification of crocin III with GjUGT94E13 resulted in the complete bona fide candidates for all the steps through co- conversion to crocin I, confirming that GjUGT94E13 is expression analysis, as well as the expression in E. coli of able to add a second glucosyl moiety to a mono-glycosyl the corresponding full-length proteins for functional as- group (secondary glycosylation). says. Based on genome-wide candidate gene identifica- tion, expression studies, and in vitro functional studies, Molecular evolution of caffeine and crocin biosynthesis in three of the 7 previously identified unigenes (GjCCD4a, the Rubiaceae GjALDH2C3, and UGT94E13), but also a novel The chromosome-level Gardenia genome assembly re- UGT74F8 gene, which went undetected in the previous ported here and its comparative analysis with the closely transcriptome analysis, were shown to be involved in related Coffea genome [4] allowed a detailed reconstruc- crocin biosynthesis in G. jasminoides. tion of the molecular events that led to the evolution of In C. sativus, zeaxanthin is cleaved symmetrically at the caffeine and crocin pathways in the two genera. The the 7/8,7′/8′ positions by CsCCD2 to produce crocetin NMT gene cluster involved in caffeine biosynthesis in C. dialdehyde [17, 28], while, in B. davidii, the same reac- canephora shows synteny to a region in G. jasminoides tion is carried out by BdCCD4.1 and BdCCD4.3 [19]; no that also contains NMTs, but ones that predate the evolu- activity of CsCCD2 and/or BdCCD4.1/BdCCD4.3 on tion of the caffeine synthase cluster characteristic of cof- other carotenoids, including β-carotene and lycopene, fee. Conversely, the first dedicated gene in crocin has been reported. GjCCD4a from G. jasminoides shares biosynthesis, GjCCD4a, is part of a 4-gene cluster for only 31% identity with CsCCD2, but 56 and 59% identity which, in coffee, there is only one ortholog (Cc08g05610), with BdCCD4.1 and BdCCD4.3, respectively, and can showing the highest identity (80.8%) to GjCCD4c.The catalyze the symmetric 7/8,7′/8′ cleavage of zeaxanthin, two genes are highly expressed in flowers in the two spe- β-carotene, and lycopene to produce crocetin dialde- cies, suggesting that they may be responsible for caroten- hyde. Thus, crocetin biosynthesis evolved in different oid cleavage in the white flowers of G. jasminoides and C. plant taxa through convergent evolution, since the carot- canephora, followed by volatile apocarotenoid formation enoid cleavage step is catalyzed by a CCD2 in Crocus [38]. Additionally, neither the CcCCD4 nor the [17], but by CCD4 enzymes in Buddleja [19] and Gar- GjCCD4b-c proteins exhibit 7/8,7′/8′ cleavage activity denia (this paper). Convergent evolution also underlies against any carotenoid tested. These findings suggest that the second step of crocin biosynthesis, which is medi- caffeine and crocetin biosynthesis in coffee and Gardenia ated by an ALDH3 in C. sativus [18] and by an ALDH2 evolved, respectively, through tandem duplications and in G. jasminoides (this paper). functional specialization of NMT and CCD4 genes, and Two Gardenia UGTs, GjUGT75L6 and GjUGT94E5, that these tandem duplications occurred after the separ- were previously shown to catalyze the conversion of ation of the two genera. Besides giving rise to novel Xu et al. BMC Biology (2020) 18:63 Page 10 of 14

metabolic functions, tandem gene duplications play sev- Base calling was performed using the Oxford Nanopore eral additional roles in genome evolution; for instance, in base caller Guppy (v1.8.5). Canu (v1.7) was used to cor- the birch and avocado genomes, tandem duplicates are rect, trim, and assemble the ONT raw reads with the de- enriched for pathogen responses [7, 39], while in the fault parameters [41]. The correction-free approach, Australian carnivorous pitcher plant, tandem duplicates named Minimap/miniasm [42], was also independently are enriched in enzymatic functions related to the acquisi- performed with the recommended parameters. The as- tion of carnivory [40]. The production of tandem dupli- sembler SMARTdenovo was also used for assembly with cates as copy number variants in populations, and their the corrected and trimmed ONT reads as input [43]. subsequent species-level fixation, provides a general evolu- The Canu-SMARTdenovo contigs were polished three tionary substrate for novel secondary metabolic activities times with Pilon (v1.22) using Illumina short reads [44]. in the recent adaptive landscapes of plant genomes. The final scaffolds were constructed with the polished contigs and corrected ONT reads using SSPACE- Conclusion LongRead (version 1.1) [45], and heterozygous sequences This study sequenced the genome of G. jasminoides,a were removed using Purge Haplotigs [46]. The quality of crocin-producing species, and dissected the complete the genome assembly was estimated by searching for crocin biosynthetic pathway through genomic and func- Benchmarking Universal Single-Copy Orthologs tional assays. Comparative analyses with C. canephora (BUSCO v4.0, embryophyte profile) with Embryophya revealed that the caffeine biosynthetic genes (NMTs) in odb 10 dataset [47]. Illumina sequences from the G. jas- Coffea and the first dedicated crocin biosynthetic gene minoides DNA and RNA libraries were mapped to evalu- (GjCCD4a) in Gardenia evolved through recent tandem ate the quality of the assembled genome using BWA gene duplications in the two different genera, respect- (Burrows-Wheeler Aligner) [48]. ively. This study highlighted the divergent evolution of caffeine and crocin biosynthesis within the coffee family, Chromosome construction using Hi-C providing significant insights on the role of tandem du- Fresh tissue of G. jasminoides was used to construct a plications in the evolution of plant specialized Hi-C sequencing library. Steps included chromatin metabolism. crosslinking, chromatin digestion with Hind III, biotin labeling and end repair, DNA purification, streptavidin Methods pull-down of labeled Hi-C ligation products, and con- Plant materials struction of an Illumina sequencing library. The clean An individual G. jasminoides (line 1–9) plant, which was sequences were mapped to the draft genome, and valid asexually propagated by cutting, was obtained from Nan- Hi-C reads were used to correct the draft assembly. chuan District (29° N and 107° E), Chongqing City, Then, the draft genome of G. jasminoides was assembled China. Seven independent organs from G. jasminoides, into chromosomes (2n = 22) using Lachesis [22]. including the root, stem, leaf, flower, fruitlet, green fruit, and red fruit, were collected. The fruitlet, green fruit, Genome annotation and RNA-Seq analysis and red fruit represented different maturity. In total, 21 Annotation of structural repeats in the G. jasminoides samples including three biological replicates for each genome was performed using the RepeatModeler (http:// organ were gathered. All samples were divided into two www.repeatmasker.org/RepeatModeler/; v1.0.9) package, portions, which were used for the measurement of cro- which combines RECON and RepeatScout to identify cin content and RNA sequencing. The pooled young and classify the repeat elements. The long terminal re- leaves were used to DNA extraction for Illumina and peat retrotransposons (LTR-RTs) in G. jasminoides were ONT sequencing. identified using LTR_Finder (v1.0.6) and LTR_retriever with the default parameters [49]. The repeat sequences ONT sequencing and assembly were masked by RepeatMasker (v4.0.6) (http://www. Following the methods for megabase-size DNA prepar- repeatmasker.org/). ation, we extracted the high molecular weight (HMW) RNA-Seq on the HiSeq 4000 platform was performed genomic DNA of G. jasminoides, which was used to con- for 21 samples. The short reads were assembled de novo struct paired-end, mate-pair, and ONT libraries. The using Trinity (v 2.2.0) [50], and peptide sequences were HMW gDNA was randomly fragmented using a Mega- predicted with TransDecoder (v2.1.0) (https://github. ruptor; then, the large fragments were selected and puri- com/TransDecoder). The masked G. jasminoides gen- fied using BluePippin and AMPure beads. After end- ome annotation was ab initio predicted using the prep, ligation of sequencing adapters, and tether attach- MAKER (v2.31.9) [51] annotation pipeline, integrating ment, the fragments were sequenced on the ONT Grid- the assembled transcripts of G. jasminoides and protein ION X5 platform with 6 nanopore flow cells (v9.4.1). sequences from G. jasminoides, C. canephora, and A. Xu et al. BMC Biology (2020) 18:63 Page 11 of 14

thaliana. Noncoding RNAs were annotated by aligning were incorporated in the NMT tree (MTL,_AFV60456.1; to the Rfam database using INFERNAL (v1.1.2) [52], and DXMT,_ABD90686.1; MXMT,_AFV60445.1; XMT,_ miRNAs were further analyzed by performing BLASTN ABD90685.1). Searches were run on the CoGe platform searches against the miRNA database. The RNA-Seq using default parameters and saving 100 Blast HSPs per reads from different G. jasminoides organs were aligned species. Unique translated sequences were then down- to the masked genome using HiSAT2 (v2.0.5), and the loaded, duplicates were excluded using BBedit, se- FPKM values of annotated genes in the reference gen- quences with internal stop codons were excluded, and ome were calculated using Cufflinks (v2.2.1) [53]. then trees were run using PASTA [63] with MAFFT [64] The amino acid sequences of proteins from G. jasmi- to align the protein sequences and FastTree [65] to cre- noides and nine other angiosperms were clustered into ate an approximately maximum likelihood tree. Trees orthologous groups using OrthoMCL (version 2.0.9) were visualized and edited using FigTree (http://tree.bio. [54]. Phylogenetic analyses of single-copy orthologous ed.ac.uk/software/figtree/) (Additional file 4: Fig. S27, genes were performed using the RAxML package (ver- Additional file 5: Fig. S28, Additional file 6: Fig. S29, sion 8.1.13) using the JTT+G+I substitution model for Additional file 7: Fig. S30). To interpret the supplemen- amino acid sequences with 1000 bootstrap replicates tal figures, pink branches represent gentianalean clades, [55]. Divergence times were directly estimated based on green branches represent Rubiaceae clades, and orange the divergence times of P. trichocarpa-G. max (94–127 gene model IDs represent Gardenia genes. Coffee- MYA) and B. distachyon-Z. mays (40–53 MYA) ob- specific clades are shown in red. In the NMT supple- tained from TimeTree (http://www.timetree.org). The mental tree (Additional file 4: Fig. S27), the anchoring Markov Cluster Algorithm (MCL) was used to identify protein sequences are shown in red. species-specific gene groups [56]. CAFÉ (version 3.1) was used to predict gene family expansion and contrac- Enzymatic activity assays and LC/LC-MS analyses tion [57]. Genome synteny analyses were performed The cDNAs of candidate genes from the CCD, ALDH,and using the CoGe web suite, www.genomevolution.org, ac- UGT families were de novo synthesized and cloned into ex- cording to methods described elsewhere [58–60]. pression vectors via digestion and ligation (Additional file 2: Fig. S25). The GjALDH2C3, GjUGT94E13, and Identification of gene families related to crocin GjUGT74F8 proteins were purified from E. coli using affin- biosynthesis ity chromatography to a purity > 95% (Additional file 2:Fig. Protein sequences of the CCD, ALDH, and UGT family S15, S16, S19). The in bacterio and in vitro activity assays members in A. thaliana were downloaded from the and detailed reaction mixtures are described in the Supple- TAIR database, then were used as queries in BLASTP mental information. Crocetin dialdehyde (trans-crocetin searches against the G. jasminoides protein sequences to dialdehyde) and crocetin (trans-crocetin) were purchased identify homologous sequences. Full-length protein se- from Sigma-Aldrich (USA) and CFW Laboratories quences were corrected and aligned with ClustalW2 (Germany), respectively. Crocin I and crocin II were pur- [61]. Phylogenetic trees were constructed using the max- chased from Meilunbio (China). All the chemical reagents imum likelihood method with the Jones-Taylor- used here were of analytical grade. Thornton (JTT) model and 1000 Bootstrap replicates Samples were analyzed using a Thermo Ultimate 3000 [62]. Further analyses incorporated blast searches (using system equipped with an Acquity UPLC® BEH C18 col- Gardenia proteins as queries) of a number of other ge- umn (1.7 μm, 100 × 2.1 mm). A gradient elution proced- nomes to identify more CCD, ALDH, and UGT genes. ure was applied, using the mobile phases acetonitrile For NMTs, the Coffea canephora XMT protein was used containing 0.1% formic acid (A) and water containing as a query (NCBI accession ABD90685.1). Species con- 0.1% formic acid (B). The following gradient elution pro- sidered were Gardenia jasminoides (CoGe genome ID gram was used at a flow rate 0.3 mL/min: 0–5 min, 10% 53980), Coffea canephora (CoGe genome ID 19443), A linearly increased to 50% A; 5–8 min, 50% A linearly Arabidopsis thaliana (CoGe genome ID 16911), Calotro- increased to 90% A; 8–10 min, 90% A linearly increased pis gigantea (CoGe genome ID 36623), Catharanthus to 100% A and sustained for 20 min; and 30–31 min, roseus (CoGe genome ID 36703), Vitis vinifera (CoGe back to 10% A. genome ID 19990), Gelsemium sempervirens (CoGe gen- Qualitative analysis of each compound was carried out ome ID 53941), and Solanum lycopersicum (CoGe gen- using liquid chromatography-mass spectrometry (Agi- ome ID 12289). Gene model IDs from the respective lent Technologies 1290 Infinity II LC System and 6545 CoGe-uploaded genomes were retained as leaf IDs for Q-TOF, with Dual Agilent Jet Stream Electrospray phylogenetic analysis, with the exception that “:” when it Ionization sources). The drying gas was set at 325 °C appeared in a gene model ID was replaced by “_”. Sev- with a flow rate of 6 L/min, and the sheath gas was set eral additional anchoring protein sequences from NCBI at 350 °C, with a flow rate of 12.0 L/min. The nebulizer Xu et al. BMC Biology (2020) 18:63 Page 12 of 14

was set at 45 psig, and the VCap was set at 4000 V. The Competing interests data were analyzed using MassHunter (version B.07.00). The authors declare no competing interests. The detailed information of mentioned compounds was Author details listed in Additional file 2: Table S17, S18. 1Key Lab of Chinese Medicine Resources Conservation, State Administration Nuclear magnetic resonance (NMR) experiments were of Traditional Chinese Medicine of the People’s Republic of China, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & performed on Bruker AV III 600 NMR spectrometer Peking Union Medical College, Beijing 100193, China. 2Engineering Research 1 13 (600 MHz for H NMR and 150 MHz for C NMR) in Center of Chinese Medicine Resource, Ministry of Education, Beijing 100193, China. 3Italian National Agency for New Technologies, Energy and CDCl3 (Sigma-Aldrich, USA), and the chemical shifts δ Sustainable Economic Development (ENEA), Casaccia Res. Ctr, 00123 Rome, were given in (ppm) with TMS as the internal Italy. 4Department of Biological Sciences, University at Buffalo, Buffalo, NY standard. 14260, USA. 5Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China. 6Institute of Materia Medica, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing Supplementary information 100050, China. 7Chongqing Institute of Medicinal Plant Cultivation, Supplementary information accompanies this paper at https://doi.org/10. Chongqing 408435, China. 8College of Life Sciences, Qingdao Agricultural 1186/s12915-020-00795-3. University, Qingdao 266109, China. 9School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore. 10Yunnan Branch, Institute of Medicinal Plant Development, Chinese Academy of Medical Additional file 1. Detailed methods and results for genome Sciences & Peking Union Medical College, Jinghong 666100, China. sequencing. Additional file 2: Figures S1-S26; Tables S1-S18. Received: 14 January 2020 Accepted: 18 May 2020 Additional file 3. Detailed methods and results for crocin biosynthesis. Additional file 4: Figure S27. Pasta phylogenetic tree of NMTs from different species. References Additional file 5: Figure S28. Pasta phylogenetic tree of CCDs from 1. Nutzmann HW, Huang A, Osbourn A. Plant metabolic clusters - from different species. genetics to genomics. New Phytol. 2016;211:771–89. 2. Chae L, Kim T, Nilo-Poyanco R, Rhee SY. Genomic signatures of specialized Additional file 6: Figure S29. Pasta phylogenetic tree of UGTs from metabolism in plants. Science. 2014;344:510–3. different species. 3. Guo L, Winzer T, Yang X, Li Y, Ning Z, He Z, Teodor R, Lu Y, Bowser TA, Additional file 7: Figure S30. Pasta phylogenetic tree of ALDHs from Graham IA, Ye K. The opium poppy genome and morphinan production. different species. Science. 2018;362:343–7. 4. Denoeud F, Carretero-Paulet L, Dereeper A, Droc G, Guyot R, Pietrella M, Zheng C, Alberti A, Anthony F, Aprea G, et al. The coffee genome provides Acknowledgments insight into the convergent evolution of caffeine biosynthesis. Science. We would like to thank Dr. Norihiko Misawa for providing plasmids of 2014;345:1181–4. zeaxanthin, lycopene, and β-carotene production. 5. Frey M, Chomet P, Glawischnig E, Stettner C, Grun S, Winklmair A, Eisenreich W, Bacher A, Meeley RB, Briggs SP, et al. Analysis of a chemical plant – Authors’ contributions defense mechanism in grasses. Science. 1997;277:696 9. Z.X. and J.S. designed and coordinated the study. Z.X., X.P., R.G., A.J., S.C., W.S., 6. Dutartre L, Hilliou F, Feyereisen R. Phylogenomics of the benzoxazinoid Z.W., J.S., T.G., C.X., H.Y., and T.X. generated the data. K.H. and F.R. supplied biosynthetic pathway of Poaceae: gene duplications and origin of the Bx the plant materials. Z.X., X.P., R.G., C.H., J.K., O.D., S.F., M.R., V.A.A., and G.G. cluster. BMC Evol Biol. 2012;12:64. analyzed the data. Z.X., X.P., G.G., and J.S. wrote the manuscript. C.H., J.K., 7. Salojarvi J, Smolander OP, Nieminen K, Rajaraman S, Safronov O, Safdari P, O.D., V.A., G.G., and S.C. revised the manuscript. All authors edited the Lamminmaki A, Immanen J, Lan T, Tanskanen J, et al. Genome sequencing manuscript and approved the final version. and population genomic analyses provide insights into the adaptive landscape of silver birch. Nat Genet. 2017;49:904–12. 8. Xiao W, Li S, Wang S, Ho CT. Chemistry and bioactivity of Gardenia Funding jasminoides. J Food Drug Anal. 2017;25:43–61. This work was supported by the National Natural Science Foundation of 9. Chinese Pharmacopoeia Commission. Pharmacopoeia of the People’s China (81973424) to Z.X., the CAMS Innovation Fund for Medical Sciences Republic of China. Beijing: China Medical Science and Technology Press; (CIFMS) (2016-I2M-3-016) to J.S., the EU grant DISCO (grant no. 613513) and 2015. Lazio Region project ProBioZaff (85-2017-15296) to G.G., and the US National 10. Huang R, O'Donnell AJ, Barboline JJ, Barkman TJ. Convergent evolution of Science Foundation (1442190) to V.A.A. caffeine in plants by co-option of exapted ancestral enzymes. Proc Natl Acad Sci U S A. 2016;113:10613–8. Availability of data and materials 11. Ma G, Zhang L, Matsuta A, Matsutani K, Yamawaki K, Yahata M, Wahyudi A, G. jasminoides line 1–9 and the constructs used in this work can be obtained Motohashi R, Kato M. Enzymatic formation of beta-citraurin from beta- by writing to K.H. (email: [email protected]). Genome sequence and cryptoxanthin and Zeaxanthin by carotenoid cleavage dioxygenase4 in the assembly data were submitted to the Sequence Read Archive from the NCBI flavedo of citrus fruit. Plant Physiol. 2013;163:682–95. under the accession number PRJNA477438 [66]. The assembled genomes 12. Zhang B, Liu C, Wang Y, Yao X, Wang F, Wu J, King GJ, Liu K. Disruption of a including nuclear genome, chloroplast genome, and mitochondrial genome CAROTENOID CLEAVAGE DIOXYGENASE 4 gene converts flower colour from for G. jasminoides were submitted to the NCBI genome resource with white to yellow in Brassica species. New Phytol. 2015;206:1513–26. accession no. VZDL00000000 [67], and the CoGe with id53980 [68], id55476 13. Ohmiya A, Kishimoto S, Aida R, Yoshioka S, Sumitomo K. Carotenoid [69], and id55477 [70], respectively. cleavage dioxygenase (CmCCD4a) contributes to white color formation in chrysanthemum petals. Plant Physiol. 2006;142:1193–201. 14. Hou X, Rivers J, Leon P, McQuinn RP, Pogson BJ. Synthesis and function of Ethics approval and consent to participate apocarotenoid signals in plants. Trends Plant Sci. 2016;21:792–803. Not applicable. 15. Khorasanchi Z, Shafiee M, Kermanshahi F, Khazaei M, Ryzhikov M, Parizadeh MR, Kermanshahi B, Ferns GA, Avan A, Hassanian SM. Crocus sativus a Consent for publication natural food coloring and flavoring has potent anti-tumor properties. All authors consent to the publication of the manuscript. Phytomedicine. 2018;43:21–7. Xu et al. BMC Biology (2020) 18:63 Page 13 of 14

16. Nair SC, Pannikar B, Panikkar KR. Antitumour activity of saffron (Crocus 37. Wang X. Structure, mechanism and engineering of plant natural product sativus). Cancer Lett. 1991;57:109–14. glycosyltransferases. FEBS Lett. 2009;583:3303–9. 17. Frusciante S, Diretto G, Bruno M, Ferrante P, Pietrella M, Prado-Cabrero A, 38. Brandi F, Bar E, Mourgues F, Horvath G, Turcsi E, Giuliano G, Liverani A, Rubio-Moraga A, Beyer P, Gomez-Gomez L, Al-Babili S, Giuliano G. Novel Tartarini S, Lewinsohn E, Rosati C. Study of ‘Redhaven’ peach and its white- carotenoid cleavage dioxygenase catalyzes the first dedicated step in fleshed mutant suggests a key role of CCD4 carotenoid dioxygenase in saffron crocin biosynthesis. Proc Natl Acad Sci U S A. 2014;111:12246–51. carotenoid and norisoprenoid volatile metabolism. BMC Plant Biol. 2011;11: 18. Demurtas OC, Frusciante S, Ferrante P, Diretto G, Azad NH, Pietrella M, 24. Aprea G, Taddei AR, Romano E, Mi J, et al. Candidate enzymes for saffron 39. Rendon-Anaya M, Ibarra-Laclette E, Mendez-Bravo A, Lan T, Zheng C, crocin biosynthesis are localized in multiple cellular compartments. Plant Carretero-Paulet L, Perez-Torres CA, Chacon-Lopez A, Hernandez-Guzman G, Physiol. 2018;177:990–1006. Chang TH, et al. The avocado genome informs deep angiosperm 19. Ahrazem O, Diretto G, Argandona J, Rubio-Moraga A, Julve JM, Orzaez D, phylogeny, highlights introgressive hybridization, and reveals pathogen- Granell A, Gomez-Gomez L. Evolutionarily distinct carotenoid cleavage influenced gene space adaptation. Proc Natl Acad Sci U S A. 2019;116: dioxygenases are responsible for crocetin production in Buddleja davidii.J 17081–9. Exp Bot. 2017;68:4663–77. 40. Fukushima K, Fang X, Alvarez-Ponce D, Cai H, Carretero-Paulet L, Chen C, 20. Ji A, Jia J, Xu Z, Li Y, Bi W, Ren F, He C, Liu J, Hu K, Song J. Transcriptome- Chang TH, Farr KM, Fujita T, Hiwatashi Y, et al. Genome of the pitcher plant guided mining of genes involved in crocin biosynthesis. Front Plant Sci. Cephalotus reveals genetic changes associated with carnivory. Nat Ecol Evol. 2017;8:518. 2017;1:59. 21. Nagatoshi M, Terasaka K, Owaki M, Sota M, Inukai T, Nagatsu A, Mizukami H. 41. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: UGT75L6 and UGT94E5 mediate sequential glucosylation of crocetin to scalable and accurate long-read assembly via adaptive k-mer weighting and crocin in Gardenia jasminoides. FEBS Lett. 2012;586:1055–61. repeat separation. Genome Res. 2017;27:722–36. 22. Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J. Hi-C. a 42. Li H. Minimap and miniasm. Fast mapping and de novo assembly for noisy comprehensive technique to capture the conformation of genomes. long sequences. Bioinformatics. 2016;32:2103–10. Methods. 2012;58:268–76. 43. Istace B, Friedrich A, d'Agata L, Faye S, Payen E, Beluche O, Caradec C, 23. Xu Z, Xin T, Bartels D, Li Y, Gu W, Yao H, Liu S, Yu H, Pu X, Zhou J, et al. Davidas S, Cruaud C, Liti G, et al. De novo assembly and population Genome analysis of the ancient tracheophyte Selaginella tamariscina reveals genomic survey of natural yeast isolates with the Oxford Nanopore MinION evolutionary features relevant to the acquisition of desiccation tolerance. sequencer. Gigascience. 2017;6:1–13. Mol Plant. 2018;11:983–94. 44. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, 24. VanBuren R, Wai CM, Ou S, Pardo J, Bryant D, Jiang N, Mockler TC, Edger P, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for Michael TP. Extreme haplotype variation in the desiccation-tolerant comprehensive microbial variant detection and genome assembly clubmoss Selaginella lepidophylla. Nat Commun. 2018;9:13. improvement. PLoS One. 2014;9:e112963. 25. Gao L, Zhu BY. The accumulation of crocin and geniposide and transcripts 45. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre- of synthase during maturation of Gardenia jasminoides fruit. Evid assembled contigs using SSPACE. Bioinformatics. 2011;27:578–9. Based Complement Alternat Med. 2013;2013:686351. 46. Roach MJ, Schmidt SA, Borneman AR. Purge Haplotigs: allelic contig 26. Chen Y, Zhang H, Li YX, Cai L, Huang J, Zhao C, Jia L, Buchanan R, Yang T, reassignment for third-gen diploid genome assemblies. BMC Bioinformatics. Jiang LJ. Crocin and geniposide profiles and radical scavenging activity of 2018;19:460. gardenia fruits (Gardenia jasminoides Ellis) from different cultivars and at the 47. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. various stages of maturation. Fitoterapia. 2010;81:269–73. BUSCO: assessing genome assembly and annotation completeness with 27. Brocker C, Vasiliou M, Carpenter S, Carpenter C, Zhang Y, Wang X, Kotchoni single-copy orthologs. Bioinformatics. 2015;31:3210–2. SO, Wood AJ, Kirch HH, Kopecny D, et al. Aldehyde dehydrogenase (ALDH) 48. Li H, Durbin R. Fast and accurate short read alignment with Burrows- superfamily in plants: gene nomenclature and comparative genomics. Wheeler transform. Bioinformatics. 2009;25:1754–60. Planta. 2013;237:189–210. 49. Ou S, Jiang N. LTR_retriever: a highly accurate and sensitive program for 28. Ahrazem O, Rubio-Moraga A, Berman J, Capell T, Christou P, Zhu C, Gomez- identification of long terminal repeat retrotransposons. Plant Physiol. 2018; Gomez L. The carotenoid cleavage dioxygenase CCD2 catalysing the 176:1410–22. synthesis of crocetin in spring crocuses and saffron is a plastidial . 50. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis New Phytol. 2016;209:650–63. X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly 29. Moraga AR, Nohales PF, Perez JA, Gomez-Gomez L. Glucosylation of the from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29: saffron apocarotenoid crocetin by a glucosyltransferase isolated from Crocus 644–52. sativus stigmas. Planta. 2004;219:955–66. 51. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez 30. Xu H, Song J, Luo H, Zhang Y, Li Q, Zhu Y, Xu J, Li Y, Song C, Wang B, et al. Alvarado A, Yandell M. MAKER: an easy-to-use annotation pipeline designed Analysis of the genome sequence of the medicinal plant Salvia miltiorrhiza. for emerging model organism genomes. Genome Res. 2008;18:188–96. Mol Plant. 2016;9:949–52. 52. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. 31. Xu J, Chu Y, Liao B, Xiao S, Yin Q, Bai R, Su H, Dong L, Li X, Qian J, et al. Bioinformatics. 2013;29:2933–5. Panax ginseng genome examination for ginsenoside biosynthesis. 53. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Gigascience. 2017;6:1–15. Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression 32. Chen W, Kui L, Zhang G, Zhu S, Zhang J, Wang X, Yang M, Huang H, Liu Y, analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. Wang Y, et al. Whole-genome sequencing and analysis of the Chinese 2012;7:562–78. herbal plant Panax notoginseng. Mol Plant. 2017;10:899–902. 54. Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL: identification of ortholog groups 33. Zhang D, Li W, Xia EH, Zhang QJ, Liu Y, Zhang Y, Tong Y, Zhao Y, Niu YC, for eukaryotic genomes. Genome Res. 2003;13:2178–89. Xu JH, Gao LZ. The medicinal herb Panax notoginseng genome provides 55. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post- insights into ginsenoside biosynthesis and genome evolution. Mol Plant. analysis of large phylogenies. Bioinformatics. 2014;30:1312–3. 2017;10:903–7. 56. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large- 34. Mochida K, Sakurai T, Seki H, Yoshida T, Takahagi K, Sawai S, Uchiyama H, scale detection of protein families. Nucleic Acids Res. 2002;30:1575–84. Muranaka T, Saito K. Draft genome assembly and annotation of Glycyrrhiza 57. De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool uralensis, a medicinal legume. Plant J. 2017;89:181–94. for the study of gene family evolution. Bioinformatics. 2006;22:1269–71. 35. Schmidt MH, Vogel A, Denton AK, Istace B, Wormit A, van de Geest H, 58. Lyons E, Pedersen B, Kane J, Alam M, Ming R, Tang H, Wang X, Bowers J, Bolger ME, Alseekh S, Mass J, Pfaff C, et al. De novo assembly of a new Paterson A, Lisch D, Freeling M. Finding and comparing syntenic regions Solanum pennellii accession using nanopore sequencing. Plant Cell. 2017;29: among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe 2336–48. with rosids. Plant Physiol. 2008;148:1772–81. 36. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. 59. Lyons E, Pedersen B, Kane J, Freeling M. The value of nonmodel genomes Chromosome-scale scaffolding of de novo genome assemblies based on and an example using SynMap within CoGe to dissect the hexaploidy that chromatin interactions. Nat Biotechnol. 2013;31:1119–25. predates the rosids. Trop Plant Biol. 2008;1:181–90. Xu et al. BMC Biology (2020) 18:63 Page 14 of 14

60. Ibarra-Laclette E, Lyons E, Hernandez-Guzman G, Perez-Torres CA, Carretero- Paulet L, Chang TH, Lan T, Welch AJ, Juarez MJ, Simpson J, et al. Architecture and evolution of a minute plant genome. Nature. 2013;498:94–8. 61. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–8. 62. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35:1547–9. 63. Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J Comput Biol. 2015;22:377–86. 64. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30:3059–66. 65. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum- likelihood trees for large alignments. PLoS One. 2010;5:e9490. 66. Xu Z, Pu X, Gao R, Demurtas OC, Fleck SJ, Richter M, He C, Ji A, Sun W, Kong J, et al. Supplementary Datasets. 2020. NCBI BioProject accession: PRJNA477438. [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA477438]. 67. Xu Z, Pu X, Gao R, Demurtas OC, Fleck SJ, Richter M, He C, Ji A, Sun W, Kong J, et al. Supplementary Datasets. 2020. Whole Genome Shotgun project: VZDL00000000. [https://www.ncbi.nlm.nih.gov/nuccore/ VZDL00000000]. 68. Xu Z, Pu X, Gao R, Demurtas OC, Fleck SJ, Richter M, He C, Ji A, Sun W, Kong J, et al. Supplementary Datasets. 2019. CoGe genome ID: id53980. [https://genomevolution.org/coge/SearchResults.pl?s=Gardenia&p=genome]. 69. Xu Z, Pu X, Gao R, Demurtas OC, Fleck SJ, Richter M, He C, Ji A, Sun W, Kong J, et al. Supplementary Datasets. 2019. CoGe genome ID: id55476. [https://genomevolution.org/coge/SearchResults.pl?s=Gardenia&p=genome]. 70. Xu Z, Pu X, Gao R, Demurtas OC, Fleck SJ, Richter M, He C, Ji A, Sun W, Kong J, et al. Supplementary Datasets. 2019. CoGe genome ID: id55477. [https://genomevolution.org/coge/SearchResults.pl?s=Gardenia&p=genome].

Publisher’sNote Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.