<<

Genome

Full-length transcriptome analysis and identification of genes involved in asarinin and biosynthesis in medicinal plant sieboldii

Journal: Genome

Manuscript ID gen-2020-0095.R1

Manuscript Type: Article

Date Submitted by the 01-Dec-2020 Author:

Complete List of Authors: Chen, Chen; Xi’an Botanical Garden of Shaanxi province, Institute of Botany of Shaanxi Province, Shi, Xinwei; Shaanxi Engineering Research Centre for Conservation and Utilization Draftof Botanical Resources, Xi’an Botanical Garden of Shaanxi province, Institute of Botany of Shaanxi Province Zhou, Tao; Xi’an Jiaotong University Li, Weimin; Xi’an Botanical Garden of Shaanxi province, Institute of Botany of Shaanxi Province Li, Sifeng; Xi’an Botanical Garden of Shaanxi province, Institute of Botany of Shaanxi Province Bai, Guoqing; Xi’an Botanical Garden of Shaanxi province, Institute of Botany of Shaanxi Province

Keyword: PacBio sequencing, Asarum sieboldii, asarinin, aristolochic acid, CYP

Is the invited manuscript for consideration in a Special Not applicable (regular submission) Issue? :

© The Author(s) or their Institution(s) Page 1 of 42 Genome

Full-length transcriptome analysis and identification of genes

involved in asarinin and aristolochic acid biosynthesis in

medicinal plant Asarum sieboldii

Chen Chen 1, Xinwei Shi1, Tao Zhou2, Weimin Li1, Sifeng Li1, and Guoqing Bai 1, *

1 Shaanxi Engineering Research Centre for Conservation and Utilization of Botanical Resources, Xi’an Botanical Garden of Shaanxi province, Institute of Botany of Shaanxi Province, No.17 Cuihua South Road, 710061, Xi’an City, Shaanxi province, China; Draft 2 School of Pharmacy, Healthy Science Center, Xi’an Jiaotong University, No.76 Yanta West Road, 710061, Xi’an City, Shaanxi province, China;

* Corresponding author: Guoqing Bai Tel: +86-29-85251750 (Office) Fax: +86-29-85251800 E-mail: [email protected]

1

© The Author(s) or their Institution(s) Genome Page 2 of 42

Abstract Asarum sieboldii a well-known traditional Chinese medicinal herb, is used for curing inflammation and ache. It contains both bioactive ingredient asarinin and the toxic compound aristolochic acid. To address further breeding demand, genes involved in the biosynthetic pathways of asarinin and aristolochic acid should be explored. Therefore, we sequenced the full-length transcriptome of A. sieboldii using PacBio Iso-Seq to determine the candidate transcripts that encode the biosynthetic enzymes of asarinin and aristolochic acid. In our study, in total of 63,023 full-length transcripts were generated with an average length of 1,371 bp from roots, stems and leaves tissues, of which 49,593 transcripts (78.69%) were annotated against public databases. Furthermore, 555 alternative splicing (AS), 10,869 lncRNAs as well as their 11,291 target genes and 17,909 SSRs were identified. Our data also revealed that 97 candidate transcripts related to asarinin metabolism, of which 6 novel genes that encoded enzymes involved in asarininDraft biosynthesis were initially reported. 56 transcripts related to aristolochic acid biosynthesis were also identified, especially CYP81B. In summary, our transcriptome data provide a useful resource to study gene function and genetic engineering in A. sieboldii.

Keywords: PacBio sequencing; Asarum sieboldii; asarinin; aristolochic acid; CYP

2

© The Author(s) or their Institution(s) Page 3 of 42 Genome

Introduction

Asarum sieboldii (named “Hua Xixin” in Chinese) belongs to the

family and is a renowned traditional Chinese medicinal (TCM) herb that grows

primarily in the Qinling-Bashan mountain region in central and eastern China

(Lawrence 1998). The roots of A. sieboldii are commonly used for the treatment of

cough, toothache, headache, inflammation, and cancer in Asian nations (Jeong et al.

2018a; Kim et al. 2019a). A. sieboldii has numerous pharmacological properties

including anti-inflammatory, anti-allergic, antitussive, anti-fungi, and

antihyperlipidemic trait (Lee et al. 2005; Quang et al. 2012), which are attributed to the many active ingredients in A. sieboldii,Draft including essential oils (i.e., methyleugenol, kakuol, , safrole, asaricin, limonen, eucarvone), acid amides, and lignans

(especially asarinin and sesamin). Among them, asarnin is the main active compound

accumulated in root and rhizome of A. sieboldii, which listed in Pharmacopeia and

serves as a quality control standard (Commission 2015). Asarnin is used to cure human

ovarian cancer, to ameliorate collagen-induced arthritis (CIA), and to treat rat adrenal

pheochromocytoma by inducing biosynthesis (Jeong et al. 2018b; Dai et al.

2019). Also, It has been reported that asarinin can inhibit the expression TLR4 and

CXCR3 genes in vitro and decrease peripheral blood concentration of IL-12, leading to

produce prolongation of allograft heart survival (Gu et al. 2015). In addition,

methyleugenol (1-allyl-3,4-dimethoxybenzene) is one of the most significant volatile

oil components which is a natural compound with antiallergic, antianaphylactic,

antinociceptive, antibacterial, acaricidal and anti-inflammatory effects. (Kim et al. 2016;

3

© The Author(s) or their Institution(s) Genome Page 4 of 42

Wang et al. 2018). It belongs to phenolic compound eugenol is used as a flavoring substance in dietary products and cosmetics in Europe and the USA. While

methyleugenol is capitalized on treatment on allergic rhinitis through inhibiting PGE2 production and decreasing the activation of Cyclooxygenase-2 (Tang et al. 2015).

Methyleugenol is also proved having effects of suppression on inductive systemic and local anaphylaxis in mice (Shin et al. 1997).

Many studies focus on the biosynthetic pathway of asarinin have shown that asarinin and methyleugenol share the common precursor coniferyl and derived from the phenylpropanoid biosynthesis (Figure 6). In upstream of phenylpropanoid pathway, phenylalanine is transformedDraft into caffeic acid by the wide-distributed enzyme phenylalanine ammonia-lyase (PAL), followed by a series enzyme including cinnamic acid 4-hydroxylase (C4H), 4-coumarate CoA ligase (4CL), and p-coumarate 3- hydroxylase (C3H). The first committed steps of the phenylpropanoid biosynthetic pathway have been well elucidated. And the enzymes involved in this pathway have been identified in many species, e.g., Arabidopsis thaliana, Solanum lycopersicum,

Anoectochilus roxburghii, and Camellia sinensis (Singh et al. 2009; Guo and Wang

2010; Vanholme et al. 2010; Ma and Constabel 2019; Ye et al. 2019; Chen et al. 2020).

While it is various specific side chains in plants. In basil, the following steps are formation of ferulic acid and coniferyl alcohol firstly, which is likely to be catalyzed from caffeic acid by catechol-O-methyltransferase (COMT), Cinnamoyl-CoA reductase (CCR), and cinnamyl alcohol dehydrogenase (CAD), then production of eugenol transformed to methyleugenol by eugenol O-methyltransferase (EMOT)

4

© The Author(s) or their Institution(s) Page 5 of 42 Genome

adding the methyl group to the 4-OH (Gang et al. 2001). In A. sieboldii, coniferyl

alcohol was believed to be catalyzed by coniferyl alcohol acyl transferase (CAAT),

eugenol synthase genes (EGS) and EMOT to produce methyleugenol (Liu et al. 2018).

On the other hand, as the common processor, coniferyl alcohol also proved to

biosynthesize asasrnin. Two coniferyl alcohol substrates form one pinoresinol

molecular, and then is catalyzed by a proposed cytochrome P450 to produce sesamin

that finally isomerisation to asarinin by epimerase. The SiCYP81Q1 has been reported

having the ability to form methylenedioxy bridges in Sesamum. However, the specific

cytochrome P450 and the epimerase to form asarinin in A. sieboldi are still unclear. As genus, aerialDraft part (i.e. leaves and fruits) of A. sieboldi contain aristolochic acids, which could cause a progressive interstitial nephritis leading to end-

stage renal disease and urothelial malignancy called Aristolochic acid nephropathy

(AAN)(Vanherweghem et al. 1993). Recently, medicines and formulas containing

aristolochic acid are limited or prohibited due to its nephrotoxicity. Multiple methods

to reduce aristolochic acid toxicity have been tried, but it is still difficult to completely

eliminate aristolochic acid toxicity (Zamudio et al. 2010; Wu et al. 2018). Now, genetic

engineering should be an effective approach to overcome those limitations, and it is

necessary to elucidate the aristolochic acid synthesis pathway in medical plants. The

biosynthesis of aristolochic acid is blurred, and it is known that (S)- derived

from the (BIAs) pathway is the crucial intermediate to

produce aristolochic acid. The begin of BIAs biosynthesis is the conversion of L-

to dopamine by tyrosine decarboxylase (TYDC), which are then condensed to

5

© The Author(s) or their Institution(s) Genome Page 6 of 42

(S)-norcoclaurine by (S)-norcoclaurine synthase (NCS) (Samanani and Facchini 2002;

Minami et al. 2008). The following steps are the conversion of (S)-norcoclaurine to (S)-

reticuline that catalyzed by series methyltransferases and particular cytochrome P450.

(S)-reticuline is the central intermediate (S)-reticuline to production of various BIAs,

such as , , coptisine, palmatine, and

alkaloids(Samanani et al. 2006; Ziegler and Facchini 2008; Beaudoin and Facchini

2014; Yamada et al. 2015). Numerous enzymes involved in BIAs biosynthesis pathway

were characterized in Papaver and Coptis Species, such as TYDC, NCS and methyltransferases (S)-norcoclaurine/norlaudanosoline 6-O-methyltransferase (6OMT), norcoclaurine-7OMT (7OMT),Draft 9-OMT (SOMT), and (S)- coclaurine-N-methyltransferase (CNMT), which help understanding BIAs biosynthesis in other plants. Although TyrDC1, TyrDC2, and TyrDC3 genes are identified involving in the accumulation of alkaloids in A. heterotropoides, the genes involved in aristolochic acid biosynthesis pathway are unknown, the specific downstream enzymes i.e., methyltransferases and cytochrome P450, need to keeping exploration in Asarum.

Deep sequencing technologies (i.e., genomic and transcriptome sequencing) have been increasingly used to decipher the metabolic pathways and regulatory networks of medicinal plants. There are good examples of applying deep sequencing technologies to determinate the metabolic biosynthesis pathway in medicinal plants such as the gentiopicroside biosynthetic pathway in Gentiana rigescens (Zhang et al. 2015b), ginsenoside backbone synthesis in American ginseng, Panax quinquefolius L. (Sun et al. 2010), and transcriptome-wide sequencing in Dendrobium officinale (Meng et al.

6

© The Author(s) or their Institution(s) Page 7 of 42 Genome

2016). Currently, the single-molecule real-time (SMRT) sequencer from PacBio is

widely used to produce long reads and generate full-length transcripts to obtain genetic

resources for non-model plant species which lack the reference genomes (Eid et al.

2009; Chen et al. 2018; Ren et al. 2018; Zhang et al. 2018; Kim et al. 2019b).

In our study, we sequenced the full-length transcriptome of A. sieboldii to uncover

the genes involved in the asarinin and aristolochic acid biosynthesis pathways. Our

study provides essential genetic resources for further investigations into the metabolic

regulation of asarinin and aristolochic acid production in A. sieboldii. Our work also

highlights a fundamental basis for the genetic improvement and molecular breeding of A. sieboldii. Draft Materials and methods

Plant materials

At least three individuals of A. sieboldii were collected from LaoShan mountain

(Qingdao, Shandong province) in April 2019. The tissues of leaves, stems, and roots

were collected and frozen in liquid nitrogen immediately and then stored at −80°C for

RNA extraction.

RNA extraction

The RNeasy Plus Mini Kit (Qiagen, Valencia, CA, USA) was used to extract the

total RNA from frozen tissue samples. After extraction, electrophoresis, and

spectrophotometry (NanoDrop Technologies, Rockland, DE, USA) were employed to

determine the integrity, purity, and concentration of the RNA samples. Furthermore,

7

© The Author(s) or their Institution(s) Genome Page 8 of 42

the integrity of the RNA samples was evaluated by the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA) with an RNA Nano 6000 Assay Kit.

Iso-Seq library construction and sequencing

The RNA samples from leaves, stems, and roots were mixed equally for Iso-Seq library construction. According to the protocol of the Clontech SMARTer PCR cDNA

Synthesis Kit and BluePippin (Sage Science, Beverly, MA, USA), Isoform sequencing and size selection were performed. The full-length cDNA library, which was 1-6 kb in size was built, and its quality was assessed by the Agilent Bioanalyzer 2100 system

(Agilent Technologies, CA, USA) and sequenced on the Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA)Draft. The typical SMRT Analysis 2.3 protocol was used to handle the raw PacBio ISO-seq data (Hu et al. 2020).For Illumina RNA-Seq, twelve libraries of RNA samples of roots, stems, and leaves were prepared and sequenced using an Illumina HiSeq 2500 platform (Illumina Inc., San Diego, CA,

USA),respectively following manufacturer’s recommendations. The clean reads without adapters and low-quality reads from raw Illumina NGS data were used to correct low-quality full-length transcript isoforms using Proovread software (Hackl et al. 2014). CD-HIT (Li and Godzik 2006) software was employed to remove redundant and similar sequences. This sequencing work was performed by Biomarker technologies co. (Beijing, China). The raw sequence data has been submitted to the

National Genomics Data Center (http://bigd.big.ac.cn/) with accession number

CRR127990.

8

© The Author(s) or their Institution(s) Page 9 of 42 Genome

Alternative splicing (AS) analysis, CDS detection and simple sequence repeat (SSR)

analysis

AStalavista v3.2 was utilized to determine the alternative splicing events in the

obtained transcripts with default parameters (Foissac and Sammeth 2007). The coding

regions (CDS) of the obtained transcripts were identified by the TransDecoder software

(Haas et al. 2013). The simple sequence repeats (SSR) based on transcript sequences

longer than 500 bp were analyzed using MISA v1.0 (Beier et al. 2017).

Long non-coding RNA (LncRNA) prediction

LncRNA were comprehensively identified by using the Coding Potential Calculator (CPC), Coding-Non-CodingDraft Index (CNCI), Coding Potential Assessment Tool (CPAT) and Pfam databases, screening transcripts that are longer 200 bp or have

more than two exons. Additionally, lncTar v1.0 was used to predict the lncRNAs target

genes (Li et al. 2015).

Functional annotation and transcriptional factors identification

All of the transcripts were functionally annotated using BLASTX (v 2.2.26) (Yang

et al. 2014) with an E-value threshold of 1e-5 against the following databases: NR

(Deng et al. 2006), NT, Swissprot (Rolf A. et al. 2004), GO (Ashburner M et al. 2000),

COG (Tatusov et al. 2000), KOG (Koonin et al. 2004), Pfam (Robert D. Finn et al.

2013), and KEGG (Kanehisa M et al. 2004). The identified transcriptional factors and

the gene family assignments were analyzed by the iTAK software (Zheng et al. 2016).

Gene expression profiles by quantitative real-time PCR

9

© The Author(s) or their Institution(s) Genome Page 10 of 42

The unigenes related to asarinin and aristolochic acid biosynthesis were randomly

selected to be verified by qRT-PCR. The RNA samples from roots, stems and leaves

were extracted and were then used to synthesize cDNA as the qRT-PCR template with

PrimeScript™ RT Master Mix (TaKaRa Bio, Beijing). The specific primers were

designed with Primer Primer 5.0 (Table S1), and SYBR Green PCR Master Mix

(TaKaRa Bio, Beijing) was used to carry out PCR. The reactions took place in a 20 μL

volume containing 10 μL of SYBR Green PCR Master Mix, 0.4 μL of each primer (0.1

mM), 2 μL of 20× diluted cDNA, and 7.2 μL of dd H2O under the following conditions:

30 s at 95 °C, 40 cycles of 5 s at 95 °C, 30 s at 55 °C and 30 s at 72 °C. Three biological replicates and technical replicates Draftwere performed in all qRT-PCR experiments, and the expression levels of the candidate genes were calculated using the 2-ΔΔCt method.

Ethics statement

This study has not directly involved humans, animals or plants.

Results

SMAT sequencing and transcript clustering analysis

The pooled samples of the leaves, roots, and stems were sequenced using the

Pacbio Iso-Seq Sequel II platform to acquire the full-length transcriptome of A.

sieboldii. A 1-6 kb Iso-seq cDNA library was constructed and sequenced with 2 cells,

22.1 G. In total, the Pacbio platform generated a total of 348,913 circular consensus

sequences (CCSs) with an average length of 1,258 bp (Table 1). Standard processing

of the CCS sequences generated 348,913 full-length sequences, including 268,055

(76.83%) full-length nonchimeric reads (FLNC). After polishing a total of 118,486

10

© The Author(s) or their Institution(s) Page 11 of 42 Genome

consensus isoforms were obtained, of which 114,026 were high-quality consensus

transcript sequences (96.24%) with a mean length of 1,160 bp and 3,276 were low-

quality isoforms (3.76%) (Figure 1A; Table 1).

In parallel, the mRNA samples of leaves, roots, and stems were obtained

separately using Illumina sequencing. A total of 223,260,305 clean read (Q30 > 94.22%)

after filtering the original data was produced: 83,310,646 reads, 70,120,207 reads, and

69,829,452 reads were generated for the leaves, roots, and stems, respectively (Data

S2). To reduce the high error rate of third-generation sequencing technology, the

Illumina data were employed for correction. As a result, a total of 63,023 transcriptional sequences were generatedDraft removing redundant transcripts from the high-quality consensus, with an average length of 1,371 bp and a N50 length of 1,813

bp, which were suitable for further structural and functional analysis.

Alternative spliced isoforms

Alternative splicing acts as an important fine regulation mechanism on specific

gene expression in plants. Pre-mRNA is translated into many kinds of proteins by being

spliced in multiple ways for potential biological functions. In our data, 555 AS events

were identified for A. sieboldii (Data S3). Unfortunately, the type of those AS

transcripts could not be defined due to the absence of A. sieboldii genome data.

Long non-coding RNA (LncRNA) prediction

Long noncoding RNAs (lncRNAs) are longer than 200 nucleotides and are

emerging as regulatory molecules in many vital biological processes (Li et al. 2014;

Chekanova 2015). We used the available public databases CPC, CNCI, CPAT, and

11

© The Author(s) or their Institution(s) Genome Page 12 of 42

Pfam to categorize lncRNAs from our transcripts, and ultimately identified 10,869 lncRNAs in the A. sieboldii transcriptome sequencing data (Figure 1B; Data S4).

LncRNA generally acts on adjacent target genes to regulate gene expression, which is called the cis-acting effect of lncRNA (Marques and Ponting 2014). To predict the target genes of lncRNA, we looked for coding genes transcribed within 100 kb upstream and downstream of lncRNA. As a result, we determined 11,291 mRNA sequences as the target genes of lncRNA by pairing them with the complementary mRNA bases.

Simple sequence repeat (SSR) identification SSRs are highly polymorphicDraft and efficient markers that have been widely applied for genetic analysis (Sharopova et al. 2002). Here, 15,029 SSRs and 12,357 SSR- containing sequences were obtained from 52,238 sequences (isoforms >500 bp in size), representing 24.13% of the 63,023 transcripts (Table 2; Data S5). All the SSR sequences were classified as mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide, or compound SSR (when the distance between SSRs was smaller than 100 bp), based on the repeating unit. Mononucleotides

(6009, 39.98%) were the most frequent type of SSR, followed by dinucleotides (4476,

29.78%) and trinucleotides (2457, 16.35%).

The analysis of the number of iterations of the repeat units in the SSR sequences showed that there were different iterations of repeat units in each classification. In mononucleotides, 48.79% of motifs had 10 repeat units, while the rest had more than

10 repeat units. In the di-, tri-, tetra-, Penta-, and hexa-nucleotide classifications, most

12

© The Author(s) or their Institution(s) Page 13 of 42 Genome

(79.74%) of the motifs had 5-10 repeat units, while fewer (21.26%) motifs had more

than 10 repeat units. These SSR sequences will be helpful tools for further research on

investigating genetic diversity, constructing the genetic linkage map, and locating

important genes in A. sieboldii.

Functional annotation of transcripts

A total of 49,593 transcripts was functionally annotated by searching for the

63,023 transcripts in multiple databases. Of these, 49,139 were annotated in NR, 33,267

in Swissprot, 37,045 in GO, 21,066 in KEGG, 19,864 in COG, 31,103 in KOG, 47,401

in eggNOG, and 35,650 in Pfam. The species distribution analysis with NR annotation showed that the most annotated speciesDraft was Nelumbo nucifera (27.07%), followed by Vitis vinifera (8.75%) (Figure 1C).

To classify the orthologous gene19,864 transcripts were categorized in COG

clusters, in which the largest group was the cluster for “general function prediction”

(2,279, 10.32%), while the smallest group was for “RNA processing and modification”

(17, 0.08%) (Figure 2A). Additionally, transcripts of the A. sieboldii transcriptome were

the most enriched for function O (posttranslational modification, protein turnover,

chaperones), followed by function T (signal transduction mechanism), except for

function S “function unknown” in the eggNOG database (Figure 2B).

Also, 37,045 transcripts were enriched based on GO items using the categories of

biological process (BP), cellular component (CC), and molecular function (MF), which

were assigned 81,580, 98,779, and 45,063 transcripts, respectively (Figure 3). In the

BP category, the most enriched category was “metabolic process”, followed by

13

© The Author(s) or their Institution(s) Genome Page 14 of 42

“cellular process”, and “single-organism process”. In the MF category, “catalytic activity” and “binding” dominated. In the CC category, a high percentage of the genes fell under “cell part”, “cell”, and “organelle”.

To further understand the transcripts in their biological pathways, a KEGG pathway analysis was performed. A total of 19,910 transcripts were mapped to 129

KEGG pathways, where “biosynthesis of amino acids (ko01230)”, “plant hormone signal transduction (ko04075)”, and “cysteine and methionine metabolism (ko00270)” occupied the top three slots (Table 3).

Transcription factors (TFs) determination Both physiological processes Draftand cell differentiation are known to be regulated to a large extent by transcription factors (TF). We explored 3,540 transcription factors divided into 184 families, such as WRKY, bHLH, bZIP, MYB-related, NAC, and

AP2/ERF (Figure 4). The category with the largest number of transcription factors in

our transcriptome was the C2H2 zinc finger proteins (179, 5.06%), which participate in the development, defense, and secondary metabolism regulation in plants (Kwon et al.

2010; Tsutsui et al. 2011).

Candidate genes involved in the asarinin biosynthesis pathway

Asarinin is a tetrahydrofurofurano lignin which is the main active compound of A. sieboldii. Combined with previous studies, we investigated the enzyme-encoding transcripts involved in the phenylpropanoid pathway by which asarinin and methyleugenol could be derived. As a result, a total of 97 candidate transcripts that encoded 14 enzymes related to asarinin and methyleugenol biosynthesis metabolism

14

© The Author(s) or their Institution(s) Page 15 of 42 Genome

were found (Figure 5; Data S6). Within these transcripts, 41 candidate genes that

encode for 3 enzymes were discovered for the first time, including 2 transcripts that

encode caffeic acid-3-O-methyltransferase (AsCOMT), 6 transcripts that encode

bifunctional epimerase (AsEPI), and 33 transcripts that encode CYP81Q subfamily,

such as AsCYP81Q2, AsCYP81Q4, AsCYP81Q7, and AsCYP81Q29, which has been

speculated to be synthesized from (+)-pinoresinol to (+)-sesamin.

Candidate genes involved in the biosynthesis of aristolochic acid

Genes for enzymes involved in the aristolochic acid biosynthesis pathway have

rarely been identified, although its biosynthesis processes was predicted in 1967 (Schutte et al. 1967). To better understandDraft the molecular basis of aristolochic acid biosynthesis, we analyzed our transcripts library and revealed 56 candidate genes that

encode 6 enzymes that are related to aristolochic acid metabolism according to the

KEGG database (K00950). These were AsTYR (1), AsTYDC (5), AsNCS (27), AsNOMT

(9), AsCNMT (1), and AsCYP80B1 (13), all of which cocatalyzed the synthesis from

tyrosine to reticuline, which is the important intermediate substrate in aristolochic acid

biosynthesis (Figure 6, Data S7).

Expression profiles of candidate enzyme genes in different tissues

To further verify the candidate enzyme gene expression profiles in different tissues,

12 candidate transcripts were randomly selected from the dataset of candidate

transcripts associated with asarinin and aristolochic acid biosynthesis for qRT-PCR

analysis (Figure 7). According to this analysis, the expression levels of AsPAL,

AsCYP81Q, AsCOMT, and AsCYP81B1 were high while that of AsNCS, AsTYR, and

15

© The Author(s) or their Institution(s) Genome Page 16 of 42

AsTYDC were low in roots. In leaves, the expression level of AsNOMT was the highest,

while that of AsTYR was the lowest. Of the candidate genes involved in asarnin biosynthesis, AsCCR, AsPAL, AsCOMT, and AsCYP81Q had significantly higher expression in roots than that in leaves, which accordance with the asarinin accumulation.

Nevertheless, among the candidate genes involved in the aristolochic acid biosynthesis pathway, AsNOMT and AsNCS had higher expressed in leaves than roots. But the

AsCYP81B1 expression was remarkably higher in both roots and leaves. These candidate transcripts were the reference genes that participated in the biosynthetic process of asarinin and aristolochic acid as the central enzyme genes. Discussion Draft Generating a large quantity of functional transcripts by transcriptome sequencing technology is demonstrated to be a strong and effective approach for extensive research applications. It is particularly suitable for transcription profiling in nonmodel species without a reference genome database (Kawaharamiki et al. 2011; Shi et al. 2011). Using this sequencing technology, we obtained massive amounts of genetic information about secondary metabolism in plants, such as for ascorbic acid and carotenoid in tomatoes, and anthocyanin in Lilium and peach flowers (Chen et al. 2014; Zhang et al. 2015a).

This sequencing technology is of interest to researchers that are searching for novel genes, especially related to the biosynthesis of active ingredients in medicinal plants.

For example, novel genes involved in secoiridoid biosynthesis were found by sequencing in Swertia mussotii (Liu et al. 2017). The UDP-glycosyltransferase unigenes were considered as candidates related to glycosides biosynthesis in

16

© The Author(s) or their Institution(s) Page 17 of 42 Genome

Polygonum cuspidatum (Hao et al. 2012). Likewise, unigenes related to tanshinones

and phenolic acid biosynthesis were globally identified in Salvia miltiorrhiza (Xu et al.

2015; Xu et al. 2016). In this study, the 3rd generation PacBio SMRT sequencing

technology was performed to elucidate the full-length transcriptome profile and find

novel transcripts that are involved in the secondary metabolism of active or toxic

compounds in A. sieboldii. We ultimately obtained 63,023 transcripts from 114,026

high-quality consensus transcript sequences, with an average length of 1,371 bp and a

N50 length 1,813 bp. These results were similar to previous studies in Pisum sativum

L. (Susete Alves-Carvalho 2015), Zanthoxylum planispinum (Kim et al. 2019b), and Camellia sinensis (Xu et al. 2017).Draft And the annotation rate was higher (78.69%, 49,593 out of 63,023) than the former transcriptome data obtained by Illumina

sequencing technology (Liu et al. 2018). This result reflected the obvious benefits of

the Pacbio platform, such as longer reads, shorter assembly time and higher accuracy

(Roberts et al. 2013). Furthermore, data on predicted AS transcripts and lncRNA were

also analyzed from the PacBio transcripts (Figure 1; Data S1, S2). To the best of our

knowledge, this study is the first to provide not only the full-length transcripts of

encoded enzymes and TFs, but the potential transcriptional regulation loci, the structure

of transcripts, and genetic markers. Hence, the PacBio sequencing technique used in

our study would also give a genetic basis for further studies.

Asarinin is the main active compound listed in the Chinese Pharmacopoeia with

antitumor and antibacterial properties. It is reported that asarinin derived from

phenylpropanoid biosynthesis (ko00940) (Liu et al. 2018). The function of enzyme

17

© The Author(s) or their Institution(s) Genome Page 18 of 42

genes such as PAL, C4H, 4CL, and CCR in phenylpropanoid biosynthesis were thoroughly studied in many plants (Maher E. A. et al. 1994; Liu et al. 2006; Tuan et al.

2010). Nevertheless, the key enzyme genes of phenylpropanoid biosynthesis in A. sieboldii have been rarely reported, only PAL and CAD were cloned and further researched (Lin et al. 2018; Liu et al. 2018). Additionally, CYPs and DIRs genes related to asarinin biosynthesis were found using Illumina sequencing technology (Liu et al.

2017). However, due to short reads and a preference for the sequencing system, the use of Illumina sequencing technologies is restricted to generating transcriptomes. In our study, 134 candidates were identified that encoded 14 enzymes involved in phenylpropanoid pathway related Draftto asarinin and methyleugenol biosynthesis (Figure 7). Among them, most genes such as PALs, C4Hs, 4CLs, CCRs, CCoAOMTs, CADs, and DIRs were also found in the previous report, while the transcripts of COMT, CFAT,

CYP81Q, and EPI were reported for the first time in A. sieboldii. To be specific, the

COMT gene encodes caffeic acid-3-O-methyltransferase which catalyzes O- methylation at the C5 position of phenolic ring and catalyzes caffeoyl alcohol to form coniferyl alcohol in lignin biosynthesis pathway(Trabucco et al. 2013; Chen et al. 2017).

The COMT gene is first cloned in maize by screening cDNA library, and downregulation of COMT gene is linked directly to the lignin reduction, which have been shown in other plant species(Guo et al. 2001; Guillaumie et al. 2008; Lu et al.

2010). In our transcriptome library, we found 3 transcripts encoded AsCOMT to form coniferyl alcohol, the node intermediate to produce methyleugenol and asarinin, which adds a branch of coniferyl alcohol synthesis and provides the flexibility to regulate this

18

© The Author(s) or their Institution(s) Page 19 of 42 Genome

pathway in A. sieboldii. Coniferyl alcohol acyltransferase (CFAT) catalyzes coniferyl

alcohol and acetyl-CoA biosynthesize coniferyl acetate. And CFAT gene silencing

resulted in no more isoeugenol and benzenoid formation in petunia (Dexter et al. 2007;

Koeduka et al. 2009). Similar, we screened our transcripts database and found 2

AsCFAT genes speculated to play the role in biosynthesize coniferyl acetate for further

methyleugenol formation. For asarinin biosynthesis, the last step is changing the

epimerization of (+)-sesamin and asarinin by an epimease. We found 6 transcripts to

encode AsEPI in our data in which could be the specific epimease to epimerize (+)-

sesamin. Although we only obtained those genes mRNA sequences, our findings also provide a basis for the further determinationDraft of enzyme catalytic activity and gene function analysis. Furthermore, we determined the expression levels of the candidate

genes involved in asarinin biosynthesis by qRT-PCR (Figure 7). Several genes such as

AsPAL, AsCCR, AsCOMT, AsCAD, and AsCYP81Q were highly expressed in roots that

is the asarinin accumulation tissue in A. sieboldii, which indicated that those genes were

involved in asarinin biosynthesis (Song et al. 2014; Cao et al. 2018). However, the

functions of these enzyme genes should be further confirmed by catalytic experiments

both in vivo and in vitro.

Plants from the Aristolochia genus have been used to treat various diseases for

hundreds of years, but many species contain toxic aristolochic acids, leading to

nephropathy or cancer. The main aristolochic acids often accumulated in the aerial

portions including fruits and seeds of Aristolochia plant (Schaneberg and Khan 2004;

Xue et al. 2008). An important and effective way to remove or reduce aristolochic acid

19

© The Author(s) or their Institution(s) Genome Page 20 of 42

content in Asarum is through genetic technology like molecular breeding and genetic modification. Hence, clarification the multiple enzyme genes in aristolochic acids biosynthetic pathway is required. From now, only AhTYDCs are identified as potentially crucial enzymes in aristolochic acid biosynthesis in A. heterotropoides, and little report about other genes involved in aristolochic acids biosynthesis (Liu et al.

2017). In our study, we found 1 TYR, 6 TYDC, 26 NCS, 19 NOMT, 1 CNMT, and 12

CYP80B (Figure 6) drawing the sketchy flew of aristolochic acids biosynthesis pathway from L-tryptophan to aristolochic acids by screening our sequencing data of A. sieboldii.

And qRT-PCR (Figure 7) showed that the expression of AsNCS and AsNOMT were significantly highly in leaves whichDraft were believed to the part of aristolochic acid accumulation (Schaneberg et al. 2002; Xue et al. 2008). Of course, the catalytic function of these genes, especially for CYP80B, needs further verification. Our findings will provide candidate targets for gene silencing, which helps to reduce or block enzyme genes expression, thereby reducing the toxicity of aristolochic acid-containing plants.

This will enable their safe use and give new insight into the effective breeding of varieties with low toxicity.

Cytochromes P450 (CYP) are monooxygenase enzymes, have ubiquitous presence and versatile functions. CYPs, as one of the largest gene families in plants, take part in reactions involved in both primary and secondary metabolism, as in the biosynthesis of lignin intermediates, sterols, terpenes, flavonoids, and other phytochemicals. For example, a P450 gene, geraniol 10-hydroxylase (G10H), is involved in the biosynthesis of the terpenoid in Catharanthus roseus (Collu et al. 2001). CYP88D6

20

© The Author(s) or their Institution(s) Page 21 of 42 Genome

was proved to catalyze the sequential two-step oxidation of β-amyrin, which is a

possible biosynthetic intermediate in the production of glycyrrhizin in Glycyrrhiza

(Seki et al. 2008). And CYP716A genes were linked to protopanaxadiol 6-hydroxylase

catalyzes the formation of protopanaxatriol from protopanaxadiol in Panax ginseng

(Han et al. 2012). In our concerned pathways, CYP81Q1 was found in Sesamum to

catalyze the two methylenedioxy bridges to yield (+)-sesamin, which is the epimer of

asarinin. And CYP80B1 encoded the enzyme that catalyzes the production of 3′-

hydroxylation of (S)-N-methylcoclaurine, which is a branch point in the biosynthesis

of the central alkaloidal intermediate (S)-reticuline in the biosynthesis of aristolochic acids (Pauli and Kutchan 1998;Draft Morishige et al. 2000). Notably, genes in the cytochrome P450 subfamilies CYP81Q and CYP80B had not been found in A. sieboldii

before. With the aid of transcriptome sequencing, we identified 473 CYP genes in total

(data not shown). Among these CYP genes, we found 29 transcripts of CYP81Q and

12 transcripts of CYP80B, which would provide the details of candidate enzyme genes

that determine the biosynthesis of asarinin and aristolochic acids for a deeper

understanding of secondary metabolism in A. sieboldii.

Transcription factors act as an important regulator taking participated in the

phenylpropanoid alkaloids, and terpenes pathways. For example, AtMYB75/PAP1

specifically regulates anthocyanin accumulation while AtMYB123 regulates

proanthocyanidin synthesis (Borevitz et al. 2000; Nesi et al. 2001). PpNAC1 is the main

regulator of phenylalanine biosynthesis in maritime pine (Pascual et al. 2018). Likewise,

ODORANT1 (ODO1) and EMISSION OF BENZENOIDS II (EOBII) have been

21

© The Author(s) or their Institution(s) Genome Page 22 of 42

identified as regulators of the volatile benzenoid/phenylpropanoid pathway in petunias

(Van Moerkercke et al. 2011). By contrast, poplar MYB165 and MYB194 interacted with bHLH131 and represses the activation of flavonoid promoters (Ma et al. 2018).

MYB often forms transcriptional regulator complex with bHLHs and WDRs, binding responsive elements positioned in promoter regions of the anthocyanin production and seed coat pigmentation (Baudry et al. 2004). In addition, WRKY TFs are well known for regulating abiotic/biotic stress and plant-specialized metabolism, including phenylpropanoids, alkaloids, and terpenes. Two transcription factors, CjWRKY1 and

CjbHLH1 work as transcriptional activators that positively regulate on berberine biosynthesis have been isolated fromDraft Camellia japonica (Yamada et al. 2015). And WRKYs from opium poppy was found to regulate thebaine and biosynthesis using comparative analysis of transcriptome datasets (Agarwal et al. 2016). In our data, bHLH, MYB, bZIP, and WRKY transcriptional factors were found, which could be the candidate regulators for asarinin and aristolochic acid biosynthesis and need to further experimental verification.

In this study, a whole transcriptome of A. sieboldii was sequenced on a Pacbio single-molecule long-read sequencing platform. A total of 63,023 transcripts were generated, of which 48,945 were annotated against public databases. In addition, 10,869 lncRNAs, 11,291 target mRNAs and 15,029 SSRs were identified in A. sieboldii. In functional annotation, 19,864 isoforms were distributed in COG, and 37,045 transcripts were categorized into 51 functional groups within the GO classifications. Additionally,

19,910 transcripts were mapped in 129 KEGG pathways. A total of 187 transcripts were

22

© The Author(s) or their Institution(s) Page 23 of 42 Genome

assigned to phenylpropanoid biosynthesis, which is related to asarinin biosynthesis, and

71 transcripts were significantly enriched in alkaloid biosynthesis, which

is related to the bio-origin of aristolochic acid. In our data, 97 candidate transcripts that

encoded 14 enzymes related to asarinin and methyleugenol biosynthesis metabolism

were identified, 6 of which were reported for the first time. Our sequencing data and

the KEGG database revealed 56 candidate genes that encoded 6 enzymes that were

related to aristolochic acid metabolism. Our transcriptome data will supply a good basis

for genetic engineering research for producing a valuable medicative ingredient by

reducing the toxicity of and effectively breeding new A. sieboldii varieties. Acknowledgement Draft This research was funded by National Natural Science Foundation of China, grant

number 31800554 and 31800259, and the Natural Science Basic Research Plan in

Shaanxi Province of China, grant number 2018JQ3033, and Outstanding Youth

Scientific Research Project of Shaanxi Academy of Sciences , grant number 2017k-12,

and Scientific Research Project of Shaanxi Province Bureau of Traditional Chinese

Medicine, grant number 610902, and Chinese Medicine Public Health Service Subsidy

Special "National Traditional Chinese Medicine Resources Survey Project" in 2017 ,

grant number Financial society 2017-66.

Abbreviations: 4CL, 4-coumarate-CoA ligase; C3H, p-coumarate 3-hydroxylase; C4H,

trans-cinnamate 4-monooxygenase; CAD, cinnamyl alcohol dehydrogenase;

CCoAOMT, caffeoyl-CoA O-methyltransferase; CCR, cinnamoyl-CoA reductase;

CFAT, coniferyl alcohol acyl transferase; CNCI, Coding-Non-Coding Index; CNMT,

23

© The Author(s) or their Institution(s) Genome Page 24 of 42

coclaurine N-methyltransferase; COG, Clusters of Orthologous groups; CPAT, Coding

Potential Assessment Tool; CPC, Coding Potential Calculator; DIR, dirigent proteins;

EGS, eugenol synthase; GO, Gene Ontology; IMOT/EOMT, (iso)/eugenol O- methyltransferase; KEGG, Kyoto Encyclopedia of Genes and Genomes; NCS, (S)- norcoclaurine synthase; NOMT, (S)-norcoclaurine-6-O-methyltransferase; PAL, phenylalanine ammonia-lyase; TYR, tyrosinase; TYDC, tyrosine dopa decarboxylase.

References Agarwal P, Pathak S, Lakhwani D, Gupta P, Asif MH, Trivedi PK. 2016. Comparative analysis of transcription factor gene families from Papaver somniferum: identification of regulatory factors involved in benzylisoquinoline alkaloid biosynthesis. Protoplasma 253: 857-871. Ashburner M, Ball C A, Blake J A, Botstein D, H B, J M C, A P D, K D, S S D, J T E et al. 2000. Gene ontology: tool for the unification of biology. Nature Genetics 25: 25-29. Baudry A, Heim MA, Dubreucq B, CabocheDraft M, Weisshaar B, Lepiniec L. 2004. TT2, TT8, and TTG1 synergistically specify the expression of BANYULS and proanthocyanidin biosynthesis in Arabidopsis thaliana. The Plant journal : for cell and molecular biology 39: 366-380. Beaudoin GA, Facchini PJ. 2014. Benzylisoquinoline alkaloid biosynthesis in opium poppy. Planta 240: 19-32. Beier S, Thiel T, Munch T, Scholz U, Mascher M. 2017. MISA-web: a web server for microsatellite prediction. Bioinformatics 33: 2583-2585. Borevitz JO, Xia Y, Blount J, Dixon RA, Lamb C. 2000. Activation tagging identifies a conserved MYB regulator of phenylpropanoid biosynthesis. Plant Cell 12: 2383-2394. Cao S, Li YM, Dai WQ, Han LT, Huang F, J.J. L, Zhou ZX, Wang Q. 2018. Content Determination of Asarinin in Different Parts of Asarum by HPLC. Journal of Hubei University of Chinese Medicine 20: 42-44. Chekanova JA. 2015. Long non-coding RNAs and their functions in plants. Current opinion in plant biology 27: 207-216. Chen J, Tang X, Ren C, Wei B, Wu Y, Wu Q, Pei J. 2018. Full-length transcriptome sequences and the identification of putative genes for flavonoid biosynthesis in safflower. BMC genomics 19: 548. Chen Y, Mao Y, Liu H, Yu F, Li S, Yin T. 2014. Transcriptome analysis of differentially expressed genes relevant to variegation in peach flowers. PloS one 9: e90842. Chen Y, Pan W, Jin S, Lin S. 2020. Combined metabolomic and transcriptomic analysis reveals key candidate genes involved in the regulation of flavonoid accumulation in Anoectochilus roxburghii. Process Biochemistry 91: 339-351. Chen Z, Sun X, Li Y, Yan Y, Yuan Q. 2017. Metabolic engineering of Escherichia coli for microbial synthesis of monolignols. Metab Eng 39: 102-109.

24

© The Author(s) or their Institution(s) Page 25 of 42 Genome

Collu G, Unver N, Peltenburg-Looman AM, van der Heijden R, Verpoorte R, Memelink J. 2001. Geraniol 10-hydroxylase, a cytochrome P450 enzyme involved in terpenoid indole alkaloid biosynthesis. FEBS letters 508: 215-220. Commission CP. 2015. Chinese Pharmacopoeia Part I, 2015 ed. Chemical Industry Press, Beijing. Dai Q, Wang M, Li Y, Li J. 2019. Amelioration of CIA by Asarinin is associated to a downregulation of TLR9/NF-κB and regulation of Th1/Th2/Treg expression. Biological and Pharmaceutical Bulletin 42: 1172-1178. Deng YY, Li JQ, Wu SF, Zhu YP, Chen YW, He FC, Chen YW, Deng Y, Li J, Wu S. 2006. Integrated NR Database in Protein Annotation System and Its Localization. Computer Engineering 32: 71-74. Dexter R, Qualley A, Kish CM, Ma CJ, Koeduka T, Nagegowda DA, Dudareva N, Pichersky E, Clark D. 2007. Characterization of a petunia acetyltransferase involved in the biosynthesis of the floral volatile isoeugenol. The Plant journal : for cell and molecular biology 49: 265-275. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B et al. 2009. Real- time DNA sequencing from single polymerase molecules. Science 323: 133-138. Foissac S, Sammeth M. 2007. ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets. Nucleic acids research 35: W297-299. Gang DR, Wang J, Dudareva N, Nam KH, Simon JE, Lewinsohn E, Pichersky E. 2001. An investigation of the storage and biosynthesis of phenylpropenes in sweet basil. Plant physiology 125: 539- 555. Gu J, Zhang L, Wang Z, Chen Y, Zhang G, ZhangDraft D, Wang X, Bai X, Li X, Lili Z. 2015. The effect of Asarinin on Toll-like pathway in rats after cardiac allograft implantation. Transplant Proc 47: 545-548. Guillaumie S, Goffner D, Barbier O, Martinant JP, Pichon M, Barriere Y. 2008. Expression of cell wall related genes in basal and ear internodes of silking brown-midrib-3, caffeic acid O- methyltransferase (COMT) down-regulated, and normal maize plants. BMC plant biology 8: 71. Guo D, Chen F, Inoue K, Blount JW, Dixon RA. 2001. Downregulation of caffeic acid 3-O- methyltransferase and caffeoyl CoA 3-O-methyltransferase in transgenic alfalfa. impacts on lignin structure and implications for the biosynthesis of G and S lignin. Plant Cell 13: 73-88. Guo J, Wang MH. 2010. Ultraviolet A-specific induction of anthocyanin biosynthesis and PAL expression in tomato (Solanum lycopersicum L.). Plant growth regulation 62: 1-8. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M et al. 2013. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols 8: 1494-1512. Hackl T, Hedrich R, Schultz J, Forster F. 2014. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30: 3004-3011. Han JY, Hwang HS, Choi SW, Kim HJ, Choi YE. 2012. Cytochrome P450 CYP716A53v2 catalyzes the formation of protopanaxatriol from protopanaxadiol during ginsenoside biosynthesis in Panax ginseng. Plant & cell physiology 53: 1535-1545. Hao D, Ma P, Mu J, Chen S, Xiao P, Peng Y, Huo L, Xu L, Sun C. 2012. De novo characterization of the root transcriptome of a traditional Chinese medicinal plant Polygonum cuspidatum. Science China Life sciences 55: 452-466.

25

© The Author(s) or their Institution(s) Genome Page 26 of 42

Hu Z, Zhang Y, He Y, Cao Q, Zhang T, Lou L, Cai Q. 2020. Full-Length Transcriptome Assembly of Italian Ryegrass Root Integrated with RNA-Seq to Identify Genes in Response to Plant Cadmium Stress. International journal of molecular sciences 21. Jeong M, Kim HM, Lee JS, Choi JH, Jang DS. 2018a. (-)-Asarinin from the Roots of Asarum sieboldii Induces Apoptotic Cell Death via Caspase Activation in Human Ovarian Cancer Cells. Molecules 23. -. 2018b. (−)-Asarinin from the Roots of Asarum sieboldii Induces Apoptotic Cell Death via Caspase Activation in Human Ovarian Cancer Cells. Molecules 23: 1849. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. 2004. The KEGG resource for deciphering the genome. Nucleic acids research 32: D277-D280. Kawaharamiki R, Wada K, Azuma N, Chiba S. 2011. Expression profiling without genome sequence information in a non-model species, pandalid shrimp (Pandalus latirostris), by next- generation sequencing. PloS one 6: e26043. Kim E, Kim HJ, Oh HN, Kwak AW, Kim SN, Kang BY, Yoon G. 2019a. Cytotoxic Constituents from the Roots of Asarum sieboldii in Human Breast Cancer Cells. Sciences 25: 72-75. Kim JA, Roy NS, Lee IH, Choi AY, Choi BS, Yu YS, Park NI, Park KC, Kim S, Yang HS et al. 2019b. Genome- wide transcriptome profiling of the medicinal plant Zanthoxylum planispinum using a single- molecule direct RNA sequencing approach. Genomics 111: 973-979. Kim JR, Perumalsamy H, Lee JH, Ahn YJ, Lee YS, Lee SG. 2016. Acaricidal activity of Asarum heterotropoides root-derived compoundsDraft and hydrodistillate constitutes toward Dermanyssus gallinae (Mesostigmata: Dermanyssidae). Experimental & applied acarology 68: 485-495. Koeduka T, Orlova I, Baiga TJ, Noel JP, Dudareva N, Pichersky E. 2009. The lack of floral synthesis and emission of isoeugenol in Petunia axillaris subsp. parodii is due to a mutation in the isoeugenol synthase gene. The Plant journal : for cell and molecular biology 58: 961-969. Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Rogozin IB. 2004. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome biology 5: R7. Kwon NJ, Garzia A, Espeso EA, Ugalde U, Yu JH. 2010. FlbC is a putative nuclear C2H2 transcription factor regulating development in Aspergillus nidulans. Molecular microbiology 77: 1203- 1219. Lawrence MK. 1998. Phylogenetic relationships in Asarum (Aristolochiaceae) based on morphology and ITS sequences. Am J Bot 85: 1454-1467. Lee JY, Moon SS, Hwang BK. 2005. Isolation and antifungal activity of kakuol, a propiophenone derivative from Asarum sieboldii rhizome. Pest Management Science: formerly Pesticide Science 61: 821-825. Li J, Ma W, Zeng P, Wang J, Geng B, Yang J, Cui Q. 2015. LncTar: a tool for predicting the RNA targets of long noncoding RNAs. Briefings in bioinformatics 16: 806-812. Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, Chettoor AM, Givan SA, Cole RA, Fowler JE et al. 2014. Genome-wide discovery and characterization of maize long non-coding RNAs. Genome Biol 15: R40. Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658-1659.

26

© The Author(s) or their Institution(s) Page 27 of 42 Genome

Lin MY, Zheng L, Liu JJ, Ji PP, Liu Z. 2018. Cloning and Bioinformatics Analysis of Phenylalanine Ammonia-lyase Gene in Asarum sieboldii. Chinese Journal of Experimental Traditional Medical Formulae 1: 7. Liu L, Chong X, Zhang H, Liu F, Ma D, Z.. L. 2018. Comparative Transcriptomics Analysis for Gene Mining and Identification of a Cinnamyl Alcohol Dehydrogenase Involved in Methyleugenol Biosynthesis from Asarum sieboldii Miq. Molecules 23: 3184. Liu R, Xu S, Li J., Lin Z. 2006. Expression profile of a PAL gene from Astragalus membranaceus var. Mongholicus and its crucial role in flux into flavonoid biosynthesis. Plant cell reports 25: 705- 710. Liu Y, Wang Y, Guo F, Zhan L, Mohr T, Cheng P, Huo N, Gu R, Pei D, Sun J et al. 2017. Deep sequencing and transcriptome analyses to identify genes involved in secoiridoid biosynthesis in the Tibetan medicinal plant Swertia mussotii. Scientific reports 7: 43108. Lu F, Marita JM, Lapierre C, Jouanin L, Morreel K, Boerjan W, Ralph J. 2010. Sequencing around 5- hydroxyconiferyl alcohol-derived units in caffeic acid O-methyltransferase-deficient poplar lignins. Plant physiology 153: 569-579. Ma D, Constabel CP. 2019. MYB repressors as regulators of phenylpropanoid metabolism in plants. Trends in plant science 24: 275-289. Ma D, Reichelt M, Yoshida K, Gershenzon J, Constabel CP. 2018. Two R2R3-MYB proteins are broad repressors of flavonoid and phenylpropanoid metabolism in poplar. The Plant journal : for cell and molecular biology 96: 949-965.Draft Maher E. A., Bate N. J., Ni W., Elkind Y, Dixon RA, J. LC. 1994. Increased disease susceptibility of transgenic plants with suppressed levels of preformed phenylpropanoid products. Proceedings of the National Academy of Sciences 91: 7802-7806. Marques AC, Ponting CP. 2014. Intergenic lncRNAs and the evolution of gene expression. Current opinion in genetics & development, 27: 48-53. Meng Y, Yu D, Xue J, Lu J, Feng S, Shen C, Wang H. 2016. A transcriptome-wide, organ-specific regulatory map of Dendrobium officinale, an important traditional Chinese orchid herb. Scientific reports 6: 18864. Minami H, Kim JS, Ikezawa N, Takemura T, Katayama T, Kumagai H, Sato F. 2008. Microbial production of plant benzylisoquinoline alkaloids. Proceedings of the National Academy of Sciences of the United States of America 105: 7393-7398. Morishige T, Tsujita T, Yamada Y, Sato F. 2000. Molecular characterization of the S-adenosyl-L- methionine:3'-hydroxy-N-methylcoclaurine 4'-O-methyltransferase involved in isoquinoline alkaloid biosynthesis in Coptis japonica. The Journal of biological chemistry 275: 23398- 23405. Nesi N, Jond C, Debeaujon I, Caboche M, Lepiniec L. 2001. The Arabidopsis TT2 gene encodes an R2R3 MYB domain protein that acts as a key determinant for proanthocyanidin accumulation in developing seed. Plant Cell 13: 2099-2114. Pascual MB, Llebres MT, Craven-Bartle B, Canas RA, Canovas FM, Avila C. 2018. PpNAC1, a main regulator of phenylalanine biosynthesis and utilization in maritime pine. Plant Biotechnol J 16: 1094-1104. Pauli HH, Kutchan TM. 1998. Molecular cloning and functional heterologous expression of two alleles encoding (S)-N-methylcoclaurine 3'-hydroxylase (CYP80B1), a new methyl jasmonate-

27

© The Author(s) or their Institution(s) Genome Page 28 of 42

inducible cytochrome P-450-dependent mono-oxygenase of benzylisoquinoline alkaloid biosynthesis. The Plant journal : for cell and molecular biology 13: 793-801. Quang TH, Ngan NT, Minh CV, Kiem PV, Tai BH, Thao NP, Song SB, Kim YH. 2012. Anti-inflammatory and PPAR transactivational effects of secondary metabolites from the roots of Asarum sieboldii. Bioorganic & medicinal chemistry letters 22: 2527-2533. Ren P, Meng Y, Li B, Ma X, Si E, Lai Y, Wang J, Yao L, Yang K, Shang X et al. 2018. Molecular Mechanisms of Acclimatization to Phosphorus Starvation and Recovery Underlying Full- Length Transcriptome Profiling in Barley (Hordeum vulgare L.). Frontiers in plant science 9: 500. Robert D. Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Mistry J. 2013. Pfam: the protein families database. Nucleic acids research 42: D222-D230. Roberts RJ, Carneiro MO, Schatz MC. 2013. The advantages of SMRT sequencing. Genome biology 14: 405. Rolf A., Amos B., Cathy H.W., Winona C. B., Brigitte BS, Ferro ; Elisabeth, Gasteiger ; Hongzhan, Huang ; Rodrigo, Lopez ; Michele, Magrane ; Maria J, Martin ; Darren A, Natale ; Claire, O'Donovan ; Nicole, Redaschi ; Lai-Su L, Yeh. 2004. UniProt: the Universal Protein knowledgebase. Nucleic acids research 1: 115-119. Samanani N, Alcantara J, Bourgault R, Zulak KG, Facchini PJ. 2006. The role of phloem sieve elements and laticifers in the biosynthesisDraft and accumulation of alkaloids in opium poppy. The Plant journal : for cell and molecular biology 47: 547-563. Samanani N, Facchini PJ. 2002. Purification and characterization of norcoclaurine synthase. The first committed enzyme in benzylisoquinoline alkaloid biosynthesis in plants. The Journal of biological chemistry 277: 33878-33883. Schaneberg BT, Applequist WL, Khan IA. 2002. Determination of aristolochic acid I and II in North American species of Asarum and Aristolochia. Pharmazie 57: 686-689. Schaneberg BT, Khan IA. 2004. Analysis of products suspected of containing Aristolochia or Asarum species. J Ethnopharmacol 94: 245-249. Schutte HR, Orban U., K. M. 1967. Biosynthesis of Aristolochic Acid. European J Biochem 1: 70-72. Seki H, Ohyama K, Sawai S, Mizutani M, Ohnishi T, Sudo H, Akashi T, Aoki T, Saito K, Muranaka T. 2008. Licorice beta-amyrin 11-oxidase, a cytochrome P450 with a key role in the biosynthesis of the triterpene sweetener glycyrrhizin. Proceedings of the National Academy of Sciences of the United States of America 105: 14204-14209. Sharopova N, McMullen MD, Schultz L, Schroeder S, Sanchez-Villeda H, Gardiner J, Bergstrom D, Houchins K, Melia-Hancock S, Musket T et al. 2002. Development and mapping of SSR markers for maize. Plant molecular biology 48: 463-481. Shi CY, Yang H, Wei CL, Yu O, Zhang ZZ, Jiang CJ, Sun J, Li YY, Chen Q, Xia T et al. 2011. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds. BMC genomics 12: 131. Shin BK, Lee EH, Kim HM. 1997. Suppression of L-histidine decarboxylase mRNA expression by methyleugenol. Biochem Biophys Res Commun 232: 188-191. Singh K, Kumar S, Rani A, Gulati A, Ahuja PS. 2009. Phenylalanine ammonia-lyase (PAL) and cinnamate 4-hydroxylase (C4H) and catechins (flavan-3-ols) accumulation in tea. Functional & integrative genomics 9: 125-134.

28

© The Author(s) or their Institution(s) Page 29 of 42 Genome

Song SH, Se LG, Chen B, L.J. C, P. S. 2014. Determination of aristolochic acidⅠand asarinin in four species of Asari Radix et Rhizoma by HPLC. Chinese Traditional Patent Medicine 36: 1711- 1715. Sun C, Li Y, Wu Q, Luo H, Sun Y, Song J, Lui EM, Chen S. 2010. De novo sequencing and analysis of the American ginseng root transcriptome using a GS FLX Titanium platform to discover putative genes involved in ginsenoside biosynthesis. BMC genomics 11: 262. Susete Alves-Carvalho GA, Sebastien Carrere, Corinne Cruaud, Anne-Lise Brochot, Francoise Jacquin, Anthony Klein, Chantal Martin, Karen Boucherot, Jonathan Kreplak, Corinneda Silva, Sandra Moreau, Pascal Gamas,Patrick Wincker, Jerome Gouzy and Judith Burstin. 2015. Full-length de novo assembly of RNA-seq data in pea (Pisum sativum L.) provides a gene expression atlas and gives insights in root nodulation in this species. the Plant Journal 84: 1-19. Tang F, Tang Q, Tian Y, Fan Q, Huang Y, Tan X. 2015. Network pharmacology-based prediction of the active ingredients and potential targets of Mahuang Fuzi Xixin decoction for application to allergic rhinitis. J Ethnopharmacol 176: 402-412. Tatusov RL, Galperin MY, Natale DA, Koonin EV. 2000. The COG database: a tool for genome scale analysis of protein functions and evolution. Nucleic acids research 28: 33-36. Trabucco GM, Matos DA, Lee SJ, Saathoff AJ, Priest HD, Mockler TC, Sarath G, Hazen SP. 2013. Functional characterization of cinnamyl alcohol dehydrogenase and caffeic acid O- methyltransferase in Brachypodium distachyon. BMC Biotechnol 13: 61. Tsutsui T, Yamaji N, Feng Ma J. 2011. IdentificationDraft of a cis-acting element of ART1, a C2H2-type zinc- finger transcription factor for aluminum tolerance in rice. Plant physiology 156: 925-931. Tuan PA, Park NI, Li X, Xu H, Kim HH, Park SU. 2010. Molecular cloning and characterization of phenylalanine ammonia-lyase and cinnamate 4-hydroxylase in the phenylpropanoid biosynthesis pathway in garlic (Allium sativum). Journal of agricultural and food chemistry 58: 10911-10917. Van Moerkercke A, Haring MA, Schuurink RC. 2011. The transcription factor EMISSION OF BENZENOIDS II activates the MYB ODORANT1 promoter at a MYB binding site specific for fragrant petunias. The Plant journal : for cell and molecular biology 67: 917-928. Vanherweghem JL, Depierreux M, Tielemans C, Abramowicz D, Dratwa M, Jadoul M, Richard C, Vandervelde D, Verbeelen D, Vanhaelen-Fastre R et al. 1993. Rapidly progressive interstitial renal fibrosis in young women: association with slimming regimen including Chinese herbs. Lancet 341: 387-391. Vanholme R, Ralph J, Akiyama T, Lu F, Pazo JR, Kim H, Christensen JH, Van Reusel B, Storme V, De Rycke R et al. 2010. Engineering traditional monolignols out of lignin by concomitant up- regulation of F5H1 and down-regulation of COMT in Arabidopsis. The Plant journal : for cell and molecular biology 64: 885-897. Wang X, Xu F, Zhang H, Peng L, Zhen Y, Wang L, Xu Y, He D, Li X. 2018. Orthogonal test design for optimization of the extraction of essential oil from Asarum heterotropoides var. Mandshuricum and evaluation of its antibacterial activity against periodontal pathogens. 3 Biotech 8: 473. Wu X, Wang S, Lu J, Jing Y, Li M, Cao J, Bian B, Hu C. 2018. Seeing the unseen of Chinese herbal medicine processing (Paozhi): advances in new perspectives. Chinese medicine 13: 4.

29

© The Author(s) or their Institution(s) Genome Page 30 of 42

Xu Q, Zhu J, Zhao S, Hou Y, Li F, Tai Y, Wan X, Wei C. 2017. Transcriptome Profiling Using Single- Molecule Direct RNA Sequencing Approach for In-depth Understanding of Genes in Secondary Metabolism Pathways of Camellia sinensis. Frontiers in plant science 8: 1205. Xu Z, Luo H, Ji A, Zhang X, Song J, Chen S. 2016. Global Identification of the Full-Length Transcripts and Alternative Splicing Related to Phenolic Acid Biosynthetic Genes in Salvia miltiorrhiza. Frontiers in plant science 7: 100. Xu Z, Peters RJ, Weirather J, Luo H, Liao B, Zhang X, Zhu Y, Ji A, Zhang B, Hu S et al. 2015. Full-length transcriptome sequences and splice variants obtained by a combination of sequencing platforms applied to different root tissues of Salvia miltiorrhiza and tanshinone biosynthesis. The Plant journal : for cell and molecular biology 82: 951-961. Xue Y, Tong XH, Wang F, Zhao WG. 2008. [Analysis of aristolochic acid A from the aerial and underground parts of Asarum by UPLC-UV]. Yao Xue Xue Bao 43: 221-223. Yamada Y, Motomura Y, Sato F. 2015. CjbHLH1 homologs regulate sanguinarine biosynthesis in Eschscholzia californica cells. Plant & cell physiology 56: 1019-1030. Yang Y, Jiang XT, Zhang T. 2014. Evaluation of a hybrid approach using UBLAST and BLASTX for metagenomic sequences annotation of specific functional genes. PloS one 9: e110947. Ye J, Wang G, Tan J, Zheng J, Zhang X, Xu F, Liao Y. 2019. Identification of candidate genes involved in anthocyanin accumulation using Illmuina-based RNA-seq in peach skin. Scientia Horticulturae 250: 184-198. Zamudio F, Kujawska M, I Hilgert N. 2010.Draft Honey as medicinal and food resource. Comparison between Polish and multiethnic settlements of the Atlantic Forest, Misiones, Argentina. The Open Complementary Medicine Journal 2: 58-73. Zhang B, Liu J, Wang X, Wei Z. 2018. Full-length RNA sequencing reveals unique transcriptome composition in bermudagrass. Plant physiology and biochemistry : PPB 132: 95-103. Zhang MF, Jiang LM, Zhang DM, Jia GX. 2015a. De novo transcriptome characterization of Lilium 'Sorbonne' and key enzymes related to the flavonoid biosynthesis. Molecular genetics and genomics : MGG 290: 399-412. Zhang X, Allan AC, Li C, Wang Y, Yao Q. 2015b. De Novo Assembly and Characterization of the Transcriptome of the Chinese Medicinal Herb, Gentiana rigescens. International journal of molecular sciences 16: 11550-11573. Zheng Y, Jiao C, Sun H, Rosli HG, Pombo MA, Zhang P, Banf M, Dai X, Martin GB, Giovannoni JJ et al. 2016. iTAK: A Program for Genome-wide Prediction and Classification of Plant Transcription Factors, Transcriptional Regulators, and Protein Kinases. Molecular plant 9: 1667-1670. Ziegler J, Facchini PJ. 2008. Alkaloid biosynthesis: metabolism and trafficking. Annu Rev Plant Biol 59: 735-769.

Figure Captions

Table 1 Summary of the SMRT sequencing data

Table 2 Distribution and frequency of single sequence repeats (SSRs) based on the numbers of repeat motifs analyzed in the A. sieboldii transcriptome 30

© The Author(s) or their Institution(s) Page 31 of 42 Genome

Table 3 The KEGG annotated pathways and number of unigenes in A. sieboldii

Figure 1 (A) Read length distribution and length cumulative frequency of

consensus isoforms in A. sieboldii. (B) The number of LncRNA annotated by the four

databases Coding-Non-Coding Index (CNCI), Coding Potential Calculator (CPC),

Pfam and Coding Potential Assessment Tool (CPAT). (C) Species distribution statistics

on transcripts annotation based on the top BLASTX hits.

Figure 2 (A) COG and (B) eggnog annotation and classification of transcripts.

Figure 3 Gene Ontology (GO) classification analysis of transcripts. The x-axis

shows the GO function classes. The right y-axis shows the number of genes with the GO function, and the left y-axis showsDraft the percentage. Figure 4 Prediction of transcriptional factors and transcript statistics.

Figure 5 A proposed biosynthetic pathway for methyleugenol and asarinin

biosynthesis in A. sieboldii. Abbreviations: 4CL, 4-coumarate-CoA ligase; C3H, p-

coumarate 3-hydroxylase; C4H, trans-cinnamate 4-monooxygenase; CFAT, coniferyl

alcohol acyl transferase; CAD, cinnamyl alcohol dehydrogenase; CCoAOMT,

caffeoyl-CoA O-methyltransferase; CCR, cinnamoyl-CoA reductase; CYP81Q belongs

to the cytochrome P450 family; DIR, dirigent proteins; EGS, eugenol synthase;

IMOT/EOMT, (iso)/eugenol O-methyltransferase; PAL, phenylalanine ammonia-lyase.

Bracketed numbers represent the number of transcripts.

Figure 6 A proposed biosynthetic pathway for aristolochic acids in A. sieboldii.

Abbreviations: TYR, tyrosinase; TYDC, tyrosine dopa decarboxylase; NCS, (S)-

Norcoclaurine synthase; NOMT, (S)-Norcoclaurine-6-O-methyltransferase; CNMT,

31

© The Author(s) or their Institution(s) Genome Page 32 of 42

coclaurine N-methyltransferase; CYP80B belongs to the cytochrome P450 family.

Bracketed numbers represent the number of transcripts.

Figure 7 The expression patterns of candidate genes associated with asarinin and aristolochic acids biosynthesis. Note: a, b, and c represent significant differences between samples.

Draft

32

© The Author(s) or their Institution(s) Page 33 of 42 Genome

Table 1 Summary of the SMRT sequencing data Statistical data A. sieboldii Library 1-6kb Number of circular consensus (CCS) 348,913 Read bases of CCS 439,086,195 Mean read length of CCS 1,258 Mean number of passes 62 Number of undesired primer reads 60,045 Number of filtered short reads 92 Number of full-length nonchimeric reads 268,055 Full-length nonchimeric percentage (FLNC%) 76.83% Number of consensus isoforms 118,486 Average consensus isoforms read length 1,160 Number of polished high-quality isoforms 114,026 Number of polished low-quality isoforms 3,276 Percent of polished high-quality isoforms 96.24%

Draft

33

© The Author(s) or their Institution(s) Genome Page 34 of 42

Table 2 Distribution and frequency of single sequence repeats (SSRs) based on the numbers of repeat motifs analyzed in the A. sieboldii transcriptome

Total number of sequences examined 52,238 Statistical data Total size of examined sequences (bp) 83,342,523 Total number of identified SSRs 15,029 Repeat number Percentage SSR motif 5 6 7 8 9 10 >10 Total (%) Mononucleotide 0 0 0 0 0 2932 3077 6009 39.98 Dinucleotide 0 1132 749 549 408 283 1355 4476 29.78 Trinucleotide 1,369 496 221 128 87 61 95 2,457 16.35 Tetranucleotide 47 39 18 5 0 1 0 110 0.73 Pentanucleotide 25 2 3 1 0 3 1 35 0.23 Hexanucleotide 45 32 10 5 1 0 5 98 0.65 SSR in compound - 1844 12.27 formation Draft

34

© The Author(s) or their Institution(s) Page 35 of 42 Genome

Table 3 The KEGG annotated pathways and number of unigenes in A. sieboldii Number of Pathway Ko ID transcript P value s Plant hormone signal transduction ko04075 366 3.69E-10 Cyanoamino acid metabolism ko00460 140 7.92E-07 Circadian rhythm-plant ko04712 106 2.36E-06 Biosynthesis of amino acids ko01230 839 1.12E-05 Plant-pathogen interaction ko04626 359 4.57E-05 Carbon fixation in photosynthetic ko00710 316 1.69E-04 organisms Phenylalanine, tyrosine and tryptophan ko00400 159 1.71E-04 biosynthesis Phenylpropanoid biosynthesis ko00940 187 1.26E-03 Cysteine and methionine metabolism ko00270 325 2.52E-03 Folate biosynthesis ko00790 47 4.22E-03 Isoquinoline alkaloid biosynthesis ko00950 71 5.47E-03 Tyrosine metabolism ko00350 144 5.78E-03 Porphyrin and chlorophyll metabolism ko00860 114 6.36E-03 Regulation of autophagyDraft ko04140 98 6.39E-03 Diterpenoid biosynthesis ko00904 32 1.29E-02 Glutathione metabolism ko00480 250 1.47E-02 Carotenoid biosynthesis ko00906 73 2.42E-02 Zeatin biosynthesis ko00908 22 2.86E-02 Glucosinolate biosynthesis ko00966 27 3.14E-02 Tropane, piperidine and pyridine alkaloid ko00960 53 3.75E-02 biosynthesis

35

© The Author(s) or their Institution(s) Genome Page 36 of 42

Figure 1

Draft

Figure 1 (A) Read length distribution and length cumulative frequency of consensus isoforms in A. sieboldii. (B) The number of LncRNA annotated by the four databases Coding-Non-Coding Index

(CNCI), Coding Potential Calculator (CPC), Pfam and Coding Potential Assessment Tool (CPAT).

(C) Species distribution statistics on transcripts annotation based on the top BLASTX hits.

36

© The Author(s) or their Institution(s) Page 37 of 42 Genome

Figure 2

Figure 2 (A) COG and (B) eggnog annotation and classification of transcripts.

Draft

37

© The Author(s) or their Institution(s) Genome Page 38 of 42

Figure 3

Draft

Figure 3 Gene Ontology (GO) classification analysis of transcripts. The x-axis shows the GO function classes. The right y-axis shows the number of genes with the GO function, and the left y- axis shows the percentage.

38

© The Author(s) or their Institution(s) Page 39 of 42 Genome

Figure 4

Figure 4 Prediction of transcriptional factorsDraft and transcript statistics.

39

© The Author(s) or their Institution(s) Genome Page 40 of 42

Figure 5

Draft

Figure 5 A proposed biosynthetic pathway for methyleugenol and asarinin biosynthesis in A. sieboldii. Abbreviations: 4CL, 4-coumarate-CoA ligase; C3H, p-coumarate 3-hydroxylase; C4H, trans-cinnamate 4-monooxygenase; CFAT, coniferyl alcohol acyl transferase; CAD, cinnamyl alcohol dehydrogenase; CCoAOMT, caffeoyl-CoA O-methyltransferase; CCR, cinnamoyl-CoA reductase; CYP81Q belongs to the cytochrome P450 family; DIR, dirigent proteins; EGS, eugenol synthase; IMOT/EOMT, (iso)/eugenol O-methyltransferase; PAL, phenylalanine ammonia-lyase.

Bracketed numbers represent the number of transcripts.

40

© The Author(s) or their Institution(s) Page 41 of 42 Genome

Figure 6

Draft

Figure 6 A proposed biosynthetic pathway for aristolochic acids in A. sieboldii. Abbreviations:

TYR, tyrosinase; TYDC, tyrosine dopa decarboxylase; NCS, (S)-Norcoclaurine synthase; NOMT,

(S)-Norcoclaurine-6-O-methyltransferase; CNMT, coclaurine N-methyltransferase; CYP80B

belongs to the cytochrome P450 family. Bracketed numbers represent the number of transcripts.

41

© The Author(s) or their Institution(s) Genome Page 42 of 42

Figure 7

150 Roots Stems Leaves a

100

s a l

e a v e l

n 50

o a a

i a a a s a s

e a r 25 a b p b x 20 a b b E b b b 15 b b 10 b b c c b 5 c b c c 0

R D L T Q C S T T 1 C A A M EPI 1 YR C M M B C C P O s T YD N O N 1 s s s C A P8 s T s N C A A A s Y A s A s s YP8 A C A A A C s s A A

Figure 7 The expression patterns of candidate genes associated with asarinin and aristolochic acids biosynthesis. Note: a, b, and c represent significant differences between samples. Draft

42

© The Author(s) or their Institution(s)