Gene 528 (2013) 120–131

Contents lists available at ScienceDirect

Gene

journal homepage: www.elsevier.com/locate/gene

The complete chloroplast genome sequence of Mahonia bealei (Berberidaceae) reveals a significant expansion of the inverted repeat and phylogenetic relationship with other angiosperms

Ji Ma a, Bingxian Yang a, Wei Zhu a, Lianli Sun a, Jingkui Tian a,⁎, Xumin Wang b,⁎ a College of Biomedical Engineering & Instrument Science, Zhejiang University, 38 Zheda Road, Hangzhou, Zhejiang, China b Beijing Institute of Genomics, Chinese Academy of Science, Beijing, China article info abstract

Article history: Mahonia bealei (Berberidaceae) is a frequently-used traditional Chinese medicinal with efficient anti- Accepted 18 July 2013 inflammatory ability. This plant is one of the sources of berberine, a new cholesterol-lowering drug with anti- Available online 27 July 2013 diabetic activity. We have sequenced the complete nucleotide sequence of the chloroplast (cp) genome of M. bealei. The complete cp genome of M. bealei is 164,792 bp in length, and has a typical structure with large Keywords: (LSC 73,052 bp) and small (SSC 18,591 bp) single-copy regions separated by a pair of inverted repeats (IRs Mahonia bealei 36,501 bp) of large size. The Mahonia cp genome contains 111 unique genes and 39 genes are duplicated in Chloroplast IR expansion the IR regions. The gene order and content of M. bealei are almost unarranged which is consistent with the SSRs hypothesis that large IRs stabilize cp genome and reduce gene loss-and-gain probabilities during evolutionary Molecular dating process. A large IR expansion of over 12 kb has occurred in M. bealei,15genes(rps19, rpl22, rps3, rpl16, rpl14, RNA editing rps8, infA, rpl36, rps11, petD, petB, psbH, psbN, psbT and psbB) have expanded to have an additional copy in the IRs. The IR expansion rearrangement occurred via a double-strand DNA break and subsequence repair, which is dif- ferent from the ordinary gene conversion mechanism. Repeat analysis identified 39 direct/inverted repeats 30 bp or longer with a sequence identity ≥90%. Analysis also revealed 75 simple sequence repeat (SSR) loci and almost all are composed of A or T, contributing to a distinct bias in base composition. Comparison of protein-coding sequences with ESTs reveals 9 putative RNA edits and 5 of them resulted in non-synonymous modifications in rpoC1, rps2, rps19 and ycf1. Phylogenetic analysis using maximum parsimony (MP) and maximum likelihood (ML) was performed on a dataset composed of 65 protein-coding genes from 25 taxa, which yields an identical tree topology as previous plastid-based trees, and provides strong support for the sister relationship between Ranunculaceae and Berberidaceae. Molecular dating analyses suggest that Ranunculaceae and Berberidaceae diverged between 90 and 84 mya, which is congruent with the fossil records and with recent estimates of the divergence time of these two taxa. © 2013 Elsevier B.V. All rights reserved.

1. Introduction large-scale genomic rearrangement and gene loss-and-gain events are also identified in some (Bausher et al., 2006). The probability Chloroplast (cp) is a photosynthetic organelle that provides essen- of gene loss-and-gain events and genomic rearrangements is closely re- tial energy for plant cells (Dyall et al., 2004). Except for photosynthesis, lated to the size of IRs (Wu et al., 2007), and large IRs can help stabilize lots of biosynthetic activities take place in chloroplast, such as the pro- the rest of the part of the cp genome and prevent gene loss-and-gain duction of certain amino acids and lipids, as well as certain pigments rearrangements to take place (Xiao et al., 2008; Yang et al., 2010). Be- in flowers and several key pathways of sulphur and nitrogen metabo- cause of the existence of large IRs (~23–26 kb), the relative sizes of lism. It is also considered to have association with the plant's immune LSC, SSC and IRs of most angiosperms remain constant, and gene response (Pyke, 1999). In angiosperms, most cp genomes are circular order and content are well conserved (Shinozaki et al., 1986). However, DNA molecules ranging from 120 to 160 kb in length and highly con- cpDNAs which have lost some of its inverted repeat segments tend to served in gene content and orders (Wicke et al., 2011). However, have more rearrangements and gene loss conditions, the cp genome of gymnosperms such as pines and cypresses has lost one of the rele- vant large inverted repeats (Kolodner and Tewari, 1979), and gene Abbreviations: Cp, chloroplast; cpDNA, chloroplast DNA; IGS, intergenic spacer; Indel, loss-and-gain events and structural rearrangements were more fre- insertion/deletion; IR, inverted repeat; LSC, large single-copy region; SSC, single-copy quently found in these species (Yi et al., 2013). region. ⁎ Corresponding authors. Tel.: +86 571 87951301; fax: +86 571 8795 1091. Mahonia bealei (Berberidaceae) is a traditional Chinese herb that E-mail addresses: [email protected] (J. Tian), [email protected] (X. Wang). deals with all sorts of inflammation, ranging from teeth inflammation,

0378-1119/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.gene.2013.07.037 J. Ma et al. / Gene 528 (2013) 120–131 121 pneumonia to mastitis (Ferris and Zheng, 1999). This plant is a great functional classification of the annotated genes was referred to DOGMA. medicinal value because it is one of the sources of berberine, which is The circular map of the M. bealei cp genome was drawn by the OGdraw considered to be a new cholesterol-lowering drug (Kong et al., 2004). online tool (Lohse et al., 2007). Recently, there has been extensive research focused on the biochemical activities and mechanisms of actions of this anti-diabetic compound 2.3. Repeat structure as berberine has shown to have anti-diabetic properties (Lee et al., 2006). As the population with hyperglycemia and hyperlipidemia is The REPuter program (Kurtz et al., 2001) was used to assess both increasing rapidly, the unique medicinal value of this plant is gaining direct (forward) and inverted (palindrome) repeats within the cp more and more attention. It is sure that access to genetic information genome. The identity and the size of the repeats were limited to no of M. bealei will not only improve the genetic large-scale breeding and less than 90% (hamming distance equal to 3) and 30 bp in unit length, selection of germplasm process of this species, but also facilitate further respectively (−f −p −l30−h3−best 10000). We ran the same usage of sequence data, such as phylogenetic and transplastomic analy- REPuter analyses against the other three cp genomes of Ranunculales sis for the family of Berberidaceae. species which were also used for mVISTA to assess the relative number Currently, there are only 46 sequences available for M. bealei, of the repeat sequences in M. bealei cp genome. MISA (Thiel et al., 2003) including 34 mitochondrial DNA sequences and 12 chloroplast DNA were used to detect simple sequence repeats (SSRs), with thresholds of sequences, listed in GenBank (http://www.ncbi.nlm.nih.gov/genbank/ mononucleotide repeats ≥10 bases, dinucleotide repeats ≥12 bases, nuccore/?term=mahonia+bealei). Due to the lack of chloroplast tri- and tetranucleotide repeats ≥15 bases, hexanucleotide or greater genome information of this medicinal plant, there is an urgent need to repeats ≥24 bases. develop further genetic resources for it. Here, we report the complete cp genome sequence of M. bealei, using one of the next-generation 2.4. RNA editing sequencing (NGS) technologies — pyrosequencing (Illumina Genome Analyzer II), to mainly address two particular questions. First, we char- Positional determination of potential RNA edits was accomplished acterized the unique structure of this genome, and analyzed the differ- by aligning publicly available M. bealei EST sequences obtained from ences comparatively with its relative species. Second, we performed GenBank to our assembly. BioEdit (Hall, 1999) was used to identify phylogenetic analyses based on 65 protein-coding genes from 25 taxa, genome loci that have more than one nucleotide type as candidates and molecular dating analyses to determine the divergent time of for base changes. We only considered those loci with nucleotide Ranunculaceae and Berberidaceae, which may contribute to a better sequencing quality value greater than 20 and the limit of the number understanding of the evolution progress of the Mahonia species. Addi- of aligned EST reads for each selected locus is set to be above 4. tionally, we also described details of the genome assembly, annotation, and nucleotide polymorphisms of M. bealei. 2.5. Phylogenetic and divergence time analysis

2. Materials and methods The data set consisted of concatenated nucleotide sequences of 65 protein coding genes including atpA, B, E, F, H, I, ccsA, cemA , ndhA, B, C, 2.1. DNA sequencing and genome assembly D, E, F, G, H, I, J, K, petA, B, D, G, L, N, psaA, B, C, I, J, B, C, psbD, E, F, H, I, J, K, L, M, N, T, Z, rbcL, rpl2, 16, 20, 23, 32, 33, 36, rps2, 3, 4, 7, 8, 11, 12, 14, Fresh green leaves from adult were collected for the prepara- 15, 16, 18, 19 and matK extracted from 25 chloroplast genome tion of genomic DNA extraction. 5 μg purified DNA was used for the sequences representing all lineages of angiosperms. These genes were construction of cpDNA libraries. extracted from complete cp genome sequences in the GenBank data- Since the original sequence reads are a mixture of DNA from nucleus base with local perl scripts. Sequences were aligned using MAFFT and organelles, BLAT (Kent, 2002) software was used to isolate chloro- (Katoh and Standley, 2013) and manually adjusted. plast reads from the raw reads based on known reference cp genomes. Maximum likelihood (ML) trees were constructed using RAxML 7.0.4 SolexaQA (Cox et al., 2010)wasusedtofilter low quality reads with (Stamatakis et al., 2008) with a general time reversible (GTR+I+G) the options −h 25 (quality cutoff p = 0.01) and −l40(length model. Maximum parsimony (MP) trees were performed using PAUP* cutoff = 40 bp). SOAPdenovo (Luo et al., 2012) was carried out for 4.10 (Swofford, 2003). Phylogenetic analyses excluded gap regions. All the assembly with an option −K 57 (Kmer size = 57 bp). Then, all MP searches included 100 random addition replicates and TBR branch contigs were mapped to the reference chloroplast genome of Nadina swapping with the Multrees option. Non-parametric bootstrap analyses domestica (NC_008336) using BLATZ (Schwartz et al., 2003) to identify were performed for MP analyses with 1000 replicates. the position and direction of the contigs. Gaps between contigs were Divergence times were estimated by the program BEAST version filled up with a method like this: first, we used BLAT to map chloroplast 1.7.5 (Drummond et al., 2012), using the same dating strategies adopted reads onto both ends of the assembled contigs and then elongated the by Moore et al. (2010). Three time constraints were applied for molec- contigs by joining the reads, which were partly overlapped (≥95% iden- ular dating, in which the minimum ages of angiosperms were set to tical) with the contigs. The two steps were taken repeatedly until the be 131.8 mya, the minimum ages of were set to be 125 mya, gaps were filled; we thus acquired the complete cp genome of M. bealei. and 85 mya for the most recent common ancestor of Quercus and To avoid assembly errors, we tested the length distribution of all assem- Cucumis. Additionally, we set the age of the root node to be 308 mya bled paired-end reads (Supplementary Fig. 1), and 3 pairs of primers for as reports of conifer leaves and shoots are known from 309.2 to PCR amplifications were designed based on the variation of the genome 307.1 mya. to validate the sequence of the assembly (Supplemetary Table 1). 3. Results 2.2. Genome annotation and analysis 3.1. Genome assembly and validation The M. bealei cp genome was annotated using DOGMA (Dual Organellar GenoMe Annotator, Wyman et al., 2004). Predicted annota- Using the Illumina Hiseq 2000 system, 16,139,329 paired-end reads tions of genes and open reading frames (ORFs) were identified with were generated for this project. After filtering low-quality reads (≤Q20 BLAST tools and ORF finder at NCBI website (http://www.ncbi.nlm.nih. bases) and aligning with reference cp genomes, we collected 562,270 gov/). The transfer RNA genes were identified by using DOGMA reads (3.48% of total) reaching 266X coverage over the cp genome paralleled with using tRNAscan-SE1.23 (Lowe and Eddy, 1997). The (Table 1). The unassembled reads (~96.52%) were mostly from the 122 J. Ma et al. / Gene 528 (2013) 120–131

Table 1 genome. The final complete chloroplast genome sequence of M. bealei Sequencing and assembly results for cp genome of Mahonia bealei. has been deposited into GenBank under the accession number KF76554. Total number of paired-end reads 16,139,329 Average read length (after filtering low-quality) 69.5 Reads used in complete cp genome assembly 562,270(3.48%) 3.2. Genome features Number of bases used in complete cp genome assembly 39,077,765 Fold coverage of chloroplast genome 266X (Fold coverage = number of bases sequenced/estimated length The M. bealei cp genome is 164,792 bp in length and shares the of the cpDNA) typical quadripartite structure with a pair of inverted repeats (IRs 36,501 bp), separated by a small single copy region (SSC 18,591 bp) and a large single copy region (LSC 73,052 bp), respectively (Fig. 1 and Table 2). The IR size of M. bealei (36,501 bp) is the largest among nuclear genome due to the raw reads were a collection of DNA from Ranunculales. Megaleranthis saniculifolia ranks second, with an IR size nucleus and organelles. of 26,608 bp, and Nandina domestica ranks third, with an IR size of We have manually corrected ambiguous nucleotide sites in the cp 26,062 bp. The large size of IRs is the main contribution to the large genome which were produced in the scaffold extension progress during size of the cp genome of M. bealei. the assembly, as well as errors associated with heterogeneous insertion/ The cp genome of M. bealei encodes 150 predicted functional genes, deletions (Indels) which were arisen from homopolymeric repeats in 111 are unique and 39 are duplicated in the IR regions. Among the 111 the genome. This kind of error is intrinsic to pyrosequencing and unique genes, there are 73 protein-coding genes, 34 tRNA genes and 4 could be corrected by PCR runs. However, there is a total of 293 homo- rRNA genes, respectively. In IRs, the number of protein-coding genes, polymers (n ≥ 7) in the cp genome and 118 equal to 7 bps, and it is tRNA,andrRNAis28,7,and4forM. bealei, while the number is 14, 7, irrational to retest all of them; we designed 3 pairs of primers to validate and 4 for M. saniculifolia,and13,7,and4forN. domestica. The equiva- the sequences which have the maximum likelihood to make errors lent numbers of tRNA and rRNA in the IRs indicate that the gene order (Table S1). After contrasting with returned results from PCR experi- and content remain unarranged, and the number of protein-coding ments, a total of 22 bases were corrected within the assembled cp genes in M. bealei is much higher than the other two, indicating large

Fig. 1. Gene map of the chloroplast genome of the Mahonia bealei (Berberidaceae). Asterisks (*) after the gene indicate the presence of introns; the introns are annotated with white boxes within genes in the map. Genes on the outside of the map are transcribed in the clockwise direction, while other genes on the inside of the map are transcribed in the counterclockwise direction (drawn with gray arrows). The LSC, SSC, and IR regions are annotated at the inner circle. The dark gray plot in the inner circle corresponds to GC content. J. Ma et al. / Gene 528 (2013) 120–131 123

Table 2 The gene order and content of M. bealei are almost identical to that of Summary of the complete cp genome of Mahonia bealei. tobacco, only the SSC region is in the reverse orientation. However, the Total cp genome size 164792 bp IR in M. bealei is 36,501 bp in length and contains 39 duplicated genes, Inverted repeat (IR) region size 36501 bp while that in the cp genome of tobacco is 25,343 bp in length, and con- Large single copy (LSC) region size 73052 bp tains 24 duplicated genes. The 15 additional genes are rps19, rpl22, rps3, Small single copy (SSC) region size 18591 bp rpl16, rpl14, rps8, infA, rpl36, rps11, petD, petB, psbH, psbN, psbT and psbB, Number of unique genes 111 Number of protein coding genes 73 as a result of IR expansion around the junction between LSC and IR tRNA genes 34 (Fig. 1). rRNA genes 4 Pseudogenes 2 GC content 38.14% Coding regions (proportion of whole genome) 50.09% 3.3. Gene organization and content

The M. bealei cp genome has 18 intron-containing genes, 16 of which IR expansion around the junction of IRb and LSC regions in M. bealei are single-intron except for two genes, ycf3 and clpP, whose exons are (Fig. 1). separated by 2 introns. The gene rps12 is trans-spliced; one of its The overall GC content of M. bealei is 38.14%, almost the same with exons is in the LSC region (5′) and the two reside in the IR regions sep- other plants in Ranunculales: M. saniculifolia (38.03%), Ranunculus arated by an intron (Fig. 1). The introns of all protein-coding genes share macranthus (37.86%) and N. domestica (38.30%). This value is within the same splicing mechanism as Group II introns. Some exceptional the normal coverage for vascular plants (38–39%) (Palmer and Stein, cases were identified in start codons, such as ACG for ndhD and rpl2, 1986). The GC content of the IRs in M. bealei is 41.25%, which is ATC for rpl16,andGTGforrps19. This event of non-canonical start lower than the other plants in Ranunculales: M. saniculifolia (42.96%), codons was also found in other angiosperms and ferns (Lidholm and R. macranthus (43.34%) and N. domestica (43.06%). The high GC content Gustafsson, 1991). in the IRs is mainly caused by the high GC content of the four ribosomal The gene ycf68 of the IR regions became pseudogene due to an inter- RNA (rRNA) genes (55.2%), so the greater the proportion of the rRNAs is, nal stop codon identified in its coding sequence (CDS), positioned 42 bp the higher GC content will be. The number of rRNA is the same (4 for upstream away from the start codon. In the junction region of IRb and each) in the IRs of the 4 cp genomes, while the number of protein- SSC, the ycf1 gene became pseudogene because of incomplete duplica- coding genes of M. bealei (28 protein coding genes) is much higher tion of the normal copy of ycf1 at the junction of IRa and SSC region than that of the other 3 (13–14 protein-coding genes), which results (Fig. 1). Similar mutations were also found in the cp genomes of other in a little lower GC content of the IRs in M. bealei. angiosperms (Johansson, 1999). Another gene rpoA, normally present A total of 26,326 codons represent the coding capacity of all protein- between rps11 and petD in most land plants, however, is completely coding genes of the M. bealei chloroplast genome (Table 3). Among absent from the cp genome of M. bealei. The loss of the rpoA gene is these codons, 2233 (8.48%) encode for isoleucine and 306 (1.16%) for considered to be a general occurrence in mosses (Chaw et al., 1997), cysteine, which were the most and the least amino acids, respectively. but this event is not found in other plants of Ranunculales.

Table 3 Codon usage and codon–anticodon recognition pattern for tRNA in Mahonia bealei chloroplast genome.

Amino acid Codon No. RSCU tRNA Amino acid Codon No. RSCU tRNA

Phe UUU 924 1.24 Ser UCU 556 1.65 UUC 563 0.76 trnF-GAA UCC 361 1.07 trnS-GGA Leu UUA 819 1.8 trnL-UAA UCA 407 1.21 trnS-UGA UUG 542 1.19 trnL-CAA UCG 169 0.5 CUU 562 1.24 Pro CCU 426 1.43 CUC 187 0.41 CCC 278 0.93 CUA 415 0.91 trnL-UAG CCA 325 1.09 trnP-UGG CUG 199 0.44 CCG 161 0.54 Ile AUU 1066 1.43 Thr ACU 549 1.6 AUC 487 0.65 trnI-GAU ACC 295 0.86 trnT-GGU AUA 680 0.91 ACA 375 1.09 trnT-UGU Met AUG 628 1 trnf (M)-CAU ACG 154 0.45 Val GUU 537 1.44 Ala GCU 647 1.71 GUC 184 0.49 trnV-GAC GCC 257 0.68 GUA 567 1.52 trnV-UAC GCA 428 1.13 trnA-UGC GUG 207 0.55 GCG 182 0.48 Tyr UAU 742 1.59 Cys UGU 221 1.44 trnC-GCA UAC 190 0.41 trnY-GUA UGC 85 0.56 TER UAA 1 0.5 TER UGA 2 1 UAG 3 1.5 Trp UGG 458 1 trnW-CCA His CAU 481 1.51 Arg CGU 379 1.37 trnR-ACG CAC 157 0.49 trnH-GUG CGC 96 0.35 Gln CAA 682 1.48 trnQ-UUG CGA 400 1.45 CAG 239 0.52 CGG 139 0.5 Asn AAU 885 1.52 Ser AGU 414 1.23 AAC 276 0.48 trnN-GUU AGC 113 0.34 Lys AAA 936 1.46 trnK-UUU Arg AGA 476 1.73 trnS_GCU AAG 350 0.54 AGG 164 0.59 Asp GAU 873 1.61 Gly GGU 625 1.31 GAC 211 0.39 trnD-GUC GGC 197 0.41 trnG-GCC Glu GAA 937 1.43 trnE-UUC GGA 737 1.54 trnG-UCC GAG 370 0.57 GGG 350 0.73 124 J. Ma et al. / Gene 528 (2013) 120–131

Table 4 regions. The direct repeats were mostly within intergenic spacer re- Repeated sequences in the Mahonia bealei chloroplast genome. gions, intron sequences, but nested repeats were also found in ycf1 Repeat number Size (bp) Type Repeat 1 Location Repeat 2 Location and accD (Table 4). There are 75 SSRs with length ≥10 in the M. bealei cp genome, 1 51 D IGS (atpB–rbcL) IGS (atpB–rbcL) 2 48 I IGS (rbcL–accD) IGS (rbcL–accD) including 71 mononucleotides, one di-, two trinucleotide and one com- 3 46 D IGS (rbcL–accD) IGS (rbcL–accD) plex repeats (Table 6). No tetranucleotides or hexanucleotides were 4 66 D accD accD found. All (71/71) mononucleotides consist of A or T, which is consistent 5 63 D accD accD with the contention that cp SSRs are generally composed of short 6 37 D accD accD 7 51 D accD accD polyadenine (polyA) or polythymine (polyT) repeats and rarely contain 8 33 D accD accD tandem guanine (G) or cytosine (C) repeats (Kuang et al., 2011). We 9 33 D accD accD also analyzed the distribution of the SSRs in the M. bealei cp genome. 10 87 D accD accD 9.3% SSRs are located in CDS, which indicates that SSRs are much less 11 42 D accD accD abundant in coding regions than in non-coding regions. Two genes, 12 70 D accD accD 13 36 I IGS (petA–psbJ) IGS (petA–psbJ) ycf1 and rps2, were found to harbor at least two SSRs. 14 57 I IGS (petD–rps11) IGS (petD–rps11) 15 51 D IGS (trnI-CAU–ycf2) IGS (trnI-CAU–ycf2) 3.5. RNA editing 16 43 D IGS (trnI-CAU–ycf2) IGS (trnI-CAU–ycf2) 17 56 I ycf2 ycf2 18 33 D ycf2 ycf2 DNA and EST sequences were compared by aligning expressed se- 19 39 I IGS (ycf2–ycf15) IGS (ycf2–ycf15) quence tag (EST) sequences retrieved from GenBank with the complete 20 49 I IGS (trnL-CAA–ndhB) IGS (trnL-CAA–ndhB) M. bealei chloroplast genome sequence. 5 non-synonymous nucleotide 21 46 D ndhB IGS (ndhF–trnN-GUU) substitutions were identified in the protein-coding transcripts of 22 37 D ycf1 ycf1 rpoC1, rps2, rps19,andycf1 (Table 5). In ycf1 four amino acid substitu- 23 37 D ycf1 ycf1 24 34 D ycf1 ycf1 tions were found, which resulted in 1 non-synonymous and 3 synony- 25 50 D ycf1 ycf1 mous nucleotide substitutions. Only 1 non-synonymous nucleotide 26 39 D ycf1 ycf1 substitution was found in transcripts coding for rpoC1. In non-protein- 27 35 D ycf1 ycf1 coding regions, 7 additional differences were detected, including 2 in 28 51 D ycf1 ycf1 29 54 D ycf1 ycf1 the intron of clpP and 5 in the ribosomal RNA gene rrn23 (Table 5). 30 38 D ycf1 ycf1 31 44 D ycf1 ycf1 3.6. Phylogenetic and molecular dating analysis 32 36 D ycf1 ycf1 33 35 D ycf1 ycf1 We prepared a data set of 65 protein-coding genes sequences 34 36 D ycf1 ycf1 35 44 I ycf1 ycf1 extracted from 25 taxa, including 23 angiosperms and 2 gymnosperms 36 41 I IGS (ndhF–trnN-GUU) IGS (ndhF–trnN-GUU) as outgroups (Pinus and Cycas). Both maximum parsimony (MP) and 37 34 D IGS (ndhF–trnN-GUU) IGS (ndhF–trnN-GUU) maximum likelihood (ML) analyses were performed on this data set. 38 30 I trnS-GCU trnS-GGA The data set comprises 112,450 characters and has a percentage of 39 36 D intron (ycf1) IGS (rps12–trnV-GAC) potential parsimony-informative characters of 14.24%. Table includes repeats at least 30 bp in size, with a sequence identity ≥90%. D = direct, ML analysis resulted in a single tree with −lnL = 468191.42405 I = palindrome, IGS = intergenic spacer. (Fig. 5). ML bootstrap support values were also high, with values of ≥95% for 21 nodes, and 16 nodes with 100% bootstrap support 3.4. Repeat structure analyses (Fig. 5). MP analyses resulted in a single, fully resolved tree with a length of 64,594, a consistency index of 0.5966, and a retention index Repeat analysis identified 30 direct and 9 inverted repeats 30 bp or of 0.5665 (Fig. 6). Bootstrap analyses indicated that 20 of the 22 nodes longer with a sequence identity greater than 90% (Table 4). 15 direct were supported by values ≥95%. repeats and 3 inverted repeats are 30 to 40 bp long, and the longest The ML and MP trees had very similar topologies, except for 3 major direct repeat is 78 bp (Fig. 2). Three direct repeats were found in the differences. First, the MP tree placed Amborella with Nymphaea as the coding region, while 7 were identified either in intergenic or intronic earliest diverging angiosperm lineage, whereas the ML tree placed

Fig. 2. Repeat sequence analysis in 4 sequenced Ranunculales chloroplast genomes. REPuter was used to identify repeat sequences with length ≥30 bp long and sequence identity ≥90% in the cp genomes. F, P, R, and C indicate the repeat type of F (forward), P (palindrome), R (reverse), and C (complement), respectively. Repeat with different lengths are indicated in different groups. J. Ma et al. / Gene 528 (2013) 120–131 125

Table 5 Differences observed by comparison of Mahonia bealei chloroplast genome sequences with EST sequences obtained by BLAST search in GenBank.

Gene Gene size (bp) EST sequencea Number of variable sites Variation typeb Position(s)c Amino acid change

rpoC1 2795 1516–1558 1 C-T 1520 S-N rps2 705 1–525 2 G-A 392 S-L G-A 458 S-L rps19 270 1–270 2 A-T 118 I-I C-T 143 A-V ycf1 5542 1554–1993 4 A-T 1611 K-N C-T 1656 D-D T-C 1864 L-L T-A 1962 I-I ⁎ clpP 730 375–624 2 T-C 450 C-T 511 rrn23 2804 422–1112 5 G-C 437 – A-G 442 – A-T 462 – A-C 512 – A-C 1106 –

a Gene sequence which considers the first base of the initiating codon as 1. b Variation type: nucleotide in genomic DNA-nucleotide in mRNA. c Variable position is referenced to the first base of the initiating codon of the gene sequence. ⁎ indicates the length of the second intron, not the full length of the gene clpP.

Amborella followed by the Nymphaeales as the most basal angiosperm to eudicots by the MP tree, and both of these two clades with basal lineage, parental branch of both monocots and eudicots. Second, the angiosperms formed a sister relationship with monocots. However, it only representative of the magnoliid clade, Calycanthus, is placed sister is placed as a sister group of monocots by the ML tree, and both of

Fig. 3. Sequence alignment of four chloroplast genome in the Ranunculales order using mVISTA program. Gray arrows and thick blue lines above the alignment indicate genes with their orientation and the positions of the IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity (50%–100%). Genome regions are color-coded as the heat map on the left corner indicated, blue represents exons, lime represents tRNA and rRNA genes, and red represents conserved non-coding sequences (CNS). 126 J. Ma et al. / Gene 528 (2013) 120–131

Fig. 4. Comparison of boundaries between inverted repeats (IR) and single-copy (SC) regions among the five cp genomes of early-divergent eudicots. Gene sequences near the IR–SC (including SSC and LSC) boundaries were annotated along the vertical gray bar (represent part of genome). Genes above the gray bar indicate that their transcriptions are in forward direction and genes below the gray bar indicate that their transcriptions are in reverse direction. *indicates a pseudo gene. This figure is not to scale. them formed a sister relationship with eudicots. The third topological the Ranunculales order (M. bealei, N. domestica, M. saniculifolia and difference is the placement of Myrtales (represented by Eucalyptus). R. macranthus). For the first two, which are in the Berberidaceae family, The MP tree placed Myrtales as a sister branch to eurosids I clade, in- the length of IR in M. bealei cp genome is 36 kb and encoding 39 genes, cluding Cucurbitales and Fabales, whereas the ML tree placed Myrtales and that in N. domestica is 26 kb and encoding 26 genes. For the two sister to eurosids II clade, including , Brassicales and Sapindales. species in the Ranunculaceae family, the length of IR in M. saniculifolia The remaining angiosperms formed two major clades, monocots and is 26.6 kb and encoding 27 genes, and that in R. macranthus is 25.8 kb eudicots. and encoding 26 genes. Both MP and ML trees strongly support the monophyly of the mono- The second, the gene content of M. bealei is very similar to that of cots with a bootstrap value of 100%. Ranunculales were the earliest tobacco, however, rpoA, a gene coding for a translation initiation factor diverging lineage, placed sister to a large clade including and in other plant species, is absent in the cp genome of M. bealei.The asteroids with a bootstrap value of 100%. Among the major clades of gene ndhK degenerated into pseudogene due to an internal stop eudicots, 100% bootstrap value was assigned to rosids against asteroids codon in it. A loss of all 11 functional genes (ndh genes) for subunits for both MP and ML trees. The monophyly of the eurosid II clade was of a putative NADH dehydrogenase was found in the chloroplast strongly supported in the MP tree (100% bootstrap value). Both MP genome of angiosperms and a bryophyte (Wakasugi et al., 2001), 4 and ML trees provide strong support value for the placement of M. bealei ndh genes were completely lost and the other 7 genes remain as obvious in the basal eudicot clade, sister to N. domestica (Berberidaceae) with a pseudogenes. This unexpected finding raises the possibility that ndh bootstrap value of 100%. genes may transfer to the nucleus or that an NADH dehydrogenase is Molecular dating analyses suggest that the two families of the not essential in certain land plant chloroplasts. The lack of a functional Ranunculales order, Ranunculaceae and Berberidaceae were deter- copy of ndhK in M. bealei should be investigated further, including an ex- mined to diverge between 90 and 84 mya. The crown group 95% panded sampling of members of the Berberidaceae and Ranunculaceae. highest posterior density (HPD) age estimates for other major lineages of Pentapetalae were as follows: Superasterids (112–98 mya), Dilleniaceae + Superrosids (108–102 mya), Superrosids (115–112 mya), 4.2. Repeat sequences analysis Asteridae (98–87 mya), and Rosidae (102–95 mya). Repeat analysis identified 30 direct (forward) repeats, 9 inverted 4. Discussion (palindrome) repeats, and 11 reverse repeats of at least 30 bp long per repeat unit with a sequence identity ≥90% in the cp genome of 4.1. Genome organization M. bealei, with the longest repeat of 87 bp in length (Table 4). The repeat structures of the other three species within Ranunculales were also an- Gene order of the M. bealei genome is identical to the published ge- alyzed (Fig. 2). The forward and inverted repeats are the most common nome sequence of tobacco (Nicotiana tabacum, NC_001879), which has repeats in these species. The cp genome of M. bealei owns the longest the inferred ancestral angiosperm genome organization (Shinozaki forward repeat (87 bp in length) and the longest inverted repeats et al., 1986). However, rps19 gene and other 14 genes, which generally (57 bp in length), while the number of inverted repeats is less than are single copies in the LSC regions, now have duplicated in the IRs of the other three. The reason for this may be inferred that the large IR in M. bealei. Such a long IR expansion is unusual in chloroplast genomes. the cp genome of M. bealei could stabilize the genome and reduce the We have compared the IRs of the four sequenced cp genomes of occurrence probability of small inversions. J. Ma et al. / Gene 528 (2013) 120–131 127

Fig. 5. Maximum likelihood tree based on 65 chloroplast protein-coding genes. The single maximum likelihood phylogram has a ML value of −lnL = 468191.42405. Numbers at each node indicate bootstrap support values and branch length scales are shown at base of the tree. Thicker lines in trees indicate members of eudicots. Genelosseventsarelabeledinthe most parsimonious way. Accession numbers for taxa are: Vigna unguiculata, NC_018051.1; thurberi, NC_015204.1; Eucalyptus grandis, NC_014570.1; Glycine max, NC_007942.1; Nymphaea alba, NC_006050.1; Arabidopsis thaliana, NC_000932.1; Panax ginseng, NC_006290.1; Amborella trichopoda, NC_005086.1; Calycanthus floridus var. glaucus, NC_004993.1; Atropa belladonna, NC_004561.1; Cycas taitungensis, NC_009618.1; Lotus japonicas, NC_002694.1; Zea mays, NC_001666.2; Triticum aestivum, NC_002762.1; Oryza sativa indica group, JN861110.1; Gossypium sturtianum, JF317356.1; Cucumis melo subsp. melo,JF412791.1;Citrus sinensis, DQ864733.1; Megaleranthis saniculifolia, FJ597983.1; Nandina domestica, DQ923117.1; Coffea arabica,EF044213.1;Nuphar advena, DQ354691.1; Ranunculus macranthus, DQ359689.1; Nicotiana tabacum, NC_001879.

Fig. 6. Maximum parsimony tree based on 65 chloroplast protein-coding genes. The single most parsimonious phylogram has a length of 64,594, a consistency index of 0.5966, and a re- tention index of 0.5665. Numbers at nodes indicate bootstrap support values and branch length scales are shown at the base of the tree. Thicker lines in trees indicate members of eudicots. 128 J. Ma et al. / Gene 528 (2013) 120–131

The gene ycf1 was found to harbor four SSRs (three mononucleotides transcript counter clockwise in Platunus.InTrochodendron,theIRbex- and one trinucleotides), and two SSRs are closely-positioned to form a panded into the IGS of infA and rps8, ~4 kbp apart from the upstream 138 bp GA repeating segment (Table 6). Compared with other published of rps19 gene; while in Mahonia, the IRb further extended to the IGS of genomes, we found that this 138 bp GA repeating segment is a polymor- clpP and psbB,~11kbpapartfromrps19. The ~4 kbp expansion in IRb phic microsatellite locus, which is also found in Littorina saxatilis, Linum is also found in Tetracentron (Sun et al., 2013), an early-diverging eudicot usitatissimum, Hordeum vulgare etc. These polymorphic new microsatel- inthesameorderwithTrochodendron. Angiosperm with the largest lite loci will be useful for genetic linkage map construction, germplasm known IR is possessed by Pelargonium × hortorum, ~76 kb in length classification and identification, gene identification and quantitative (Palmer and Stein, 1986). The ~12 kb IR expansion in Mahonia was trait loci mapping, and marker-assisted selection in breeding in certain also found in Berberis (Kim and Jansen, 1994)andNicotiana acuminata plants. (Shen et al., 1982). In the cp genome, the IR/LSC boundaries are not static but rather are subject to a dynamic and random process that allows con- 4.3 . Comparison with other cp genomes in the Ranunculales order servative expansions and contractions. The nature of IR movements is conservative, so that disruptive genomic rearrangements are avoided. Four complete cp genome sequences of the Ranunculales order are Goulding et al. have proposed that short IR expansions may be the result currently available (including our assembly of M. bealei), which can be of gene conversion, whereas large IR expansions need to involve a divided into two families: Berberidaceae (M. bealei, N. domestica)and double-strand DNA break (Goulding et al., 1996). The formation of the Ranunculaceae (M. saniculifolia, R. macranthus). ~12 kb IR expansion in Mahonia cp genome is proposed to occur via The genome size of M. bealei is the largest of the four, about 8.2 kb, the second mechanism mentioned above, involving a double-strand 9.7 kb and 4.9 kb larger than that of N. domestica, R. macranthus and DNA break and subsequent repair, which is also found in N. acuminata. M. saniculifolia, respectively. The significant increase in genome size of The IR/LSC junction lies within intron 1 of the clpP gene, a recombination M. bealei is mainly attributed to a pair of large-sized IRs in M. bealei, between poly(A) tracts in clpP intron 1 and upstream of rps19 followed which is the result of IR expansion at the junction of LSC and IRs, as the double-strand break to complete the rearrangement. The poly(A) mentioned above. tract at the IR/LSC boundary in N. acuminata is also found in Mahonia, We then aligned the four Ranunculales cp genomes and found a high except that the truncated clpP gene (clpP′) found in the IR region down- degree of synteny among them, but with minor exceptions. The diver- stream of trnH in N. acuminata is complete silent in Mahonia. gence observed at the junctions between IR and LSC, SSC, as well as in IR regions, is more than that in LSC and SSC regions. In the cp genome 4.5. Divergence time between Ranunculaceae and Berberidaceae of R. macranthus, the gene ycf15, commonly positioned between ycf2 and trnL is transferred to rps12 and trnV, approximately 12.1 kbp down- The best available pollen and leaf fossil evidence of angiosperms is stream of its origin position in the IR region; while two genes rpl22 about 130 mya–135 mya in the origin of the early cretaceous. It is sug- and ycf1 were completely lost at the junction of LSC–IRb and IRb–SSC, gested that the diversity of angiosperms had significantly increased in respectively. In the IR region of the cp genome of M. bealei, ycf68 became the cretaceous period (Crane, 1989). Towards the end of the cretaceous pseudogene due to internal stop codons. The trnI (GAT) gene of period (the Maestrichtian period), pollen and leaf fossils of a number of Berberidaceae contains highly conserved type II introns, while that modern families and genera have emerged, including Magnoliales, was not found in Ranunculaceae due to deletions within the intron of Hamamelidales, Theales and several members of the monocotyledon. trnI, which has also been found in a wide range of plant families The fossil remains of Ranunculales are also found in this period. (Umesono et al., 1988). Menispermaceae, Lardizabalaceae, and Berberidaceae are closely related The overall sequence identity of the four cp genomes was plotted families of the Ranunculales order, and the earliest fossil remains with mVISTA using the new annotation of M. bealei as reference of Menispermaceae are discovered in the late cretaceous (~65 mya). (Fig. 3). The result shows that the IR regions are less divergent than Our estimation of the divergence time between Ranunculaceae and the LSC and SSC regions, and the coding regions are more conserved Berberidaceae (90–84 mya) is generally consistent with the recent than the non-coding region. The highly divergent regions observed report of ~91 mya (Anderson et al., 2005), whose result was based on are ndhC-trnV, ndhI-ndhG, ndhD-ccsA and psbA-trnH, which are mostly the rbcL gene phylogenetic analysis and employed numerous fossil similar to the results of observation in the non-coding region of six constraints from the early-diverging eudicots. Both report agree that a Asteraceae cp genomes (Doyle et al., 1995). The most divergent coding reasonable divergence time between the two lineages Ranunculaceae regions of the four cp genomes are clpP, accD, rps11, ccsA and ycf1. and Berberidaceae may be in the late cretaceous (the Turonian period).

4.4. IR contraction and expansion 4.6. Phylogenetic implications

The IR regions of cp genome are highly conserved in the cp ge- Phylogenies based on 65 protein-coding genes (Fig. 5)generally nome of land plants, but IR contraction and expansion at the junc- agree with several recent studies based on multiple genes or complete tion of SSC, LSC with IRs are easily found during evolutionary chloroplast genomes (Gao et al., 2009; Wakasugi et al., 2001). Lineages progress, while it is this event attributed to the main cause for the that are strongly supported in both MP and ML trees include the mono- size variation of cp genomes (Tsudzuki et al., 1992; White, 1990; phyly of monocots and their sister relationship to eudicots (98%–100% Lin et al., 2012). We compared the boundary part of LSC, SSC and bootstrap values), monophyly of rosids and asteroids (100% bootstrap IRs of our assembly M. bealei with four early-diverging eudicots: values), and the sister relationship between basal eudicots (represented R. macranthus (Ranunculaceae, Ranunculales), N. domestica by Ranunculales) and core eudicots (100% bootstrap values). (Berberidaceae, Ranunculales), Platanus occidentalis (Platanaceae, However, there are still several incongruences between the MP and Proteales), and Trochodendron aralioides (Trochodendraceae, ML trees. The first concerns with the position of the basal angiosperm Trochodendrales) (Fig. 4). clade, including Amborella and Nympaeleas. In our analyses, Amborella Great expansion at the junction of IRb–LSC of Trochodendron and and Nympaeleas formed the most basal angiosperm lineage in MP anal- Mahonia was observed among the comparison. The IRb–LSC junction yses and they were placed parental to both monocots and eudicots in was located in the intergenic space (IGS) region between rps19 and ML tree. This incongruence was observed in previous phylogenetic rpl2 in Ranunculus, while IRb expanded into the 3′ portion of rpl19 studies (Bausher et al., 2006; Caterino and Sperling, 1999; Corriveau gene with a length of 61 and 23 bp in Nandina and Platanus,respectively, and Coleman, 1988; Leebens et al., 2005; Zanis et al., 2002). The second but the genes near the boundary of LSC–IRb (rpl22, rps19 and rpl2)were incongruence concerns with the placement of Calycanthus,theonly J. Ma et al. / Gene 528 (2013) 120–131 129

Table 6 Simple sequence repeat in Mahonia bealei chloroplast genome.

No. Location SSR type SSR start SSR end

1 trnH-GUG–psbA (A) 12 159 170 2 matK–trnK-UUU (A) 14 3717 3730 3 trnK-UUU–rps16 (A) 12 4477 4488 4 trnK-UUU–rps17 (A) 14 4619 4632 5 trnK-UUU–rps18 (T) 10 4904 4913 6 trnS-GCU–trnR-UCU (A) 11 7995 8005 7 trnS-GCU–trnR-UCU (A) 10 8169 8178 8 trnS-GCU–trnR-UCU (T) 13 8961 8973 9 trnS-GCU–trnR-UCU (T) 11 9615 9625 10 atpF–atpH (A) 10 13172 13181 11 atpH–atpI (T) 12 14040 14051 12 rps2 (T) 10 15625 15634 13 rpoC2 (T) 14 18081 18094 14 rpoB–trnC-GCA (A) 16 26667 26682 15 rpoB–trnC-GCA (T) 11 27608 27618 16 trnC-GCA–petN (T) 11 28090 28100 17 psbM–trnD-GUC (T) 12 29422 29433 18 psbM–trnD-GUC (A) 11 29809 29819 19 trnD-GUC–trnY-GUA (T) 10 30886 30895 20 trnE-UUC–trnT-GGU (A) 13 31506 31518 21 trnT-GGU–psbD (A) 15 32403 32417 22 psbC–trnS-UGA (A) 13 35956 35968 23 lhbA–trnG-UCC (A) 11 36853 36863 24 ycf3 intron (T) 10 43766 43775 25 ycf3–trnS-GGA (A) 10 45020 45029 26 rps4–trnT-UGU (T) 10 47021 47030 27 rps5–trnT-UGU (A) 12 47144 47155 28 trnT-UGU–trnL-UAA (A) 11 47490 47500 29 trnL-UAA intron (A) 10 48137 48146 30 trnL-UAA–trnF-GAA (A) 13 48649 48661 31 trnL-UAA–trnF-GAA (A) 10 48813 48822 32 trnF-GAA–ndhJ (T) 12 49444 49455 33 trnF-GAA–ndhJ (A) 12 49572 49583 34 ndhC–trnV-UAC (A) 12 51998 52009 35 ndhC–trnV-UAC (A) 12 52341 52352 36 ndhC–trnV-UAC (T) 11 52771 52781 37 trnT-GGU–atpE (T) 10 53937 53946 38 petG–trnW-CCA (T) 12 66706 66717 39 trnP-UGG–psaJ (T) 12 67222 67233 40 psaJ–rpl33 (A) 11 67750 67760 41 psaJ–rpl34 (T) 12 68000 68011 42 rps12–clpP (A) 14 70230 70243 43 clpP intron (T) 14 71274 71287 44 clpP–psbB (T) 14 73120 73133 45 rps11–rpl36 (T) 11 80243 80253 46 rpl36–infA (A) 12 80419 80430 47 rps8–rpl14 (T) 16 81327 81342 48 rpl14–rpl16 (T) 13 81792 81804 49 rpl16–rps3 (T) 17 82718 82734 50 rpl22–rps19 (T) 10 84784 84793 51 trnR-ACG–trnN-GUU (T) 10 107607 107616 52 trnN-GUU–ycf1 (A) 11 108596 108606 53 ycf1 (A) 10 109411 109420 54 ycf1 (A) 10 111446 111455 55 ndhA intron (T) 11 116932 116942 56 ndhA intron (A) 14 117257 117270 57 ndhI–ndhG (A) 10 119017 119026 58 ndhI–ndhG (A) 10 119173 119182 59 trnL-UAG–rpl32 (T) 12 124136 124147 60 rpl32–ndhF (A) 22 125521 125542 61 ndhF–ycf1 (T) 10 127856 127865 62 ycf1 (T) 10 128452 128461 63 ycf1–trnN-GUU (T) 11 129266 129276 64 trnN-GUU–trnR-ACG (A) 10 130256 130265 65 rps19 (A) 10 153079 153088 66 rps3–rpl16 (A) 17 155138 155154 67 rpl16–rpl14 (A) 13 156068 156080 68 rpl14–rps8 (A) 16 156530 156545 69 infA–rpl36 (T) 12 157442 157453 70 rpl36–rps11 (A) 11 157619 157629 71 psbB–trnH-GUG (A) 15 164739 164753 72 trnT-GGU–psbD (AT) 6 33038 33049 73 ndhC–trnV-UAC (TAT) 5 51862 51876 74 ycf1 (GAA) 5 110196 110210 75 clpP intron (A)12tggcaaagcaaagaaagaatcaaagtaaagtataggtcctc(T)11 71999 72062 130 J. Ma et al. / Gene 528 (2013) 120–131 representative of magnoliids. Our MP tree places Calycanthus sister to References eudicots with strong supported bootstrap values (90% bootstrap value), whereas this taxon is positioned sister to monocots against Anderson, C.L., Bremer, K., Friis, E.M., 2005. Dating phylogenetically basal eudicots using rbcL sequences and multiple fossil reference points. Am. J. Bot. 92, 1737–1748. eudicots in the ML tree. Several molecular phylogenies have suggested Bausher, M., Singh, N., Lee, S.B., Jansen, R., Daniell, H., 2006. The complete chloroplast ge- different sets of relationships among magnoliids, monocots, and nome sequence of Citrus sinensis (L.) Osbeck var ‘Ridge Pineapple’: organization and eudicots. A single gene matK based phylogenies (Sugiura et al., 2003) phylogenetic relationships to other angiosperms. BMC Plant Biol. 6, 21. Cai, Z., et al., 2006. Complete plastid genome sequences of Drimys, Liriodendron,andPiper: suggested that eudicots are sister to magnoliids and monocots, while implications for the phylogenetic relationships of magnoliids. BMC Evol. Biol. 6, 77. phylogenetic analyses based on whole chloroplast genome including Caterino, M.S., Sperling, F.A., 1999. Phylogeny based on mitochondrial cytochrome oxidase 28 taxa suggested that magnoliid is placed sister to eudicots with strong I and II genes. Mol. Phylogenet. Evol. 11, 122–137. support (100% bootstrap values) (Chaw et al., 2000). Recent articles Chaw, S.M., Zharkikh, A., Sung, H.M., Lau, T.C., Li, W.H., 1997. Molecular phylogeny of extant gymnosperms and seed plant evolution: analysis of nuclear 18s rRNA sequence. recovered the relationships among magnoliids, monocots and eudicots Mol. Biol. Evol. 14, 56–68. depending on the phylogenetic methods (MP or ML) and the genes Chaw, S.M., Parkinson, C.L., Cheng, Y., Vincent, T., Palmer, J.D., 2000. Seed plant phylogeny used (Jansen et al., 2007), so the position of magnoliids continues to inferred from all three plant genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers. Proc. Natl. Acad. Sci. U. S. A. 97, 4086–4091. be controversial. The different resolutions of relationships of magnoliids Corriveau, J.L., Coleman, A.W., 1988. Rapid screening method to detect potential biparental are greatly affected by taxon sampling and phylogenetic methodology inheritance of plastid DNA and results for over 200 angiosperm species. Am. J. Bot. (Cai et al., 2006; Moore et al., 2007). In order to clarify the effects of 1443–1458. Cox, M.P., Peterson, D.A., Biggs, P.J., 2010. SolexaQA: at-a-glance quality assessment of both of these on the resolution of the relationships among magnoliids, Illumina second-generation sequencing data. BMC Bioinforma. 11, 485. monocots, and eudicots in the phylogenetic reconstruction of angio- Crane, P.R., 1989. Paleobotanical evidence on the early radiation of nonmagnoliid dicoty- sperms, additional complete chloroplast genome sequences are still ledons. Plant Syst. Evol. 162, 165–191. Doyle, J.J., Doyle, J.L., Palmer, J.D., 1995. Multiple independent losses of two genes and one needed. intron from legume chloroplast genomes. Syst. Bot. 20, 272–294. M. bealei belongs to the Berberidaceae family, Ranunculales order. Drummond, A.J., Suchard, M.A., Xie, D., Rambaut, A., 2012. Bayesian phylogenetics with The Ranunculales are one of the most early-diverging eudicots that BEAUti and the BEAST 1.7. Mol. Biol. Evol. 29, 1969–1973. Dyall, S.D., Brown, M.T., Johnson, P.J., 2004. Ancient invasions: from endosymbionts to strongly supported to be a clade and have maintained a stable sister organelles. Science 304, 253–257. relationship with other eudicots (Kim et al., 2004). In our analyses, Ferris, H., Zheng, L., 1999. Plant sources of Chinese herbal remedies: effects on Pratylenchus both of the MP and ML trees have a consistent result with previous stud- vulnus and Meloidogyne javanica.J.Nematol.31,241–263. ies, and the monophyly of Ranunculales is strongly supported with 100% Gao, L., Yi, X., Yang, Y.X., Su, Y.J., Wang, T., 2009. Complete chloroplast genome sequence of a tree fern Alsophila spinulosa: insights into evolutionary changes in fern chloro- bootstrap values. Areas of congruence that strongly supported include plast genomes. BMC Evol. Biol. 9, 130. the monophyly of monocots and their sister relationship to eudicots, Goulding, S.E., Olmstead, R.G., Morden, C.W., Wolfe, K.H., 1996. Ebb and flow of the – monophyly of rosids and asteroids, and the sister relationship between chloroplast inverted repeat. Mol. Gen. Genet. 252, 195 206. Hall, T.A., 1999. BioEdit: a user-friendly biological sequence alignment editor and analysis Ranunculales and the remaining eudicots. program for Windows 95/98/NT. Nucleic Acids Symp. Ser. 41, 95–98. Jansen, R.K., et al., 2007. Analysis of 81 genes from 64 plastid genomes resolves relation- ships in angiosperms and identifi es genome-scale evolutionary patterns. Proc. Natl. 5. Conclusions Acad. Sci. U. S. A. 104, 19369–19374. Johansson, J.T., 1999. There large inversions in the chloroplast genomes and one loss of The complete chloroplast genome sequence of the M. bealei has the chloroplast gene rps16 suggest an early evolutionary split in the genus Adonis (Ranunculaceae). Plant Syst. Evol. 218, 133–143. revealed that the Mahonia species has a distinct cp genome compared Katoh, K., Standley, D.M., 2013. MAFFT multiple sequence alignment software version 7: to other angiosperm genomes, contains a pair of especially large IRs, improvements in performance and usability. Mol. Biol. Evol. 30, 772–780. making its gene order and content stay at a very stable level and has Kent, W. James, 2002. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664. Kim, K.J., Jansen, R.K., 1994. Comparisons of phylogenetic hypotheses among different data almost no structural rearrangements during evolutionary process. The sets in dwarf dandelions (Krigia, Asteraceae): additional information from internal identification of repetitive sequences and potential RNA editing sites transcribed spacer sequences of nuclear ribosomal DNA. Plant Syst. Evol. 190, 157–185. provides new insights into the structural characteristics of the chloro- Kim, S., Soltis, D.E., Soltis, P.S., Zanis, M.J., Suh, Y., 2004. Phylogenetic relationships among plast genome of Mahonia species, and the incongruence revealed in early-diverging eudicots based on four genes: were the eudicots ancestrally woody? Mol. Phylogenet. Evol. 31, 16–30. phylogenetic analyses raising new questions on the completeness of Kolodner, R., Tewari, K.K., 1979. Inverted repeats in chloroplast DNA from higher plants. data set based on concatenated gene sequences and limited taxa Proc. Natl. Acad. Sci. U. S. A. 76, 41–45. sampling. The dramatic IR junction rearrangements are proposed to Kong, W., et al., 2004. Berberine is a novel cholesterol-lowering drug working through a unique mechanism distinct from statins. Nat. Med. 10, 1344–1351. occur via a double-strand DNA break and subsequent recombination. Kuang, D.Y., Wu, H., Wang, Y.L., Gao, L.M., Zhang, S.Z., et al., 2011. Complete chloroplast The recombination between the two copies of the IR is assumed to genome sequence of Magnolia kwangsiensis (Magnoliaceae): implication for DNA – occur continually based on the fact that two full copies of the S10 oper- barcoding and population genetics. Genome 54, 663 673. Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R., 2001. on and duplicated genes that include psbB existed in the IR of M. bealei, REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic which means that the inference of contractions and expansions of the Acids Res. 29, 4633–4642. two IRs that happened separately and independently is wrong. Lee,Y.S.,etal.,2006.Berberine, a natural plant product, activates AMP-activated protein kinase with beneficial metabolic effects in diabetic and insulin-resistant states. Diabe- tes 55, 2256–2264. Conflict of interest statement Leebens, M.J., et al., 2005. Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one's way out of the Felsenstein zone. Mol. Biol. Evol. 22, 1948–1963. There is no conflict of interest. Lidholm, J., Gustafsson, P., 1991. The chloroplast genome of the gymnosperm Pinus contorta: a physical map and a complete collection of overlapping clones. Curr. Genet. 20, 161–166. Acknowledgments Lin, C.P., Wu, C.S., Huang, Y.Y., Chaw, S.M., 2012. Complete chloroplast genome of Ginkgo biloba reveals the mechanism of inverted repeat contraction. Genome Biol. Evol. 4, – This work was supported by the National Science Foundation of China 374 381. Lohse, M., Drechsel, O., Bock, R., 2007. Organellar Genome DRAW (OGDRAW) — atoolfor (Grant nos. 81202424 and 81073155) and the Ph.D. Programs Foundation the easy generation of high-quality custom graphical maps of plastid and mitochon- of Ministry of Education of China (Grant no. 20110101110070). drial genomes. Curr. Genet. 52, 267–274. Lowe, T.M., Eddy, S.R., 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955– 964. Appendix A. Supplementary data Luo, et al., 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18. Moore, M.J., Bell, C.D., Soltis, P.S., Soltis, D.E., 2007. Using plastid genome-scale data to Supplementary data to this article can be found online at http://dx. resolve enigmatic relationships among basal angiosperms. Proc. Natl. Acad. Sci. U. S. A. doi.org/10.1016/j.gene.2013.07.037. 104, 19363–19368. J. Ma et al. / Gene 528 (2013) 120–131 131

Moore, M.J., Soltis, P.S., Bell, C.D., Burleigh, J.G., Soltis, D.E., 2010. Phylogenetic analysis of Umesono, K., et al., 1988. Structure and organization of Marchantia polymorpha chloro- 83 plastid genes further resolves the early diversification of eudicots. Proc. Natl. Acad. plast genome: II. Gene organization of the large single copy region from rps12 to Sci. U. S. A. 107, 4623–4628. atpB. J. Mol. Biol. 203, 299–331. Palmer, J.D., Stein, D.B., 1986. Conservation of chloroplast genome structure among Wakasugi, T., Tsudzuki, T., Sugiura, M., 2001. The genomics of land plant chloroplasts: gene vascular plants. Curr. Genet. 10, 823–833. content and alteration of genomic information by RNA editing. Photosynth. Res. 70, Pyke, K.A., 1999. Plastid division and development. Plant Cell Online 11, 549–556. 107–118. Schwartz, S., et al., 2003. Human-mouse alignments with BLASTZ. Genome Res. 103–107. White, E.E., 1990. Chloroplast DNA in Pinus monticola. I. Physical map. Theor. Appl. Genet. Shen, G.F., Chen, K., Wu, M., Kung, S.D., 1982. Nicotiana chloroplast genome. Mol. Gen. 79, 119– 124. Genet. 187, 12–18. Wicke, S., Schneeweiss, G.M., Müller, K.F., Quandt, D., 2011. Theevolutionoftheplastidchro- Shinozaki, K., et al., 1986. The complete nucleotide sequence of the tobacco chloroplast mosome in land plants: gene content, gene order, gene function. Plant Mol. Biol. 76, genome: its gene organization and expression. EMBO J. 5, 2043–2049. 273–297. Stamatakis, A., Hoover, P., Rougemont, J., 2008. A rapid bootstrap algorithm for the RAxML Wu, C.S., Wang, Y.N., Liu, S.M., Chaw, S.M., 2007. Chloroplast genome (cpDNA) of Cycas Web-Servers. Syst. Biol. 75, 758–771. taitungensis and 56 cp protein-coding genes of Gnetum parvifolium: insights into Swofford, D.L., 2003. {PAUP*. Phylogenetic Analysis Using Parsimony (* and Other cpDNA evolution and phylogeny of extant seed plants. Mol. Biol. Evol. 24, 1366–1379. Methods). Version 4.}. Wyman, S.K., Jansen, R.K., Boore, J.L., 2004. Automatic annotation of organellar genomes Sugiura, Chika, Kobayashi, Yuki, Aoki, Setsuyuki, et al., 2003. Complete chloroplast DNA with DOGMA. Bioinformatics. 20, 3252–3255. sequence of the moss Physcomitrella patens: evidence for the loss and relocation of Xiao, P.G., Huang, B., Ge, G.B., Yang, L., 2008. Interspecific relationships and origins of rpoA from the chloroplast to the nucleus. Nucleic Acid Res. 31, 5324–5331. Taxaceae and Cephalotaxaceae revealed by partitioned Bayesian analyses of chloro- Sun, Y., et al., 2013. Complete plastid genome sequencing of Trochodendraceae reveals a plast and nuclear DNA sequences. Plant Syst. Evol. 276, 89–104. significant expansion of the inverted repeat and suggests a Paleogene divergence be- Yang, M., et al., 2010. The complete chloroplast genome sequence of date palm (Phoenix tween the two extant species. PLoS One 8, e60429. dactylifera L.). PLoS One 5, e12762. Thiel, T., Michalek, W., Varshney, R.K., Graner, A., 2003. Exploiting EST databases for the Yi, X., Gao, L., Wang, B., Su, Y.J., Wang, T., 2013. The complete chloroplast genome sequence development and characterization of gene-derived SSR-markers in barley (Hordeum of Cephalotaxus oliveri (Cephalotaxaceae): evolutionary comparison of Cephalotaxus vulgare L.). Theor. Appl. Genet. 106, 411–422. chloroplast DNAs and insights into the loss of inverted repeat copies in gymnosperms. Tsudzuki, J., et al., 1992. Chloroplast DNA of black pine retains a residual inverted repeat Genome Biol. Evol. 5, 688–698. lacking rRNA genes: nucleotide sequences of trnQ, trnK, psbA, trnI and trnH and the Zanis, M.J., Soltis, D.E., Soltis, P.S., Mathews, S., Donoghue, M.J., 2002. The root of the angio- absence of rps16.Mol.Gen.Genet.232,206–214. sperms revisited. Proc. Natl. Acad. Sci. U. S. A. 99, 6848–6853.