Copyright
by
Zhengqiu Cai
2010
The Dissertation Committee for Zhengqiu Cai
certifies that this is the approved version of the following dissertation:
Comparative Analyses of Land Plant Plastid Genomes
Committee:
______
Robert K. Jansen, Supervisor
______
Edward Theriot
______
John W. La Claire II
______
David L. Herrin
______
Claus O. Wilke
Comparative Analyses of Land Plant Plastid Genomes
by
Zhengqiu Cai, B.S.; M.S.
Dissertation
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
The University of Texas at Austin
December 2010
Acknowledgements
I would like to thank my major professor Bob Jansen and professor Edward Theriot for their support and encouragement to pursue my study. I would also like to thank my dissertation committee member John Claire, David L. Herrin, Claus O. Wilke, and David Graham for serving on my committee. I would like to thank all the people in the Jansen and Theriot lab for being supportive over the past years.
iv
Comparative Analyses of Land Plant Plastid Genomes
Zhengqiu Cai, Ph.D.
The University of Texas at Austin, 2010
Supervisor: Robert K. Jansen
The availability of complete plastid genomes has been playing an important role in resolving phylogenetic relationships among the major clades of land plants and in improving our understanding of the evolution of genomic organization. The increased availability of complete genome sequences has enabled researchers to build large multi-gene datasets for phylogenetic and molecular evolutionary studies.
In chapter 2 of this thesis a web-based multiple sequence web viewer and alignment tool
(MSWAT) is developed to handle large amount of data generated from complete genome sequences for phylogenetic and evolutionary analyses. We expect that MSWAT will be of general interest to biologists who are building large data matrices for evolutionary analyses.
The third chapter presents the sequenced plastid genomes of three magnoliids, Drimys
(Canellales), Liriodendron (Magnoliales), and Piper (Piperales). Data from these genomes, in combination with 32 other angiosperm plastid genomes, were used to assess phylogenetic relationships of magnoliids to other angiosperms and to examine patterns of variation of GC content. Evolutionary comparisons of three new magnoliid plastid genome sequences, combined with other published angiosperm genomes, confirm that GC content is unevenly distributed across the genome by location, codon position, and functional group. Furthermore, phylogenetic analyses provide the strongest support so far for the hypothesis that the magnoliids are sister to a large clade that includes both monocots and eudicots.
v
The fourth chapter presents the Trifolium subterraneum plastid genome sequence, which is unusual in genome size and organization relative to other angiosperm plastid genomes.
The Trifolium plastid genome is an excellent model system to examine mechanisms of rearrangements and the evolution of repeats and unique DNA.
vi
TABLE OF of CONTENTS
Acknowledgments iv
Abstract v
List of Tables ix
List of Figures x
Chapter 1. Introduction 1
Chapter 2. A Multiple sequence alignment tool for phylogenetic and molecular
evolutionary analyses 3
2.1 Abstract…………………………………………………………………….....3
2.2 Background…………………………………………………………………...5
2.3 Implementation of MSWAT…………………………………………………..5
2.4 Results and Discussion………………………………………………………..6
2.5 Availability and requirements…………………………………………………7
Chapter 3. Complete plastid genome sequences of Drimys, Liriodendron, and
Piper: implications for the phylogenetic relationships of magnoliids…………….8
3.1 Abstract………………………………………………………………………..8
3.2 Background…………………………………………………………………....9
3.3 Results………………………………………………………………………...11
3.4 Discussion…………………………………………………………………….15
3.5 Conclusion…………………………………………………………………….21
3.6 Methods……………………………………………………………………….21
Chapter 4. Organization and evolution of the Trifolium subterraneum plastid
genome…………………………………………………………………………….26
vii
4.1 Abstract………………………………………………………………………..26
4.2 Introduction……………………………………………………………………27
4.3 Materials and Methods………………………………………………………...29
4.4 Results………………………………………………………………………....32
4.5 Discussion……………………………………………………………………..36
Tables and Figures………………………………………………………………………..40
Glossary…………………………………………………………………………………..68
References………………………………………………………………………………...69
Vita………………………………………………………………………………………..92
viii
List of Tables
Table 3.1.
Comparison of major features of magnoliid plastid genomes
Table 3.2.
Taxa included in phylogenetic analyses with GenBank accession numbers and references.
Table 4.1.
Comparison of major features of legume plastid genomes
Table 4.2.
The rpl23 family of dispersed repeats in the Trifolium genome
Table 4.3.
Results of BLASTX comparisons for 160 unique DNA sequence fragments from the Trifolium plastid genome.
ix
List of Figures
Figure 2.1.
Main MSWAT window with instructions and login information.
Figure 2.2.
MSWAT amino acid alignment window showing buttons for various tasks including conversion to nucleotide sequences and adding new genomes.
Figure 2.3.
MSWAT nucleotide sequence alignment window showing buttons for exporting alignments in various formats.
Figure 2.4.
MSWAT nucleotide sequence alignment exported in NEXUS format.
Figure 3.1.
Gene map of the Drimys granatensis plastid genome. The thick lines indicate the extent of the inverted repeats (IRa and IRb), which separate the genome into small (SSC) and large (LSC) single copy regions. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction.
Figure 3.2.
Gene map of the Piper coenoclatum plastid genome. The thick lines indicate the extent of the inverted repeats (IRa and IRb), which separate the genome into small (SSC) and large (LSC) single copy regions. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction.
Figure 3.3.
Histogram of GC content for 34 seed plant plastid genomes, including the gymnosperm Pinus
x and 33 angiosperms (see Table 3.2 for list of genomes). Taxa are arranged phylogenetically following tree in Figure 3.8. A. Overall GC content of complete genomes. B. GC content for 66 protein-coding genes, including average value for all codon positions, followed by values for the
1st, 2nd, and 3rd codon positions, respectively.
Figure 3.4.
Graphs of GC content plotted over the entire plastid genomes of Drimys and Piper. X axis represents the proportion of GC content between 0 and 1 and the Y axis gives the coordinates in kb for the genomes. Coding and non-coding regions are indicated in blue and red, respectively.
The green dashed line indicates that average GC content for the entire genome.
Figure 3.5.
Histogram of GC content for photosynthetic and genetic system genes for 34 seed plant plastid genomes (see Table 3.2 for list of genomes). Taxa are arranged phylogenetically following tree in
Figure 3.8. GC content includes average value for all codon positions, followed by values for the 1st, 2nd, and 3rd codon positions, respectively. A. GC content for 33 photosynthetic genes. B.
GC content for 22 genetic system genes.
Figure 3.6.
Histogram of GC content for NADH and rRNA genes for 34 seed plant plastid genomes (see
Table 3.2 for list of genomes). Taxa are arranged phylogenetically following tree in Figure 3.8. A.
GC content for 11 NADH genes, which includes average value for all codon positions, followed by values for the 1st, 2nd, and 3rd codon positions, respectively. B. GC content for four rRNA genes.
Figure 3.7.
Graphs of GC content plotted over the entire plastid genomes of 34 seed plants. The graphs are organized by genomes with the same gene order and by clade in the phylogenetic trees in Figures
xi
3.8 and 3.9. X axis represents the proportion of GC content between 0 and 1 and the Y axis gives the coordinates in kb for the genomes.
Figure 3.8.
Phylogenetic tree of 35-taxon data set based on 61 plastid protein-coding genes using maximum parsimony. The tree has a length of 61,095, a consistency index of 0.41 (excluding uninformative characters) and a retention index of 0.57. Numbers at each node are bootstrap support values.
Numbers above node indicate number of changes along each branch and numbers below nodes are bootstrap support values. Ordinal and higher level group names follow APG II [93]. Taxa in red are the three new genomes reported in this paper.
Figure 3.9.
Phylogenetic tree of 35-taxon data set based on 61 plastid protein-coding genes using maximum likelihood. The single ML tree has an ML value of – lnL = 342478.92. Numbers at nodes are bootstrap support values ≥ 50%. Scale at base of tree indicates the number of base substitutions.
Ordinal and higher level group names follow APG II [93]. Taxa in red are the three new genomes reported in this paper.
Figure. 4.1.
Gene map of the Trifolium subterraneum plastid genome. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction. Boxed areas indicate the 18 clusters of genes (see Figure 4.2).
Figure 4.2.
Comparison of the gene order and orientation of plastid genomes between Medicago (upper) and
Trifolium (lower). The arrows refer the clusters of genes as numbered in Figure 4.1, with the red arrows indicating an inverted orientation. The scale bar at the bottom shows the positions of the clusters of genes on the genomes.
xii
Figure 4.3.
Distribution of genes and repeats on the Trifolium plastid genome. Coordinates indicate positions in the genome. The green line represents genes and the red line represents repeats.
The vertical dashed lines indicate the boundaries of clusters of genes as defined in Figure 4.1, with the genes at the boundaries labeled.
Figure 4.4.
Comparison of the percentage of repeats in the plastid genomes of 15 angiosperms.
Figure 4.5.
Comparison of the number and sizes of repeats in the plastid genomes of 14 angiosperms.
Figure 4.6.
The rpl23 family of dispersed repeats with their position in the Trifolium genome. The top panel is the full-length copy of rpl23, and the other six panels are repeats 1, 2, 3, 4, 5, 6, respectively.
The same colors represent identical sequences. Coordinates indicate location of repeats in the
Trifolium genome.
Figure 4.7.
Comparison of unique DNA among six legume plastid genomes.
xiii
Chapter 1
Introduction
The availability of complete plastid genomes has been playing an important role in resolving phylogenetic relationships and in improving our understanding of genomic organization and rearrangement. The increased availability of complete genome sequences has enabled researchers to build large multi-gene datasets for phylogenetic and molecular evolutionary studies. One of the limiting steps in such analyses is the alignment and editing of these enormous data matrices. Existing alignment and editing tools are limited by the length of sequences and the options they provide of exporting data into the appropriate format for widely used phylogenetic and molecular evolutionary software. These limitations have forced researchers to perform much of this work manually, which is both time consuming and error prone. In chapter 2 a new bioinformatic tool, MSWAT (Multiple Sequence Web Viewer and
Alignment) for phylogenetic and molecular evolutionary analyses is developed. We expect that
MSWAT will be of general interest to biologists who are building large data matrices for molecular evolutionary analyses. The ability to assemble large matrices is increasingly important in view of the rapid expansion in high-throughput genomic sequencing.
The third chapter compared the complete plastid genome sequences of Drimys,
Liriodendron, and Piper with other genomes and resolved the phylogenetic relationships of magnoliids using the 61 protein coding genes shared among most angiosperm plastid genomes.
The Drimys, Liriodendron, and Piper plastid genomes are very similar in size at 160,604,
159,886 bp, and 160,624 bp, respectively. Gene content and order are nearly identical to many other unrearranged angiosperm plastid genomes. Evolutionary comparisons of three new
1 magnoliid plastid genome sequences, combined with other published angiosperm genomes, confirm that GC content is unevenly distributed across the genome by location, codon position, and functional group. Furthermore, phylogenetic analyses provide the strongest support so far for the hypothesis that the magnoliids are sister to a large clade that includes both monocots and eudicots.
The fourth chapter examined the complete plastid genome sequnce of Trifolium subterraneum. The plastid genome of this species is very unusual, and is 144,763 bp, about 20 kb longer than those of closely related legumes, which also lost one copy of the large inverted repeat
(IR). The genome has undergone extensive genomic reconfiguration. One unusual feature of the
Trifolium subterraneum genome is the large number of dispersed repeats, which comprise 19.5%
(ca. 28 kb) of the genome (versus about 4% for other angiosperms) and account for part of the increase in genome size. Comparisons of the Trifolium plastid genome with the Plant Repeat
Database and searches for flanking inverted repeats suggest that the high incidence of dispersed repeats and rearrangements is not likely the result of transposition. Trifolium has 19.5 kb of unique DNA distributed among 160 fragments ranging in size from 30 to 494 bp, greatly surpassing the other five sequenced legume plastid genomes in novel DNA content. At least some of this unique DNA may represent horizontal transfer from bacterial genomes. These unusual features provide direction for the development of more complex models of plastid genome evolution.
2
Chapter 2
MSWAT: A Multiple Sequence Web Viewer and Alignment Tool for Phylogenetic and
Molecular Evolutionary Analyses
2.1 Abstract
Background: The increased availability of complete genome sequences has enabled researchers to build large multi-gene datasets for phylogenetic and molecular evolutionary studies. One of the limiting steps in such analyses is the alignment and editing of these enormous data matrices. Existing alignment and editing tools are limited by the length of sequences and the options they provide of exporting data into the appropriate format for widely used phylogenetic and molecular evolutionary software. These limitations have forced researchers to perform much of this work manually, which is both time consuming and error prone.
Results: MSWAT is a multiple sequence web viewer and alignment tool for phylogenetic and molecular evolutionary analyses. MSWAT provides a user-friendly platform for aligning and editing such large data sets. For protein-coding genes, MSWAT first aligns amino acid sequences, includes an editing function to make manual adjustments, and then aligns the nucleotide sequences by constraining them to the amino acid alignment. The tool has been used to build a database of 81 aligned genes from 64 completely sequenced plastid genomes. In addition to providing a user-friendly alignment tool, MSWAT enables the export of aligned data into formats utilized by several commonly used programs for phylogenetic and molecular evolutionary analyses. All alignments are password protected on a server, which maintains the privacy of the users' data.
3
Conclusion: We expect that MSWAT will be of general interest to evolutionary biologists who are building large data matrices of phylogenetic and molecular evolutionary analyses. The ability to assemble large matrices is increasingly important in view of the rapid expansion in high-throughput genomic sequencing.
4
2.2 Background
The rapid increase in the availability of complete genome sequences, in particular of small organellar genomes, has prompted the use of these data for addressing phylogenetic and evolutionary questions based on numerous shared gene sequences [1-4]. Currently there are
1,416 mitochondrial and 134 plastid genome sequences on GenBank
(http://www.ncbi.nlm.nih.gov/genomes/ genlist.cgi?taxid=2759& type=4&name=Eukaryotae%20Organelles), and the rate of completion of these genomes will continue to increase rapidly with the development of cheaper and faster methods for sequencing
[3, 5]. As more complete plastid genomes have been sequenced in recent years, multi-gene data sets have been applied extensively to resolve phylogenetic relationships among plants, especially angiosperms [2 - 4, 6-11]. One of the primary limitations of this approach has been the time- consuming and error-fraught process of generating aligned data sets. We have developed a new web-based tool, Multiple Sequence Web Viewer and Alignment Tool (MSWAT), to enable the handling of large amounts of data generated from complete genome sequences. This tool provides a user-friendly platform for aligning and editing sequences and exports data in formats commonly used in phylogenetic and molecular evolutionary analyses. In addition, a database of aligned sequences for 81 genes for 64 completely sequenced plastid genomes [3] is included in
MSWAT, which facilitates the addition of new genomes as they become available.
2.3 Implementation of MSWAT
MSWAT (Fig. 2.1) is a multiple sequence web viewer and alignment tool using FASTA formatted files as input. A user name and password are required to access MSWAT, which maintains the privacy of the users' data. MSWAT was initially designed and implemented to align shared genes from completely sequenced plastid genomes [3, 8], and it has been used to generate an alignment
5 for 81 chloroplast genes for 64 taxa ([3], Figs. 2.2-2.4). The tool can be used to align new sequences against an existing data matrix, thereby expanding the dataset, or a new dataset composed of sequences from other genomes can be generated. All protein-coding nucleotide sequences in the input files are translated into amino acid sequences, which are aligned using
MUSCLE [12] and/or ClustalW [13]. Manual adjustments to the amino acid alignment can be made, followed by alignment of nucleotide sequences by constraining them to the amino acid alignment (Fig. 2.2). Manual adjustments can also be made to the nucleotide alignments (Fig.
2.3). In cases in which the codon position is unknown for partial sequences of protein-coding genes, MSWAT will translate the nucleotide sequences in all six possible reading frames and select the translation with the fewest number of stop codons for subsequent alignments.
The user is able to export files in different formats that can be used for phylogenetic and molecular evolutionary analyses. Both interleaved and non-interleaved NEXUS files (Fig. 2.4) can be generated that include data blocks and assumptions blocks with character sets defined for use by PAUP [14], MacClade [15], and Mesquite [16]. In addition, alignments can be exported in PAML [17] format for molecular evolutionary analyses.
2.4 Results and Discussion
MSWAT has a number of advantages over previously published multiple sequence alignment and editing tools developed for phylogenetic and molecular evolutionary analyses. The commonly used data editor MacClade [15] is limited to 32,000 characters and thus cannot be used for data sets generated from most complete genome sequences. Although the recently designed program SQUINT [18] has no limit to the length of sequences and it allows export into different file formats, MSWAT has several useful features that are not included in SQUINT: accessibility via web browsers; manual adjustment of sequence data and alignment; the ability to easily add
6 new genomes to an existing dataset; the addition of character and assumption sets in files exported into the NEXUS format; translation of nucleotide sequences of protein-coding genes in all six reading frames, and the alignment of multiple genes in a concatenated dataset.
2.5 Availability and requirements
Project name: MSWAT is readily available at http://mswat.ccbb.utexas.edu. The website includes a set of guidelines for implementing and using MSWAT.
Contact: [email protected]
Operating systems: MSWAT works with all popular operating systems and most web browsers, including Internet Explorer or Mozilla on a PC and Mozilla or Firefox on a MAC.
7
Chapter 3
Complete plastid genome sequences of Drimys, Liriodendron, and Piper: implications for
the phylogenetic relationships of magnoliids
3.1 Abstract
Background: The magnoliids with four orders, 19 families, and 8,500 species represent one of the largest clades of early diverging angiosperms. Although several recent angiosperm phylogenetic analyses supported the monophyly of magnoliids and suggested relationships among the orders, the limited number of genes examined resulted in only weak support, and these issues remain controversial. Furthermore, considerable incongruence resulted in phylogenetic reconstructions supporting three different sets of relationships among magnoliids and the two large angiosperm clades, monocots and eudicots. We sequenced the plastid genomes of three magnoliids, Drimys (Canellales), Liriodendron (Magnoliales), and Piper (Piperales), and used these data in combination with 32 other angiosperm plastid genomes to assess phylogenetic relationships among magnoliids and to examine patterns of variation of GC content.
Results: The Drimys, Liriodendron, and Piper plastid genomes are very similar in size at
160,604, 159,886 bp, and 160,624 bp, respectively. Gene content and order are nearly identical to many other unrearranged angiosperm plastid genomes, including Calycanthus, the other published magnoliid genome. Overall GC content ranges from 34–39%, and coding regions have a substantially higher GC content than non-coding regions. Among protein-coding genes, GC content varies by codon position with 1st codon > 2nd codon > 3rd codon, and it varies by functional group with photosynthetic genes having the highest percentage and NADH genes the lowest. Phylogenetic analyses using parsimony and likelihood methods and sequences of 61 protein-coding genes provided strong support for the monophyly of magnoliids and two strongly
8 supported groups were identified, the Canellales/Piperales and the Laurales/Magnoliales. Strong support is reported for monocots and eudicots as sister clades with magnoliids diverging before the monocot-eudicot split. The trees also provided moderate or strong support for the position of
Amborella as sister to a clade including all other angiosperms.
Conclusion: Evolutionary comparisons of three new magnoliid plastid genome sequences, combined with other published angiosperm genomes, confirm that GC content is unevenly distributed across the genome by location, codon position, and functional group. Furthermore, phylogenetic analyses provide the strongest support so far for the hypothesis that the magnoliids are sister to a large clade that includes both monocots and eudicots.
3.2 Background
Phylogenetic relationships among basal angiosperms have been debated for over 100 years since
Darwin [19] identified the origin and rapid diversification of flowering plants as an "abominable mystery". The difficulty of resolving these relationships is likely due to the rapid radiation of angiosperms. During the past decade there has been considerable interest in estimating phylogenetic relationships based on single or multiple genes to resolve the basal radiation of flowering plants [20-37]. Recently, completely sequenced plastid genomes have been used to estimate relationships among basal angiosperms [27-29, 36, 37] and comprehensive analyses of these data concur with earlier studies resolving either Amborella or Amborella plus the
Nymphaelales as the basal-most clade, sister to all other angiosperms [20-26], [30-38]. Although the whole plastid genome approach has enhanced our understanding of basal angiosperm relationships, issues of taxon sampling and methods of phylogenetic analysis have generated considerable controversy regarding the efficacy of this approach [31, 36-41]. One of the major limitations of this approach is the paucity of plastid genome sequences from early diverging
9 lineages, especially the magnoliids, which are currently represented only by Calycanthus [42].
With four orders, 19 families, and approximately 8,500 species, the magnoliids comprise the largest clade of early diverging angiosperms [43]. Recent single and multigene trees have provided only weak to moderate support for the monophyly of this clade [25, 32, 33, 44-46]. One notable exception is the phylogenetic estimate of Qiu et al. [35] based on eight plastid, mitochondrial, and nuclear genes and 162 taxa representing all of the major lineages of gymnosperms and angiosperms. This eight-gene tree provides the first strong support for the monophyly of the four orders of magnoliids, and for sister group relationships between the
Canellales/Piperales and Laurales/Magnoliales.
One of the most important remaining phylogenetic issues regarding angiosperms is the relationship of magnoliids to the other major clades, including the monocots and eudicots. All three possible relationships among these major angiosperm clades have been generated based on single and multiple gene trees, but none receives strong support in any of these studies [reviewed in [43]. The three-gene phylogenetic tree of Soltis et al. [45] placed the magnoliids and monocots in a clade that was sister to the eudicots. Support for the monophyly of this group, which was referred to as eumagnoliids, was weak with a jackknife value of only 56%. This same topology was generated in Bayesian analyses using the plastid gene matK, with a parsimony bootstrap value of 78% and a Bayesian posterior probability of 0.73 [47]. Two multi-gene molecular phylogenetic studies identified the magnoliids as sister to the eudicots. The 11-gene MP trees in
Zanis et al. [25] provided only weak support (56% bootstrap value) for the sister group relationship between magnoliids and eudicots, and in the 9-gene analyses of Qiu et al. [44] support for this same relationship increased to 78% in ML trees. Finally, a third possible resolution of relationships among magnoliids, monocots, and eudicots was recovered in two other studies based on phytochrome genes [21] and 17 plastid genes [24]. These studies
10 suggested that magnoliids were sister to a clade that included monocots and eudicots, however, support for this relationship had only weak or moderate bootstrap support (< 50% in Mathews and Donoghue [21] and 76 or 83% in Graham and Olmstead [24]). Thus, despite intensive efforts during the past 10 years relationships among magnoliids, monocots, and eudicots remain problematic.
In this paper, we report on the complete sequences of three magnoliid plastid genomes
(Drimys, Liriodendron, and Piper). We characterize the organization of two of these genomes, including the most comprehensive comparisons of GC content among completely sequenced plastid genomes. Furthermore, the results of phylogenetic analyses of DNA sequences of 61 genes for 35 taxa, including 33 angiosperms and two gymnosperm outgroups provide new evidence for resolving relationships among angiosperms with an emphasis on relative positions of magnoliids, monocots and eudicots. These results have implications for our understanding of floral evolution in the angiosperms [e.g., [48]].
3.3 Results
Size, gene content, order and organization of the Drimys and Piper plastid genomes
We have sequenced the plastid genomes of three genera of magnoliids, Drimys, Liriodendron, and Piper. In this paper, we only characterize the genome of two of these, Drimys and Piper, and we use 61 protein-coding genes from all three for the phylogenetic analyses. The genome sequence for Liriodendron will be described in a future paper on the utility of 454 sequencing
[49] for plastid genomes (J. E. Carlson et al. in progress).
The sizes of the Drimys and Piper plastid genomes are nearly identical at 160,604 and
160,624 bp, respectively (Figs. 3.1, 3.2), and within a kb of the Liriodendron plastid genome size
(159,886 bp; J. E. Carlson et al. in progress). The genomes include a pair of inverted repeats of
11
26,649 bp (Drimys) and 27,039 bp (Piper), separated by a small single copy region of 18,621 bp
(Drimys) and 18,878 bp (Piper) and a large single copy region 88,685 bp (Drimys) and 87,668 bp (Piper). The Drimys IR has expanded on the IRa side to duplicate trnH-gug. This expansion has not increased the overall size of the IR in Drimys because two of the genes in the IR of
Drimys are shorter than they are in Piper (ycf2 is 6909 and 6945 and ndhB is 1533 and 1686 in
Drimys and Piper, respectively).
The Drimys and Piper plastid genomes contain 113 different genes, and 18 (Drimys) and
17 (Piper) of these are duplicated in the IR, giving a total of 130–131 genes (Figs. 3.1, 3.2, Table
3.1). There are only two differences in gene content between these two magnoliid genomes; one is due to the duplication of trnH-gug in Drimys and the second is that ycf1 appears to be a pseudogene in Piper, since it has an internal stop codon that results in a truncated gene that is only 927 bp long (versus 5,574 bp in Drimys). Eighteen genes contain introns, 15 of which contain one intron and three (clpP, rps12, and ycf3) with two introns (Table 3.1). There are 30 distinct tRNAs, and 8 and 7 of these are duplicated in the IR of Drimys and Piper, respectively.
The genomes consist of 50.12% (Drimys) and 48.36% (Piper) protein-coding genes, 7.38%
(Drimys) and 7.34% (Piper) RNA genes, and 42.5% (Drimys) and 44.3% (Piper) non-coding regions (intergenic spacers and introns). Gene order is identical in the magnoliid plastid genomes sequenced from Drimys, Piper, Liriodendron (J. E. Carlson et al. in progress), and Calycanthus
[42], and those of most angiosperms, including tobacco.
GC content
The overall GC content of the Piper and Drimys plastid genomes is similar, 38.31% and 38.79% respectively. These values are within the range of 34–39% GC content but slightly higher than that of the average for 35 plastid genomes representing all currently available angiosperms and one gymnosperm Pinus (Fig. 3.3A). In some cases, overall GC content correlates with
12 phylogenetic position: early diverging lineages (i.e., Amborella, Nymphaeales, magnoliids) tend to have a higher GC content and legumes and non-grass monocots have lower values (Fig. 3.3A).
GC content is not uniformly distributed across the plastid genome (Figs. 3.3, 3.4, 3.5, 3.6, 3.7).
In general, GC content is higher in coding regions than in non-coding regions (i.e., intergenic spacers and introns) (Fig. 3.4). This pattern is also supported by the observation that GC content of protein-coding genes is higher than the overall GC content for the complete genomes
(compare Figs. 3.3A, B). A t test indicated that this pattern is statistically significant at p < 0.01.
GC content also varies by codon position with the 1st codon > 2nd codon > 3rd codon (Figs.
3.5A, B, 3.6A). GC content was also compared by partitioning protein-coding genes into three functional groups (Figs. 3.3A, 3.5A, B, 3.6A). This comparison demonstrates that the percent of
GC for all three codon positions is highest in photosynthesis genes, followed by genetic system genes, and lowest in NADH genes. Statistical tests using ANOVA demonstrated that differences in GC content by codon positions and functional groups are significant at p < 0.01. GC content is similar among all genomes (Fig. 3.7) even in taxa that have different gene orders (i.e., grasses and legumes). The IR regions have higher GC content and the SSC has the lowest. The much higher GC content in the IR is due to the presence of rRNA genes (Fig. 3.6B). The lower GC content in the SSC is due to the presence of eight of the 11 NADH genes, which have a lower
GC content than photosynthetic and genetic system genes (Figs. 3.5A, B, 3.6A). This genome wide pattern of GC content is maintained even when one copy of the IR is lost (bottom right panel in Fig. 3.7).
Phylogenetic analyses
Our phylogenetic data set included 61 protein-coding genes for 35 taxa (Table 3.1), including 33 angiosperms and two gymnosperm outgroups (Pinus and Ginkgo). The data set comprised
45,879 nucleotide positions but when the gaps were excluded there were 39,378 characters.
13
Maximum Parsimony (MP) analyses resulted in a single, fully resolved tree with a length of
61,095, a consistency index of 0.41 (excluding uninformative characters), and a retention index of 0.57 (Fig. 3.8). Bootstrap analyses indicated that 22 of the 32 nodes were supported by values
≥ 95% and 18 of these had a bootstrap value of 100%. Of the remaining 10 nodes, five had bootstrap values between 80–95%. Maximum likelihood (ML) analysis resulted in a single tree with – lnL = 342478.92 (Fig. 3.9). ML bootstrap values also were also high, with values of ≥
95% for 28 of the 32 nodes and 100% for 23 these nodes. The ML and MP trees had similar topologies. Both trees indicate that Amborella alone forms the earliest diverging angiosperm, however, support for this placement is much higher in the MP tree (100%) than the ML tree
(63%). The next most basal clade includes the Nymphaeales (Nuphar and Nymphaea) and support for this relationship is 100% in both MP and ML trees. Recent phylogenetic trees based on complete plastid genome sequences [29,36,37] have highlighted the difficulty of resolving the relative position of Amborella and the Nymphaeales. Two alternative hypotheses that have received the most support are: Amborella as the earliest diverging lineage of angiosperms, or
Amborella and the Nymphaeales forming a clade sister to all remaining extant angiosperms. In a previous study using 61 plastid genes (see Fig. 3.4 in [36]) the first hypothesis was strongly supported (100%) in parsimony trees and the second hypothesis received only moderate support
(63%) in ML trees. Both MP and ML trees support the basal position of Amborella alone in our expanded taxon sampling. We performed a SH test [50] to determine if the
Amborella/Nymphaeales basal hypothesis is a reasonable alternative to the ML and MP trees that support the Amborella basal topology. The ML score for the alternative topology was -ln L =
339758.53918 versus 339753.32204 for the best ML tree. The difference in the -ln L was
5.21714 with a p = 0.303. Thus, the Amborella/Nymphaeales basal hypothesis could not be rejected by the SH test, indicating that identification of the most basal angiosperm lineage
14 remains unresolved even with the addition of three magnoliid genomes.
Relationships among most other major angiosperm clades are congruent in the MP and
ML trees. Monophyly of the magnoliids is strongly supported with 92 (MP) or 100% (ML) bootstrap values. Within magnoliids there are two well-supported clades, one including the
Canellales/Piperales with 74 (MP) or 99 (ML)% bootstrap support, and a second including the
Laurales/Magnoliales with 82 (MP) or 99 (ML)% bootstrap support. The magnoliid clade forms a sister group to a large clade that includes the monocots and eudicots. Support for the sister relationship of magnoliids to the remaining angiosperms is moderate (72% in MP tree, Fig. 3.8) or strong (99% in ML tree, Fig. 3.9).
Support for the monophyly of monocots and eudicots is strong with 100% bootstrap values, and relationships among the monocots is identical in both analyses. The Ranunculales occupy the earliest diverging lineage among eudicots, and they are sister to two major, strongly supported clades, the rosids and Caryophyllales/asterids. There is strong support for the sister group relationship between the Caryophyllales and asterids (92% in MP and 100% in ML). Both
MP and ML trees provide strong support for the placement of Vitis as the earliest diverging lineage within the rosids. The only remaining incongruence between MP and ML trees is found within the rosids. In both analyses, eurosids I are not monophyletic, although support for relationships among the five representatives of this clade to the eurosid II and Myrtales clades is not strong.
3.4 Discussion
Genome organization and evolution of GC content
The organization of the Drimys and Piper genomes with two copies of an IR separating the SSC and LSC regions is identical to most sequenced angiosperm plastid genomes [reviewed in [33]].
The sizes of the genome at 160,604 and 160,624 bp, respectively are similar to the each other
15 and Liriodendron, but larger than the only other sequenced magnoliid genome (Calycanthus,
153,337, 23, Table 3.1). Most of this size difference is due to the larger size of the IR in Drimys
(26,649 bp) and Piper (27,039 bp) relative to Calycanthus (23,295 bp), although some is also due to the larger LSC region (Table 3.1). Expansion and contraction of the IR is a common phenomenon in land plant plastid genomes [52] with the IR ranging in size from 9,589 bp in the moss Physcomitrella [53] to 75,741 bp in the highly rearranged angiosperm genome of
Pelargonium [54,55]. Among angiosperms the IR generally ranges in size between 20–27 kb, and the magnoliid genomes except for Calycanthus are at the high end of that range.
Gene order of the magnoliid plastid genomes is identical to tobacco and many other unrearranged angiosperm plastid genomes. There are a few differences in gene content and these can be explained by two phenomena. The first concerns differences in the annotation of two genes in these genomes. Two putative genes (ACRS and ycf15) in Calycanthus were not annotated in Drimys and Piper because several recent studies indicated that they are not functional plastid genes. The sequence of ycf15 has been shown to be highly variable among angiosperms, with conserved motifs at the 5' and 3' ends and an intervening sequence that makes it a pseudogene [56, 57]. An examination of ycf15 transcripts in spinach indicated that is not a functional protein-coding gene [56]. More recent sequence comparisons of ycf15 in other plastid genomes also supported the conclusion that this putative gene is not functional [27,57]. The
ACRS gene was identified by Goremykin et al. [42] in Calycanthus based on its very high sequence identity with the mitochondrial ACR-toxin sensitivity (ACRS) gene of Citrus jambhiri
[58]. This conserved sequence has been identified (as ycf68) in a number of plastid genomes, however, in cases where it has been critically examined the presence of internal stop codons indicates that this is a pseudogene. The second explanation for gene content differences among the three magnoliid genomes is caused by the expansion of the IR in Piper, which results in the
16 duplication of trnH. Small expansions of the IR boundary are common in plastid genomes [42] resulting in duplications of genes at the IR/SC boundaries. The duplication of trnH in Piper is shared with Nuphar, a member of the Nymphaeales (L. Raubeson et al. unpublished). This expansion of the IR to duplicate trnH has clearly happened independently in Piper and Nuphar since none of the other basal angiosperms or magnoliids have this duplication.
Examination of GC content in 34 seed plant plastid genomes reveals several clear patterns. GC content for the complete genomes ranges between 34–39% (Fig. 3.3A), confirming previous observations that plastid genomes are in general AT rich [42,57,59-65]. The uneven distribution of GC content over the plastid genome is also very evident, and there are several explanations for this pattern. First, there is a clear bias for the coding regions to have a significantly higher GC content than non-coding regions (Figs.3.3, 3.4), which again confirms previous observations based on comparisons of many fewer genomes [42]. Second, there is an uneven distribution of GC content by regions of the genome with the highest GC content in the
IR and the lowest in the SSC (Figs. 3.4, 3.7). The higher GC content in the IR can be attributed to the presence of the four rRNA genes in this region, which have the highest GC content of any coding regions (Fig. 3.6B). This higher GC content in the IR region is maintained even when one copy of the IR is lost as in Medicago and Pinus (Fig. 3.7, bottom right panel). The lower GC content in the SSC region is due to the presence of 8 of the 11 NADH genes, which have the lowest GC content of any of the classes of genes compared (Figs. 3.5, 3.6). Third, GC content varies by functional groups of genes. Among protein genes, GC content is highest for photosynthetic genes, lowest for NADH genes, with genetic system genes having intermediate values. This same pattern was observed by Shimada and Sugiura [59] in comparisons of the first three sequenced land plant plastid genomes.
Differences in GC content were also observed by codon position in protein-coding genes
17
(Figs. 3.3B, 3.5A, B, 3.6A). For each of the three classes of genes (photosynthetic, genetic system, and NADH) the third position in the codon has a significant AT bias. This pattern has been observed previously [59,63,64,66], and it has been attributed to codon bias. Previous studies have demonstrated that there is a strong A+T bias in the third codon position for plastid genes [63,64,66]. This is in contrast to a GC bias in codon usage for nuclear genes in plants [66].
Several studies have examined codon usage of plastid genes to attempt to determine if these biases can be attributed to nucleotide compositional bias, selection for translational efficiency, or a balance among mutational biases, natural selection, and genetic drift [67-71]. All of these studies have been limited to examining a single or few genes, and they have been constrained by the limited sampling of complete genome sequences for taking variation in GC content into account. Our comparisons of GC content variation for a wide diversity of angiosperm lineages provide a rich source of information for future investigations of the relationship between GC content and codon usage bias.
Phylogenetic implications
The debate concerning the identity of the most basal angiosperm lineage continues even though numerous molecular phylogenetic studies of angiosperms have been conducted over the past 15 years [21,22,24,26-29,32-38,44-47,72-84]. Several issues have confounded the resolution of relationships among basal angiosperms, including long branch attraction associated with sparse taxon density and poor taxon sampling, and conflict among trees obtained using different phylogenetic methodologies [22,24,25,29,31,36-38,85]. Most recent studies agree that Amborella and the Nymphaeales represent the earliest diverging angiosperm lineages [21,22,24-26,32-
38,44-47,75-84]. The most recent multi-gene phylogenetic reconstructions based on nine gene sequences from the plastid, mitochondrial, and nuclear genomes [35] generate trees supporting each of these two hypotheses depending on the method of phylogenetic analysis and the genes
18 included. Trees generated from plastid genes supported the Amborella basal hypothesis, whereas mitochondrial genes supported the Amborella + Nymphaeales hypothesis. Furthermore, MP analyses tended to support the Amborella basal hypothesis and ML analyses supported Amborella
+ Nymphaeales. A similar set of relationships was also observed in recent phylogenetic studies using sequences of 61 genes from completely sequenced plastid genomes [36,37,85]. In these studies, MP trees placed Amborella alone as the basal most angiosperm with strong support and
ML trees placed Amborella + Nymphaeales at the base with moderate support. These differences were attributed to rapid diversification and the lack of extant lineages that could be used to cut the length of branches leading to Amborella and the most recent common ancestor of lineages within the Nymphaeales.
Our phylogenetic analyses include three additional magnoliids from three different orders. Both MP and ML trees (Figs. 3.8, 3.9) support Amborella alone as the earliest diverging lineage of angiosperms. Support for this relationship is very strong in MP trees (100% bootstrap) and weak (63%) in ML trees. However, a SH test that constrained Amborella + Nymphaeales in a basal position indicated that the two hypotheses of basal angiosperm relationships are not significantly different. Thus, although both MP and ML analyses including the three additional magnoliid taxa support Amborella as the basal-most branch in the angiosperm phylogeny, sampling of more taxa and genes, and further investigations of model specification in phylogenetic analyses are needed before this issue is fully resolved [see discussion in 36,41].
Several earlier molecular phylogenetic studies based on one or a few genes [45,74,81] did not support the monophyly of magnoliids. Furthermore, morphological studies of angiosperms failed to detect any synapomorphies for this group. The circumscription, monophyly, and relationships of magnoliids has only recently been established based on phylogenetic analyses of multiple genes [42,43]. These earlier multigene trees provided only weak to moderate support for the
19 monophyly of magnoliids and the sister group relationships of the Canellales/Piperales and
Laurales/Magnoliales. A recent study using eight plastid, mitochondrial, and nuclear genes [35] provided the first strong support for both the monophyly and relationships among the four orders of magnoliids. Our phylogenetic trees based on 61 plastid protein-coding genes also provide strong support for the monophyly of magnoliids and the sister relationship between the
Canellales/Piperales and Laurales/Magnoliales.
One of the most controversial remaining issues regarding relationships among angiosperms concerns the resolution of relationships among the magnoliids, monocots and eudicots. Previous phylogenetic studies have supported three different hypotheses of relationships among these lineages: (1) (magnoliids (monocots, eudicots)), (2) (monocots
(magnoliids, eudicots)), and (3) (eudicots (magnoliids, monocots)). The first hypothesis was supported in phylogenetic analyses based on phytochrome genes [21] and 17 plastid genes [24] but bootstrap support for a sister relationship of monocots and eudicots was only 67%. Several studies supported the second hypothesis [25,34,86], however, bootstrap support was again weak ranging from 55 – 78%. The three-gene phylogenetic tree of Soltis et al. [45] supported the third hypothesis with only 56% jackknife support. This relationship was also recovered in a matK gene tree with a parsimony bootstrap value of 78% and a posterior probability of 0.73 [47]. Both
MP and ML trees based on 61 plastid-encoded protein genes support hypothesis 1 (Figs. 3.8, 3.9).
Branch support for this hypothesis is moderate (MP, Fig. 3.8) or strong (ML, Fig. 3.9).
Congruence of the results from both MP and ML analyses is notable because our previous phylogenetic analyses using whole plastid genomes that included only one member of the magnoliid clade (Calycanthus, 36,85) were incongruent. In these earlier studies, MP trees supported hypothesis 2 (monocots sister to a clade that included magnoliids and eudicots), whereas ML trees supported hypothesis 1 (magnoliids sister to a clade that included monocots
20 and dicots). These differences provide yet another example of the importance of expanded taxon sampling in phylogenetic studies using sequences from whole plastid genomes [36,85]. The addition of other angiosperm lineages, especially members of the Chloranthales,
Certatophyllaceae, and Illiciales may be critical for providing additional resolution of relationships among the major clades.
3.5 Conclusion
The genome sequences of three additional magnoliids have a very similar size and organization to the ancestral angiosperm plastid genome. Comparisons of 34 seed plant plastid genomes confirm that GC content is unevenly distributed across the genome by location, codon position, and functional group. Phylogenetic analyses for 61 protein-coding genes using both maximum parsimony and maximum likelihood methods for 35 seed plants provide moderate or strong support for the placement of Amborella sister to all other angiosperms. Furthermore, there is strong support for the monophyly of magnoliids and for the recognition of two major clades, the
Canellales/Piperales and the Laurales/Magnoliales. Finally, phylogenetic analyses provide the strongest support so far that magnoliids are sister to a large clade that includes both monocots and eudicots.
3.6 Methods
Plastid isolation, amplification, and sequencing
10–20 g of fresh leaf material of Drimys granatenis and Piper coenoclatum was used for the plastid isolation. Leaf material was obtained from the University of Connecticut Greenhouses
(accession numbers 200100052 for Drimys and 199600027 for Piper). Plastids were isolated from fresh leaves using the sucrose-gradient method [87]. They were then lysed and the entire plastid genome was amplified using Rolling Circular Amplification (RCA, using the REPLI-g™ whole genome amplification kit, Molecular Staging) following the methods outlined in Jansen et
21 al. [88]. The RCA product was then digested with the restriction enzymes EcoRI and BstBI and the resulting fragments were separated by agarose gel electrophoresis to determine the quality of plastid DNA. The RCA product was sheared by serial passage through a narrow aperture using a
Hydroshear device (Gene Machines), and the resulting fragments were enzymatically repaired to blunt ends and gel purified, then ligated into pUC18 plasmids. The clones were introduced into E. coli by electroporation, plated onto nutrient agar with antibiotic selection, and grown overnight.
Colonies were randomly selected and robotically processed through RCA of plasmid clones, sequencing reactions using BigDye chemistry (Applied Biosystems), reaction cleanup using solid-phase reversible immobilization, and sequencing using an ABI 3730 XL automated DNA sequencer. Detailed protocols are available at [89].
Genome assembly and annotation
Drimys and Piper. Sequences from randomly chosen clones were processed using PHRED and assembled based on overlapping sequence into a draft genome sequence using PHRAP [90].
Quality of the sequence and assembly was verified using Consed [91]. In most regions of the genomes we had 6–12-fold coverage but there were a few areas with gaps or low depth of coverage. PCR and sequencing at the University of Texas at Austin were used to bridge gaps and fill in areas of low coverage in the genome. Additional sequences were added until a completely contiguous consensus was created representing the entire plastid genome with a minimum of 2X coverage and a consensus quality score of Q40 or greater.
Liriodendron. The Liriodendron plastid genome sequence was obtained using 454 sequencing technology [49]. This work will be described more fully in a separate paper focused in the utility of 454 technology for plastid genome sequencing (J. E. Carlson et al. in progress).
Briefly, plastid genome containing clones were identified from a genomic bacterial artificial chromosome (BAC) library that had been constructed for Liriodendron [92]. The entire plastid
22 genome was sequenced from a pooled sample of DNAs isolated from three overlapping large- insert BAC clones. The BAC DNAs were processed for 454 sequencing according to protocols provided by 454 Life Sciences (Branford, CT, also see [49]). One-half plate run on the Genome
Sequencer 20™ System (Life Sciences, Branford, CT; Roche Applied Science, Indianapolis, IN) generated over 15,000 sequences with an average read length of 103 bp. Reads including vector sequence were removed and the remaining reads were assembled using the 454-de novo assembly software. Each base in the assembly was inferred based on an average of 90 independent reads.
Genome annotation
The coordinate of each genome was standardized for gene annotation to be the first bp after IRa on the psbA side. The genomes were annotated using the program DOGMA (Dual Organellar
GenoMe Annotator [93]). All genes, rRNAs, and tRNAs were identified using the plastid/bacterial genetic code.
Examination of GC content
GC content was calculated for 34 seed plant plastid genomes, including the gymnosperm Pinus and 33 angiosperms. GC content was also determined for 66 protein-coding genes. These genes were partitioned into three functional groups (photosynthesis (33), genetic system genes (22), and NADH (11) genes), and GC content was calculated for the entire gene and the first, second, and third codon positions. The genes included in each of these three groups are: (1) photosynthetic genes (atpA, atpB, atpE, atpF, atpH, atpI, psbZ, petA, petB, petD, petG, petL, petN, psaA, psaB, psaC, psaI, psaJ, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, rbcL); (2) genetic system genes (rpl14, rpl16, rpl2, rpl20, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps11, rps12, rps14, rps15, rps18, rps19, rps2, rps3, rps4, rps7, rps8); and (3) NADH genes (ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ,
23 ndhK). GC content was also plotted over the entire genome for all 34 taxa, which were classified into 10 groups based on gene order and phylogenetic placement (Figs. 3.8, 3.9).
Statistical analyses of GC content between rRNA genes and complete genomes and between 72 protein-coding genes and complete genomes were performed using a t test. Analyses of GC content among the three codon positions and among the three functional classes were performed using one way ANOVA. All statistical analyses used the program SPSS, version 9.0
[94].
Phylogenetic analysis
Alignment
The 61 protein-coding genes included in the analyses of Goremykin et al. [27,28] and Leebens-
Mack et al. [36] were extracted from Drimys, Liriodendron and Piper using the organellar genome annotation program DOGMA [93]. The same set of 61 genes was extracted from plastid genome sequences of 32 other sequenced genomes (see Table 3.1 for complete list). All 61 protein-coding genes of the 35 taxa were translated into amino acid sequences, which were aligned using MUSCLE [95] followed by manual adjustments, and then nucleotide sequences of these genes were aligned by constraining them to the aligned amino acid sequences. A Nexus file with character sets for phylogenetic analyses was generated after nucleotide sequence alignment was completed. The complete nucleotide alignment is deposited online at [93].
Tree reconstruction
Phylogenetic analyses using maximum parsimony (MP) and maximum likelihood (ML) were performed using PAUP* version 4.0b10 [97] on a data including 35 taxa (Table 3.1).
Phylogenetic analyses excluded gap regions. All MP searches included 100 random addition replicates and TBR branch swapping with the Multrees option. Modeltest 3.7 [98] was used to determine the most appropriate model of DNA sequence evolution for the combined 61-gene
24 dataset. Hierarchical likelihood ratio tests and the Akakle information criterion were used to assess which of the 56 models best fit the data, which was determined to be GTR + I + Γ by both criteria. For ML analyses we performed an initial parsimony search with 100 random addition sequence replicates and TBR branch swapping, which resulted in a single tree. Model parameters were optimized onto the parsimony tree. We fixed these parameters and performed a ML analysis with three random addition sequence replicates and TBR branch swapping. The resulting ML tree was used to re-optimize model parameters, which then were fixed for another ML search with three random addition sequence replicates and TBR branch swapping. This successive approximation procedure was repeated until the same tree topology and model parameters were recovered in multiple, consecutive iterations. This tree was accepted as the final ML tree (Fig.
3.9). Successive approximation has been shown to perform as well as full-optimization analyses for a number of empirical and simulated datasets [99]. Non-parametric bootstrap analyses [100] were performed for MP analyses with 1000 replicates with TBR branch swapping, 1 random addition replicate, and the Multrees option and for ML analyses with 100 replicates with NNI branch swapping, 1 random addition replicate, and the Multrees option.
Test of alternate topology
A Shimodaira-Hasegawa (SH) test [50] was performed to determine if the alternative topology with Amborella + Nymphaeales basal was significantly worse than the ML tree that places
Amborella alone as the basal angiosperm lineage. A constraint topology with this alternative tree topology was used and the SH test was conducted using RELL optimization [101] as implemented in PAUP* version 4.0b10 [97].
25
Chapter 4
Extensive reorganization of the plastid genome of Trifolium subterraneum (Fabaceae) is
associated with numerous repeated sequences and novel DNA insertions
4.1 Abstract
The plastid genome of Trifolium subterraneum is 144,763 bp, about 20 kb longer than those of closely related legumes, which also lost one copy of the large inverted repeat (IR). The genome has undergone extensive genomic reconfiguration, including the loss of six genes (accD, infA, rpl22, rps16, rps18, and ycf1) and two introns (clpP and rps12) and numerous gene order changes, attributable to 14–18 inversions. All endpoints of rearranged gene clusters are flanked by repeated sequences, tRNAs, or pseudogenes. One unusual feature of the Trifolium subterraneum genome is the large number of dispersed repeats, which comprise 19.5% (ca.
28 kb) of the genome (versus about 4% for other angiosperms) and account for part of the increase in genome size. Nine genes (psbT, rbcL, clpP, rps3, rpl23, atpB, psbN, trnI-cau, and ycf3) have also been duplicated either partially or completely. rpl23 is the most highly duplicated gene, with portions of this gene duplicated six times. Comparisons of the Trifolium plastid genome with the Plant Repeat Database and searches for flanking inverted repeats suggest that the high incidence of dispersed repeats and rearrangements is not likely the result of transposition. Trifolium has 19.5 kb of unique DNA distributed among 160 fragments ranging in size from 30 to 494 bp, greatly surpassing the other five sequenced legume plastid genomes in novel DNA content. At least some of this unique DNA may represent horizontal transfer from bacterial genomes. These unusual features provide direction for the development of more complex models of plastid genome evolution.
26
4.2 Introduction
During the past 5 years there has been a rapid increase in the availability of complete plastid genome sequences of angiosperms, due largely to the development of faster and cheaper methods for whole-genome sequencing [112-113]. The 90 angiosperm plastid genome sequences already available in GenBank (http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/plastids_tax.html) have provided valuable insight into plastid genome evolution [114-119]. These genome sequences support the view that the plastid genome is highly conserved in both gene order and content, with the majority having two copies of a large (usually 25-kb) inverted repeat (IR) separating the small and large single-copy regions (SSC and LSC, respectively). The ancestral angiosperm genome [117, 121-125] has 115 different genes, 18 of which contain introns. This ancestral organization is conserved across angiosperms [124] ranging from the earliest-diverging lineages, Amborella [120] and Nymphaeales [117, 126], to more derived lineages, including the asterid Nicotiana [127] and the monocot Acorus [128].
Recent studies have revealed exceptions to the prevailing view that angiosperm plastid genome organization is highly conserved. The most highly rearranged genomes from photosynthetic angiosperms published to date are Pelargonium (Geraniaceae [115]) and
Trachelium (Campanulaceae [118]). The extent of rearrangement in these two lineages is different, but the underlying mechanisms, inversions and expansion of the IR, are generally similar. Another common feature of these extensively rearranged plastid genomes is the prevalence of repeated sequences, many of which are associated with rearrangement endpoints.
Previous studies have provided evidence that recombination between inverted repeats or tRNA genes has caused inversions [129-133]. Furthermore, although transposition via transposable elements (TEs) has been suggested as a possible mechanism for gene order changes for a number
27 of highly rearranged angiosperm plastid genomes [134-135], no direct evidence for this process has been uncovered. In fact, the inactive ―Wendy‖ element in Chlamydomonas [136] remains the only known example of a TE in any plastid genome.
Early gene mapping studies [134] suggested that the plastid genome of Trifolium subterraneum is unusual among land plants in several respects. Milligan et al. [134] reported three unusual features of this genome. First, the gene order is highly rearranged, with 10 clusters of genes rearranged in both order and orientation relative to another legume, Medicago. Eight large inversions were proposed to explain the reorganization of these gene clusters. Second, a family of dispersed repeats unique to Trifolium and two closely related species was identified and proposed to originate from TEs. Third, unique repeated elements and unique single-copy sequences may represent rare instances of transfer of sequences into a plastid genome.
In this study, we present the complete plastid genome sequence of Trifolium subterraneum and provide detailed comparisons of repeats and unique DNA with other sequenced plastid genomes, including those of five other legumes. These comparisons confirm size estimates and overall genome architecture proposed previously [134] but also show that the
Trifolium genome is more highly rearranged than previously suggested. The genome includes
19.5% (ca. 28 kb) of repetitive DNA, five times more than any other angiosperm plastid genome sequenced to date, and these repeats are associated with rearrangements. However, we find no evidence of TEs, suggesting that inversion and gene duplication/loss are the primary mechanisms for gene order changes. The Trifolium plastid genome also includes 19.5 kb of unique DNA, some of which may represent lateral DNA transfer from other organisms.
28
4.3 Materials and Methods
Plastid DNA Isolation
Trifolium subterraneum seeds were obtained from the U.S. Department of Agriculture Plant
Introduction Service. Plastid DNA was extracted from freshly harvested leaves using a modification of the procedure outlined by Palmer [137], involving the use of a high-salt extraction buffer recommended by Bookjans et al. [138] and elimination of the sucrose-gradient step.
Genome Assembly and Annotation
Purified plastid DNA was sheared into ≈3-kb fragments using a Hydroshear device (Gene
Machines, San Carlos, CA, USA). These fragments were then end-repaired, gel-purified, ligated into pUC18 plasmid vectors to create a DNA library, introduced into competent E. coli by electroporation, and plated onto nutrient media with antibiotic selection. The resulting colonies were randomly selected and processed robotically for end sequencing using Big Dye (Applied
Biosystems, Foster City, CA, USA) chemistry and an ABI 3730 XL sequencer at the DOE Joint
Genome Institute (Walnut Creek, CA, USA). Sequences from randomly chosen clones were processed using PHRED and assembled based on overlapping sequences into a draft genome using PHRAP [139]. Quality of the sequence and assembly was verified using CONSED [140].
Most regions of the genome had 6- to 12-fold coverage, and areas with gaps or low depth of coverage were PCR amplified and sequenced at The University of Texas at Austin. Additional sequences were added until a completely contiguous consensus was created, representing the entire plastid genome, with a minimum of 2 × coverage and a consensus quality score of ≥Q40.
29
The genome was annotated using DOGMA (Dual Organellar GenoMe Annotator [141]; http://dogma.ccbb.utexas.edu/).
Identification of Repeats and Unique DNA
BLAST (default parameters) comparisons of 15 representative angiosperm plastid genomes were performed against themselves to identify repeated sequences in each genome. Repeats ≥30 bp were plotted. For the plastid genomes with two copies of the inverted repeat, one copy was removed. The genomes examined included the six legumes Cicer arietinum (NC_011163),
Glycine max (NC_007942), Lotus japonicus (NC_002694), Medicago truncatula (NC_003119),
Phaseolus vulgaris (NC_009259), and Trifolium subterraneum (EU849487) and nine other eudicots, Arabidopsis thaliana (NC_000932), Cucumis sativus (NC_007144), Eucalyptus globules (NC_008115), Morus indica (NC_008359), Nicotiana tabacum (NC_001879),
Pelargonium x hortorum (NC_008454), Populus alba (NC_008235), Trachelium caeruleum
(NC_010442), and Vitis vinifera (NC_007957).
BLAST searches of Trifolium were also performed against all plastid genomes on
GenBank (using release number 164) to identify unique sequences within the six legume plastid genomes. Unique DNA is defined as DNA 30 bp or longer not shared by legumes or any of the other available 131 plastid genomes. BLASTN and BLASTX analyses of the Trifolium unique
DNA were performed against GenBank.
Identification of Transposable Elements
To identify putative TEs, the Plant Repeat Database was downloaded from http://www.tigr.org/tdb/e2k1/plant.repeats/ and BLAST searches of Trifolium against this database were performed. BLAST searches were performed with an e value of 10, and repetitive
30 or low-complexity sequences were not filtered. In addition, the program LTR STRUC [142] was used to search for long terminal repeat (LTR) retrotransposons in all 131 plastid genomes available at GenBank to locate potential TEs.
Estimation of Number of Inversions
The genome comparison tool GRIMM [143] was used to estimate the minimum number of inversions required to derive the Trifolium gene order from that of Medicago. The GRIMM algorithm is limited in that it cannot accommodate gene duplications or other differences in gene content between the input genomes. Gene content, order, and orientation were determined for each genome using DOGMA, and an ordered matrix of each gene and its relative strand orientation was created. To equalize the content of the genomes, two genes (accD and rps18) absent from Trifolium were removed from Medicago; likewise, it was necessary to exclude the many pseudogenes (three fragments of psbN, six of rpl23) present in Trifolium from the analysis.
Novel DNA, representing ~20% of Trifolium, was also excluded from the analysis. Therefore, missing and duplicated genes in Trifolium were excluded from the input file for Medicago, resulting in a comparison of 107 shared genes.
Both pairwise BLAST and the Mauve genome alignment algorithm [144] revealed many clusters of genes, or local colinear blocks, that are in the same order in both Medicago and
Trifolium. Each cluster included two or more genes, for a total of 18 clusters or colinear, unrearranged blocks of genes (Fig. 4.1). In addition, three genes (trnL-caa, trnI-cau, and clpP) occur singly (i.e., alone and not part of a cluster), separated from other genes by novel DNA.
These 18 gene clusters were also coded for analysis with GRIMM.
31
Results
Organization of the Trifolium Plastid Genome
The complete plastid genome of Trifolium (GenBank accession number EU849847) has lost one copy of the IR, but this loss has not greatly reduced its overall size relative to those genomes containing two copies of the IR because the novel DNA in Trifolium is cumulatively similar in length to the typical angiosperm IR. Trifolium, like two relatives whose plastid genomes are being compared here, Cicer and Medicago, has only one copy of the IR. These three genomes are members of a large clade (IRLC [145]) of papilionoid legumes that is marked by the loss of one copy of the IR. The Trifolium subterraneum genome is 144,763 bp (Fig. 4.1), ~20 kb longer than other legumes with which it shares only a single copy of the IR (Table 4.1). The overall GC content of the six sequenced legume plastid genomes is similar, ranging from 34% to 36%
(Table 4.1). The Trifolium plastid genome contains 111 different genes. Six genes (accD, infA, rpl22, rps16, rps18, and ycf1) are missing, and two genes (clpP, rps12) have lost an intron relative to the ancestral angiosperm plastid genome [6]. Two of these genes, infA and rpl22, are missing from all legumes [146-147], and a third, rps16, has been lost from many papilionoid legumes [146].
The Trifolium genome is highly rearranged in both gene order and orientation.
Comparison of the Trifolium plastid genome with Medicago reveals that it is composed of 18 clusters of genes (Figs. 4.1 and 4.2); genes within each cluster are in the same order as in
Medicago, whereas the clusters themselves have been extensively shuffled across the genome
(Fig. 4.2). Considering only these 16 clusters (Fig. 4.1), GRIMM proposes 14 inversions to derive the Trifolium gene order from that of its close relative Medicago, some of which break up clusters of genes that are adjacent to each other in the remaining copy of the IR (Fig. 4.2). Four additional inversions (18 total) are required when including the three genes appearing alone,
32 apart from the blocks of genes. The latter scenario, with 18 inversions separating Trifolium from the gene order of Medicago, is supported by GRIMM analysis of the position and orientation of
107 genes common to both genomes.
It is not surprising that using both colinear blocks and individual genes in the GRIMM analyses converged on the same number of inversions, because they both exclude duplicated genes. There is still no comprehensive method for comparing genomes with unequal gene contents. Modeling the evolution of these rearranged genomes with confidence will require an improved algorithm capable of incorporating gene/intron losses, gene duplications, and, for the majority of plastid genomes, variation in the IR boundaries. Current models of plastid genome evolution based solely on inversions between a reduced, common set of genes cannot explain the proliferation of repeats such as pseudogenes in this or other rearranged genomes (e.g.,
Pelargonium [115] and Trachelium [118]). In Trifolium, the endpoints of most rearranged gene clusters are flanked by repeated sequences, tRNAs or pseudogenes (Fig. 4.3). More sophisticated models of plastid genome evolution may elucidate the roles of multiple pseudogenes and other types of repeats in the evolution of this genome and reveal the purely inversion-based model to be unsuitable for Trifolium.
Repeated Sequences
The great proliferation of repeated sequences is one of the most remarkable features of the
Trifolium subterraneum plastid genome and these repeats contribute to the unusually large size of the genome in comparison to other legumes lacking one copy of the IR. Among 15 representative eudicot plastid genomes examined, Trifolium exhibits the highest proportion of repeats (19.5%; ca. 28 kb) (Fig. 4.4). Our genome sampling included six legumes, eight nonlegume rosids, and an asterid. Aside from the highly rearranged plastid genome of the rosid
Pelargonium (Geraniaceae), nonlegume plastid genomes contain <5% repeated DNA. However,
33 legumes generally contain more repeated DNA; >5% of the genomes comprise repeated sequence. We divided repeats into four groups based on length: 30–50, 51–100, 101–150, and ≥151 bp (Fig. 4.5). Most repeats are 30–50 bp in size and occur across all 15 genomes, whereas long repeats (≥151 bp) are more prevalent in the highly rearranged genomes of
Trifolium and Pelargonium. Repeats in Trifolium are not evenly distributed across the genome, and most repeats are located within A + T-rich intergenic regions, as well as at rearrangement endpoints (Fig. 4.3).
Nine genes (psbT, rbcL, clpP, rps3, rpl23, atpB, psbN, ycf3, and trnI-cau) have also been duplicated partially or completely in Trifolium, two of which are not merely duplicated but are present with multiple nonfunctional copies. First, three degenerate copies of the photosynthetic gene psbN are found in Trifolium. Two of these pseudogenes are complete, but with single-base- pair insertions resulting in a frame shift. The two nonfunctional copies are located within repeats differing by a 7-bp deletion and are flanked by two unique DNA fragments. One of the nonfunctional copies of psbN is inserted into part of the highly conserved S10 operon (cluster 6 in Fig. 4.1).
Repeats of the ribosomal protein gene rpl23 result in some of the most unusual structural features of the Trifolium plastid genome. This family of dispersed repeats was previously identified by Milligan et al. [134]. Analyses of the complete sequence of Trifolium show that portions of rpl23 are repeated six times across the genome, another genomic anomaly not readily explicable through inversions. There is one intact, full-length copy of rpl23 and six nonfunctional, partial copies, with these six partial copies located within dispersed repeats ranging in size from 88 to 2,399 bp (Table 4.2). Repeats 2–5 are flanked by DNA that is not found in any other sequenced plastid genome.
34
Transposable Elements
BLAST results showed no significant sequence similarity between Trifolium subterraneum and
TEs in the Plant Repeat Database. Furthermore, repeated sequences of Trifolium do not have characteristic inverted repeats that are known to flank some TEs, suggesting that TEs are not found in Trifolium. Thus, the high incidence of dispersed repeats and the extensive rearrangements of the genome are not likely the result of transposition. In addition, we found no evidence of TEs in the other 131 publicly available plastid genomes based on either BLAST similarity or the presence of flanking inverted repeats.
Unique DNA
BLAST comparisons of each of the six legume plastid genomes against 131 available plastid genomes identified unique DNA (30 bp or longer) in each of the legume genomes (Fig. 4.7).
These comparisons identified 19,551 bp of unique DNA in 160 fragments of the Trifolium plastid genome, ranging in size from 30 to 494 bp. The proportion of unique DNA in Trifolium is 2.5–10 times higher than in any other legume.
BLASTN analyses of the 160 unique fragments of Trifolium resulted in no matches to fragments >50 bp, however, there were numerous hits to small fragments, ranging from 14 to
38 bp (results not shown). In view of the very short length of the fragments in the BLASTN results, we performed BLASTX analyses of these same 160 fragments, which resulted in matches for 24 of the sequences (Table 4.3). In most cases, the BLASTX hits had a very low amino acid sequence identity (25–40% amino acid identity for < 50 amino acids) but 18 hits showed either higher sequence identity (>50%) for a short polypeptide (25–50 amino acids) or
35 lower sequence identity (30–50%) for a longer polypeptide (50–137 amino acids). Several of the most significant BLAST hits are notable in terms of their putative identification. Fragment 98, a
427-bp DNA sequence located between Trifolium gene clusters 3 and 11 (Fig. 4.1), had 147 hits, many of which matched unidentified hypothetical proteins. However, several of these hits had sequence identities ranging from 50% to 65%, and six of the top matches were to batB proteins, which are involved in the amino acid autotransporter system in bacteria [148]. Eight other significant hits (62% sequence identity for 27 amino acids) for the same fragment matched a bacterial chitinase gene. Fragment 104, which is 164 bp long and again occurs between gene clusters 3 and 11, has a 46% to 48% sequence identity to a 45-amino acid segment of the plastid gene ycf1 from Medicago. No intact copy of this gene is present in the Trifolium plastid genome.
4.5 Discussion
The Trifolium subterraneum plastid genome is unusual in several respects compared to other legumes and angiosperms. It is among the most highly rearranged angiosperm plastid genome sequenced to date. Trifolium, like many other papilionoid legume plastid genomes, such as Cicer and Medicago, has lost one copy of the IR. The Trifolium genome is larger than those of other papilionoid legumes having one copy of the IR. The gene order of Trifolium is also rearranged relative to that of the closely related genus Medicago, which is similar to the ancestral angiosperm genome organization [117], except for the presence of a large, 50-kb inversion that occurred early in the divergence of papilionoid legumes [149-151]. The Medicago plastid genome is 124 kb, whereas the Trifolium plastid genome is significantly longer, at 144 kb.
However, despite this increase in genome size, the gene content of Trifolium has decreased. Six genes and two introns are missing relative to the complement of genes in the ancestral angiosperm plastid genome [117]. In terms of gene loss, Passiflora [124] and Trifolium have the
36 highest number of losses among photosynthetic angiosperms. Passiflora is missing a total of eight genes and four of the six gene losses in Trifolium are shared with Passiflora (accD, infA, rps18, and ycf1). Notably, accD is also partially or completely missing in several lineages, including grasses [152], Acorus [128], Lobeliaceae [153], Campanulaceae [118,135,154],
Oleaceae [116], and Pelargonium [115].
Repetitive DNA composed of active or inactive TEs is a major component of plant nuclear genomes, comprising up to 50–60% of maize [155-156] and 70% of barley [157], and
TEs contribute significantly to nuclear genome size variation [158]. Additionally, TEs can facilitate genome rearrangements, including inversions, duplications, or deletions of DNA. Aside from an inactive TE, the ―Wendy‖ element in the plastid genome of the alga Chlamydomonas
[136], TEs have not been identified in plastid genomes. Milligan et al. [134] proposed that TEs may contribute to the extent of repetitive DNA and to rearrangement in Trifolium. Our results suggest that repeats in Trifolium are not the product of TEs; neither BLAST searches of the Plant
Repeat Database nor searches for flanking inverted repeats, characteristic of some elements, indicated the presence of TEs in the plastid genome of Trifolium.
Milligan et al. [134] also noted the unprecedented extent of unique DNA in the Trifolium plastid genome. All of the sequenced legume plastid genomes contain some unique DNA
(Fig. 4.7) but the origin of this DNA is unclear. There are four possible explanations for the origin of this unique DNA in Trifolium. First, much of this DNA could simply represent noncoding plastid DNA that does not have any matches among publicly available plastid genome sequences, including five other completely sequenced legume plastid genomes. Although it is difficult to disprove that the unique DNA is merely noncoding plastid DNA, one would expect noncoding plastid DNA from Trifolium to yield some BLAST matches with the closely related legumes Cicer and Medicago [145]. Furthermore, this explanation immediately raises further
37 questions as to how Trifolium came to contain such a great abundance of noncoding plastid DNA.
Second, some of this unique DNA could represent pseudogenes from the six genes that have been lost from Trifolium. The BLASTX results for unique fragment 104 provides some support for this explanation because this fragment matches a 45-amino acid portion of ycf1 from the
Medicago plastid genome (Table 3). In Trifolium, this sequence is located adjacent to trnN-GUU, the location of ycf1 in unrearranged angiosperm plastid genomes. Third, the unique DNA could represent intracellular transfer of DNA into the plastid from the Trifolium mitochondrial and nuclear genomes. BLASTX comparisons (Table 3) did not identify any matches to sequences from either of these genomes. However, the absence of BLAST hits could be due to the paucity of mitochondrial and nuclear genome sequences for legumes, although there are considerable nuclear data available for Medicago truncatula. Fourth, the unique DNA could have originated via horizontal transfer from other organisms. Horizontal gene transfers into the plastid genome are extremely rare, with only two instances documented, both of which are ancient [159]. Our
BLASTX comparisons did identify several strong matches to bacterial genes, including the batB amino acid transport genes from proteobacteria. These genes are involved in the autotransporter secretion pathway in gram-negative bacteria including animal and plant pathogens. Although it is highly speculative at this time, it is possible that some portion of the unique DNA in the
Trifolium plastid genome is derived from horizontal transfer from bacterial plant pathogens.
Additional investigations into the origin of the unique DNA in the plastid genome of Trifolium and other legumes are needed to clarify the origin of these sequences.
In summary, the organization of the plastid genome of Trifolium is unusual relative to that of other angiosperms. Compared with closely related Medicago, Trifolium shows a highly accelerated rate of genomic rearrangements, with 14–18 inversions, six gene losses, and two intron losses. In addition, the genome contains a large number of repetitive sequences and unique
38
DNA of uncertain origin. Some repeats are associated with the endpoints of rearrangements, and the unique DNA likely represents recently derived segments of the plastid genome, highly divergent remnants of former genes, intracellular transfers from the mitochondrion or nucleus, or horizontal transfers from other genomes, possibly pathogenic bacteria. The Trifolium plastid genome is an excellent model system for examining mechanisms of rearrangements and the evolution of repeats and unique DNA.
39
Tables and Figures
Table 3.1 Comparison of major features of magnoliid plastid genomes Calycanthus Drimys Piper Size (bp) 153,337 160,604 160,624 LSC length (bp) 86,948 88,685 87,668 SSC length (bp) 19,799 18,621 18,878 IR length (bp) 23,295 26,649 27,039 Number of genes 133 (115) 131 (113) 130 (113) Number of gene duplicated in IR 18 18 17 Number of genes with introns (with 2 introns) 18 (3) 18 (3) 18 (3)
Table 3.2 Taxa included in phylogenetic analyses with GenBank accession numbers and references. GenBank Taxon Reference Accession Numbers Gymnosperms – Outgroups Pinus thunbergii NC_001631 Wakasugi et al. 1994 [102] Ginkgo biloba DQ069337-DQ069702 Leebens-Mack et al 2005 [36]
Basal Angiosperms Amborella trichopoda NC_005086 Goremykin et al. 2003 [27] Nuphar advena DQ069337-DQ069702 Leebens-Mack et al 2005 [36] Nymphaea alba NC_006050 Goremykin et al. 2004 [28]
Magnoliids
40
GenBank Taxon Reference Accession Numbers Calycanthus floridus NC_004993 Goremykin et al. 2003 [42] Drimys granatensis DQ887676 Current study Liriodendron tulipifera NC_008326 Current study Piper coenoclatum DQ887677 Current study
Monocots Acorus americanus DQ069337-DQ069702 Leebens-Mack et al 2005 [36] Oryza sativa NC_001320 Hiratsuka et al. 1989 [103] Phalaenopsis aphrodite NC_007499 Chang et al. 2006 [37] Saccharum officinarum NC_006084 Asano et al. 2004 [104] Triticum aestivum NC_002762 Ikeo and Ogihara, unpublished Typha latifolia DQ069337-DQ069702 Leebens-Mack et al 2005 [36] Yucca schidigera DQ069337-DQ069702 Leebens-Mack et al 2005 [36] Zea mays NC_001666 Maier et al. 1995 [60]
Eudicots Arabidopsis thaliana NC_000932 Sato et al. 1999 [61] Atropa belladonna NC_004561 Schmitz-Linneweber et al. 2002 [105] Citrus sinensis NC_008334 Bausher et al. 2006 [106] Cucumis sativus NC_007144 Plader et al. unpublished Eucalyptus globulus NC_008115 Steane 2005 [57] Glycine max NC_007942 Saski et al. 2005 [107] Gossypium hirsutum NC_007944 Lee et al. 2006 [108] Lotus corniculatus NC_002694 Kato et al. 2000 [62] Medicago truncatula NC_003119 Lin et al., unpublished Nicotiana tabacum NC_001879 Shinozaki et al. 1986 [109]
41
GenBank Taxon Reference Accession Numbers Oenothera elata NC_002693 Hupfer et al. 2000 [110] Panax schinseng NC_006290 Kim and Lee 2004 [63] Populus trichocarpa NC_008235 unpublished Ranunculus macranthus DQ069337-DQ069702 Leebens-Mack et al 2005 [36] Solanum lycopersicum DQ347959 Daniell et al. 2006 [65] Solanum bulbocastanum NC_007943 Daniell et al. 2006 [65] Spinacia oleracea NC_002202 Schmitz-Linneweber et al. 2001 [56] Vitis vinifera NC_007957 Jansen et al. 2006 [85]
Table 4.1 Comparison of major features of legume plastid genomes ______Trifolium Medicago Cicer Lotus Glycine Phaseolus ______Size with one IR (bp) 144,763 124,033 125,319 125,363 126,644 123,863 IR length (bp) N/A N/A N/A 25,156 25,574 26,422 Number of protein- 77 75 77 82 83 83 coding genes GC content 34% 34% 34% 36% 35% 35% Number of tRNA 30 31 30 30 30 30 genes Number of rRNA 4 4 4 4 4 4 genes
42
Table 4.2
The rpl23 family of dispersed repeats in the Trifolium genome
Coordinates rpl23 length/repeat length (bp) rpl23 128945–129229 285/N/A
Repeat 1 128978–129060 88/88
Repeat 2 37063–37849 88/786
Repeat 3 82539–82794 88/256
Repeat 4 68295–70693 88/2398
Repeat 5 74054–74309 88/256
Repeat 6 126756–126856 101/101
Note: N/A not applicable
43
Table 4.3. Results of BLASTX comparisons for 160 unique DNA sequence fragments from the Trifolium plastid genome. Fragment Length Trifolium Number Amino Number Top match, organism number (bp) genome of acid of coordinates BLAST sequence amino hits identity acids (%) 1 453 1971-2423 2 34, 37 45, 66 hypothetical protein, Plasmodium 15 482 13932- 7 26-32 40-113 atpase, 14413 Sarrcharophagus 21 428 15429- 3 29-48 35-58 glycosyl hydrolase, 15856 Tyrpanosome 33 487 36444- 9 27-44 28-109 hypothetical protein, 36930 Tetraodon 65 220 68861- 2 37, 40 35, 49 hypothetical protein, 69080 Drosophila 66 265 69100- 5 26-36 44-84 flagellar inner arm 69364 dynein 1 heavy chain beta, Chlamydomonas 67 246 69389- 1 39 41 hypothetical protein, 69634 Paramecium 69 426 69846- 3 35-36 58-64 hypothetical protein, 70271 Physcomitrella 77 427 71406- 3 23-52 25-65 rmlD substrate binding 71832 domain, Beggiatoa 82 473 73439- 6 34-42 37-79 cytochrome P450, 73911 Fusarum 87 410 74659- 3 33-45 35-53 iron-containing alcohol 75068 dehydrogenase, Acidobacteria 91 455 81942- 5 27-44 38-92 hypothetical protein, 82396 Tetraodon 98 427 83483- 147 50-60; 30-36; batB protein, 83909 62; 27; Lentisphaera;
44
65; 26; chitinase glycosyl 52; 25; hydrolase, Francisella; 36; 38; von Willebrand factor 36; 72; type A, Akkermansia 37; 43; rm1D substrate binding 36-43; 32-33; domain, Beggiatoa; 40; 45; chloroplast ycf2, 29; 57; Nephroselmis; 41; 46; Adenylate & Guanylate 32; 97; cyclase, Tetrahymena; 38; 42; Serine-threonine 41; 34; phosphatase, 30; 72; Cryptosporidium; 26; 83; 5' nucleotidase, 40; 35; Glossina; 24-46 25-137 protein kinase, Entamoeba; viral A-type inclusion protein, Trichomonas; benzoate dioigenase, uncultured bacterium; putative sigma K, Clostridium; ABC transporter G protein, Dictyostelium; glycine - tRNA ligase, Plasmodium; outer membrane porin, Borrelia; calcium channel T type alpha 1G, Bos; dynein, Plasmodium; hypothetical proteins, Dictyostelium, Danio, Legionella,
45
Cryptosporidium, Plasmodium, Trichomonas, Tribolium, Tetrahymena, Hydrogenobaculum, Nanoarchaeum, Mycoplasma, Franciesella, Debaryomyces, Neurospora, Vanderwaltozyma 103 244 85152- 2 34, 35 49, 60 hypothetical proteins, 85395 Frankia, Paramecium 104 164 87373- 2 46, 48 45 chloroplast ycf1, 87536 Medicago 120 204 104304- 1 37 45 hypothetical protein, 104507 Ostreococcus 121 315 104531- 4 25-48 50-92 hypothetical protein, 104845 Plasmodium, Clostridium, Paramecium, Microcystis 127 494 114977- 2 29 102 hypothetical protein, 115470 Pichia 131 469 124377 - 1 27 80 histidine kinase, 124845 Bacteroides 140 374 126190 - 1 31 44 hypothetical protein, 126563 Psychrobacter 150 426 139170 - 3 35-36 58-64 hypothetical protein, 139595 Physcimitrella, Plasmodium 152 246 139806 - 1 39 41 hypothetical protein, 140051 Paramecium 153 265 140076 - 5 26-36 44, 84 dynein,
46
140340 Chlamydomonas, Ostreococcus, Tetrahymena 154 220 140360 2 37-40 35, 49 hypothetical protein, 140579 Drosophila, Tetrahymena
47
Figure 2.1. Main MSWAT window with instructions and login information.
48
Figure 2.2. MSWAT amino acid alignment window showing buttons for various tasks including conversion to nucleotide sequences and adding new genomes.
49
Figure 2.3. MSWAT nucleotide sequence alignment window showing buttons for exporting alignments in various formats.
50
Figure 2.4. MSWAT nucleotide sequence alignment exported in NEXUS format.
Figure 3.1. Gene map of the Drimys granatensis plastid genome. The thick lines indicate the extent of the inverted repeats (IRa and IRb), which separate the genome into small (SSC) and large (LSC) single copy regions. Genes on the outside of the map are transcribed in the
51 clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction.
52
Figure 3.2. Gene map of the Piper coenoclatum plastid genome. The thick lines indicate the extent of the inverted repeats (IRa and IRb), which separate the genome into small (SSC) and large (LSC) single copy regions. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction.
53
Figure 3.3. Histogram of GC content for 34 seed plant plastid genomes, including the gymnosperm Pinus and 33 angiosperms (see Table 2 for list of genomes). Taxa are arranged phylogenetically following tree in Fig. 8. A. Overall GC content of complete genomes. B. GC content for 66 protein-coding genes, including average value for all codon positions, followed by values for the 1st, 2nd, and 3rd codon positions, respectively.
54
Figure 3.4. Graphs of GC content plotted over the entire plastid genomes of Drimys and Piper.
X axis represents the proportion of GC content between 0 and 1 and the Y axis gives the coordinates in kb for the genomes. Coding and non-coding regions are indicated in blue and red, respectively. The green dashed line indicates that average GC content for the entire genome.
55
Figure 3.5. Histogram of GC content for photosynthetic and genetic system genes for 34 seed plant plastid genomes (see Table 2 for list of genomes). Taxa are arranged phylogenetically following tree in Fig. 8. GC content includes average value for all codon positions, followed by values for the 1st, 2nd, and 3rd codon positions, respectively. A. GC content for 33 photosynthetic genes. B. GC content for 22 genetic system genes.
56
Figure 3.6. Histogram of GC content for NADH and rRNA genes for 34 seed plant plastid genomes (see Table 2 for list of genomes). Taxa are arranged phylogenetically following tree in
Fig. 8. A. GC content for 11 NADH genes, which includes average value for all codon positions, followed by values for the 1st, 2nd, and 3rd codon positions, respectively. B. GC content for four rRNA genes.
57
Figure 3.7. Graphs of GC content plotted over the entire plastid genomes of 34 seed plants. The graphs are organized by genomes with the same gene order and by clade in the phylogenetic trees in Figures 8 and 9. X axis represents the proportion of GC content between 0 and 1 and the
Y axis gives the coordinates in kb for the genomes.
58
Figure 3.8. Phylogenetic tree of 35-taxon data set based on 61 plastid protein-coding genes using maximum parsimony. The tree has a length of 61,095, a consistency index of 0.41 (excluding uninformative characters) and a retention index of 0.57. Numbers at each node are bootstrap support values. Numbers above node indicate number of changes along each branch and numbers below nodes are bootstrap support values. Ordinal and higher level group names follow APG II
[93]. Taxa in red are the three new genomes reported in this paper.
59
Figure3.9. Phylogenetic tree of 35-taxon data set based on 61 plastid protein-coding genes using maximum likelihood. The single ML tree has an ML value of – lnL = 342478.92. Numbers at nodes are bootstrap support values ≥ 50%. Scale at base of tree indicates the number of base substitutions. Ordinal and higher level group names follow APG II [93]. Taxa in red are the three new genomes reported in this paper.
60
Figure. 4.1. Gene map of the Trifolium subterraneum plastid genome. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction. Boxed areas indicate the 18 clusters of genes (see
Figure. 4.2).
61
Figure. 4.2. Comparison of the gene order and orientation of plastid genomes between
Medicago (upper) and Trifolium (lower). The arrows refer the clusters of genes as numbered in
Fig. 1, with the red arrows indicating an inverted orientation. The scale bar at the bottom shows the positions of the clusters of genes on the genomes.
62
Figure. 4.3. Distribution of genes and repeats on the Trifolium plastid genome. Coordinates indicate positions in the genome. The green line represents genes and the red line represents repeats. The vertical dashed lines indicate the boundaries of clusters of genes as defined in
Figure. 1, with the genes at the boundaries labeled.
63
Figure. 4.4. Comparison of the percentage of repeats in the plastid genomes of 15 angiosperms.
64
Figure. 4.5. Comparison of the number and sizes of repeats in the plastid genomes of 15 angiosperms.
65
Figure. 4.6. The rpl23 family of dispersed repeats with their position in the Trifolium genome.
The top panel is the full-length copy of rpl23, and the other six panels are repeats 1, 2, 3, 4, 5, 6, respectively. The same colors represent identical sequences. Coordinates indicate location of repeats in the Trifolium genome.
66
Figure. 4.7. Comparison of unique DNA among six legume plastid genomes.
67
Glossary
ClustalW: A general purpose multiple sequence alignment program for DNA or protein sequences.
MUSCLE: Multiple sequence alignment program for DNA or protein sequences using log-expectation.
PAUP: Phylogenetic Analysis Using Parsimony (and other methods).
Mesquite: A modular system for evolutionary analysis.
MacClade: a tool for phylogenetic analysis.
PAML: a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood.
NEXUS: a file format used for phylogenetic analyses.
SQUINT: an advanced DNA alignment tool.
IR: inverted repeat
SSC: small single copy
LSC: large single copy bp: base pair ycf: hypothetical chloroplast reading frame rrn: ribosomal RNA
MP: maximum parsimony
ML: maximum likelihood
68
References
1. Rokas A, Williams BL, King N., Carroll SB: Genome-scale approaches to
resolving incongruence in molecular phylogenies. Nature 2003 425:798-804.
2. Leebens-Mack J. Raubeson LA, Cui L-Y, Kuehl JV, Fourcade MH, Chumley
TW, Boore JL, Jansen RK, dePamphilis CW: Identifying the basal angiosperm
node in chloroplast genome phylogenies: Sampling one's way out of the
Felsenstein zone. Mol Biol Evol 2005, 22:1948-1963.
3. Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens-Mack J,
Müller KF, Guisinger-Bellian M, Haberle RC., Hansen AK, Chumley TW, Lee,
S-B, Peery R, McNeal J, Kuehl JV, Boore JL: Analysis of 81 Genes from 64
Plastid Genomes Resolves Relationships in Angiosperms and Identifies
Genome-Scale Evolutionary Patterns. Proc Natl Acad Sci USA, 2007,
104:19369-19374.
4. Moore MJ, Bell CD, Soltis PS, Soltis DE: Using plastid genome-scale data to
resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci
USA, 2007, 104:19363-19368.
5. Moore MJ, Dhingra A, Soltis P, Shaw R, Farmerie WG, Folta KM, Soltis DE:
Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant
Biol 2006, 6:17.
6. Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH: Analysis of the
Amborella trichopoda chloroplast genome sequence suggests that Amborella is
not a basal angiosperm. Mol Biol Evol 2003, 20:1499-1505.
7. Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH: Analysis of Acorus
69
calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol
2005, 22:1813-1822.
8. Cai Z. Penaflor C, Kuehl JV, Leebens-Mack J, Carlson JE, dePamphilis CW,
Boore JL, Jansen RK: Complete plastid genome sequences of Drimys,
Liriodendron, and Piper: Implications for the phylogenetic relationships of
magnoliids. BMC Evol Biol 2006, 6:77.
9. Jansen RK, Kaittanis C, Saski C, Lee SB, Tomkins J, Alverson AJ, Daniell H:
Phylogenetic analyses of Vitis (Vitaceae) based on complete chloroplast genome
sequences: Effects of taxon sampling and phylogenetic methods on resolving
relationships among rosids. BMC Evol Biol 2006, 6:32.
10. Chang C-C, Lin H-C, Lin I-P, Chow T-Y, Chen H-H, Chen W-H, Cheng C-H,
Lin C-Y, Liu S-M, Chang C-C, Chaw S.-M: The chloroplast genome of
Phalaenopsis aphrodite (Orchidaceae): Comparative analysis of evolutionary
rate with that of grasses and its phylogenetic implications. Mol Biol Evol 2006,
23:279-91.
11. Hansen DR, Dastidar SG, Cai Z, Penaflor C, Kuehl JV, Boore, JL, Jansen RK:
Phylogenetic and evolutionary implications of complete chloroplast genome
sequences of four early diverging angiosperms: Buxus (Buxaceae), Chloranthus
(Chloranthaceae), Dioscorea (Dioscoreaceae), and Illicium (Schisandraceae).
Mol Phylogen Evol 2007, 45:547-563.
12. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time
and space complexity. BMC Bioinformatics 2004, 5:113.
13. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: Improving the sensitivity
of progressive multiple sequence alignment through sequence weighting,
70
position specific gap penalties and weight matrix choice. Nucleic Acids Res 1994,
22:4673-80.
14. Swofford DL: PAUP*: Phylogenetic analysis using parsimony (*and other
methods), ver. 4.0 Sunderland MA: Sinauer Associates; 2003.
15. Maddison WP, Maddison DR: MacClade 4.08. Sunderland MA: Sinauer
Associates; 2005.
16. Maddison WP, Maddison DR: Mesquite Version 1.12.
http://mesquiteproject.org/mesquite/mesquite.html, 2006.
17. Yang Z: PAML: a program package for phylogenetic analysis by maximum
likelihood. Comp Appl BioSci 1997, 13:555-556.
18. Goode MG Rodrigo AG: SQUINT: A multiple alignment program and editor.
Bioinformatics 2007, 23:1553-1555.
19. Darwin C. Letter to J. D. Hooker. In: Darwin F, Seward AC, editor. More letters
of Charles Darwin. Vol. 2. London: John Murray; 1903.
20. Donoghue DJ, Mathews S. Duplicate genes and the root of angiosperms, with an
example using phytochrome sequences. Mol Phylogen Evol. 1998;9:489–500.
doi: 10.1006/mpev.1998.0511.
21. Mathews S, Donoghue MJ. The root of angiosperm phylogeny inferred from
duplicate phytochrome genes. Science. 1999;286:947–950. doi:
10.1126/science.286.5441.947.
22. Barkman TJ, Chenery G, McNeal JR, Lyons-Weiler J, Ellisens WJ, Moore G,
Wolfe AD, dePamphilis CW. Independent and combined analyses of sequences
from all three genomic compartments converge on the root of flowering plant
71
phylogeny. Proc Natl Acad Sci USA. 2000;97:13166–13171. doi:
10.1073/pnas.220427497.
23. Doyle JA, Endress PK. Morphological phylogenetic analysis of basal
angiosperms: Comparison and combination with molecular data. Int J Plt Sci.
2000;161:S121–S153. doi: 10.1086/317578.
24. Graham SW, Olmstead RG. Utility of 17 chloroplast genes for inferring the
phylogeny of the basal angiosperms. Am J Bot. 2000;87:1712–1730. doi:
10.2307/2656749.
25. Zanis MJ, Soltis DE, Soltis PS, Mathews S, Donoghue MJ. The root of the
angiosperms revisited. Proc Natl Acad Sci USA. 2002;99:6848–6853. doi:
10.1073/pnas.092136399.
26. Borsch T, Hilu KW, Quandt DV, Wilde Neinhuis C, Barthlott W. Noncoding
plastid trnT-trnF sequences reveal a well-resolved phylogeny of basal
angiosperms. J Evol Biol. 2003;16:558–576. doi: 10.1046/j.1420-
9101.2003.00577.x.
27. Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH. Analysis of the
Amborella trichopoda chloroplast genome sequence suggests that Amborella is
not a basal angiosperm. Mol Biol Evol. 2003;20:1499–1505. doi:
10.1093/molbev/msg159.
28. Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH. The chloroplast genome
of Nymphaea alba: whole-genome analyses and the problem of identifying the
most basal angiosperm. Mol Biol Evol. 2004;21:1445–1454. doi:
10.1093/molbev/msh147.
72
29. Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH. Analysis of Acorus
calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol.
2005;22:1813–1822. doi: 10.1093/molbev/msi173.
30. Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V.
Darwin's abominable mystery: insights from a supertree of the angiosperms.
Proc Natl Acad Sci USA. 2004;101:1904–1909. doi: 10.1073/pnas.0308127100.
31. Soltis DE, Soltis PS. Amborella not a "basal angiosperm"? Not so fast. Amer J
Bot. 2004;91:997–1001.
32. Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer
EA, Chen Z, Savolainen V, Chase MW. The earliest angiosperms: evidence from
mitochondrial, plastid and nuclear genomes. Nature. 1999;402:404–407. doi:
10.1038/46536.
33. Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer
EA, Chen Z, Savolainen V, Chase MW. Phylogeny of basal angiosperms:
analyses of five genes from three genomes. Int J Plt Sci. 2000; 161:S3–S27. doi:
10.1086/317584.
34. Qiu Y-L, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F,
Rest JS, Davis CC, Borsch T, Hilu KW, Renner SS, Soltis DE, Soltis PS, Zanis
MJ, Cannone JJ, Gutell RR, Powell M, Savolainen V, Chatrou LW, Chase MW.
Phylogenetic analysis of basal angiosperms based on nine plastid, mitochondrial,
and nuclear genes. Int J Plt Sci. 2005; 166:815–842. doi: 10.1086/431800.
35. Qiu Y-L, Li L, Hendry T, Li R, Taylor DW, Issa MJ, Ronen AJ, Vekaria ML,
White AM. Reconstructing the basal angiosperm phylogeny: evaluating
information content of the mitochondrial genes. Taxon. 2006.
73
36. Leebens-Mack J, Raubeson LA, Cui L, Kuehl J, Fourcade M, Chumley T, Boore
JL, Jansen RK, dePamphilis CW. Identifying the basal angiosperms in
chloroplast genome phylogenies: sampling one's way out of the Felsenstein zone.
Mol Biol Evol. 2005; 22:1948–1963. doi: 10.1093/molbev/msi191.
37. Chang C-C, Lin H-C, Lin I-P, Chow T-Y, Chen H-H, Chen W-H, Cheng C-H,
Lin C-Y, Liu S-M, Chang C-C, Chaw S-M. The chloroplast genome of
Phalaenopsis aphrodite (Orchidaceae): comparative analysis of evolutionary
rate with that of grasses and its phylogenetic implications. Mol Biol Evol. 2006;
23:279–291. doi: 10.1093/molbev/msj029.
38. Stefanovic S, Rice DW, Palmer JD. Long branch attraction, taxon sampling, and
the earliest angiosperms: Amborella or monocots? BMC Evol Biol. 2004; 4:35.
doi: 10.1186/1471-2148-4-35.
39. Martin W, Deusch O, Stawski N, Grunheit N, Goremykin V. Chloroplast genome
phylogenetics: why we need independent approaches to plant molecular
evolution. Trends Plant Sci. 2005; 10:203–209. doi:
10.1016/j.tplants.2005.03.007.
40. Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu Y-Q, Chase MW, Farris JS,
Stefanović S, Rice DW, Palmer JD, Soltis PS. Genome-scale data, angiosperm
relationships, and 'ending incongruence': a cautionary tale in phylogenetics.
Trends Plant Sci. 2004; 9:477–483. doi: 10.1016/j.tplants.2004.08.008.
41. Lockhart PJ, Penny D. The place of Amborella within the radiation of
angiosperms. Trends Plant Sci. 2005; 10:201–202. doi:
10.1016/j.tplants.2005.03.006.
74
42. Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH. The chloroplast genome
of the "basal" angiosperm Calycanthus fertilis – structural and phylogenetic
analyses. Plt Syst Evol. 2003; 242:119–135. doi: 10.1007/s00606-003-0056-4.
43. Soltis DE, Soltis PS, Endress PK, Chase MW. Phylogeny and evolution of
Angiosperms. Sunderland Massachusetts: Sinauer Associates Inc; 2005.
44. Soltis PS, Soltis DE, Chase MW. Angiosperm phylogeny inferred from multiple
genes as a tool for comparative biology. Nature. 1999; 402:402–404. doi:
10.1038/46528.
45. Soltis DE, Soltis PS, Chase MW, Mort ME, Albach DC, Zanis M, Savolainen V,
Hahn WJ, Hoot SB, Fay MF, Axtell M, Swensen SM, Prince LM, Kress WJ,
Nixon KC, Farris JS. Angiosperm phylogeny inferred from 18S rDNA, rbcL,
and atpB sequences. Bot J Linn Soc. 2000; 133:381–461. doi:
10.1006/bojl.2000.0380.
46. Zanis MJ, Soltis PS, Qiu Y-L, Zimmer EA, Soltis DE. Phylogenetic analyses and
perianth evolution in basal angiosperms. Ann Missouri Bot Gard. 2003; 90:129–
150.
47. Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase M,
Powell M, Alice L, Evans R, Sauquet H, Neinhuis C, Slotta T, Rohwer J,
Chatrou L. Inference of angiosperm phylogeny based on matK sequence
information. Amer J Bot. 2003; 90:1758–1776.
48. Kim S, Koh J, Yoo MJ, Kong H, Hu Y, Ma H, Soltis PS, Soltis DE. Expression
of floral MADS-box genes in basal angiosperms: implications for the evolution
of floral regulators. Plant J. 2005; 43:724–744. doi: 10.1111/j.1365-
313X.2005.02487.x.
75
49. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J,
Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV,
Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML,
Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM,
Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP,
Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT,
Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt
KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg
JM. Genome sequencing in microfabricated high-density picolitre reactors.
Nature. 2005; 437:376–380.
50. Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with
applications to phylogenetic inference. Mol Biol Evol. 1999; 16:1114–1116.
51. Raubeson LA, Jansen RK. Chloroplast genomes of plants. In: Henry H, editor.
Diversity and Evolution of Plants-Genotypic and Phenotypic Variation in Higher
Plants. Wallingford: CABI Publishing; 2005. pp. 45–68.
52. Goulding SE, Olmstead RG, Morden CW, Wolfe KH. Ebb and flow of the
chloroplast inverted repeat. Mol Gen Gen. 1996; 252:195–206. doi:
10.1007/BF02173220.
53. Sugiura C, Kobayashi Y, Aoki S, Sugita C, Sugita M. Complete chloroplast
DNA sequence of the moss Physcomitrella patens: evidence for the loss and
relocation of rpoA from the chloroplast to the nucleus. Nucl Acids Res. 2003;
31:5324–5331. doi: 10.1093/nar/gkg726.
54. Palmer JD, Nugent JM, Herbon LA. Unusual structure of geranium chloroplast
DNA – A triple-sized inverted repeat, extensive gene duplications, multiple
76
inversions, and 2 repeat families. Proc Natl Acad Sci USA. 1987; 84:769–773.
doi: 10.1073/pnas.84.3.769.
55. Chumley TW, Palmer JD, Mower JP, Fourcade HM, Caile PJ, Boore JL, Jansen
RK. The complete chloroplast genome sequence of Pelargonium × hortorum:
Organization and evolution of the largest and most highly rearranged chloroplast
genome of land plants. Mol Biol Evol. 23:2175–2190. doi:
10.1093/molbev/msl089.
56. Schmitz-Linneweber C, Maier RM, Alcaraz JP, Cottet A, Herrmann RG, Mache
R. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide
sequence and gene organization. Plt Mol Biol. 2001; 45:307–315. doi:
10.1023/A:1006478403810.
57. Steane DA. Complete nucleotide sequence of the chloroplast genome from the
Tasmanian blue gum, Eucalyptus globulus (Myrtaceae) DNA Res. 2005; 12:215–
220.
58. Ohtani K, Yamamoto H, Akimitsu K. Sensitivity to Alternaria alternata toxin in
citrus because of altered mitochondrial RNA processing. Proc Natl Acad Sci
USA. 2002; 99:2439–2444. doi: 10.1073/pnas.042448499.
59. Shimada H, Sugiura M. Fine structural features of the chloroplast genome:
comparison of the sequenced chloroplast genomes. Nucl Acids Res. 1991;
19:983–995.
60. Maier RM, Neckermann K, Igloi GL, Kossel H. Complete sequence of the
maize chloroplast genome: gene content, hotspots of divergence and fine tuning
of genetic information by transcript editing. J Mol Biol. 1995; 251:614–628. doi:
10.1006/jmbi.1995.0460.
77
61. Sato S, Nakamura Y, Kaneko T, Asamizu E, Tabata S. Complete structure of the
chloroplast genome of Arabidopsis thaliana. DNA Res. 1999; 6:283–290. doi:
10.1093/dnares/6.5.283.
62. Kato T, Kaneko T, Sato S, Nakamura Y, Tabata S. Complete structure of the
chloroplast genome of a legume, Lotus japonicus. DNA Res. 2000; 7:323–330.
doi: 10.1093/dnares/7.6.323.
63. Kim K-J, Lee H-L. Complete chloroplast genome sequence from Korean
Ginseng (Panax schiseng Nees) and comparative analysis of sequence evolution
among 17 vascular plants. DNA Res. 2004; 11:247–261. doi:
10.1093/dnares/11.4.247.
64. Chaw S-M, Chang C-C, Chen H-L, Li W-H. Dating the monocot-dicot
divergence and the origin of core eudicots using whole chloroplast genomes. J
Mol Evol. 2004; 58:424–441. doi: 10.1007/s00239-003-2564-9.
65. Daniell H, Lee S-B, Grevich J, Saksi C, Quesada-Vargas T, Guda C, Tomkins J,
Jansen RK. Complete chloroplast genome sequences of Solanum bulbocastanum,
Solanum lycopersicum and comparative analyses with other Solanaceae
genomes. Theor Appl Genet. 2006; 112:1503–1518. doi: 10.1007/s00122-006-
0254-x.
66. Liu Q, Xue Q. Comparative studies on codon usage pattern of chloroplasts and
their host nuclear genes in four plant species. J Genet. 2005; 84:55–62.
67. Morton BR. Chloroplast DNA codon use: Evidence for selection at the psbA
locus based on tRNA availability. J Mol Evol. 1993; 37:273–280. doi:
10.1007/BF00175504.
78
68. Morton BR. Codon use and the rate of divergence of land plant chloroplast
genes. Mol Biol Evol. 1994; 11:231–238.
69. Morton BR. Selection on the codon bias of chloroplast and cyanelle genes in
different plant and algal lineages. J Mol Evol. 1998; 46:449–459. doi:
10.1007/PL00006325.
70. Morton BR, Levin JA. The atypical codon usage of the plant psbA gene maybe
the remnant of an ancestral bias. Proc Natl Acad Sci USA. 1997; 94:11434–
11438. doi: 10.1073/pnas.94.21.11434.
71. Wall DP, Herbeck JT. Evolutionary patterns of codon usage in the chloroplast
gene rbcL. J Mol Evol. 2003; 56:673–688. doi: 10.1007/s00239-002-2436-8.
72. Martin PG, Dowd JM. Studies of angiosperm phylogeny using protein sequences.
Ann Misssouri Bot Gard. 1991; 78:296–337. doi: 10.2307/2399564.
73. Hamby RK, Zimmer EA. Ribosomal RNA as a phylogenetic tool in plant
systematics. In: Soltis PS, Soltis DE, Doyle JJ, editor. Molecular Systematics of
Plants. Chapman and Hall; 1992. pp. 50–91.
74. Chase M, Soltis D, Olmstead R, Morgan D, Les D, Mishler B, Duvall M, Price
R, Hills H, Qui Y-L, Kron K, Rettig J, Conti E, Palmer J, Manhart J, Sytsma K,
Michaels H, Kress J, Karol K, Clark D, Hedren M, Gaut B, Jansen R, Kim K-J,
Wimpee C, Smith J, Furnier G, Straus S, Xiang Q-Y, Plunkett G, Soltis P,
Swensen S, Williams S, Gadek P, Quinn C, Equiarte L, Golenberg E, Learn G,
Graham S, Barrett S, Dayanandan S, Albert V. Phylogenetics of seed plants: an
analysis of nucleotide sequences from the plastid gene rbcL. Ann Missouri Bot
Gard. 1993; 80:528–580. doi: 10.2307/2399846.
79
75. Qiu Y-L, Chase MW, Les DH, Parks CR. Molecular phylogenetics of the
Magnoliidae: cladistic analyses of nucleotide sequences of the plastid gene rbcL.
Ann Missouri Bot Gard. 1993; 80:587–606. doi: 10.2307/2399848.
76. Qiu Y-L, Lee J, Whitlock BA, Bernasconi-Quadroni F, Dombrovska O. Was the
ANITA rooting of the angiosperm phylogeny affected by long branch attraction?
Mol Biol Evol. 2001; 18:1745–1753.
77. Soltis DE, Soltis PS, Nickrent DL, Johnson LA, Hahn WJ, Hoot SB, Sweere JA,
Kuzoff RK, Kron KA, Chase M, Swensen SM, Zimmer E, Shaw SM, Gillespie
LJ, Kress WJ, Sytsma K. Angiosperm phylogeny inferred from 18S ribosomal
DNA sequences. Ann Missouri Bot Gard. 1997;84:1–49. doi: 10.2307/2399952.
78. Hoot SB, Magallon S, Crane PR. Phylogeny of basal tricolpates based on three
molecular data sets: atpB, rbcL, and 18S nuclear ribosomal DNA sequences.
Ann Missouri Bot Gard. 1999; 86:1–32. doi: 10.2307/2666215.
79. Mathews S, Donoghue MJ. Basal angiosperm phylogeny inferred from
duplicated phytochromes A and C. Int J Plt Sci. 2000; 161:S41–S55. doi:
10.1086/317582.
80. Parkinson CL, Adams KL, Palmer JD. Multigene analyses identify the three
earliest lineages of extant flowering plants. Curr Biol. 1999; 9:1485–1488. doi:
10.1016/S0960-9822(00)80119-0.
81. Savolainen V, Chase MW, Morton CM, Soltis DE, Bayer C, Fay MF, De Bruijn
A, Sullivan S, Qiu Y-L. Phylogenetics of flowering plants based upon a
combined analysis of plastid atpB and rbcL gene sequences. Syst Biol. 2000;
49:306–362. doi: 10.1080/10635159950173861.
80
82. Aoki S, Uehara K, Imafuku M, Hasebe M, Ito M. Phylogeny and divergence of
basal angiosperms inferred from APETALA3- and PISTILLATA-like MADS-box
genes. J Plt Res. 2004; 117:229–244.
83. Kim ST, Yoo MJ, Albert VA, Farris JS, Soltis PS, Soltis DE. Phylogeny and
diversification of B-function MADS-box genes in angiosperms: Evolutionary
and functional implications of a 260-million-year-old duplication. Amer J Bot.
2004; 91:2102–2118.
84. Löhne C, Borsch T. Molecular evolution and phylogenetic utility of the petD
group II intron: a case study in basal angiosperms. Mol Biol Evol. 2005; 22:317–
332. doi: 10.1093/molbev/msi019.
85. Jansen RK, Kaittanis C, Saski C, Lee SB, Tomkins J, Alverson AJ, Daniell H.
Phylogenetic analyses of Vitis (Vitaceae) based on complete chloroplast genome
sequences: effects of taxon sampling and phylogenetic methods on resolving
relationships among rosids. BMC Evol Biol. 2006; 6:32. doi: 10.1186/1471-
2148-6-32.
86. Nickrent DL, Blarer A, Qiu Y-L, Soltis DE, Soltis PS, Zanis M. Molecular data
place Hydnoraceae with Aristolochiaceae. Amer J Bot. 2002; 89:1809–1817.
87. Palmer JD. Isolation and structural analysis of chloroplast DNA. In: Weissbach
A, Weissbach H, editor. Methods Enzymol. Vol. 118. New York: Academic Press;
1986. pp. 167–186.
88. Jansen RK, Raubeson LA, Boore JL, dePamphilis CW, Chumley TW, Haberle
RC, Wyman SK, Alverson AJ, Peery R, Herman SJ, Fourcade HM, Kuehl JV,
McNeal JR, Leebens-Mack J, Cui L. Methods for obtaining and analyzing
81
chloroplast genome sequences. Meth Enzym. 2005; 395:348–384. doi:
10.1016/S0076-6879(05)95020-9.
89. DOE Joint Genome Institute sequencing protocols
http://www.jgi.doe.gov/sequencing/index.html
90. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II.
Error probabilities. Genome Res. 1998; 8:186–194.
91. Gordon D, Abajian C, Green P. Consed: a graphical tool for sequence finishing.
Genome Res. 1998; 8:195–202.
92. Liang H, Fang EG, Tomkins JP, Luo M, Kudrna D, Arumuganathan K, Zhao S,
Schlarbaum SE, Banks JA, Leebens-Mack JH, dePamphilis CW, Mandoli DF,
Wing RA, Carlson JE. Development of a BAC library for yellow-poplar
(Liriodendron tulipifera) and the identification of genes associated with flower
development and lignin biosynthesis. Tree Genet Genomes. 2006.
93. Wyman SK, Boore JL, Jansen RK. Automatic annotation of organellar genomes
with DOGMA. Bioinformatics. 2004; 20:3252–3255. doi:
10.1093/bioinformatics/bth352. http://bugmaster.jgi-psf.org/dogma/
94. SPSS for Windows, Rel. 9.0. Chicago: SPSS Inc; 2001.
95. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucl Acids Res. 2004; 32:1792–1797. doi: 10.1093/nar/gkh340.
96. Cui L, Veeraraghavan N, Richer A, Wall K, Jansen RK, Leebens-Mack J,
Makalowska I, dePamphillis CW. ChloroplastDB: the chloroplast genome
database. Nucl Acids Res. 2006; 34:D692–D696. doi: 10.1093/nar/gkj055.
http://chloroplast.cbio.psu.edu/
82
97. Swofford DL. PAUP*: Phylogenetic analysis using parsimony (* and other
methods), ver 40. Sunderland MA: Sinauer Associates; 2003.
98. Posada D, Crandall KA. MODELTEST: testing the model of DNA substitution.
Bioinformatics. 1998; 14:817–818. doi: 10.1093/bioinformatics/14.9.817.
99. Sullivan J, Abdo Z, Joyce P, Swofford DL. Evaluating the performance of a
successive-approximations approach to parameter optimization in maximum-
likelihood phylogeny estimation. Mol Biol Evol. 2005; 22:1386–1392. doi:
10.1093/molbev/msi129.
100. Felsenstein J. Confidence limits on phylogenies: an approach using the
bootstrap. Evolution. 1985; 39:783–791. doi: 10.2307/2408678.
101. Goldman N, Anderson JP, Rodrigo AG. Likelihood-based tests of topologies in
phylogenetics. Syst Biol. 2000; 49:652–670. doi: 10.1080/106351500750049752.
102. Wakasugi T, Tsudzuki J, Ito S, Nakashima K, Tsudzuki T, Sugiura M. Loss of
all ndh genes as determined by sequencing the entire chloroplast genome of the
black pine Pinus thunbergii. Proc Natl Acad Sci USA. 1994; 91:9794–9798. doi:
10.1073/pnas.91.21.9794.
103. Hiratsuka J, Shimada H, Whittier R, Ishibashi T, Sakamoto M, Mori M, Kondo
C, Honji Y, Sun CR, Meng BY, Li YQ, Kanno A, Nishizawa Y, Hirai A,
Shinozaki K, Sugiura M. The complete sequence of the rice (Oryza sativa)
chloroplast genome: intermolecular recombination between distinct tRNA genes
accounts for a major plastid DNA inversion during the evolution of the cereals.
Mol Gen Genet. 1989; 217:185–194. doi: 10.1007/BF02464880.
104. Asano T, Tsudzuki T, Takahashi S, Shimada H, Kadowaki K. Complete
nucleotide sequence of the sugarcane (Saccharum officinarum) chloroplast
83
genome: a comparative analysis of four monocot chloroplast genomes. DNA Res.
2004; 11:93–99. doi: 10.1093/dnares/11.2.93.
105. Schmitz-Linneweber C, Regel R, Du TG, Hupfer H, Herrmann RG, Maier RM.
The plastid chromosome of Atropa belladonna and its comparison with that of
Nicotiana tabacum: the role of RNA editing in generating divergence in the
process of plant speciation. Mol Biol Evol. 2002; 19:1602–1612.
106. Bausher MG, Singh ND, Lee S, Jansen RK, Daniell H. The complete
chloroplast genome sequence of Citrus sinensis (L.) Osbeck var 'Ridge
Pineapple': organization and phylogenetic relationships to other angiosperms.
BMC Plt Biol. 6:21. doi: 10.1186/1471-2229-6-21.
107. Saski C, Lee S, Daniell H, Wood T, Tomkins J, Kim H-G, Jansen RK. Complete
chloroplast genome sequence of Glycine max and comparative analyses with
other legume genomes. Plt Mol Biol. 2005; 59:309–322. doi: 10.1007/s11103-
005-8882-0.
108. Lee S-B, Kaittanis C, Jansen RK, Hostetler JB, Tallon LJ, Town CD, Daniell H.
The complete chloroplast genome sequence of Gossypium hirsutum:
organization and phylogenetic relationships to other angiosperms. BMC
Genomics. 2006; 7:61. doi: 10.1186/1471-2164-7-61.
109. Shinozaki K, Ohme M, Tanaka M, Wakasugi T, Hayashida N, Matsubayashi T,
Zaita N, Chunwongse J, Obokata J, Yamaguchi-Shinozaki K, Ohto C, Torazawa
K, Meng BY, Sugita M, Deno H, Kamogashira T, Yamada K, Kusuda J, Takaiwa
F, Kato A, Tohdoh N, Shimada H, Sugiura M. The complete nucleotide sequence
of the tobacco chloroplast genome: its gene organization and expression. EMBO
J. 1986; 5:2043–2049.
84
110. Hupfer H, Swaitek M, Hornung S, Herrmann RG, Maier RM, Chiu WL, Sears
B. Complete nucleotide sequence of the Oenothera elata plastid chromosome,
representing plastome 1 of the five distinguishable Euoenthera plastomes. Mol
Gen Genet. 2000; 263:581–585.
111. APG II 2002 An update of the Angiosperm Phylogeny Group classification for
the orders and families of flowering plants: APG II. Bot J Linn Soc. 2003;
141:399–436. doi: 10.1046/j.1095-8339.2003.t01-1-00158.x.
112. Jansen RK, Raubeson LA, Boore JL, dePamphilis CW, Chumley TW, Haberle
RC, Wyman SK, Alverson A, Peery R, Herman SJ, Fourcade HM, Kuehl JV,
McNeal JR, Leebens-Mack J, Cui L (2005) Methods for obtaining and analyzing
whole chloroplast genome sequences Molecular evolution: producing the
biochemical data part B. Methods Enzymol 395:348–384
113. Moore MJ, Dhingra A, Soltis P, Shaw R, Farmerie WG, Folta KM, Soltis DE
(2006) Rapid and accurate pyrosequencing of angiosperm plastid genomes.
BMC Plt Biol 6:17
114. Saski C, Lee S-B, Daniell H, Wood TC, Tomkins J, Kim H-G, Jansen RK
(2005) Complete chloroplast genome sequence of Glycine max and comparative
analyses with other legume genomes. Plant Mol Biol 59:309–322
115. Chumley TW, Palmer JD, Mower JP, Fourcade HM, Calie PJ, Boore JL, Jansen
RK (2006) The complete chloroplast genome sequence of Pelargonium ×
hortorum: organization and evolution of the largest and most highly rearranged
chloroplast genome of land plants. Mol Biol Evol 23:2175–2190
116. Lee H-L, Jansen RK, Chumley TW, Kim K-J (2007) Gene relocations within
chloroplast genomes of Jasminum and Menodora (Oleaceae) are due to multiple,
85
overlapping inversions. Mol Biol Evol 24:1161–1180
117. Raubeson LA, Peery R, Chumley T, Dziubek C, Fourcade HM, Boore JL,
Jansen RK (2007) Comparative chloroplast genomics: Analyses including new
sequences from the angiosperms Nuphar advena and Ranunculus macranthus.
BMC Genomics 8:174
118. Haberle RC, Fourcade HM, Boore JL, Jansen RK (2008) Extensive
rearrangements in the chloroplast genome of Trachelium caeruleum are
associated with repeats and tRNA genes. J Mol Evol 66:350–361
119. Wang R-J, Cheng C-L, Chang C-C, Wu C-L, Su T-M, Chaw SM (2008)
Dynamics and evolution of the inverted repeat-large single copy junctions in the
chloroplast genomes of monocots. BMC Evol Biol 8:36
120. Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH (2003) Analysis of the
Amborella trichopoda chloroplast genome sequence suggests that Amborella is
not a basal angiosperm. Mol Biol Evol 20:1499–1505
121. Leebens-Mack J, Raubeson LA, Cui LY, Kuehl JV, Fourcade MH, Chumley
TW, Boore JL, Jansen RK, dePamphilis CW (2005) Identifying the basal
angiosperm node in chloroplast genome phylogenies: sampling one’s way out of
the Felsenstein zone. Mol Biol Evol 22:1948–1963
122. Chang C-C, Lin H-C, Lin I-P, Chow T-Y, Chen H-H, Chen W-H, Cheng C-H,
Lin C-Y, Liu S-M, Chang C-C, Chaw S-M (2006) The chloroplast genome of
Phalaenopsis aphrodite (Orchidaceae): comparative analysis of evolutionary rate
with that of grasses and its phylogenetic implications. Mol Biol Evol 23:279–
291
123. Hansen DR, Dastidar SG, Cai Z, Penaflor C, Kuehl JV, Boore JL, Jansen RK
86
(2007) Phylogenetic and evolutionary implications of complete chloroplast
genome sequences of four early diverging angiosperms: Buxus (Buxaceae),
Chloranthus (Chloranthaceae), Dioscorea (Dioscoreaceae), and Illicium
(Schisandraceae). Mol Phylogenet Evol 45:547–563
124. Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens-Mack
J, Müller KF, Guisinger-Bellian M, Haberle RC, Hansen AK, Chumley TW, Lee
S-B, Peery R, McNeal J, Kuehl JV, Boore JL (2007) Analysis of 81 genes from
64 plastid genomes resolves relationships in angiosperms and identifies genome-
scale evolutionary patterns. Proc Natl Acad Sci USA 104:19369–19374
125. Moore MJ, Bell CD, Soltis PS, Soltis DE (2007) Using plastid genome-scale
data to resolve enigmatic relationships among basal angiosperms. Proc Natl
Acad Sci USA 104:19363–19368
126. Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH (2004) The chloroplast
genome of Nymphaea alba: whole-genome analyses and the problem of
identifying the most basal angiosperm. Mol Biol Evol 21:1445–1454
127. Shinozaki K, Ohme M, Tanaka M, Wakasugi T, Hayashida N, Matsubayashi T,
Zaita N, Chunwongse J, Obokata J, Yamaguchishinozaki K, Ohto C, Torazawa
K, Y. Meng B, Sugita M, Deno H, Kamogashira T, Yamada K, Kusuda J,
Takaiwa F, Kato A, Tohdoh N, Shimada H, Sugiura M (1986) The complete
nucleotide sequence of the tobacco chloroplast genome—its gene organization
and expression. EMBO J 5:2043–2049
128. Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH (2005) Analysis of
Acorus calamus chloroplast genome and its phylogenetic implications. Mol Biol
Evol 22:1813–1822
87
129. Bowman CM, Dyer TA (1986) The location and possible evolutionary
significance of small dispersed repeats in wheat ctDNA. Curr Genet 10:931–941
130. Hiratsuka J, Shimada H, Whittier R, Ishibashi T, Sakamoto M, Mori M,
Kondo C, Honji Y, Sun CR, Meng BY, Li YQ, Kanno A, Nishizawa Y, Hirai A,
Shinozaki K, Sugiura M (1989) The complete sequence of the rice Oryza sativa
chloroplast genome - intermolecular recombination between distinct transfer-
RNA genes accounts for a major plastid DNA inversion during the evolution of
the cereals. Mol Gen Genet 217:185–194
131. Hupfer H, Swiatek M, Hornung S, Herrmann RG, Maier RM, Chiu WL, Sears
B (2000) Complete nucleotide sequence of the Oenothera elata plastid
chromosome, representing plastome I of the five distinguishable Euoenothera
plastomes. Mol Gen Genet 263:581–585
132. Pombert J-F, Otis C, Lemieux C, Turmel M (2005) The chloroplast genome
sequence of the green alga Pseudendoclonium `akinetum (Ulvophyceae) reveals
unusual structural features and new insights into the branching order of
chlorophyte lineages. Mol Biol Evol 22:1903–1918
133. Pombert J-F, Lemieux C, Turmel M (2006) The complete chloroplast DNA
sequence of the green alga Oltmannsiellopsis viridis reveals a distinctive
quadripartite architecture in the chloroplast genome of early diverging
ulvophytes. BMC Biol 4:3
134. Milligan BG, Hampton JN, Palmer JD (1989) Dispersed repeats and structural
reorganization in subclover chloroplast DNA. Mol Biol Evol 6:355–368
135. Cosner ME, Jansen RK, Palmer JD, Downie SR (1997) The highly rearranged
chloroplast genome of Trachelium caeruleum (Campanulaceae): multiple
88
inversions, inverted repeat expansion and contraction, transposition,
insertions/deletions, and several repeat families. Curr Genet 31:419–429
136. Fan WH, Woelfle MA, Mosig G (1995) Two copies of a DNA element, Wendy,
in the chloroplast chromosome of Chlamydomonas reinhardtii between
rearranged gene clusters. Plant Mol Biol 29:63–80
137. Palmer JD (1986) Isolation and structural analysis of chloroplast DNA.
Methods Enzymol 118:167–186
138. Bookjans G, Stummann BM, Henningsen KW (1984) Preparation of
chloroplast DNA from pea plastids isolated in a medium of high ionic strength.
Anal Biochem 141:244–247
139. Ewing B, Green P (1998) Base-calling of automated sequencer traces using
phred II. error probabilities. Genome Res 8:186–194
140. Gordon D, Abajian C, Green P (1998) Consed: a graphical tool for sequence
finishing. Genome Res 8:195–202
141. Wyman SK, Jansen RK, Boore JL (2004) Automatic annotation of organellar
genomes with DOGMA. Bioinform 20:3252–3255
142. McCarthy EM, McDonald JF (2003) LTR STRUC: a novel search and
identification program for LTR retrotransposons. Bioinformatics 19:362–367
143. Tesler G (2002) GRIMM: genome rearrangements web server. Bioinformatics
18:492–493
144. Darling ACE, Mau B, Blattner FR, Perna NT (2004) Mauve: multiple
alignment of conserved genomic sequence with rearrangements. Genet Res
14:1394–1403
145. Wojciechowski MF, Lavin M, Sanderson MJ (2004) A phylogeny of legumes
89
(Leguminosae) based on analysis of the plastid matK gene resolves many well-
supported subclades within the family. Am J Bot 91:1846–1862
146. Doyle JJ, Doyle JL, Palmer JD (1995) Multiple independent losses of two
genes and one intron from legume chloroplast genomes. Syst Bot 20:272–294
147. Millen RS, Olmstead RG, Adams KL, Palmer JD, Lao NT, Heggie L,
Kavanagh TA, Hibberd JM, Gray JC, Morden CW, Caile PJ, Jermiin LS, Wolfe
KH (2001) Many parallel losses of infA from chloroplast DNA during
angiosperm evolution with multiple independent transfers to the nucleus. Plt
Cell 13:645–658
148. Henderson IR, Navarro-Garcia F, Desvaux M, Fernandez RC, Ala’Aldeen D
(2004) Type V protein secretion pathway: the autotransporter story. Micrbiol
Mol Biol Rev 68:692–744
149. Palmer JD, Thompson WF (1981) Rearrangements in the chloroplast genomes
of mung bean and pea. Proc Natl Acad Sci USA 78:5533–5537
150. Doyle JJ, Doyle JL, Palmer JD (1996) The distribution and phylogenetic
significance of a 50-kb chloroplast DNA inversion in the flowering plant family
Leguminosae. Mol Phylogenet Evol 5:429–438
151. Jansen RK, Wojciechowski MF, Sanniyasi E, Lee S-B, Daniell H (2008)
Complete plastid genome sequence of the chickpea (Cicer arietinum) and the
phylogenetic distribution of rps12 and clpP intron losses among legumes
(Fabaceae). Mol Phylogenet Evol 48:1204–1217
152. Katayama H, Ogihara Y (1996) Phylogenetic affinities of the grasses to other
monocots as revealed by molecular analysis of chloroplast DNA. Curr Genet
29:572–581
90
153. Knox EB, Palmer JD (1999) The chloroplast genome arrangement Lobelia
thuliniana Lobeliaceae: expansion of the inverted repeat in an ancestor of the
Campanulales. Plant Sys Evol 214:49–64
154. Cosner ME, Raubeson LA, Jansen RK (2004) Chloroplast DNA
rearrangements in Campanulaceae: phylogenetic utility of highly rearranged
genomes. BMC Evol Biol 4:1–17
155. San Miguel P, Bennetzen JL (1998) Evidence that a recent increase in maize
genome size was caused by the massive amplification of intergene
retrotransposons. Ann Bot 82:37–44
156. Meyers BC, Tingey SV, Morgante M (2001) Abundance, distribution, and
transcriptional activity of repetitive elements in the maize genome. Genome Res
11:1660–1676
157. Vicient CM, Suoniemi A, Anamthawat-Jonsson K, Tanskanen J, Beharav A,
Nevo E, Schulman AH (1999) Retrotransposon BARE–1 and its role in genome
evolution in the genus Hordeum. Plt Cell 11:1769–1784
158. Zhang X, Wessler SR (2004) Genome-wide comparative analysis of the
transposable elements in the related species Arabidopsis thaliana and Brassica
oleracea. Proc Natl Acad Sci USA 101:5589–5594
159. Rice DW, Palmer JD (2006) An exceptional horizontal gene transfer in
plastids: gene replacement by a distant bacterial paralog and evidence that
haptophyte and cryptophyte plastids are sisters. BMC Evol Biol 4:31
91
Vita
Zhengqiu Cai entered Capital Normal University in Beijing in 1997. He received the degrees of Bachelor of Science and Master of Science from Capital Normal University in 2001 and 2004 respectively. From 2004 to 2005, he was employed by Beijing
Genomics Institute as a Bioinformatics developer. In August, 2005, he attended the
Graduate School at The University of Texas at Austin. He received the degree of Master of Science from the department of Computer Science at The University of Texas at
Austin in May, 2010. During the summer of 2010, he worked for Monsanto Company as a Bioinformatics Intern. He has been working for Washington University School of
Medicine in St. Louis as a Bioinformaticist since August, 2010.
Permanent Address:
This dissertation was typed by the author.
92