The Pennsylvania State University

The Graduate School

Department of Biology

THE IMPACT OF GENE DUPLICATION ON MOLECULAR EVOLUTION IN

FLOWERING

A Dissertation in

Biology

by

Jill Marie Duarte

 2008 Jill Marie Duarte

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2008

The dissertation of Jill Marie Duarte was reviewed and approved* by the following:

Claude dePamphilis Professor of Biology Dissertation Advisor Chair of Committee

Naomi Altman Associate Professor of Statistics

Hong Ma Professor of Biology

Kateryna Makova Assistant Professor of Biology

Webb Miller Professor of Biology and Computer Science and Engineering

Douglas Cavener Professor of Biology Head of the Department of Biology

*Signatures are on file in the Graduate School

ii

ABSTRACT

Gene duplication has played an important role in the evolution of life on Earth, and it is particularly important for the evolution of flowering plants, given the prevalence of segmental and whole genome duplication in this lineage. This dissertation explores the link between gene duplication and the evolution of flowering plants by: 1) examining the role of expression divergence in the evolution of regulatory genes; 2) estimating the makeup of the ancestral angiosperm genome; 3) identifying and characterizing shared single copy genes in sequence genomes; and 4) exploring the phylogenetic utility of single copy nuclear orthologs and their relative performance compared to organellar and rRNA DNA. We show that expression divergence is a common feature of duplicated regulatory genes in , but that the divergence patterns are not clearly classified by current theories concerning the evolution of duplicated genes. Results indicate that the ancestral angiosperm genome included a rich diversity of developmental genes that have contributed to angiosperm diversity through subsequent duplication and diversification.

We have also identified a large number of shared single copy genes and have shown their phylogenetic utility in a number of different systems and using different methodologies.

Results from these studies illustrate the added complexity in the evolution of flowering plants that is the result of gene duplication and the need to revise current theoretical frameworks and methodologies in order to account for this complexity.

iii

TABLE OF CONTENTS

LIST OF FIGURES…………………………………………………………………… vi

LIST OF TABLES……………………………………………………………………. vii

ACKNOWLEDGEMENTS……………………………………………………………viii

Chapter 1 The Impact of Gene Duplication on Molecular Evolution……………………1

References……………………………………………………………………….17

Chapter 2 Expression Pattern Shifts Following Duplication Indicative of Subfunctionalization and Neofunctionalization in Regulatory Genes of Arabidopsis…..25

Preface……………………………………………………………………………25 Abstract…………………………………………………………………………..27 Introduction……………………………………………………………………....28 Methods…………………………………………………………………………..33 Results……………………………………………………………………….…...37 Discussion………………………………………………………………….…….43 Acknowledgments………………………………………………………….…….52 Additional Files…………………………………………………………….…….52 References…………………………………………………………………….….53

Chapter 3 Utility of Amborella trichopoda and Nuphar advena ESTs for comparative sequence analysis…………………………………………………………………..…… 57

Preface……………………………………………………………………..……..57 Abstract……………………………………………………………………..……59 Introduction…………………………………………………………………..…..60 Methods……………………………………………………………………….….62 Results…………………………………………………………………………....65 Discussion………………………………………………………………………..70 Acknowledgments………………………………………………………………..74 Additional Files…………………………………………………………………..74 References………………………………………………………………………..75

Chapter 4 Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels.………..85

Preface……………………………………………………………………………85 Abstract…………………………………………………………………………..87 Background………………………………………………………………………88 Results……………………………………………………………………………93 Discussion………………………………………………………………………104 Methods…………………………………………………………………………112

iv

Acknowledgments………………………………………………………………117 Additional Files…………………………………………………………………118 References………………………………………………………………………119

Chapter 5 A Marriage of High-throughput Identification of Phylogenetic Markers with High-throughput Sequencing: Next-generation Phylogenomics using 454 Transcriptomes …………………………………………………………………………………………..125

Preface…………………………………………………………………………..125 Abstract…………………………………………………………………………127 Introduction……………………………………………………………………..128 Methods…………………………………………………………………………132 Results…………………………………………………………………………..135 Discussion………………………………………………………………………143 Acknowledgments………………………………………………………………144 Additional Files…………………………………………………………………145 References………………………………………………………………………146

Concluding Remarks……………………………………………………………………150

References………………………………………………………………………154

v

LIST OF FIGURES

Figure 2-1. Fate of duplicated genes and the effect of each fate on expression patterns ………………………………………………………………………………………...…29

Figure 2-2. Range of ANOVA results…………………………………………………...38

Figure 2-3. Inferring ancestral expression patterns in MIKCC gene family in Arabidopsis………………………………………………………………………………41

Figure 2-4. Complete linkage clustering and mapping of complementary expression patterns supports independence of regulatory modules…………………………………50

Figure 3-1. Characterization of the Amborella and Nuphar ESTs and insights into the ancestral angiosperm……………………………………………………………………..67

Figure 3-2. Using single copy nuclear genes for organismal phylogeny…………………69

Figure 4-1. Shared single copy genes in various combinations of angiosperm genomes…94

Figure 4-2. Overrepresentation and underrepresentation of shared single copy genes in select GO categories……………………………………………………………………...97

Figure 4-3. Single copy nuclear genes improve phylogenetic resolution in the ………………………………………………………………………………99

Figure 4-4. Angiosperm phylogeny using ESTs for 13 shared single copy genes………102

Figure 5-1. Slim GO Annotation of the Brassica oleracea leaf transcriptome………….136

Figure 5-2. Slim GO Annotation of the Medicago truncatula leaf transcriptome……….137

Figure 5-3. Sequence coverage of orthologous single copy genes in Brassica and Medicago…………………………………………………………………………….….139

Figure 5-4. 6-taxon phylogenetic trees from different datasets and phylogenetic methodologies…………………………………………………………………….…..…142

vi

LIST OF TABLES

Table 3-1. Reproductive development genes present in the Amborella and Nuphar Unigenes………………………………………………………………………………….81

Table 4-1. Shared single copy nuclear genes are present throughout plant lineages………95

Table 4-2. Amplification of shared single copy nuclear genes in Brassicaceae…………...98

Table 4-3. Shared single copy nuclear genes are a rich source of phylogenetic information…………………………………………………………………………...…100

Table 5-1. Coverage of orthologous single copy genes in the Brassica and Medicago 454 transcriptome sequencing………………………………………………………………..140

Table 5-2. Summary of alignments used for phylogenetic analysis………………….….141

vii

ACKNOWLEDGMENTS

I have been privilieged over the past five years to work with an outstanding group of colleagues who have taught me more about botany, phylogenetics, genomics, statistics, molecular evolution, etc than they will ever know. I would like to first acknowledge my indebtedness to my thesis advisor, Claude dePamphilis, and the rest of my thesis committee, Naomi Altman, Hong Ma, Kateryna Makova and Webb Miller, for their continued support, insight and advice. I would also like to acknowledge all the members of the Biology Department at Penn State who have provided a welcoming academic community to call home for the past 5 years. An invaluable resource and sounding board for my research has been the Institute for Molecular Evolutionary Genetics and the IMEG seminar series each semester. In making me a better educator and guiding me through the trials and tribulations of teaching undergraduate students, I would like to thank all the individuals involved in the BIOL220W for the past four years, especially Dianne Burpee,

Jim Minesky and Carla Hass. I would also like to thank the members of the dePamphilis lab, extant and extinct, especially Lena Landherr, Kerr Wall, Paula Ralph, Nadira Naznin,

Yan Zhang, Barbara Bliss, Yuannian Jiao, Joel McNeal, Kevin Beckmann, Ali Barakat,

Laura Zahn, and Jim Leebens-Mack. I would also like to thank our collaborators from

FGP, including, but not limited to Pam and Doug Soltis, Matyas Buzgo, Vic Albert, and

Peter Endress. J. Chris Pires and Patrick Edger from the University of Missouri-

Columbia are to be thanked for their collaboration and intense discussions on the relative importance of selection in the evolution of duplicated genes. I would also like to acknowledge Susan “Doc” Behel for her role in first introducing me to the wonders of biology and the research community and Charles Delwiche for his role as my

viii

undergraduate research advisor and providing me a solid foundation in phylogenetics, genomics, and molecular evolution. There are many more people that deserve my thanks and appreciation for their involvement in my education.

Without the financial contributions from the following sources, my research and graduate school experience would have been severely lacking in monies: a Braddock fellowship, the Mohnkern scholarship and teaching assistantships from the Department of

Biology; a University fellowship from The Pennsylvania State University; travel grants from the Department of Biology, the MORPH network and IMEG; research support from the Botanical Society of America Genetics Section and a NSF DDIG, as well as research assistantships and research support from the Floral Genome Project and the Ancestral

Angiosperm Genome Project.

Although their expertise in biology is questionable, I must acknowledge the support of my family, both the Rickers and the Duartes, and my friends, throughout the past five years, even if they rarely had a notion of what I was really up to. Last, but far from least, I would like to thank and acknowledge my husband, Nicolas Duarte, for his unwavering support and encouragement throughout the past five years. We’ve both learned a lot from each other – I figured out reversible temperature-dependent resistance regimes in nickel oxides and he reinvented shotgun sequencing.

ix Chapter 1

The Impact of Gene Duplication on Molecular Evolution

…duplications enable genes to make evolutionary experiments which have previously been forbidden, liberating them from incessant natural selection whose overwhelming activity is eliminating variants. Kimura [1]

Gene Duplication and Evolution

The contents of a genome are not static over time. Through a variety of different processes, genes and other elements of genomes evolve. While mutations have held the interest of both geneticists and evolutionary biologists for over a hundred years, the role of gene duplication has been gradually been acknowledged, defined, and studied with significant attention first brought to the phenomenon by Susumu Ohno’s book “Evolution by Gene Duplication”, first published in 1970 [2]. In the absence of duplication, there would not be the diversity of genes present in extant organisms and the complexity of molecular evolution would be simplified. Duplication provides novel material for evolution by changing the ‘rules’ by which genes evolve. It can separate functions, be permissive of the evolution of novel functions, or even permit redundancy in the genetic toolbox.

In order to clearly discuss genes in reference to their evolutionary history regarding duplication, it is important to classify homologous (those derived from a common ancestor) genes in terms of their relationship to each other via speciation and/or duplication. Orthologs are defined as homologous genes that are differentiated by their inclusion in two or more separate and originate through a speciation event [3].

Paralogs are defined as homologous genes that originate through a duplication event (i.e.,

1 they reside at two distinct loci within a single genome) [3]. Co-orthologs are defined as homologous genes in two or more species for which paralogs are present in at least one of the species [3]. In other words, orthologous genes exist in a strict 1:1 relationship when multiple organisms are considered whereas co-orthologous genes may exist in a 1:many or many:many relationship when multiple organisms are considered, depending on the prevalence of paralogs within each individual genome. For further clarification, the terms inparalog and outparalog can be used to specify different types of paralogs.

Inparalogs are those genes in a species that have expanded through a series of lineage specific duplications that happened after the most recent common ancestor whereas outparalogs are those paralogs that originated via duplications prior to the most recent common ancestor [3].

There have a variety of opinions concerning the relevance of gene duplication to the evolutionary process. It has been estimated by analysis of sequenced genomes that the rate of individual gene duplication occurs at approximately the same rate as the nucleotide mutation rate – indicating that gene duplication produces as many genetic opportunities for evolution as point mutations [4]. In the case of a comparison between the human and chimpanzee genomes, it was estimated that gene duplication contributed

4.4 megabases of DNA to each of those genomes per million years and that large segmental duplications are responsible for more differences between humans and chimpanzees than point substitutions [5]. In plants, it is estimated that polyploidy, which duplicates all genes in the genome, is responsible for 2-4% of speciation events in flowering plants and 7% in ferns, with allotetraploid plants formed at a rate of approximately 10-5 per individual per generation [6].

2 Mechanisms for gene duplication

In order to understand the evolution of genes after duplication, and how evolutionary forces and pathways may vary according to mechanism, it is important to understand the multiple manners in which a gene may become duplicated and the molecular characteristics of the different modes of duplication. There are multiple ways for a gene to become duplicated within a nuclear genome. Duplications are typically classified by the size of the region duplicated, as different mechanisms act on different size regimes in the genome and the mechanism by which a pair of duplicate genes was generated can be inferred by the genomic context [7-11]. As such, duplications are classified as tandem, retrotransposition, segmental, or whole genome duplication.

Tandem duplications typically result from unequal crossing-over events during meiosis.

This results in segments of DNA being duplicated in a consecutive array, but maintenance of the entire gene and/or its promoter regions is not guaranteed.

Retrotransposition results from the integration of cDNA that has resulted from reverse- transcription of a processed mRNA back into a chromosomal segment. As a result of this duplication process, retrotransposed duplications are typically devoid of promoter regions and introns and typically result in pseudogenes. Segmental duplications typically result from translocations, in which large chromosomal fragments break from their original location and reattach elsewhere in the genome. The complete mechanism for segmental duplication is not entirely understood although it plays a significant role in animal lineages, accounting for almost half of the difference in the genomes of human and chimpanzee [5]. Whole genome duplication results from either allopolyploidy or autopolyploidy. Allopolyploidy refers to an event in which the polyploidy is of hybrid

3 origin [12]. Autopolyploidy refers to an event in which the genome duplication consists of genomes originating from within a single species. There are multiple molecular mechanisms that can result in polyploidy including genomic doubling, gametic nonreduction, and polyspermy. Failure of cell division in mitotis results in genome doubling, whereas a failure of cell division in meiosis results in gametic nonreduction.

Polyspermy results from fertilization of an egg by two or more sperm. There are immediate changes to a genome after polyploidization including CpG methylation and interchromosomal rearrangements that rapidly restructure the genome and silence thousands of genes [13-15]. Genes duplicated by whole genome duplication tend to remain in gradually degrading syntenic blocks with other duplicated genes of the same age, as approximated by the number of synonymous substitutions per site (dS) [7-11, 16].

Evidence for paleopolyploidy in flowering plants

Since the set of organisms used to study the impact of gene duplication on molecular evolution in this thesis are flowering plants, it is important to note the frequency of gene duplication and the primary mechanisms that have resulted in gene duplications in this lineage. Although polyploidy has played a role in molecular evolution throughout eukaryotes [6, 17-19], it is extremely relevant in the evolution of plant genomes, which 30-70% of species are thought to have experienced polyploidization events [6, 20] and examination of sequenced genomes has hypothesized the existence of several paleopolyploidy events which were incident with major radiations [9, 11, 21-23]. In addition, studies examining paleopolyploidy events and the diversification of floral development genes have inferred a whole genome duplication

4 event during the diversification of the and another event during the early diversification of all angiosperms, in the group commonly referred to as basal angiosperms [8, 9] and have hypothesized that these whole genome duplication events have enabled the diversification of angiosperms and the evolution of floral form [24].

Analysis of the synonymous substitution rate in paralogs detected in expressed sequence tags (ESTs) has also provided evidence for paleopolyploidy throughout the evolution of angiosperms [25-27]. As to how polyploidy may have shaped the evolution of flowering plants, polyploidy in plants is associated with increased cell volume [28], decreased developmental rates [29], increased seed size [29], differences in organ structure and function [30, 31], and increased geographic range [32].

The Classical Model of Gene Duplication

Described by Susumu Ohno in the book “Evolution by Gene Duplication“ [2], the classical model of gene duplication assumes that after duplication one copy maintains the original function, whereas the other copy evolves in one of two ways. The first option is neofunctionalization - the acquisition of novel function through a beneficial mutation that is positively selected for. The second option is nonfunctionalization - loss of function through a null mutation. In the case of neofunctionalization, the second copy is maintained and spreads through the population by natural selection. In the case of nonfunctionalization, the second copy is eventually lost through pseudogenization and eventual deletion. According to the classical model, most cases of gene duplication end with one copy accumulating a null mutation and being eventually lost from the genome.

Maintenance of both copies after duplication is rare, but is an important source of novelty

5 in the evolutionary process. Examples of neofunctionalization after duplication include the RNAse1 gene in the leaf-eating colobine monkey enabling bacterial fermentation in the foregut [33], opsin genes in Old World primates enabling trichromatic color vision

[34], and a trypsinogen-like protease in Antarctic notothenioid fishes enabling a natural antifreeze glycoprotein to protect the fish from below-freezing temperatures [35, 36].

There are several potential criticisms of the classical model of gene duplication.

Under the classical model, most gene duplicates should become nonfunctional in a relatively short amount of time – approximately a few million years given a mutation rate of 10-6 mutations per base per year [4, 37]. However, there are high rates of retention of duplicated genes, ranging from 10-75%, after many paleopolyploidization and recent polyploidization events [6]. In addition, according to the classical model, the majority of maintained gene duplicates should acquire novel functions through neofunctionalization.

However, cases of neofunctionalization are rare and the majority of maintained duplicated genes do not exhibit signatures of neofunctionalization.

The Duplication-Degeneration-Complementation Model of Gene Duplication

In order to account for a greater than expected number of genes that are maintained after duplication, an alternative model for the evolution of genes after duplication was proposed, the Duplication-Degeneration-Complementation (DDC) model

[38]. Invoking population genetics and the role of genetic drift in the evolution of gene duplicates, the DDC model also introduced the concept of subfunctions into the theoretical framework for the evolution of gene duplicates [4, 37-41]. Subfunctions are independent modules of gene expression or function [42]. Under the DDC model there

6 are three evolutionary courses for a pair of duplicated genes. The first two are the fates described by the classical model - neofunctionalization and nonfunctionalization. The third, subfunctionalization, involves complementary, degenerate mutations in independent subfunctions of the original genes. Through this process, both copies are now required in order to maintain the ancestral function of the gene prior to duplication.

So although degenerative mutations are necessary for subfunctionalization, in the end selection maintains the retention of both duplicates. Most cases of subfunctionalization involve the complementation of expression patterns rather than complementation of protein function [38, 43]. However, it has been suggested that complementation of protein function may have played a role in the evolution of Type II MADS-box genes in flowering plants [44-46] as well as other an assortment of other genes in animal lineages

[47, 48]. It is anticipated by the DDC model that subfunctionalization would result in symmetric patterns of divergence between paralogs. Recent studies have argued that divergence between paralogs is commonly asymmetric in terms of expression divergence, sequence divergence and protein interactions [49-55]. Selection plays a role in the DDC model through: 1) release of a gene from pleiotropic constraints through subfunctionalization [56]; and 2) evolutionary innovation through neofunctionalization

[56, 57].

Alternative hypotheses for the evolution of duplicated genes

Whereas the DDC model focuses on the role of genetic drift in the maintenance of duplicated genes through subfunctionalization, an alternative hypothesis, known as the

“gene balance hypothesis” invokes protein interactions and stoichiometry to explain the

7 disparate rates of maintenance of duplicates according to protein function and mechanism of gene duplication [58-61]. According to the ‘gene balance hypothesis’, genes that are dosage-sensitive such as transcription factors and other regulatory proteins are retained in duplicate only when complete sets of the regulatory network that determines the expression level of the target genes are duplicated at the same time (e.g., after a whole genome duplication event) [58-60, 62]. When relative dosage of each protein is maintained, duplicate copies have no overall effect on the fitness of the individual, but when dosage is thrown out of balance through a tandem or segmental duplication, there is selection against the disruption in dosage. Over time, modifications may occur to the dosage network that result in increasing complexity through processes such as neofunctionalization and subfunctionalization. This model can also be applied to other highly-connected protein interaction networks given that disruption in stoichiometry, or balance, in number of interaction partners may affect the overall fitness of the system, but typically it has been used as an explanation for differences in phenotype resulting from aneuploidy versus polyploidy, maintenance of duplicated regulatory genes after whole genome duplications but not small-scale duplications, and hybrid vigor [46, 59, 61, 62].

Another alternative hypothesis for the evolution of duplicated genes is the

‘buffering’ hypothesis proposed by Chapman et al [63]. Under the ‘buffering’ hypothesis, duplicates are maintained under selection for redundancy to protect an organism from deleterious mutations. Previous studies have shown evidence that duplicated genes can result in improved fitness by providing the organism with robustness to deleterious mutations [63-66]. The ‘buffering’ hypothesis rests on an additional hypothesis that genes that maintain duplicates are of greater importance to the organism’s fitness than

8 other genes and therefore, protection from deleterious mutations by the maintenance of more than one locus increases fitness. To support this hypothesis, they rely on evidence indicating that duplicated genes evolve more slowly than singletons [63, 67]. However, it is likely that the evolutionary rate is tied to expression level rather than a particular crucial function performed by the protein [62].

Expression divergence versus functional divergence

Duplicated genes tend to acquire expression differences more rapidly than differences in protein function. While the mutations that result in nonfunctionalization, neofunctionalization or subfunctionalization can occur in either the regulatory modules or the protein-coding sequence of a gene, it is expected that most of the mutations will affect the regulatory modules [41, 42, 68]. There are multiple hypotheses that favor expression divergence over functional divergence. Regulatory sequences evolve rapidly on microevolutionary time-scales, generating considerable genetic diversity that is capable of altering gene expression [69]. Alteration of protein function is complicated by the possibility of disrupting protein-protein interactions [70]. Selective constraints such as dosage imbalance favor rapid expression divergence rather than extensive modification of the protein [71]. There is evidence for rapid shifts in expression patterns immediately after polyploidization in resynthesized polyploids in cotton, Arabidopsis, Brassica, and other angiosperm species via epigenetic changes such as RNAi silencing and CpG methylation [13, 14] which can reduce the amount of time the duplicates are released from selective constraint and can accumulate null mutations [72]. A large number of studies have shown evidence for extensive levels of expression divergence between

9 duplicated genes and that significant changes in protein constraint are limited (as detected by analysis of the synonymous and nonsynonymous substitution rates) [53, 54, 73-78].

Overall, it is hypothesized that expression divergence plays a more significant role in the evolution of duplicated genes, especially immediately after a duplication event, and that functional divergence in relation to changes to protein function happen over a longer period of time [51]. One potential pathway for this process is the subneofunctionalization model proposed by He and Zhang [51] in which they propose that rapid subfunctionalization through expression divergence occurs after duplication, to be followed by neofunctionalization at the protein level [51, 66, 79, 80].

Another important facet of understanding the role of expression divergence is the evolution of development in both animal and plant lineages. Studies in the evolution of development have shown an important role for transcriptional regulation in the evolution of developmental genes, with a common conclusion that the majority of morphological diversity is the result of changes to expression patterns of developmental regulatory genes rather than changes to protein sequences [45, 68, 80-84]. Therefore, understanding the role of expression divergence in the evolution of duplicated genes is integral to understanding the evolution of development in animal and plant lineages, as well as understanding the role of duplication in the evolution of development.

Differential retention of duplicated genes

It has been shown that the evolution of genes after duplication is nonrandom. The

DDC model predicts that the probability that a given gene will undergo subfunctionalization after duplication is dependent on the number of independent

10 functional or regulatory modules [37, 38, 40, 51]. Studies in Arabidopsis and yeast have shown that genes with paralogs tend to be more complex, containing more protein domains and longer in coding sequence length than singletons [51, 85]. The gene balance hypothesis states that the maintenance of duplicated genes is dependent on dosage sensitivity and the number of interaction partners [59]. The maintenance of duplicated genes is also influenced by cellular location, molecular function and biological process.

In apparent defiance of duplication, there are specific Gene Ontology (GO) categories that are underrepresented among maintained duplicated genes, including genes involved in defense, apoptosis, DNA repair, chloroplast, mitochondrion, electron transfer, amino acid activation, oxygen binding, transmembrane receptors, and tRNA ligases among the set of duplicated genes in Arabidopsis [53]. In contrast, maintained duplicated genes are overrepresented among a large number of GO categories, with a tendency towards transcription factors, signal transduction pathways and other regulatory genes [53, 86].

Retention is also dependent on the mechanism by which a gene is duplicated [86]. For example, genes involved in transcription, kinase activity, protein binding and modification, and signal transduction pathways factors are preferentially maintained after whole genome duplication events but are preferentially lost after small-scale duplication events such as tandem duplications in Arabidopsis [86]. In contrast, maintenance of duplicated genes associated with secondary metabolism and response to biotic stimulus is independent of the mechanism of gene duplication [86]. Within recently duplicated portions of the human genome, duplicates of genes involved in immunity and defense, membrane surface interactions, drug detoxification, and growth and development tend to be retained [87].

11 Explaining expression divergence in regulatory genes

As previously discussed, expression divergence plays a major role in the evolution of duplicated genes. Given the overrepresentation of transcription factors, protein kinases, signal transduction pathway genes and other regulatory genes among duplicated genes, what role does expression divergence play in the maintenance of regulatory genes? The second chapter of this dissertation discusses expression divergence in maintained regulatory genes in Arabidopsis. By using a 2-way ANOVA analysis of microarray data from six organs in wild-type Arabidopsis thaliana, it was possible to examine patterns of expression divergence in duplicated regulatory genes in more specific terms than previous distance-based methods. Distance-based methods for examining expression divergence between paralogs by analysis of microarray expression data has several technical limitations that prevent the methodology from accurately assessing the mode of expression divergence – simply put, distance-based methods fail to distinguish between patterns of expression divergence that contribute to maintenance of both paralogs, such as subfunctionalization and neofunctionalization, and expression divergence that may be indicative of future loss of a paralog through nonfunctionalization. This is especially relevant as researchers have suggested asymmetric divergence of expression patterns between paralogs using distance-based methods – which would suggest a role for silencing or hypofunctionalization in the evolution of duplicated genes [49, 51, 52, 73, 88]. Hypofunctionalization is introduced in the second chapter as a pattern of expression divergence in which one member of a duplicated pair remains expressed and shows signs of tissue specificity, but is consistently expressed 2-3 fold lower than its paralog. This pattern is relatively common

12 in Type II MADS-box genes [73] and may be related to the maintenance of redundancy in genetic pathways, such as the floral developmental pathway [64, 65]. The results from this study indicate a high rate (~85%) of expression pattern shifts in regulatory genes in

Arabidopsis that are indicative of subfunctionalization and/or neofunctionalization.

Accompanying this result is evidence from Ka/Ks analyses that proteins remain under selective constraint. The combination of the 2-way ANOVA methodology with other analyses, such as one-way ANOVA to detect silencing, analysis of complementation, and mapping of complementary patterns onto a tree of tissue similarity, provides a wealth of information about the expression patterns of duplicate genes that has not been shown in other studies [73].

Exploring the link between duplication and the origin of the angiosperms

The origin and rapid diversification of floral forms in the fossil record is commonly known as “Darwin’s abominable mystery” [89] and continues to be the interest of many plant evolutionary biologists. The group of flowering plants that is most likely to help solve Darwin’s abominable mystery is the basal angiosperms – a paraphyletic group of flowering plants that are neither monocots nor eudicots that lay along a grade at the base of the angiosperm tree of life. Among the basal angiosperms,

Amborella trichopoda (the sole member of the Amborellaceae) and the Nymphaeales occupy the lowest rung(s) of the ladder. Molecular phylogenetic studies have placed

Amborella and the Nymphaeales as the basalmost lineages of extant angiosperms with either Amborella alone, or Amborella + Nymphaeales as sister to all other angiosperms

13 [90-100]. Nuphar is sister to the rest of Nymphaeaceae within the Nymphaeales [101,

102].

The third chapter of this dissertation explores the utility of EST sequencing in

Amborella trichopoda and Nuphar advena (Nymphaceae), for reconstructing the genetic toolbox of the ancestral angiosperm and understanding the role of gene duplication in the evolution of development in flowering plants. By identifying 4572 gene clusters for which there exists a putative homolog in Amborella and/or Nuphar, the EST sequencing of transcripts from pre-meiotic buds in these two species provides insight into the genetic make-up of the most recent common ancestor of all extant angiosperms. Included in this set of gene clusters are homologs of many developmental genes, which helps to elucidate and determine the genetic toolbox of early angiosperms and how these genes have evolved throughout the history of angiosperms. Amborella and Nuphar serve an important role in determining the role of gene duplication and diversification in the evolution of flowering plants, since genes from these basal angiosperms can be used to root analyses and help determine ancestral expression and function of developmental genes. This study also identified shared single copy genes in Arabidopsis and Oryza that are present in Amborella and/or Nuphar and discussed the phylogenetic utility of these genes in resolving organismal phylogeny.

Identification and characterization of shared single copy genes

Although the overwhelming majority of studies on the evolution of genes after duplication is concerned with evolution of genes which maintain duplicates, another approach to understanding the molecular evolution of genes after duplication is to study

14 genes for which duplicates are not maintained. Few previous studies have examined singleton genes, and rarely is maintenance of single copy status discussed [103, 104]. The identification of single copy nuclear genes in plants has been primarily the target of molecular systematists looking for new phylogenetic markers. Nuclear genes are a rich source of alternative phylogenetic information, with many distinct qualities that make them desirable compared to the typically used chloroplast and rRNA DNA sequences

[105-108]. Results from previous studies have shown that intronic sequences from nuclear genes such as LFY, ACCase, PGK, petD, GBSSI, GPAT, ncpGS and others are at least as useful as ITS or chloroplast intronic sequences and in many cases are more informative than ITS or chloroplast intronic sequences [109-113]. The fourth chapter of this dissertation identifies a set of shared single copy genes in the Arabidopsis thaliana,

Populus trichocarpa, Vitis vinifera and Oryza sativa nuclear genomes and characterizes these genes in an attempt to understand the evolution of these genes. The application of these genes for phylogenetics is also discussed, both in terms of broad “deep” phylogeny and family-level phylogenies. Characterization of these genes provides some insight into their evolution and the evolutionary forces that may play a role in their preferential loss of paralogs, but overall many questions are raised about the function and evolution of these genes. Single copy genes targeted to the chloroplast and mitochondria are overrepresented compared to the rest of the genome, as well as genes associated with

DNA and/or RNA metabolism.

15 Applications for single copy genes

Single copy nuclear genes in flowering plants are in the unique position of being the closest semblance of “true” orthologs in their genomes. Given the amount of duplication present in flowering plant genomes and their evolutionary history, orthologous sequences that are only separated by speciation events and have not been duplicated since the most recent common ancestor can be considered to be rare, and the number of genes that can be considered orthologous decreases dramatically as we compare increasingly distant lineages. The use of orthologous sequences is especially important for phylogenetics, as research has shown that asymmetric divergences between paralogs (and co-orthologs) can cause artifacts in the reconstruction of organismal phylogeny [49]. In the fifth chapter of this dissertation, the phylogenetic utility of the single copy genes identified in the previous chapter, has been tested by combining the high-throughput identification of single copy genes in the Arabidopsis, Populus, Vitis and

Oryza genomes with high-throughput acquisition of sequences through 454 leaf transcriptome sequencing. 454 sequencing is based on an adaptation of pyrosequencing on a massively parallel scale for which the current generation of the technology can sequence 400,000 reads with an average read length of 200-300 bp in a single run of 4 hours (Roche Life Sciences). A pilot study using de novo leaf 454 transcriptome sequencing in Brassica oleracea and Medicago truncatula for the purpose of generating sequences for phylogenetics is discussed in the fifth chapter of this dissertation. The preliminary results from this study indicate that in comparison to datasets generated by the sequencing of whole chloroplast genomes, datasets containing single copy nuclear genes generated by transcriptome sequencing are richer resources for phylogenetic

16 information – containing greater proportions of variable and parsimony-informative characters. In addition, the single copy nuclear gene dataset is more consistent across different phylogenetic methods, whereas the chloroplast dataset has topological variation according to the phylogenetic method employed.

References

1. Kimura M: The Neutral Theory of Molecular Evolution. Cambridge: Cambridge Univ. Press; 1983. 2. Ohno S: Evolution by gene duplication. New York, New York: Springer; 1970. 3. Sonnhammer ELL, Koonin EV: Orthology, paralogy and proposed classification for paralog subtypes. Trends in Genetics 2002, 18(12):619-620. 4. Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290(5494):1151-1155. 5. Cheng Z, Ventura M, She X, Khaitovich P, Graves T, Osoegawa K, Church D, DeJong P, Wilson RK, Paabo S et al: A genome-wide comparison of recent chimpanzee and human segmental duplications. Nature 2005, 437(7055):88- 93. 6. Otto SP, Whitton J: Polyploid incidence and evolution. Annual Review of Genetics 2000, 34:401-437. 7. Cannon SB, Mitra A, Baumgarten A, Young ND, May G: The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC Plant Biology 2004, 4:10. 8. Bowers JE, Chapman BA, Rong JK, Paterson AH: Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 2003, 422(6930):433-438. 9. Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 2003, 13(2):137-144. 10. Simillion C, Vandepoele K, Van Montagu MCE, Zabeau M, Van de Peer Y: The hidden duplication past of Arabidopsis thaliana. Proceedings of the National Academy of Sciences USA 2002, 99(21):13627-13632. 11. Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290(5499):2114-2117. 12. Kihara H, Ono T: Chromosomenzahlen und systematische Gruppierung der Rumex-Arten. Z Zellforsch Mikrosk Anat 1926, 4:475-481. 13. Adams KL, Cronn, R., Percifield, R., and J. F. Wendel: Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ- specific reciprocal silencing. Proceedings of the National Academy of Sciences USA 2003, 100:4649-4654.

17 14. Osborn TC, Pires JC, Birchler JA, Auger DL, Chen ZJ, Lee H, Comai L, Madlung A, Doerge RW, Colot V et al: Understanding mechanisms of novel gene expression in polyploids. Trends in Genetics 2003, 19(3):141-147. 15. Song KM, Lu P, Tang KL, Osborn TC: Rapid genome change in synthetic polyploids of Brassica and its implications for polyploid evolution. Proceedings of the National Academy of Sciences USA 1995, 92:7719-7723. 16. Soltis DE, Albert VA, Leebens-Mack J, Palmer JD, Wing RA, dePamphilis CW, Ma H, Carlson JE, Altman N, Kim S et al: The Amborella Genome: An evolutionary reference for plant biology. Genome Biology 2008, 9:402. 17. Aury J-M, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, Segurens B, Daubin V, Anthouard V, Aiach N et al: Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 2006, 444(7116):171- 178. 18. Brunet FG, Crollius HR, Paris M, Aury JM, Gibert P, Jaillon O, Laudet V, Robinson-Rechavi M: Gene loss and evolutionary rates following whole- genome duplication in teleost fishes. Molecular Biology and Evolution 2006, 23:1808-1816. 19. Scannell DR, Frank AC, Conant GC, Byrne KP, Woolfit M, Wolfe KH: Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proceedings of the National Academy of Sciences USA 2007, 104:8397-8402. 20. Wendel JF, Cronn RC, Johnston JS, Price HJ: Feast and famine in plant genomes. Genetica 2002, 115(1):37-47. 21. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449(7161):463-467. 22. Paterson AH, Bowers JE, Chapman BA: Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proceedings of the National Academy of Sciences USA 2004, 101:9903-9908. 23. Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al: The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313(5793):1596-1604. 24. De Bodt S, Maere S, Van de Peer Y: Genome duplication and the origin of angiosperms. Trends in Ecology and Evolution 2005, 20:591-597. 25. Blanc G, Wolfe KH: Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. The Plant Cell 2004, 16(7):1667-1678. 26. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A et al: Widespread genome duplications throughout the history of flowering plants. Genome Res 2006, 16:738-749. 27. Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC: Mining EST databases to resolve evolutionary events in major crop species. Genome 2004, 47:868-876.

18 28. Cavalier-Smith T: Nuclear volume control by nucleoskeletal DNA, selection for cell volume and cell growth rate, and the soution of the DNA C-value paradox. Journal of Cell Science 1978, 34:247-278. 29. van Well E, Fossey A: A comparative investigation of seed germination, metabolism and seedling growth between two polyploid Triticum species. Euphytica 1998, 101:83-89. 30. Bretagnolle F, Lumaret R: Bilateral polyploidization in Dactylis glomerata L. subsp. lusitanica: occurence, morphological and genetic characteristics of first polyploids. Euphytica 1995, 84:197-207. 31. Brochmann C: Reproductive strategies of diploid and polyploid populations of arctic (Brassicaceae). Plant Systematics and Evolution 1993, 185:35-70. 32. Levin DA: Polyploidy and novelty in flowering plants. American Naturalist 1983, 122:1-25. 33. Zhang J, Zhang Y-P, Rosenberg HF: Adaptive evolution of a duplicated pancreatic ribonuclease gene in a leaf-eating monkey. Nature Genetics 2002, 30:411-415. 34. Dulai KS, von Dornum M, Mollon JD, Hunt DM: The evolution of trichromatic colour vision by opsin gene duplication in New World and Old World primates. Genome Res 1999, 9(629-38). 35. Chen L, DeVries AL, Cheng CH: Evolution of antifreeze glycoprotein gene from a trypsinogen gene in Antarctic notothenioid fish. Proceedings of the National Academy of Sciences USA 1997, 94(3811-16). 36. Cheng CH, Chen L: Evolution of an antifreeze glycoprotein. Nature 1999, 401:443-444. 37. Lynch M, O'Hely M, Walsh B, Force A: The probability of preservation of a newly arisen gene duplicate. Genetics 2001, 159(4):1789-1804. 38. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics 1999, 151(4):1531-1545. 39. Lynch M: Genomics - Gene duplication and evolution. Science 2002, 297(5583):945-947. 40. Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization. Genetics 2000, 154(1):459-473. 41. Force A, Lynch M, Postlethwait J: Preservation of duplicate genes by subfunctionalization. American Zoologist 1999, 39(5):78A-78A. 42. Force A, Cresko WA, Pickett FB, Proulx SR, Amemiya C, Lynch M: The origin of subfunctions and modular gene regulation. Genetics 2005, 170(1):433-446. 43. Averof M: Athropod Hox genes: insights on the evolutionary forces that shape gene functions. Current Opinion in Genetics & Development 2002, 12:386-392. 44. Becker A, Theissen G: The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Molecular Phylogenetics and Evolution 2003. 45. Soltis PS, Soltis DE, Kim S, Chanderbali A, Buzgo M: Modifications of the ABC model based on analyses of basal angiosperms. In: Developmental

19 Genetics of the Flower. Edited by Soltis DE, Soltis PS, Leebens-Mack J. London: Elsevier Limited; 2006. 46. Veron A, Kaufmann K, Bornberg-Bauer E: Evidence of interaction network evolution by whole genome duplications: a case-study in MADS-box proteins. Molecular Biology and Evolution 2007, 24(3):670-678. 47. Dermitzakis ET, Clark AG: Differential selection after duplication in mammalian developmental genes. Molecular Biology and Evolution 2001, 18(4):557-562. 48. Edelmann L, Stankiewicz P, Spiteri E, Pandita RK, Shaffer L, Lupski J, Morrow BE: Two functional copies of the DGCR6 gene are present on human chromosome 22q11 due to a duplication of an ancestral locus. Genome Res 2001, 11(2):208-217. 49. Fares MA, Byrne KP, Wolfe KH: Rate asymmetry after genome duplication causes substantial long-branch attraction artifacts in the phylogeny of Saccharomyces species. Molecular Biology and Evolution 2006, 23(2):245-253. 50. Kim S-H, Yi SV: Correlated asymmetry of sequence and functional divergence between duplicate proteins of Saccharomyces cerevisiae. Molecular Biology and Evolution 2006, 23(5):1068-1075. 51. He X, Zhang J: Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 2005, 169(2):1157-1164. 52. Wagner A: Asymmetric functional divergence of duplicate genes in yeast. Molecular Biology and Evolution 2002, 19:1760-1768. 53. Blanc G, Wolfe KH: Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. The Plant Cell 2004, 16(7):1679- 1691. 54. Ganko EW, Meyers BC, Vision TJ: Divergence in expression between duplicated genes in Arabidopsis. Molecular Biology and Evolution 2007, 24(10):2298-2309. 55. Wagner A: Decoupled evolution of coding region and mRNA expression patterns after gene duplication: Implications for the neutralist-selectionist debate. Proceedings of the National Academy of Sciences USA 2000, 97(12):6579-6584. 56. Lynch M, Conery JS: The origins of genome complexity. Science 2003, 302(5649):1401-1404. 57. Hughes AL: Adaptive evolution after gene duplication. Trends in Genetics 2002, 18(9):433-434. 58. Birchler JA, Bhadra U, Bhadra MP, Auger DL: Dosage-dependent gene regulation in multicellular eukaryotes: Implications for dosage compensation, aneuploid syndromes, and quantitative traits. Developmental Biology 2001, 234:275-288. 59. Birchler JA, Veitia RA: The gene balance hypothesis: From classical genetics to modern genomics. The Plant Cell 2007, 19:395-403. 60. Birchler JA, Yao H, Chudalayandi S: Biological consequences of dosage dependent gene regulatory systems. Biochimica et Biophysica Acta 2007, 1769:422-428.

20 61. Veitia RA: Gene dosage balance: deletions, duplications and dominance. Trends in Genetics 2005, 21(1):33-35. 62. Semon M, Wolfe KH: Consequences of genome duplication. Current Opinion in Genetics & Development 2007, 17:505-512. 63. Chapman BA, Bowers JE, Feltus FA, Paterson AH: Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication. Proceedings of the National Academy of Sciences USA 2006, 103(8):2730-2735. 64. Gu X: Evolution of duplicate genes versus genetic robustness against null mutations. Trends in Genetics 2003, 19(7):354-356. 65. Moore RC, Grant SR, Purugganan MD: Molecular population genetics of redundant floral-regulatory genes in Arabidopsis thaliana. Molecular Biology and Evolution 2005, 22:91-103. 66. Zahn LM, Kong H, Leebens-Mack J, Kim S, Soltis PS, Landherr LL, Soltis DE, dePamphilis CW, Ma H: The evolution of the SEPALLATA subfamily of MADS-box genes: A pre-angiosperm origin with multiple duplication throughout angiosperm history. Genetics 2005, 169:2209-2223. 67. Jordan IK, Wolf YI, Koonin EV: Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evolutionary Biology 2004, 4(22). 68. Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The evolution of transcriptional regulation in eukaryotes. Molecular Biology and Evolution 2003, 20(9):1377-1419. 69. Stone JR, Wray GA: Rapid evolution of cis-regulatory sequences via local point mutations. Molecular Biology and Evolution 2001, 18(9):1764-1770. 70. Riechmann JL: Transcriptional Regulation: a Genomic Overview. The Arabidopsis Book 2002:1-46. 71. Doebley J, Lukens L: Transcriptional regulators and the evolution of plant form. The Plant Cell 1998, 10(7):1075-1082. 72. Rodin SN, Riggs AD: Epigenetic silencing may aid evolution by gene duplication. Journal of Molecular Evolution 2003, 56:718-729. 73. Duarte JM, Cui L, Wall PK, Zhang Q, Zhang X, Leebens-Mack J, Ma H, Altman N, dePamphilis CW: Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Molecular Biology and Evolution 2006, 23(2):469-478. 74. Gu X, Zhang Z, Huang W: Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proceedings of the National Academy of Sciences USA 2005, 102(3):707-712. 75. Gu ZL, Nicolae D, Lu HHS, Li WH: Rapid divergence in expression between duplicate genes inferred from microarray data. Trends in Genetics 2002, 18(12):609-613. 76. Gu ZL, Rifkin SA, White KP, Li WH: Duplicate genes increase gene expression diversity within and between species. Nature Genetics 2004, 36(6):577-579. 77. Makova KD, Li WH: Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res 2003, 13(7):1638-1645.

21 78. Zhang ZQ, Gu JY, Gu X: How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution? Trends in Genetics 2004, 20(9):403-407. 79. Kramer EM, Jaramillo MA, Di Stilio VS: Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 2004, 166(2):1011-1023. 80. Zahn LM, Leebens-Mack J, dePamphilis CW, Ma H, Theissen G: To B or not to B a flower: the role of DEFICIENS and GLOBOSA orthologs in the evolution of the angiosperms. Journal of Heredity 2005, 96:225-240. 81. Carroll SB, Grenier JK, Weatherbee SD: From DNA to diversity: molecular genetics and the evolution of animal design. Malden, MA: Blackwell Science; 2001. 82. Wray GA: From DNA to diversity - Molecular genetics and the evolution of animal design. Science 2001, 292(5525):2256-2257. 83. Wray GA: Genomic regulatory systems - Development and evolution. Science 2001, 292(5525):2256-2257. 84. Wray GA: Transcriptional regulation and the evolution of development. International Journal of Developmental Biology 2003, 47(7-8):675-684. 85. Yang J, Lusk R, Li WH: Organismal complexity, protein complexity, and gene duplicability. Proceedings of the National Academy of Sciences USA 2003, 100(26):15661-15665. 86. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y: Modeling gene and genome duplications in eukaryotes. Proceedings of the National Academy of Sciences USA 2005, 102(15):5454-5459. 87. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE: Recent segmental duplications in the Human genome. Science 2002, 297(5583):1003-1007. 88. Ferris S, Whitt GS: Evolution of the differential regulation of duplicate genes after polyploidization. Journal of Molecular Evolution 1979, 12:267-317. 89. Darwin C: Letter to J. D. Hooker. In: More Letters of Charles Darwin. Edited by Darwin F, Seward AC, vol. 2. London: John Murray; 1903. 90. Barkman TJ, Chenery G, McNeal JR, Lyons-Weiler J, Ellisens WJ, Moore G, Wolfe AD, dePamphilis CW: Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proceedings of the National Academy of Sciences USA 2000, 97(24):13166-13171. 91. Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W: Noncoding plastid trnT-trnF sequences reveal a well resolved phylogeny of basal angiosperms. Journal of Evolutionary Biology 2003, 16:558-576. 92. Graham S-W, Olmstead R-G: Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. American Journal of Botany 2000, 87(11):1712-1730. 93. Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell M, Alice LA, Evans R et al: Angiosperm phylogeny based on matK sequence information. American Journal of Botany 2003, 90(1758-1776).

22 94. Leebens-Mack J, Raubeson LA, Cui L, Kuehl JV, Fourcade MH, Chumley T, Boore JL, Jansen RK, dePamphilis CW: Identifying the basal angiosperm node in chloroplast genome phylogenies: Sampling one's way out of the Felsenstein zone. Molecular Biology and Evolution 2005, 22(10):1948-1963. 95. Mathews S, Donoghue M-J: The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 1999, 286(5441):947-950. 96. Parkinson C-L, Adams K-L, Palmer J-D: Multigene analyses identify the three earliest lineages of extant flowering plants. Current Biology 1999, 9(24):1485- 1488. 97. Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis D-E, Soltis P-S, Zanis M, Zimmer E-A, Chen Z, Savolainen V, Chase M-W: The earliest angiosperms: Evidence from mitochondrial, plastid and nuclear genomes. Nature 1999, 402(6760):404-407. 98. Soltis PS, Soltis DE, Chase MW: Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 1999, 402:402-404. 99. Stefanovic S, Rice DW, Palmer JD: Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evolutionary Biology 2004, 4:35. 100. Zanis M-J, Soltis D-E, Soltis P-S, Mathews S, Donoghue M-J: The root of the angiosperms revisited. Proceedings of the National Academy of Sciences USA 2002, 99(10):6848-6853. 101. Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny, Classification and Floral Evolution of Water Lilies (Nymphaeaceae; Nymphaeales): A Synthesis of Non-molecular, rbcL, matK, and 18S rDNA Data. Systematic Botany 1999, 24(1):28-46. 102. Yoo M-J, Bell CD, Soltis PS, Soltis DE: Divergence Times and Historical Biogeography of Nymphaeales. Systematic Botany 2005, 30(4):693-704. 103. Carlquist S: Observations on the vegetative anatomy of Austrobaileya: Habital, organographic and phylogenetic conclusions. Botanical Journal of the Linnean Society 2001, 135(1):1-11. 104. Wu F, Mueller LA, Crouzillat D, Petiard V, Tanksley SD: Combining bioinformatics and phylogenetics to identify large sets of single-copy orthologous genes (COSII) for comparative, evolutionary and systematic studies: A test case in the Euasterid plant clade. Genetics 2006, 174(3):1407- 1420. 105. Mort M-E, Crawford DJ: The continuing search: low-copy nuclear sequences for lower-level plant molecular phylogenetic studies. Taxon 2004, 53(2):257- 261. 106. Sang T: Utility of low-copy nuclear gene sequences in plant phylogenetics. Critical Reviews in Biochemistry and Molecular Biology 2002, 37:121-147. 107. Small RL, Cronn RC, Wendel JF: Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 2004, 17:145-170. 108. Soltis DE, Soltis PS: Choosing an approach and an appropriate gene for phylogenetic analysis. In: Molecular systematics of Plants II DNA Sequencing. Edited by Soltis DE, Soltis PS, Doyle JJ. Boston: Kluwer Academic Publishers; 1998: 1-42.

23 109. Emshwiller E, Doyle JJ: Chloroplast-expressed glutamine synthetase (ncpGS): Potential utility for phylogenetic studies with an example from Oxalis (Oxalidaceae). Molecular Phylogenetics and Evolution 1999, 12(3):310. 110. Grob GBJ, Gravendeel B, Eurlings MCM: Potential phylogenetic utility of the nuclear FLORICAULA/LEAFY second intron: comparison with three chloroplast DNA regions in Amorphophallus (Araceae). Molecular Phylogenetics and Evolution 2004, 30(1):13. 111. Oh S-H, Potter D: Phylogenetic utility of the second intron of LEAFY in Neillia and Stephanandra (Rosaceae) and implications for the origin of Stephanandra. Molecular Phylogenetics and Evolution 2003, 29(2):203. 112. Tank DC, Sang T: Phylogenetic utility of the glycerol-3-phosphate acyltransferase gene: Evolution and implications in Paeonia (Paeoniaceae). Molecular Phylogenetics and Evolution 2001, 19(3):421. 113. Lohne C, Borsch T: Molecular evolution and phylogenetic utility of the petD Group II intron: A case study in basal angiosperms. Molecular Biology and Evolution 2005, 22(2):317-332.

24

Chapter 2

Expression Pattern Shifts Following Duplication Indicative of Subfunctionalization and Neofunctionalization in Regulatory Genes of Arabidopsis

Preface

The manuscript has been published in Molecular Biology and Evolution, and was formatted for that journal. Authors for this original manuscript are Jill M. Duarte, Liying

Cui, P. Kerr Wall, Qing Zhang, Xiaohong Zhang, Jim Leebens-Mack, Hong Ma, Naomi

Altman, and Claude W. dePamphilis. JMD helped to conceive the study, identified paralogous pairs, analyzed the results of the ANOVA analysis, performed the analysis concerning expression patterns in the MIKC gene family, and drafted the manuscript. LC and PKW helped with analyses. QZ performed the ANOVA analysis. XZ provided the

Arabidopsis microarray data. JLM contributed to the Ka/Ks analysis and the text. HM contributed to the text. NA helped conceive the ANOVA analysis and performed the linkage clustering of expression patterns. CWD helped to conceive the study and contributed to the text.

25 Research Article

Title: Expression Pattern Shifts Following Duplication Indicative of Subfunctionalization and Neofunctionalization in Regulatory Genes of Arabidopsis

Authors: Jill M. Duarte*, Liying Cui*, P. Kerr Wall*, Qing Zhang✝, Xiaohong Zhang*,

Jim Leebens-Mack*, Hong Ma*, Naomi Altman‡, and Claude W. dePamphilis*

* Department of Biology, Institute of Molecular Evolutionary Genetics, and Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802

✝ Bioinformatics Consulting Center, The Pennsylvania State University, University Park, PA 16802

‡Department of Statistics, The Pennsylvania State University, University Park, PA 16802

Corresponding author: Claude W. dePamphilis, 405B Life Sciences Building, The Pennsylvania State University, University Park, PA 16802; Phone: (814) 863-6412; Fax: (814) 863-1357; e-mail: [email protected]

Keywords: expression, gene duplication, ANOVA, microarray, regulatory genes

Running Head: Expression Pattern Shifts in Duplicated Regulatory Genes

Abbreviations: maximum parsimony (MP); gene (G); organ (O); gene by organ interaction (GxO); analysis of variance (ANOVA)

26 ABSTRACT

Gene duplication plays an important role in the evolution of diversity and novel function and is especially prevalent in the nuclear genomes of flowering plants. Duplicate genes may be maintained through sub- and neofunctionalization at the level of expression or coding sequence. In order to test the hypothesis that duplicated regulatory genes will be differentially expressed in a specific manner indicative of regulatory subfunctionalization and/or neofunctionalization, we examined expression pattern shifts in duplicated regulatory genes in Arabidopsis. A two-way analysis of variance (ANOVA) was performed on expression data for 280 phylogenetically-identified paralogous pairs.

Expression data were extracted from global expression profiles for wild-type root, stem, leaf, developing inflorescence, nearly mature flowerbuds, and seed pod. Gene [G], organ

[O], and gene by organ interaction [GxO] effects were examined. Results indicate that

85% of the paralogous pairs exhibited a significant GxO effect indicative of regulatory subfunctionalization and/or neofunctionalization. A significant GxO effect was associated with complementary expression patterns in 45% of pair-wise comparisons. No association was detected between a GxO effect and relaxed evolutionary constraint as detected by the ratio of nonsynonymous to synonymous substitutions. Ancestral gene expression patterns inferred across a MIKCC gene phylogeny suggest several cases of regulatory neofunctionalization and organ-specific nonfunctionalization. Complete linkage clustering of gene expression levels across organs suggests that regulatory modules for each organ are independent or ancestral genes had limited expression. We propose a new classification, regulatory hypofunctionalization, for an overall decrease in expression level in one member of a paralogous pair while still having a significant GxO

27 effect. We conclude that expression divergence contributes to the maintenance of duplicated regulatory genes in Arabidopsis and hypothesize that this results in increasing expression diversity or specificity of regulatory genes after each round of duplication

INTRODUCTION

Whole genome duplication is especially common in flowering plant lineages relative to animal lineages, with between 50% (and perhaps 70% or more) of all angiosperms having at least one detectable genome duplication in their history [1, 2].

Recent work indicates that whole genome duplications are responsible for more than 90% of the expansion of regulatory genes in the angiosperm lineage over the last 350 million years [3]. Under the classical model of gene duplication [4], one duplicate maintains the original function, while the other either evolves a new function (rare) or is lost or silenced

(common). However, recent work has proposed new models for the evolution of duplicated genes [5-8]. Under these revised models, duplicated genes (paralogs) may experience the following: 1) nonfunctionalization through silencing or null mutation; 2) neofunctionalization through gain of novel function; and 3) subfunctionalization through the partitioning of functional modules such that the complement of both copies represents the functional capability of the ancestral gene [9]. The modification of regulatory modules through mutation or epigenetic effects can result in specific expression pattern shifts between paralogs, resulting in regulatory subfunctionalization, neofunctionalization, or nonfunctionalization (Figure 2-1).

28

Figure 2-1. Fate of duplicated genes and the effect of each fate on expression patterns. Following the Duplication-Degeneration- Complementation model [7], before duplication, a given hypothetical gene has a set expression pattern, depicted here as the quantitative expression level in six organs (A). Immediately after duplication, both copies of the gene have the same expression pattern (B). As time passes, the fate of the two copies can be described as (C) nonfunctionalization, where one copy loses expression in all organs; (D) subfunctionalization, where the expression patterns of both copies are complementary and combined are equal to the expression pattern before duplication; and (E) neofunctionalization, where one copy has a novel increase in expression in one

or more organs. Values below log250 =5.6 are assumed to be undetectable expression [10].

29 Subfunctionalization and neofunctionalization contribute to the likelihood of maintenance of a paralogous set of genes; nonfunctionalization contributes to the likelihood of loss of one member of a paralogous set of genes. A prediction of the DDC model is that subfunctionalization will most often result in a symmetric division of regulatory modules; whereas neofunctionalization will typically reflect the gain of a single regulatory module (Force et al. 1999). However, it is unclear how closely the biology of duplicate genes follows the DDC model. Evidence from studies in yeast supports asymmetric divergence between duplicate genes, both in terms of the protein- coding sequence and expression [11, 12]. Recent studies have also suggested that expression divergence tends towards rapid subfunctionalization followed by neofunctionalization [13-16].

Arabidopsis is hypothesized to have undergone two, or possibly three, whole genome duplications in its past [17-20]. In addition, studies based on a limited set of genes in polyploids show evidence for rapid expression pattern shifts after duplication

[21, 22]. Previous studies on individual sets of regulatory paralogs in flowering plants provide evidence for regulatory subfunctionalization in important genes for reproductive developmental pathways [14-16, 23-25]. Selective constraints, such as dosage imbalance and protein-protein interactions, on regulatory proteins after duplication suggests that expression divergence may frequently play a role in the maintenance of both duplicates.

Static expression patterns after duplication can cause a variety of problems such as dosage imbalance [26]. In addition, changes to the protein coding sequence may cause disruption of protein-protein interactions [27] and other possible activities. Therefore, in the case of duplicated regulatory genes, where there are constraints on protein functional

30 divergence and where either the expression or protein function must diverge in order for both copies to survive, regulatory subfunctionalization and/or neofunctionalization is expected to be common in maintained duplicated regulatory genes.

Given the overrepresentation of regulatory genes in the set of maintained duplicate genes in Arabidopsis [28] and their important role in developmental pathways, we have chosen to focus on shifts in expression patterns of regulatory genes after duplication, as inferred through global expression profiling. We hypothesize that regulatory genes retained after duplication will show evidence of expression pattern shifts, subfunctionalization and/or neofunctionalization, that would contribute to maintenance as described by the DDC model (Force et al. 1999). In order to test this hypothesis, we have developed a 2-way ANOVA analysis of microarray data.

Microarrays provide a rich resource for investigating the evolution of gene expression. Previous studies of expression divergence between paralogs using microarray data applied a correlation-based approach [28-31]. These studies showed that expression

divergence does occur between paralogs, and that this is correlated with Ks (a measure of silent substitutions) but not with d (a measure of protein divergence). A recent analysis of paralogs in Arabidopsis using correlation indicated that 57% of recent duplicates and

73% of older duplicates had divergent expression patterns [28]. Whereas correlation is a good approximation for a relative level of expression divergence that has an easily calculated distance measure, there are limitations in using correlation to analyze the role of specific effects and interaction between effects in expression divergence. In particular, correlation studies only address the frequency of overall divergence, and therefore only provide general information about the evolution of expression after duplication.

31 Correlation can result in false negatives if expression pattern shifts are limited to a small number of the data points (e.g., a spike in expression in one of the paralogs due to neofunctionalization in a single organ would be obscured) and false positives if hybridization strength is not uniform over all probes. Furthermore, differences in overall expression levels among paralogs as detected by correlation can be the result of technical differences in probe design and hybridization [32].

We use a split-plot two-way analysis of variance analysis (ANOVA) to more closely examine the relationship of expression patterns between genes in a paralogous pair. ANOVA is widely used in analysis of microarray experiments [33] because of its power, flexibility and robustness. These qualities also make ANOVA well suited to isolate the factors that contribute to gene expression divergence in duplicate genes. With

ANOVA, one can examine the effect of differential and spatial expression and the interaction between these two effects on the overall divergence of expression patterns between paralogs. With a focus on the consequence of gene duplication, this analysis can distinguish between expression shifts contributing to maintenance of both paralogs

(regulatory subfunctionalization or neofunctionalization) and shifts that would lead to the silencing of one paralog (regulatory nonfunctionalization). Whereas correlation analyses only provide general trends in expression divergence since duplication, the ANOVA approach described here tests more directly the predictions of the DDC model [7].

In order to test the hypothesis that retained duplicated regulatory genes will have evidence of expression pattern shifts that would contribute to maintenance, we applied a two-way ANOVA to expression data for paralogous pairs of Arabidopsis regulatory genes using gene expression profiles for six organs (root, leaf, stem, young

32 inflorescences, later stage flower buds, and silique). Our findings indicate that 85% of the regulatory genes analyzed in this study have undergone significant expression shifts, likely contributing to regulatory subfunctionalization and/or neofunctionalization that serves to maintain duplicated genes. In addition, we provide evidence that some paralogous pairs exhibit expression profiles that are not clearly identifiable as regulatory nonfunctionalization, neofunctionalization or subfunctionalization. These findings have important implications for the evolution of regulatory networks in plants, suggesting that the expression of homologous developmental regulators is likely to vary across plant lineages with distinct histories of ancient polyploidy. We propose that the majority of expression shifts we have detected here contribute to maintenance of regulatory paralogs after duplication via regulatory subfunctionalization and/ or neofunctionalization.

METHODS

Microarray experiments

Microarray experiments were performed on the Affymetrix ATH1 array, which is based on the Arabidopsis thaliana var. Columbia whole genome sequence. Two biological replicates were used for six structures (root, leaf, stem, young inflorescences,

Stage 12 flower, and siliques) from Arabidopsis thaliana var. Landsberg, which we refer to as organs [10]. The microarray data were normalized using the RMA method [34] in

Bioconductor [35]. The normalized data are expressed in logarithmic units (base 2).

Based on Zhang et al. (2005), we considered genes with expression level below 5.6 to be below reliable detection.

33 Phylogenetic identification of Arabidopsis paralogous pairs

In order to identify duplicate pairs of regulatory genes, we first identified putative regulatory gene families in the PlantTribes database, an objectively defined database of putative protein families developed through Markov clustering [36, 37] of the rice and

Arabidopsis proteomes (http://floralgenome.org/cgi-bin/tribedb/tribe.cgi). The

PlantTribes gene families were inferred using predicted protein sequences from the

Arabidopsis thaliana var. Columbia and Oryza sativa japonica genomes downloaded from TIGR (http://www.tigr.org/tdb/euk/)[36]. An all-against-all BLASTP [38] analysis was performed. TribeMCL [36, 37] was used to cluster proteins into putative gene families (http://floralgenome.org/cgi-bin/tribedb/tribe.cgi). A multiple amino acid sequence alignment for each regulatory gene family was produced with POA [39] followed by RASCAL [40] to improve poor regions of the alignment. Maximum parsimony (MP) phylogenetic analysis was performed for each regulatory tribe amino acid alignment using PAUP 4.0b* [41]. Tree searching was performed using ten random sequence additions and tree bisection reconnection branch swapping for alignments with fewer than 50 taxa, subtree pruning reconnection branch swapping for alignments with greater than 50 taxa, and fast jackknife for alignments with greater than 100 taxa. MP bootstrapping was performed (1000 replicates) with heuristic searches using random sequence additions as above. Unambiguous paralogous pairs were identified as distinct monophyletic clades of two Arabidopsis genes with greater than 50% bootstrap support.

Pairs with missing expression data for one or both genes were eliminated from further analyses.

34 Statistical Analyses

In order to identify which components contribute to expression pattern divergence within each duplicate pair, a split plot two-way ANOVA was used to partition the gene effect (G, the subplot treatment), organ effect (O, the whole plot treatment) and gene by organ interaction effect (GxO). The blocks in this analysis are the microarrays, which allows for correlation among paralogs measured on the same array, but independence between arrays. Analysis was done using SAS PROC MIXED [42]. The resulting analysis allows us to assess whether there are differences in the mean expression for the two genes averaged over all organs (G effect), differences in the mean expression for each of the 6 organs averaged over the two genes (O effect), or whether the within-organ mean expression varies by gene

(GxO). Multiple comparisons adjustments were not done, as this unduly inflates the false negative rate when the true differential expression rate is high (Delongchamp et al 2004).

A duplicate pair with expression levels A and B was said to have a complementary expression pattern in organ X (organ X and Y) if the GxO effect was statistically significant, and A>B in some organ (both X and Y) while A

For genes with expression lower than the threshold (25.6) in all organs, a one-way

ANOVA to test for significant O effects was used to distinguish putative silencing events from lowly expressed genes. Genes lower than threshold in all organs and without a significant O effect were identified as putative “silent” loci. Putative silent loci were examined for evidence of expression in other conditions using all Affymetrix whole genome microarray data stored at TAIR (http://www.Arabidopsis.org).

Complete linkage cluster analysis using correlation distance [43, 44] was used to find relatively homogeneous clusters of organs using transcript abundance. 1186 probe sets representing regulatory genes were identified from the 22810 probe sets on the array.

35 Using jack-knife samples of 1000 genes, 200 cluster trees were built using complete linkage clustering with correlation distance. PHYLIP [45] was used to construct a consensus tree for organ similarity using the extended majority rule. Complementary expression patterns in paralogous pairs as previously described were mapped onto the resulting tree of organs to assess independence of regulatory modules and extent of expression pattern shifts.

Tests for adaptive protein evolution

In order to analyze changes in selective constraint for each paralogous pair, codon-based analyses of nonsynonymous to synonymous nucleotide substitution ratios

(ω= Ka/Ks) were performed for each pair using the codeml program in PAML version

3.13 [46]. To investigate whether paralogous pairs with expression pattern shifts are more or less likely to exhibit altered constraint at the protein level, 2-tailed t-tests were used to compare the Ka/Ks distributions between: 1) paralogous pairs with a significant GxO effect and paralogous pairs without a significant GxO effect; 2) paralogous pairs containing at least one putative nonfunctionalization event and all other paralogous pairs; and 3) paralogous pairs with a significant GxO effect and paralogous pairs without a significant GxO effect, with cases of putative nonfunctionalization events removed from the data set.

Inferring ancestral expression patterns in Type II MADS-box genes

The mean expression value for each Type II MADS-box gene in Arabidopsis was obtained and sorted into seven discrete categories based on two-fold expression difference starting with the threshold for detectable expression of 25.6. These unordered character states were mapped on a modified phylogeny of MIKCC Arabidopsis genes [47]

36 and ancestral expression patterns inferred under Fitch optimization in MacClade 4.06

[48]. Equivocal ancestral expression levels were resolved using ACCTRAN.

RESULTS

By using phylogenies to identify paralogous pairs of genes, we are able to focus our study on paralogs resulting from duplications since the separation between the rice and Arabidopsis lineages. A total of 91 gene clusters (Tribes) containing 1464 genes in the PlantTribes database (www.floralgenome.org/cgi-bin/tribedb/tribe.cgi) were identified as containing at least one gene functionally characterized as a regulator of floral development. Using maximum parsimony phylogenies for 91 regulatory tribes in

Arabidopsis and rice, we identified 354 pairs of paralogs (monophyletic Arabidopsis sister genes) maintained in the Arabidopsis genome. 280 pairs remained after paralogous pairs with missing microarray data were excluded (Supplementary Table 2-1).

Accordingly, at least 63% of the paralogous pairs in this study appear to have resulted from polyploidy events in Arabidopsis, with at least 53% associated with the most recent whole genome duplication [17]. The two-way ANOVA results for representative paralogous pairs are reported in Figure 2-2, with graphs of expression levels in all 6 organs for one example from each possible result from the two-way ANOVA. ANOVA results for all paralogous pairs are reported in Supplementary Table 2-1. Two pairs showed no significant effects (e.g. Fig. 2-2A). 13 pairs showed only a G effect (e.g. Fig.

2-2B). Three pairs showed only an O effect (e.g. Fig. 2-2C). 24 pairs showed significant

G and O effects but no significant GxO effect (e.g. Fig. 2-2D). A significant GxO effect without significant G or O effects was observed in two paralogous pairs (e.g. Fig. 2-2E).

37 12 pairs showed a G effect coupled with a GxO effect (e.g. Fig. 2-2F), and a significant

O effect coupled with a GxO effect was observed in 22 paralogous pairs (e.g. Fig. 2-2G).

Finally, 202 pairs have a significant G, O, and GxO effect (e.g. Fig. 2-2H). A total of

85% of paralogous pairs have a significant GxO effect, which is representative of regulatory subfunctionalization and/or neofunctionalization. This result is not significantly altered by reducing the data set to include only the most strongly supported duplicate gene pairs (bootstrap values above 80) or considering paralogous pairs with a

GxO effect significant at a more stringent level (α = 0.01) (Supplementary Table 2-1).

Most of the pairs have a p-value << 0.05. At a more conservative cutoff of α = 0.01, 75% of paralogous pairs have a significant GxO effect.

Figure 2-2. Range of ANOVA results. The depicted paralogous pairs are examples of all possible results, with the number of pairs exhibiting that result following in brackets: A) no significant effects [2 gene pairs]; B) gene effect only [13 pairs]); C) organ effect only [3 gene pairs]; D) gene and a organ effect only [24 pairs]; E) gene by organ effect only [2 pairs]; F) gene and gene by organ interaction effect [12 pairs]; G) organ and gene by organ interaction effect [22 pairs]; (H) gene, organ, and gene by organ interaction effect [202 pairs]. ______

38 A significant G effect in our study corresponds to the type of expression divergence previously detected in correlation-based studies. The number of pairs with a significant G effect (251 pairs) is highly similar to the number with a GxO effect (238 pairs). However, it should be noted that not all pairs with a significant G effect have a significant GxO effect (37 pairs have a significant G effect without a significant GxO effect). This indicates that the type of expression divergence detected by correlation- based analyses is not always indicative of expression pattern shifts that would contribute to maintenance. In addition, there are 24 pairs with a significant GxO effect that do not have a significant G effect and therefore would not be identified as having significant expression divergence in a correlation-based analysis. This illustrates the improvement in sensitivity by using a 2-way ANOVA analysis versus a correlation-based analysis.

In addition, our statistical analyses identified genes that did not have a significant

GxO effect and were likely candidates for regulatory nonfunctionalization. We identified

38 paralogous pairs where at least one paralog was silenced in our data set according to the following criteria: (1) with low to no detectable expression and (2) no significant change in expression as measured by one-way ANOVA over all six organs. Analysis of expression data provided at the TAIR Microarray Expression database

(http://www.Arabidopsis.org) indicated that of these 38 pairs, 21 pairs contain at least one paralog that is lowly expressed or not expressed (below 5.6). All other putative silent genes had medium to strong expression in at least one organ or treatment

(http://www.Arabidopsis.org). The mean Ka/Ks ratio for the 258 actively expressed paralogous pairs (0.177) versus the 21 paralogous pairs containing at least one silent

39 paralog (0.335) differs significantly according to a 2-tailed t-test with unequal variances

(p=0.0333).

We also identified a set of paralogous pairs that could not be easily classified according to the DDC model. There were 22 pairs in which one member of the pair was consistently expressed two to three fold lower than the other member of the paralogous pair. Examining additional expression data from TAIR (http://www.Arabidopsis.org), we find this pattern to be relatively consistent in these pairs. Additionally, all of these pairs do have a significant GxO effect.

Identifying an expression pattern shift as subfunctionalization or neofunctionalization requires an approximation of the ancestral expression pattern before duplication [7, 49, 50]. We were able to make a detailed character reconstruction analysis for the MIKCC gene family using maximum parsimony. Two thirds of the paralogous pairs were identified as having complementary expression patterns and all pairs except one had a significant GxO effect, which is consistent with regulatory subfunctionalization and/or neofunctionalization. Eight of the nine paralogous pairs in this gene family can be mapped to either the most recent paleopolyploidy event or are in a tandem repeat. Pairs in tandem repeats have higher p-values for the GxO effect.

Inference of the ancestral gene expression pattern indicates that none of the paralogous pairs have a clear example of regulatory subfunctionalization (Figure 2-3).

40

Figure 2-3. Inferring ancestral expression patterns in MIKCC gene family in Arabidopsis. All red pairs in the MIKCC phylogeny [47] were detected in our MP phylogeny. All paralogous pairs in the MIKCC gene family have a significant GxO effect, except MAF4/MAF5. This approach has allowed us to infer putative instances of regulatory neofunctionalization and nonfunctionalization by examining fold differences in expression between members of a paralogous pairs and other closely related genes in a phylogenetic framework.

41 Our analysis uses successively more distant paralogs as outgroups to estimate ancestral expression patterns. An alternative and sometimes more sensitive approach would be to use corresponding expression data for one or more related species that diverged prior to the duplication event to serve as an outgroup to the paralogous pair of interest [51]. Unfortunately, corresponding genome-scale expression data for one or more related species are not yet available. Loss of leaf-specific expression can be inferred for

SEP2, because expression patterns for SEP1 and SEP3 suggest that expression in the leaf is the ancestral state. Reduction of expression is more common than complete loss of expression. Subtle neofunctionalization through increase of expression level is observed in multiple paralogs, such as AP3 and PI. In one-third of paralogous pairs, the sister paralog is generally expressed 2-3 fold less than the other member of the paralogous pair, such as AP1/CAL, AGL6/AGL13, and SHP1/SHP2.

Another important factor in the maintenance of duplicated genes is the evolution of the protein-coding region and possible correlation between expression divergence and relaxation of constraint on the protein-coding region of the duplicated genes. According to our hypothesis, we expect to see no correlation between a significant GxO effect and

relaxed selective constraint as measured by the Ka/Ks ratio for duplicated regulatory genes. Divergence in expression pattern was not associated with positive selection on the

protein-coding portion of the genes, as detected by Ka/Ks > 1.0, except in a single instance that included a putative silencing event. The mean Ka/Ks ratio for the 238 paralogous pairs with a significant GxO effect is 0.1741 and the 41 paralogous pairs

without a significant GxO effect have a mean Ka/Ks of 0.2742. A 2-tailed t-test with unequal variances indicates that this difference was statistically significant (p=0.0195).

42 However, this result may have been due to relaxed constraint following loss of function for one of the paralogs in some pairs. When the 19 paralogous pairs in the data set exhibiting silencing across all organs and treatments in one of the duplicates were

removed from the analysis, the mean difference Ka/Ks between paralogous pairs with and without a GxO effect was not significant (p=0.1409).

Given the mechanism for regulatory subfunctionalization in the DDC model, understanding the number and nature of the regulatory modules in the duplicate genes is necessary for distinguishing subfunctionalization and neofunctionalization. By mapping complementary expression patterns onto a tree of how organs are related according to similar expression profiles, we were able to ascertain the relative independence of expression between organs and common complementary expression patterns. Complete linkage clustering of organs by gene expression reveal expected patterns of organ similarity (Figure 2-4). However, bootstrap support values for most nodes are relatively weak, except root versus all other organs, which has 100% bootstrap support.

Reproductive organs are clustered together, with inflorescence and stage 12 flower more similar. Almost all complementary expression patterns are mapped to the tips of the tree or have unusual patterns unrelated to organ similarity.

DISCUSSION

Understanding expression pattern shifts using a two-way ANOVA

An expectation of the classical model and the DDC model for the fate of duplicated genes is that maintenance of duplicated genes will be accompanied by divergence in expression or protein structure [4-8]. An ANOVA approach allowed us to

43 identify and classify instances of expression divergence as they relate to the DDC model in regulatory genes following gene duplication using expression profiles for different organs. The two-way ANOVA results are interpreted as: 1) a gene effect (G) represents divergent quantitative expression levels between genes across all organs (e.g. Fig 2-2B); this is equivalent to the expression divergence seen in previous correlation-based analyses and may be the result of differences in quantitative expression levels, or may be the result of technical bias in microarray probe design; 2) an organ effect (O) represents different expression levels in each organ for both genes in the same paralogous pair (e.g.

Fig. 2-2C), and 3) a gene by organ (GxO) effect represents the case that the expression levels for each locus in the paralogous pair are significantly different in spatial and quantitative terms (e.g. Figs, 2-2E - 2-2H); this is interpreted as representative of regulatory neofunctionalization and/or subfunctionalization.

Our hypothesis that duplicated regulatory genes in Arabidopsis will have evidence of expression pattern shifts that contribute to maintenance (regulatory subfunctionalization and/or neofunctionalization) is supported by the overwhelming majority of paralogous pairs of regulatory genes which have a significant GxO effect.

The predominance of significant GxO effects throughout our data set suggests that regulatory subfunctionalization and/or neofunctionalization are dominant processes contributing to the maintenance of duplicated regulatory genes. Furthermore, we conclude that our global analysis supports the conclusion that the molecular evolution of regulatory proteins in Arabidopsis is significantly impacted by regulatory subfunctionalization and neofunctionalization after duplication as a result of selective constraints on the evolution of protein sequence for regulatory genes [26].

44 It is likely that our estimate of expression pattern shifts indicative of regulatory subfunctionalization and/or neofunctionalization between regulatory paralogs is an underestimate, because we only considered six organ types. Increasing the number of organs sampled or examining additional conditions or developmental stages should allow the detection of an even greater number of regulatory paralogous pairs that exhibit a significant GxO effect. An overall conclusion is that even with a relatively coarse assay, expression pattern shifts that would contribute to maintenance of duplicated regulatory genes via subfunctionalization and/or neofunctionalization are seen in the vast majority of gene pairs analyzed. Another advantage for this approach is that ANOVA accounts for data quality since large variances will reduce the ability to identify significant effects.

Surprisingly, although most of the gene pairs in this study have evidently evolved differential gene expression patterns, few of the paralogous pairs are diverged in a way that is fully consistent with either a classic sub- or neofunctionalization hypothesis of complete division of ancestral expression or acquisition of novel expression followed by loss of ancestral expression. In our results, complete loss of expression in a given organ or treatment for one paralog while the sister paralog is expressed is uncommon.

Therefore, a simplistic binary model, like the DDC model [50], for expression shifts in regulatory duplicates is inadequate. For example, regulatory neofunctionalization as defined by gain of novel expression and loss of all ancestral expression seems to be extremely rare in duplicated Arabidopsis regulatory genes. It is more common that duplicated Arabidopsis regulatory genes in our study have altered ancestral expression levels rather than strict division of ancestral expression. Though this type of expression shift is described as a quantitative shift in the DDC model, it is not considered as a

45 common pathway for divergence [7]. However, our results suggest that a quantitative pathway for regulatory subfunctionalization is frequently taken by duplicated regulatory genes. In other cases, altered expression levels include duplicate pairs where a single paralog is expressed at a higher level in a specific organ than the other paralogs or the ancestral gene (see AP1/CAL in Figure 2-3). This increase in expression may change the role of the gene, similar to regulatory neofunctionalization. However, this expression change is not entirely novel, since the ancestral gene was expressed in the same organ. In addition, a paralog that does gain novel expression in a specific organ or condition may also retain part of the ancestral expression pattern. Such a mixture of subfunctionalization and neofunctionalization has been previously noted in studies using yeast and human duplicates [13] and computational models [52].

Another circumstance deviating from the DDC model concerns genes that are lowly expressed compared to their sister paralogs, but which still exhibit expression shifts that would contribute to maintenance (indicated by a GxO effect) or selective constraint

on the coding sequence (Ka/Ks<1.0). We have assigned these genes to a classification of regulatory hypofunctionalization, a special case of subfunctionalization that does not follow the typical pattern of subfunctionalization outlined by the DDC model.

Examination of our data set, combined with additional expression data from TAIR, indicates that these are uncommon (22 pairs out of 280). This includes the AP1/CAL,

AGL6/AGL13, SEP1/SEP2 and SHP1/SHP2 paralogous pairs from the MIKCC gene family, which are critical regulators of floral development [53, 54]. Regulatory hypofunctionalization is a specialized case of subfunctionalization, where instead of splitting the expression pattern equally as predicted by the DDC model, expression in one

46 of the paralogs is greatly diminished (by at least two to three fold) compared to the other paralog in almost all organs and treatment conditions. In this case, it seems that the minor paralog is maintained through selection for genetic robustness [55] or loss of selective maintenance of the minor paralog has been so recent that the gene is not silenced across all organs. The maintenance of “redundant” duplicate genes can contribute to the robustness of the genetic network by reducing the fitness effect of deleterious mutations

[56]. Given the importance of the floral developmental pathway, we may expect that regulatory hypofunctionalization as protection against deleterious mutations is common among floral regulators. Another explanation is that expression differences between paralogs are among cells within an organ and, therefore, not detectable in typical microarray experiments. In any case, the minor paralog has persisted, but the evolutionary mechanism responsible for persistent low-level expression is not easily ascribed to regulatory subfunctionalization or neofunctionalization.

Complete regulatory nonfunctionalization is difficult to ascertain. Because of limited organ sampling, it is also possible that these putative silencing events are regulatory genes that have highly restricted expression patterns that could not be detected in the microarray results. However, given the breadth of the microarray experiments accessed through TAIR, the proportion of putative silent loci that are not truly silent is

expected to be small. There is a significant difference in mean Ka/Ks between paralogous pairs of actively expressed regulatory genes and paralogous pairs with a putatively silent regulatory gene, suggesting that actively expressed regulatory genes are evolving under purifying selection, whereas regulatory nonfunctionalization (silencing) is associated

47 with a relaxation of evolutionary constraint on protein coding portions of a regulatory gene.

What is the functional significance of these results? While some pairs show 3-fold difference or greater, others have more subtle differences. The level of divergence required to affect function is dependent on the particular gene, the cellular context, and the environment. Therefore, functional significance would need to be assessed in each case. Where functional studies are available, major expression difference is clearly correlated with functional differences. For example in the MIKCC gene family, AP1 has higher levels of expression than does CAL, consistent with AP1 playing a more prominent role in specifying floral meristem and organ identities than CAL [57, 58]. On the other hand, subtle expression differences may not be related to dramatic functional divergence detectable in laboratory conditions, as in the case of SEP1 and SEP2, and SHP1 and

SHP2, which seem to have completely redundant roles in controlling flower and fruit development, respectively [59, 60]. Whereas the expression pattern shifts found in this study may not result in an altered phenotype in a single knockout mutant, it is important to recognize that laboratory conditions are only a small subset of conditions that the organism has experienced in its evolutionary history. Under natural conditions, these subtle expression pattern shifts could affect fitness over evolutionary time, and therefore are evolutionarily important expression shifts [61]. Alternatively, some of the subtle changes might represent evolutionarily “transient” states of truly redundant paralogs that will diverge functionally in the future.

48 Expression divergence is not coupled with global protein constraint

The classical and DDC models for the fate of duplicated genes assert that duplicated genes will only be maintained when changes occur in the regulation or protein activity of the paralogs which result in neofunctionalization or subfunctionalization in the case of the DDC model (Ohno 1970; Hughes 2002; Kondrashov 2002; Wagner 2002;

Force et al. 1999). Therefore, we may expect paralogs without evidence of regulatory subfunctionalization and/or neofunctionalization to have significant changes in the

protein sequence. However, the lack of significant difference in mean Ka/Ks between those paralogous pairs with a GxO effect and paralogous pairs without a GxO effect

(with putative silencing events removed), suggest independent roles for gene regulation and protein activity in the maintenance of duplicated regulatory genes. This agrees with previous studies that show no relation between expression divergence and protein

divergence [29-31]. It should be noted that this methodology for determining Ks and Ka

[46] has a low statistical power and only provides a general overview of protein constraint within the paralogous pair. It is still plausible for local adaptive protein evolution to occur within these paralogous pairs [62, 63]. Local adaptive evolution could complement changes at the regulatory level or eliminate selective pressure for expression pattern shifts.

Complementary expression patterns in duplicated regulatory genes

By identifying the organs in which complementation occurs and comparing it to complete linkage clustering of organs based on expression levels, we can determine whether related organs may share the same regulatory module for expression or if regulatory modules are independent (Figure 2-4). In this case, the mapping of most

49 complementary patterns to the tips of the tree supports independence of regulatory modules that coordinate a specific spatial or temporal expression pattern [64] and also suggests that many duplicated regulatory genes may have been expressed in a limited number of organs prior to duplication. These conclusions apply to approximately half of paralogous pairs with a significant GxO effect.

Figure 2-4. Complete linkage clustering and mapping of complementary expression patterns supports independence of regulatory modules. The tree above summarizes the results of a complete linkage cluster analysis of the 6 plant organs by expression level of regulatory genes and the analysis of paralogous gene pairs for complementary expression. Bootstrap support for the cluster analysis is gray and italicized. 128 out of 280 paralogous pairs were found to have complementary expression. Black numbers indicate the number of paralogous pairs whose expression patterns differ only on organs below this branch. The network below the tree indicates the number of paralogous pairs that differ on the pairs or triples of organs connected at the nodes. For example, 35 sets exhibit complementation only in root compared with other organs; 12 sets exhibit complementation in leaf and root and 1 set exhibits complementation in root, leaf and silique. 11 sets of paralogs are not shown on the graph because they have uncommon complementary expression pattern.

50 The 122 paralogous pairs with a significant GxO effect but without complementation suggest that these non-complementary expression patterns evolve after duplication through a quantitative pathway; in this case both copies are required at the same time in order to meet ancestral functional thresholds for proper regulation of genes downstream in a regulatory network [7]. Complementation may also not be detectable if subfunctionalization or neofunctionalization are occurring within organs included in this study. However, there is also a possibility that the quantitative measures of expression are inconsistent and a constant difference in expression across all organs may be a result of poor probe design or resolution for one of the paralogs. In this case, the ANOVA would detect significant G, O, and GxO effects, but a complementary expression pattern would not be present. Overall, the ANOVA approach, by identifying different components of variance in expression data, can identify unique relationships between expression patterns in a paralogous pair and is not confounded by technical bias in microarray data that can affect the quantitative expression level.

Role of gene duplication in evolution of plant regulatory networks

Given the prevalence of gene and genome duplication in the evolutionary history of plants, extensive expression pattern shifts after duplication would have a profound impact on the evolution of developmental and regulatory networks. Arabidopsis has a larger repertoire of transcription factors than some sequenced metazoan genomes and approximately the same number of transcription factors as human [65]. The abundance of transcription factors is coupled with evidence that expression pattern shifts in regulatory proteins are predominant in the evolution of novel phenotypes [26]. This suggests that evolution of development in angiosperms may differ from organisms where genome

51 duplication is rare. Overall, this rich diversity of regulatory proteins in terms of number and expression patterns in plants provides for complex developmental pathways arising from simpler pathways through rounds of gene and genome duplication by simply tinkering with the regulatory controls of developmental pathways. This evolutionary scenario comes with testable predictions. Lineages with fewer detectable gene or genome-wide duplication events may be expected to have less specialized regulatory networks and regulatory genes with broader function.

ACKNOWLEDGEMENTS

We thank Jiong Wang and Anthony Omeis for plant care. We are grateful for helpful comments from Kateryna Makova and Victor Albert. The support of the NSF

Plant Genome Research Program for the Floral Genome Project (DBI-0115684) and a

University Fellowship from The Pennsylvania State University (J.M.D) are gratefully acknowledged.

ADDITIONAL FILES

A full listing of the paralogous pairs used in this study, along with their respective

ANOVA p-values, status of complementary expression, Ka, Ks, Ka/Ks ratio, amino acid identity, DNA sequence identity, and gene family annotation, are available as supplementary material as Supplementary_Table_2-1.xls. This file is available at http://fgp.huck.psu.edu/jduarte/Supplementary_Table_2-1.xls

52 REFERENCES

1. Blanc G, Wolfe KH: Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. The Plant Cell 2004, 16(7):1667-1678. 2. Wendel JF: Genome evolution in polyploids. Plant Molecular Biology 2000, 42(1):225-249. 3. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y: Modeling gene and genome duplications in eukaryotes. Proceedings of the National Academy of Sciences USA 2005, 102(15):5454-5459. 4. Ohno S: Evolution by gene duplication. New York, New York: Springer; 1970. 5. Hughes AL: Adaptive evolution after gene duplication. Trends in Genetics 2002, 18(9):433-434. 6. Kondrashov FA, Rogozin, I. B., Wolf, Y. I., and E. V. Koonin: Selection in the evolution of gene duplications. Genome Biology 2002, 2:8.1-8.9. 7. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics 1999, 151(4):1531-1545. 8. Wagner A: Selection and gene duplication: a view from the genome. Genome Biology 2002, 3:1012.1011-1012.1013. 9. Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290(5494):1151-1155. 10. Zhang X, Feng B, Zhang Q, Zhang D, Altman N, Ma H: Genome-wide expression profiling and identification of gene activities during early flower development in Arabidopsis. Plant Molecular Biology 2005, 58(3):401-419. 11. Gu ZL, Rifkin SA, White KP, Li WH: Duplicate genes increase gene expression diversity within and between species. Nature Genetics 2004, 36(6):577-579. 12. Wagner A: Asymmetric functional divergence of duplicate genes in yeast. Molecular Biology and Evolution 2002, 19:1760-1768. 13. He X, Zhang J: Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 2005, 169(2):1157-1164. 14. Kramer EM, Jaramillo MA, Di Stilio VS: Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 2004, 166(2):1011-1023. 15. Zahn LM, Kong H, Leebens-Mack J, Kim S, Soltis PS, Landherr LL, Soltis DE, dePamphilis CW, Ma H: The evolution of the SEPALLATA subfamily of MADS-box genes: A pre-angiosperm origin with multiple duplication throughout angiosperm history. Genetics 2005, 169:2209-2223. 16. Zahn LM, Leebens-Mack J, dePamphilis CW, Ma H, Theissen G: To B or not to B a flower: the role of DEFICIENS and GLOBOSA orthologs in the evolution of the angiosperms. Journal of Heredity 2005, 96:225-240. 17. Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 2003, 13(2):137-144.

53 18. Simillion C, Vandepoele K, Van Montagu MCE, Zabeau M, Van de Peer Y: The hidden duplication past of Arabidopsis thaliana. Proceedings of the National Academy of Sciences USA 2002, 99(21):13627-13632. 19. Vandepoele K, Simillion C, Van de Peer Y: Detecting the undetectable: uncovering duplicated segments in Arabidopsis by comparison with rice. Trends in Genetics 2002, 18(12):606-608. 20. Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290(5499):2114-2117. 21. Osborn TC, Pires JC, Birchler JA, Auger DL, Chen ZJ, Lee H, Comai L, Madlung A, Doerge RW, Colot V et al: Understanding mechanisms of novel gene expression in polyploids. Trends in Genetics 2003, 19(3):141-147. 22. Adams KL, Cronn, R., Percifield, R., and J. F. Wendel: Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ- specific reciprocal silencing. Proceedings of the National Academy of Sciences USA 2003, 100:4649-4654. 23. Bomblies K, Wang RL, Ambrose BA, Schmidt RJ, Meeley RB, Doebley J: Duplicate FLORICAULA/LEAFY homologs zfl1 and zfl2 control inflorescence architecture and flower patterning in maize. Development 2003, 130(11):2385-2395. 24. Hileman LC, Baum DA: Why do paralogs persist? Molecular evolution of CYCLOIDEA and related floral symmetry genes in Antirrhineae (Veronicaceae). Molecular Biology and Evolution 2003, 20(4):591-600. 25. Matsunaga S, Isono E, Kejnovsky E, Vyskot B, Dolezel J, Kawano S, Charlesworth D: Duplicative transfer of a MADS box gene to a plant Y chromosome. Molecular Biology and Evolution 2003, 20(7):1062-1069. 26. Doebley J, Lukens L: Transcriptional regulators and the evolution of plant form. The Plant Cell 1998, 10(7):1075-1082. 27. Riechmann JL: Transcriptional Regulation: a Genomic Overview. The Arabidopsis Book 2002:1-46. 28. Blanc G, Wolfe KH: Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. The Plant Cell 2004, 16(7):1679- 1691. 29. Gu ZL, Nicolae D, Lu HHS, Li WH: Rapid divergence in expression between duplicate genes inferred from microarray data. Trends in Genetics 2002, 18(12):609-613. 30. Makova KD, Li WH: Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res 2003, 13(7):1638-1645. 31. Wagner A: Decoupled evolution of coding region and mRNA expression patterns after gene duplication: Implications for the neutralist-selectionist debate. Proceedings of the National Academy of Sciences USA 2000, 97(12):6579-6584. 32. Hekstra D: Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Research 2003, 31(7):1962- 1968.

54 33. Jin W, Riley RM, Wolfinger RD, White KP, Passador-Gurgel G, Gibson G: The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nature Genetics 2001, 29:389-395. 34. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264. 35. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5:R80. 36. Enright AJ, Kunin V, Ouzounis CA: Protein families and TRIBES in genome sequence space. Nucleic Acids Research 2003, 31:4632-4638. 37. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large- scale detection of protein families. Nucleic Acids Research 2002, 30(7):1575- 1584. 38. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215:403-410. 39. Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics 2002, 18(3):452-464. 40. Thompson JD, Thierry JC, Poch O: RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 2003, 19(9):1155-1161. 41. Swofford DL: PAUP*: Phylogenetic Analysis using Parsimony (* and Other methods). In., 4 edn. Sunderland, MA: Sinauer Associates; 2001. 42. Littell RC, Milliken GA, Stroup WW, Wolfinger RD: SAS System for Mixed Models. In. Cary, NC: SAS Institute Inc.; 1996. 43. Anderbert MR: Cluster analysis for applications. New York: Academic Press; 1973. 44. Spath H: Cluster Analysis Algorithms. Chichester, England: Ellis Horwood; 1980. 45. Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5:164-166. 46. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Computational Applied Bioscience 1997, 13:555-556. 47. Martinez-Castilla LP, Alvarez-Buylla ER: Adaptive evolution in the Arabidopsis MADS-box gene family inferred from its complete resolved phylogeny. Proceedings of the National Academy of Sciences USA 2003, 100(23):13407-13412. 48. Maddison DR, and W. P. Maddison: MacClade 4.0: Analysis of phylogeny and character evolution. In., 4.06 edn. Sunderland, MA: Sinauer Associates; 2001. 49. Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization. Genetics 2000, 154(1):459-473. 50. Force A, Lynch M, Postlethwait J: Preservation of duplicate genes by subfunctionalization. American Zoologist 1999, 39(5):78A-78A. 51. Gu X, Zhang Z, Huang W: Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proceedings of the National Academy of Sciences USA 2005, 102(3):707-712.

55 52. Rastogi S, Liberles DA: Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evolutionary Biology 2005, 5:28. 53. Becker A, Theissen G: The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Molecular Phylogenetics and Evolution 2003. 54. Irish VF: The evolution of floral homeotic gene function. Bioessays 2003, 25:637-646. 55. Moore RC, Grant SR, Purugganan MD: Molecular population genetics of redundant floral-regulatory genes in Arabidopsis thaliana. Molecular Biology and Evolution 2005, 22:91-103. 56. Gu X: Evolution of duplicate genes versus genetic robustness against null mutations. Trends in Genetics 2003, 19(7):354-356. 57. Mandel MA: Molecular characterization of the Arabidopsis floral homeotic gene APETALA1. Nature 1992, 360(6401):273-277. 58. Kempin SA, Savidge B, Yanofsky MF: Molecular basis of the cauliflower phenotype in Arabidopsis. Science 1995, 267(5197):522-525. 59. Liljegren SJ, Ditta GS, Eshed Y, Savidge B, Bowman JL, Yanofsky MF: SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis. Nature 2000, 404(6779):766-770. 60. Pelaz S: B and C floral organ identity functions require SEPALLATA MADS-box genes. Nature 2000, 405(6783):200-203. 61. Weinig C: Heterogeneous selection at specific loci in natural environments in Arabidopsis thaliana. Genetics 2003, 165(1):321-329. 62. Nam J, Kaufmann K, Theissen G, Nei M: A simple method for predicting the functional differentiation of duplicate genes and its application to MIKC- type MADS-box genes. Nucleic Acids Res 2005, 33:e12. 63. Yang Z, Nielsen, N., Goldman, N., and A.M. Pedersen: Codon substitution models for heterogenous selection pressure at amino acid sites. Genetics 2000, 155:431-449. 64. Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The evolution of transcriptional regulation in eukaryotes. Molecular Biology and Evolution 2003, 20(9):1377-1419. 65. Riechmann JL, Heard J, Martin G, Reuber L, Jiang CZ, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR et al: Arabidopsis transcription factors: Genome-wide comparative analysis among eukaryotes. Science 2000, 290(5499):2105-2110.

56

Chapter 3

Utility of Amborella trichopoda and Nuphar advena ESTs for comparative

sequence analysis

Preface

This manuscript has been accepted for publication in Taxon and is formatted for that journal. Authors for the original manuscript are Jill M. Duarte, P. Kerr Wall, Laura

M. Zahn, John Carlson, Pamela S. Soltis, Douglas E. Soltis, Jim Leebens-Mack, J., Hong

Ma, and Claude W. dePamphilis. JMD conducted data analysis and wrote the draft manuscript. PKW conducted data analysis. LMZ. PSS, DES, JLM, HM and CWD contributed to the text. JLM and CWD conceived and supervised the study.

Supplementary Table 3-1 is included as an additional file.

57

Title: Utility of Amborella trichopoda and Nuphar advena ESTs for comparative sequence analysis

Running Title: Amborella trichopoda and Nuphar advena ESTs

Authors: Duarte, J.M1., Wall, P. K.1, Zahn, L.M.1,2, Carlson, J., Soltis, P. S.3, Soltis, D.

E.4, Leebens-Mack, J.1,5, Ma, H.1, dePamphilis, C. W.1

Affliations:

1Department of Biology, Institute of Molecular Evolutionary Genetics, and The Huck

Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA

16802

2 current address: Science, 1200 New York Ave NW, Washington, D.C. 20005

3 Florida Museum of Natural History, University of Florida, Gainesville, FL 32611

4 Department of Botany, University of Florida, Gainesville, FL 32611

5 current address: Department of Botany, University of Georgia, Athens, GA 30602

58

ABSTRACT

Expressed sequence tags (ESTs) for the basal angiosperms Amborella trichopoda

(Amborellaceae) and the water lily Nuphar advena (Nymphaeaceae) have proven valuable in identification of gene pairs to study the timing of duplication events relative to the most recent common ancestor (MRCA) of all extant angiosperms. Here we discuss how ESTs for these taxa are also useful for deducing gene families that were present in the MRCA of all flowering plants. For example, 4572 gene clusters identified in an analysis of the rice and Arabidopsis proteomes contained putative orthologs of Amborella or Nuphar. Homologs of many developmentally important genes were identified from

Amborella or Nuphar in these gene clusters. This number of ancestral genes is expected to increase as the number of Amborella and water lily ESTs increases. Genes found unduplicated in the rice and Arabidopsis genomes may be especially useful for phylogenetic analyses including diverse angiosperm lineages. We identify 595 of these single copy genes with putative orthologs in Amborella or Nuphar. Phylogenetic analysis of one of these nuclear single-copy genes encoding the enzyme carboxymethylenebutenolidase yields a topology that places Amborella and Nuphar as the sisters to all other extant lineages of angiosperms and is generally consistent with current angiosperm phylogenies based mainly on chloroplast, mitochondrial, and nuclear ribosomal sequences.

59

INTRODUCTION

Since their introduction, expressed sequence tags (ESTs) have proven to be the most efficient way to characterize the expressed gene set in plants and other organisms.

The initial utility of ESTs, and still their most common use, is to leverage research within a species, such as genome mapping, gene and genome annotation, estimation of the number of expressed genes [1], building microarrays for expression analysis, and finding genes associated with particular traits. Although most ESTs are derived from model and crop species in the Poaceae and core eudicots, the availability of large EST sets from a diverse array of angiosperms (215 have over 1,000 ESTs; see http://plantta.tigr.org/) has enabled a wide range of analyses that focus on the utility of ESTs for comparative studies in these plants. These studies include analyses of gene families in multiple organisms [2-

8]; comparative studies of genome history and duplication [9-11] and the identification of nuclear genes for phylogenetic analysis of organismal relationships [12, 13].

Amborella trichopoda Baill. is a small, evergreen, dioecious shrub endemic to the island of New Caledonia and the sole living representative of the monotypic

Amborellaceae [14]. Nuphar advena (Aiton) W. T. Aiton is a member of the water lily family (Nymphaeaceae) and is commonly known as spatterdock or yellow pond lily, with a geographic range extending throughout eastern North America [15]. Nuphar advena has a genome size of 2772 MB/1Ce, while that of Amborella trichopoda is only 870 MB/1C

[10]. Nearly all molecular phylogenetic studies have placed Amborella and Nymphaeales as basal lineages of angiosperms with either Amborella alone, or Amborella +

Nymphaeaceae as sister to all other angiosperms [16-29]. Nuphar is sister to the rest of

Nymphaeaceae [30, 31].

60

A striking observation stemming from phylogenetic analyses has been the identification of numerous mitochondrial sequences from Amborella that have apparently been acquired through horizontal transfer from plants as diverse as eudicots, monocots, and mosses [32, 33]. Analyses of mitochondrial sequences of water lilies (and other basal angiosperms) have shown no similar evidence for horizontal transfer [33]. There has not yet been an evaluation of the possibility that horizontal gene transfer has impacted expressed nuclear genes in these species.

By studying the genetic diversity present in the basalmost angiosperm lineages, we can gain insights into the genetic characteristics of the most recent common ancestor of all extant angiosperms. Evidence from floral transcriptome sequencing in Amborella and Nuphar provides important information on potential genome duplications that occurred early in angiosperm history, identification of the common set of angiosperm genes, the potential for functional horizontal gene transfer of nuclear-encoded genes, and single-copy phylogenetic markers from the nuclear genome. In addition, by studying the basalmost angiosperm lineages, we will be able to polarize the direction of changes observed in genes and gene families in monocot and eudicot lineages to infer whether characters are ancestral or derived.

In this paper we demonstrate the utility of the Amborella and Nuphar ESTs in comparative genomics and phylogenetic analysis, as well as review previous research using these ESTs in comparative sequence analysis to detect genome duplication and improve gene family analysis by providing a root to angiosperm clades.

61

METHODS

Tissue Sampling and Collection

EST libraries were built from RNAs isolated from premeiotic flower buds [34].

Nuphar flower buds were collected from a single population of Nuphar advena plants growing at Black Moshannon Lake, Black Moshannon State Park, Centre County,

Pennsylvania. Buds were sampled from the smallest sizes obtainable (ca. 0.3 mm) through initiation of meiosis (13 mm diameter). Premeiotic (< 2 mm in diameter) flower buds of Amborella were collected from plants maintained at the National Tropical

Botanical Garden, Kauai, HI, USA. Buds of male flowers were collected from plants

980644-18, 980631-9, 980631-13, 980647-010, and 980647-008, which correspond to

NTBG vouchers DL8346, TF6481, and DHL8350, for 980644, 980631, and 980647, respectively. Buds of female flowers were collected from plants 980631-019, 980631-

015, 980647-002, 980647-005, and 980647-006, which correspond to NTBG vouchers

TF6481 and DHL8350 for 980631 and 980647, respectively.

EST Sequences and Unigene Assembly

Complementary DNA (cDNA) libraries were constructed from RNAs isolated from premeiotic flower buds of Nuphar advena and male and female premeiotic flower buds of Amborella trichopoda, and EST sequencing was performed as described in

Albert et al. [34]. ESTs and unigenes (assembled contigs of ESTs) for Amborella trichopoda and Nuphar advena are available through the Plant Genome Network (PGN) website (http://www.pgn.cornell.edu).

62

Classification of Amborella and Nuphar Unigenes using PlantTribes and INPARANOID

Groups

In this study, we used two classification systems to provide putative homology groups at two levels of resolution to aid in the tentative identification of the sequences captured in the study. Tribes [34-37] approximate gene families, while INPARANOID groups [38] represent a finer-grained analysis intended to recover putative sets of orthologs. Amborella and Nuphar unigenes were sorted into putative homology groups using tBLASTx. "Best hits" with e-values stronger than e-5 [39] were used to assign unigenes into PlantTribes (http://fgp.huck.psu.edu/tribe.php) and INPARANOID groups composed of Arabidopsis and rice proteins using a similar BLASTx approach.

PlantTribes is a database of putative gene families based on TribeMCL clustering of the Arabidopsis and rice proteomes [34-37]. INPARANOID groups are putative orthologous groups of genes, based on pairwise comparison of two genomes, in this case

Arabidopsis thaliana and Orya sativa [38]. TribeMCL and INPARANOID clustering both use BLAST e-values as a distance metric. Whereas TribeMCL clusters genes irrespective of the taxon from which genes are sampled, the INPARANOID algorithm identifies gene pairs sampled from two focal taxa with minimal distances and then grows clusters by including all genes within each taxon that have smaller distances than the minimal cross-species distance. As a consequence, INPARANOID clusters only include genes for which cross-species gene pairs are found with an e-value below some threshold

(10-5 in this study). In our experience, INPARANOID clusters are typically, but not always, nested within TribeMCL clusters (e.g. see Table 3-1 and Supplementary Table 3-

1).

63

Phylogenetic Analysis using a Single Copy Nuclear Gene

To illustrate the utility of the Amborella and Nuphar ESTs in phylogenetic analysis, we chose a single-copy nuclear gene for which sequences are available from both Amborella and Nuphar, and also for a diversity of eudicot and monocot species.

Clones for ESTs from this gene were sequenced on both strands to obtain a high-quality finished sequence for phylogenetic analysis. Internal primers were designed using

MacVector (Accelrys, San Diego, CA) when necessary. Additional sequences for phylogenetic analysis were obtained from the TIGR gene indices [40] and additional FGP unigenes from PGN (http://pgn.cornell.edu/).

Amino acid sequences were aligned using MUSCLE [41], and nucleotide alignments were created by forcing the DNA sequence onto the protein alignment using

CLUSTALW [42]. Alignments were manually inspected and adjusted. Nucleotide alignments were used to produce phylogenies using maximum parsimony (MP), neighbor-joining (NJ), and maximum likelihood (ML). Appropriate models of sequence evolution for NJ and ML analyses were determined using MODELTEST [43].

Phylogenetic analyses were performed using PAUP 4.0b10 [44]. All MP and ML analyses were based on a heuristic search with 10 random sequence additions and tree bisection-reconnection branch swapping. Bootstrapping was also performed with 1000 replicates for MP and NJ and 250 replicates for ML analyses using a heuristic search with random sequence addition.

64

RESULTS

Description of Amborella and Nuphar EST Sequences

A general description of the Amborella and Nuphar EST sets is provided in Albert et al. [34]. From these species, floral EST sequencing yielded a total of 8489 and 8442

ESTs from Amborella and Nuphar, respectively [34]. Clustering these ESTs to combine sequences putatively derived from the same gene reduced these to 6099 and 6205 unigene (putative gene) sequences, respectively [34]. Of the 6099 Amborella unigenes,

3970 were sorted into a PlantTribe using BLAST, and 3478 have a BLAST best hit (with e-value greater than e-5) into an Arabidopsis and rice INPARANOID group. Of the 6205

Nuphar Unigenes, 4445 belong to a PlantTribe, and 3898 were classified using BLAST into an Arabidopsis and rice INPARANOID group (Supplemental Table 3-1). By comparing the non-redundant set of INPARANOID groups identified in Amborella and/or Nuphar, we estimated that there was a minimal set of 4572 genes in the ancestral angiosperm that have homologs in Arabidopsis and rice (Figure 3-1D; Supplementary

Table 3-1). In addition, by comparing the non-redundant set of PlantTribes identified in

Amborella and/or Nuphar, we estimated that of those gene families that existed in the common ancestor of all angiosperms, a minimal set of 2857 gene families has been retained in both monocots and eudicots (Figure 3-1D). The putative functional categories of the non-redundant set of INPARANOID groups from Amborella and Nuphar

(ancestral orthologs of genes present in Arabidopsis and rice) were deduced using the

Gene Ontology (GO) information for Arabidopsis best BLAST hits (Figure 3-1ABC).

MADS-box genes are critical floral regulators [45-47], and their evolution has been analyzed, including sequences from Amborella and Nuphar [2, 4, 6, 8, 45, 46]. In

65

addition, several other families with important plant and floral developmental function were identified in the EST sets (Table 3-1). Members of the AP2 [3, 48] and NAC families [49] are involved in floral development, among other functions. Homologs were also found for several genes involved in RNA processing and silencing such as members of the ARGONAUTE, HEN1, HUA1, HUA2, and other families; many of these regulate genes involved in floral and developmental function [47, 50]. Many genes that are homologous to those involved in hormone signaling such as PINOID, PIN1, BRI1, ARFs,

SCARECROW, IAAs, ERFs and ETR1 [51] were also detected among the Amborella and

Nuphar unigenes. Several genes encoding proteins that may function in seed and embryo development were also identified, including LEA, PcG (polycomb-group), endo-1,4-beta- glucanase and members of the YABBY family [52-56]. The identification of homologs of known regulators of development and signal responses suggests that many of these processes are conserved in basal angiosperms.

Table 3-1. (p. 81-84) Reproductive development genes present in the Amborella and Nuphar Unigenes. A subset of the PlantTribe and INPARANOID gene clusters listed in Supplementary Table 1 shows unique Nuphar and Amborella genes assembled from Floral Genome Project EST sequences with PGN unigene IDs (see http://pgn.cornell.edu/). Deduced gene sequences are organized by PlantTribe and INPARANOID cluster identity based on the placement of the best BLAST hit in Arabidopsis.

66

Figure 3-1. Characterization of the Amborella and Nuphar ESTs and insights into the ancestral angiosperm. The classification of the nonredundant set of Amborella and/or Nuphar INPARANOID groups into slim Gene Ontology (GO) categories is shown in Figure 3-1ABC. The best BLAST hit into Arabidopsis was used to classify the Amborella and Nuphar Unigenes into slim GO categories. The number of Amborella and Nuphar Unigenes sorted into INPARANOID groups and PlantTribes, along with overlap is shown in 1D.

In addition to the identification of putative homologs of genes with known regulatory functions, other Amborella and Nuphar unigenes were found to encode proteins that contain domains known to be important for development, including the DNA-binding homeodomain, MYB, bZIP, bHLH, and WRKY domains, as well as protein-protein and other interaction motifs WD-40, F-box, zinc fingers and zinc knuckles. Moreover,

Amborella and Nuphar genes might also be involved in cell-cell signaling such as the

67

putative receptors containing an LRR domain, which might function as receptors for plant peptides, including members of the CLAVATA3 family [57].

There are 353 Amborella and 342 Nuphar unigenes with a best BLAST hit (threshold of

10-5) to one of the 1574 Arabidopsis and rice 1:1 PlantTribes that contain a single member from Arabidopsis and rice (Supplementary Table 3-1). These 1:1 PlantTribes represent gene families that have not expanded despite the genome duplications that likely occurred during angiosperm history, including those on the specific lineages leading to Arabidopsis [58-60] and rice [61, 62]. There are 100 pairs of Amborella and

Nuphar unigenes that were sorted by BLAST to these 1:1 PlantTribes. These nuclear protein-coding genes have the potential to be used for phylogenetic analysis across all angiosperms, as illustrated in Figure 3-2. However, current sequence data from a single nuclear gene are not sufficient for resolving the branching order of Amborella and

Nuphar or several internal nodes (Figure 3-2; data not shown for other genes).

68

Figure 3-2. Using single copy nuclear genes for organismal phylogeny. The gene used to construct the tree shown, carboxymethylenebutenolidase, is an example of a single copy nuclear gene identified using a comparative genomic approach to the proteomes of Arabidopsis and rice. Unigenes sampled from throughout the angiosperms were used to illustrate their potential in elucidating organismal phylogeny as well as the conservation of single or low copy throughout the angiosperms, in spite of repeated rounds of duplication in the angiosperm tree of life, as inferred by other studies [9-11]. While the information contained in this one gene is not sufficient to provide an entirely resolved tree, especially at the base of the angiosperms, it provides resolution similar to other single-gene analyses from other markers. ______

69

DISCUSSION

Utility of Amborella and Nuphar ESTs in Comparative Sequence Analyses

Amborella and the water lilies (Nymphaeales) either form a clade, sister to all other extant flowering plant lineages, or Amborella and then the water lilies are successive sisters to all other angiosperms [16-29]. The EST sequences described in this paper are being mined for genes that may allow us to resolve the positions of the

Amborella and water lily lineages at the base of the angiosperm phylogeny (see discussion of 1:1 PlantTribes, below). Regardless of their relationship to each other relative to the rest of the angiosperms, however, the placement of Amborella and Nuphar genes in gene family phylogenies can elucidate duplication events early in angiosperm history. The placement of Amborella or Nuphar sequences in gene-family phylogenies can reveal whether duplication events predate or postdate the most recent common ancestor of Amborella, the water lilies, and all other angiosperms by the identification of clades and whether Amborella or Nuphar sequences are located within the clade

(duplication prior to the diversification of the basal angiosperms) or are placed sister to a set of clades (duplication after the diversification of the basal angiosperms). Further, meta-analyses of gene families can be used to test hypotheses of when gene and whole- genome duplication events occurred. More refined molecular evolutionary analyses and expanded sequence datasets from basal angiosperms may be able to test the hypothesis that gene and genome duplications spawned the evolutionary innovations associated with the early radiation of flowering plants [63].

The MADS-box transcription factor gene family has been the focus of much research on the genetic basis of floral evolution [47]. Except for AP2, Arabidopsis genes

70

specifying the genetically defined functions in the classical ABC model of floral development [64] are MIKC–type MADS-box genes: APETALA1 (A class),

PISTILLATA and APETALA3 (B class, GLOBOSA and DEFICIENS in Antirrhinum), and

AGAMOUS (C class; PLENA in Antirrhinum). According to the quartet model [65], these ABC-function MADS-domain proteins interact with each other and with the

MADS-domain SEPALLATA proteins to form quaternary complexes [66] that regulate specification of floral organ identity. These and other MIKC–type MADS-box genes have been classified based on clade membership [45]. The placement of Amborella and water lily homologs in phylogenetic analyses of PISTILLATA+APETALA3 [4, 67],

AGAMOUS [7, 68], and SEPALLATA [6] clades reveal gene duplications predating the diversification of the angiosperm crown group. Given the role of these genes in floral development, these duplication events, particularly of the SEPALLATA subfamily, are hypothesized to have played a role in the origin of the flower [6]. Whether duplication events in the PISTILLATA+APETALA3, AGAMOUS, and SEPALLATA clades occurred simultaneously as part of a single whole-genome duplication [69] remains an open question.

Phylogenies of nuclear genes also provide a means of assessing the possible acquisition of functional nuclear-encoded genes through horizontal transfer, as has been proposed for mitochondrial sequences in Amborella, but not water lilies [33]. In published gene family phylogenies to date, including those for phytochromes [21, 70], numerous MADS-box genes [4, 6, 67], the AP2 family [3], RNA polymerase II [71],

YABBY genes [72], Amborella and water lilies are found in a phylogenetic position consistent with a basal position among extant angiosperm lineages. This is true as well

71

in the tree presented in this paper. This observation is consistent with vertical (standard) inheritance of each of the genes examined above and highlights the very different dynamic observed for mitochondrial sequences in Amborella. The mechanism by which apparently xenologous sequences are taken up and then incorporated into the mitochondrial genome of this species remains a fundamental problem in plant genetics and genomics [33]. However, the evidence at this point suggests that these dynamics may not extend to the expressed nuclear genes that have been studied thus far, as well as additional single-copy nuclear genes we have examined (data not shown).

Detecting Ancient Genome Duplications

Studies in Arabidopsis indicate that there were at least two and possibly three or more whole genome duplications in the evolutionary history of Arabidopsis [58, 60, 73,

74]. Additional studies utilizing EST sequences for the identification of genome duplication have inferred whole genome duplications throughout the evolutionary history of diverse angiosperms [9-11]. Recent work using ESTs from Amborella and Nuphar indicated that there was a concentration of duplicate gene pairs consistent with ancient genome duplication in the Nuphar lineage, but no detectable signal of a similar ancient genome duplication event in Amborella [10]. This provides evidence for an ancient paleopolyploidy event shared by all angiosperms except Amborella.

Single-Copy Nuclear Genes for Organismal Phylogeny

As a result of pervasive gene duplication, most nuclear genes in angiosperms are part of gene families. This has often deterred researchers from using nuclear genes (other

72

than RNA genes) in phylogenetic reconstruction in angiosperms because of the difficulty of isolating all members of a gene family in each taxon for accurate phylogenetic reconstruction [75-78]. In addition, there is evidence from yeast that phylogenies based on paralogous genes with asymmetric divergence are misrepresentations of the organismal phylogeny [79]. These difficulties in using nuclear genes in phylogeny reconstruction are relieved by adopting a high-throughput approach for the identification of single- or low-copy nuclear genes. Using PlantTribes, we have identified 1574 putative single- or low-copy nuclear genes that contain only one member from Arabidopsis and only one member from rice. By adding unigenes from a wide variety of flowering plants to these 1:1 PlantTribes, we can identify putative single-copy gene families and construct multiple sequence alignments for phylogenetic analysis. Including unigenes from basal lineages aids in the identification of single- or very low-copy nuclear genes that would be widely distributed among all flowering plants and permits using sequences from

Amborella and Nuphar as outgroups for angiosperm-wide analysis if corresponding sequence data from gymnosperms are missing. Although individual gene trees do not provide sufficient resolution across the entire tree, we hypothesize that these single-copy genes will be valuable phylogenetic markers, especially given that we already have sequences from the basal angiosperms for rooting the tree (Figure 3-2). The addition of basal taxa improves the ability to properly root the angiosperm phylogeny [20].

The applications featured in this paper reflect the many uses of Amborella trichopoda and Nuphar advena ESTs and the importance of these organisms in the study of the evolution of flowering plants. The genes identified by the ESTs suggest possible molecular bases for the evolution of floral structure and development. Future work will

73

develop comprehensive search strategies for finding genes in EST sets that may be most useful for organismal phylogeny reconstruction and comparative sequence analysis, using the basal position of these two lineages to elucidate the mysteries of the ancestral angiosperm and the origin of the flower.

ACKNOWLEDGEMENTS

We thank Thomas Borsch and Pam Soltis for organizing the Nymphaeales symposium at the International Botanical Congress in Vienna. We also thank Lena

Landherr for finished cDNA sequencing, and all of our collaborators in the Floral

Genome Project for their hard work and stimulating discussions on gene family evolution and the origin of flowering plants. This work was supported through NSF Plant Genome

Research Project grant DBI-0115684.

ADDITIONAL FILES

Supplementary Table 3-1. Characterization of Amborella and Nuphar Unigenes by homology groups. All Amborella and Nuphar Unigenes are classified according to

PlantTribe, INPARANOID group and best BLAST hit in Arabidopsis. Information about number of ESTs in each Unigene assembly and number of Arabidopsis and rice members in each PlantTribe and INPARANOID group is also included. This file is available at http://fgp.huck.psu.edu/jduarte/Supplementary_Table_3-1.xls

74

REFERENCES

1. Wang J, Lindsay BG, Cui L, Wall PK, Marion J, Zhang J, dePamphilis CW: Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries. BMC Bioinformatics 2005, 6:300. 2. Kim S, Koh J, Yoo M-J, Kong H, Hu Y, Ma H, Soltis P-S, Soltis D-E: Expression of floral MADS-box genes in basal angiosperms: implications for the evolution of floral regulators. Plant Journal 2005, 43(5):724-744. 3. Kim S, Soltis PS, Wall PK, Soltis DE: Phylogeny and domain evolution in the APETALA2-like gene family. Molecular Biology and Evolution 2006, 23(107- 120). 4. Kim S, Yoo M-J, Albert V-A, Farris J-S, Soltis P-S, Soltis D-E: Phylogeny and diversification of B-function MADS-box genes in angiosperms: Evolutionary and functional implications of a 260-million-year-old duplication. American Journal of Botany 2004, 91(12):2102-2118. 5. Yoo M-J, Albert VA, Soltis PS, Soltis DE: Phylogenetic diversification of glycogen synthase kinase 3/SHAGGY-like kinase genes in plants. BMC Plant Biology 2006, 6:3. 6. Zahn LM, Kong H, Leebens-Mack J, Kim S, Soltis PS, Landherr LL, Soltis DE, dePamphilis CW, Ma H: The evolution of the SEPALLATA subfamily of MADS-box genes: A pre-angiosperm origin with multiple duplication throughout angiosperm history. Genetics 2005, 169:2209-2223. 7. Zahn LM, Leebens-Mack J, Arrington JM, Hu Y, Landherr LL, dePamphilis CW, Becker A, Theissen G, Ma H: Conservation and divergence in the AGAMOUS Subfamily of MADS-box Genes: evidence of independent sub-and neofunctionalization events. Evolution of Development 2006, 8:30-45. 8. Zahn LM, Leebens-Mack J, dePamphilis CW, Ma H, Theissen G: To B or not to B a flower: the role of DEFICIENS and GLOBOSA orthologs in the evolution of the angiosperms. Journal of Heredity 2005, 96:225-240. 9. Blanc G, Wolfe KH: Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. The Plant Cell 2004, 16(7):1667-1678. 10. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A et al: Widespread genome duplications throughout the history of flowering plants. Genome Res 2006, 16:738-749. 11. Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC: Mining EST databases to resolve evolutionary events in major crop species. Genome 2004, 47:868-876. 12. de la Torre JE, Egan MG, Katari M, Brenner ED, Stevenson DW, Coruzzi GM, Desalle R: ESTimating plant phylogeny: lessons from partitioning. BMC Evolutionary Biology 2006, 6:48.

75

13. Whittall JB, Medina-Marino A, Zimmer EA, Hodges SA: Generating single- copy nuclear gene data for a recent adaptive radiation. Molecular Phylogenetics and Evolution 2006, 39:124-134. 14. Stevens PF: Angiosperm Phylogeny Website. Version 7 2001, May 2006(http://www.mobot.org/MOBOT/research/APweb). 15. Wiersema JH, Hellquist CB: Nymphaeaceaee. In: Flora of North America. New York: Oxford University Press; 1997: 66-71. 16. Barkman TJ, Chenery G, McNeal JR, Lyons-Weiler J, Ellisens WJ, Moore G, Wolfe AD, dePamphilis CW: Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proceedings of the National Academy of Sciences USA 2000, 97(24):13166-13171. 17. Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W: Noncoding plastid trnT-trnF sequences reveal a well resolved phylogeny of basal angiosperms. Journal of Evolutionary Biology 2003, 16:558-576. 18. Graham S-W, Olmstead R-G: Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. American Journal of Botany 2000, 87(11):1712-1730. 19. Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell M, Alice LA, Evans R et al: Angiosperm phylogeny based on matK sequence information. American Journal of Botany 2003, 90(1758-1776). 20. Leebens-Mack J, Raubeson LA, Cui L, Kuehl JV, Fourcade MH, Chumley T, Boore JL, Jansen RK, dePamphilis CW: Identifying the basal angiosperm node in chloroplast genome phylogenies: Sampling one's way out of the Felsenstein zone. Molecular Biology and Evolution 2005, 22(10):1948-1963. 21. Mathews S, Donoghue M-J: Basal angiosperm phylogeny inferred from duplicate phytochromes A and C. International Journal of Plant Sciences 2000, 161(6 Supplement):S41-S55. 22. Parkinson C-L, Adams K-L, Palmer J-D: Multigene analyses identify the three earliest lineages of extant flowering plants. Current Biology 1999, 9(24):1485- 1488. 23. Qiu Y-L, Dombrovska O, Lee J, Li L, Whitlock B-A, Bernasconi-Quadroni F, Rest J-S, Davis C-C, Borsch T, Hilu K-W et al: Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. International Journal of Plant Sciences 2005, 166(5):815-842. 24. Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis D-E, Soltis P-S, Zanis M, Zimmer E-A, Chen Z, Savolainen V, Chase M-W: The earliest angiosperms: Evidence from mitochondrial, plastid and nuclear genomes. Nature 1999, 402(6760):404-407. 25. Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis D-E, Soltis P-S, Zanis M, Zimmer E-A, Chen Z, Savolainen V, Chase M-W: Phylogeny of basal angiosperms: Analyses of five genes from three genomes. International Journal of Plant Sciences 2000, 161(6 Supplement):S3-S27. 26. Soltis D-E, Soltis P-S, Chase M-W, Mort M-E, Albach D-C, Zanis M, Savolainen V, Hahn W-H, Hoot S-B, Fay M-F et al: Angiosperm phylogeny inferred from

76

18S rDNA, rbcL, and atpB sequences. Botanical Journal of the Linnean Society 2000, 133(4):381-461. 27. Soltis PS, Soltis DE, Chase MW: Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 1999, 402:402-404. 28. Stefanovic S, Rice DW, Palmer JD: Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evolutionary Biology 2004, 4:35. 29. Zanis M-J, Soltis D-E, Soltis P-S, Mathews S, Donoghue M-J: The root of the angiosperms revisited. Proceedings of the National Academy of Sciences USA 2002, 99(10):6848-6853. 30. Les DH, Schneider EL, Padgett DJ, Soltis PS, Soltis DE, Zanis M: Phylogeny, Classification and Floral Evolution of Water Lilies (Nymphaeaceae; Nymphaeales): A Synthesis of Non-molecular, rbcL, matK, and 18S rDNA Data. Systematic Botany 1999, 24(1):28-46. 31. Yoo M-J, Bell CD, Soltis PS, Soltis DE: Divergence Times and Historical Biogeography of Nymphaeales. Systematic Botany 2005, 30(4):693-704. 32. Bergthorsson U, Adams KL, Thomason B, Palmer JD: Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 2003, 424(6945):197. 33. Bergthorsson U, Richardson AO, Young GJ, Goertzen LR, Palmer JD: Massive horizontal transfer of mitochondrial genes from diverse land plant donors to the basal angiosperm Amborella. Proceedings of the National Academy of Sciences USA 2004, 101(51):17747-17752. 34. Albert VA, Soltis DE, Carlson JE, Farmerie WG, Wall PK, Ilut DC, Solow TM, Mueller LA, Landherr LL, Hu Y et al: Floral gene resources from basal angiosperms for comparative genomics research. BMC Plant Biology 2005, 5(1):5. 35. Enright AJ, Kunin V, Ouzounis CA: Protein families and TRIBES in genome sequence space. Nucleic Acids Research 2003, 31:4632-4638. 36. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large- scale detection of protein families. Nucleic Acids Research 2002, 30(7):1575- 1584. 37. Wall PK, Leebens-Mack J, Muller K, Field D, Altman N, dePamphilis CW: PlantTribes: A gene and gene family resource for comparative genomics in plants. Nucleic Acids Research 2007. 38. Remm M, Storm CEV, Sonnhammer ELL: Automatic Clustering of Orthologs and In-paralogs from Pairwise Species Comparisons. Journal of Molecular Biology 2001, 314:1041-1052. 39. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215:403-410. 40. Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J: The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Research 2005, 33:D71-74. 41. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5(1):113.

77

42. Higgins DG, Thompson JD, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22:4673-4680. 43. Posada D, Crandall KA: Modeltest: testing the model of DNA substitution. Bioinformatics 1998, 14(9):817-818. 44. Swofford DL: PAUP*: Phylogenetic Analysis using Parsimony (* and Other methods). In., 4 edn. Sunderland, MA: Sinauer Associates; 2001. 45. Becker A, Theissen G: The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Molecular Phylogenetics and Evolution 2003. 46. Irish VF: The evolution of floral homeotic gene function. Bioessays 2003, 25:637-646. 47. Zahn LM, Feng B, Ma H: Beyond the ABC model: Regulation of floral homeotic genes. In: Advances in Botanical Research. Edited by Soltis DE, Leebens-Mack JH, Soltis PS. London: Elsevier Limited; 2006. 48. Riechmann JL, Meyerowitz EM: The AP2/EREBP family of plant transcription factors. Biological Chemistry 1998, 379:633-646. 49. Olsen AN, Ernst HA, Lo-Leggio L, Skriver K: NAC transcription factors: structurally distinct, functionally diverse. Trends in Plant Science 2005, 10:79- 87. 50. Ma H: Molecular genetic analysis of microsporogenesis and microgametogenesis. Annual Review of Plant Biology 2005, 56:393-434. 51. Gazzarrini S, McCourt P: Cross-talk in plant hormone signalling: What Arabidopsis mutants are telling us. Ann Bot 2003, 91(6):605-612. 52. Chanvivattana Y, Bishopp A, Schubert D, Stock C, Moon Y-H, Sung Z-R, Goodrich J: Interaction of polycomb-group proteins controlling flowering in Arabidopsis. Development 2004, 131(21):5263-5276. 53. Eshed Y, Baum SF, Perea JV, Bowman JL: Establishment of polarity in lateral organs of Arabidopsis. Current Biology 2001, 11(16):1251-1260. 54. Ingouff M, Haselhof J, Berger F: Polycomb group genes control developmental timing of endosperm. Plant Journal 2005, 42(5):663-674. 55. Yamada T, Ito M, Kato M: YABBY2-homologue expression in lateral organs of Amborella trichopoda (Amborellaceae). International Journal of Plant Sciences 2004, 165(6):917-924. 56. Yamada T, Tobe H, Imaichi R, Kato M: Developmental morphology of the ovules of Amborella trichopoda (Amborellaceae) and Chloranthus serratus (Chloranthaceae). Botanical Journal of the Linnean Society 2001, 137(3):277- 290. 57. Ito Y, Nakanomyo I, Motose H, Iwamoto K, Sawa S, Dohmae N, Fukuda H: Dodeca-CLE peptides as suppressors of plant stem cell differentiation. Science 2006, 313:842-845. 58. Simillion C, Vandepoele K, Van Montagu MCE, Zabeau M, Van de Peer Y: The hidden duplication past of Arabidopsis thaliana. Proceedings of the National Academy of Sciences USA 2002, 99(21):13627-13632.

78

59. Vandepoele K, Simillion C, Van de Peer Y: Detecting the undetectable: uncovering duplicated segments in Arabidopsis by comparison with rice. Trends in Genetics 2002, 18(12):606-608. 60. Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290(5499):2114-2117. 61. Paterson AH, Bowers JE, Chapman BA: Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proceedings of the National Academy of Sciences USA 2004, 101:9903-9908. 62. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C et al: The genomes of Oryza sativa: A history of duplications. PLoS Biology 2005, 3(2):e38. 63. Darwin C: Letter to J. D. Hooker. In: More Letters of Charles Darwin. Edited by Darwin F, Seward AC, vol. 2. London: John Murray; 1903. 64. Coen ES, Meyerowitz EM: The war of the whorls - Genetic interactions controlling flower development. Nature 1991, 353:31-37. 65. Theissen G, Saedler H: Plant biology: Floral quartets. Nature 2001, 409:469- 471. 66. Honma T, Goto K: Complexes of MADS-box proteins are sufficient to convert leaves into floral organs. Nature 2001, 409:525-529. 67. Aoki S, Uehara K, Imafuku M, Hasebe M, Ito M: Phylogeny and divergence of basal angiosperms inferred from APETALA3- and PISTILLATA-like MADS-box genes. Journal of Plant Research 2004, 117(3):229-244. 68. Kramer EM, Jaramillo MA, Di Stilio VS: Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 2004, 166(2):1011-1023. 69. Buzgo M, Soltis PS, Kim S, Soltis DE: The making of a flower. The Biologist 2005, 52:149-154. 70. Mathews S, Donoghue M-J: The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 1999, 286(5441):947-950. 71. Nickerson J, Drouin G: The sequence of the largest subunit of RNA polymerase II is a useful marker for inferring seed plant phylogeny. Molecular Phylogenetics and Evolution 2004, 31(2):403-415. 72. Fourquin C, Vinauger-Douard M, Fogliani B, Dumas C, Scutt C-P: Evidence that CRABS CLAW and TOUSLED have conserved their roles in carpel development since the ancestor of the extant angiosperms. Proceedings of the National Academy of Sciences USA 2005, 102(12):4649-4654. 73. Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 2003, 13(2):137-144. 74. Bowers JE, Chapman BA, Rong JK, Paterson AH: Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 2003, 422(6930):433-438. 75. Crawford DJ, Mort M-E: Single-locus molecular markers for inferring relationships at lower taxanomic levels: observations and comments. Taxon 2004, 53:631-635.

79

76. Hughes J, Longhorn SJ, Papadopoulou A, Theorides K, de Riva A, Meija-Chang M, Foster PG, Vogler AP: Dense Taxanomic EST Sampling and Its Applications for Molecular Systematics of the Coleoptera (Beetles). Molecular Biology and Evolution 2006, 23(2):268-278. 77. Sang T: Utility of low-copy nuclear gene sequences in plant phylogenetics. Critical Reviews in Biochemistry and Molecular Biology 2002, 37:121-147. 78. Small RL, Cronn RC, Wendel JF: Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 2004, 17:145-170. 79. Fares MA, Byrne KP, Wolfe KH: Rate asymmetry after genome duplication causes substantial long-branch attraction artifacts in the phylogeny of Saccharomyces species. Molecular Biology and Evolution 2006, 23(2):245-253.

80 Chapter 4

Identification of shared single copy nuclear genes in Arabidopsis, Populus,

Vitis and Oryza and their phylogenetic utility across various taxonomic

levels

Preface

This manuscript is in preparation for submission to BMC Evolutionary Biology and is formatted for that journal. JMD performed characterization of the shared single copy genes, manual curation of EST-based alignments, phylogenetic analysis of EST- based alignment and wrote the manuscript. PKW developed the PlantTribes database and developed software for gathering data for characterization and automated alignment for

EST-based analyses. PPE designed primers, performed RT-PCR, sequencing, alignment and phylogenetic analysis for shared single copy genes in the Brassicaceae, and contributed text to the manuscript. LLL contributed to finished cDNA sequencing of single copy genes in select taxa. JCP conceived of the study using single copy genes for

Brassicaceae phylogeny and contributed text to the manuscript. JLM and CWD conceived of the study.

- 85- Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels

Jill M. Duarte1, P. Kerr Wall1, Patrick P. Edger2, Lena L. Landherr1, J. Chris Pires2, Jim Leebens-Mack3, Claude W. dePamphilis 1§

1Department of Biology and the Huck Institutes of the Life Sciences, The Pennsylvania

State University, University Park, PA 16802

2Division of Biological Sciences, University of Missouri-Columbia, Columbia, MO

65211-7310

3Department of Plant Biology, University of Georgia, Athens, GA 30602-7271

§Corresponding author

Email addresses:

JMD: [email protected]

PKW: [email protected]

PPE: [email protected]

LLL: [email protected]

JCP: [email protected]

JLM: [email protected]

CWD: [email protected]

- 86- Abstract

Background Although the overwhelming majority of genes found in angiosperms are members of gene families, and both gene- and genome-duplication are pervasive forces in plant genomes, we have identified a series of genes that are shared as single copy genes in the

Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa genomes. To characterize these genes, we have performed a number of analyses examining GO annotations, coding sequence length, number of exons, number of domains, presence in distant lineages, such as Selaginella and Physcomitrella, and phylogenetic analysis to estimate copy number throughout seed plants and to demonstrate their phylogenetic utility. Because these are a special set of orthologous sequences, we then provide examples of how these genes may be used in phylogenetic analysis to reconstruct organismal history, both by using extant coverage in EST databases for seed plants and de novo amplification via RT-PCR in the plant family Brassicaceae.

Results There are 959 single copy nuclear genes shared in Arabidopsis, Populus, Vitis and Oryza and the majority of these genes are also present in the Selaginella and Physcomitrella genomes. Public EST sets for 197 species suggest that most of these genes are present throughout the seed plants as single or very low copy genes, though exceptions are seen in recently polyploid taxa and in lineages where there is significant evidence for a shared large-scale duplication event. The evolutionary forces leading to continuous loss of duplicated genes for some kinds of genes and not others are still unknown, but the collection of single copy genes is biased toward organellar genes and genes of unknown

- 87- biological function. A series of single copy genes were amplified in the Brassicaceae using RT-PCR and directly sequenced without cloning. Sequences obtained using this method provide improved resolution of Brassicaceae phylogeny compared to recent studies using chloroplast and ITS sequences. Although single gene phylogenies that rely on EST sequences have limited bootstrap support as the result of limited sequence information, concatenated alignments result in phylogenetic trees with strong bootstrap support for already established relationships. Overall, these single copy nuclear genes contain a greater proportion of phylogenetically-informative sites than protein-coding sequences from the chloroplast or mitochondrial genomes.

Conclusions Shared single copy nuclear genes provide a vast source of new evidence for plant phylogeny, genome mapping, and other applications, as well as a substantial class of genes for which functional characterization is needed.

Background

Gene duplication as a pervasive force in the molecular evolution of angiosperms As a result of pervasive gene duplication and the high likelihood of repeated rounds of genome duplication [1-3], most nuclear genes in angiosperms are part of gene families. This presents a serious complication in the identification of shared single copy nuclear genes that could be used for applications such as molecular systematics and mapping markers. However, there is a small subset of genes that have very low copy number, ranging from 1-4 copies per taxa [4-6]. In the extreme case are genes that rapidly resolve to single copy very soon after duplication in many independent lineages. There is evidence from Arabidopsis that genes that become single copy after genome duplication

- 88- are more likely to return to single copy status after subsequent genome duplications [7].

This suggests that there is a small subset of single copy nuclear genes that are single copy throughout angiosperm diversity.

An ancient history of genome duplication in plant genomes Whole genome duplications have been inferred in all angiosperm genomes sequenced to date. Analysis of the Arabidopsis thaliana genome provides evidence based on synteny for at least two whole genome duplications, and possibly a third [8-11].

Analysis of the Oryza sativa genome suggests two genome duplications in the evolutionary history of the genome, one close to the divergence of the Poaceae and another older duplication [12, 13]. Analysis of the Populus trichocarpa genome is indicative of three whole genome duplications [14]. The Vitis vinifera genome is currently interpreted to show evidence of an ancient hexaploidy event with no recent whole genome duplications[15]. This evidence, in addition to evidence from analysis of

ESTs in a number of species throughout the angiosperm tree of life, suggests that there is a record of paleopolyploidy in almost all extant angiosperms [1-3]. Evidence for frequent gene duplication is also seen in the evolutionary history of numerous gene families that have expanded during the diversification of the angiosperms [16-19]. The prevalence of duplication in flowering plants means that the identification of orthologous sequences that have not undergone duplication in any lineages of interest is highly unlikely, and quite possibly impossible. Especially for phylogenetics, it is important to identify orthologous sequences given that evidence from yeast indicates that phylogenies based on paralogous genes with asymmetric divergence are misrepresentations of the organismal phylogeny [20].

- 89- Previously identified single copy nuclear genes Relatively few (in the context of the entire genome) single copy nuclear genes have been well-studied in flowering plants. A number of low or single copy nuclear genes have been previously identified, including ADH, TPI, GAP3DH, LEAFY,

ACCase, PGK, petD, GBSSI, GPAT, ncpGS, GIGANTEA, PHYA, PHYB, PHYC,

RBP2 and PISTILLATA – primarily for their use as phylogenetic markers [5, 21-32].

Evidence from wheat indicates that duplicated low-copy genes are rapidly eliminated after polyploidization [33]. This is supported by the trend seen in rice and Arabidopsis that singletons tend to return to singleton status after duplication events [7]. In the rare instances in which duplicated copies are maintained, persistent paralogs tend to arise through polyploidy events followed by functional and/or expression divergence. For example, overexpression of LEAFY generally results in early flowering [34, 35] and cases in which LEAFY is present in duplicate, expression patterns are typically complementary and suggest that subfunctionalization is necessary for the maintenance of both loci [36, 37].

Current tools for phylogenetic analysis in angiosperms Molecular systematics in flowering plants has been dominated by the use of phylogenetic markers that originate from the chloroplast genome (e.g., rbcL, matK. ndhF, trnL-F) or ribosomal DNA (18S, 26S, ITS, ETS). The predominant use of chloroplast and ribosomal DNA markers limits the number of genes available for understanding the placement of the angiosperms in the eukaryotic tree of life. Typically only angiosperms with sequenced genomes are included in taxa sampling for large tree of life datasets [38,

39]. Although the majority of phylogenetic markers used in angiosperms are from the chloroplast or mitochondrial genomes, low copy nuclear genes have been sought after as

- 90- phylogenetic and mapping markers, and nuclear genes are necessary to detect hybridization and introgression events [40]. However, both recent and ancient gene duplications have complicated the identification of low-copy genes that can be considered orthologous and can be easily amplified. Nonetheless, the nuclear genome is an important source of genetic diversity that can be used to establish phylogenetic relationships between species, genera, families, and deeper lineages such as the origin of angiosperms [41, 42] and origin of life [38, 43, 44]. For instance, previous studies indicate that intronic sequences from nuclear genes such as LFY, ACCase, PGK, petD,

GBSSI, GPAT, ncpGS and others are at least as useful as ITS or chloroplast intronic sequences in resolving family-level phylogenies and in many cases are more informative than ITS or chloroplast intronic sequences [22, 30, 31]. However, because of lineage- specific duplications, datasets using protein-coding nuclear genes across all angiosperms are limited (e.g., PHYC [41, 42]; Pires, unpublished data) and other methodologies, such as whole chloroplast genome sequencing, has been used to decipher the angiosperm tree of life [45, 46].

In this paper, we present two applications of using shared single copy genes for phylogeny. The first application utilizes EST contigs and sequences from the TIGR Plant

Transcript Assemblies to investigate the utility of 18 shared single copy nuclear genes for deep phylogenetic analysis, as well as to identify the occurrence and timing of lineage- specific duplications. This application of the single copy genes will illustrate the overall evolution and provide information about whether these genes are suitable for use as phylogenetic markers and whether these genes have the ability to provide phylogenetically-informative sequences. The second application is a family-level

- 91- phylogeny in the Brassicaceae based on sequences for a set of shared single copy genes that have been amplified by RT-PCR and sequenced. The mustard family (Brassicaceae,

338 genera, 3,700 species) is an ideal system to test the utility of these 'single-copy' nuclear genes for phylogenetic studies and to test if these genes have repeatedly returned to a single copy state following multiple whole genome duplication and diploidization events. Arabidopsis thaliana is thought to have undergone at least three rounds of whole genome duplication [47], and the "diploid" Brassica species have undergone additional duplication events [48, 49]. A paleopolyploid event occurred approximately 40 mya near the origin of the family (the alpha event in Arabidopsis, [50]) with an additional putative triplication event within the tribe Brassiceae that occurred 7.9-14.6 mya [48-50].

Hypotheses We hypothesize that there are shared single copy nuclear genes because not all genes are equal with respect to whether genes remain in duplicate or return to single copy after duplication events. Even across various duplication mechanisms (e.g., tandem, segmental, and whole genome duplications), gene retention and gene loss may be nonrandom for some genes throughout the majority of angiosperms [7, 51]. It remains to be determined what evolutionary forces play a role in this nonrandom retention and loss after duplication and as a result whether shared single copy or shared duplicated genes have unique biological characteristics that distinguish them from genes that are randomly lost or retained across angiosperms with respect to gene function, cellular location, and gene architecture.

In regards to phylogenetic application, we hypothesize that broad angiosperm phylogenies constructed using EST sequences of these genes will indicate whether there

- 92- have been lineage-specific duplications during the course of angiosperm evolution, as well as insight into the ability of nuclear genes to resolve angiosperm phylogeny. We predict that RT-PCR amplification of multiple single copy nuclear genes will result in sufficient phylogenetic information to resolve family-level phylogenies.

Results

Shared single copy genes are numerous in angiosperm genomes, as well as in other plant lineages 959 nuclear genes were shared as single copy genes in Arabidopsis, Populus, Vitis and Oryza, (APVO) as defined by the number of PlantTribes at stringency 3.0 that contain a single member from each respective genome (Figure 4-1). The number of shared single copy genes varies according to the genomes included in the analysis, however significant decreases in number of single copy genes is limited to the difference between considering a single genome versus two genomes combined (Figure 4-1).

Additionally, it is primarily the inclusion of the Populus trichocarpa proteome that decreases the number of shared single copy genes (Figure 4-1). Examination of the Plant

Transcript assemblies indicates that these genes are also present throughout angiosperms, although exact copy number cannot be ascertained since EST assemblies are not an accurate estimate of copy number. For example, the average number of plant transcript assemblies per single copy PlantTribe for Arabidopsis and Oryza are 1.22 and 1.51, respectively (Supplementary Table 4-3). These genomes have had extensive EST sequencing, but actual copy number is known based on the genome sequence.

Phylogenies based on eighteen genes that are single copy in both Arabidopsis and Oryza suggest that retained paralogs are limited to recent polyploids and in specific lineages such as the Asteraceae, Solanaceae, Fabaceae, and Gossypium (Additional file 4-4).

- 93- Comparison of the set of single copy genes identified in this study to a set of single copy and low-copy nuclear genes that have been commonly used for angiosperm phylogenetics

[5, 25-27, 29-32, 41, 42, 52] shows no overlap between the two sets – the set of previously used phylogenetic markers belong to PlantTribes in which copy number varies in the four genomes examined. This indicates that the single copy genes presented in this study adhere more strictly to a definition of being single copy genes suitable for phylogenetic analysis.

Figure 4-1. Shared single copy genes in various combinations of angiosperm genomes. Angiosperm genomes are abbreviated as follows: Ath – Arabidopsis thaliana; Ptr – Populus trichocarpa; Vvi – Vitis vinifera; Osa – Orysa sativa. The number of tribes represents the number of PlantTribes found at medium stringency (3.0) that contain a single member from each of the genomes sampled.

When genomes are considered (Arabidopsis, Populus, Vitis, Oryza,

Selaginella, Physcomitrella), there are 395 PlantTribes that consist of a single protein

- 94- from each of the 6 genomes considered. When conservation of copy number is not considered, there are 699 APVO PlantTribes that are present in Selaginella and 558

APVO PlantTribes that are present in Physcomitrella. Even when considering the distant algal species Chlamydomonas reinhardtii, 362 APVO PlantTribes are present with 1 or 2

Chlamydomonas genes per single copy PlantTribe. When the TIGR Plant Transcript

Assemblies are considered, which include a wide range of green algae and terrestrial plants, 438 APVO PlantTribes have at least one hit in a non-seed land plant species and

190 APVO PlantTribes have at least one hit in a green algal (chlorophyte or charophyte) species (Table 4-1).

Table 4-1. Shared single copy nuclear genes are present throughout plant lineages Summary of supplementary table 2 with total number of shared single copy nuclear genes present and distribution of count numbers for various taxonomic groups. The eurosids include members from both I and II; core eudicots are comprised of families considered basal to either the Eurosids or the Asterids such as the Caryophyllales and the Vitales; basal eudicots are represented by the ranunculids Eschscholzia californica and Papaver somniferum; Monocots include both members of the Poaceae and non-grass monocots; basal angiosperms include members of the magnoliids as well as Amborella trichopoda and Nuphar advena; gymnosperms includes conifers, cycads, Ginkgo and Gnetales; Vascular plants include representatives from ferns, mosses and hornwort; and green algae include members of both chlorophyte and charophyte green algal lineages.

- 95- Shared single copy genes have distinguishing characteristics compared to the rest of the genome The APVO shared single copy genes are overrepresented in the following GO slim categories: chloroplast, mitochondria, plastid, other intracellular components, other cytoplasmic components, other enzyme activity, DNA or RNA metabolism, unknown molecular function and unknown biological processes. Shared single copy genes are significantly underrepresented in several GO categories: other membranes, unknown cytoplasmic components, nucleus, transporter activity, nucleic acid binding, other molecular functions, transcription factor activity, kinase activity, other biological processes, protein metabolism, transport, transcription, response to abiotic or biotic stimulus and signal transduction. This correlates with findings that duplicates tend to encode for proteins with known functions such as signaling and related cellular processes while singletons typically encode for unknown proteins [53]. These results also mirror the result obtained by analysis of conserved ortholog sets (COSII) identified in the euasterid clade that indicate an association of single copy genes with genes targeted to the chloroplast and mitochondria, as well as genes associated with DNA and/or RNA metabolism [6]. Shared single copy genes from APVO PlantTribes have a higher number of exons than all other genes (p<0.00001, 2-tailed Student’s t-test with unequal variances). This result holds when intronless genes are excluded from the analysis.

Another notable difference is the number of domains present in the shared single copy genes – fewer domains are present in the shared single copy genes compared to the rest of the genome (p<0.00001, 2-tailed Student’s t-test with unequal variances). However, there is no significant difference in cDNA length (p=0.998, 2-tailed Student’s t-test with unequal variances). In summary, the shared single copy genes are structurally complex

- 96- genes, with a greater number of exons present in the coding sequence. However, they do not appear to be functionally complex, as they have fewer recognized domains per gene than the rest of the genome.

Figure 4-2. Overrepresentation and underrepresentation of shared single copy genes in select GO categories. Bar chart showing GO slim categories that are overrepresented in the 1:1:1:1 tribes using the TAIR8 annotation of the Arabidopsis thaliana genome using an initial alpha value of 0.05 with a subsequent Bonferroni correction for multiple tests. Overrepresentation and underrepresentation was detected using a chi-square test comparing the slim GO annotation of the single copy tribes versus all else in the Arabidopsis genome.

Shared single copy genes can be valuable phylogenetic markers for plant families For successful use of these shared single copy genes in molecular systematics, it must be demonstrated that they can be easily amplified and sequenced in a range of taxa.

In order to test this, we designed primers to amplify the 18 genes in the Brassicaceae.

When cDNA templates were used with degenerate primers, single bands of identical size were recovered across all taxa (Table 4-2). Direct sequencing of these loci yielded sequences with minor polymorphisms. These polymorphisms could be allelic

- 97- or from multiple loci. In contrast, when genomic DNA was used as a template across taxa, multiple amplicons were recovered (Table 4-2). These amplicons varied in size both across and within each of the taxa when using degenerative primers and genomic

DNA as the template. The total number of amplicons also varied between taxa. RT-

PCR amplification products were directly sequenced and used for phylogenetic analysis.

Table 4-2: Amplification of shared single copy nuclear genes in Brassicaceae PCR and RT-PCR amplification results from Brassicaceae showing number and size of bands. The number of bands amplified is indicated (0-5).

A combined and complete data-matrix containing genes At2g32520, At2g13360, and At5g23290 for each of the twelve taxa yielded a single most-parsimonious tree with the exact predicted and previously supported topology (Figure 4-3) [54-56] but with increased bootstrap support at deeper nodes than previously published phylogenies. The tribe Brassiceae had a 100% bootstrap statistical support and a 100% bootstrap support that the tribe is sister to the tribe Sisymbrieae. The Brassiceae + Sisymbrieae clade

- 98- located within lineage II, had a 99% bootstrap support to the relative position to the

Camelinae tribe located within lineage I, and the lineage I and lineage II clade had a

100% bootstrap support for a sister relationship to Chorisporeae (lineage III).

Figure 4-3: Single copy nuclear genes improve phylogenetic resolution in the Brassicaceae. A single most-parsimonious tree was found from combined analysis of complete data- matrix containing genes At2g32520, At2g13360, and At5g23290 (L = 961, consistency index = 0.774, retention index = 0.529). Bootstrap values are shown above branches. Note that individual gene trees gave similar topologies. The phylogeny is consistent with published phylogenies using more taxa and other molecular markers. Bootstrap values from Beilstein et al., 2006 are shown below branches.

Shared single copy genes are valuable phylogenetic markers throughout angiosperms Shared single copy homologs are well represented in the Plant Transcript

Assemblies, representing over 13 million plant ESTs from 233 species. Given this rich resource of sequence information, we examined the presence and distribution of shared single copy genes in this dataset. A significant observation from this analysis is that EST data and number of Unigenes are not reliable indicators of copy number at the genomic level. Even in the case of Arabidopsis, Populus, Vitis and Oryza, where copy number is

- 99- known from the genomic sequence, numbers of Unigenes vary substantially – most likely resulting from sequencing error, assembly issues, alternative splicing, allelism and other such factors. Using a set of criteria designed to maximize the phylogenetic signal for each taxon and minimize noise from the dataset, we used phylogenetics to help distinguish between species-specific duplicates (which could result from either technical error, alleles or an actual recent duplication event) and shared duplication events that could reveal if there are patterns of maintenance of duplicates in lineages. Because this process requires extensive manual inspection of the alignments, a limited number

(eighteen) of genes were selected for the phylogenetic analysis.

Table 4-3. Shared single copy nuclear genes are a rich source of phylogenetic information. This table provides the following information about the 18 shared single copy genes that were used for phylogenetic analysis based on EST and finished cDNA sequences: the Arabidopsis thaliana locus ID (ATH); the TAIR annotation in Arabidopsis (Annotation); the number of sequences in the final alignment (# SEQ); the number of nucleotides in the final alignment (# NT); the number of variable characters in the final alignment (# VAR); the number of parsimony-informative characters in the final alignment (# PI); the percentage of characters that are parsimony-informative (% PI); the number of nodes in the MP (% >50 MP) and ML (% >50 ML) bootstrap consensus that are supported in greater than 50% of the bootstrap replicates.

- 100 - Not surprisingly, single gene phylogenies provide limited resolution of seed plant phylogeny beyond plant families given the limited number of characters for each gene

(Table 4-3). The plant families Fabaceae, Poaceae, Asteraceae, and Solanaceae, all of which are well represented in the Plant Transcript Assemblies, are strongly supported in the single gene phylogenies (Additional File 4-4). Shared duplications in a given lineage are also present in five of the individual gene phylogenies, which precluded the use of these genes in the concatenated alignment. Shared duplications within a lineage are associated with lineages for which a shared polyploidy event has been proposed [1, 3]. It should be noted that even though many members of the Poaceae are well represented in the Plant Transcript Assemblies and multiple sequences from a single species were typically included in the single gene alignment, there were no cases of shared duplications within this plant family – multiple sequences from the same species were more closely related to each other than to sequences from other species (Additional File

4-4). Notably, there are no duplications that involves more than a single plant family – deeper duplications are not present in these datasets. This supports the idea that paralogous copies of these genes do not persist over evolutionary time and the presence of paralogous sequences is most likely the result of relatively recent duplication events.

Contradictions to the general consensus concerning angiosperm phylogeny, such as

Acorus falling within magnoliids instead of a basal position within monocots, have poor bootstrap support (<70%).

- 101 -

Figure 4-4. Angiosperm phylogeny using ESTs for 13 shared single copy genes The tree depicted to the left is the MP tree determined from the concatenated data matrix for 13 single copy genes using 69 seed plant taxa. The tree depicted on the right is the ML tree determined from the concatenated data matrix for 13 single copy genes using 69 seed plant taxa. Bootstrap values are indicated by the colored bars placed on branches with greater than 50% bootstrap support. Picea sitchensis was used as the outgroup taxa for all analyses. Taxa are color-coded as follows: monocots (green); euasterid I (light blue); euasterid II (dark blue); eurosid I (pink); eurosid II (red); core eudicot (purple); basal eudicot (brown); magnoliid (orange); basal angiosperm (dark gray); gymnosperms (black).

In contrast to the individual gene phylogenies, the concatenated alignment of 13 of the shared single copy genes (Figure 4-4) shows improved resolution and is similar to recently reported phylogenies [45, 57, 58] though there are significant differences with the placement of individual species between the MP and ML trees. Overall, the ML tree

- 102 - shows improved resolution and increased bootstrap support compared the to MP tree

(Table 4-3). The eurosids, asterids and monocots are all monophyletic (Figure 4-4). All plant families with multiple species present in the datasets are found to be monophyletic with strong bootstrap support. In all cases in which multiple members of a genus are included in the analysis, genera are found to be monophyletic with strong bootstrap support. Eurosids I and II are paraphyletic with the Sapindales (Eurosid II) sister to a clade containing the Malpighiales (Eurosid I) with poor bootstrap support. (below 50% in

MP and 53% in ML). Excepting the placement of the Malpighiales, the rest of the eurosid II lineage is monophyletic. Vitales are basal to the eurosids with poor bootstrap support (<50%). Euasterids I and II are both monophyletic with strong bootstrap support.

The caryophillid Beta is not stable – nesting within the eurosid II clade in the MP phylogeny and sister to the Asterids in the ML phylogeny, with below 50% bootstrap support for both placements. Ranunculales are sister to all core eudicots with strong support in the ML analysis and moderate support in the MP analysis. Within the monocots, Acorus is basal, with Zingiber sister to the Poaceae. Limited resolution at basal nodes, such as the affiliation of the core eudicots, magnoliids and basal angiosperms, is expected to be the result of the limited amount of sequence information available for these lineages. Magnoliids are sister to the eudicots, with poor bootstrap support, contrary to their placement sister to the monocots + eudicots in whole chloroplast genome datasets

[59]. The ordering of Amborella and Nuphar switches between the MP and ML tree, with below 50% support for Amborella as the basalmost angiosperm in the MP phylogeny and with 52% bootstrap support for Nuphar as the basalmost angiosperm in the ML phylogeny. Gymnosperm relationships are un-informative given the rooting of

- 103 - the tree with Picea sitchensis and the limited taxon sampling since there weren’t a sufficient number of genes present for members of the Zamiales, Ginkgoales and

Gnetales.

Discussion

There are a large number of shared single copy nuclear genes in plant genomes Because of pervasive gene duplication in the angiosperms, the presence of a significant number of shared single copy nuclear genes is an interesting biological question. However, it is unlikely that any one gene identified in this study is single copy in every extant angiosperm, given the frequency of gene duplication and the time required to resolve gene duplication events. Overall, this study provides a set of genes for which further characterization and research could greatly improve our understanding of the molecular evolution of genes in flowering plants and provide new tools for plant biologists, especially in the realm of molecular systematics.

Shared single copy genes have different characteristics than singletons A previous study on singletons present in the Oryza and Arabidopsis genomes [7], but not necessarily conserved as singletons between the two genomes yielded results that do not correspond to the findings presented in this paper. The singletons presented in

Chapman et al. [7] are significantly shorter than genes with a duplicate. However, our analysis in Arabidopsis indicates that there is no significant difference in cDNA length between shared single copy genes and all other genes. To some degree, the difference in approach for identifying single copy genes may provide a clue as to the difference in results. Chapman et al. [7] identified single copy genes (singletons) in individual genomes, whereas the single copy genes in this study were identified by requiring the gene to be single copy in four genomes as an added criterion. This suggests that there

- 104 - may be significantly different molecular characteristics for shared and non-shared singletons within an individual genome as a result of differential molecular evolution of these two subsets of single copy genes. It is possible that some of the singletons identified by Chapman et al. [7] are unique to their genome or lineage, or that the homologous sequence in another genome has paralogs. The only common characteristic between the shared singletons and the non-shared singletons is a decreased number of protein domains – however this could be the result of poor characterization and annotation of genes that are not members of a gene family. It is of note that the functional overrepresentation of shared single copy genes in slim GO categories is highly similar to the results found by Wu et al [6] for clusters of singletons in the Asterids that were verified as single copy in Arabidopsis, which they named conserved ortholog sets

(COSII). Therefore, there could be significant similarities between genes that experience nonrandom gene loss after duplication and genes that experience random genes loss after duplication.

An increased average number of exons in shared single copy genes supports a hypothesis that intron loss is facilitated by retrotransposition and gene conversion in gene families[60]. In the absence of paralogs, we hypothesize that introns are more stable in shared single copy genes than introns in genes that are members of gene families. This aspect of intron evolution in shared single copy genes could be very important for their utility in phylogenetic studies, since intronic sequences can be a useful tool for species identification and classification.

- 105 - Evolution of shared single copy nuclear genes against a background of polyploidy The reduced number of single copy genes found when Populus was included to the analysis can be attributed to the relatively recent genome duplication in Populus and the relatively slow resolution of this event. This indicates that the resolution of duplication events can vary in how long it takes for the gene to return to single copy through a nonrandom process after duplication or reflects the number of shared single copy genes that present in single copy as the result of random processes. Some degree of this can be attributed to the random processes associated with the degeneration of one of the duplicates according to neutral evolution and population genetics parameters [61].

However, the conservation of many of these genes as single copy throughout the angiosperms, as evidenced by their representation and distribution in the broad EST- based phylogenies from this study, suggests that nonrandom processes can be involved in maintaining their single copy status. For example, variance in copy number may relate to variance in the strength of negative selection against more than one copy after duplication. It is also plausible that a number of these shared single copy genes are only coincidently single copy in more than genome, having returned to single copy after duplication through genetic drift. Unfortunately, the small number of sequenced genomes and incomplete functional characterization for a number of these genes does not allow us to make a firm conclusion on which evolutionary forces predominate the evolution of these genes or whether any given gene is single copy as the result of drift or selection. Ideally, functional characterization and population genetic studies of this class of genes as well as more sequenced genomes will in the future provide more evidence about the respective roles of selection and drift on the evolution of these genes after duplication.

- 106 - The strong association between organellar (chloroplast and mitochondria) targeted genes and the conserved single copy genes identified in this study supports a hypothesis that coordination of protein complexes such as those involved in electron transport in both the chloroplast and mitochondria requires an evolutionarily-stable copy number, since these nuclear-encoded proteins interact with organellar-encoded proteins. However it is unclear how dosage in the nuclear genome (one nucleus per cell) and dosage in the chloroplast genome (many chloroplasts per cell) is coordinated. This hypothesis extends the gene balance hypothesis by invoking the specific complexity of protein complexes that include members from both the nuclear genome and the organellar genomes [62-65].

Although some number of the genes identified in this study are present as single copy as the result of random processes, additional data obtained by phylogenetic analysis of EST sequences of a subset of these genes from a wide range of angiosperms suggests that the single copy status of some of these genes is non-random. Cases which are indicative of ancient duplication are limited to clades for which ancient polyploidization events have been hypothesized that coincide with the duplication of single copy genes, such as the Asterids, Fabaceae, and Gossypium (see Additional File 4-4). By focusing on sequences obtained through transcriptome sequencing or RT-PCR, we have purposefully only focused on expressed copies of these single copy genes. However, we hypothesize that in genomes with recent duplication events there will be multiple loci present for any gene, including the shared single copy genes. It may take millions of years for duplicated loci to resolve to a single locus – however, silencing of duplicated loci such that only a single locus is actively expressed is expected to occur relatively quickly after duplication

[33, 66-68].

- 107 - The expressed copy of At2g32520 was aligned to non-expressed genomic DNA copies for several taxa in the Brassicaceae that were cloned and sequenced. This comparison revealed not only which genomic DNA copy was expressed but also that the non-expressed copies displayed evidence of pseudogenization by acquiring numerous point mutations, insertions, and deletions likely resulting in gene loss through sequence degeneration. Many of the taxa are relatively recent polyploids, and it is difficult to conclude if the polymorphisms observed in the sequences from the directly sequenced bands from RT-PCR are allelic or from duplicate copies. However, this analysis did display possible evidence for the loss of duplicate copies from past duplication events from the presence of pseudogenization for multiple genomic loci.

Following whole genome duplications, the loss or degeneration of duplicate copies is the most likely conclusion especially for genes that encode conserved biological functions [69]. This suggests that these 'single copy' genes have repeatedly returned to a single copy state following whole genome duplication events. The individual gene trees show this pattern, as well as illustrating differences in the resolution of whole genome duplication events. All of the inferred gene duplications present in the individual gene trees are coincident with whole genome duplications that have been demonstrated by either Ks distributions of ESTs or syntenic chromosomal blocks [1, 3, 12, 13, 49]. In lineages where tandem or segmental duplication is inferred to be more prevalent than whole genome duplications, such as the monocots (specifically the Poaceae [13]), multiple sequences from a single species are more closely related to each other than sequences from a different species. This indicates that if duplicates are actually present in a genome (not detectable using ESTs), they are recent in origin and did not originate

- 1 -08 during a shared duplication event. Overall, these results suggests that is typically genome-scale duplications that result in the long term maintenance of more than one locus for this class of genes – however more research is required to determine if this trend extends to all shared single copy genes, since the phylogenetic analysis was limited to a small number of shared single copy genes. This supports that the gene balance hypothesis may be well suited to explaining the evolution of these genes after duplication. However, it must be determined if genes within this set are in fact dosage- sensitive or dosage-insensitive in order to understand whether the gene balance hypothesis is applicable.

Shared single copy genes are a valuable new source of phylogenetic information from the nuclear genome This study emphasizes the utility and ease of using single-copy nuclear genes for phylogenetic studies. The Brassicaceae has been recently divided into 25 tribes that are mostly grouped into four monophyletic lineages [54, 56]. The tribe

Aethionemeae containing saxatile is the tribe sister to the remainder of the family [54-56]. Lineage I contains eight tribes, including the tribe Camelineae with Arabidopsis thaliana and Olimarabidopsis pumila. Lineage II contains four tribes, including the agronomically important tribe Brassiceae with Brassica rapa,

Brassica oleracea, Brassica repanda, Moricandria arvensis, Schouwia thebaica, and

Erucastrum canariense. The tribe Brassiceae is thought to have an ancient hexaploid event, the sister tribe Sisymbrieae represented by Sisymbrium irio does not have this ancient polyploidy event [49]. Lineage III includes four primarily Asian tribes including the tribe Chorisporeae and there are also tribes that have not been assigned

- 109 - to any of these three lineages [54]. The combined and complete data matrix of three single copy nuclear genes (At2g32520, At2g13360, and At5g23290) yielded a single most-parsimonious tree with strong bootstrap support that is consistent with published phylogenies using more taxa and other molecular markers in the Brassicaceae [54-

56]. Individual gene trees gave similar topologies.

Although individual gene trees do not provide sufficient resolution for deep nodes across the seed plant phylogeny, the combined data matrix of thirteen single copy nuclear genes across 69 seed plants yields a phylogenetic tree with good resolution, given the overall size of the data matrix and the amount of overlap in sequence between different taxa. In a general sense, the resulting phylogeny from the combination of these thirteen single copy genes agrees with recent results from large datasets based on whole chloroplast genome sequences [45, 46]. It is expected that improved sampling at basal nodes would improve the resolution of the resulting phylogeny, since current EST resources for these taxa are considerably less than EST resources available for crop species in the Poaceae and eurosids, for which coverage is excellent. An important result from the EST-based phylogenies is that these nuclear genes are a rich source of phylogenetic information – they include a high percentage of variable and parsimony-informative characters, and extended branch lengths at deeper nodes within the trees (Table 4-3). Although the dataset presented in this study is limited to small number of genes the methodology presented of using shared single copy genes for phylogenetics using sequences from public databases could be expanded to utilize all of the shared single copy nuclear genes with automation of the alignment and sequence selection methods. It is anticipated that in contrast to other

- 110 - studies which have used alternative clustering methods and have incorporated sequences for which paralogs are present, high-throughput phylogenetics using shared single copy genes will exhibit a stronger signal to noise ratio and minimize conflict within the analysis.

Conclusions The identification of shared single copy nuclear genes is important for many reasons, given their unique status in genomes full of paralogs. Even though their origin and molecular evolution are of continuing interest, shared single copy nuclear genes have a series of immediate, valuable applications as mapping markers, quality control for genomic libraries, and phylogenetic markers. As single copy loci, these genes can provide important genomic landmarks for mapping of other genes or features in the genome, and can be used to identify the amount of coverage provided by a given genomic library. However, their potentially highest impact application is for molecular phylogenetics within the angiosperms, and perhaps deeper since most are found in a diverse set of angiosperm EST studies (Supplementary Table 4-3) and approximately half can be detected in the non-seed plant lineages (Supplementary Table 4-1).

Since most current molecular systematic studies in angiosperms are based on cpDNA and non-coding nuclear sequences, it is important to know whether protein- coding nuclear genes provide complementary support to these studies, or conflict. The prevalence of paralogs in angiosperm genomes has deterred researchers from using nuclear genes in phylogenetic reconstruction in angiosperms on a regular basis because of the difficulty of isolating all members of a gene family in each taxon for accurate phylogenetic reconstruction [4]. Given the dynamic evolutionary history of plant nuclear

- 111 - genes as a result of recombination, duplication and endosymbiotic gene transfer, it is unclear whether phylogenies based on nuclear-encoded proteins will be congruent with phylogenies based on chloroplast-encoded proteins. However, the results from this study show no compelling evidence of a conflict between phylogenies derived using chloroplast data and the phylogenies presented. By comparing and combining data from single copy nuclear genes with chloroplast genes, we will have a better understanding of the diversification of flowering plants.

Shared single copy nuclear genes are an important tool for understanding angiosperm phylogeny and are applicable at a wide range of taxonomic levels. Both coding and intronic sequences from these genes are phylogenetically informative and genes can be easily amplified and sequenced using RT-PCR or PCR. It is now necessary to expand our knowledge of these genes by sampling them in diverse lineages in order to fully appreciate their phylogenetic utility.

Methods

Identification of shared single copy nuclear genes in sequenced plant genomes We identified shared single copy nuclear genes using a high throughput comparative genomic approach, using available data from the Arabidopsis, Populus, Vitis and Oryza genomes. The Arabidopsis, Populus, Vitis and Oryza proteomes were compared against each other using BLASTP [70]. Genes were clustered into ‘tribes’, an approximation of gene families, using TRIBE MCL [71-73] at medium stringency.

Shared single copy genes were identified as a unique cluster from TRIBE MCL clustering of two or more proteomes that included a single protein from each genome included in the analysis.

- 112 - Identification of shared single copy nuclear genes in non-seed plants To identify the presence of shared single copy genes in non-seed plants, we used the PlantTribes database (insert website) to identify shared single copy genes in

Selaginella, Physcomitrella, Arabidopsis, Populus, Vitis and Oryza. In addition, as the presence of genes in the Physcomitrella, Selaginella and Chlamydomonas genomes that are shared single copy genes in Arabidopsis, Populus, Vitis, and Oryza was also identified. The PlantTribes database identifies clusters of genes from sequenced genomes through TRIBEMCL clustering of BLASTP searches against various combinations of plant genomes [71-73]. The presence of shared single copy genes in other non-seed plant lineages were detected by using TBLASTX of all shared single copy Arabidopsis,

Populus, Vitis and Oryza protein sequences against the Plant Transcript Assemblies

(http://plantta.tigr.org/).

Characterization of shared single copy nuclear genes in Arabidopsis To broadly characterize the shared single copy genes shared between Arabidopsis,

Populus, Vitis and Oryza, we analyzed the genes based on their attributes in Arabidopsis for the following characteristics: 1) slim Gene Ontology (GO) categories for biological processes, molecular function, and cellular location; 2) cDNA length; 3) number of exons; and 4) number of domains as compared to all other genes in the Arabidopsis genome. For GO classifications we compared the APVO single copy genes against all other genes using χ2 tests to detect overrepresentation and underrepresentation of single copy genes in the slim GO categories as compared to all other genes, using a p-value cutoff value of 0.05, adjusted using a Bonferroni correction for multiple tests for each separate category within the GO classification (cellular location, molecular function, biological process). Differences in coding sequence length, number of exons, and

- 113 - number of domains between shared single copy genes and all other genes were tested using a Student’s t-test and a p-value cutoff of 0.01.

EST-based angiosperm phylogeny using shared single copy nuclear genes In order to explore the phylogenetic utility of conserved single copy nuclear genes, eighteen genes from the Arabidopsis and Oryza shared single copy PlantTribes with good taxonomic sampling of full-length sequences across the angiosperms were selected for further study (Table 3). When available, cDNA clones from select taxa

(Acorus americanus, Amborella trichopoda, Cucumis sativa, Eschscholzia californica,

Liriodendron tulipifera, Nuphar advena, Persea americana, Ribes americanus, Saruma henryii, Vaccinium corombosa, Welwitschia mirabilis, Yucca filamentosa, Zamia fisherii) were sequenced on both strands to obtain a high-quality finished sequence for phylogenetic analysis. Internal primers were designed using MacVector (Accelrys, San

Diego, CA) when necessary. Additional sequences for phylogenetic analysis were obtained from the Plant Transcript Assemblies [74] by using TBLASTX [75] to identify putative homologs using both the Arabidopsis and Oryza genes as query sequences.

Amino acid sequences were aligned using MUSCLE [76] and nucleotide alignments were created by forcing the DNA sequence onto the protein alignment using

CLUSTALW [77]. Alignments were manually inspected and adjusted. Sequences were eliminated from the alignments according to the following criteria: 1) the sequence was from a non-seed plant such as a fern, moss or an alga; 2) the sequence contained at least five ambiguous bases (N); 3) the sequence had low sequence similarity compared to the rest of the alignment; 4) 50 or more bp were missing from either the 5’ or 3’ end of the alignment, anchored by the start codon at the 5’ end and the stop codon at the 3’ end; 5)

- 114 - the sequence was identical to another sequence (preference given to transcript assemblies over EST singletons); 6) the sequence was from a hybrid for which close relatives were also included in the alignment. Gaps shared between species were included in the final alignments, whereas gaps that occur only in a single sequence were removed. A concatenated alignment of 13 conserved single copy nuclear genes was constructed using

69 taxa using the following criteria: 1) genes were included if the single gene phylogenies presented no evidence of shared duplication events; 2) taxa were included if at least 6 of the gene sequences were available for that taxon; 3) a single sequence for each species was selected with preference for plant transcript assemblies with the greatest amount of coverage and the shortest branch length. Shared duplication events and identical sequences were identified through the examination of a neighbor-joining phylogenetic tree and accompanying distance matrix that was generated using Jukes-Cantor distance in

PAUP 4.0b10 [78] applied to the nucleotide alignment for each individual after non-seed plant sequences, truncated sequences, low similarity sequences, and sequences with more than 5 ambiguous bases were eliminated from the alignment. Nucleotide alignments after all criteria were applied were used to produce phylogenies using maximum parsimony

(MP) and maximum likelihood (ML). Appropriate models of sequence evolution for ML analyses were determined using MODELTEST using Aikake’s Information Criteria [79].

Maximum parsimony searches and bootstrapping were completed using a heuristic search in PAUP 4.0b10 [78] with 500 ratchet iterations as implemented by PRAP [80] with ten random sequence addition replications and tree-bisection-reconnection (TBR) branch swapping. Maximum likelihood searches and bootstrapping were completed in GARLI

0.951 using default settings for the genetic algorithm [81]. 1000 MP and 100 ML

- 115 - bootstrapping replicates using the same heuristic search parameters and model of sequence evolution were performed for each of the eighteen single copy PlantTribes examined, as well as the concatenated alignment of thirteen genes.

Using shared single copy nuclear genes for phylogeny in the Brassicaceae Ten taxa were selected in Brassicaceae to span the major lineages and to sample more intensively in the tribe Brassiceae that is thought to have an additional ancient polyploidy event. Five tribes were selected because of their known relative relationships to one another because the relationships between most of the other tribes remain relatively unresolved. The outgroup taxa selected was Medicago truncatula within the family Fabaceae. The family Cleomaceae has been firmly established as the family sister to Brassicaceae [50, 82] and Medicago truncatula was selected as an additional outgroup which is positioned within the rosids clade.

Total RNA was extracted from approximately 250 mg fresh young leaf tissue using the Invitrogen Homogenizer and the Micro- to Midi- Total RNA Purification

System and then converted into cDNA using the Invitrogen SuperScript III First-Strand

Synthesis System for RT-PCR kit (Invitrogen, Carlsbad, California) following all manufacturer's instructions. Polymerase chain reactions (PCR) amplifications were performed using degenerate primers designed by eye based on the EST alignments,

Eppendorf Triplemaster PCR system, and the Eppendorf Mastercycler epgradient S thermocycler (Eppendorf, Hamburg, Germany) and following Eppendorf's high fidelity

PCR protocols using a 60-62ºC primer annealing temperature. Amplicons were analyzed by gel electrophoresis, purified using the Invitrogen Purelink PCR Purification kit, and than direct sequenced at the Core facility at the University of Missouri-Columbia.

- 116 - The sequences (1594 characters with 320 Parsimony informative) were automatically assembled using the software SeqMan (DNA Star, Madison, Wisconsin), aligned using Megalign (DNA Star), and further aligned by eye using Se-Al v2.0a11 [83].

Maximum parsimony analyses of the alignments were conducted using PAUP 4.0b10

[78]. The most parsimonious trees were generated using the heuristic search algorithm, equal weighted characters, gaps treated as missing data, with tree bisection-reconnection

(TBR) branch swapping, 1000 random additions of the sampled taxa, and 10 trees saved per replicate and a strict consensus tree was computed. Bootstrap analyses were performed with a 1000 replicates, full heuristic search, TBR branch swapping and simple sequence addition.

Acknowledgements The authors would like to thank Kevin Beckmann for preliminary analysis of shared single copy genes in Arabidopsis and Oryza. This research was supported by: a graduate research award from the Botanical Society of America Genetics section and a

NSF DDIG (DEB-0710250) to J.M.D.; the Polyploid Genome Project (NSF Plant

Genome Award #0501712) and Brassica Genome Project (NSF Plant Genome

Comparative Sequencing award #0638536) to JCP; the Floral Genome Project (NSF Plant

Genome Award DBI-0115684) and Ancestral Angiosperm Genome Project (NSF Plant

Genome Comparative Sequencing DEB-0638595) to CWD.

- 117 - Additional files Additional file 4-1 – Supplementary_Table_4-1.xls Identification of all shared single copy nuclear genes among the Arabidopsis, Populus, Vitis and Oryza genomes, as well as the Physcomitrella and Selaginella genomes. Counts for each gene in each genome is provided. Locus ID in Arabidopsis, as well as TAIR8 annotation for symbol, description and slim GO annotation in terms of molecular function, cellular location and biological process are also provided. All tribes can be located in the PlantTribes database at fgp.bio.psu.edu/tribe.php by searching with the appropriate Arabidopsis locus ID. This file is available at http://fgp.huck.psu.edu/jduarte/Supplementary_Table_4-1.xls.

Additional file 4-2 – Supplementary_Table_4-2.xls Coding sequence length, number of exons and number of domains for all Arabidopsis proteins according to the TAIR8 annotation, with genes identified as to whether they are a member of the APVO PlantTribes. This file is available at http://fgp.huck.psu.edu/jduarte/Supplementary_Table_4-2.xls.

Additional file 4-3 – Supplementary_Table_4-3.xls BLASTx hits to the Arabidopsis, Populus. Vitis and Oryza conserved single copy nuclear genes for all plants in TIGR Plant Transcript Assemblies that have more than 5 hits to the APVO single copy PlantTribes, showing taxonomic grouping as well as gene counts. This file is available at http://fgp.huck.psu.edu/jduarte/Supplementary_Table_4-3.xls

Additional file 4-4 – EST_Trees.zip ML and MP trees for the 18 single copy genes based on alignments of EST and finished cDNA sequences are provided as well as the ML and MP bootstrap consensus trees. MP bootstrap consensus trees and all ML trees are available as PDF files and the collection of MP trees for each alignment is available as a .tre file. This file is available at http://fgp.huck.psu.edu/jduarte/EST_Trees.zip

Additional file 4-5 – EST_Alignments.nex Alignments for the 18 single copy genes using seed plant ESTs and finished cDNAs are provided in NEXUS format, in addition to the concatenated alignment of 13 single copy genes. This file is available at http://fgp.huck.psu.edu/jduarte/EST_Alignments.nex

Additional file 4-6 – Brassica_alignments.nex Alignments used for Brassicaceae phylogeny. This file is available at http://fgp.huck.psu.edu/jduarte/Brassica_alignments.nex

- 118 - References 1. Blanc G, Wolfe KH: Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. The Plant Cell 2004, 16(7):1667-1678. 2. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A et al: Widespread genome duplications throughout the history of flowering plants. Genome Res 2006, 16:738-749. 3. Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC: Mining EST databases to resolve evolutionary events in major crop species. Genome 2004, 47:868-876. 4. Small RL, Cronn RC, Wendel JF: Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 2004, 17:145-170. 5. Strand AE, Leebens-Mack J, Milligan BG: Nuclear DNA-based markers for plant evolutionary biology. Molecular Ecology 1997, 6(2 %R doi:10.1046/j.1365-294X.1997.00153.x):113-118. 6. Wu F, Mueller LA, Crouzillat D, Petiard V, Tanksley SD: Combining bioinformatics and phylogenetics to identify large sets of single-copy orthologous genes (COSII) for comparative, evolutionary and systematic studies: A test case in the Euasterid plant clade. Genetics 2006, 174(3):1407- 1420. 7. Chapman BA, Bowers JE, Feltus FA, Paterson AH: Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication. Proceedings of the National Academy of Sciences USA 2006, 103(8):2730-2735. 8. Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 2003, 13(2):137-144. 9. Simillion C, Vandepoele K, Van Montagu MCE, Zabeau M, Van de Peer Y: The hidden duplication past of Arabidopsis thaliana. Proceedings of the National Academy of Sciences USA 2002, 99(21):13627-13632. 10. Vandepoele K, Simillion C, Van de Peer Y: Detecting the undetectable: uncovering duplicated segments in Arabidopsis by comparison with rice. Trends in Genetics 2002, 18(12):606-608. 11. Vision TJ, Brown DG, Tanksley SD: The origins of genomic duplications in Arabidopsis. Science 2000, 290(5499):2114-2117. 12. Paterson AH, Bowers JE, Chapman BA: Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proceedings of the National Academy of Sciences USA 2004, 101:9903-9908. 13. Yu W, J., Lin, W., Li, S., Li, H., Zhou, J., al., et.,: The Genomes of Oryza sativa: a history of duplications. PLoS Biology 2005, 3:e38. 14. Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al: The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313(5793):1596-1604. 15. Characterization TF-IPCfGG: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, in press.

- 119 - 16. De Bodt S, Maere S, Van de Peer Y: Genome duplication and the origin of angiosperms. Trends in Ecology and Evolution 2005, 20:591-597. 17. Kim S, Soltis PS, Wall PK, Soltis DE: Phylogeny and domain evolution in the APETALA2-like gene family. Molecular Biology and Evolution 2006, 23(107- 120). 18. Kramer EM, Jaramillo MA, Di Stilio VS: Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 2004, 166(2):1011-1023. 19. Zahn LM, Leebens-Mack J, dePamphilis CW, Ma H, Theissen G: To B or not to B a flower: the role of DEFICIENS and GLOBOSA orthologs in the evolution of the angiosperms. Journal of Heredity 2005, 96:225-240. 20. Fares MA, Byrne KP, Wolfe KH: Rate asymmetry after genome duplication causes substantial long-branch attraction artifacts in the phylogeny of Saccharomyces species. Molecular Biology and Evolution 2006, 23(2):245-253. 21. Aoki S, Uehara K, Imafuku M, Hasebe M, Ito M: Phylogeny and divergence of basal angiosperms inferred from APETALA3- and PISTILLATA-like MADS-box genes. Journal of Plant Research 2004, 117(3):229-244. 22. Bailey CD, Doyle JJ: Potential phylogenetic utility of the low-copy nuclear gene pistallata in dicotyledonous Pplants: Comparison to nrDNA ITS and trnL intron in Sphaerocardamum and other Brassicaceae. Molecular Phylogenetics and Evolution 1999, 13(1):20-30. 23. Emshwiller E, Doyle JJ: Chloroplast-expressed glutamine synthetase (ncpGS): Potential utility for phylogenetic studies with an example from Oxalis (Oxalidaceae). Molecular Phylogenetics and Evolution 1999, 12(3):310. 24. Fortune PM, Schierenbeck KA, Ainouche AK, Jacquemin J, Wendel JF, Ainouche ML: Evolutionary dynamics of Waxy and the origin of hexaploid Spartina species (Poaceae). Molecular Phylogenetics and Evolution 2007, 43(3):1040-1055. 25. Grob GBJ, Gravendeel B, Eurlings MCM: Potential phylogenetic utility of the nuclear FLORICAULA/LEAFY second intron: comparison with three chloroplast DNA regions in Amorphophallus (Araceae). Molecular Phylogenetics and Evolution 2004, 30(1):13. 26. Huang S: Phylogenetic analysis of te acetyl-CoA carboxylase and 3- phosphoglycerate kinase loci in wheat and other grasses. Plant Molecular Biology 2002, 48:805-820. 27. Lewis CE, Doyle JJ: Phylogenetic utility of the nuclear gene malate synthase in the palm family (Arecaceae). Molecular Phylogenetics and Evolution 2001, 19(3):409-420. 28. Lohne C, Borsch T: Molecular evolution and phylogenetic utility of the petD Group II intron: A case study in basal angiosperms. Molecular Biology and Evolution 2005, 22(2):317-332. 29. Mason-Gamer RJ, Weil CF, Kellogg EA: Granule-bound starch synthase: structure, function, and phylogenetic utility. Molecular Biology and Evolution 1998, 15(12):1658-1673. 30. Mathews S, Mason-Garner RJ, Spangler RE, Kellogg EA: Phylogeny of the Andropogoneae inferred from phytochrome B, GBSSI, and ndhF.

- 120 - International Journal of Plant Sciences 2002, 163:441-450. 31. Oh S-H, Potter D: Phylogenetic utility of the second intron of LEAFY in Neillia and Stephanandra (Rosaceae) and implications for the origin of Stephanandra. Molecular Phylogenetics and Evolution 2003, 29(2):203. 32. Shaoxing H, Anchalee S, Justin DF, Xiujuan S, Bikram SG, Robert H, Piotr G: Phylogenetic analysis of the acetyl-CoA carboxylase and 3-phosphoglycerate kinase loci in wheat and other grasses. Plant Molecular Biology 2002, V48(5):805. 33. Feldman M, Liu B, Segal G, Abbot S, Levy AA, Vega JM: Rapid elimination of low-copy DNA sequences in polyploid wheat: A possible mechanism for differentiation of homoeologous chromosomes. Genetics 1997, 147:1381-1387. 34. Rottmann WH, Meilan R, Sheppard LA, Brunner AM, Skinner JS, Ma C, Cheng S, Jounain L, Pilate G, Strauss SH: Diverse effects of overexpression of LEAFY and PTLF, a poplar (Populus) homolog of LEAFY/FLORICAULA, in transgenic poplar and Arabidopsis. The Plant Journal 2000, 22(3):235-245. 35. Nilsson O, Lee I, Blazquez MA, Weigel D: Flowering-time genes modulate the response to LEAFY activity. Genetics 1998, 150(1):403-410. 36. Bomblies K, Wang RL, Ambrose BA, Schmidt RJ, Meeley RB, Doebley J: Duplicate FLORICAULA/LEAFY homologs zfl1 and zfl2 control inflorescence architecture and flower patterning in maize. Development 2003, 130(11):2385-2395. 37. Wada M, Cao Q-f, Kotoda N, Soejima J-i, Masuda T: Apple has two orthologues of FLORICAULA/LEAFY involved in flowering. Plant Molecular Biology 2002, 49(6):567-577. 38. Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF: A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 2000, 290:972-977. 39. Bhattacharya D, Yoon HS, Hackett JD: Photosynthetic eukaryotes unite: endosymbiosis connects the dots. Bioessays 2004, 26:50-60. 40. Wendel JF, Doyle JJ: Phylogenetic incongruence: Window into genome history and molecular evolution. In: Molecular Systematics of Plants II. Edited by Soltis PS, Soltis DE; 1998: 265-296. 41. Mathews S, Donoghue M-J: The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 1999, 286(5441):947-950. 42. Mathews S, Donoghue M-J: Basal angiosperm phylogeny inferred from duplicate phytochromes A and C. International Journal of Plant Sciences 2000, 161(6 Supplement):S41-S55. 43. Iwabe N, Kuma K-I, Hasegawa M, Osawa S, Miyata T: Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes from phylogenetic trees of duplicated genes. Proceedings of the National Academy of Sciences USA 1989, 86:9355-9359. 44. Baldauf SL: The deep roots of eukaryotes. Science 2003, 300:1703-1706. 45. Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens-Mack J, Muller KF, Guisinger-Bellian M, Hansen AK, Chumley TW et al: Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proceedings of the National

- 121 - Academy of Sciences USA 2007, 104(49):19369-19374. 46. Moore MJ, Bell CD, Soltis PS, Soltis DE: Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proceedings of the National Academy of Sciences USA 2007, 104(49):19363-19368. 47. Bowers JE, Chapman BA, Rong JK, Paterson AH: Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 2003, 422(6930):433-438. 48. Lukens LN, Quijada PA, Udall J, Pires JC, Schranz ME, Osborn TC: Genome redundancy and plasticity within ancient and recent Brassica crop species. Biological Journal of the Linnean Society 2004, 82(4):665-674. 49. Lysak MA, Koch MA, Pecinka A, Schubert I: Chromosome triplication found across the tribe Brassiceae. Genome Res 2005, 15(4):516-525. 50. Schranz ME, Mitchell-Olds T: Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. The Plant Cell 2006, 18(5):1152- 1165. 51. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y: Modeling gene and genome duplications in eukaryotes. Proceedings of the National Academy of Sciences USA 2005, 102(15):5454-5459. 52. Loehne C, Borsch T: Molecular evolution and phylogenetic utility of the petD group II intron: A case study in basal angiosperms. Molecular Biology and Evolution 2005, 22(2):317-332. 53. Jordan IK, Wolf YI, Koonin EV: Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evolutionary Biology 2004, 4(22). 54. Al-Shehbaz IA, Beilstein MA, Kellogg EA: Systematics and phylogeny of the Brassicaceae (Cruciferae): an overview. Plant Systematics and Evolution 2006, 259(2-4):89-120. 55. Bailey CD, Koch MA, Mayer M, Mummenhoff K, O'Kane SL, Warwick SI, Windham MD, Al-Shehbaz IA: Toward a global phylogeny of the Brassicaceae. Molecular Biology and Evolution 2006, 23(11):2142-2160. 56. Beilstein MA, Al-Shehbaz IA, Kellogg EA: Brassicaceae phylogeny and trichome evolution. American Journal of Botany 2006, 93(4):607-619. 57. Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell M, Alice LA, Evans R et al: Angiosperm phylogeny based on matK sequence information. American Journal of Botany 2003, 90(1758-1776). 58. Soltis D-E, Soltis P-S, Chase M-W, Mort M-E, Albach D-C, Zanis M, Savolainen V, Hahn W-H, Hoot S-B, Fay M-F et al: Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Botanical Journal of the Linnean Society 2000, 133(4):381-461. 59. Jansen RK, Kaittanis C, Saski C, Lee SB, Tomkins J, Alverson AJ, Daniell H: Phylogenetic analyses of Vitis (Vitaceae) based on complete chloroplast genome sequences: effects of taxon sampling and phylogenetic methods on resolving relationships among rosids. BMC Evolutionary Biology 2006, 6:32. 60. William Roy S, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress. Nature Reviews Genetics 2006, 7(3):211-221. 61. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation

- 122 - of duplicate genes by complementary, degenerative mutations. Genetics 1999, 151(4):1531-1545. 62. Birchler JA, Bhadra U, Bhadra MP, Auger DL: Dosage-dependent gene regulation in multicellular eukaryotes: Implications for dosage compensation, aneuploid syndromes, and quantitative traits. Developmental Biology 2001, 234:275-288. 63. Birchler JA, Veitia RA: The gene balance hypothesis: From classical genetics to modern genomics. The Plant Cell 2007, 19:395-403. 64. Birchler JA, Yao H, Chudalayandi S: Biological consequences of dosage dependent gene regulatory systems. Biochimica et Biophysica Acta 2007, 1769:422-428. 65. Veitia RA: Gene dosage balance: deletions, duplications and dominance. Trends in Genetics 2005, 21(1):33-35. 66. Adams KL, Cronn, R., Percifield, R., and J. F. Wendel: Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ- specific reciprocal silencing. Proceedings of the National Academy of Sciences USA 2003, 100:4649-4654. 67. Osborn TC, Pires JC, Birchler JA, Auger DL, Chen ZJ, Lee H, Comai L, Madlung A, Doerge RW, Colot V et al: Understanding mechanisms of novel gene expression in polyploids. Trends in Genetics 2003, 19(3):141-147. 68. Song KM, Lu P, Tang KL, Osborn TC: Rapid genome change in synthetic polyploids of Brassica and its implications for polyploid evolution. Proceedings of the National Academy of Sciences USA 1995, 92:7719-7723. 69. Ohno S: Evolution by gene duplication. New York, New York: Springer; 1970. 70. Altschul S, Madden T, Schaffer A, Zhang JH, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 1998, 12(8):3389-3402. 71. Enright AJ, Kunin V, Ouzounis CA: Protein families and TRIBES in genome sequence space. Nucleic Acids Research 2003, 31:4632-4638. 72. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large- scale detection of protein families. Nucleic Acids Research 2002, 30(7):1575- 1584. 73. Wall PK, Leebens-Mack J, Muller K, Field D, Altman N, dePamphilis CW: PlantTribes: A gene and gene family resource for comparative genomics in plants. Nucleic Acids Research 2007. 74. Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP: The TIGR Plant Transcript Assemblies database. Nucleic Acids Research 2006, 35:846-851. 75. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215:403-410. 76. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5(1):113. 77. Higgins DG, Thompson JD, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22:4673-4680.

- 123 - 78. Swofford DL: PAUP*: Phylogenetic Analysis using Parsimony (* and Other methods). In., 4 edn. Sunderland, MA: Sinauer Associates; 2001. 79. Posada D, Crandall KA: Modeltest: testing the model of DNA substitution. Bioinformatics 1998, 14(9):817-818. 80. Muller K: PRAP-computation of Bremer support for large data sets. Molecular Phylogenetics and Evolution 2004, 31:780-782. 81. Zwickl DJ: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Austin, TX: University of Texas; 2006. 82. Hall JC, Sytsma KJ, Iltis HH: Phylogeny of Capparaceae and Brassicaceae based on chloroplast sequence data. American Journal of Botany 2002, 89(11):1826-1842. 83. Rambaut A: SE-AL Sequence Alignment Editor. In., v2.0a11 edn. Oxford, UK: University of Oxford; 2002.

- 124 - Chapter 5

A Marriage of High-throughput Identification of Phylogenetic Markers with High-throughput Sequencing: Next-generation Phylogenomics using 454 Transcriptomes

Preface This is a manuscript in preparation for submission and is currently formatted for

BMC Evolutionary Biology. Major contributors are Jill M. Duarte, P. Kerr Wall, Jim

Leebens-Mack, J. Chris Pires, and Claude W. dePamphilis. JMD constructed the cDNA transcriptome libraries in Medicago truncatula and Brassica oleracea, performed the phylogenetic analysis and characterization of the single copy genes, and wrote the manuscript. PKW provided bioinformatics services and developed software, primarily for the assembly of the alignments for the six genomes. JLM, JCP, and CWD conceived of the study and provided financial support.

125 A Marriage of High-throughput Identification of Phylogenetic Markers with High-throughput Sequencing: Next-generation Phylogenomics using 454 Transcriptomes

Duarte, J.M. 1, Wall, P. K. 1, Leebens-Mack, J. H.2, Pires, C. P.3, dePamphilis, C. W.1*

1 Department of Biology and Huck Institutes of the Life Sciences, The Pennsylvania State

University, University Park, PA 16802

2 Department of Plant Biology, University of Georgia, Athens, GA 30602-7271

3 Division of Biological Sciences, University of Missouri-Columbia, Columbia, MO

65211-7310

*Corresponding author

E-mail addresses:

JMD: [email protected]

PKW: [email protected]

JLM: [email protected]

JCP: [email protected]

CWD: [email protected]

126 Abstract

Background Traditional molecular systematics has relied on the sequencing of a limited number of phylogenetic markers for the resolution of organismal phylogenies. However, large datasets are increasingly necessary as researchers attempt to resolve complex phylogenetic questions. A potential source for generating large datasets in an efficient manner is through the application of next-generation sequencing technology such as pyrosequencing and other solid-phase sequencing technologies. We describe a method to generate a large amount of phylogenetically-informative single copy nuclear protein- coding sequence using next-generation sequencing technology. To demonstrate the feasibility of this methodology, we describe a pilot project that investigates the utility of next-generation transcriptome sequencing in two species to provide a large dataset of orthologous sequences for phylogenetic analysis, laid on a foundation of sequenced genomes.

Results Young leaf transcriptomes sequenced using a half-plate of 454 GS FLX sequencing tagged ~15,000 and ~11,000 Arabidopsis genes in Brassica oleracea and

Medicago truncatula, respectively. Out of 948 single copy genes shared between

Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa, 606 were tagged in both Brassica and Medicago. The resulting alignment including Arabidopsis,

Brassica, Medicago, Populus, Vitis and Oryza sequences contained 1,298,586 nucleotides when missing data was included and 58,968 nucleotides when missing data was excluded.

127 Conclusions This study illustrates how 454 transcriptome sequencing can be used as an effective tool for the generation of large quantities of phylogenetically-informative sequence, in the form of putative single-copy, protein-coding nuclear genes.

Introduction Systematics was transformed by the application of molecular sequences to decipher organismal relationships by using orthologous sequences to assess the relatedness of different species. Traditional molecular systematics has relied on the individual amplification and sequencing of one or more genes to generate sufficient data for the resolution of organismal relationships. However, in recent years researchers have applied ‘genomic’ solutions to phylogenetic questions, such as the use of whole chloroplast genome sequences to resolve angiosperm phylogeny [1-8] or using the vast amounts of publicly-available sequences through the application of sophisticated bioinformatic analysis [9-12]. Recent advances in sequencing technology have enabled researchers to generate large amounts of sequence data with a significant decrease in cost through the use of technologies that we collectively describe as ‘next-generation’ sequencing technology [13]. In this study, we describe and provide preliminary data to support a methodology that combines the high-throughput identification of orthologous genes with the sampling of those genes via next-generation transcriptome sequencing using 454 sequencing.

Benefits of using single copy nuclear genes Molecular systematics traditionally uses orthologous genes from a variety of species in order to infer organismal relationships. Orthologous genes in organisms are

128 differentiated through the process of speciation, whereas paralogous genes in an organism are differentiated through the process of duplication. Traditionally, organellar or rRNA genes have been used for molecular systematics, given their ease of amplification using

PCR from DNA and the ability to assume orthologous relationships between genes sequenced from different species, given either the presence of a single locus in the organellar genome or a homogenous array of genes, as in the case of rRNA. Previous research supports the use of orthologous sequences for systematics, as the presence of paralogous sequences in datasets can complicate phylogenetic reconstruction and can introduce artifacts into the analysis [12, 14]. Orthology is more difficult to assess in nuclear genomes given the presence of gene families that have proliferated through duplicative processes, which are more prevalent in nuclear genomes compared to organellar genomes. However, it has been suggested in the literature that there is a strong need to test phylogenetic hypotheses by using independent data sets that are not necessarily subject to the same evolutionary forces as organellar or rRNA genes [15-19].

Other potential benefits to using protein-coding nuclear genes involve: ability to use multiple independently assorted loci; recombination, biparental inheritance; variation in substitution rates [18, 19]. Additionally, there is much more potential phylogenetic information present in the nuclear genome compared to organellar genomes, given size and content of nuclear genomes compared to organellar genomes.

Utility of Expressed Sequence Tags (ESTs) for phylogenomics The gene discovery process has been greatly aided by the development of

Expressed Sequence Tags (ESTs), which enable high-throughput sequencing and characterization of the expressed genes within a designated biological sample, by

129 sequencing cloned fragments of cDNA from isolated mRNA. This approach to sampling genes is in sharp contrast to traditional molecular systematics in which a small number of genes, such as 18S rRNA, coxI, and rbcL, are sequenced in a wide range of taxa using

PCR, molecular cloning, and double-stranded Sanger sequencing. It should also be noted that molecular systematics has primarily used rRNA and organellar genes as phylogenetic markers in almost all eukaryotic lineages, foregoing the potential utility of nuclear protein-coding genes because of uncertainties concerning gene family evolution and the establishment of strict orthologous relationships between genes sequenced from different species [18, 19]. In contrast to traditional phylogenetic markers, EST sequences have been successfully used to infer phylogeny and for the identification of potential phylogenetic markers in both animal and plant lineages [9, 10, 12, 15, 20-22]. .

However, phylogenetic markers identified in this manner have been limited in number and have not been generally applied beyond the set of species included in the studies.

Phylogenetic markers identified from EST studies result from the clustering of pre- existing or de novo EST sequences from traditional Sanger sequencing [10, 15, 20, 21].

However, there is no control for actual copy number other than the number of copies retrieved during the EST sequencing effort, which is typically shallow unless pre-existing

EST databases are used for the analysis. Therefore, the overall orthology of the sequences used in these analyses is undetermined unless comparison is made to a genomic sequence. However, even the presence of a single homolog in a related sequenced genome is not deterministic of copy number in another lineage, since independent duplication events may have occurred since the divergence of the organism for which a sequenced genome is available and the lineage of interest. Therefore,

130 methodologies that utilize ESTs as a source of sequence information should independently identify putatively orthologous sequences based on sequenced genomes to reduce phylogenetic artifacts that could be introduced by using genes that are typically members of gene families.

Benefits of 454 transcriptome sequencing compared to Sanger

EST sequencing

In 2005, 454 Life Sciences made available a massively parallel sequencing by synthesis technology that utilizes a pyrosequencing protocol optimized for picolitre-scale volumes capable of sequencing 25 million bases with 99% accuracy in a single 4 hr run

[13]. The current generation of this technology, the GS FLX, is capable of over 400,000 reads per sequencing plate with an average read length of 200-300 bp (Roche Life

Sciences). 454 sequencing has been successfully used to sequence bacterial and organellar genomes, single nucleotide polymorphisms (SNPs), small RNAs and transcriptomes [23-30]. A significant issue with the use of EST sequence for phylogenetics is accounting for the increased rate of sequencing error in ESTs (2-3%) compared to the double-stranded finished sequencing that is typically required for traditional molecular systematics. Sequencing error can be controlled in EST sequencing through increased coverage. The estimated sequencing error associated with pyrosequencing is 0.04% using the previous generation GS 20 sequencer and typical sequencing errors are limited to the length of mononucleotide repeats [29]. Estimated single-read accuracy is greater than 99.5% over 200 bases and accuracy is increased to greater than 99.99% in assembled contigs using the GS 20 sequencer [31]. Previous research has also shown that 454 transcriptome sequencing is capable of sequencing rare

131 transcripts that were not previously sequenced through extensive Sanger EST sequencing

[28]. Overall, 454 transcriptome sequencing is a rich resource for the accurate sequencing and cataloging of thousands of genes for a nominal price compared to traditional Sanger EST sequencing.

Methods

Library construction Brassica oleracea inbred line DH1000 and Medicago truncatula ecotype

Jemalong A17 plants used in this study were grown from seed under greenhouse conditions at The Pennsylvania State University with 16 h light/ 8 h dark and approximately 23°C. Young leaves with a surface area of no more 25% of the fully mature leaf surface area were pooled from independent plants and snap-frozen in liquid nitrogen immediately after harvest. Young leaf tissue was selected as an ideal tissue for this study given the ease of sampling young leaf tissue, the typical RNA yield, and the expected representation of shared single copy nuclear genes in the young leaf transcriptome. Total RNA was isolated using RNAqueous-Midi kit (Ambion, Inc., catalog: #1911) using approximately 120 mg of young leaf tissue according to the manufacturer’s recommendations with modifications as previously reported [32]. Total

RNA was pooled from multiple independent technical replicates of sufficient concentration and quality and concentrated using a sodium acetate precipitation. Total

RNA quantity and quality after initial isolation and after sodium acetate precipitation was determined by micro-capillary electrophoresis using an Agilent Bioanalyzer, according to the manufacturer’s suggested protocol. Two rounds of polyA purification to reduce ribosomal RNA contamination were performed using polyA Purist (Ambion). After the

132 first round of polyA isolation rRNA contamination was 12.8% in Brassica and 14.2% in

Medicago. After the second round of polyA isolation, rRNA contamination was 1.2% and 1.3% in Brassica and Medicago, respectively. 5.6 ug of mRNA was used to build the

Brassica cDNA library. 2.9 ug of mRNA was used to build the Medicago cDNA library.

The cDNA library for each species was generated using the Just cDNA synthesis kit excluding radioactivity and blunting (Stratagene). A half plate of 454 GS FLX sequencing was performed at The Pennsylvania State University for each species.

Unigenes for Brassica and Medicago were assembled from the 454 reads using the

Newbler Assembler software [13].

Mining leaf transcriptome data for phylogenetic data Shared single copy genes were identified as clusters of genes in Arabidopsis,

Populus, Vitis, and Oryza in the PlantTribes database that contained a single gene from each genome (APVO PlantTribes; http://fgpdev.huck.psu.edu/tribedb/index.pl). The

PlantTribes database identifies putative gene families or clusters of genes based on

TribeMCL clustering of the predicted proteins from plant genomes [33-35]. Putative single copy genes were selected as an ideal source of phylogenetic information as these genes can be considered to be putatively orthologous and may reduce the difficulties associated with handling paralogous genes in a phylogenetic analysis [14]. Unigenes generated from the Brassica oleracea and Medicago truncatula leaf transcriptome were sorted into the APVO PlantTribes based on the best BLASTX hit against the 4 plant genomes. Only genes for which there was only a single homologous unigene from both

Brassica and Medicago were used for the phylogenetic analysis. For each PlantTribe, all

133 unigene sequences were analyzed using ESTscan to predict coding sequence and orientation prior to placement in the nucleotide alignment [36].

The amino acid sequences from the four plant genomes from each single copy

PlantTribe were assembled using MUSCLE [37]. Nucleotide sequences from each plant genome as well as the unigenes from Brassica and Medicago were then aligned by forcing the nucleotide alignment to the amino acid alignment for each PlantTribe.

Nucleotide alignments were then concatenated into a single data matrix.

Phylogenetic Analysis Three distinct alignments were used in the phylogenetic analysis: 1) a concatenated alignment of 83 genes from the whole chloroplast genomes of Arabidopsis,

Brassica, Populus, Medicago, Vitis and Oryza that was obtained from ChloroplastDB

[38]; 2) a concatenated alignment of all 918 orthologous single copy genes from the

APVO PlantTribes using sequence information from the Arabidopsis, Populus, Vitis and

Oryza genomes and the Brassica and Medicago 454 transcriptome sequencing performed for this study; and 3) a reduced version of the single copy nuclear alignment previously mentioned with all gaps and missing data excluded from the alignments. Phylogenetic analysis was performed using PAUP 4.0b10 for all maximum parsimony (MP),

Neighbor-Joining (NJ), and maximum likelihood (ML) searches [39]. Maximum parsimony searches were performed using an exhaustive search and with 1000 bootstrap replicates using a heuristic search with ten random sequence additions and TBR branch swapping. A GTR+I+G model of sequence evolution was used for NJ and ML searches, using parameters selected using the Aikake Information Criteria in MODELTEST [40].

NJ and ML searches were performed using a heuristic search with ten random sequence

134 additions and TBR branch swapping, with 1000 bootstrap replicates for each dataset, except for the complete single copy gene dataset, for which only 100 ML bootstrap replicates were performed based on time limitations.

Results

Library statistics 145,565 and 168,899 reads were obtained from the Brassica oleracea and

Medicago truncatula young leaf transcriptomes, respectively, with an average length of

240.9 and 233.7 bp, respectively. A total of ~35 MB and ~39 MB were obtained for

Brassica and Medicago, respectively. Reads were assembled into 34,280 unigenes for

Brassica and 38,777 unigenes for Medicago. 15,291 Arabidopsis genes were tagged in the Brassica transcriptome. 11,071 Arabidopsis genes were tagged in the Medicago transcriptome. Gene Ontology characterization of the transcriptomes according to the

TAIR8 Arabidopsis annotation is shown in Figures 5-1 and 5-2 and is highly similar between the two species.

135

Figure 5-1. Slim GO Annotation of the Brassica oleracea leaf transcriptome. Characterization was performed by identifying the best BLAST hit in Arabidopsis for each of 15,291 Unigenes from the Brassica young leaf transcriptome that have a hit to the Arabidopsis genome.

136

Figure 5-2 Slim GO Annotation of the Medicago truncatula leaf transcriptome. Characterization was performed by identifying the best BLAST hit in Arabidopsis for each of the 11,071 Unigenes from the Medicago young leaf transcriptome that have a hit to the Arabidopsis genome.

137 Presence of shared single copy genes from Arabidopsis, Populus, Vitis and Oryza in de novo sequenced transcriptomes In order to maximize the amount of phylogenetically-informative sequence data produced using a transcriptome approach, one would ideally sample a series of transcriptomes from different species that have a high degree of overlap in the genes sampled and similar expression values. Studies using large phylogenetic datasets derived from publicly available databases and/or EST sequencing have shown that well-resolved phylogenies are not dependent on “complete” alignments, but that alignments can contain a significant amount of missing data and still provide sufficient resolution [9-11]. This indicates that even though using transcriptomes to sample phylogenetic markers is not guaranteed to provide ‘complete’ data sets, the amount of overlap and the amount of phylogenetically-informative sequence generated by the method should be sufficient to provide sufficient resolution of organismal relationships. An estimated 9,125 Arabidopsis genes were tagged in both Brassica and Medicago.

There are 918 APVO PlantTribes that contain a single gene from each genome included in the clustering. The Brassica 454 leaf transcriptome tagged 791 (86%) of the single copy APVO PlantTribes with an average contig length of 136.5 bp whereas the

Medicago 454 leaf transcriptome tagged 698 (76%) of the single copy APVO PlantTribes with an average contig length of 130.1 bp. Sequence coverage of the single copy genes averages 41% in Brassica and 38% in Medicago (Figure 5-3) and is overall summarized in Table 5-1. 606 of the orthologous single copy genes are tagged in both transcriptomes with a total overlap of 58,968 nucleotides. The concatenated nucleotide alignment based on these 948 PlantTribes contains 1,298,586 characters. When the Brassica and

Medicago Unigenes are incorporated into the nucleotide alignment, the resulting

138 alignment contains 1,194,978 characters. When all characters containing a gap or missing data are eliminated from the alignment, the resulting alignment contains 58,968 characters. This illustrates that the contents of the Brassica and Medicago leaf transcriptome overlap to a significant degree such that the reliance on sequencing expressed genes does not have a strong impact on the amount of phylogenetically- informative sequence obtained, even though the overall amount of overlap meets the random expectation of the number of shared sequences between the two transcriptomes.

Figure 5-3. Sequence coverage in orthologous single copy genes in Brassica and Medicago. The distribution of sequence coverage is shown in the above graph. Although the majority of orthologous single copy genes are tagged using a half-plate of 454 GS FLX sequencing, sequence coverage averages ~40% of the total CDS in Arabidopsis. However, a small number of genes (32 in Brassica and 12 in Medicago) are sequenced for more than 95% of the corresponding CDS in Arabidopsis. Sequence coverage is dependent on both CDS length and expression level [41].

139 Coverage Brassica oleracea Medicago truncatula >95% CDS 32 12 >80% CDS 76 45 >60% CDS 165 116 1X or more 791 698 2X or more 606 524 3X or more 459 403 4X or more 342 321

Table 5-1. Coverage of orthologous single copy genes in the Brassica and Medicago 454 leaf transcriptome sequencing. Percent coverage of the CDS (coding sequence) is in terms of overlap with the Arabidopsis CDS. The last 4 rows indicates the cumulative number of 454 reads obtained per Unigene.

Comparison to phylogenetics using whole chloroplast genomes Since the limited taxon sampling used in this study can cause significant artifacts in the phylogenetic outcome, we compared the results of our six taxon analysis with an analogous dataset using a whole chloroplast genome alignment, limited to the six taxa included in our study. A brief comparison of the three alignments analyzed illustrates the difference in the amount of phylogenetic information present (Table 5-2). An important difference to note between the chloroplast and nuclear datasets is the proportion of variable and parsimony-informative characters. Approximately 21% of the chloroplast data is variable, and only 6% of the characters are parsimony-informative. Using the smaller nuclear orthologs alignment in which all missing data and gaps have been removed, the overall number of characters is almost half of the chloroplast alignment, but there are over twice as many variable characters and three times as many parsimony- informative characters in the nuclear-derived alignment compared to the chloroplast- derived alignment (Table 5-2).

140

Alignment # Characters # Var Characters # Parsimony-Informative Characters Chloroplast 91,499 19,115 (20.9%) 5,687 Nuclear Orthologs 1,194,978 605,076 (50.6%) 123,754 Nuclear Orthologs 58,968 33,300 (56.5%) 17,682 (gaps deleted)

Table 5-2. Summary of alignments used for phylogenetic analysis. The chloroplast alignment is a concatenated aiignment of 83 genes from the chloroplast genomes of the 6 taxa studied. The nuclear orthologs alignment is comprised of the 948 single copy APVO PlantTribes with the relevant sequences generated from the de novo transcriptome sequencing from Brassica and Medicago added. #Var indicates the number of variable characters in the alignment, which are those characters that are not uniform across the dataset and provide information for phylogenetic reconstruction using NJ and/or ML methods. #Parsimony-informative characters represents the number of characters for which at least two taxa share one character state and at least two other taxa share a different character state.

A summary of the phylogenetic trees obtained using each dataset and phylogenetic methods can be found in Figure 5-3. None of the trees recover the current hypothesis of organismal phylogeny of (Oryza (Vitis ((Arabidopsis, Brassica), (Populus,

Medicago)))) that is proposed using whole chloroplast genome datasets that utilized a large number of genes and taxa [5, 6, 8]. However, previous studies using different data sets and increased taxon sampling in the core eudicots have not resulted in a firm phylogenetic placement for Vitales, other than it is most likely a sister group to rosids [6,

42-44]. This incongruence with past results is most likely the result of the limited taxon sampling, which is further complicated by the short internal branch lengths. However, unlike the chloroplast-derived dataset, the alignments consisting of orthologous genes from the nuclear genome consistently exhibit the same topology, with very high levels of bootstrap support, irrespectively of the phylogenetic method utilized. The pairing of

Populus and Vitis is also reflected in the overall genome sequences, as proteins from the

141 nuclear genome of Vitis show higher overall levels of amino acid similarity to the

Populus proteome than the other sequenced plant genomes [45]. However, it must be noted that both Arabidopsis and Oryza are both on long branches and have higher substitution rates than Vitis and Populus [6, 45, 46].

Figure 5-4. 6-taxon phylogenetic trees from different datasets and phylogenetic methodologies. Trees associated with the top row are the result from the alignment of 83 genes from the six chloroplast genomes. Trees associated with the middle row are the result from the complete alignment from 4 genomes and the available data from the Brassica and Medicago 454 transcriptomes. Trees associated with the bottom row are the result from the alignment of the 4 genomes and the Brassica and Medicago 454 transcriptomes with all columns containing gaps and missing data removed from the alignment. The first column consists of the maximum parsimony trees, the second column consists of the neighbor-joining trees and the last column consists of the maximum likelihood trees. Bootstrap support is indicated for all trees and Oryza was used as an outgroup to root the trees.

142 Discusssion

Effectiveness of 454 transcriptome sequencing compared to traditional phylogenetic methods Using traditional molecular systematics of sampling individual genes, datasets are severely limited by the cost and labor required to generate the necessary data and therefore contain typically under 10,000 nucleotides, even when concatenated datasets are considered that include genes from both nuclear and organellar genomes. With the advent of sequencing whole chloroplast genomes in plants, datasets have expanded to include between 69 and 81 genes and alignments that contain tens of thousands of nucleotides. The potential dataset from the nuclear genome may contain hundreds of orthologous sequences and millions of nucleotides. Additionally, the preliminary results from this study indicate that proportionally, the nuclear genome is a much richer source of phylogenetic information than the chloroplast genome. Also, in the case of flowering plant phylogenetics, this research strengthens the need to independently test phylogenetic hypotheses using different sources of molecular data, especially the nuclear genome.

With the limited taxon selection in this study, the alignment consisting of single copy orthologs from the nuclear genome resulted in more consistent topologies using different phylogenetic methods than the analogous alignment consisting of genes from the chloroplast genome.

It is expected that other next-generation sequencing technologies such as Solexa sequencing could be successfully used in this methodological framework. Expansion of the sequencing effort as well as read length can also be expected to increase the amount of phylogenetically-informative sequence produced, as the majority of the single copy orthologs were tagged, and therefore present in the leaf transcriptome, but not completely

143 sequenced as a result of the relatively shallow sequencing performed and the read length of the current generation of the 454 sequencing technology. As more taxa are to be included in this type of analysis, it is expected that sequencing efforts must be proportionally increased in order to ensure a reasonable amount of overlap between all transcriptomes sampled.

Application of methodology to other systems Although this study focuses on the application of using next generation sequencing technology to address phylogeny in flowering plants, this methodology of using next generation transcriptome sequencing to simultaneously sequence a large number of orthologous sequences could be applied to any set of organisms for which two or more whole genome sequences are available. It is expected that this methodology would produce more sequence data that could be used for phylogenetic analysis in other systems, given the greater percentage of sequences that can be considered orthologous and the reduced number of genes with paralogs as compared to flowering plants. This approach to phylogenomics is similar to the recent study by Dunn et al, [10] however the generation of data through pyrosequencing instead of traditional Sanger sequencing greatly enhances the amount of phylogenetically-informative data produced for approximately the same amount of labor and money. This methodology should become increasingly more affordable as next-generation sequencing technology decreases in cost and increases in read length occur.

Acknowledgements We would like to thank Zeng-Yu Wang from the Noble Foundation for providing seeds for Medicago truncatula, Tony Omeis for plant care and Lena Landherr for

144 technical assistance in the construction of 454 cDNA libraries. This research was supported in part by a NSF DDIG (DEB-0710250) to JMD and CWD; the Ancestral

Angiosperm Genome Project (NSF Plant Genome Comparative Sequencing DEB-

0638595) to CWD; and by University of Georgia start-up funds to JLM and University of

Missouri-Columbia start-up funds to JCP.

Authors' contributions JMD constructed the cDNA transcriptome libraries in Medicago truncatula and

Brassica oleracea, performed the phylogenetic analysis and characterization of single copy genes identified in the 454 transcriptomes, and wrote the manuscript. PKW provided bioinformatics services and developed software, primarily for the assembly of the alignments for the six genomes and the identification of single copy genes in the 454 transcriptomes. JLM, JCP, and CWD conceived of the study and provided financial support.

Additional Files

Additional File 1 – Supplementary_Table_5-1.xls This table provides information on number of Unigenes, reads, %coverage and length of Unigene for each of the single copy genes based on the results from the de novo 454 leaf transcriptome sequencing performed for this study in Brassica oleracea and Medicago truncatula. This file is available at http://fgp.huck.psu.edu/jduarte/Supplementary_Table_5-1.xls

Additional File 2 - 6cp.nex Alignment of the Arabidopsis, Brassica, Medicago, Populus, Vitis and Oryza whole chloroplast genomes using nucleotide sequences from 81 genes. This file is available at http://fgp.huck.psu.edu/jduarte/6cp.nex

Additional File 3 – 6genomes.nex Alignment of the 959 single copy genes using sequences from the genome of Arabidopsis, Populus, Vitis, and Oryza with sequences from Brassica and Medicago

145 obtained from the 454 leaf transcriptome sequencing. This file is available at http://fgp.huck.psu.edu/jduarte/6genomes.nex

Additional File 4 – 6genomes_nogaps.nex Alignment of the 959 single copy genes using sequences from the genome of Arabidopsis, Populus, Vitis, and Oryza with sequences from Brassica and Medicago obtained from the 454 leaf transcriptome sequencing with all columns containing gaps and missing data removed from the alignment. This file is available at http://fgp.huck.psu.edu/jduarte/6genomes_nogaps.nex

References

1. Cai Z, Penaflor C, Kuehl JV, Leebens-Mack J, Carlson JE, dePamphilis CW, Boore JL, Jansen RK: Complete plastid genome sequences of Drimys, Liriodendron, and Piper: implications for the phylogenetic relationships of magnoliids. BMC Evolutionary Biology 2006, 6:77. 2. Goremykin V-V, Hirsch-Ernst K-I, Woelfl S, Hellwig F-H: Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Molecular Biology and Evolution 2003, 20(9):1499-1505. 3. Goremykin V-V, Hirsch-Ernst K-I, Woelfl S, Hellwig F-H: The chloroplast genome of Nymphaea alba: Whole-genome analyses and the problem of identifying the most basal angiosperm. Molecular Biology and Evolution 2004, 21(7):1445-1454. 4. Goremykin V-V, Holland B, Hirsch-Ernst K-I, Hellwig F-H: Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Molecular Biology and Evolution 2005, 22(9):1813-1822. 5. Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens- Mack J, Muller KF, Guisinger-Bellian M, Hansen AK, Chumley TW et al: Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proceedings of the National Academy of Sciences USA 2007, 104(49):19369-19374. 6. Jansen RK, Kaittanis C, Saski C, Lee SB, Tomkins J, Alverson AJ, Daniell H: Phylogenetic analyses of Vitis (Vitaceae) based on complete chloroplast genome sequences: effects of taxon sampling and phylogenetic methods on resolving relationships among rosids. BMC Evolutionary Biology 2006, 6:32. 7. Leebens-Mack J, Raubeson LA, Cui L, Kuehl JV, Fourcade MH, Chumley T, Boore JL, Jansen RK, dePamphilis CW: Identifying the basal angiosperm node in chloroplast genome phylogenies: Sampling one's way out of the Felsenstein zone. Molecular Biology and Evolution 2005, 22(10):1948-1963.

146 8. Moore MJ, Bell CD, Soltis PS, Soltis DE: Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proceedings of the National Academy of Sciences USA 2007, 104(49):19363-19368. 9. de la Torre JE, Egan MG, Katari M, Brenner ED, Stevenson DW, Coruzzi GM, Desalle R: ESTimating plant phylogeny: lessons from partitioning. BMC Evolutionary Biology 2006, 6:48. 10. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD et al: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 2008. 11. Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S: Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Molecular Biology and Evolution 2003, 20(7):1036-1042. 12. Sanderson MJ, McMahon MM: Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology 2007, 7:S3. 13. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437(7057):376-380. 14. Fares MA, Byrne KP, Wolfe KH: Rate asymmetry after genome duplication causes substantial long-branch attraction artifacts in the phylogeny of Saccharomyces species. Molecular Biology and Evolution 2006, 23(2):245-253. 15. Alvarez I, Costa A, Feliner GN: Selecting single-copy nuclear genes for plant phylogenetics: A preliminary analysis for the Senecioneae (Asteraceae). Journal of Molecular Evolution 2008, 66:276-291. 16. Martin W, Deusch O, Stawski N, Grunheit N, Goremykin V: Chloroplast genome phylogenetics: why we need independent approaches to plant molecular evolution. Trends In Plant Science 2005, 10(5):203-209. 17. Mort M-E, Crawford DJ: The continuing search: low-copy nuclear sequences for lower-level plant molecular phylogenetic studies. Taxon 2004, 53(2):257-261. 18. Sang T: Utility of low-copy nuclear gene sequences in plant phylogenetics. Critical Reviews in Biochemistry and Molecular Biology 2002, 37:121-147. 19. Small RL, Cronn RC, Wendel JF: Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 2004, 17:145- 170. 20. Hughes J, Longhorn SJ, Papadopoulou A, Theorides K, de Riva A, Meija- Chang M, Foster PG, Vogler AP: Dense Taxanomic EST Sampling and Its Applications for Molecular Systematics of the Coleoptera (Beetles). Molecular Biology and Evolution 2006, 23(2):268-278.

147 21. Whittall JB, Medina-Marino A, Zimmer EA, Hodges SA: Generating single-copy nuclear gene data for a recent adaptive radiation. Molecular Phylogenetics and Evolution 2006, 39:124-134. 22. Steinke D, Salzburger W, Meyer A: Novel relationships among ten fish model species revealed based on a phylogenomic analysis using ESTs. Journal of Molecular Evolution 2006, 62:772-784. 23. Axtell MJ, Snyder JA, Bartel DP: Common functions for diverse small RNAs of land plants. The Plant Cell 2007, 19(6):1750-1769. 24. Bainbridge MN, Warren RL, Hirst M, Romanuik T, Zeng T, Go A, Delaney A, Griffith M, Hickenbotham M, Magrini V et al: Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by- synthesis approach. BMC Genomics 2006, 7:246. 25. Barakat A, Wall K, Leebens-Mack J, Wang YJ, Carlson JE, dePamphilis CW: Large-scale identification of microRNAs from a basal eudicot (Eschscholzia californica) and conservation in flowering plants. The Plant Journal 2007, 51(6):991-1003. 26. Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS: SNP discovery via 454 transcriptome sequencing. The Plant Journal 2007, 51(5):910- 918. 27. Cheung F, Haas BJ, Goldberg SMD, May GD, Xiao Y, Town CD: Sequencing Medicago truncatula expressed sequence tags using 454 Life Sciences technology. BMC Genomics 2006, 7:272. 28. Emrich SJ, Barbazuk WB, Schnable PS: Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res 2007, 17:69-73. 29. Moore MJ, Dhingra A, Soltis PS, Shaw R, Farmerie WG, Folta KM, Soltis DE: Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biology 2006, 6(17). 30. Pearson BM, Gaskin DJH, Segers RPAM, Wells JM, Nuijten PJM, van Vliet AHM: The Complete Genome Sequence of Campylobacter jejuni Strain 81116 (NCTC11828). Journal of Bacteriology 2007, 189(22):8402- 8403. 31. Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 2007, 8. 32. Carlson J, Leebens-Mack J, Wall P, Zahn L, Mueller L, Landherr L, Hu Y, Ilut D, Arrington J, Choirean S et al: EST database for early flower development in California poppy (Eschscholzia californica Cham., Papaveraceae) tags over 6000 genes from a basal eudicot. Plant Molecular Biology 2006, 62(3):351-369. 33. Enright AJ, Kunin V, Ouzounis CA: Protein families and TRIBES in genome sequence space. Nucleic Acids Research 2003, 31:4632-4638.

148 34. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 2002, 30(7):1575-1584. 35. Wall PK, Leebens-Mack J, Muller K, Field D, Altman N, dePamphilis CW: PlantTribes: A gene and gene family resource for comparative genomics in plants. Nucleic Acids Research 2007. 36. Iseli C, Jongeneel CV, Bucher P: ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol 1999:138-148. 37. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5(1):113. 38. Cui L, Veeraraghavan N, Richter A, Wall PK, Jansen RK, Leebens-Mack J, Makalowska I, dePamphilis CW: ChloroplastDB: the chloroplast genome database. Nucleic Acids Research 2006, 34:D692-696. 39. Swofford DL: PAUP*: Phylogenetic Analysis using Parsimony (* and Other methods). In., 4 edn. Sunderland, MA: Sinauer Associates; 2001. 40. Posada D, Crandall KA: Modeltest: testing the model of DNA substitution. Bioinformatics 1998, 14(9):817-818. 41. Wall PK, Leebens-Mack J, Chanderbali A, Barakat A, Liang H, Landherr LL, Tomsho LP, Hu Y, Carlson JE, Ma H et al: Comparison of next generation sequencing technologies for de novo transcriptome characterization. BMC Genomics 2008, submitted. 42. Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell M, Alice LA, Evans R et al: Angiosperm phylogeny based on matK sequence information. American Journal of Botany 2003, 90(1758-1776). 43. Soltis D-E, Soltis P-S, Chase M-W, Mort M-E, Albach D-C, Zanis M, Savolainen V, Hahn W-H, Hoot S-B, Fay M-F et al: Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Botanical Journal of the Linnean Society 2000, 133(4):381-461. 44. Soltis DE, Senters AE, Zanis MJ, Kim S, Thompson JD, Soltis PS, Ronse De Craene LP, Endress PK, Farris JS: Gunnerales are sister to other core eudicots: implications for the evolution of pentamery. American Journal of Botany 2003, 90(3):461-470. 45. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C et al: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449(7161):463-467. 46. Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al: The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313(5793):1596-1604.

149 Concluding Remarks

The Importance of Understanding the Impact of Gene Duplication in Flowering Plants

The impact of gene duplication on the evolution of flowering plants should not be underestimated in plant biology. Common assumptions in genetics, development, and molecular biology concerning the orthology of genes and maintenance of function in different species do not necessarily hold for plants as the result of frequent gene and genome duplication. This is especially relevant for the field of evo-devo – ‘evolution of development’. With the presence of each new gene duplicate potentially comes an increase in complexity of the ‘toolbox’ used for development, since the majority of developmental regulators are members of gene families that preferentially maintain duplicates after whole genome duplications [1, 2]. In order to understand the evolution of development in flowering plants, we must necessarily understand the molecular evolution of genes after duplication.

Additionally, it is clear that we do not yet have a good grasp of the evolutionary forces that influence the evolution of genes after duplication events and that the theoretical framework surrounding the evolution of duplicated genes is in need of reconciliation with current empirical studies concerning the evolution of genes after duplication. In future work, I hope to synthesize the theoretical framework of gene duplication with empirical studies in order to properly identify the variables that influence the fate of duplicated genes and apply that knowledge to understanding the evolution of development in plants as well as the proper selection of nuclear genes to use as phylogenetic markers.

150

Application of new methods to understand expression divergence in orthologs and paralogs

As the transcriptomes of more and more organisms are characterized using microarrays, including non-model organisms, a wealth of new information will be made available that can be used to study the evolution of expression. Whereas studies of expression divergence between paralogs in most of the organisms for which extensive microarray data is available have been published [2-11], studies comparing expression across species have been limited [12, 13], primarily by the lack of comparative microarray data from homologous biological samples. However, when comparative microarray data becomes available, the ANOVA-based methodology introduced in

Chapter 2 can be adopted to study expression divergence in terms of difference in organism, gene (s), and biological sample while eliminating many of the technical biases found in microarray analysis that can become obstacles for distance-based methods of examining expression divergence. It may also become possible to use the results from next-generation transcriptome sequencing (such as 454 or Solexa sequencing) to track expression [14] and study the evolution of expression in a comparative framework.

A new direction in phylogenetics: the use of single copy nuclear genes

The research presented in this dissertation introduces a new potential set of phylogenetic markers from the nuclear genome in flowering plants. Hopefully, other researchers in plant systematics will adopt these genes for use in their own phylogenetic research. An additional project I’ve undertaken, which is not included in this

151 dissertation, is testing a selection of approximately twenty of the shared single copy genes for their use in traditional molecular systematics in the basal angiosperms in collaboration with Dr. Stefan Wanke. As noted in Chapter 3, the basal angiosperms are an important collection of taxa for understanding the evolution of flowering plants.

There are still questions concerning the organismal relationships among the basal angiosperms. While the preliminary utility of the shared single copy genes is illustrated in the Brassicaceae family in Chapter 4, this study will test the ease of amplification and sequencing, as well as the phylogenetic utility, of these genes over a much more diverse and broad set of taxa. We have preliminary data from 13 species in 11 basal angiosperm families that indicates that 8-10 of the shared single copy genes can be reliably amplified using RT-PCR and RNA obtained from young leaves with direct sequencing of the amplification product after gel isolation, with tissue already collected that covers 13 species in 11 additional basal angiosperm families. The resulting dataset will provide valuable new insight into basal angiosperm phylogeny and the early evolution of flowering plants. Our collaborator, Stefan Wanke, has demonstrated that intron sequences from one of the shared single copy genes identified in this work can be successfully used for population-level studies and is superior to current markers that are typically employed for such studies.

The balance of high-throughput methodologies and low-throughput “traditional” biology

As sequencing technologies improve in terms of the amount of data generated proportionally to cost and labor, researchers must adapt their methodology to take advantage of these advances. However, researchers must balance high-throughput,

152 typically computational methods, with proper analytical techniques that take into the biology and theoretical framework of the system they are studying. The last chapter of this dissertation proposes a new methodology for gathering phylogenetic data in an efficient manner using 454 transcriptome sequencing. Unlike previous studies that have used computational methods to gather as much phylogenetic data as possible from public databases or EST sequencing [13, 15-17], the methodology proposed in this dissertation uses a filter of previously determined ‘single-copy’ orthologous genes to increase the signal to noise ratio that can be an issue when all possible data is used in an analysis.

This is complicated in phylogenetics, and especially in flowering plants, by the presence of paralogous sequences that can result in phylogenetic artifacts as the result of rate heterogeneity [17, 18]. By limiting the dataset to a significant set of orthologous sequences, it expected that the reliability and consistency of the dataset would be greatly enhanced. Hopefully researchers studying the full range of eukaryotic diversity will employ this methodology to improve our understanding of the tree of life and coincidently, greatly enhance our understanding of the evolution of nuclear genome content and the evolution of orthologous sequences.

Synthesis

Overall, the research included in this dissertations illustrates how in order to apply models of molecular evolution to deciphering the evolution of development and organismal phylogeny in flowering plants we must expand our understanding of the evolution of genes after duplication. By taking into account a more complex model of molecular evolution after duplication, we will better understand the evolution of all genes

153 in flowering plants, both regulatory genes and genes used as phylogenetic markers, and how changes at the molecular level result in the broad diversity and evolutionary success of flowering plants.

References

1. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y: Modeling gene and genome duplications in eukaryotes. Proceedings of the National Academy of Sciences USA 2005, 102(15):5454-5459. 2. Blanc G, Wolfe KH: Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. The Plant Cell 2004, 16(7):1679- 1691. 3. Duarte JM, Cui L, Wall PK, Zhang Q, Zhang X, Leebens-Mack J, Ma H, Altman N, dePamphilis CW: Expression Pattern Shifts Following Duplication Indicative of Subfunctionalization and Neofunctionalization in Regulatory Genes of Arabidopsis. Molecular Biology and Evolution 2006, 23(2):469-478. 4. Wagner A: Decoupled evolution of coding region and mRNA expression patterns after gene duplication: Implications for the neutralist-selectionist debate. Proceedings of the National Academy of Sciences USA 2000, 97(12):6579-6584. 5. Huminiecki L, Wolfe KH: Divergence of Spatial Gene Expression Profiles Following Species-Specific Gene Duplications in Human and Mouse. Genome Res 2004, 14(10a):1870-1879. 6. Gu ZL, Nicolae D, Lu HHS, Li WH: Rapid divergence in expression between duplicate genes inferred from microarray data. Trends in Genetics 2002, 18(12):609-613. 7. Gu X: Statistical Framework for Phylogenomic Anlysis of Gene Family Expression Profiles. Genetics 2004, 167:531-542. 8. Gu X, Zhang Z, Huang W: Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proceedings of the National Academy of Sciences USA 2005, 102(3):707-712. 9. He X, Zhang J: Rapid Subfunctionalization Accompanied by Prolonged and Substantial Neofunctionalization in Duplicate Gene Evolution. Genetics 2005, 169(2):1157-1164. 10. Makova KD, Li WH: Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res 2003, 13(7):1638-1645. 11. Ganko EW, Meyers BC, Vision TJ: Divergence in Expression between Duplicated Genes in Arabidopsis. Molecular Biology and Evolution 2007, 24(10):2298-2309. 12. Liao B, Zhang J: Evolutionary conservation of expression profiles between human and mouse orthologous genes. Molecular Biology and Evolution 2005, in press.

154 13. Gu ZL, Rifkin SA, White KP, Li WH: Duplicate genes increase gene expression diversity within and between species. Nature Genetics 2004, 36(6):577-579. 14. Wall PK, Leebens-Mack J, Chanderbali A, Barakat A, Liang H, Landherr LL, Tomsho LP, Hu Y, Carlson JE, Ma H et al: Comparison of next generation sequencing technologies for de novo transcriptome characterization. BMC Genomics 2008, submitted. 15. de la Torre JE, Egan MG, Katari M, Brenner ED, Stevenson DW, Coruzzi GM, Desalle R: ESTimating plant phylogeny: lessons from partitioning. BMC Evolutionary Biology 2006, 6:48. 16. Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S: Obtaining Maximal Concatenated Phylogenetic Data Sets from Large Sequence Databases. Molecular Biology and Evolution 2003, 20(7):1036-1042. 17. Sanderson MJ, McMahon MM: Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology 2007, 7:S3. 18. Fares MA, Byrne KP, Wolfe KH: Rate Asymmetry After Genome Duplication Causes Substantial Long-Branch Attraction Artifacts in the Phylogeny of Saccharomyces Species. Molecular Biology and Evolution 2006, 23(2):245-253.

155 VITA

Jill Marie Duarte

Education: Ph.D., Biology with option in Molecular Evolutionary Biology Anticipated 2008 The Pennsylvania State University, University Park, PA

B.S., Cell and Molecular Biology and Genetics 2003 University of Maryland, College Park, MD

Honors and Awards: NESCENT Postodoctoral Fellowship – declined The Pennsylvania State University, Dept. of Biology - Jeanette Ritter Mohnkern Scholarship MORPH Mini-course travel grant National Science Foundation Doctoral Dissertation Improvement Grant Botanical Society of America Genetics Section Graduate Research Award Biology Department, IMEG, and FGP travel awards National Science Foundation – Graduate Research Fellowship Honorable Mention The Pennsylvania State University – University Fellowship University of Maryland -Appleman-Norton Award for Most Outstanding Student in Plant Biology University of Maryland – High Honors in Cell Biology and Molecular Genetics

Publications: Duarte, J. M., Wall, P. K., Leebens-Mack, J. H., Pires, C. P., dePamphilis, C. W. A marriage of high-throughput identification of phylogenetic markers with high-throughput sequencing: Next-generation phylogenomics using 454 transcriptomes. In prep Duarte, J. M., Landherr, L. L., Wall, P. K., Edgar, P. C., Pires, J. C., Leebens-Mack, J. H., dePamphilis, C. W. Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility at various taxonomic levels. In prep Duarte, J.M., P.K. Wall, J.H. Leebens-Mack and C.W. dePamphilis. Utility of Amborella trichopoda and Nuphar advena ESTs for phylogeny and comparative sequence analysis. Taxon in press Leebens- Mack, J. H., Wall, P. K., Duarte, J. M., Zheng, Z., Oppenheimer, D., C. W. dePamphilis (2006) A genomics approach to the study of ancient polyploidy and floral developmental genetics.. Soltis, D. E., Soltis, P. S., and Leebens-Mack, J. H. [eds] Advances in Botanical Research series. Elsevier Limited, London. Duarte, J. M., Cui, L., Wall, P. K., Zhang, Q., Zhang, X., Leebens-Mack, J., Ma, H., Altman, N., C. W. dePamphilis. 2006. Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Molecular Biology and Evolution. 23:469-478