mn u Ottawa L'Universite canadienne Canada's university TTTTT FACULTE DES ETUDES SUPERIEURES l^^l FACULTY OF GRADUATE AND ET POSTDOCTORALES u Ottawa POSTDOCTORAL STUDIES L'Universite canadierine Canada's university

Alain Toutloff AUTEUR DE LA THESE / AUTHOR OF THESIS

M.Sc. (Biology) GRADE/"DEGREE"

Department of Biology FACULTE, ECOLE, DEPARTEMENT / FACULTY, SCHOOL, DEPARTMENT

Phylogenetic Position of Ceratophyllum Using the Largest Subunit of RNA Polymerase Genes

TITRE DE LA THESE / TITLE OF THESIS

G. Drouin DIRECTEUR (DIRECTRICE) DE LA THESE/ THESIS SUPERVISOR

CO-DIRECTEUR (CO-DIRECTRICE) DE LA THESE / THESIS CO-SUPERVISOR

EXAMINATEURS (EXAMINATRICES) DE LA THESE / THESIS EXAMINERS

S. Aris-Brosou

M. Ekker

K. Seifert

.G.?LyWA.S!ater. Le Doyen de la Faculte des etudes superieures et postdoctorales / Dean of the Faculty of Graduate and Postdoctoral Studies PHYLOGENETIC POSITION OF CERATOPHYLLUMUSING THE LARGEST SUBUNIT OF RNA POLYMERASE GENES

By:

Alain H. Toutloff

Thesis submitted to the Faculty of Graduate and Postdoctoral Studies of the University of Ottawa in partial fulfillment of the requirements for the M. Sc. degree in the Ottawa-Carleton Institute of Biology.

These soumise a la Faculte des etudes superieures et postdoctorales de I'Universite d'Ottawa en vue de I'obtention de la Maitrise en biologie de I'lnstitut de biologie d'Ottawa-Carleton.

© Alain H. Toutloff, Ottawa, Canada, 2009 Library and Archives Bibliotheque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition

395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 OttawaONK1A0N4 Canada Canada

Your file Votre reference ISBN: 978-0-494-63017-4 Our file Notre reference ISBN: 978-0-494-63017-4

NOTICE: AVIS:

The author has granted a non­ L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non­ support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extra its substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

••I Canada EXAMINERS

The following members of the

Ottawa-Carleton Institute of Biology:

Dr. Guy Drouin

Supervisor

Dr. Stephane Aris-Brosou

Examining Member (University of Ottawa)

Dr. Marc Ekker

Examining Member (University of Ottawa)

Dr. Keith A. Seifert

Examining Member (Carleton University)

2 ABSTRACT

Classification of several groups of angiosperms remains to be clarified. Among those, the position of the family , with Ceratophyllum as the unique genus, is still not clear. Previous studies placed this genus at the basal position of all angiosperms, within the eumagnoliids, as sister group to all , or as sister group to the monocots. We performed maximum parsimony and maximum likelihood phylogenetic analyses to clarify the phylogenetic position of Ceratophyllum.

These analyses were performed on partitions, based on substitution rates, to account for the different substitution rates of different sites. Our results show that Ceratophyllum forms a clade with

Acorus in the majority of inferred . Moreover, this clade is most often found as a sister group to the monocots.

3 RESUME

La classification de plusieurs groupes d'angiospermes doit etre clarifiee. Parmi ceux-ci, la position de la famille Ceratophyllaceae, avec Ceratophyllum comme unique genre, est toujours incertaine. Des etudes anterieures ont place ce genre a la base des angiospermes, au sein des eumagnolides, en tant que groupe soeur de tous les eudicotyledones ou comme groupe sceur des monocotyledones. Nous avons effectue des analyses phylogenetiques de parcimonie maximale et de vraisemblance maximale, afin de clarifier la position phylogenetique de Ceratophyllum. Ces analyses ont ete effectuees sur des partitions pour tenir compte des divers taux de substitution a differents sites. Nos resultats demontrent que Ceratophyllum produit un clade avec dans la majorite des arbres deduits. De plus, ce clade se retrouve tres souvent en tant que groupe soeur des monocotyledones.

4 REMERCIEMENTS/ACKNOWLEDGMENTS

II me semble primordial de remercier tous ceux et celles qui, de pres ou de loin, ont permis que je puisse realiser ces travaux en evolution moleculaire, en vue d'une Maitrise en biologie. Tout d'abord, je suis tres reconnaissant a mon superviseur, Dr Guy Drouin, d'avoir accepte que je puisse travailler dans son laboratoire. Son experience, son expertise, ses conseils, ses encouragements, sa patiente lecture de ma these, depuis le debut de mon experience, m'ont permis de realiser ce projet. I would also like to thank my committee members. Thanks to Dr George Carmody for his valuable suggestions. Merci aussi Dr Stephane Aris-Brosou pour m'avoir fait des recommandations concernant mes donnees et analyses bioinformatiques et phylogenetiques.

J'aimerais aussi remercier Cindy Lesage-Pelletier qui, en tant que stagiaire, a travaille avec moi dans le laboratoire de Guy Drouin pour I'isolation d'ADN, ramplification par PCR, I'electrophorese, ainsi que les preparations d'echantillon pour le sequengage. Merci egalement a mon employeur, le

Cegep de I'Outaouais, ainsi que le Ministere de I'education, pour m'avoir libere en partie de ma tache d'enseignement pendant quelques sessions afin de poursuivre ce projet.

En terminant, j'aimerais aussi remercier mes filles, Corinne et Laurence Courchesne-Toutloff, pour leur patience lorsque leur papa etait occupe a son projet de maitrise. Enfin il me faut mentionner une derniere personne, mais nullement la moindre puisqu'elle m'a grandement encourage tout au long de ce periple. Merci a Ma Belle, Caroline Mimeault, pour ton expertise avec les logiciels Word et EndNote, mais surtout pour ton appui constant et ta confiance. Ce fut apprecie au plus haut point!

5 TABLE OF CONTENT

ABSTRACT 3 RESUME 4 REMERCIEMENTS/ACKNOWLEDGMENTS 5 TABLE OF CONTENT 6 LIST OF FIGURES 8 LIST OF TABLES 9 LIST OF ABBREVIATIONS 10

Chapter 1 11 INTRODUCTION 11 Phytogenies ....11 Usefulness of an adequate angiosperm classification 12 Morphological versus molecular data 13 Phylogenetic methods and substitution models 15 Saturated sites 17 Long branch attraction 18 Genes and concatenated data 22 RNA polymerase genes as phylogenetic markers 26

Chapter 2 28 CONTROVERSIES IN THE ANGIOSPERM CLASSIFICATION 28 Root of the angiosperms 29 Primitive dicots.... , 30 Eudicots (tricolpates) 32 Monocots 36 Ceratophyllaceae 39

Chapter 3 42 PURPOSE OF THE STUDY... 42

Chapter 4 43 MATERIALS AND METHODS 43 Species 43 Sequencing of rpal and rpcl 43 Contig assemblies 45 Alignment and supermatrices .....45 Phylogenetic analysis 51 Consistency in topology 53

Chapter 5 54 RESULTS 54 Analysis of rpal and rpcl sequences 54 Analysis of the concatenated datasets 54 Rate heterogeneities among the concatenated datasets •. 55 Phylogenetic analyses 57 Testing of the tree topologies 77 Bootstrap confidence .....80

6 Chapter 6 82 DISCUSSION..... 82 Genes used 83 Concatenated datasets and alignment 84 Choice of outgroups ...87 Among-site rate variation and removal of noise 87 Choice of methods 90 MP analysis 92 ML analysis 95 Data partitions 95 Bootstrap confidence 96 Relationships between Acorus and Ceratophyllum 97

Chapter 7 101 CONCLUSION 101

REFERENCES 103 APPENDIX 119 APPENDIX A: Nucleotide alignment REFER TO CD APPENDIX B: Amino acid alignment REFER TO CD APPENDIX C: Permissions granted for the 4 figures 121

7 LIST OF FIGURES

Figure 1 The updated angiosperm classification 13

Figure 2 Taxon sampling among the gymnosperm and angiosperm clades using 22 species, including 7 eudicots, 4 monocots, 9 and 2 gymnosperms (as the outgroups) 21

Figure 3 Primitive dicots, as well as monocots and , are still not resolved and are thus often shown as a large polytomy in phylogenetic trees 31

Figure 4 Relationships among eudicots are still not completely resolved, especially for ancient eudicot species, as well as eurosids I, lamiids and campanulas 33

Figure 5 Evolutionary relationships of 20 taxa for the nucleotide dataset with Maximum parsimony 60

Figure 6 Evolutionary relationships of 20 taxa for the amino acid dataset with Maximum parsimony 62

Figure 7 Evolutionary relationships of 22 taxa for the nucleotide dataset with Maximum likelihood 64

Figure 8 Evolutionary relationships of 22 taxa for the amino acid dataset with Maximum likelihood 66

Figure 9 Evolutionary relationships of 20 taxa for the nucleotide dataset with Maximum likelihood 69

Figure 10 Evolutionary relationships of 20 taxa for the amino acid dataset with Maximum likelihood 70

Figure 11 Evolutionary relationships of 20 taxa for the nucleotide dataset, using only the seven nuclear sequences, with Maximum likelihood 71

Figure 12 Evolutionary relationships of 20 taxa for the amino acid dataset, using only the six nuclear protein-coding sequences, with Maximum likelihood 72

Figure 13 Evolutionary relationships of 20 taxa for the nucleotide dataset, using only the five chloroplastic sequences, with Maximum likelihood 73

Figure 14 Evolutionary relationships of 20 taxa for the amino acid dataset, using only the five chloroplastic sequences, with Maximum likelihood 74

Figure 15 Evolutionary relationships of 20 taxa for the nucleotide dataset, using only the three mitochondrial sequences, with Maximum likelihood 75

Figure 16 Evolutionary relationships of 20 taxa for the amino acid dataset, using only the three mitochondrial sequences, with Maximum likelihood 76

Figure 17 All positions inferred for Ceratophyllum 82

8 LIST OF TABLES

Table 1 List of the 22 species analyzed, with their orders and families, according to the classification adopted by the international symposium of the Angiosperm Phytogeny Group II in 2003 44

Table 2 List of nuclear, mitochondrial and chloroplastic genes coding for proteins and

nuclear 18S ribosomal DNA gene 46

Table 3 Species and accession numbers of the 11 sequences 46

Table 4 Species and accession numbers of the nuclear genes included 48

Table 5 Species and accession numbers of the mitochondrial genes included 49

Table 6 Species and accession numbers of chloroplastic genes included 50

Table 7 Rate heterogeneity of the DNA dataset 56

Table 8 Rate heterogeneity of the amino acid dataset 56

Table 9 Characteristics of the different datasets

(all sites and without sites of categories 8, 7 and 6) 58

Table 10 SH-tests for seven alternative topologies with different datasets 79

Table 11 Comparison of bootstrap values for all figures and all different matrices analyzed 81 Table 12 Relationships between Ceratophyllum and Acorus 99

9 LIST OF ABBREVIATIONS

ANITA , Nymphaea and ITA atpA ATPase alpha subunit atpB ATP synthase beta subunit APG Angiosperm Phylogeny Group (1998) APG II Angiosperm Phylogeny Group II (2003) BI Bayesian inference CI Consistency index coxl Cytochrome oxidase subunit I DNA Deoxyribonucleic acid GC Guanine-cytosine GPWG Grass Phylogeny Working Group GTR General-time reversible substitution model (or REV) Indel Insertion or deletion event ITA Illiciaceae, Schisandraceae, Trimeniaceae and Austrobaileyaceae JTT Jones, Taylor and Thornton substitution model kb Kilo base pairs kDa Kilo Dalton LBA Long branch attraction matK Maturase K matR Maturase R MP Maximum parsimony ML Maximum likelihood MtREV Mitochondrial reversible substitution model MYA Million years ago mtSSU Mitochondrial small subunit ORF Open reading frame phyA PhytochromeA phyC Phytochrome C PI Parsimony informative psaA Photosystem I subunit A psbB Photosystem II CP47 protein rbcL Large subunit of ribulose-l,5-bisphosphate carboxylase/oxygenase (RuBisCO) rDNA Ribosomal deoxyribonucleic acid RNA Ribonucleic acid rpal RNA polymerase I largest subunit rpbl RNA polymerase II largest subunit rpb2 RNA polymerase II second largest subunit rpcl RNA polymerase III largest subunit , TBR Tree-bisection-reconnection tRNA Transfer RNA trnK Lysine tRNA

10 CHAPTER 1

INTRODUCTION

Phytogenies

How can we explain the origin of species and the interactions of living organisms within ecosystems?

Despite their interest in both of these fascinating questions, humans mostly had to wait for Darwin and his famous book On the Origin of Species by Means of Natural Selection (Darwin 1859) before beginning to reconsider the tree of life. This was also a start for phylogenetics where, by using morphological features, the evolutionary relationships of species were established. Nowadays, phylogenetics commonly uses molecular characters to clarify relationships among organisms. Indeed, with phylogenetic tree inferences, ancestral traits (plesiomorphies) can be compared to the ones that are derived, or more recent (synapomorphies; Mishler 1994). The former are considered uninformative (noise), whereas the latter are useful characters: one then needs to make a distinction between these two (Hillis 1987). Therefore, it is possible, using molecular data, to discriminate between a group of species that includes all the descendants, a monophyletic clade also referred to a natural clade, and another one that does not and is paraphyletic (a group whose common ancestor is shared by other taxa). The acceptance of monophyletic clades is facilitated when all species clearly share a certain number of morphological synapomorphies. The paraphyletic former groupings can then be abandoned.

In comparative biology, the use of phylogenetic trees is essential if one intends to infer relationships (Holland et al. 2004). This is indeed true for any organism and particularly for flowering where, for many reasons such as the rapid diversification, common ancestors are sometimes difficult to assess. In fact, in general compared to animals, phylogenies are more difficult to reconstruct, partly because of the presence of more analogous features, usually referred to as homoplasies (similar characteristics acquired through independent evolution; Lankester 1870;

Sanderson and Donoghue 1989). After only two decades of phylogenetics using intensive molecular investigations, our current state of knowledge of all these plants has greatly improved (Judd et al. 11 2008; Moore et al. 2007; Savolainen and Chase 2003; Soltis et al. 2004a). The current angiosperm classification thus differs greatly to the one based solely on morphological characters (APG_II 2003;

Cronquist 1988; Thorne 1992). However, controversies still remain about a certain number of groupings and there is a need for further clarification.

Usefulness of an adequate angiosperm classification

Extant seed plants include only gymnosperms and angiosperms, the latter being also called flowering plants. While seed plants seemed to arise about 375 MYA (Gerrienne et al. 2004), the gymnosperms and angiosperms probably appeared 310 and 120 MYA, respectively (Crepet et al. 2004; Pryer et al.

2004). Gymnosperms are divided into four clades: cycads at the base, then Ginkgo, whereas Gnetales could be found within the fourth group, conifers (Hajibabaei et al. 2006; Nickerson and Drouin 2003).

Despite their putative old age, the number of gymnosperm species is quite limited, compared to flowering plants, with no more than 600 species.

Angiosperms are the most diverse group of plants. Indeed, at least 260,000 species, divided into 457 families within 45 orders, according to the Angiosperm Phylogeny Group II, have so far been recognized, although the number of species will probably double in a few years (APGJI 2003)1. To improve resolution for the angiosperm classification, because of the huge number of species and their rapid diversification at their early stage, it seems necessary to obtain more information from fossils

(Zanis et al. 2002).

Because of the overwhelming importance of flowering plants for our source of food, drugs and fibre, a good understanding of their relationships is essential. Angiosperm classification has thus been revised through molecular phylogeny (evolutionary history). Hence the Angiosperm Phylogeny

Group (APG 1998; APGJI 2003), based on recent phylogenies using molecular data, had to consider diverse criteria, such as the taxonomic redundancy that has to be minimized, the morphological features of all species of a clade and of course the support for monophyletic clades. However, a

1 Unless otherwise, the classification discussed here is the one adopted by the last international symposium of the Angiosperm

Phylogeny Group II in 2003.

12 number of questions are still unresolved, because either different trees gave rise to conflicting positions of species, or support for particular nodes is weak. The updated classification of the orders of angiosperms, as well as families that are still unassigned, is shown below (Figure 1).

*r&wetatftae

conurelinkJa

CrsiSCsamntniES

Byroads. I

euros* la

eiiasteriSs I

™"°|_ t *piato I e.dBteridE. 11

Figure 1: The updated angiosperm classification. Source: APG II (2003); Reproduced with permission from Wiley-Blackwell (Appendix C).

Morphological versus molecular data

The tree of life has been constructed in part with morphological features, but such criteria are not as well defined and reliable as molecular characters (Givnish and Sytsma 1997; Judd et al. 2008). In

13 fact, morphology is rather tricky to interpret (Kellogg 2001), and this kind of data can thus obviously lead to incorrect conclusions about taxonomic relationships (Graur and Li 2000; Judd et al. 2008).

Molecular data can therefore infer phylogenies more congruently than morphological data (Hillis

1987; Sanderson and Donoghue 1989). Similarities among organisms were inferred as a consequence of a putative common ancestor based on these interpretations. For example, former botany textbooks used cotyledon (leaf located at the embryo; Judd et al. 2008) number to show a split among angiosperms, and species belonged either to monocots or dicots (Cronquist 1981; Dahlgren et al.

1985; Jussieu 1789). Some other interpretations proposed that Gnetales were the sister group to flowering plants (Chase et al. 1993; Doyle 1996; Doyle and Donoghue 1986). These two classifications have now been refuted by many molecular studies (Burleigh and Mathews 2004; Chaw et al. 1997; Donoghue and Doyle 2000; Hajibabaei et al. 2006; Rydin et al. 2002; Soltis et al. 1999).

Since it is now obvious that certain similarities might have arisen independently, angiosperm has been reconstructed using molecular sequences (APG 1998; APG_II 2003), although the use of these types of traits can still sometimes confuse historical inferences, mainly due to systematic errors (Gee 2003).

The use of these molecular characters to infer phylogenies is fairly recent. Before 1999, only a small number of single gene datasets were available. The first one was a dataset of 499 sequences, based on the rbcL plastid gene, coding for the most abundant enzyme on earth, the ribulose biphosphate carboxylase protein (rbcL). This study proposed that the paleoherb Ceratophyllum was a sister to all other angiosperms (Chase et al. 1993). Another dataset was composed of a series of 223 sequences of the nuclear 18S ribosomal RNA, in which were inferred as the most basal angiosperms (Soltis et al. 1997). Although these early molecular studies produced some congruent phylogenetic trees and that several clades were then recognized or rearranged (for example eudicots, and ), scientists did not consider the results conclusive. The main reasons for this sceptic response were the low support for many clades, topological incongruence in many tree comparisons, scarce informative sites (positions containing a phylogenetic signal), as well as the use of only one phylogenetic reconstruction method, parsimony (Kuzoff and Gasser 2000).

14 Fortunately, since 1999, due to extensive collaborations, automated sequencing, as well as more powerful computers and algorithms, more rigorous phylogenetic studies have been published.

They mostly relied on multiple unlinked genes instead of unique genes. Nowadays, such matrices tend to be the standard for phylogenetic studies (Rokas and Carroll 2005). In plants, studies with datasets from the three genomic compartments (the nucleus, the mitochondrion and the plastid) tend to be more convincing than those using sequences from only a single genomic compartment, because of different modes of inheritance and evolutive histories. Genes, such as plastid rbcL, matKand atpB, nuclear 18S rDNA, phytochrome A and phytochrome C, as well as mitochondrial small sub-units rRNA, coxl and matR have been used to infer both gymnosperm (Chaw et al. 2000) and angiosperm trees (Savolainen and Chase 2003). But even though molecular data permitted the angiosperm classification to be updated, based on monophyletic groupings, these new emerging clades still have to be further confirmed by more exhaustive data. A lot of work is thus required in order to be decisive about the reliability of these results, using adequate methods for inferring phylogenies and best suited substitution models.

Phylogenetic methods and substitution models

A few phylogenetic methods exist to reconstruct species relationships, such as Maximum parsimony

(MP), Maximum likelihood (ML) and Bayesian inference (BI). With MP (Farris 1970), the most adequate tree is assumed to be the one with the smallest number of evolutionary changes. In order to achieve this, all sites of a dataset are categorized: they are either invariant or variable. Those variable sites might be informative or uninformative. Informative sites are the only ones evaluated for

MP. They are basically sites at which at least two different characters occur, and these must be present in at least two taxa. All these informative sites are used to construct a tree and the best tree is the parsimonious one: it bears the lowest number of changes. MP is probably the simplest method, but it unfortunately minimizes the number of substitutions, which can be a real problem, particularly when dealing with distantly related species.

15 Other methods can be quite useful provided the chosen substitution model fits the data. The traditional ML (Edwards 1972; Felsenstein 1981) and the more recent approach referred to BI are very similar, although BI might offer more flexibility, apart from being faster (Aris-Brosou and Xia

2008). ML, using a model of character changes (a substitution model), calculates the probability (or likelihood) of observing the data for a phylogenetic tree. Since each tree corresponds to a likelihood of observing the data, the tree with the maximum likelihood is the one chosen. Swofford et al. (1996) offer a detailed explanation, focusing in particular on the mathematical aspects of this method. BI is a statistical inference that calculates the probability of observing an hypothesis. It uses posterior probabilities, which are probabilities estimated after learning something about the data, under a chosen substitution model.

Divergence among aligned sequences has to be quantified in order to distinguish the closely related species from the other ones. A certain number of substitution models exist to do so. Although the p distance, the proportion of differences between two aligned sequences is the simplest method to assess divergence, it does not take into account multiple changes at each site. Corrections then need to be applied in order to compensate for these changes. This is why some substitution models have been proposed, such as the general-time reversible (GTR; Lanave et al. 1984; Rodriguez et al.

1990) for nucleotide datasets and the Jones, Taylor and Thornton (JTT; Jones et al. 1992) for amino acid datasets. As much as possible, any of the chosen model should be able to fit the data realisticly, even though no perfect fit can be reached. Furthermore, in order to account for the substitution rate variation among sites, two other parameters can be added to the nucleotide model: the first being the invariable sites (+1) and the second being among-site rate variation (+r). Apart from the re­ distribution that usually increases the fit of amino acid datasets (Goldman and Whelan 2002), an additional parameter that takes into account the frequencies of the replaced and resulting residues

(+F) can be used. It is noteworthy to realize that the selection of the appropriate model does not necessarily imply choosing the one that best fits the data. Indeed, whereas a high-parameter model could obtain a better fit, its accuracy will be decreased (Aris-Brosou and Xia 2008), and therefore tree reconstruction might not always guarantee an adequate phylogeny (Gaut and Lewis 1995; Ren et al.

2005).

16 Since it was claimed that ML often outperforms MP (Felsenstein 1978), a long debate has occurred. MP was then considered to be inconsistent, even when more data were being added to the datasets. This is partly why when MP and ML obtained different topologies, long branch attraction

(see below) is suspected, mostly with MP (Felsenstein 1978; Kim 1996), even though it can occasionnaly happen with ML (Sanderson and Shaffer 2002). Therefore, an essential condition must be fulfilled for ML to provide a good topology: the substitution model has to be consistent with the data (Anderson and Swofford 2004). Since Posada and Crandall (2001), the authors of Modeltest, admitted the possibility that the wrong model might be chosen by Modeltest, this is indeed a major setback. Modeltest also usually proposes a model with a lot of parameters, for example GTR +1 +r

(Chang 1996; Steel 2005).

Saturated sites

When using MP, the rate of evolution (the speed at which substitution occurs) of the investigated sequences has to be high enough to give rise to a significant number of informative sites. If not, the resolution of branches will be fairly poor. For example, Nickrent and Soltis (1995) compared 18S and rbcL sequences, and the trees from the former gene suffered many polytomies, which are nodes with multifurcations, meaning that these nodes were far from being resolved. In contrast, trees constructed from rbcL were more conclusive, with more resolved nodes, mainly due to about 1.4 times more informative sites (Nickrent and Soltis 1995). However, it is sometimes advisable to have a closer look at the heterogeneity of the sites in a matrix, because, since some positions evolve at a faster rate than others, it is not clear that they add valuable information. Sites with intermediate substitution rates are those most probably suitable for inferring more accurate trees (Yang 1998), event though fast rates are preferred to slowest ones (Sanderson and Shaffer 2002). On the other hand, it is quite difficult to evaluate the level at which saturation presents a real problem, but a threshold of about 30-40% sequence divergence seems to be adequate (Yang 1998). For example third positions of codons, because they mostly give rise to synonymous substitutions, can add noise

(i. e. non-phylogenetic information or random data, sometimes equated as homoplasies) instead of

17 providing useful information (Barkman et al. 2000). Some sites evolve so rapidly that they have

undergone many multiple substitutions. They are sometimes referred as saturated sites (Jeffroy et al.

2006; Maynard Smith and Smith 1996) and it might be wise to remove them, being a frequent source

of systematic errors, in order to increase the signal-to-noise ratio (Rodriguez-Ezpeleta et al. 2007).

When doing so, datasets tend to include less bias due to non-phylogenetic information, thus

potentially increasing the accuracy of the inferred topologies (Brinkmann and Philippe 1999). In order

to do so, r models, such as TREE-PUZZLE (Rodriguez-Ezpeleta et al. 2007; Strimmer and von

Haeseler 1996), produce different categories of positional rate heterogeneities (Stefanovic et al.

2004). Discrimination between noise and useful information can be estimated and the saturated sites

can then be removed from the data. On the other hand, some studies claim that these sites need to

be retained in order to obtain a better resolution (Kallersjo et al. 1999; Saarela et al. 2007; Soltis et

al. 2004a; Swofford et al. 1996; Wenzel and Siddall 1999; Yang 1998; Yoder et al. 1996). In fact, a

recent data analysis demonstrated that even highly saturated sites, such as third codon positions, are

still quite informative and need to be included (Seo and Kishino 2008). Therefore, in this study,

analyses have been made with and without fast evolving sites, in order to compare the resulting

phylogenetic trees.

Long branch attraction

Branch lengths are proportional to the expected number of substitutions or differences between a

taxonomic unit and the inferred ancestor (Mishler 1994). Since many angiosperm phylogenies

experience unequal branch lengths, rates of evolution among species are also

unequal and thus interpreting relationships is often tricky, especially when a slow down in molecular

evolution sometimes gives rise to very short branches (Soltis et al. 2005a).

Long branch attraction (LBA; Bergsten 2005; Hendy and Penny 1989) is a phenomenon by which, at homologous sequence positions, identical nucleotides are obtained by chance. Then fast

evolving species, even if they are distantly related, can occasionally be attracted to each other,

because they share the same nucleotides by chance more often than other closely related ones

18 (Felsenstein 1978). Moreover, when an outgroup is too distantly related to the ingroup taxa, chances are that a diverging species might be attracted to it. LBA is one of the best documented systematic errors (statistical inconsistencies) that can give rise to wrong trees (Bergsten 2005), although not the only one. Even if this LBA phenomenon seems to be quite limited, especially with only a few number of taxa (Anderson and Swofford 2004), how can it be detected? According to Goremykin et al.

(2006), there is only one method, the extraction test (Siddall and Whiting 1999), based on the fact that LBA exists only if both species are used. Then exclusion of any of these two species will inevitably change topology.

Within monocots, grasses are fast evolving species. When three studies (Goremykin et al.

2003; Goremykin et al. 2004; Goremykin et al. 2005), using only three grasses as representatives of monocots, proposed monocots at the root of angiosperms, this unusual position was said to be partly a result of LBA between these monocots and the outgroup Pinus (Soltis and Soltis 2004). Enhancing the number of well chosen species can break these long branches, by dividing them (Graybeal 1998;

Hendy and Penny 1989; Hillis 1996; Kuzoff and Gasser 2000). However, the tree topology can be transformed due to the addition of a specific taxon, particularly if it introduces conflicting traits, such as primitive characters, along with derived ones (Novacek 1992). Choice of taxa is thus critical; therefore adding slower evolving monocots such as Acorus or an orchid or any other fast evolving lineage to grasses can break this LBA (Stefanovic et al. 2004) because of a diminished average branch length, even though this increase in the number of species does not always give good results

(Kim 1996; Poe and Swofford 1999). The position where the addition of a taxon is mostly remarkable and accurate is near the common ancestor of the investigated clade (Graybeal 1998; Kim 1996).

Nevertheless, Goremykin and Hellwig (2006) showed that the unusual placement of grasses at the root was not due to LBA, but to another phenomenon that they called model mis-specification where

ML failed to provide an adequate tree topology.

On the other hand, it is also imperative to be aware that some long branches, obtained by convergent changes, can sometimes help improving topologies by avoiding other long branches to attract species (Poe and Swofford 1999). Moreover, even though adding species usually improved phylogenies, Kim (1996) argued it can also decrease accuracy by dividing long branches, causing

19 other long branches to attract the wrong species. The usual long-branch division has then to be taken with great caution (Anderson and Swofford 2004).

Some authors favoured adding more taxa (Aguinaldo et al. 1997; Graybeal 1998; Hillis 1996;

Hillis and Bull 1993; Rannala et al. 1998; Soltis and Soltis 2004; Zwickl and Hillis 2002), because it can greatly enhance the detection of multiple substitutions, resulting in part from reversions or convergences (Jeffroy et al. 2006). However, it was also found that an increase of characters can give better results, even with fast evolving species (Poe and Swofford 1999), and that it can more significantly enhance accuracy than taxon number, particularly when using ML methods (Rokas and

Carroll 2005; Rosenberg and Kumar 2003; Sanderson and Driskell 2003), in part because of decreases of sampling errors, as more sites become available (Nishihara et al. 2007). Although this issue is still highly controversial, in general not only increased characters, but also adequate taxon sampling, and a better choice of models that fits the evolution of the datasets, are essential to obtain topologies that best describe sequence evolution (Rannala et al. 1998; Soltis et al. 2004a; Stefanovic et al. 2004).

In this study, we used sequences from the three genomes of 22 species. As seen in figure 2 on the following page, two gymnosperms, Pinus and Ginkgo, are the outgroup; nine basal angiosperms are represented by three representatives of the ANITA clades (Amborella, Nymphaea and Illicium) as well as Ceratophyllum, 2 (Magno/ia and Liriodendron), Persea (a ), Drimys (a

Winterales), Saruma (a ) and 4 monocots (Acorus, Asparagus, Oryza and Zea); seven eudicots are also included in our dataset with Papaver (a ), Beta (a ),

Nicotiana (an asterids) and four Rosids {Arabidopsis, Pisum, Populus and Vitis). Even though our dataset is limited to 22 species, these angiosperms have been chosen to cover a fairly large spectrum of the angiosperm diversity.

20 ^SSi Other Conifers 1 1*3 a *•£? Pinaceae.O) 3 m 9$i- 1MW- Gnetafes idwio5 Ginkgo^) Amborellai^. 1M'"10" NymphaealesO) lOtVKIO isaasiTA(i) •3 30?65 i^Ml Chloranthates — CeratophyltumiV ^sM. Monooots(4) 2aEMagnoltates(2) Lauralw(1) > ,W CD 1 o" pfiSS Winterales( ) • .i/i T3 PsperafesO) is ™Ranunculales(1)- -352H Proteaies SS£L» Sabiaceae :ssailffi Trodhodendrales m nxtfioo- Buxaceae and 9- DEdymelaceae 8 ^^ CaryophyllalesCT 40 ^§a Asterids(1) ^9 RosidS (4)

FJ9*S?s"w FYafflf iffiswisi?

Figure 2: Taxon sampling among the gymnosperm and angiosperm clades using 22 species, including 7 eudicots, 4 monocots, 9 basal angiosperms and 2 gymnosperms (as the outgroups).

Numbers in parentheses refer to the number of species being investigated. Adapted from Kuzoff and

Gasser (2000) and reproduced with permission from Elsevier (Appendix C).

21 Genes and concatenated data

DNA sequences can often lead to inaccurate phytogenies due to rate heterogeneity, inadequate taxon sampling (Rodriguez-Ezpeleta et al. 2007), protein-coding gene constraints, extinction of common ancestors or their too rapid diversification. However, when using different gene sequences, these putative unequal rates of evolution do not seem to affect phytogenies (Chase et al. 1993), multigene analysis can therefore give reconstructed trees with higher bootstrap values (Qiu et al. 1999;

Rodriguez-Ezpeleta et al. 2007; Soltis et al. 1999a). Sometimes, the only evidence that our phylogenetic reconstruction presents the real portrait is the congruence of analyses resulting from different independent data, such as the three plant genomes, or substitution models. Also, because most botanists consider that slower evolving genes contain less homoplasy, they usually prefer using them (Graham and Olmstead 2000; Olmstead et al. 1998), even through they bear less information.

Other fast evolving genes, such as matK, can potentially provide sufficient signal to obtain more resolved trees (Hilu et al. 2003), if saturation is not too high.

Phylogenetic trees can have different topologies, depending of how they are constructed and because they may contain systematic errors. Three strategies are commonly retained for phylogenetic reconstruction: (1) the use of supertrees, which is basically comparisons of trees from different genes (Bryant 2003; Sanderson et al. 1998), (2) the identification of best substitution models for the genes used before obtaining the best tree (Ronquist and Huelsenbeck 2003; Yang

1996) and finally (3) the use of supermatrices, by finding the best model after concatenation of all genes into a single "supergene" (Huelsenbeck et al. 1996; Huelsenbeck et al. 1996a). Even though some researchers have questioned this last method, combining many genes into supermatrices can provide stronger support for certain nodes in phylogenetic trees (Baker and DeSalle 1997;

Chippindale and Wiens 1994; Huelsenbeck et al. 1996; Wenzel and Siddall 1999), because weak phylogenetic signals can be amplified, with dispersal of noise (Baldauf et al. 2000; Sanderson and

Shaffer 2002), to the point that the true history will finally emerge (Baker and DeSalle 1997). This is particularly true with the presence of among-site rate variations in datasets (Sullivan 1996; Sullivan et al. 1996a). A lot of systematists do recommend to include all data into a single matrix, even

22 though conflicts exist with a particular partition (Baker and DeSalle 1997; Kluge 1989; Sanderson and

Shaffer 2002; Wenzel and Siddall 1999). A few comparisons with the supertree approach have even demonstrated the superiority of supermatrices over supertrees (Delsuc et al. 2005; Gatesy et al.

2004; Salamin et al. 2002; Soltis et al. 2004a), although accuracy of the inferred relationships among taxa can never be guaranteed (Rokas et al. 2003). On the other hand, sampling genes for all species, even though preferable, is really not essential, because a large portion of missing data does not change the tree topology. Even though missing data can sometimes obliterate phylogenetic signals

(Wilkinson 1995), these characters, usually coded as question marks, should usually be included to the datasets. In fact, only one third of the data can give rise to trees with nodes robustly inferred

(Bapteste et al. 2002; Philippe et al. 2004; Wiens 2003). A few surveys have even shown that as much as 92% missing information can still provide good historical relationships among species

(Driskell et al. 2004; Philippe et al. 2004). Even very sparsed supermatrices can be phylogenetically useful, and these fragmentary datasets can successfully produce trees (Kearney 2002). Therefore, because even incomplete data can be informative (Novacek 1992; Wiens 1995; Wiens 1998), they should generally be included in analyses. The enclosure of all information through concatenation of sequences into a single supermatrix is the method chosen for our study.

Because sequences from all plant genomes encode genes that have different functions, chances of homoplasy are decreased (Qiu et al. 1999). Therefore, a dataset with only a few homoplasies has better chances of containing the ancestral states, because the sites are far from being saturated. When noise is present in a single nonrecombining genome, such as plant chloroplast, it can thus hide true signal (e.g. homoplasies). Combining sequences can be quite useful to get a stronger resolving power (Brown et al. 2001; Sullivan 1996), especially if they originate from the three plant genomes. Thus, the use of data from different genomes or loci is more prone to create the true tree (Holland et al. 2004). Furthermore, analyses of concatenated genes have shown that computer run times are faster and that the trees are obtained with better resolution and stronger support (Bapteste et al. 2002; Hoot et al. 1999; Savolainen et al. 2000; Soltis et al. 1998).

Our dataset contains fifteen genes from the three plant compartments: three mitochondrial, five

23 chloroplastic and seven nuclear genes. Most of these sequences came from Drouin et al. (2008), as outlined in the Material and Methods.

Plant chloroplasts have been widely used for inferring phylogenies, one of the reasons being the presence of single copy genes (Sanderson and Shaffer 2002), so that these genes facilitate the amplification of orthologous regions for the species studied. An other explanation is their fairly stable genome, rearrangements being rare events, compared to plant mitochondria and nucleus (Judd et al.

2008). The gene coding for the large subunit of ribulose-l,5-bisphosphate carboxylase/oxygenase

(rbcL) was suggested to be a promising locus for molecular analyses, because in plants, this rather long sequence is present in almost every taxa (Judd et al. 2008). It has been sequenced in a huge number of species (Chase et al. 1993; Kallersjo et al. 1998; Lewis et al. 1997; Savolainen et al.

2000). In fact, Chase et al. (1993) published the first survey with this locus being sequenced for hundreds of species from most taxonomic groups. RbcL sequences are now available for more than

5000 taxa, partly because they are capable of displaying an excellent resolution, despite their slow rate of evolution (Judd et al. 2008), compared to other datasets (Nickrent and Soltis 1995; Sanderson and Driskell 2003). Another plastid gene used more recently is the maturase gene matK. Located within the intron trnK and believed to be involved in the splicing of the tRNA of lysine, this gene is fast evolving (Hilu et al. 2003) and is therefore quite dissimilar from other genes used for inferring phylogenies. It has about the same rate of nucleotide substitution for all the three codon positions, and the nonsynonymous substitution rate is quite high. Robust trees have been produced with this gene (Hilu et al. 2003). Other plastid genes have also been used successfully to infer phylogenies, such as the ATP synthase beta subunit {atpB, Cuenoud et al. 2002; Hoot and Crane 1995; Hoot et al.

1997; Hoot et al. 1999; Savolainen et al. 2000), the photosystem I subunit A (psaA) and the photosystem II CP47 protein (psbB; Drouin et al. 2008; Graham and Olmstead 2000). However, these chloroplastic genes are markers of the same history, and thus other sequences are needed in order to trace down all the phylogenetic events that occured in plants (Judd et al. 2008).

Because of the need to compare chloroplastic genes with the other plant compartments, mitochondrial and nuclear genes have been also used. In plant mitochondria, compared to mammals, the rate of nucleotide substitution is quite slow (Drouin et al. 2008; Wolfe et al. 1987). Although

24 mitochondrial genes are problematic, because of their rate of change or simply their substitution

pattern, some of them having been transferred to the nucleus or have more frequently experienced

horizontal gene transfers, they still can give reliable information (reviewed by Chase et al. 2004). A

few of these single copy genes (Sanderson and Shaffer 2002) have thus been used for phylogenies,

such as the cytochrome oxidase subunit I (coxl), the maturase R (matf?) and the ATPase alpha

subunit (atpA; Savolainen et al. 2000), as well as the mtSSU rDNA (Parkinson et al. 1999).

Finally, nuclear genes, such as nuclear small (18S) and large (26S) subunits ribosomal DNA

were investigated (Chaw et al. 1997; Kim et al. 2004; Kuzoff and Gasser 2000; Kuzoff et al. 1998;

Nickrent and Soltis 1995; Soltis et al. 1997). Even though these genes are abundant in plant nuclear genomes, their sequences are believed to be homogeneous because of what is called concerted evolution (Zimmer et al. 1980). The 18S gene has been surveyed thoroughly and is the most utilized

nuclear gene for inferring phytogenies, but because its rate of evolution is three times slower than

rbcL (Nickrent and Soltis 1995), these sequences have been used mostly for deep relationships

among organisms (Woese 1987; Wolthers and Erdmann 1986). The potential use of 26S to infer

phylogenies has been examined by Kim et al. (2004). It was demonstrated that since 26S evolves

more rapidly than both rbcL and 18S, it might contain more information. But, low-copy protein-coding genes are also needed, even though some genera (e.g. Ceratophyllum) only have highly modified ones (Mathews and Donoghue 1999). Therefore, orthologous nuclear phytochrome genes (phyA and phyQ, as well as rpb2, are available from numerous species and will be added to our dataset.

Phytochrome genes get hold of proteins controling plant growth and responding to light (Judd et al.

2008), whereas rpb2 is the second largest subunit of RNA polymerase II. Furthermore, relatively new sequences, the genes coding for the largest subunit of RNA polymerases I, II and III {rpal, rpbl and

rpcl, respectively) will also be used. Whereas studies showed that nuclear sequences of the largest subunit of RNA polymerase II were adequate markers to study plant phylogenies (Hajibabaei et al.

2006; Nickerson and Drouin 2004), the three genes rpal, rpbl and rpcl have been used together only once.

25 RNA polymerase genes as phyloqenetic markers

In eukaryotic cells, three RNA polymerases use DNA templates to catalyze the formation of different

RNA molecules, when a ribonucleoside monophosphate is added to the 3'-terminus of a RNA chain

(Cramer 2002). These paralogous genes perform precise catalytic functions: RNA polymerase I

transcribes most ribosomal RNA (except 5S), RNA polymerase II does the same to pre-messenger

RNA, as well as some small nuclear RNAs, and RNA polymerase III catalyzes the formation of 5S

ribosomal RNA and some other small RNAs, including transfer RNAs (Cramer 2002). These three

types of RNA polymerase proteins are in fact holoenzymes in eukaryotic organisms, because they are

composed of at least 12 subunits. Only two of these are considered large subunits, having a

molecular weight of about 160 and 150 kDa: the largest and second largest subunits of RNA

polymerase II, rpbl and rpb2, respectively (Denton et al. 1998; Ebright 2000).

Genes performing replication and transcription processes might be a good choice for inferring

phylogenies, especially if they are part of a multiprotein machinery. Because these genes interact

with many proteins at the same time, they experience a lot of constraints, so that rates of

substitution of these genes are then probably suitable to reconstruct phylogenies (Denton et al.

1998; Nickerson and Drouin 2004). These paralogous rpbl and rpb2 RNA polymerase genes have

already been used as phylogenetic markers and their usefulness is well documented (Denton et al.

1998; Nickerson and Drouin 2004; Oxelman and Bremer 2000; Oxelman et al. 2004; Stiller and Hall

1997). Since Denton et al. (1998) proposed rpb2 as a valuable phylogenetic marker, it has been

particularly helpful in resolving trees for asterids (Denton et al. 1998; Oxelman and Bremer 2000;

Oxelman et al. 2004). Stiller and Hall (1997) have even successfully used rpbl to compare distantly

related species such as red algae, yeasts and angiosperms.

While phylogenies have been inferred mostly by either organellar or repetitive nuclear genes,

the potential of low-copy genes has not been looked at very carefully. These last sequences can

provide valuable insights about chromosomal events such as gene duplication or polyploidy (Oxelman

et al. 2004; Pfeil et al. 2004; Popp and Oxelman 2001). Another study discovered the presence of two rpb2 copies in a certain number of asterid species (Oxelman and Bremer 2000). This duplication

26 might have occurred before the origin of asterids and probably near the origin of core eudicots

(Oxelman et al. 2004). While this second largest subunit is in single copy in most species, the presence of a second one in others can then be used as a reliable marker.

In plants, a lot of genes are members of families and these paralogous relationships render phylogenetic studies complicated. This paralogy of genes that arose from either gene duplication

(within the same organism) or gene transfer (between two species) makes comparisons between orthologous genes rather difficult. In order to be confident that sequences of the same gene are compared, a need for single copy genes is then quite obvious. Rpbl can be used for phylogenetic studies because it does not have any paralogous copy in most species investigated (Nickerson and

Drouin 2004), it can yield a fairly large amount of data due to its large size and has a stable GC content (Nickerson and Drouin 2004; Stiller and Hall 1997).

27 CHAPTER 2

CONTROVERSIES IN THE ANGIOSPERM CLASSIFICATION

Angiosperm phylogeny is still regarded, as stated by Darwin in 1879, as "an abominable mystery", because of numerous species and their diverse morphologies. An obvious evolution can be seen in vertebrate animal groups, from fish, to amphibians, to reptiles, to birds and to mammals. However, no such mechanism is apparent with angiosperms (Frohlich 2003). Flowering plants were formerly divided into two majors groups, (or monocots) and (or dicots).

However, the latter is far from being monophyletic and must not be retained. In fact, molecular surveys have determined that dicots are paraphyletic and their features are thought to be mainly pleisiomorphic. Embryos with at least two cotyledons should not be used as a synapomorphic feature, being characteristic of other groups such as certain gymnosperms (Judd and Olmstead 2004; Soltis and Soltis 2004). Monocots, for their part, have been recognized as monophyletic by a vast majority of studies.

Even though congruent trees of angiosperms have been published so far (Davies et al. 2004;

Savolainen and Chase 2003; Soltis et al. 1999a), controversies still remain about the position of a few taxa. Polytomies are still unresolved concerning in particular the earliest clades of angiosperms, primitive dicots (magnoliids), monocots as well as the position of certain eudicots such as rosids

(Figure 1). The explanations for the difficulties in resolving these clades might include the existence of very short internal and long external branches (Moore et al. 2007). Furthermore the position of

Ceratophyllum has not been clarified yet. In fact, with the use of the plastid gene rbcL,

Ceratophyllum was once considered a sister group to all other angiosperms (Chase et al. 1993; Les

1988; Les et al. 1991).

28 Root of the angiosperms

The origin of flowering plants, or the question of the most basal angiosperms, has been a subject of debate for a long time. Many causes may be reflected in the difficulties encountered to reconstruct the angiosperm tree, based on molecular data. One of these is polyploidy events. As a matter of fact, it has been shown that important genomic rearrangements have occurred following genome increases and that happened quite rapidly (Soltis 2005). Even though most groups of angiosperms were once suggested to be the ancestors at one time or another, only two earlier theories are still discussed. Whereas the euanthial theory (proposed in 1907) claimed that flowers came from bisexual strobili (conelike reproductive organs in many plants), similar to the large bisexual flowers of

Magnolia and Nymphaea, the pseudanthial theory (proposed in 1924) stated that flowers were from unisexual strobili, like the small unisexual flowers of Amborella and Ceratophyllum (Goremykin et al.

2003). However, the common ancestor might have had intermediate size of flowers (Parkinson et al.

1999). More recently, several molecular phylogenies, strongly supported that Amborella is the most basal angiosperm, and that Nymphaea is the next most basal (Zanis et al. 2002). Other studies could not reject the possibility that a clade of Amborella + is the sister group of all other angiosperms (Parkinson et al. 1999; Qiu et al. 2000). But other authors still questioned these last findings. Hence, with the availability of the complete plastid sequences of Amborella trichopoda and

Nymphaea alba, three studies argued that Amborella is not conclusively the most basal angiosperm

(Goremykin et al. 2003; Goremykin et al. 2004; Goremykin et al. 2005). Moreover, these analyses found no support for the Amborella + Nymphaea clade. Instead, they proposed the monocots as the basal angiosperms (e.g. Nickrent and Soltis 1995). Nymphaea was even found as sister to other dicots (Duvall et al. 1993; Nickrent and Soltis 1995). Also, for morphological reasons, it did not seem appropriate to place Amborella as the sister taxon to Nymphaea, because their flowers differ. In fact, the small flowers of Amborella do not share too many similarities with the large flowers of

Nymphaea. Most studies claim that Amborella is the only survivor of the most basal angiosperm, followed by Nymphaea. Goremykin and Hellwig (2006), using complete plastid sequences but

29 implementing a new test to find model fitness of the data to the evolution pattern, claimed the clade

Amborella+ Nymphaea was basal (Goremykin and Hellwig 2006).

The clade named' , with Austrobaileyaceae, Trimeniaceae and

Schisandraceae, usually referred as the ITA clades, represented by Illicium in our dataset, is the next most basal group of angiosperms. This clade is strongly supported by molecular data, but the species do not seem to share any morphological resemblance (Jansen et al. 2007; Moore et al. 2007; Soltis and Soltis 2004b; Soltis et al. 1997). Shared morphological characters among these groups still have to be found. Some of the members of the ITA clades, such as Illiciales {Illicium), used to be classified with the magnoliids (Hoot et al. 1999).

Since all the clades at the very base of angiosperms are species-poor, with Amborella (one species), Nymphaeae (70 species) and Austrobaileya (one species), a rapid angiosperm diversification might have happened afterwards, maybe about the same time the monocots speciation occurred.

The other reasonable alternative would be a massive extinction, although this is not very plausible

(Sanderson and Donoghue 1994).

Even though the early-diverging angiosperms have been studied intensively, the order in which species diverged is still not completely clear and thus more investigations are needed.

Primitive dicots

Primitive dicots, also called Magnoliids or eumagnoliids, comprise four orders: Magnoliales, Laurales,

Piperales and (Zanis et al. 2002). Eumagnoliids are far from being completely resolved and some studies even tend to include more species, such as the monocots and the controversial

Chloranthaceae, as seen in Figure 3 on next page (Savolainen et al. 2000; Soltis et al. 2000).

30 monocots 71 ^5 Winterales • TTxT Laurales (9) 56 "55" Magnoliales (10) 72 TO Chloranthales (3) Piperales (8) 95 llliciaceae (1) [65 Schisandraceae (1) •5T Jiool Foo Austrobaileyaceae (1) TW Nymphaeaceae (1) Amborellaceae (1)

Figure 3: Primitive dicots, as well as monocots and Chloranthaceae, are still not resolved and are thus often shown as a large polytomy in phylogenetic trees. Source: Soltis et al. (1999a); Reproduced with permission from Nature Publishing Group (Appendix C).

The order Magnoliales includes six families, including the Magnoliaceae (Liriodendron and

Magnolia), but the relationships within these families are still unclear (Sauquet et al. 2003). Laurales now comprise seven families, one of which is the Lauraceae (Persea), and Amborella used to be placed within this order (Goremykin et al. 2003; Goremykin et al. 2004; Goremykin et al. 2005).

Whether or not Amborella belongs to this clade, it is worth saying that this grouping is not recognized nowadays (Goremykin et al. 2003; Lockhart and Penny 2005; Soltis and Soltis 2004; Stefanovic et al.

2004; Zanis et al. 2002). The taxonomy of the order Piperales varied a lot, but with the use of molecular data, it is actually formed of only four families, one of them being the Aristolochiaceae

(Saruma). The position of Piperales is still subject to debates. Canellales has only two families, one of which is the Winteraceae (Drlm/s), and it is an order usually strongly supported by both molecular and non-molecular data (Soltis and Soltis 2004). However, the families of this order were not always considered closely related either to each other or to Piperales. In fact, Winteraceae were once suspected of being the most primitive angiosperms (Endress 1986) and have also been included within the eudicots (Nickrent and Soltis 1995). Furthermore, clades showing Magnoliales + Laurales and Piperales + Canellales are often proposed (Graham and Olmstead 2000a; Hilu et al. 2003;

Jansen et al. 2007; Sauquet et al. 2003; Zanis et al. 2002). Next to these clades are found monocots,

Chloranthaceae and Ceratophyllaceae.

31 Eudicots (tricolpates)

Plants were once divided into monocots and dicots, but even though monocots are monophyletic, dicots are paraphyletic. This traditional dichotomy of monocots/dicots has to be abandoned and eudicots should be used instead for what are not considered real dicots, i.e. not basal dicots (Chase et al. 1993). The name tricolpates (Donoghue and Doyle 1989) was first used for designation of these dicots, but it now seems that the term eudicots is preferred (Doyle and Hotton 1991). Eudicots are an important group, since they encompass 75% of all angiosperms divided into 175,000 species, and this clade has been recovered as monophyletic with high support by most recent molecular analyses. Partly because of this huge diversity and phenomenal variation (number of species, but also morphologically, anatomically and in their biochemical features), the relationships among eudicots still need more clarification, especially for the most ancient eudicot species (Hilu et al. 2003; Judd and Olmstead 2004; Soltis and Soltis 2004; See Figure 4 on next page). By looking at the data, it is not difficult to imagine that the evolution of eudicots was made mostly by a series of a very rapid diversifications followed by a long period of lineage persistence and extinction (Soltis and Soltis

2004). Indeed, this cycle might have been repeated subsequently.

32 Ranunculales Sabiaceae Pioteales Buxalcs Trochodend rales Gunncrales Betberidopsidales Dilleniaceae Caryophyllales | caryophyllids ^E Polygonales I Vitaies Cross os omatalcs G cranial cs Zygophyllaies Cclastralcs Malpighialcs fabids Bras sic ales ^E Mai vales malvids Cornalcs Uric ales Gentian ales lamiids Aquifoliales campanulids Asteralcs E Dinsacales

Figure 4: Relationships among eudicots are still not completely resolved, especially for ancient eudicot species, as well as eurosids I, lamiids and campanulids. Source: Judd and Olmstead (2004);

Reproduced with permission from The Botanical Society of America (Appendix C).

At the base of eudicots is the monophyletic order Ranunculales with strong support as sister to all other eudicots (Chase 2004; Chase et al. 1993; Hoot and Crane 1995; Hoot et al. 1999; Jansen et al. 2007; Judd-and Olmstead 2004). This order was formerly classified with woody magnoliids because of the shape of their numerous flowers. Within this order could be found the family

Papaveraceae {Papaver) which is believed to be the sister family of all remaining Ranunculales.

Nonetheless, other surveys support another family (Eupteleaceae) as the very basal Ranunculales

(Hilu et al. 2003; Kim et al. 2004). Ranunculales are the only basal eudicots whose placement is mainly supported by molecular studies and all remaining basal groups still need additional studies.

33 The majority of eudicots are found in a clade usually called core eudicots (or core tricolpates, as seen in Figure 4 above; Chase et al. 1993; Soltis et al. 1997; Soltis and Soltis 2004) and they are probably among the best-supported major clades of angiosperms (Hoot et al. 1999; Hilu et al. 2003).

These eudicots are further usually divided into seven major clades, although the relationships among them are far from being well resolved, partly because rapid radiation has occurred. At the base of core eudicots, the order of branches is poorly supported (Savolainen et al. 2000; Soltis et al. 2000,

2003; Kim et al. 2004). Caryophyllales {Beta), asterids (Nicotioana) and rosids (Arabidopsis, Populus,

Pisum and Vitis) are the species-rich clades. Asterids are probably closer to Caryophyllales (Hilu et al.

2003; Jansen et al. 2007; Soltis and Soltis 1997a).

Caryophyllales are divided into two large groups: the core and the noncore Caryophyllales

(Cuenoud et al. 2002). Most of them, the so-called core caryophyllales, were recognized as a clade a long time ago, possibly as early as 1864, because they are closely related morphologically. More recently, molecular data helped find other related species, which are now referred to as the noncore caryophyllales or Polygonales. Core caryophylllales are divided into 19 families, one of which is the amaranth family {Beta), but some of these families are either para- or polyphyletic. Although the monophyly of these core caryophyllales is moderately supported, further investigations are still in order for even more rigorous classifications. The Polygonales clades are not strongly supported.

Caryophyllales are suspected of being sister to the asterids (Soltis et al. 2000; Jansen et al. 2007).

Rosids and Asterids are the two largest clades of eudicots. In fact, each of these clades represents about one-third of all angiosperms. Because they encompass so many different lineages, a major reorganization of their groupings is yet to come (Hoot et al. 1999).

With the use of molecular data, the former clades rosids I-IV (Chase et al. 1993) were rearranged, with new groupings of families that used to be found within polyphyletic groups, so that a new group called Rosids has emerged and is now generally well accepted. (Vitis) is suspected of being the sister to the rest of Rosids, although this claim is not strongly supported and thus chances are that this latter grouping should be positioned elsewhere, even with basal eudicots

(Chase et al. 1993; Soltis et al. 2000; Savolainen et al. 2000; Kim et al. 2004).

34 Within this heterogeneous clade of orders supported by molecular data, Rosids comprise 140

families, most of these belonging to either eurosids I or eurosids II2, aside from at least eight smaller

poorly resolved clades. Within eurosids I, seven orders and two unassigned families could be found

(Judd and Olmstead 2004), two of these orders being the Fabales {Pisum) and the

{Populus). This order Fabales comprises three subgroups, with one being the Papilionoideae {Pisum).

There is still no strong support within the eurosids I, even though small monophyletic clades have

been resolved, for example the families Fabaceae {Pisum) and Salicaceae {Populus). Fabales and

Malpighiales have only been recently recognized with the aid of molecular data, but there are no obvious synapomorphies (Judd et al 2008). Thus, phylogenetic links have to be clarified. For

example, even though Pisum appears usually with other Rosids, Glycine, a close relative belonging to the same subfamily Papilionoideae, was once found with a magnoliid, namely Drimys (Nickrent and

Soltis 1995). Eurosids II have a smaller number of orders with only three, as well as one family still

being unplaced. One of these orders is the which encompasses the first completely sequenced plant, Arabidopsis. Brassicales are a monophyletic clade strongly supported by DNA studies (Hilu et al. 2003; Kallersjo et al. 1998; Soltis et al. 2003; Soltis et al. 1997; Soltis et al. 2000), despite being morphologically heterogeneous. Within this order, the family Brassicaceae

{Arabidopsis), the largest of all this order's families, is strongly supported by both molecular and

morphological data (Judd and Olmstead 2004).

Rosids are by far the largest controversial group of angiosperms, because relationships within the two subclades eurosids I and II, as well as between them, still have to be clarified. This lack of

resolution is partly due to the age of the clade, at least 90 MYA (Crepet and Nixon 1998; Crepet et al.

2004; Magallon et al. 1999), which probably permitted substantial diversifications, but also due to

missing morphological features. Clear synapomorphies have thus not been confirmed.

Asterids encompass no less that 114 families, for a total of about 80,000 species (Albach et al. 2001). A lot of them were identified more than 200 years ago, but other taxa were only

Alternative terms have been proposed, after the last international symposium of the Angiosperm Phylogeny Group II (APG II,

2003) for eurosids I and eurosids II: fabids and malvids, respectively. Since they are still not included in the official APG II classification, they will not be used here.

35 recognized with molecular characters. However, the identification of which traits should be considered as adequate synapomorphies that unite asterids has not yet been finalized. This strongly supported clade now includes species of older polyphyletic clades (Chase et al. 1993; Judd and

Olmstead 2004; Olmstead et al. 1993; Olmstead et al. 2000; Olmstead et al. 1992; Savolainen et al.

2000; Soltis et al. 1997; Soltis and Soltis 2004; Soltis et al. 2000). However, certain clarifications are required about the relationships of this grouping with other eudicots, as well as within the clade, particularly at the most basal position of asterids (Judd and Olmstead 2004). As previously mentioned, this major clade is often found as sister group to Caryophyllales, but it cannot be constrained to be closer to Rosids. Even though most of the asterid species have been recognized a long time ago, this is not the case with the basalmost positions: and (or the opposite) are considered successive sisters to all other asterids. These two new clades have been recognized by DNA studies. The remainders of the species are considered core asterids, divided into two major subclades: lamiids (euasterids I) and campanulids (euasterids II)3 (Albach et al. 2001;

Bremer 2002). Apart from the four unassigned families of lamiids, four orders are included, one of which being the Solanales {Nicotiana). Relationships among lamiids are weaker than among campanulids, but as a general rule of thumb, the groupings within all asterid clades need further investigations, especially for the more ancient ones, because of the lack of morphological synapomorphies (reviewed by Soltis and Soltis 2004; Judd and Olmstead 2004).

Monocots

With about 52,000 species, representing 22% of all angiosperms (Mabberley 1993), monocots are a large group of plants that comprises, for example, orchids, lilies, palms, gingers, as well as grasses.

Because of their great importance in providing food for humankind, monocots, particularly grasses, are probably the best known species within all angiosperms. Even though monocots belong to an old

At the last international symposium of the Angiosperm Phylogeny Group II (APG II, 2003), euasterids I and II were renamed lamiids and campanulids, respectively.

36 classification system that dates back to 1703 (Chase 2004), they seem to be a well supported

monophyletic group even with the inclusion of the genus Acorus (see below). It is quite clear,

considering their anatomy and physiology, that monocots show a major dissimilarity compared to other angiosperms, especially the eudicots, but their exact position relative to the remainder of

angiosperms also needs to be clarified. Most botanists still agree that they should be placed within

angiosperms and not apart from them (Chase 2004). Nevertheless, it was stated at least once that

monocots should form a separate clade (Tomlinson 1995), and Goremykin et al. (2005) obtained trees with monocots being paraphyletic. Monocots share characters with the family of

Aristolochiaceae within the order of Piperales (Saruma). These resemblances were shown to be due to convergent evolution (Dahlgren et al. 1985), which was confirmed by nucleotide sequences (Qiu et al. 1999; Zanis et al. 2003; Zanis et al. 2002). A lot of synapomorphies, apart from the single cotyledon, have also been found among monocots (Chase 2004; Doyle and Donoghue 1992; Loconte and Stevenson 1991), despite the great diversity that exists among them. Monocots must be placed, according to a few studies (Nickrent and Soltis 1995; Goremykin et al. 2003; Goremykin et al. 2004;

Goremykin et al. 2005), at the very base of angiosperms. This is congruent with molecular clock studies that estimated the appearance of that lineage to 134 and 170 MYA (Bremer 2000; Bremer

2002; Wikstrom et al. 2001). Because these studies suggest that monocots are even more ancient than what fossil records indicate for the common ancestor of all angiosperms, monocots are most commonly suggested as sister to Ceratophyllum or magnoliids (e.g., Chase et al. 1993), but these conclusions are not strongly supported. Nonetheless, two recent analyses placed the monocots as sister to the eudicots with strong support (Jansen et al. 2007; Saarela et al. 2007). Of course, different topologies and other methods to calibrate trees still need to be investigated (reviewed by

Chase 2004).

The root of the monocots is fairly consistent, with the order Acorales that comprises only one family Acoraceae, referred to the sweet flag family (Judd et al. 2008). This latter grouping, in turn, only has one genus, Acorus, which is believed to be the sister of all other monocots, by most analyses and with strong support (Chase et al. 1993; Davis et al. 1998; Duvall et al. 1993; Jansen et al. 2007), despite a recent study that rather placed this order within the order of (Davies

37 et al. 2004; Qiu et al. 2000). In fact, the Acoranan hypothesis (Duvall et al. 1993) claimed that

ancestral monocots were similar to Acorus and grew in freshwater, with temperate climates. Although

the two basalmost monocot clades (Acorales and Alismatales) are predominantly aquatic, this

hypothesis was partially refuted because of too many divergent traits (Chase 2004). Acorus, a

monocotyledonous genus of two to six species and subspecies, has long been considered as part of the family Araceae, but was finally removed, because of too many differences (Bogner and Nicolson

1991; Grayum 1987). Nowadays, most botanists place this genus with monocots, despite the fact that it is sometimes found with eumagnoliids or Ceratophyllum, or even as sister taxon to the

Nymphaeales, according to a few morphological and molecular analyses (Bharathan and Zimmer

1995; Davies et al. 2004; Nickrent and Soltis 1995; Savolainen et al. 2000; Soltis et al. 1997). Even though a certain number of morphological characters are shared among the Acorales and primitive dicots, such as the paleoherb Ceratophyllum, this resemblance might be either the result of

homoplasy or symplesiomorphies. Nevertheless, Acorus is sometimes found away from the other

monocots. Moreover, it was shown that the supposedly paraphyly of monocots was no longer strongly supported when the genus Acorus was removed from the analysis (Bharathan and Zimmer

1995; Soltis et al. 1997). We therefore need more evidence to know whether Acorus is a sister group to monocots, to Ceratophyllum or within Alismatales.

It seems that the positions of other monocots are strongly supported, although their subclades are not always (Soltis and Soltis 2004). The order of Alismatales is the sister group to all

monocots other than the Acorales (Friis et al. 2004).

Of all nodes within monocots, only three are still not clearly determined, one of these being the placement of most orders of the larger clade lilioids. Most of the species belonging to

{Asparagus) were placed earlier in the lilioids, which is an old polyphyletic grouping that is no longer accepted (Chase et al. 2000; Hilu et al. 2003). and Asparagales are sometimes placed next to each other. In terms of species, Asparagales is the largest order of the monocots, and includes

Orchidaceae, the most numerous family of monocots. Although the positions of most families within

Asparagales are congruent, the placement of orchids within this order is still subject to controversy, mostly because of molecular data (reviewed by Chase et al. 2004). Indeed Asparagales were

38 considered as being a broader clade, but certain analyses only recovered the core Asparagales as a monophyletic group (Chase et al. 2000; Chase et al. 1995; Soltis et al. 2000). Nevertheless, APG II

(2003) recognized 14 families within Asparagales, one of these being Asparagaceae {Asparagus).

The order , commonly referred to grasses, includes the family Poaceae (Oryza and

Zea), as well as 16 other families. The grass family is species-rich with approximately 10,000 species.

Because most people and a lot of animals need grasses for their daily diet, this group of plants has been extensively investigated. Grasses have colonized the earth's land with about 20% of the total surface. DNA analyses strongly support the prevailling classification of grasses (Chase et al. 2004). In order to closely investigate grasses, researchers of Grass Phylogeny Working Group (GPWG) chose 59 representative species (Kellogg 2001). Because of uncertainties associated with the calibration of the molecular clock and lack of fossils, the age of extant grasses has been dated to between 55 and 70

MYA (Wikstrom et al. 2001; Bremer 2002).

Ceratophyllaceae

Ceratophyllum has attracted much interest for a long time. In fact, as early as 1837, it has been described as a "vegetable vagabond" (Schleiden 1837). Moreover, the first time the Ceratophyllaceae was mentioned as a family, it was even earlier in 1821, and then it was proposed to be a close member of Nymphaeaceae (Les 1988). This family, sometimes referred to the hornwort family, has only one genus, Ceratophyllum. Six different species and three subspecies have so far been identified, although in addition extinct taxa have so far been recognized from fossil records

(Herendeen et al. 1990; Les 1986; Les 1988a; Les 1988b). All lineages are submersed, aquatic plants, without any roots. They grow in freshwater and are perennials. This genus is the only known water-pollinated (hydrophile) dicot (Les 1988a; Les 1988b; Les et al. 1991). Morphologically, all species among the genus Ceratophyllum are very similar. Leaves are whorled, and flowers are unisexual, having seven or more bracts. They also only have one carpel. Even if this group is obviously monophyletic, a lot of adaptations to life as an aquatic herb are noticeable. Therefore this taxon is considered highly modified, despite fossil records suggesting very little morphological

39 evolution for almost 50 MYA, due mostly to its aquatic environment (Herendeen et al. 1990).

Relationships among all species rely mostly on leaves and fruit variations (Judd et al. 2008). Hence, with Ceratophyllaceae's morphology and anatomy being quite different from any other clades, its place is difficult to assess. Morphological studies have shown similarities with Chloranthaceae,

Aristolochiaceae, and especially near the Cabombaceae within Nymphaeaceae (Cronquist 1981;

Cronquist 1988; Czaja 1978; Doyle et al. 1994; Doyle and Endress 2000), although a closer look at the data established no convincing links between Ceratophyllaceae and Nymphaeaceae (Les 1988).

For instance, Ceratophyllum is often considered the only remaining genus of very ancient diverse lineages (Endress 1994; Herendeen et al. 1990) that suffered from extensive extinctions over time, and this absence of closely related species is probably why relationships are difficult to establish (Les

1988). Ceratophyllum might have persisted because of its aquatic environment being quite stable and its rapid initial adaptation to this habitat (Dilcher 1989). In fact, the amino acid composition of

Ceratophyllum, when compared to analogous proteins from surface higher plants and green algae, is quite dissimilar (Oganezova and Nalbandian 1976). As for the extant taxa of basal angiosperms,

Ceratophyllum may have been modified a lot since the first speciation event, despite its long period of non-evolution, so called "stasis" (Endress 1993; Herendeen et al. 1990). Indeed, because of the unusual morphological characters of Ceratophyllum, incongruent conclusions were obtained so far

(Endress 1993; Les 1988; Les 1989).

Molecular studies have only been able to resolve Ceratophyllum as part of a polytomy (Soltis et al. 2000). This is probably why this genus is sometimes absent from cladistic studies. In earlier molecular analyses, it was placed at the basal position of all angiosperms and fossils record tend to agree with this location, partly because fossils of Ceratophyllum are amongst the oldest known plants

(Dilcher 1989). However, some surveys argued this striking position is probably due to the long branch attraction phenomenon, as a consequence of many homoplasies (Chase et al. 1993; Dilcher

1989; Les 1988; Les et al. 1991; Qiu et al. 1993). However, fossils, with fruits analogous to those in

Ceratophyllum, are considered early angiosperms (Dilcher 1989). In fact, Ceratophyllaceae has a fossil record that tends to show that this taxon is quite ancient. Other studies, using either single gene datasets or combined genes matrices, placed this plant genus at different positions. Hence, in a

40 few surveys, Ceratophyllum is a weakly or a moderately supported sister of all eudicots, next to the

Ranunculales (Hilu et al. 2003; Moore et al. 2007; Saarela et al. 2007; Soltis and Soltis 1997a; Soltis et al. 1999a). It has also been inferred to be a basal group with the eumagnoliids (Goremykin et al.

2009; Savolainen and Chase 2003; Zanis et al. 2002). Ceratophyllum can sometimes be found as sister to the monocots, although bootstrap support is rather low (Chaw et al. 2000; Graham and

Olmstead 2000; Graham and Olmstead 2000a; Qiu et al. 1999; Qiu et al. 2000; Soltis et al. 2003;

Soltis et al. 1997). When this position is obtained, it has been also observed that Ceratophyllum forms a clade with Acorus (Savolainen et al. 2000). Xia (2003) has used a rpbl marker, as well as atpB and rbcL sequences, to build phylogenies. Ceratophyllum was then placed as sister group to either Beta, a core eudicot from the order of Caryophyllales, or Saruma, a eumagnoliid from the order of Piperales, although these data were not strongly supported. These results were partly confirmed by a recent study, where Ceratophyllum was placed either as a sister genus to the eudicots or to a

Piperales species, when using ML and MP, respectively (Moore et al. 2007). But the authors claimed that this last position next to Piperales was due to LBA. It is worth to say that trees are sometimes altered, whether fossils are included or excluded. Exclusion of fossils gave rise to a large polytomy, whereas their use rather produced a tree with Chloranthus, then Ceratophyllum, as successive sisters to angiosperms (Nixon et al. 1994).

Because Ceratophyllum, when it can be positioned, is either the basal-most angiosperm or found among the eumagnoliids, monocots or eudicots, supplementary data is then required to more conclusively determine its phylogenetic position. In fact, it is considered that resolving this issue, along with the position of monocots, is among the most important questions remaining for the phytogeny of basal angiosperms (Soltis et al. 2005a). But until we find a set of genes that can place

Ceratophyllum at the same location consistently, this issue will still be controversial. Here, we hypothesized that adding rpal and rpcl sequences to existing datasets can be of help in this matter.

41 CHAPTER 3

PURPOSE OF THE STUDY

The objective of this study is to determine the phylogenetic position of Ceratophyllum using RNA polymerase genes, along with other available genes, such as the nuclear phyA, phyCand 18S genes, the mitochondrial atpA, coxl and matR genes, as well as the chloroplastic atpB, matK, psaA, psbB and rbcL genes.

42 CHAPTER 4

MATERIALS AND METHODS

Species

Twenty two plant species were included in this study from most major groups of angiosperms. Two gymnosperms were used as outgroups, Ginkgo and Pinus. The other species are angiosperms that could be divided into basal angiosperms, monocots and eudicots, according to accepted phylogenies.

Therefore, there are nine basal angiosperms {Amborella, Ceratophyllum, Drimys, Illicium,

Liriodendron, Magnolia, Nymphaea, Persea and Saruma), four monocots (Acorus, Asparagus, Oryza and Zea), as well as seven eudicots (Arab/dopsis, Beta, Nicotiana, Papaver, Pisum, Populus, and

Vitis). Apart from the outgroup clade, this angiosperm sampling thus includes 20 families from 18 orders. Refer to Table 1 on next page for the list of all 22 species.

Sequencing of ma1 and rod

Clones containing amplified products of regions D through G of rpal and rpcl genes were from

Mehrdad Hajibabaei and Junnan Xia. These DNA fragments were cloned using TOPO TA Cloning kit

(vector pCR 2.1) from Invitrogen. They were cloned as described in their respective thesis

(Hajibabaei 2003; Xia 2003). Multiple clones of both genes were sequenced using universal primers

(M13 forward and reverse primers) as first primers, in order to obtain both forward and reverse strands. See Table 3.

For completion of full-length sequences, internal primers were designed with the use of

Primer3 web site (http://fokker.wi.mit.edu/primer3/input.htm).

43 Table 1: List of the 22 species analyzed, with their orders and families, according to the classification adopted by the international symposium of the Angiosperm Phylogeny Group II in 2003. Adapted from Drouin et al. (2008).

Order Family Genus and species Angiosperms Eudicots Brassicales Brassicaceae Arabidopsis thaliana Caryophyllales Amaranthaceae Beta vulgaris Solanales Solanaceae Nicotiana tabacum Ranunculales Papaveraceae Papaver orientalis Fa bales Fabaceae Pisum sativum Malpighiales Salicaceae Populus trichocarpa Vitales Vitaceae Vitis vinifera Monocots Acorales Acoraceae Acorus calamus Asparagales Asparagaceae Asparagus officinalis Poales Poaceae Oryza sativa Poales Poaceae Zea mays Basal Amborellales Amborellaceae Amborella trichopoda Ceratophyllales Ceratophyllaceae Ceratophyllum demersum Canellales Winteraceae Drimys winteri Austrobaileyales Schisandraceae Illicium parviflorum Magnoliales Magnoliaceae Liriodendron tulipifera Magnoliales Magnoliaceae Magnolia soulangeana Nymphaeales Nymphaeaceae Nymphaea odorata Laurales Lauraceae Persea americanum Piperales Aristolochiaceae Saruma henryi Gymnosperms (outgroups) Ginkgoales Ginkgoaceae Ginkgo biloba Coniferales Pinaceae Pinus nigra

44 Contiq assemblies

After receipt from the sequencing facilities, all the rpcl and rpal putative sequences were confirmed with BLASTx searches on the National Center for Biotechnology Information website

(http://blast.ncbi.nlm.nih.qov/Blast.cqi) (Altschul et al. 1990). Sequencher (version 4.0.5) software

(from Gene Codes Corporation) was used to assemble and edit reads into contigs. Consensus sequences, without the primers at both ends, were then exported into FASTA format files before alignments.

Alignment and supermatrices

Because hundreds of plant species of mitochondrial, chloroplastic, as well as nuclear sequences were available, more genes were then included into the datasets in order to add more characters to the supermatrices. Therefore, fifteen genes of the twenty two species, from the three plant compartments, listed in Table 1, were included in the datasets: three mitochondrial, five chloroplastic and seven nuclear genes. All of these sequences came from Drouin et al. (2008), except for the 11 new nuclear sequences of rpcl and rpal sequenced here. Please refer to Table 2 and Table 3 below for, respectively, the list of all 15 genes included and the new nuclear sequences of rpcl and rpal deposited in GenBank under the listed accession numbers.

45 Table 2: List of nuclear, mitochondrial and chloroplastic genes coding for proteins and nuclear 18S ribosomal DNA gene. Their respective alignment lengths are in parentheses. Percentages in brackets correspond to the proportion of sequences from the three genomes, compared to the entire matrix. Adapted from Drouin et al. (2008).

Nuclear (17256) [52 %] rpb2 RNA polymerase II second largest subunit (3678) phyA Phytochrome A (3414) phyC Phytochrome C (3402) rpbl RNA polymerase II largest subunit (3048) rpcl RNA polymerase III largest subunit (1725) rpal RNA polymerase I largest subunit (1989) Mitochondrial (5253) [18 %] atpA ATPase alpha subunit (1515) coxl Cytochrome oxidase subunit 1 (1629) matR Maturase R (2109) j Chloroplastic (8478) [25 %] : ~~~ ~

atpB ATP synthase beta subunit (1500) matK Maturase K (1734) psaA Photosystem I subunit A (2199) psbB Photosystem II CP47 protein (1584) rbcL Ribulose-l,5-bisphosphate carboxylase/oxygenase large subunit (1461) Nuclear ribosomal RNA gene (1819) [5 %] 18S Ribosomal RNA gene (1819)

Table 3: Species and accession numbers of the 11 sequences. Adapted from Drouin et al. (2008).

Genus and species rpcl rpal

Acorus calamus FJ416892 -

Asparagus officinalis FJ416893 -

Ceratophyllum demersum - FJ416898

Drimys winteri FJ416894 FJ416899

Illicium parviflorum FJ416895 FJ416900

Liriodendron tulipifera FJ416896 FJ416901

Nymphaea odorata - FJ416902

Papaver orientalis FJ416897 -

46 For rpal, the nucleotide and amino acid datasets contain respectively 1989 and 663 aligned characters and for rpcl, these aligned nucleotide and amino acid matrices have respectively 1725 nucleotides and 575 amino acids (Appendix A and B).

When possible, the same species were used for all genes, but when the sequences were not available, sequences of the same genera or family were chosen. All of these sequences were retrieved and aligned by Dr. Drouin for an on-going independent study (Drouin et al. 2008). A complete list of these species, as well as their GenBank accession numbers and taxon substitutions when relevant, are shown in Tables 4, 5 and 6 for the nuclear, mitochondrial and chloroplastic genes, respectively.

Since all of these fifteen sequences, except one (the nuclear 18S ribosomal DNA gene), came from protein-coding genes, they were also translated into amino acid sequences.

47 Table 4: Species and accession numbers of the nuclear genes included. Except for the sequences listed in Table 3 and those of Populus and Vitls, all other sequences are from Drouin et al. (2008). Note: the rpbl and rpcl sequences of Populus trlchocarpa were obtained from the Doe Joint Genomics institute (JDI) at http://genome.jgi-psf.org/Poptrl_l/Poptrl_l.home.html and the rpb2 gene of rice and maize and the rpcl gene of Arabidopsis were found at http://tigrblast.tigr.org/ (TIGR). Species rpb2 phyA phyC rpbl rpcl rpal 18S AF190061 Acorus - Acorus - This study - L24078 calamus - gramineus Amborella AY699216 AF190062 AF190063 AF519541 - AY490544 U42497 trichopoda Arabidopsis AT5G60040 Z19121 L21154 X17343 AL031986 CAB41189 X16077 thaliana (TIGR) AY563268 AF276714 AF276715 AF069205 Asparagus Dioscorea Asparagus Asparagus EU543182 This study - Asparagus offlcinialis sansibarensis falcatus falcatus falcatus DQ017099 AY190015 AY146715 Beta vulgaris Dianthus Stellaria Stellaria EU543183 - - AF161096 caryophyllus longipes longipes Ceratophyllum AF276716 AF276717 EU543184 This study U42517 demersum - - Drimys winter/' AY699218 AF190080 AF190081 EU543185 This study This study U42823

Ginkgo biloba AF020843 - - AY490553 AY490563 AY490549 D16448 AF190068 AF276729 Illicium AY699220 Austrobaileya Illicium EU543186 This study This study L75832 parviiflorum Illicium anisatum scandens oliqandrum Liriodendron AF190064 DQ058631 AY396711 EU543187 This study This study AF206954 tulipifera Annona sp. AF020841 Magnolia Magnolia AF190094 AF190095 AF519539 - - AF206956 soulangeana virqiniana DQ020640 Nicotiana Nicotiana X66784 - EU543188 - - AJ236016 tabacum sylvestris AY490565 AF096696 Nymphaea AF043427 AF190098 AF190099 AF519540 Nymphaea This study Nymphaea odorata lotus sp. LOC Os03g44484 Oryza sativa AB109891 AB018442 AC084218 NM 001059 AP003626 X00755 (TIGR) 710 DQ017122 AF276710 AF276711 L75836 Papaver Papaver Akebia Akebia EU543189 This study - Hypercoum orientalis somniferum quinata quinata imberbe AF190072 AF190073 U52031 Persea DQ017116 Calycanthus Calycanthus EU543190 - Sassafras americanum Lindera glauca - floridus floridus albitum AY699215 D38245 Pinus nigra Pseudotsuga - - AF519536 AY490560 AY490546 Pinus menziesii elliotti AJ556782 AF500550 Pisum sativum Tephrosia M37217 Bunchosia EU543191 - - U43011 purpurea armeniaca AJ001318 GW GW AY652861 Populus Populus - 1.57.350.1 1.1.2689.1 AC216543 Populus trichocarpa - tremuloides (JDI) (JDI) balsamifera AY699219' Saruma henryi AF190104 AF190105 EU543192 Asarum caudatum AF207013 CAAP Vitis vinifera AM478907 AM451152 CU459264 CU459220 AM454997 AF207053 02000798 AZM5 91801 Zea mays AY260865 U61220 AF519538 - U42796 (TIGR) -

48 Table 5: Species and accession numbers of the mitochondrial genes included. Note: Except for the sequences listed in Table 3 and those of Populusand Vitis, all other sequences are from Drouin et al. (2008); the atpA sequence of Populus trichocarpa was obtained from the Doe Joint Genomics institute (JDG) at http://aenome.iqi-psf.org/PoDtrl 1/Poptrl l.home.html.

Species atpA coxl matR

Acorus calamus - AF193944 -

Amborella trichopoda AY009407 AF193953 AF197813

Arabidopsis thaliana Y08501 NM_001284 NMJ301284

Asparagus officinalis AF197713 DQ508957 AF197736

Beta vulgaris NC_002511 NC_002511 NCJD02511

Ceratophyllum demersum AF197627 AF193954 AF197730

Drimys winter/ AF197673 AY009443 AF197781

Ginkgo biloba AF209110 AF020565 AF197722

AF197663 AY009445 AF197740 Illicium parviflorum Illicium floridanum Illicium lanceo/atum Illicium floridanum AF197690 AF197774 Liriodendron tulipifera DQ980415 Liriodendron chinense Liriodendron chinense AF209100 AF020568 AF197770 Magnolia soulangeana Magnolia grand/flora Magnolia grandiflora Magnolia tripetala

Nicotiana tabacum NC_006581 NC_006581 NC_006581

AF197727 Nymphaea odorata AF209102 DQ508951 Nymphaea sp.

Oryza sativa NC_007886 NC_007886 NC_007886

AF197714 AY009429 AF197810 Papaver orientalis Ranunculus sp. Akebia quinata Akebia quinata AF197682 AY009440 AF197798 Persea americanum Laurus nobilis Cinnamomum zeylanicum Laurus nobilis

Pinus nigra AF209108 EF114116 AF197723 Pinus strobus Pinus kwanqtunqensiss Pinus sp.

Pisum sativum X05366 X14409 AY453078

Scaffold 20770 U77623 Populus trichocarpa (JDG) Populus tremuloides -

DQ508955 Saruma henryi AF197672 AF197752 Houttuynia cordata

Vitis vinifera CU459293 EU281071 AM470652

Zea mays NC_007982 NC_007982 NC_007982

49 Table 6: Species and accession numbers of chloroplastic genes included. Except for the sequences listed in Table 3 and those of Populus and Vitis, all other sequences are from Drouin et al. (2008)

Species atpB matK psaA psbB rbcL

Acorus calamus AJ879453 AJ879453 NC_007407 AJ879453 AJ879453

Amborella richopoda AF235041 AJ506156 AJ344262 AF235042 L12628

Arabidopsis thaliana NC_000932 NC_000932 NC_000932 NC_000932 NC_000932

AB029804 AJ344261 AY007465 Asparagus officinalis AJ235400 Asparagus Acorus Lilium L05028 cochinchinensis calamus superbum AJ400848 AJ400848 Beta vulgaris DQ067451 DQ116790 Spinacia Spinacia DQ067450 oleracea oleracea

Ceratophyllum demersum NC_009962 NC_009962 NC_009962 NC_009962 NC_009962

DQ887676 Drimys winter/' AF093425 Drimys AF180016 AF222708 AF093734 qranadensis

Ginkgo biloba AJ235481 AF456370 AF223226 AF222705 AJ235804

Illicium oligandrum NC_009600 NC_009600 NC_009600 NCJ309600 NC_009600

Liriodendron tulpifera NC_008326 NC_008326 NC_008326 NC_008326 NC_008326

AF548640 AY007466 AF239784 Magnolia soulangeana Magnolia AJ344277 Magnolia M58393 Magnolia stellata qrandiflora stellata

Nicotiana tabacum NC_001879 NC_001879 NC_001879 NC_001879 NC_001879

NC_006050 AF092988 22091652 Nymphaea odorata AF188851 M77034 Nymphaea alba Clematis fusca Nymphaea sp.

Oryza sativa NC_008155 NC_008155 X15901 NC_008155 NC_008155

AJ344269 AJ347853 Papaver orientale U86394 AB110535 Coptis Coptis L08764 laciniata laciniata

Persea americana AJ621920 AJ247179 AJ344282 AJ347866 X54347

Pinus thunbergii NC_001631 NC_001631 NC_001631 NCL001631 NC_001631

Pisum sativum X03852 AY386961 AF223227 AY007467 X03853

NC_008235 IMC_008235 NC_008235 NC_008235 NC_008235 Populus trichocarpa Populus alba Populus alba Populus alba Populus alba Populus alba AJ344263 Saruma henryi AJ235595 AF543748 Asarum AF528911 L12664 caudatum

Vitis vinifera NC_007957 NC_007957 NC_007957 NC_007957 NC_007957

Zea mays X86563 X86563 X86563 X86563 X86563

50 All sequences from each gene were aligned with BioEdit, version 7.0.9.0 (Hall 1999), and

adjusted by manual inspection. Since this sequence editor performs alignment using ClustalW

(Thompson et al. 1997), it was used for the alignment of amino acid sequences, with the default gap-

creation and gap-extension penalties of 10.00 and 0.10, respectively. Each ORF of the resulting

amino acid sequences was verified before concatenation into nucleotide and amino acid

supermatrices. Two datasets were then available, a nucleotide matrix that contains 32,806

nucleotides (15 genes) and a protein matrix with 10,329 amino acids of the 14 protein-coding genes.

In order to discriminate between insertion/deletion events (indels) and missing data, they were coded

as - and ?, respectively. These supermatrices were exported in PHYLIP and NEXUS formats.

Phylogenetic analysis

In order to reconstruct the phylogenetic trees, different methods were used with the nucleotide and amino acid supermatrices. A clade of the two gymnosperms, Ginkgo and Pinus, was used as the outgroup. Then, phylogenetic trees of both nucleotide and protein datasets were built using Maximum parsimony (MP; Kluge and Farris 1969) and Maximum likelihood (ML; Felsenstein 1981) methods.

Congruence was tested by tree comparisons and bootstrap analyses (Felsenstein 1985). PAUP* version 4.0bl0 (Swofford 2002) and PHYML 2.4.4 (Guindon and Gascuel 2003) were used to reconstruct phylogenetic trees by MP and ML, respectively.

MP trees were obtained using the tree-bisection-reconnection (TBR) branch-swapping algorithm, in which the initial trees were produced with the random addition of sequences (10 replicates). All alignment gaps were treated as missing data. The most parsimonious trees were obtained using heuristic searches, and nonparametric bootstrap (Efron and Gong 1983; Efron et al.

1996; Felsenstein 1985) values (%) were calculated using 500 replicates of random additions and equal weights.

Before proceeding with ML phylogenetic analyses, it was necessary to find which substitution

model best fits the nucleotide and amino acid datasets. ModelTest 3.7 (Posada and Crandall 1998)

and ProtTest 1.4 (Abascal et al. 2005) were thus used with these respective datasets. Hence,

51 ModelTest, used with PAUP 4.0bl0 (Swofford 2002) and with the Akaike Information Criterion (AIC;

Akaike 1974), revealed that the nucleotide supermatrix was best described by the general time reversible (GTR or REV; Lanave et al. 1984; Rodriguez et al. 1990; Yang 1994b; Yang et al. 1994a) substitution model, along with invariable sites (+1; Hasegawa et al. 1985) and a discrete gamma (+r;

Yang 1994) as positional rate heterogeneity (Gu et al. 1995). ProtTest proposed the JTT (+r +F) substitution model (Jones et al. 1992) for the amino acid dataset. For ML analysis of the nucleotide dataset, the GTR substitution model and a mixed rate heterogeneity model with an invariable rate and

8 r distributed rates were first used to obtain a phylogeny for the complete nucleotide dataset. For protein sequences, the JTT substitution model and a mixed rate heterogeneity model with an invariable rate and 8 r distributed rates were also used to obtain a phylogeny for the complete amino acid dataset. Then TREE-PUZZLE 5.2 (Strimmer and von Haeseler 1996) was conducted, in order to obtain 8 categories of r distributed rates, for both nucleotide and amino acid datasets. These matrices were then used with Seq_mask, a Perl script written by Will Fischer that removes columns in an alignment according to the positions in the first sequence, which is in fact the matrix obtained from

TREE-PUZZLE. As seen below, the first sequence, called mask, has only 1 and 0, that correspond to sites that are kept (1) and others that are deleted (0):

Infile fBefore removal) Outfile (After removal) >mask >mask 111110000011111 1111111111 >seql >seql ACGTAcgtacGTACG ACGTAGTACG >seq2 >seq2 ACCGAcccccGTTCG ACCGAGTTCG

These masked datasets were constructed by Dr. Drouin. For both the nucleotide and amino acid matrices, three more phylogenies were obtained, after the last categories (the ones with the fastest relative rates) have sequentially been removed.

The complete dataset was also partitioned into the seven nuclear, the five chloroplastic, as well as the three mitochondrial genes. All these sequences, except for the 18S ribosomal gene, were translated into three amino acid datasets. Phylogenetic analyses were then conducted, using the

52 same parameters described above, in order to compare trees with those inferred with the complete datasets.

MEGA4 (Tamura et al. 2007) was used to draw trees. It was also used to estimate the number of base differences per site from averaging over all sequence pairs for the genes rpal and rpcl, as well as p distances.

Consistency in tree topology

In order to investigate whether various tree topologies were statistically different, the SH-test, namely the non-parametric Shimodaira-Hasegawa test (Goldman et al. 2000; Shimodaira and

Hasegawa 1999), was performed, using the RELL (reestimated log likelihoods) approximation

(Kishino et al. 1990), as implemented in PAUP* version 4.0bl0 (Swofford 2002). CHAPTER 5

RESULTS

Analysis of rpal and rod sequences

As mentioned in Materials and Methods, eleven genes were sequenced in this study. Since these

sequences were amplified using primers designed for the conserved regions D and G (see

Introduction), each of these five rpal and six rpcl genes have a length of approximately 1400 base

pairs.

For rpal, the nucleotide and amino acid datasets contain respectively 1989 and 663 aligned

characters (see Appendix A and B). Of the nucleotide dataset, 1207 sites (61%) are variable, and

from these, 893 (45%) are parsimony-informative. For rpcl, the nucleotide and amino acid datasets

contain respectively 1725 and 575 aligned characters (see Appendix A and B). Of the nucleotide

dataset, 1075 sites (62%) are variable, and from these, 784 (45%) are parsimony-informative.

MEGA4 (Tamura et al. 2007) was used in order to estimate the number of base differences

per site from averaging over all sequence pairs for these two genes. Hence, the overall p distance for

these rpal and rpcl nucleotide datasets are respectively 0.333 and 0.288, indicating a higher rate of

evolution for rpal. As well, the ratio of transitional over transversional differences per site is 0.860 for

rpal and 0.907 for rpcl. If a ratio below 0.4 indicates highly saturated sequences (Holmquist 1983),

these ratios are obviously above this level and thus the two genes rpal and rpcl are far from being

too saturated.

Analysis of the concatenated datasets

Analyses were performed on two datasets: a nucleotide matrix containing 22 taxa of 15 concatenated genes and an amino acid matrix of the same taxa, but only 14 genes, i.e., without 18S. For analyses

using MP, the two outgroups Pinus and Ginkgo were selected from the data.

54 The amount of missing data varies among the species, Acorus, Persea and Populus being the

ones having the smallest amount of data, with respectively only 45%, 63% and 64% of data

compared to Oryza. All the other species have at least 65% of the data. However, because a few

studies concluded that two thirds of missing data (67%) can still result in conclusive phylogenetic

trees (Wiens 2003; Phillippe et al. 2004), our level of missing data is acceptable and we expect to

obtain robust trees.

The nucleotide concatenated dataset resulted in 32,806 aligned positions including indels.

The alignment was based on the translated dataset, with the DNA sequences ajusted to match the

protein alignment. Of the total number of sites, 13,576 (41%) are variable, and of these, 8,382

(25%) are parsimony-informative. The overall p distance was calculated to estimate the number of

base differences per site and is equal to 0.095. The amino acid concatenated dataset has 10,329

aligned positions. Of the total number of sites, 3,776 (37%) are variable, and of these, 1,908 (18%)

are potentially parsimony-informative. The overall p distance was also calculated to estimate the

number of differences per site and is equal to 0.092. Thus, compared to the fastest evolving genes

rpcl and rpal, the complete dataset is far less divergent.

Rate heterogeneities among the concatenated datasets

In order to investigate on the influence of the different rates of evolution among sites of both

nucleotide and amino acid datasets on tree topologies, TREE-PUZZLE 5.2 (Strimmer and von Haeseler

1996) was used to obtain 8 categories of r distributed rates (Tables 7 and 8). As shown in these

tables, the data heterogeneity is significant, especially for the last category (8). Therefore, saturation

at these fast evolving sites might be suspected. Successive removals of the three fastest evolving site

categories (categories 8, 7, and then 6) gave rise, for the nucleotide matrix, to three datasets to be

used for phylogenetic analyses, in addition to the total-evidence nucleotide dataset. Similar datasets were created for the amino acid matrix, after removal of the three fastest evolving site (categories 8,

7, and then 6).

55 Table 7: Rate heterogeneity of the DNA dataset. Model of rate heterogeneity is r distributed rates, r distribution parameter alpha (estimated from dataset): 0.24 (S.E. 0.00). Categories 1-8 approximate a continuous r-distribution with expectation 1 and variance 4.24. TREE-PUZZLE 5.2 (Strimmer and von Haeseler 1996). Abundance compared Number Category Relative rate Probability to the of sites entire matrix (%) 1 0.0000 0.1250 17982 54

2 0.0027 0.1250 14 << 1

3 0.0240 0.1250 0 << 1

4 0.1012 0.1250 2 << 1

5 0.3033 0.1250 2734 8

6 0.7630 0.1250 2881 9

7 1.8103 0.1250 4111 13

8 4.9955 0.1250 5082 16

Table 8: Rate heterogeneity of the amino acid dataset. Model of rate heterogeneity is r distributed rates, r distribution parameter alpha (estimated from dataset): 0.20 (S.E. 0.00). Categories 1-8 approximate a continuous r-distribution with expectation 1 and variance 5.01. TREE-PUZZLE 5.2 (Strimmer and von Haeseler 1996). Abundance compared Number of Category Relative rate Probability to the sites entire matrix (%)

1 0.0000 0.1250 6103 59

2 0.0009 0.1250 15 « 1

3 0.0116 0.1250 3 « 1

4 0.0628 0.1250 20 << 1

5 0.2261 0.1250 305 3

6 0.6547 0.1250 1132 11

7 1.7380 0.1250 1415 14

8 5.3060 0.1250 1336 13

56 Phyloqenetic analyses

The figures below represent the phylogenetic trees inferred with bootstrap values of 50% or more, i.e. bootstrap values of less than 50% are not shown. It is generally considered that bootstrap support is low between 50 and 74%, moderate between 75 and 84% and high at 85% and above

(Chase et al. 2000; Felsenstein 1985).

Maximum parsimony (MP) trees appear in Figure 5 for the nucleotide matrix and in Figure 6 for the amino acid datasets. While figures 5a and 6a are MP trees with all sites, the other trees were obtained after the fast evolving sites were removed: 5b and 6b (category 8 removed), 5c and 6c

(categories 7 and 8 removed), and 5d and 6d (categories 6, 7 and 8 removed). Table 9 below details the characteristics inherent to these MP trees. Consistency index (CI; Kluge and Farris 1969), a widely used measure of homoplasies calculated from a particular dataset, is the minimum number of substitutions over the total number of changes (Mishler 1994; Sanderson and Donoghue 1989). It then varies from 1 (no homoplasy) to 0. According to some systematists, this level of homoplasy, expressed through the CI, can be viewed as being inversely proportional to the confidence conferred to phylogenetic trees (Klassen et al. 1991). As seen in both the nucleotide and amino acid matrices

(Table 9), as the fastest evolving sites were being removed, the CI increased to the point as reaching

1 when characters of categories 8, 7 and 6 were excluded. Although this can be interpreted as more confident trees, the number of informative sites is probably too low to be used. CI values are highly affected by the number of taxa, little by the number of characters (Archie 1989; Klassen et al. 1991;

Sanderson and Donoghue 1989). Retention index and rescaled consistency index (respectively RI and

RCI; Farris 1989) can also be used as measures of homoplasies, but these can suffer from the same problem that does the CI, namely the correlation with the number of taxa (Archie 1989; Archie

1989a; Meier et al. 1991). To the opposite of CI, RI does not change with the level of autapomorphies included in a dataset (Farris 1989; Farris 1989a), because it rather measures the amount of synapomorphies of a dataset.

57 Table 9: Characteristics of the different datasets (all sites and without sites of categories 8, 7 and 6).

CI = consistency index; RI = retention index; RCI = rescaled consistency index; iCI = informative consistency index; iRI = informative retention index; iRCI = informative rescaled consistency index.

Table 9a All Without categories Nucleotide matrices categories 8 8 and 7 8, 7 and 6 Characters 32806 27724 23613 20732 Variable characters 13576 8750 4649 1859 % Variable 41 32 20 9 Parsimony informative 8382 4633 1515 230 % Parsimony informative 26 17 6 CI 0.538 0.630 0.816 RI 0.380 0.415 0.565 RCI 0.205 0.261 0.461 iCI 0.452 0.508 0.636 iRI 0.380 0.415 0.565 iRCI 0.172 0.210 0.359 Steps 40946 19741 7566 2591

Table 9b All Without categories : Amino acid matrices categories 8 8 and 7 8, 7 and 6 i Characters 10329 8993 7578 6446 Variable characters 3776 2571 1170 229 % Variable 37 29 15 4 Parsimony informative 1908 1103 271 25 % Parsimony informative 18 12 4 0.4 CI 0.719 0.749 0.903 RI 0.472 0.508 0.695 RCI 0.339 0.380 0.627 iCI 0.625 0.608 0.727 iRI 0.472 0.508 0.695 iRCI 0.295 0.309 0.505 Steps 11130 5552 1779 305

The MP phylogenetic analyses based on the nucleotide datasets (Figure 5) show trees with less resolved clades as the fastest sites are removed, especially without sites of categories 7 and 6

(figures 5c and 5d, respectively), because branch lengths are then becoming very short. The MP tree, with all sites (Figure 5a), shows Ceratophyllum as the sister species of eudicots (but with bootstrap support below 50%), next to the Ranunculales (Papaver). However, the magnoliid species are moderately to highly supported, with the Magnoliales (Liriodendron and Magnolia) next to Laurales

(Persea) and then next to a clade of Winterales (Drimys) and Piperales (Saruma), as most studies have usually observed. Also, all eudicots have high bootstrap support, except for one node: the

Caryophyllales-Asterids (Beta and Nicotiana) with respect to the Rosids (Arabidopsis, Pisum and

Populus), with only a low bootstrap support of 65%. Vitis and Papaver are successive sisters to all

58 eudicots, with strong bootstrap support (100 and 99%, respectively), although in the majority of studies this former species is usually found sister to the Rosids or within them. A clade of all

monocots is created with Acorus as its basal taxon, with low bootstrap support (71%), and the placement of monocots next to the remainder of angiosperms (magnoliids and eudicots) receives the strongest support with 100%.

If we compare the total-evidence DNA dataset with the one that does not take into account the fastest evolving sites (of category 8), aside from the slight differences in a few bootstrap support, the topology is exactly the same, except for one species. Hence, whereas Ceratophyllum does not find a supported spot in the tree with all sites (Figure 5a), it forms a strongly supported clade with

Acorus (99% for Figure 5b) when the fast evolving sites are removed, but the position of this clade receives a low bootstrap support (57%) as being next to the other monocots. It is also noteworthy to notice that this Acorus-Ceratophyllum clade remains with very strong support (99 and 100%), even when the sites of categories 7 and 6 are removed and that most of the other clades are unresolved

(figures 5b, 5c and 5d). Even though consecutive removals of sites of categories 7 and 6 give rise to trees with low resolution, some clades remain. These are the monocots, the Magnoliales, and finally the grouping Ceratophyllum-Acorus. They all receive very strong bootstrap support.

59 a) b)

- Arabidopsis — Arabidopsis - Pisum • Pisum

- Populus - Populus

— Beta Beta 83 - Nicotiana - Nicotiana

-VitiS -Vitis - Papaver - Papaver — Ceratophyllum - Drimys - Drimys — Saruma — Saruma - Persea - Persea LLiriodendro n Liriodendron 10014I— Magnolia 100 I—ri 1001Hi'— Magnoli a 99 - Acorus I00 1— K Acorus - Ceratophyllum

- Asparagus Asparagus 71 I Oryza -Oryza ool Zea 100 -Zea

Illicium — Illicium

- Nymphaea - Nymphaea

- Amborella - Amborella

500 200

c) d)

- Arabidopsis - Arabidopsis - Pisum - Nicotiana - Papaver Beta Populus - Asparagus - Vitis -Oryza • Asparagus — Zea -Oryza - Saruma -Zea - Papaver — Drimys - Vitis — Saruma - Persea — Persea - Acorus - Acorus — Ceratophyllum 1001 • Ceratophyllum Pisum I— Liriodendron - Populus "iool— Magnolia - Drimys -Beta I Liriodendron - Nicotiana |001 Magnolia

Illicium Illicium

- Nymphaea - Nymphaea - Amborella - Amborella

I 1 100 20

Figure 5. Evolutionary relationships of 20 taxa for the nucleotide dataset. The evolutionary history was inferred using the Maximum Parsimony (MP) method. The bootstrap values (%) were calculated using 500 replicates and these values are not shown when less than 50%. a) MP tree based on all 32,806 positions, b) MP tree based on all positions except the fastest evolving site categories (27,724 positions), c) MP tree based on all positions except the two fastest evolving site categories (23,413 positions), d) MP tree based on all positions except the three fastest evolving site categories (20,732 positions).

60 The MP analyses using amino acids (Figure 6) show trees that have approximately the same

topologies. The basal species (Illicium, Amborella and Nymphaea) are found at exactly the same

spots. Monocots also form a robustly supported monophyletic grouping (95%), but Saruma, a species

from the order Piperales, usually found next to the Canellales (Drimys) within the magnoliid species,

is instead sister to the monocots (which are sisters to the eudicots), although support is quite low

with only a bootstrap support of 53% (Figure 6a). However, when the fastest sites are removed (i.e.,

category 8), Ceratophyllum forms a clade with Acorus with strong bootstrap support (94%), as it did

with the DNA dataset. This clade remains with support of at least 94%, even with the deletion of the

other fast evolving sites (categories 7 and 6; figures 6b, 6c and 6d). For the magnoliids, except for

Saruma, the topology is similar to the DNA dataset (Figure 5), although Drimys forms a clade with

Persea in Figure 6b. Finally for eudicots, there are a few differences, compared to the nucleotide

dataset. Looking at the total-evidence dataset (Figure 6a), it can be seen that the clade

Caryophyllales {Beta) + Asterids {Nicotiana) remains with low support of 65%, and two of the Rosids

{Arabidopsis and Populus) are strongly supported together (89%), but the positions of this last clade,

as well as both Pisum and Vitis, are not well supported. Papaver is sister to these eudicots with

strong bootstrap support (97%) and Ceratophyllum is next, although its support is quite low at 53%.

In Figure 6b, where the fastest sites are removed (of category 8), it can be observed, with only one

exception, that the eudicot topology is very similar to the DNA dataset, but with slight differences in

bootstrap supports. The positions of Pisum and Populus are permuted, compared to Arabidopsis.

Figures 6c and 6d also points out mostly the same groupings that remain, despite the exclusion of sites of categories 8, 7 and 6, compared to the DNA dataset. There are the monocots, Beta and

Nicotiana, as well as the clade of Acorus and Ceratophyllum.

Further analyses were performed (data not shown), using either both the gymnosperms

Pinus and Ginkgo as outgroups, or only one of these species, in order to examine whether the topology of trees remained the same. The inclusion of these taxa only decreased slightly the

bootstrap supports, but no significant changes occurred in topologies.

61 a) b)

- Arabidopsis - Arabidopsis

- Populus - Populus

- Vitis — Pisum

-Beta -Beta

- Nicotiana - Nicotiana - Pisum

— Papawr - Papawr

• Ceratophyllum - Saruma

- Saruma - Acorus

- Acorns ' Ceratophyllum

- Asparagus - Asparagus Oryza Oryza £1 Zea Zea - Drimys — Drimys

L7T1 I Persea — Persea 01 53 - Liriodendron Liriodendron 1001— Magnolia 1001— Magnolia

- Illicium — Illicium

- Nymphaea - Nymphaea

- Amborella - Amborella

200 100 c) d)

- Arabidopsis - Asparagus

- Pisum -Oryza - Drimys -Zea

-Beta -Beta - Nicotiana - Nicotiana — Populus - Saruma

-Vitis - Arabidopsis g71 Acorus Papa\er

• Ceratophyllum - Vitis - Asparagus - Persea

-Oryza - Magnolia

-Zea - Acorus

- Persea • Ceratophyllum Liriodendron — Populus 2 I M 721— Magnolia - Drimys

- Papaver - Liriodendron

- Saruma - Pisum

- Illicium - Illicium

- Nymphaea — Nymphaea

- Amborella Amborella

20

Figure 6. Evolutionary relationships of 20 taxa for the amino acid dataset. The evolutionary history was inferred using the Maximum Parsimony (MP) method. The bootstrap values (%) were calculated using 500 replicates and these values are not shown when less than 50%. a) MP tree based on all 10,329 positions, b) MP tree based on all positions except the fastest evolving site categories (8,993 positions), c) MP tree based on all positions except the two fastest evolving site categories (7,578 positions), d) MP tree based on all positions except the three fastest evolving site categories (6,646 positions).

62 ML trees are presented in Figure 7 for the nucleotide matrix and in Figure 8 for the amino acid dataset. Similarly to the MP trees, while figures 7a and 8a are the results for ML analyses with the use of all sites, the other ones are the trees constructed when the fast evolving sites have been removed: 7b and 8b (category 8 deleted), 7c and 8c (categories 7 and 8 removed), as well as 7d and

8d (categories 6, 7 and 8 deleted).

Before analyses of ML datasets, finding which substitution model- best fit the nucleotide and amino acid datasets was essential. Thus, ModelTest 3.7 (Posada and Crandall 1998) and ProtTest 1.4

(Abascal et al. 2005) were used with these respective datasets. For the nucleotide matrix, the following models were chosen, for each of the 4 datasets: total-evidence dataset, with all sites (GTR

+1 +r; Rodriguez et al. 1990); dataset without sites of category 8 (GTR +r); dataset without sites of categories 8 and 7 (GTR +1) dataset without sites of categories 8, 7 and 6 (GTR, without I and r).

For the amino acid matrix, ProtTest selected the following models, for each of the 4 datasets: total- evidence dataset, with all sites (JTT +r +F); dataset without sites of category 8 (JTT +r +F); dataset without sites of categories 8 and 7 (JTT +r +F); dataset without sites of categories 8, 7 and

6 (mtREV +F; Adachi and Hasegawa 1996).

The ML phylogenetic trees, constructed from the nucleotide datasets (Figure 7) become less resolved as the fastest sites are removed. Out of these four trees, the outgroup clade (Pinus and

Ginkgo) is strongly supported for the total-evidence (Figure 7a), and Illicium is still, with strong bootstrap support of over 92%, at the odd position as the basalmost angiosperm, followed by a clade of Amborella and Nymphaea. All nucleotide ML trees (Figure 7), as seen for many of the MP trees, recovered, with a strong support of over 95%, a clade of Acorus and Ceratophyllum, although the position of this grouping within the other angiosperms varies. It is found as sister to the eudicots + monocots in Figure 7a (all sites), next to the other monocots in Figure 7b when the sites of category

8 are removed, within the eudicots next to magnoliales (Liriodendron and Magnolia) and then Vitis when sites of categories 8 and 7 (Figure 7c) are deleted and finally next to Papaver then Drimys when all the fast evolving sites are removed (categories 6, 7 and 8). These changing positions illustrate a significant inconsistency for the Acorus-Ceratophyllum clade, as opposed to the MP trees where only two alternatives prevailed. For magnoliids, the five species are mostly found clustered

63 a) b)

- Arabidopsis -Arabidopsis

- Populus - Populus - Nicotiana -Beta

- Vitis - Papaver - Papa\er Asparagus Acoais I Oryza • Ceratophyllum Too"! Zea Asparagus - Aconjs Oryza • Ceratophyllum iooi Zea

- Saruma - Drimys

- Drimys — Saruma

- Persea - Persea Liriodendron r 100I— Magnolia ,] 1— Liriodendron Amborella - Amborella iooi— Magnolia 541 Nymphaea Nymphaea

- Illicium - Illicium - Ginkgo

- Ginkgo

I 1 0.05 0.02

d)

- Arabidopsis — Arabidopsis - Populus - Populus - Nicotiana - Saruma Asparagus

- Asparagus I OryzOr> a

Oryza 100D 1 2Ze a Zea - Saruma - Persea Pisum • Ceratophyllum Beta I 1— Liriodendron Nicotiana (ool—- Magnolia .— Liriodendron

- Drimys 10)1— Magnolia

— Persea — Acorus • Ceratophyllum

- Papaver - Papaver

Beta - Drimys

- Amborella Amborella

Nymphaea Nymphaea - Illicium

- Ginkgo - Ginkgo

0.005

Figure 7. Evolutionary relationships of 22 taxa for the nucleotide dataset. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 32,806 positions and assumed GTR +1 +r. b) ML tree based on all positions except the fastest evolving site categories (27,724 positions) and assumed GTR +r. c) ML tree based on all positions except the two fastest evolving site categories (23,413 positions) and assumed GTR +1. d) ML tree based on all positions except the three fastest evolving site categories (20,732 positions) and assumed GTR.

64 together and strongly supported, especially for Magnolia and Liriodendron with Persea mainly next to them, and Drimys and Saruma are often close together or even forming a clade (figures 7a, 7b, and

7c). The monocots, apart from Acorus, are always closely related with a bootstrap support of 100%, with the grasses together {Oryza and Zea), then the Asparagales {Asparagus), and their placement is always very comparable, as sisters to all or the majority of eudicots, although support is not very strong. Even if the bootstrap support varies, the eudicot topology is almost always the same for at least the two first trees (figures 7a and 7b), with three Rosids clustered together {Arabidopsis, Pisum and Populus), then the Asterids {Nicotiana) and finally the Caryophyllales {Beta). Vitis is sister to all eudicots except Papaver, as seen in the MP trees (figures 7a and 7b).

Conversely, ML analyses using the amino acid dataset (Figure 8) recover trees that have similar topologies to the ones found with MP trees, especially the amino acid dataset. The outgroup clade, as well as the basal species {Illicium, Amborella and Nymphaea), are exactly the same, despite slight changes for the bootstrap values. Looking more closely at the first three trees (figures 8a, 8b and 8c), the eudicot topology is also very similar for the ML analyses using the amino acids, except for the position of Beta, which is found either with Nicotiana (figures 8a, 8c and 8d), or forming a clade with Saruma. Then Vitis, followed by Papaver, are successive sisters to these species for at least the first figures (8a and 8b), but Ceratophyllum is never recovered as close to these as in the

MP trees. Monocots form a monophyletic grouping, although Ceratophyllum is sometimes observed in a clade with Acorus as it was with MP analysis (figures 8b, 8c and 8d). Saruma, usually next to

Canellales {Drimys), is sometimes rather found with monocots, although the support is not always strong (figures 8a and 8b). For the last trees (figures 8b, 8c and 8d), Ceratophyllum forms a clade with Acorus, as it did with the MP trees, but it is rather found next to the magnoliids + monocots + eudicots for Figure 8a. The primitive dicots (magnoliids) are also, except for Saruma, closely related, despite a bootstrap value that is sometimes low (figures 8c and 8d).

In common with the MP trees, more analyses were done after deletions of the outgroup species {Pinus and Ginkgo), in order to examine what happens to the topology. Again, by comparing figures 7 with 9 (nucleotide datasets, with and without gymnosperm outgroups, respectively), as well as figures 8 with 10 (amino acid datasets, with and without gymnosperm outgroups, respectively), it

65 a) b)

- Arabidopsis - Arabidopsis

- Populus - Populus

- Pisum

- Vitis - Papa\er

- Papa\er

-Beta

- Acorus - Acorus — Asparagus • Ceratophyllum

-Oryza — Asparagus

1001 Zea I Oryza

• Drimys 100 L -Zea

Persea - Drimys 00 M

57 j | Liriodendron - Persea 67 1001 Magnolia Liriodendron

^^^—^——«- Ceratophyllum 100 1 Magnolia

- Amborella Amborella

Nymphaea Nymphaea

- IHicium - IHicium

- Ginkgo

1— Ginkgo

I 1 0.05

c) d)

- Arabidopsis - Arabidopsis

- Populus - Populus

- Pisum

- Asparagus

- Nicotiana - -Oryza

- Acorus 1001 • Ceratophyllum -Beta

- Papaver — Nicotiana

Asparagus - Liriodendron

-Oryza - Pisum

-Zea — Papa\er

— Vitis I Acorus

— Saruma B1 Ceratophyllum

— Drimys — Drimys

Persea - Magnolia

Liriodendron Saruma

74 I IMagnoliV a

Amborella

Nymphai - Nymphaea

— IHicium

• Pinus

- Ginkgo - Ginkgo

Figure 8. Evolutionary relationships of 22 taxa for the amino acid dataset. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 10,329 positions and assumed JTT +F +r. b) ML tree based on all positions except the fastest evolving site categories (8,993 positions) and assumed JTT +F +r. c) ML tree based on all positions except the two fastest evolving site categories (7,378 positions) and assumed JTT +F +r. d) ML tree based on all positions except the three fastest evolving site categories (4,444 positions) and assumed MtREV +F.

66 is possible to determine if the inclusion of the gymnosperm outgroups makes a difference. The total- evidence nucleotide trees are identical, except for the bootstrap percentages that increased without the gymnosperms and the position of Illicium, because of the rooting with Amborella. For the figures without the fast evolving sites (of category 8), the only difference is the position of Beta, forming either a clade with Nicotiana (Figure 7b) without strong support, or as a sister group to a clade which includes Nicotiana with moderate support at 80% (Figure 9b). The topology of figures without sites of categories 8, 7 and 6 are the same (figures 9c and 9d), except for Illicium, and it is the same for figures 7d and 9d, despite the fact that many of sites were excluded. If one considers the amino acid datasets, there are dissimilarities between figures 8 and 10, even though the bootstrap percentages are also always increased without outgroups. For the total-evidence matrices, the positions of

Nicotiana and Pisum are permuted, whereas Beta is next to the Rosids in Figure 10a, but these changes are not resolved because the bootstrap percentages are below 50%. For the exclusion of the fastest sites (figures 8b and 10b), Beta is the only species with changing positions, being found either with Saruma or Nicotiana. Surprisingly, the figures 8c and 10c are quite different, despite the fact that the ones with less information (figures 8d and lOd) are almost identical.

In order to investigate whether conflicts may exist among genes from the three genomic compartments, partitions were created for, respectively, the nuclear, chloroplastic and mitochondrial sequences. ML analyses were then performed, using the same substitution models. The results obtained suggest that these datasets have a strong resolving power, but, as more and more genes are added to these, this ability to produce congruent phylogenetic tree increased, predicted by previous studies (Baker and DeSalle 1997; Dolphin et al. 2000; Kluge 1989; Sanderson et al. 1998).

The topologies obtained from nuclear nucleotide sequences (Figure 11) are similar to those we obtained with all sequences, except for two groupings. Acorus is found forming a strongly supported clade (over 97%) with Nicotiana (an eudicot), and Ceratophyllum is clustered with Beta.

This last grouping is present wether or not fast evolving sites are removed (Figures 11a through lid). Although this strange topology was not observed with all sequences (Figures 5 to 10), it was inferred by a former M.Sc student (Xia 2003). Too many conflicts, caused by numerous homoplasies

(nuclear genes are the fastest genes of the concatenation, with only 1327 constant sites throughout

67 this partition), as well as lineage sorting, are propably the causes of these odd positions. This can be corroborated when looking at the amino acid dataset (Figure 12), because the clade Acorus-

Ceratophyllum appears, although it is not well supported (Figures 12a and 12b). This same clade is still present in most of figures inferred from the chloroplastic partition and for both the nucleotide and amino acid datasets (Figures 13 and 14, respectively). In fact, Ceratophyllum is only found as a sister taxon to the eudicots in Figure 14a, whereas the others infer instead the clade Acorus-

Ceratophyllum. Considering that more than half of characters are invariable (4572 constant sites for the chloroplastic partition), these clear results are worth mentioning. Due to a small dataset and above all, because of slow evolving plant mitochondria, the mitochondrial partitions for both the nucleotide (Figure 15) and amino acid (Figure 16) datasets are certainly not congruent. Indeed for example, Ceratophyllum is found as sister to Drimys, next to the eudicots + monocots in Figure 15a, whereas it is located either next to Papaver within the eudicots (Figure 16a) or forming a clade with

Persea (Figure 16b).

68 a) b)

Arabidopsis - Arabidopsis

- Amborella Amborella

I 1 0.05 0.01

d)

- Arabidopsis - Populus — Populus Arabidopsis

- Saruma - Asparagus - Asparagus -Oryza -Oryza -Zea Zea - Nicotiana - Vitis -Beta - Acorus • Ceratophyllum - Saruma - Liriodendron - Persea 1001 Magnolia I Liriodendron Drimys 001 Magnolia

Persea • '• Drimys - Pisum - Papaver - Papaver - Acorus -Beta • Ceratophyllum - Illicium - Illicium - Nymphaea - Nymphaea - Amborella

0.002 0.001

Figure 9. Evolutionary relationships of 20 taxa for the nucleotide dataset. Outgroup clade (Pinus and Ginkgo) was removed. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 32,806 positions and assumed GTR +1 +r. b) ML tree based on all positions except the fast evolving site categories (27,724 positions) and assumed GTR +r. c) ML tree based on all positions except the two fast evolving site categories (23,413 positions) and assumed GTR +1. d) ML tree based on all positions except the three fast evolving site categories (20,732 positions) and assumed GTR.

69 a) b)

- Arabidopsis - Arabidopsis

- Populus - Populus -Beta - Pisum - Nicotiana -Beta Pisum - Nicotiana - Vitis - Vitis - Papaver - Papaver

— Saruma -Saruma - Acorus agus Ceratophyllum 69 spar 100 Oryza — Asparagus 100 Zea -Oryza — Zea — Drimys Liriodendron Persea 71

1001 Magnolia 78 - Liriodendron • Ceratophyllum — Magnolia - Illicium - Illicium Nymphaea - Nymphaea - Amborella - Amborella

I 1 0.02 0.01 c) d)

- Arabidopsis - Populus - Populus - Arabidopsis - Pisum -Vitis - Nicotiana - Asparagus - Vitis -Oryza

- Asparagus 1001 -Zea

-Oryza -Beta -Zea }7 Nicotiana 54 - Acorus - Liriodendron • Ceratophyllum -Pisum - Drimys - Papaver — Saruma - Acorus - Liriodendron - Ceratophyllum - Magnolia - Drimys - Persea - Magnolia - Papaver Saruma Beta - Persea - Illicium - Illicium - Nymphaea — Nymphaea - Amborella - Amborella

I 1 0.005 0.0005

Figure 10. Evolutionary relationships of 20 taxa for the amino acid dataset. Outgroup clade (Pinus and Ginkgo) was removed. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 10,329 positions and assumed JTT +F +r. b) ML tree based on all positions except the fast evolving site categories (8,993 positions) and assumed JTT +F +r. c) ML tree based on all positions except the two fast evolving site categories (7,378 positions) and assumed JTT +F +r. d) ML tree based on all positions except the three fast evolving site categories (4,444 positions) and assumed MtREV +F.

70 a) b)

- Arabidopsis - Arabidopsis

• Populus • Populus - Pisum - Pisum

- Vitis 66 • Vitis - Acorus - Acorus 76 • Nicotiana — Nicotiana

— Papaver - Papaver

- Saruma Asparagus

• Drimys -Oryza — Persea — Zea

-Beta • Drimys Ceratophyllum - Persea

89 Liriodendron 100 -Beta 1001oo I— Magnolii a Ceratophyllum Asparagus Liriodendron I Oryza 1001 Magnolia

100 — Zea Saruma

• Nymphaea Nymphaea - lllicium lllicium

-Amborella - Amborella

I 1 0.1

C) d)

- Arabidopsis -Beta - Populus Ceratophyllum

- Pisum - Papawr

- Acorus — Drimys — Nicotiana — Persea

Drimys — Saruma

g6 j Liriodendron - Populus I Magnolia - Arabidopsis - Beta - Vitis Ceratophyllum - Pisum - Vitis - Nicotiana - Persea - Acorus • Papaver - Liriodendron

- Saruma - Magnolia

- Asparagus - Asparagus

-Oryza -Oryza

100 Zea 100 — Zea • Nymphaea - Nymphaea

- lllicium - lllicium

- Amborella - Amborella

Figure 11. Evolutionary relationships of 20 taxa for the nucleotide dataset, using only the seven nuclear sequences. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 19,075 positions and assumed GTR +1 +r. b) ML tree based on all positions except the fast evolving site categories (16,304 positions) and assumed GTR +r. c) ML tree based on all positions except the two fast evolving site categories (13,349 positions) and assumed GTR +1. d) ML tree based on all positions except the three fast evolving site categories (11,715 positions) and assumed GTR.

71 a) b) - Arabidopsis - Arabidopsis - Populus - Populus —r- Pisum • Pisum - Nicotiana - Nicotiana Beta '• Beta - Vitis - Vitis Papaver Papaver - Saruma - Saruma - Drimys - Drimys - Persea - Persea - Liriodendron - Liriodendron - Magnolia - Magnolia Acorus Acorus • Ceratophyllum • Ceratophyllum - Asparagus - Asparagus Oryza Oryza Zea Zea - Nymphaea • Nymphaea Illicium Illicium - Amborella - Amborella

C) d) •Arabidopsis - Oryza - Pisum Zea - Nicotiana - Asparagus -Vitis - Acorus - Papaver • Liriodendron - Saruma — Persea - Populus — Saruma — Drimys Ceratophyllum • Persea - Drimys • Liriodendron - Arabidopsis — Magnolia - Populus —— Ceratophyllum - Pisum Acorus Papaver Asparagus - Vitis -Oryza -Beta — Zea - Nicotiana - Illicium - Magnolia • Nymphaea -Beta - Nymphaea - Amborella - Amborella

Figure 12. Evolutionary relationships of 20 taxa for the amino acid dataset, using only the six nuclear protein-coding sequences. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 5,752 positions and assumed JIT +F +r. b) ML tree based on all positions except the fast evolving site categories (5,049 positions) and assumed JTT +F +r. c) ML tree based on all positions except the two fast evolving site categories (4,357 positions) and assumed JTT +F +r. d) ML tree based on all positions except the three fast evolving site categories (3,767 positions) and assumed MtREV +F.

72 a) b)

- Arabidopsis - Arabidopsis — Pisum Pisum 70 - Populus 51 • Populus 100 - Vitis 100 - Vitis Beta - Beta - Nicotiana - Nicotiana • Papa\er - Papa\er - Asparagus - Asparagus Oryza -Oryza Zea 100 U -Zea — Acorns - Acorus • Ceratophyllum • Ceratophyllum - Saruma - Saruma - Drimys - Drimys - Persea - Persea I i- Liriodendron Liriodendron 100L Magnolia 1001- Magnolia - Illicium Illicium Nymphaea — Nymphai - Amborella - Amborella

C) d)

- Arabidopsis - Arabidopsis - Vitis 100 — Pisum - Beta 55 • Populus - Drimys — Beta 100 • Ceratophyllum top Nicotiana — Vitis Nicotiana — Papaver -Zea — Asparagus -Oryza -Oryza - Liriodendron 100 — Zea - Acorus - Acorus - Populus • Ceratophyllum - Illicium — Saruma - Asparagus - Drimys - Pisum 79 Persea - Magnolia 81 [ Liriodendron - Persea 10014_- Magnoli a - Nymphaea Illicium - Nymphaea - Saruma - Amborella - Papa\er - Amborella

Figure 13. Evolutionary relationships of 20 taxa for the nucleotide dataset, using only the five chloroplastic sequences. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 8,478 positions and assumed GTR +1 +r. b) ML tree based on all positions except the fast evolving site categories (7,396 positions) and assumed GTR +r. c) ML tree based on all positions except the two fast evolving site categories (6,438 positions) and assumed GTR +1. d) ML tree based on all positions except the three fast evolving site categories (5,358 positions) and assumed GTR.

73 a) b)

- Arabidopsis - Arabidopsis Pisum Pisum

56 - Populus • Populus -Beta -Beta 100 - Nicotiana - Nicotiana - Vitis - Vitis - Papaver - Papaver - Ceratophyllum - Asparagus Acorus -Oryza Asparagus 100 L Zea - Oryza - Acorus

100 L — Zea - Ceratophyllum

• Saruma - Saruma - Drimys — Drimys — Persea Persea

73\_r Liriodendron Liriodendron 1001100I— Magnolia 1001Lr— Magnolia 00 L lllicium lllicium - Nymphaea Nymphaea - Amborella - Amborella

C) d)

- Papaver - Populus

- Arabidopsis - Liriodendron -Beta - lllicium Vitis - Nicotiana - Populus - Persea - Nicotiana - Oryza Pisum -Zea Acorus • Ceratophyllum Ceratophyllum - Acorus - Asparagus -Beta -Oryza - Arabidopsis -Zea Magnolia - Vitis Persea - Saruma Saruma - Asparagus

Drimys - Magnolia - Liriodendron - Papaver

llicium - Nymphaea

— Nymphaea - Pisum

Amborella - Drimys

- Amborella

Figure 14. Evolutionary relationships of 20 taxa for the amino acid dataset, using only the five chloroplastic sequences. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 2,826 positions and assumed JTT +F +r. b) ML tree based on all positions except the fast evolving site categories (2,455 positions) and assumed JTT +F +r. c) ML tree based on all positions except the two fast evolving site categories (2,074 positions) and assumed JTT +F +r. d) ML tree based on all positions except the three fast evolving site categories (1,999 positions) and assumed MtREV +F.

74 a) b)

- Arabidopsis — Pisum

- Acorus - Nicotiana

• Populus -Beta -Beta — Arabidopsis - Nicotiana I— Oryza — Pisum jl 7ea - Vitis

— Papaver - Populus Saruma Ceratophyllum Asparagus - Acorus

Oryza Liriodendron 100 CZe a 1i ffll-M:10 L. Magnoli a

- Drimys Asparagus

Ceratophyllum - Persea Persea — Saruma lLiriodendro n Drimys I0047L Magnolia - Papawsr lOLM; lllicium - lllicium — Nymphaea - Nymphaea

- Amborella - Amborella

C) d)

[- Asparagus

1 Acorus Papa\er

Pi ""'" ,

1 lllicium ' Saruma

1 Persea ' Drimys 1— 1 Ceratophyllum

I Liriodendron ' Nymphaea

r* . _. ii

_,.__..

• _i.-_-.i- Amborella

Figure 15. Evolutionary relationships of 20 taxa for the nucleotide dataset, using only the three mitochondrial sequences. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 5,253 positions and assumed GTR +1 +r. b) ML tree based on all positions except the fast evolving site categories (4,543 positions) and assumed GTR +r. c) ML tree based on all positions except the two fast evolving site categories (3,922 positions) and assumed GTR +1. d) ML tree based on all positions except the three fast evolving site categories (3,920 positions) and assumed GTR.

75 a) b)

Arabidopsis 741 Arabidopsis

Populus • Populus - Nicotiana

Beta

— Saruma

Asparagus - Acorus

Oryza Asparagus 1001— Zea Oryza 1001c— Ze a • Pisum

Vitis • Papaver Ceratophyllum - Drimys Drimys Persea - Persea Ceratophyllum

Liriodendron • Liriodendron

94194l-M— Magnolii a 8L6 if Magnoli a Illicium Illicium Nympha — Nympha - Amborella - Amborella

C) d)

- Arabidopsis Arabidopsis

1 Acorus Oryza - Populus Populus - Asparagus Zea

I Oryza Nicotiana

-Zea Beta

- Nicotiana Magnolia

Pisum Persea -Beta Illicium

- Saruma Saruma

- Vitis Papaver

— Nymphaea Asparagus - Drimys Pisum

- Ceratophyllum Liriodendron

Magnolia Acorus

Illicium Nymphaea

- Liriodendron Vitis

Paparer Drimys - Persea Ceratophyllum Amborella Amborella

Figure 16. Evolutionary relationships of 20 taxa for the amino acid dataset, using only the three mitochondrial sequences. The evolutionary history was inferred using the Maximum Likelihood (ML) method. The bootstrap values (%) were calculated using 100 replicates and these values are not shown when less than 50%. a) ML tree based on all 1,751 positions and assumed JTT +F +r. b) ML tree based on all positions except the fast evolving site categories (1,549 positions) and assumed JTT +F +r. c) ML tree based on all positions except the two fast evolving site categories (1,360 positions) and assumed JTT +F +r. d) ML tree based on all positions except the three fast evolving site categories (1,269 positions) and assumed MtREV +F.

76 Testing of the tree topologies

We used the conservative non-parametric SH-test to compare the likelihood of alternative topologies

(Goldman et al. 2000; Shimodaira and Hasegawa 1999; Buckley 2002; Strimmer and Rambaut 2001),

This test requires that all realistic tree topologies be made available for comparisons (Buckley et al.

2001; Goldman et al. 2000), even though the number of trees influences confidence in the SH-test

(Aris-Brosou 2003; Strimmer and Rambaut 2001). This is why it was conducted with seven possibilities of alternative tree topologies, which are (1) Ceratophyllum + Acorus forming a clade as a sister group to the primitive dicots and eudicots, (2) Ceratophyllum + Acorus forming a clade as a sister group to the monocots, (3) Ceratophyllum as a sister taxon to Acorus as sister to all monocots,

(4) Ceratophyllum as a sister taxon to Papaver within all other eudicots, (5) Ceratophyllum + Beta forming a clade as a sister group to Magnoliaceae, (6) Ceratophyllum as a sister taxon to the primitive dicots, and finally (7) Ceratophyllum as a sister taxon to Amborella (Table 10; Figure 17).

Two hypotheses were tested: the null hypothesis H0 (no tree topology is significantly different) against the alternative hypothesis HA (at least one topology is significantly different). The three last topologies (5 through 7), which are respectively the clade Ceratophyllum + Beta (Xia 2003) and the placements of Ceratophyllum as a sister taxon to the primitive dicots (Aris-Brosou 2003; Savolainen and Chase 2003; Strimmer and Rambaut 2001; Zanis et al. 2002) and as a sister taxon to Amborella

(Chase et al. 1993) are the ones least supported by the SH-tests. They are indeed significantly different from the other ones in most data sets, even though the first two nuclear sets support the clade Ceratophyllum + Beta as being significantly better. Tree 3 {Ceratophyllum as a sister taxon to

Acorus, Chaw et al. 2000; Graham and Olmstead 2000; Graham and Olmstead 2000a; Qiu et al.

1999; Qiu et al. 2000; Soltis et al. 2003; Soltis et al. 1997) is never recovered as the best topology, and is also discarded in six datasets. Tree 4 {Ceratophyllum as a sister taxon to Papaver), is proposed as being the best topology twice, but it is otherwise rejected as significantly different in six other SH- tests. In contrast, trees 1 (Clade Ceratophyllum + Acorus as a sister group to the primitive dicots and eudicots) and 2 (Clade Ceratophyllum + Acorus as a sister group to monocots; Savolainen et al.

77 2000) are both discarded in only two datasets, and they are statistically the best supported trees in, respectively, five and three datasets. Table 10: SH-tests for seven alternative topologies with different datasets.

Tree topology 1 2 3 4 5 6 7 All sequences -In L 203807 203801 203822 203792 204406 203813 204210 Difference - In L 15 8 29 613 21 418 Best P 0.698 0.787 0.555 0.000* 0.701 0.000* All sequences, but without sites of category 8 -In L 141009 141006 141080 141089 141340 141087 141674 Difference - In L 4 R f 74 84 334 81 668 P 0.812 e 0.121 0.089 0.000* 0.098 0.000* All sequences, but without sites of categories 8 and 7 -In L 80787 80788 80904 80908 80906 80908 81341 Difference - In L _ . 1 116 121 119 121 554 P es: 0.820 0.010* 0.009* 0.014* 0.009* 0.000* All sequences, but without sites of categories 8, 7 and 6 -In L 47553 47553 47613 47613 47613 47613 47710 Difference - In L 0 _ . 60 60 60 60 158 P 0.865 BeSt 0.005* 0.005* 0.005* 0.005* 0.000* Only the seven nuclear genes -In L 132618 132628 132622 132591 132475 132595 132761 Difference - In L 142 152 146 115 119 284 Best P 0.001* 0.001* 0.001* 0.007* 0.004* 0.000* Only the seven nuclear genes, but without sites of category 8 -In L 89947 89949 89947 89937 89835 89937 90091 Difference - In L 113 114 113 102 102 257 Best P 0.004* 0.002* 0.004* 0.008* 0.010* 0.000* Only the seven nuclear genes, but without sites of categories 8 and 7 -In L 51740 51745 51753 51751 51704 51751 51915 Difference - In L 36 41 49 48 47 212 Best P 0.209 0.177 0.099 0.090 0.096 0.000* Only the five chloroplast genes -In L 50428 50421 50487 50484 50977 50499 50702 Difference - In L 7 _ f 67 63 556 78 281 P 0.727 0.063 0.076 0.000* 0.034* 0.000* Only the five chloroplast genes, but without sites of category 8 -In L 29191 29191 29311 29308 29649 29313 29477 Difference - In L R 0 120 117 458 122 286 P best 0.862 0.000* 0.000* 0.000* 0.000* 0.000* Only the five chloroplast genes, but without sites of categor ies 8 and 7 -In L 16330 16330 16414 16414 16594 16414 16547 Difference - In L D . 0 84 84 264 84 216 P Best 0.738 0.002* 0.002* 0.000* 0.002* 0.000* Only the three mitochondrial genes -In L 18516 18509 18482 18481 18685 18485 18532 Difference - In L 35 27 0 204 3 50 Best P 0.107 0.201 0.934 0.000* 0.878 0.042* Only the three mitochondrial genes, but without sites of category 8 -In L 10494 10494 10494 10494 10494 10494 10564 Difference - In L D . 0 0 0 0 0 70 P Best 0.507 0.527 0.556 0.508 0.525 0.000* Only the three mitochondrial genes, but without sites of categor ies 8 and 7 -In L 5421 5421 5421 5421 5421 5421 5421 Difference - In L R 0 0 0 0 0 0 P st 1.000 1.000 1.000 1.000 1.000 1.000 1- Clade Ceratophyllum + Acorus is sister group to the primitive dicots and eudicots * P < 0.05 2- Clade Ceratophyllum + Acorus is sister group to monocots 3- Ceratophyllum is sister taxon to Acorus + monocots 4- Ceratophyllum is sister taxon to Papaver+ eudicots 5- Clade Ceratophyllum + Beta is sister group to Magnoliaceae 6- Ceratophyllum is sister taxon to the primitive dicots 7- Ceratophyllum is sister taxon to Amborella

79 Bootstrap confidence

Bootstrap resampling is often used as a measure of reliability of phylogenetic trees (Donoghue and

Doyle 1989; Doyle and Donoghue 1992; Efron et al. 1996; Felsenstein 1985; Loconte and Stevenson

1991). Indeed, by looking at the specific position of Ceratophyllum for all trees, it is remarkable to notice that most of the time, this genus can be found with Acorus, with very high bootstrap supports

(i.e. more than 94%). As shown in Table 11, the grouping Ceratophyllum-Acorus appeared in a vast majority of trees for the complete datasets (not partitioned). Indeed, for the 24 trees inferred, no matter the analyses (MP or ML) nor the kind of matrix employed, this clade appeared very frequently

(a total of 20 times). Moreover, almost half of the times (9 out of 20), this grouping was found next to the monocots, and not even once that an alternative grouping (elsewhere from the monocots) was supported with bootstrap supports of over 50%. Interestingly, this clade is always found (20 times out of 20) when the fastest evolving sites (of category 8) are removed from the analyses. When the datasets were partitioned, the clade Ceratophyllum-Acorus was seen 8 times, and half of the time, it was found next to the monocots.

Since congruence can be defined as the agreement, based on different sets of characters, to give the same results (Page and Holmes 2007), in phylogenetics it can take the form of topology comparisons. As seen in Table 11 on next page, the numerous Ceratophyllum-Acorus groupings obtained can be a reasonable hint that this clade is congruent, with all datasets and phylogenetic methods used.

80 Table 11: Comparison of bootstrap values for all figures and all different matrices analyzed [all data

(a), without sites of categoriy 8 (b), of categories 8 and 7 (c) and of categories 8 to 6 (d)] for the presence of the clade Ceratophyllum-Acorus. Asterisk (*) stands for a clade being a sister group to monocots. An hyphen (-) indicates the absence of this clade.

Bootstrap values (%) All Without sites of categories ...

data 8 8 and 7 8, 7 and 6

(a) (b) (c) (d)

Figure 5: MP with nucleotides (20 taxa) - 99 %* 100 % 100 %

Figure 6: MP with amino acids (20 taxa) - 94 %* 97 %* 100 %

Figure 7: ML with nucleotides (22 taxa) 95 %* 100 %* 100 % 100 %

Figure 8: ML with amino acids (22 taxa) - 94 %* 99 % 98 %

Figure 9: ML with nucleotides (20 taxa) 96% 100 %* 100 %* 100 %

Figure 10: ML with amino acids (20 taxa) - 97 %* 100 % 97%

Figure 11: ML with nucleotides (20 taxa) - - - - (nuclear partition)

Figure 12: ML with amino acids (20 taxa) <50 % <50 % (nuclear partition)

Figure 13: ML with nucleotides (20 taxa) 96 %* 100 %* 100 %* (chloroplastic partition)

Figure 14: Ml with amino acids (20 taxa) - 100 %* 99 % (chloroplastic partition)

Figure 15: ML with nucleotides (20 taxa) - - <50 % (mitochondrial partition)

Figure 16: ML with amino acids (20 taxa) - (mitochondrial partition)

81 CHAPTER 6

DISCUSSION

Our study aimed to infer relationships among the earliest extent angiosperm species, looking primarily at the position of Ceratophylium, which can be found, in many published analyses, at a basal position within angiosperms, as a sister lineage of all eudicots, or next to the monocots (Figure

17 below).

Arabidopsis £Z Pisum •Populus •»-o> Beta o Nicotiana Viis uu •Papaver •Ceratophylium (4) •Drimys

•Sarvma > W ••= o Persea E .9. •Uriodendron -Magnolia -Ceratophylium (5) -Beta —Ceratophylium (6) ~ Acorus -Ceratophylium (1) -Ceratophylium (3) -Acorus +-w• -Ceratophylium (2) o o -Asparagus o c -Oryza o -Zea

-tlticium -Nymphaea -Amborella < o -Ceratophylium (7)

Figure 17: All positions inferred for Ceratophylium, as seen in many investigations as well as this study. Numbers in parentheses correspond to the tested tree topology, according to Table 10.

82 Genes used

The location of Ceratophyllum was thus to be determined, using primarily the three largest subunits of nuclear RNA polymerase genes (namely rpal, rpbl and rpcl), along with other sequences, not only from the nucleus, but also from the two plant organellar compartments. Thus, RNA polymerase I and III genes {rpal and rpcl) of eleven more species were sequenced (Table 3). Since many of the largest subunit of RNA polymerase II genes (rpbl), as well as other rpal and rpcl sequences, have been made available by former Ph. D. and M. Sc. students in our lab, basically over 70% of these three RNA polymerase sequences were now available for the 22 taxa investigated. These new phylogenetic markers were used successfully for assessing phylogenies, especially for gymnosperms

(Drouin et al. 2008; Hajibabaei et al. 2006; Nickerson and Drouin 2004). In order to include more

RNA polymerase sequences, the second largest subunit of RNA polymerase II gene (rpbZ) was added to the dataset. This phylogenetic marker has been successfully used in evolutionary studies, particularly those looking at deep relationships within eudicots (Denton et al. 1998; Oxelman and

Bremer 2000; Oxelman et al. 2004; Pfeil et al. 2004; Popp and Oxelman 2004). Three other nuclear genes were also included into the matrix, two of which are protein-coding sequences, i.e. the orthologous phytochrome phyA and phyC genes, because they are available for numerous species, having been used at least once for studies of the root of angiosperms (Mathews and Donoghue

1999). The last nuclear gene, the small subunit of ribosomal DNA (18S rDNA), although not a protein-coding gene, has often been used for phylogenetics (Chaw et al. 1997; Kim et al. 2004;

Nickrent and Soltis 1995; Soltis et al. 1997). Even though it contains only a small amount of information because of its slow-evolving rate, combined with other sequences, it has been proven to confer a valuable input to phylogenetic studies (Hoot et al. 1999; Soltis et al. 1997; Soltis et al.

1999). Altogether, the protein-coding nuclear genes represent more than 52% of ail positions in the combined dataset, and the other 5% comes from the 18S rDNA gene (Table 2).

Including sequences from all three plant compartments is useful to study the evolution of angiosperms (Les et al. 1999). Hence, five plastid and three mitochondrial gene sequences completed the matrix, which included a total of fifteen genes. Apart from the ribulose-l,5-bisphosphate

83 carboxylase/oxygenase (rbcL) gene that codes for an enzyme found in the stroma, which is by far the most often used plastid gene in phylogenetic studies (Chase et al 1993; Nickrent and Soltis 1995), four other plastid sequences were added. They are atpB, which encodes the p subunit of ATP synthase and is associated with the membranes of thylakoids (Hoot et al. 1995a; Zurawski et al.

1982), the two photosystem psaA and psbB genes, as well as the fast evolving maturase K gene

(matK; Mohr et al. 1993; Neuhaus and Link 1987). These four sequences have been used successfully as adequate phylogenetic markers (Graham and Olmstead 2000a; Hilu et al. 2003; Hoot et al. 1999; Les et al. 1999; Qiu et al. 1999; Selvaraj et al. 2008). Furthermore, three mitochondrial genes, namely the ATPase alpha subunit (atpA), the cytochrome oxidase subunit 1 (coxl) and the maturase R (matR), were also added to the dataset. The five plastid sequences and the three genes from the mitochondrion account for respectively 25% and 18% of all sites (Table 2).

Genomes from all three compartments of plants existed in plant cells prior to the first angiosperm speciation event and thus they should show a similar evolutionary history. Since our dataset used sequences from all three genomes, our phylogenetic trees are most likely to be reliable if topologies are similar and supported by high bootstrap supports (Barkman et al. 2000). But divergent topologies might be caused by different mechanisms, such as paralogy, convergent substitutions, horizontal gene transfer, as well as LBA. When these are observed, inferred trees can sometimes be adequately corrected by addition of well chosen species. The accuracy of tree topology might then increase (Hillis 1998; Swofford et al. 1996). For example, inclusion of another taxon, from the ITA clades, next to Illicium would have most likely corrected the position of this species.

Concatenated datasets and alignment

A controversy remains as to whether combination of potentially conflicting characters into a single matrix is a wise decision. Even though distinct datasets inferred different trees, these conflicting datasets, when combined into a large concatenated matrix, can still be appropriate to obtain adequate topologies (Chippindale and Wiens 1994; Hoot et al. 1997; Kluge 1989; Rydin et al. 2002), even though some other systematists argue that this approach should be avoided (e.g., Bull et al.

84 1993). Of the three existing strategies mentioned earlier for phylogenetic reconstruction, the one that concatenates all sequences before searching for the best substitution model is the one favoured here.

There are advantages and problems to this method, but some authors agree that it often provides good tree topologies and supports (Delsuc et al. 2005; Gatesy et al. 2004; Huelsenbeck et al. 1996;

Salamin et al. 2002; Soltis et al. 2004a).

Our resulting concatenated matrix is important, because it consists of fifteen genes spanning on more than 32 kb for the aligned nucleotide dataset and over 10 thousand aligned amino acids. In comparison, for example, to infer the phylogeny of earliest angiosperms, in previous studies the number of sites varied from about 2 to 14 thousand nucleotides, and yet congruent results were produced (Barkman et al. 2000; Graham and Olmstead 2000; Hilu et al. 2003; Mathews and

Donoghue 1999; Parkinson et al. 1999; Qiu et al. 1999; Soltis et al. 1999a). A broad study has shown that, depending on the level of bootstrap support required, the number of genes required in the contatenated matrix, in order to recover an adequate tree, ranged from 8 to 20 (Rokas et al. 2003).

Our supermatrix thus contains a sufficient number of unlinked (or independently evolving) genes to potentially infer accurate trees. For all the fourteen protein-coding genes (all sequences but the 18S rDNA), the nucleotide alignment was ajusted to be concordant with the aligned amino acids, to avoid any frame shift of the open reading frame.

To be able to estimate the evolutionary distance of putative homologous sequences, their alignment has to be done properly. A close look at the matrices might reveal some residual ambiguous positions in the two different matrices, but, in fact, only the two fastest genes, rpal and rpcl, bear regions with potential alignment problems (Appendix A and B for the nucleotide and the amino acid sequences, respectively). For rpcl, the positions 13900-13925, 14195-14225 and 14870-

15115 are far from being conserved and for rpal, the problematic positions are 15680-15820 and

16945-17160. These were the only sections where the ClustalW alignment needed to be corrected manually. But, even if the alignment was not completely optimized in these small regions, compared to the entire matrix, it should not be significant. These sites were therefore not excluded from the analyses. Empirical studies have in fact established that regions with alignment ambiguity should not be discarded from the data (Lee 2001; Swofford et al. 1996). Even though such sites might

85 sometimes contain little or misleading information, we chose not to remove them, but rather to try to get as much information as possible.

On the other hand, it has been shown that the alignment's accuracy, under some circumstances, might be more problematic than the phylogenetic methods used in order to obtain a reliable phylogeny (Ogden and Rosenberg 2006; Xia et al. 2003). For instance, all inferred trees are limited by the quality of the alignment (Hall 2005; Yang 1998). Alignment is mostly straightforward when sequence identity is 80% and over, and then almost all sites are aligned properly. However when identity decreased to 65%, only about 90% of alignment is correct (Rosenberg 2005). Then when sequences are more divergent, special care needs to be taken in order to avoid misalignments.

DNA sequence dissimilarity, calculated with MEGA4 (Tamura et al. 2007), including all fifteen genes, and the amino acid dataset of the fourteen protein-coded genes are quite alike at 90,5 % identity and 90,8 % identity, respectively. Thus, the alignment of the entire matrix can be considered reliable.

But, a closer glance at both rpcl and rpal sequences, which are the most divergent gene sequences, to evaluate their identity, might be useful. For instance, the identity of the nucleotide dataset of both genes are 71,2% and 66,7%, respectively, indicating that rpal is the fastest evolving gene.

Considering, at the most, that 10% of the positions of both of these genes are misaligned, then only about 400 sites were not aligned adequately, and this corresponds approximately to the total of all positions detailed above. Since these putative misaligned sites do not represent 1% of the entire matrix, it can be argued that those ambiguous positions were not abundant enough to compromise phylogenies. Then, despite the fairly high divergence of these two fast evolving genes, the alignment can still be considered reliable. However, since a sequence divergence between 20% and 30% is believe to be the maximum acceptable to be able to infer adequate phylogenies (Friedlander et al.

1994; Graybeal 1994; Hillis and Dixon 1991), these two fast evolving genes, but especially rpal, are probably not borderline for inferring phylogenies.

86 Choice of outgroups

In order to root trees, an adequate outgroup needs to be selected. At least one study has demonstrated that choosing both Pinus and Ginkgo as outgroups gave rise to increased signal

(Barkman et al. 2000). These genera were thus used in our datasets. However, our results have shown that the position of Illicium is quite unusual, being located next to Amborella (or the clade

Amborella + Nymphaea) at the basal position. This problem is probably due to the long branch attraction between the ingroup species Illicium and the outgroups, because very long branches connected the ingroups (angiosperms) from the two outgroups (gymnosperms), as seen, for example, in Figures 7b, 7c, 8b and 8c (Parkinson et al. 1999; (Stefanovic et al. 2004). For instance, other studies have obtained such misleading results: e.g. Graham and Olmstead (2000) have shown that Cabomba, instead of Amborella, was found as sister to all angiosperms, next to almost the same outgroup species (Apart from Pinus and Ginkgo, Gnetum was also added). They had to add another species {Nymphaea) in order to break this attraction phenomenon. Therefore, the inclusion of a well chosen species, closely related to Illicium such as those of the clade Austrobaileyales, would have been probably appropriate. In order to examine more closely the consequences of excluding the outgroup Pinus and Ginkgo from the dataset, ML analyses were performed (Figures 7, 8, 9 and 10), with and without the gymnosperm outgroups. No notable differences in topologies were observed, but a few bootstrap percentages increased without these two gymnosperm outgroup species, which most likely indicates that the inclusion of these two gymnosperms might add more homoplasies (Hilu et al. 2003). However, Pinus and Ginkgo did not alter significantly the tree topologies, which justifies why MP analyses were made without these gymnosperm species as outgroups.

Among-site rate variation and removal of noise

One of the first issues that must be answered when looking at a dataset is its capability of resolving adequately phylogenetic trees. In fact, the rate of evolution of a sequence must be high enough, in order to obtain a significant number of informative signals. Yet when this rate is too fast, a certain number of them turn into useless sites, having encountered too many multiple substitutions. They 87 are usually referred to as being saturated, and they contribute to the presence of noise. Some

investigators argue that saturated sites, such as third codon positions, should be avoided for

phylogenetic studies (Chaw et al. 2000). In order to investigate if the datasets were saturated, the

ratio of transitional over transversional differences per site was calculated for the two fastest evolving

sequences, namely rpcl and rpal. Since this ratio must be 0.4 or less for highly saturated sequences

(Hilu et al. 2003; Holmquist 1983), at 0.907 and 0,860 for the respective rpcl and rpal genes, these

ratios are clearly above the threshold of 0.4. Therefore, because those divergent sequences cannot

be considered saturated, neither can the entire matrix. Furthermore, because fast evolving sites

make up most of the information related to the evolutionary history, excluding them would then

imply getting rid of most useful data. We chose not to reject them, because even saturated data can

potentially provide valuable information (Hillis 1998; Kallersjo et al. 1999). Instead we classified all

positions into eight categories, based on their various substitution rates (Tables 7 and 8 for the

nucleotide and amino acid matrices, respectively).

In almost all sequences, substitution rate variation is significant among sites. In fact, a

continuum of evolutionary rates might be possible (Kelly and Rice 1996) and thus our classification of

all positions into eight categories was basically only an approximation of the reality concerning these

nucleotide substitutions. For protein-coding sequences, most of the explanation concerning this

disparity among sites relies on a few constraints, because of structural but predominantly functional

constraints. Furthermore, because of the genetic code being degenerated, third position of codons

contains more substitutions than the two others. However, despite a consensus about ubiquity of variations in most sequences, no real agreement has so far arisen on how to treat this "continuum"

of rates within datasets (Huelsenbeck et al. 1996). The situation is further complicated, for a given

site, by potential changes of the evolutionary rate over time, due to fixed mutations that could then

alter the constraints of such positions. This phenomenon has been frequently observed and is now

referred to as heterotachy (Philippe and Lopez 2001).

Rate heterogeneity of both our nucleotide and the amino acid matrices is substantial. For

instance, apart from the invariant sites, the sites with the lowest relative rate are almost two and six thousand times slower than the fastest ones, for the respective nucleotide and amino acid datasets

88 (Tables 7 and 8). Over half of all positions were invariant, either for the nucleotides (54%) or the amino acids (59%), whereas the number of sites for the first three slowest rate categories (2, 3 and

4) is not significant in both matrices (number of sites is far below 1%). Nevertheless, the three fastest sites (categories 6 to 8) share approximately the same abundance in both datasets, from 9 to

16%, whereas their relative rates change drastically, being more than six and eight times faster for sites of category 8, in the respective DNA and protein datasets. For the nucleotide matrix, resulting in great part from protein-coding genes (about 95%), the category with the fastest evolving sites

(relative rate of 4,9955) only has 16% of sites (instead of a thirth of all sites), meaning obviously that the rate among all third codon positions was clearly not homogeneous. A simple exclusion of all these positions would have therefore eliminated an important number of sites with high informative content, whose saturation level was probably not excessive, as the ones from mitochondrial genes that are usually much slower in angiosperms, even for synonymous substitutions (Drouin et al. 2008).

In both the nucleotide and the amino acid datasets, the relative rates of evolution of the fastest evolving sites (category 8) are about three to four times faster than those of the next category

(category 7; Tables 7 and 8). These are then those that contribute the most to the noise of the datasets, and this is why many resulting phylogenies are more congruent without these characters.

However, as seen in Table 9, sites of categories 8 and 7 are at the same time those bearing more informative signals. Indeed, the percentage of parsimony informative sites is only 6 and 4% (for the respective nucleotide and amino acid datasets) when these characters are excluded, whereas the complete datasets have 26 and 18% of parsimony informative content, respectively.

It was wise to examine whether incongruence exists among the sequences from the three genomic compartments. Therefore, they were partitioned into nuclear, mitochondrial and chloroplastic datasets and ML analyses have been performed. It can be seen however that only the nuclear nucleotide dataset (Figure 11) proposed a quite different topology, with a clade of

Ceratophyllum + Beta as a sister group to the Magnoliales. The amino acid dataset resulting from the same nuclear genes (Figure 12) still gives rise to a clade of Ceratophyllum + Acorus, as most figures.

Therefore the resulting incongruence of the nucleotide dataset should probably be attributed to a high level of noisy signals (Bapteste et al. 2002; Dolphin et al. 2000; Sanderson and Shaffer 2002).

89 Choice of methods

Although analyses have been made with two methods, namely MP and ML, which can explicitly

describe the distribution of characters among the investigated species (Siddall 1998). While MP only

considers the minimum number of changes necessary, ML rather seeks the most probable pattern.

When evolutionary rates are rather small and mostly equal, MP can obtain adequate p hylogenies

(Felsenstein 1978; Hendy and Penny 1989). However, ML is well suited for multiple substitutions,

whereas MP does not consider these changes. Because our dataset contains almost 30% of sites

whose rate is more than 1 (Tables 7 and 8 with categories 7 and 8), these multiple substitutions

probably needed to be acounted for. One other problem with MP is that this method is subject to

LBA. However, it has been showed that ML could also suffer this same problem (Lockhart et al. 1996;

Swofford et al. 2001), especially if the number of taxa is small (Anderson and Swofford 2004). Yet

there are some circumstances where MP can clearly deduce more accurate topologies, such as the

long branch repulsion, which occurs when the model-based method, such as ML, uses the wrong

model (Siddall 1998; Swofford et al. 2001). The use of both methods was then considered to be wise

for comparisons of topologies. Congruent trees can be indicative of most satisfactory phylogenies,

especially along with high bootstrap percentages.

With equal evolutionary rates, ML performs slightly better than MP, especially in situations

where long branches might arise. But when rates fluctuate, the former is significantly superior,

because it mostly gives rise to least bias (Gadagkar and Kumar 2005; Gu et al. 1995; Kuhner and

Felsenstein 1994), even with heterotachous datasets (Kuhner and Felsenstein 1994; Swofford et al.

2001). First ML methods used to infer phylogenies assumed the same rate of substitution among

sites. But such an assumption is unrealistic, especially for protein-coding genes (Yang 1993). In fact,

ML analyses that only use equal-rate models could even create long branches when heterogeneity exists over diverse positions, especially with small numbers of characters (Lockhart et al. 1996;

Stefanovic et al. 2004). Thus, in order to obtain reliable results and therefore probably more accurate topologies, phylogenetic analyses must account for rate heterogeneity (Kelly and Rice 1996;

90 Stefanovic et al. 2004). For instance, ML models that include these variations clearly obtained better results (Yang 1993; Yang et al. 1994a).

Even though a certain number of distributions exist to model rates among sites, the r distribution, with only one parameter a responsible for rate variations, seems to enhance fit in most matrices (Yang 1993; Yang 1996a), and has therefore been widely used, in order to take into account the diverse evolutionary rates of different DNA sequences, especially when nucleotide datasets for protein-coding genes were not analyzed using codon substitution models (Ren et al. 2005; Takezaki and Gojobori 1999). Rates are quite dissimilar when a is small (less than 1), whereas a large a (over

2) rather means weak rate variations (Guindon and Gascuel 2002; Page and Holmes 2007). This r distribution is the one used for our datasets with ML, along with the proportion of invariable sites.

These last sites have to be carefully considered, because when invariant positions are included to infer trees, they might contribute to the selection of wrong phylogenies,. mostly due to the violation of certain model assumptions, when performing ML methods (Lockhart et al. 1996). The accuracy or adequate fitness of a subtitution model for ML analyses does not automatically imply getting a real tree topology, if such conditions were not fulfilled (Farris 1969; Gaut and Lewis 1995; Ren et al.

2005). For both our nucleotide and amino acid datasets and despite removal of the fastest evolving sites, the resulting a, calculated from ModelTest 3.7 (Posada and Crandall 1998) and ProtTest 1.4

(Abascal et al. 2005), was always rather small, ranging between 0,25 and 0,65, thus suggesting a significant variation of subtitution rates. Concerning the selection of model for our dataset, it is known that some mathematical models, using several parameters to enhance fit of the dataset, do not necessarily recover tree topologies more acceptable than the simplest ones (Rzhetsky and

Sitnikova 2002; Yang 1996b). Thus, simpler models should occasionally be favoured, especially if sequences are not too divergent, i.e. those with a p distance of less than 0.1. However, these models should be avoided when the sequences compared have a p distance exceeding this 0.1 threshold

(Rzhetsky and Sitnikova 2002). Because p distances of our nucleotide and amino acid datasets were, respectively, between 0.25 and 0.35, these simpler models were nonetheless useless, and more parameter-rich models needed to be selected. ModelTest 3.7 (Posada and Crandall 1998) and

ProtTest 1.4 (Abascal et al. 2005) tested all available substitution models and proposed GTR (+1 +r)

91 and JTT (+r+F), for the respective nucleotide and amino acid datasets, to the exception of MtREV (a model usually fitted to mitochondrial sequences) that was instead chosen for the protein dataset without sites of categories 8, 7 and 6. Since the residual positions were slowly evolving and that mitochondria bear mostly low rates of substitutions (Drouin et al. 2008), this particular choice of model was not surprising.

When noise is removed from both the nucleotide and the amino acid datasets, a change in tree shape might occur, due to residual signal that usually increase (Barkman et al. 2000). For datasets with among-site variation, neglecting these variations will undervalue distances among certain species (Yang 1996a) and thus phylogenetic inferences might be altered (Gu et al. 1995;

Kuhner and Felsenstein 1994; Sullivan et al. 1996a). But some systematists prefer retaining all data, because this so-called "total-evidence approach" can most probably maximize the information available (Baker and DeSalle 1997; Chippindale and Wiens 1994; Huelsenbeck et al. 1996). Our analyses were therefore initially performed with the whole matrix, before successive removals of the three fastest sites (of categories 8, 7 and then 6), in order to compare topologies.

MP analysis

The number of parsimony informative (PI) characters decreased substantially when the fastest evolving sites were sequentially removed (Table 9). Hence, there were 8382 available PI characters for the total evidence nucleotide matrix, whereas only 230 of them remained with the dataset containing no site of categories 8, 7 and 6. For the amino acids, there were 1908 PI sites for the total evidence dataset, but only 25 of them were still available for the shortest matrix, without positions of categories 8, 7 and 6. Because of the low number of PI sites for the smallest datasets of both nucleotide and amino acid alignments, the last trees of Figures 5 and 6 certainly lack information for a reliable resolution. They indeed contain many multifurcations and only five well supported clades remained (Figures 5d and 6d). However, most of nodes for the remaining trees were well supported

(bootstrap percentages of 50% and over). In both figures, when the fastest evolving sites were removed (category 8), some clades appeared with even stronger bootstrap support. In fact, the

92 Arabidopsis-Pisum and Beta-Nicotiana clades, as well as magnoliids, were present with higher bootstrap percentages. Furthermore Ceratophyllum, being initially located, although without strong support, next to the eudicots in the complete nucleotide and amino acid alignments (Figures 5a and

6a), was rather found forming a strongly supported clade (99 and 94%) with Acorus in Figures 5b and 6b. These results suggest that the fastest evolving positions, representing 16 and 13 % of the respective nucleotide and amino acid alignments, were probably too noisy for inferring adequate phylogenies (Barkman et al., 2000). Indeed, with a relative rate of about 5 (Tables 7 and 8), they encountered multiple subtitutions and were thus most certainly too saturated and suffered many homoplasies (independent convergence, parallelism or reversal). For instance, a good measure of the fit of data to trees is the consistency index (CI), which quantifies the number of homoplasies (Brown et al. 2001). CI varies from 0 (a lot of homoplasies) to 1 (no homoplasy; Farris 1989; Kluge and

Farris 1969). When the sites of category 8 were removed, this index increased and therefore the number of homoplasies was lowered accordingly (Table 9). Hence, the only datasets without homoplasy (CI equal to 1) were those obtained after removal of sites of categories 8 to 6, but the number of parsimony informative sites, as seen previously, was probably too low for inferring adequate phylogenies. The relative rate of substitution of the next grouping (category 7), being approximately three times slower, might be more suitable for such phylogenetic investigations.

Therefore, TREE-PUZZLE (Strimmer and von Haeseler 1996), which created eight classes of positional rate heterogeneities, was quite helpful in discerning between noise (sites not correlated to a phylogeny) and valuable information, as well as consequently increasing tree resolution.

Unfortunately, removal of fast evolving sites decreased the ratio of parsimonious characters appreciably. For example, when positions of category 8 were deleted, this ratio diminished from 26% to 17% and from 18% to 12%, for the respective nucleotide and amino acid datasets (Table 9).-

Furthermore, branch lengths of some internal nodes became very short, especially after the deletion of sites of both categories 8 and 7 (Figures 5c and 6c), but the majority of the remainder were still long enough to be indicative of robust clades.

Comparing both figures resulting from the nucleotide and amino acid datasets (Figures 5 and

6), it can be seen that many clades were consistent and well supported. Early angiosperms, such as

93 magnoliids, often formed a mostly well supported monophyletic grouping, with Magnoliales

(Liriodendron and Magnolia) next to Laurales (Persea) and then next to a clade of Winterales

(Drimys) and Piperales (Saruma), which is also inferred by most angiosperm molecular studies.

Monocots (Oryza, Zea, Asparagus and Acorus) were as well clearly grouped together and robustly

supported, with Ceratophyllum often located next to the basal Acorus. For their part, eudicots usually

had high bootstrap supports. Rosids {Arabidopsis, Pisum and Populus) were very regularly found together, as well as Beta (a Caryophyllales) with Nicotiana (an Asterid). However, the relationship

between Rosids, Asterid and Caryophyllales was not constantly very strongly established, while

bootstrap supports varied from 97% to below 50%. Vitis, then Papaver were also mainly well supported as sisters of all remaining eudicots. Finally, it is noteworthy that the clade Acorus-

Ceratophyllum stood as one of the strongest supported and consistent groups in all MP trees, even with successive deletions of the fast evolving sites.

On the whole, MP analyses succeeded in inferring consistent phylogenetic trees. Apart from the last dataset (no sites of categories 8, 7 and 6; Figures 5d and 6d) and, to a lesser level, the one without sites of categories 8 and 7 (Figures 5c and 6c) of both nucleotide and amino acid alignments, most nodes were congruent and strongly supported. Indeed, for the two largest datasets of both nucleotides and amino acids (all sites and those without category 8), at least 80% of all internal nodes were supported with bootstrap percentages of 70% and over (Figures 5a, 5b, 6a and 6b).

When the most saturated positions were absent (category 8), the level of homoplasy decreased and thus some nodes were even more strongly supported. Furthermore, the very strongly supported

Ceratophyllum-Acorus clade suddenly emerged. No LBA phenomenon seemed to be noticeable, even with the presence of fast evolving species such as grasses (Oryza and Zea). To test consistency of these phylogenies, ML analyses were then performed. As mentioned previously, we chose not to include Pinus and Ginkgo as outgroups for the MP analyses, but they were incorporated for the ML analyses to demonstrate the usefulness of these two species to anchoring the topology of trees.

94 ML analysis

For both the nucleotide and amino acid datasets, comparisons of trees with and without the two outgroup Pinus and Ginkgo showed only small differences (Figures 7 compared to 9 and 8 compared to 10). In fact, for the nucleotide matrix, apart from the position of Illicium, due to the rooting with

Amborella, the only discrepancy (Figures 7 compared to 9) was the positon of Beta that was either forming a clade with Nicotiana or was next to it. Moreover, for the amino acids (Figures 8 compared to 10), the situation of Beta also changed a bit, as well as Pisum. But because those slight alterations were only noticeable for only a very few nodes and rarely well supported, one can conclude that the addition of these outgroups certainly did not modify the tree topology.

With the exception of Beta, as seen above, eudicots were found clustered together and for the most part with good bootstrap supports. Similarly to the MP analyses, Rosids (Arabidopsis, Pisum and Populus) were positioned together, but Beta (a Caryophyllales) was not always forming a group with Nicotiana (an Asterid). Besides, Vitis, then Papaver, were predominantly well supported as successive sisters of all the other eudicots. The early angiosperms, as seen in MP analyses and in most molecular investigations, formed a monophyletic grouping, the only exception being Saruma that surprisingly was sometimes located with the monocots, but without adequate resolution.

Moreover, monocots (Oryza, Zea, Asparagus and Acorus) were strongly supported together, with

Ceratophyllum mostly forming a clade with Acorus. More importantly, this last group again received one of the strongest bootstrap percentages and, for the amino acids, only emerged when the fastest evolving sites (of category 8) were deleted.

Data partitions

Partitioning the nucleotide and amino acid datasets into the three genomic compartments in order to perform ML analyses led to the conclusion that weaker signals do not always infer adequate phylogenetic trees. In fact, a lot of studies have shown that simultaneous analysis of all the data available is preferable (Baker and DeSalle 1997; Olmstead and Sweere 1994), because relationships not revealed with partitions can be exposed only with concatenation of sequences (Olmstead and 95 Sweere 1994). This is indeed our situation with at least the nuclear partition, when the clade

Ceratophyllum-Beta appeared. Combination of all the three partitions never produced such a conspicuous grouping.

Bootstrap confidence

Felsenstein (1985) proposed the use of the bootstrap as a method to estimate confidence levels of phylogenetic trees. This practice has since been criticized, because bootstraps are not correlated, statistically, to the probability of obtaining the true clade, but rather of resampling certain characters from a population (Efron et al. 1996; Felsenstein and Kishino 1993; Wilis and Bull 1993; Soltis and

Soltis 2003a). In other words, Bootstrap values are not linked directly to accuracy, but rather correspond to repeatability. Anyhow, simulations have shown that a relationship does exist between

Bootstrap values and accurate phylogenies. Moreover, confidence values are mostly too conservative, because 95% of clades seem to be correctly inferred with bootstrap values of over 70%, whereas almost every clade is true when these bootstrap values are 80% and more (Hillis and Bull 1993;

Zharkikh and Li 1992). The threshold of 70% or more can thus be interpreted as an indication of a supported clade (Hillis and Bull 1993; Soltis and Soltis 2003a). In contrast, most of trees in above figures had very short branch lengths, and at the same time some others were particularly long. A certain number of node disagreements among trees should thus probably be attributed to these sampling errors (Graham and Olmstead 2000; Moore et al. 2007).

As shown in Table 11, the grouping Ceratophyllum-Acorus appeared in most phylogenetic trees, with very high bootstrap supports (i.e. more than 94%). Since Felsentein (1985) claimed that bootstrap percentage of 95% or more are indicative of well supported clades, one can be quite confident in that inferred topology. Of course, it cannot be concluded unquestionably that the location of this clade should be as sister to the monocots. However, since most molecular studies have proposed, with strong support, Acorus as a basal monocot (Chase et al. 1993; Duvall et al.

1993; Davis et al. 1998; Parkinson et al. 1999; Jansen et al. 2007), this likelihood must not be considered far-fetched. In order to investigate further the relative position of the clade Acorus-

96 Ceratophyllum, additional species would be advisable, particularly those representing the other orders within monocots (e.g., the species-rich Liliales and Asparagales), as well as lineages from magnoliids and basal eudicots. Apparently, our rather sparse sampling did not succeed in resolving sufficiently some internal nodes, even though representatives of most of major basal taxa were incorporated into the dataset.

Relationships between Acorus and Ceratophyllum

The two genera Ceratophyllum (hornwort) and Acorus (sweet flag) both belong to orders

(Ceratophyllales and Acorales, respectively) that only encompass a single family (Ceratophylaceae and Acoraceae, respectively), according to the last update of the angiosperm classification (APG. II

2003). The phylogenetic positions of these isolated families are difficult to assess, because of the small amount of shared synapomorphies with any other groups of angiosperms. Whereas the last

APG classification proposed Ceratophyllum as a sister taxon to the Ranunculales (e.g., Papaver), the most basal eudicots, Acorus was placed next to the remainder of the monocots (APG II 2003).

Previously, they both have been placed tentatively within alternative groupings, but, because of molecular or morphological discrepancies, had to be removed from these locations.

Even though their position within angiosperms still needs more clarification, monocots are adequately supported as a monophyletic group, with at least 13 identified morphological synapomorphies (Donoghue and Doyle 1989; Doyle and Donoghue 1992; Loconte and Stevenson

1991). Acorus has been usually placed next to the remainder of monocots (Chase et al. 1993; Davis et al. 1998; Duvall et al. 1993; Jansen et al. 2007), whereas the position of Ceratophyllum is still subject to debate. In fact, this genus has been reported as being either at the very base of all angiosperms (Chase et al. 1993; Dilcher 1989; Les 1988; Les et al. 1991; Qiu 1993), as sister to eudicots (Hilu et al. 2003; Saarela et al. 2007; Soltis and Soltis 1997a; Soltis et al. 1999a), within the magnoliids (Savolainen and Chase 2003; Zanis et al. 2002), or next to the monocots (Chaw et al.

2000; Davies et al. 2004; Graham and Olmstead 2000; Graham and Olmstead 2000a; Hamby and

Zimmer 1992; Qiu et al. 1999; Soltis et al. 2003; Soltis et al. 1997). However, most of the values

97 supporting these assertions were low. In contrast, our trees strongly supported Ceratophyllum as forming a clade with Acorus, with bootstrap supports of at least 94% (Table 11). This location was, for the first time, very strongly supported.

Taxa at the base of angiosperms, such as Ceratophyllum and Acorus, do not share too many morphological synapomorphies, apparently due to rapid evolution subsequent to origin of angiosperm

(Soltis et al. 1997). Many authors argued that the numerous atypical characters of Ceratophyllum, a highly adapted genus that probably represents an ancient lineage, largely explain its difficulty to be classified (e.g., Qiu et al. 1993). This same explanation is most likely also appropriate for the monocot Acorus. Both genera are therefore quite dissimilar from the other angiosperms, to some extent due to many adaptations as aquatic plants. For instance, Ceratophyllum is seldom considered in cladistic morphological investigations, because of its small number of character states (Doyle and

Endress 2000). But, apart from the magnoliids, monocots are more related to Ceratophyllum than any other primitive dicots (Chase 2004). For instance, a certain number of morphological characters are shared among these two genera. A close examination of the anatomy and physiology of both

Ceratophyllum and Acorus can then indicate a few synapomorphies (Table 12 on next page).

Acorus is a predominantly aquatic herb, with very narrow leaves. These so called "equitant" leaves overlap at the base to produce flattened leaves distributed in two ranks. Ceratophyllum is a submersed, aquatic plant, without any roots, that grows in freshwater. At least three leaves protrude from the same node (whorled). Thus, the phyllotaxy (i.e. the positions of leaves on a stem) of both genera might be considered similar, because leaves originate from the same location (Doyle and

Endress 2000). Neither peltate leaves, nor arils (colored structures at the surface of seeds) are present in either genus. Concerning the xylem and phloem arrangements, Ceratophyllum only has a single vascular bundle, yet this is absent from Acorus. No cells containing latex (laticifers) are present in these species (Doyle and Endress 2000; Les 1988).

98 Table 12: Relationships between Ceratophyllum and Acorus.

Characteristics Ceratophyllum Acorus

Number of families in order Single family Single family

Habitat Aquatic Aquatic

Leave arrangement Whorled Equitant

Aril Absent Absent

Laticifer Absent Absent

Small Flower Unisexual bisexual

Stamen Free Free

Carpel number 1 (2 fused?) 2-3

Style Yes Yes

Curvature of ovule Orthotropous Orthotropous

Tapetum Amoeboid Amoeboid

Pollen grains Monads Monads

Pollen size 17-45 urn < 20 urn

Ceratophyllum bears flowers that are far from sharing many resemblances with other species

(Iwamoto et al. 2003). Whereas Acorus has small bisexual flowers, with two or three fused carpels

(Judd et al. 2008), Ceratophyllum rather has unisexual flowers, with only one carpel. However, this unique carpel might be the consequence of the fusion of two carpels (Endress 1994). If this is the case, Ceratophyllum and Acorus would then exhibit the same carpel number. Moreover, both species have a style, which is an elongated carpel (Doyle and Endress 2000; Iwamoto 2003). Pollen grains are found in both genera as monads: they are not clustered into tetrads, but are instead separated from the other grains. Pollen size is quite comparable in both genera, being small for Acorus (< 20 jxm) and small to medium for Ceratophyllum (between 17 and 45 ^m; Les 1988; Doyle and Endress

2000). These species also have orthotropous ovules. Stamens are not united with other parts of the

99 flower (adnate or connate), but are instead free in both genera. As well, the tapetum, which is a

component of the anther wall producing nutriments, is amoeboid, i.e. of variable shapes (Les 1988;

Doyle and Endress 2000; Judd et al. 2008). However, Ceratophyllum is the only known water-

pollinated (hydrophile) dicot (Les 1988a; Les 1988b; Les et al. 1991). Because of this major adaptation to its aquatic environment, Ceratophyllum flowers do not share too many similar characters with those of Acorus. But this aquatic environment is also probably the foremost reason for many peculiar morphological traits, such as the unisexual flowers, the decreased number of carpels and ovules (Dahlgren et al. 1985), and the absence of roots, stomata and perianth (Les

1988).

Overall, the number of shared characters between Acorus and Ceratophyllum is small.

Molecular characters are certainly more informative for classification of these aquatic plants than

morphological traits. But, even with only a small number of shared morphological states, these synapomorphies can most likely be used to corroborate the reliability of the inferred Acorus-

Ceratophyllum clade.

100 CHAPTER 7

CONCLUSION

Despite that flowering plants have been been reclassified early with the aid of molecular data (Soltis

et al. 2003), their origin has still not been completely elucidated. To do so, robust and consistent

trees, concerning particularly basal species, would be of great significance (Stefanovic et al. 2004).

The three largest subunits of nuclear RNA polymerase genes, namely rpal, rpbl and rpcl, but also

other nuclear genes, as well as mitochondrial and chloroplastic sequences, were used to infer

relationships among angiosperms, and particularly the location of Ceratophyllum. Eleven of the novel

rpal and rpcl phylogenetic markers were sequenced to add more information to the 22 taxa dataset

(including the two outgroup species), consisting of a total of fifteen concatenated sequences. The

resulting matrix contained a total of 32,806 aligned nucleotides, and it translated into 10,329 amino

acids when the fourteen protein-coded genes were translated.

Although the two rpal and rpcl genes encompassed a high degree of variability, because of

a high proportion of fast evolving sites and, although some positions might be considered as

ambigously aligned, they can still be considered as reliable phylogenetic markers. In fact, at the

most, only 10% of aligned positions of these two genes might be considered questionably aligned,

which represent only a slight proportion of the entire matrix, so that it did not compromise

phylogenies.

Concerning substitution rate disparity among sites, we chose not to exclude fast evolving

positions, but rather to classify all nucleotides and amino acids into eight categories, based on their various substitution rates. More than half of all positions were invariant, and most of the others

belonged to the last three classes, with the highest relative rates of substitution. Analyses were then

perfomed, using MP and ML, to compare topologies, since congruent trees, along with high bootstrap

percentages, can be indicative of most satisfactory phylogenies.

For ML analyses, the r distribution was used, along with the proportion of invariable sites,

and the models chosen for most datasets were GTR (+1 +r) and JTT (+r+F), for the respective

101 nucleotide and amino acid datasets. All analyses succeeded in inferring consistent phylogenetic trees, no matter what datasets were used. Most of nodes for the inferred trees were well supported.

Moreover, when the fastest evolving sites were removed, some clades appeared with even stronger bootstrap supports. Many of them were consistent though all figures. Early angiosperms, such as magnoliids and monocots, often formed monophyletic groupings mostly well supported. Most eudicots encountered high bootstrap supports. Finally, the clade Acorus-Ceratophyllum stood as one of the strongest supported and consistent group, even with successive deletions of the fast evolving sites. In fact, the grouping of Ceratophyllum with Acorus appeared in most trees, a total of 20 times out of the 24 trees inferred. Almost half of the time, this grouping was found next to the monocots, although bootstrap supports were rather low. Because Acorus and Ceratophyllum bear morphological traits that are tricky to interpret, probably due to convergent evolution, losses of characters and other processes when adaptation to aquatic environment occurred (Les et al. 1991), their placement in phylogenetic trees is not so obvious.

Even though a giant leap has been accomplished over the last decade, it is noteworthy that from the enormous number of angiosperm species still present on earth, only 13 of these account for more than 81% of all plant sequences present in sequence databanks (Savolainen and Chase 2003). In fact, not more than 6% of all known species are present in public databases (Driskell et al. 2004). We therefore are far from having investigated the diversity of flowering plants. But our eleven new sequences of rpal and rpcl, added to the other gene sequences, have in some ways helped in discovering relationships among angiosperms, particularly regarding the position of Ceratophyllum.

102 REFERENCES

Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21:2104-5

Adachi J, Hasegawa M (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459-68

Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA (1997) Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387:489-93

Akaike H (1974) A New Look at the Statistical Model Identification. IEEE Trans Autom Contr AC 19:716-723

Albach DC, Soltis PS, Soltis DE, Olmstead RG (2001) Phylogenetic analysis of the Asteridae based on sequences of four genes. Ann Mo Bot Gard 88:163-212

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403-10

Anderson FE, Swofford DL (2004) Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol Phylogenet Evol 33:440-51

APG (1998) An ordinal classification for the families of flowering plants. Ann Mo Bot Gard 85:531-553

APG_II (2003) An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APGII. Bot J Linn Soc 141:399-436

Archie JW (1989) Homoplasy Excess Ratios: New Indices for Measuring Levels of Homoplasy in Phylogenetic. Syst Zool 38:253-269

Archie JW (1989a) A Randomization Test for Phylogenetic Information in Systematic Data. Syst Zool 38:239-252

Aris-Brosou S (2003) Least and most powerful phylogenetic tests to elucidate the origin of the seed plants in the presence of conflicting signals under misspecified models. Syst Biol 52:781-93

Aris-Brosou S, Xia X (2008) Phylogenetic Analyses: A Toolbox Expanding towards Bayesian Methods. Int J Plant Genomics 2008:683509

Baker RH, DeSalle R (1997) Multiple sources of character information and the phylogeny of Hawaiian drosophilids. Syst Biol 46:654-73

Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF (2000) A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290:972-7

Bapteste E, Brinkmann H, Lee JA, Moore DV, Sensen CW, Gordon P, Durufle L, Gaasterland T, Lopez P, Muller M, Philippe H (2002) The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc Natl Acad Sci USA99:1414-9

103 Barkman TJ, Chenery G, McNeal JR, Lyons-Weiler J, Ellisens WJ, Moore G, Wolfe AD, dePamphilis CW (2000) Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proc Natl Acad Sci U S A 97:13166-71

Bergsten J (2005) A review of long-branch attraction. Cladistics 21:163-193

Bharathan G, Zimmer EA (1995) Early branching events in monocotyledons- partial 18S ribosomal DNA sequence analysis. In: Rudall PJ, Cribb PJ, Cutler PJ, Humphries G (eds) Moncotyledons: systematics and evolution. Royal Botanic Gardens, London, p 81-107

Bogner J, Nicolson DH (1991) A revised classification of Araceae with dichotomous keys. Willdenowia:35-50

Bremer K (2000) Phylogenetic nomenclature and the new ordinal system of the angiosperms. In: Nordenstam B, El-Ghazaly G, Kassas M, Laurent TC (eds) Plant systematics for the 21st century. Portland Press, London, p 125-133

Bremer K (2002) Gondwanan evolution of the grass alliance of families (Poales). Evolution 56:1374

Brinkmann H, Philippe H (1999) Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol Biol Evol 16:817-25

Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ (2001) Universal trees based on large combined protein sequence data sets. Nat Genet 28:281-5

Bryant D (2003) A classification of consensus mehtods for phylogenetics. In: Janowitz M, Lapointe FJ, McMorris F, B. M, Roberts F (eds) Bioconsensus. American Mathematical Society Publications, Piscataway, NJ, p 1-21

Buckley TR (2002) Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst Biol 51:509-23

Buckley TR, Simon C, Shimodaira H, Chambers GK (2001) Evaluating hypotheses on the origin and evolution of the New Zealand alpine cicadas (Maoricicada) using multiple-comparison tests of tree topology. Mol Biol Evol 18:223-34

Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Waddell PJ (1993) Partitioning and combining data in phylogenetic analysis. Syst Biol 42:384-397

Burleigh JG, Mathews S (2004) Phylogenetic Signal in Nucleotide Data from Seed Plants: Implications for Resolving the Seed Plant Tree of Life. Americal Journal of Botany 91:1599-1613

Chang JT (1996) Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci 137:51-73

Chase MW (2004) Monocot relationships: an overview. Am J Bot 91:1645-1655

Chase MW, Soltis DE, Soltis PS, Olmstead RG, Morgan DR, Les DH, Mishler BD, Duvall MR, Price RA, Hills HG, Qiu YL, Kron KA, Rettig JH, Conti E, Palmer JD, Manhart JR, Sytsma KJ, Michaels HJ, Kress WJ, Karol KG, Clark WD, Hedren M, Gaut BS, Jansen RK, Kim K, Wimpee CF, Smith JF, Furnier GR, Strauss SH, Xiang Q, Plunkett GM, Soltis PS, Swensen SM, Williams SE, Gadek PA, Quinn CJ, Eguiarte LE, Golenberg E, Learn Jr GH, Graham SW, Barrett SCH, Dayanandan S, Albert VA (1993) Phylogenetics of Seed Plants: An analysis of nucleotide sequences from the plastid gene rbcL Ann Mo Bot Gard 80:528-580

104 Chase MW, Soltis DE, Soltis PS, Rudall PJ, Fay MF, Hahn WJ, Sullivan S, Joseph J, Molvray M, Kores PJ, Givnish TJ, Sytsma KJ, Pires JC (2000) Higher-level systematics of the of the monocotyledons: An assessment of current knowledge and a new classification. In: Wilson KL, Morrison DA (eds) Proceedings of the 2nd International Monocot Symposium. CSIRO, Melbourne, p 3-16

Chase MW, Stevenson WDM, Wilkin P, Rudall PJ (1995) Monocots systematics: a combined analysis. In: Rudall PJ, J. CP, Cutler PJ, Humphries CJ (eds) Monocotyledons: systematics and evolution. Royal Botanic Gardens, Kew, p 685-730

Chaw SM, Parkinson CL, Cheng Y, Vincent TM, Palmer JD (2000) Seed plant phylogeny inferred from all three plant genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers. Proc Natl Acad Sci U S A 97:4086-91

Chaw SM, Zharkikh A, Sung HM, Lau TC, Li WH (1997) Molecular phylogeny of extant gymnosperms and seed plant evolution: analysis of nuclear 18S rRNA sequences. Mol Biol Evol 14:56-68

Chippindale PT, Wiens JJ (1994) Weighting, Pardoning, and Combining characters in Phylogenetic analysis. Syst Biol 43:278-287

Cramer P (2002) Multisubunit RNA polymerases. Curr Opin Struct Biol 12:89-97

Crepet WL, Nixon KC (1998) Two new fossil flowers of magnoliid affinity from the Late Cretaceous of New Jersey. Am J Bot 85:1273-1288

Crepet WL, Nixon KC, Gandolfo MA (2004) Fossil evidence and phylogeny: the age of major angiosperm clades based on mesofossil and macrofossil evidence from Cretaceous deposits. Am J Bot 91:1666-1682

Cronquist A (1981) An Integrated system of classification of flowering plants. Columbia University Press, New York

Cronquist A (1988) The evolution and classification of flowering plants. New York Botanical Garden, New York

Cuenoud P, Savolainen V, Powell MP, Grayer PJ, Chase MW (2002) Molecular phylogenies of the Caryophyllales based on combined analyses of 18S rDNA and rbcL,atpB and matKsequences. Am J Bot 89:132-144

Czaja AT (1978) Structure of starch grains and the classification of families. Taxon 27:463-470

Dahlgren RMT, Clifford HT, Yeo PF (1985) The families of the moncotyledons: structure, evolution, and taxonomy. Spinger-Verlag, Berlin

Darwin C (1859) The Origin of Species by Means of Natural Selection. Murray, London

Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V (2004) Darwin's abominable mystery: Insights from a supertree of the angiosperms. Proc Natl Acad Sci U S A 101:1904-9

Davis JI, Simmons MP, Stevenson DW, Wendel JF (1998) Data decisiveness, data quality, and incongruence in phylogenetic analysis: an example from the monocotyledons using mitochondrial atp A sequences. Syst Biol 47:282-310

Delsuc F, Brinkmann H, Philippe H (2005) Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 6:361-75

105 Denton AL, McConaughy BL, Hall BD (1998) Usefulness of RNA polymerase II coding sequences for estimation of green plant phylogeny. Mol Biol Evol 15:1082-5

Dilcher DL (1989) The occurrence of fruits with affinities to Ceratophyllaceae. Am J Bot 76:162-162

Dolphin K, Belshaw R, Orme CD, Quicke DL (2000) Noise and incongruence: interpreting results of the incongruence length difference test. Mol Phylogenet Evol 17:401-6

Donoghue MJ, Doyle JA (1989) Phylogenetic analysis of angiosperms and the relationships of Hamamelidae. In: Crane PR, Blackmore S (eds) Evolution, systematics, and fossil history of the Hamamelidae. Clarendon Press, Oxford, p 17-45

Donoghue MJ, Doyle JA (2000) Seed plant phylogeny: Demise of the anthophyte hypothesis? Curr Biol 10:R106-9

Doyle JA (1996) Seed Plant Phylogeny and the Relationships of Gnetales. Int J Plant Sci 157:S3-S39

Doyle JA, Donoghue MJ (1986) Seed Plant Phylogeny and the Origin of Angiosperms: An Experimental Cladistic Approach. Bot Rev 52:321-431

Doyle JA, Donoghue MJ (1992) Fossils and seed plant phylogeny reanalyzed. Britannia 44:89-106

Doyle JA, Donoghue MJ, Zimmer EA (1994) Integration of morphological and ribosomal RNA data on the origin of angiosperms. Ann Mo Bot Gard:419-450

Doyle JA, Endress PK (2000) Morphological phylogenetic analysis of basal angiosperms: comparison and combination with molecular data. Int J Plant Sci:S121-S153

Doyle JA, Hotton L (1991) Diversification of early angiopserm pollen in a cladistic context. In: Blackmore S, Barnes SH (eds) Pollen and spores: pattterns of diversification. Clarendon Press, Oxford, p 165-195

Driskell AC, Ane C, Burleigh JG, McMahon MM, O'Meara B C, Sanderson MJ (2004) Prospects for building the tree of life from large sequence databases. Science 306:1172-4

Drouin G, Daoud H, Xia J (2008) Relative rates of synonymous substitutions in the mitochondrial, chloroplast and nuclear genomes of seed plants. Mol Phylogenet Evol 49:827-831

Duvall MR, Clegg MT, Chase MW, Clark WD, Kress WJ, Hills HG, Eguiarte LE, Smith JF, Gaut BS, Zimmer EA, Learn Jr GH (1993) Phylogenetic hypotheses for the monocotyledons constructed from rbcL sequences. Ann Mo Bot Gard 80:607-619

Ebright RH (2000) RNA polymerase: structural similarities between bacterial RNA polymerase and eukaryotic RNA polymerase II. J Mol Biol 304:687-98

Edwards AWF (1972) Likelihood. Cambridge University Press, Cambridge

Efron B, Gong G (1983) A Leisurely Look at the Bootstrap, the Jackknife, and Cross-validation. Amer Statist 37:36-48

Efron B, Halloran E, Holmes S (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA 93:7085-90

Endress PK (1986) Reproductive structures and phylogenetic significance of extant primitive angiosperms. Plant Syst Evol 152:1-28

106 Endress PK (1993) Evolutionary aspects of the floral structure in Ceratophyllum. Plant Syst Evol 8:175-183

Endress PK (1994) Evolutionary aspects of the floral structure in Ceratophyllum. Plant Syst Evol 8:175-183

Farris JS (1969) A successive approximations approach to character weighting. Syst Zool 18:374-385

Farris JS (1970) Methods for Computing Wagner Trees. Syst Zool 19:83-92

Farris JS (1989) The retention index and the rescaled consistency index. Cladistics 5:417-419

Farris JS (1989a) The Retention Index and Homoplasy Excess. Syst Zool 38:406-407

Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:401-410

Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368-76

Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791

Felsenstein J, Kishino H (1993) Is There Something Wrong with the Bootstrap on Phylogenies? A Reply to Hillis and Bull. Syst Biol 42:193-200

Friedlander TP, Regier JC, Mitter C (1994) Phylogenetic Information Content of Five Nuclear Gene Sequences in Animals: Initial Assessment of Character Sets from Concordance and Divergence Studies. Syst Biol 43:511-525

Friis EM, Pedersen KR, Crane PR (2004) Araceae from the Early Cretaceous of Portugal: evidence on the emergence of monocotyledons. Proc Natl Acad Sci U S A 101:16565-70

Frohlich MW (2003) An evolutionary scenario for the origin of flowers. Nat Rev Genet 4:559-66

Gadagkar SR, Kumar S (2005) Maximum likelihood outperforms maximum parsimony even when evolutionary rates are heterotachous. Mol Biol Evol 22:2139-41

Gatesy J, Baker RH, Hayashi C (2004) Inconsistencies in arguments for the supertree approach: supermatrices versus supertrees of Crocodylia. Syst Biol 53:342-55

Gaut BS, Lewis PO (1995) Success of maximum likelihood phylogeny inference in the four-taxon case. Mol Biol Evol 12:152-62

Gee H (2003) Evolution: ending incongruence. Nature 425:782

Gerrienne P, Meyer-Berthaud B, Fairon-Demaret M, Streel M, Steemans P (2004) Runcaria, a middle devonian seed plant precursor. Science 306:856-8

Givnish TJ, Sytsma KJ (1997) Consistency, characters, and the likelihood of correct phylogenetic inference. Mol Phylogenet Evol 7:320-30

Goldman N, Anderson JP, Rodrigo AG (2000) Likelihood-based tests of topologies in phylogenetics. Syst Biol 49:652-70

107 Goldman N, Whelan S (2002) A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol 19:1821-31

Goremykin W, Hellwig FH (2006) A new test of phylogenetic model fitness addresses the issue of the basal angiosperm phylogeny. Gene 381:81-91

Goremykin W, Hirsch-Ernst KI, Wolfl S, Hellwig FH (2003) Analysis of the Amborella trichopoda chloroplast genome sequence suggests that amborella is not a basal angiosperm. Mol Biol Evol 20:1499-505

Goremykin W, Hirsch-Ernst KI, Wolfl S, Hellwig FH (2004) The chloroplast genome of Nymphaea alba: whole-genome analyses and the problem of identifying the most basal angiosperm. Mol Biol Evol 21:1445-54

Goremykin W, Holland B, Hirsch-Ernst KI, Hellwig FH (2005) Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol 22:1813-22

Goremykin W, Viola R, Hellwig FH (2009) Removal of noisy characters from chloroplast genome- scale data suggests revision of phylogenetic placements of Amborella and Ceratophyllum. J Mol Evol 68:197-204

Graham SW, Olmstead RG (2000) Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am J Bot 87:1712-1730

Graham SW, Olmstead RG (2000a) Evolutionary significance of an unusual chloroplast DNA inversion found in two basal angiosperm lineages. Curr Genet 37:183-8

Graur D, Li WH (2000) Fundamentals of molecular evolution. Sinauer Associates, Sunderland, MA

Graybeal A (1994) Evaluating the Phylogenetic Utility of Genes: A Search for Genes Informative About Deep Divergences Among Vertebrates. Syst Biol 43:174-193

Graybeal A (1998) Is it better to add taxa or characters to a difficult phylogenetic problem? Syst Biol 47:9-17

Grayum MH (1987) A Summary of Evidence and Arguments Supporting the Removal of Acorus from the Araceae. Taxon 36:723-729

Gu X, Fu YX, Li WH (1995) Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol 12:546-57

Guindon S, Gascuel O (2002) Efficient biased estimation of evolutionary distances when substitution rates vary across sites. Mol Biol Evol 19:534-43

Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696-704

Hajibabaei M (2003) Molecular evolution of the RNA polymerase genes and the phylogeny of seed plants Biology. Ottawa U, Ottawa

Hajibabaei M, Xia J, Drouin G (2006) Seed plant phylogeny: gnetophytes are derived conifers and a sister group to Pinaceae. Mol Phylogenet Evol 40:208-17

Hall BG (2005) Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol 22:792-802

108 Hall TA (1999) BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT Nucl. Acids. Symp

Hamby KR, Zimmer EA (1992) Ribosomal RNA as a phylogenetic tool in plant systematics. In: Soltis PS, Soltis DE, Doyle JJ (eds) Molecular Systematics of Plants. Chapman and Hall, New York

Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:160-74

Hendy MD, Penny D (1989) A framework for the quantitative study of evolutionary trees. Syst Zool 38:297-309

Herendeen PS, Les DH, Dilcher DL (1990) Fossil Ceratophyllum (Ceratophyllaceae) from the Tertiary of North America. Am J Bot 77:7-16

Hillis DM (1987) MOLECULAR VERSUS MORPHOLOGICAL APPROACHES TO SYSTEMATICS. Ann Rev Ecol Syst 18:23-42

Hillis DM (1996) Inferring complex phylogenies. Nature 383:130-1

Hillis DM (1998) Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst Biol 47:3-8

Hillis DM, Bull JJ (1993) An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis. Syst Biol 42:182-192

Hillis DM, Dixon MT (1991) Ribosomal DNA: molecular evolution and phylogenetic inference. Q Rev Biol 66:411-53

Hilu KW, Borsch T, Muller KF, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell MP, Alice LA, Evans R, Sauquet H, Neinhuis C, Slotta TAB, Rohwer JG, Campbell CS, Chatrou LW (2003) Angiosperm phylogeny based on matKsequence information. Am J Bot 90:1758-1776

Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Mol Biol Evol 21:1459-61

Holmquist R (1983) Transitions and transversions in evolutionary descent: an approach to understanding. J Mol Evol 19:134-44

Hoot SB, Crane PR (1995) Interfamilial relationships in the Ranunculidae based on molecular systematics. Plant Syst Evol 9:119-131

Hoot SB, Culham A, Crane PR (1995a) The Utility of atpB Gene Sequences in Resolving Phylogenetic Relationships: Comparison with rbcL and 18S Ribosomal DNA Sequences in the Lardizabalaceae. Ann Mo Bot Gard 82:194-207

Hoot SB, Kadereit JW, Blattner FR, Jork KB, Schwarzbach AE, Crane PR (1997) Data congruence and phylogeny of the Papaveraceae s.l. based on four data sets: atpB, rbcL sequences, trnK restriction sites, and morphological characters. Syst Biol 22:575-590

Hoot SB, Magallon S, Crane PR (1999) Phylogeny of basal eudicots based on three molecular data sets: atpB, rbcL, and 18S nuclear ribosomal DNA sequences. Ann Mo Bot Gard 86:1-32

Huelsenbeck JP, Bull JJ, Cunningham CW (1996) Combining data in phylogenetic analysis. Trends Ecol Evol:152-158

109 Huelsenbeck JP, Hillis DM, Nielsen R (1996a) A likelihood-ratio test of monophyly. Syst Biol 45:546- 558

Iwamoto A, Shimizu A, Ohba H (2003) Floral development and phyllotactic variation in Ceratophyllum Gfe/77ers£//77(Ceratophyllaceae). Am J Bot 90:1124-1130

Jansen RK, Cai Z, Raubeson LA, Daniell H, Depamphilis CW, Leebens-Mack J, Muller KF, Guisinger- Bellian M, Haberle RC, Hansen AK, Chumley TW, Lee SB, Peery R, McNeal JR, Kuehl JV, Boore JL (2007) Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci U S A 104:19369-74

Jeffrey O, Brinkmann H, Delsuc F, Philippe H (2006) Phylogenomics: the beginning of incongruence? Trends Genet 22:225-31

Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. ComputAppI Biosci 8:275-82

Judd WS, Campbell CS, Kellogg EA, Stevens PF (2008) Plant Systematics: a Phylogenetic approach. Sinauer Associates, Inc., Sunderland, MA

Judd WS, Olmstead RG (2004) A survey of tricolpate (Eudicot) phylogenetic relationships. Am J Bot 91:1627-1644

Jussieu A-L (1789) Genera plantarum. Herissant and Barrois, Paris

Kallersjo M, Albert VA, Farris JS (1999) Homoplasy increases phylogenetic structure. Cladistics 15:91- 93

Kallersjo M, Farris JS, Chase MW, Bremer B, Fay MF, Humphries CJ, Petersen G, Seberg O, Bremer K (1998) Simultaneous parsimony jackknife analysis of 2538 rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants and flowering plants. Plant Syst Evol 213:259-287

Kearney M (2002) Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Syst Biol 51:369-81

Kellogg EA (2001) Evolutionary history of the grasses. Plant Physiol 125:1198-205

Kelly C, Rice J (1996) Modeling nucleotide evolution: a heterogeneous rate analysis. Math Biosci 133:85-109

Kim J (1996) General Inconsistency Conditions for Maximum Parsimony: Effects of Branch Lengths and Increasing Numbers of Taxa. Syst Biol 45:363-374

Kim S, Soltis DE, Soltis PS, Zanis MJ, Suh Y (2004) Phylogenetic relationships among early-diverging eudicots based on four genes: were the eudicots ancestrally woody? Mol Phylogenet Evol 31:16-30

Kishino H, Miyata T, Hasegawa M (1990) Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J Mol Evol 31:151-160

Klassen GJ, Mooi RD, Locke A (1991) Consistency Indices and Random Data. Syst Zool 40:446-457

Kluge AG (1989) A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Bo\dae, Serpentes). Syst Zool 38:7-25

110 Kluge AG, Farris JS (1969) Quantitative phyletics and the evolution of anurans. Syst Zool 18:1-32

Kuhner MK, Felsenstein J (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:459-68

Kuzoff RK, Gasser CS (2000) Recent progress in reconstructing angiosperm phylogeny. Trends Plant Sci 5:330-6

Kuzoff RK, Sweere JA, Soltis DE, Soltis PS, Zimmer EA (1998) The phylogenetic potential of entire 26S rDNA sequences in plants. Mol Biol Evol 15:251-63

Lanave C, Preparata G, Saccone C, Serio G (1984) A new method for calculating evolutionary substitution rates. J Mol Evol 20:86-93

Lankester E (1870) On the use of the term homology in modern zoology. Annu Mag Nat Hist 6:34-43

Lee MSY (2001) Unalignable sequences and molecular evolution. Trends Ecol Evol 16:681

Les DH (1986) The Evolution of Achene Morphology in Ceratophyllum (Ceratophyllaceae), I. Fruit- Spine Variation and Relationships of C. demersum, C. submersum, and C. apiculatum. Syst Bot 11:549-558

Les DH (1988) The Origin and Affinities of the Ceratophyllaceae. Taxon 37:326-345

Les DH (1988a) The Evolution of Achene Morphology in Ceratophyllum (Ceratophyllaceae), II. Fruit Variation and Systematics of the "Spiny-Margined" Group. Syst Bot 13:73-86

Les DH (1988b) The Evolution of Achene Morphology in Ceratophyllum (Ceratophyllaceae), III. Relationships of the "Facially-spined" Group. Syst Bot 13:509-518

Les DH (1989) The Evolution of Achene Morphology in Ceratophyllum (Ceratophyllaceae), IV. Summary of Proposed Relationships and Evolutionary Trends. Syst Bot 14:254-262

Les DH, Garvin DK, Wimpee CF (1991) Molecular evolutionary history of ancient aquatic angiosperms. Proc Natl Acad Sci U S A 88:10119-23

Les DH, Schneider EL, Padgett DJ, Soltis DE, Soltis PS, Zanis M (1999) Phylogeny, Classification and Floral Evolution of Water Lilies (Nymphaeaceae; Nymphaeales): A Synthesis of Non-molecular rbcL, matK, and 18S rDNA Data. Syst Bot 24:28-46

Lewis LA, Mishler BD, Vilgalys R (1997) Phylogenetic relationships of the liverworts (Hepaticae), a basal lineage, inferred from nucleotide sequence data of the chloroplast gene rbcL. Mol Phylogenet Evol 7:377-93

Lockhart PJ, Larkum AW, Steel M, Waddell PJ, Penny D (1996) Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc Natl Acad Sci U S A 93:1930-4

Lockhart PJ, Penny D (2005) The place of Amborella within the radiation of angiosperms. Trends Plant Sci 10:201-2

Loconte H, Stevenson DM (1991) Cladistics of the Magnoliidae. Cladistics 7:267-296

Mabberley DJ (1993) The plant book: A portable dictionary of the vascular plants. Cambridge University Press, Cambridge

111 Magallon S, Crane PR, Herendeen P (1999) Phylogenetic pattern, diversity, and diversification of eudicots. Ann Mo Bot Gard 86:297-372

Mathews S, Donoghue MJ (1999) The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 286:947-50

Maynard Smith J, Smith NH (1996) Synonymous Nucleotide Divergence: What Is "Saturation"? Genetics 142:1033-1036

Meier RP, Kores PJ, Darwin S (1991) Homoplasy slope ratio: A better measurement of observed homoplasy in cladistic analyses. Syst Zool 40:74-88

Mishler BD (1994) Cladistic analysis of molecular and morphological data. Am J Phys Anthropol 94:143-56

Mohr G, Perlman PS, Lambowitz AM (1993) Evolutionary relationships among group II intron-encoded proteins and identification of a conserved domain that may be related to maturase function. Nucleic Acids Res 21:4991-7

Moore MJ, Bell CD, Soltis PS, Soltis DE (2007) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci U S A 104:19363-8

Neuhaus H, Link G (1987) The chloroplast tRNALys(UUU) gene from mustard (Sinapis alba) contains a class II intron potentially coding for a maturase-related polypeptide. Curr Genet 11:251-7

Nickerson J, Drouin G (2003) The sequence of the largest subunit of RNA polymerase II is a useful marker for inferring seed plant phylogeny. Mol Phylogenet Evol 31:403-15

Nickerson J, Drouin G (2004) The sequence of the largest subunit of RNA polymerase II is a useful marker for inferring seed plant phylogeny. Mol Phylogenet Evol 31:403-15

Nickrent DL, Soltis DE (1995) A comparison of angiosperm phylogenies from nuclear 18S rDNA and rbcL sequences. Ann Mo Bot Gard 82:208-234

Nishihara H, Okada N, Hasegawa M (2007) Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol 8:R199

Nixon KC, Crepet WL, Stevenson DW, Friis EM (1994) A reevaluation of seed plant phylogeny. Ann Mo Bot Gard 81:484-533

Novacek MJ (1992) Fossils, Topologies, Missing Data, and the Higher Level Phylogeny of Eutherian Mammals. Syst Biol 41:58-73

Oganezova EP, Nalbandian RM (1976) Purification and properties of plastocyanin and ferredoxin from Ceratophyllum demersum L. Biokhimiia 41:794-800

Ogden TH, Rosenberg MS (2006) Multiple Sequence Alignment Accuracy and Phylogenetic Inference. Syst Biol 55:314-328

Olmstead RG, Bremer B, Scott KM, Palmer JD (1993) A parsimony analysis of the Asteridae sensu lato based on rbcL sequences. Ann Mo Bot Gard 80:700-722

Olmstead RG, Kim KJ, Jansen RK, Wagstaff SJ (2000) The phylogeny of the Asteridae sensu lato based on chloroplast ndhF gene sequences. Mol Phylogenet Evol 16:96-112

112 Olmstead RG, Michaels HJ, Scott KM, Palmer JD (1992) Monophyly of the Asteridae and indentification of their major lineages inferred from DNA sequences of rbcL Ann Mo Bot Gard 79:249-265

Olmstead RG, Reeves PA, Yen AC (1998) Patterns of sequence evolution and implications for parsimony analysis of chloroplast DNA. In: Soltis PS, Soltis DE, Doyle JJ (eds) Molecular systematics of plants: DNA sequencing. Chapman and Hall, New York

Olmstead RG, Sweere JA (1994) Combining Data in Phylogenetic Systematics: An Empirical Approach Using Three Molecular Data Sets in the Solanaceae. Syst Biol 43:467-481

Oxelman B, Bremer B (2000) Discovery of paralogous nuclear gene sequences coding for the second- largest subunit of RNA polymerase II (RPB2) and their phylogenetic utility in of the asterids. Mol Biol Evol 17:1131-45

Oxelman B, Yoshikawa N, McConaughy BL, Luo J, Denton AL, Hall BD (2004) RPB2 gene phylogeny in flowering plants, with particular emphasis on asterids. Mol Phylogenet Evol 32:462-79

Page RDM, Holmes EC (2007) Molecular Evolution: A phylogenetic approach. Blackwell Sciences Ltd, Maiden, MA

Parkinson CL, Adams KL, Palmer JD (1999) Multigene analyses identify the three earliest lineages of extant flowering plants. Curr Biol 9:1485-8

Pfeil BE, Brubaker CL, Craven LA, Crisp MD (2004) Paralogy and orthology in the MALVACEAE rpb2 gene family: investigation of gene duplication in hibiscus. Mol Biol Evol 21:1428-37

Philippe H, Lopez P (2001) On the conservation of protein sequences in evolution. Trends Biochem Sci 26:414-6

Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, Casane D (2004) Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol 21:1740-52

Poe S, Swofford DL (1999) Taxon sampling revisited. Nature 398:299-300

Popp M, Oxelman B (2001) Inferring the history of the polyploid Silene aegaea (Caryophyllaceae) using plastid and homoeologous nuclear DNA sequences. Mol Phylogenet Evol 20:474-81

Popp M, Oxelman B (2004) Evolution of a RNA polymerase gene family in Silene (Caryophyllaceae)- incomplete concerted evolution and topological congruence among paralogues. Syst Biol 53:914-32

Posada D, Crandall KA (1998) MODELTEST: testing the model of DNA substitution. Bioinformatics 14:817-8

Pryer KM, Schuettpelz E, Wolf PG, Schneider H, Smith AR, Cranfill R (2004) Phylogeny and evolultion of ferns (Monilophytes) with a focus on the early Leptosporangiate divergences. Am J Bot 91:1582-1598

Qiu YL, Chase MW, Les DH, Parks CR (1993) Molecular phylogenetics of the magnoliidae: cladistic analyses of nucleotide sequences of the plastid gene rbcL Ann Mo Bot Gard 80:587-606

Qiu YL, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (1999) The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402:404-7

113 Qiu YL, Lee J, Bernasconi-Quadroni F, Soltis PS, Soltis DE, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (2000) Phylogeny of basal angiosperms: analyses of five genes from three genomes. IntJ Plant Sci 161:S3-S27

Rannala B, Huelsenbeck JP, Yang Z, Nielsen R (1998) Taxon sampling and the accuracy of large phylogenies. Syst Biol 47:702-10

Ren F, Tanaka H, Yang Z (2005) An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst Biol 54:808-18

Rodriguez F, Oliver JL, Marin A, Medina JR (1990) The general stochastic model of nucleotide substitution. J Theor Biol 142:485-501

Rodriguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol 56:389-99

Rokas A, Carroll SB (2005) More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol 22:1337-44

Rokas A, Williams BL, King N, Carroll SB (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798-804

Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. BIOINFORMATICS 19:1572

Rosenberg MS (2005) Evolutionary distance estimation and fidelity of pair wise sequence alignment. BMC Bioinformatics 6:102

Rosenberg MS, Kumar S (2003) Taxon sampling, bioinformatics, and phylogenomics. Syst Biol 52:119-24

Rydin C, Kallersjo M, Friist EM (2002) Seed plant relationships and the systematic position of Gnetales based on nuclear and chloroplast DNA: conflicting datas, rooting problems, and the monophyly of conifers. Int J Plant Sci 163:197-214

Rzhetsky A, Sitnikova T (2002) When is it Safe to Use an Oversimplified Substitution Model in Tree- Making? Mol Biol Evol 13:1255-1265

Saarela JM, Rai HS, Doyle JA, Endress PK, Mathews S, Marchant AD, Briggs BG, Graham SW (2007) Hydatellaceae identified as a new branch near the base of the angiosperm phylogenetic tree. Nature 446:312-5

Salamin N, Hodkinson TR, Savolainen V (2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Syst Biol 51:136-50

Sanderson MJ, Donoghue MJ (1989) Patterns of Variation in Levels of Homoplasy. Evolution 43:1781- 1795

Sanderson MJ, Donoghue MJ (1994) Shifts in Diversification Rate with the Origin of Angiosperms. Science 264:1590-1593

Sanderson MJ, Driskell AC (2003) The challenge of constructing large phylogenetic trees. Trends Plant Sci 8:374-9

Sanderson MJ, Purvis A, Henze C (1998) Phylogenetic supertrees: assembling the trees of life. Trends Plant Ecol. Evol. 13:105-109

114 Sanderson MJ, Shaffer HB (2002) Troubleshooting Molecular Phylogenetic Analyses. Ann Rev Ecol Syst 33:49-72

Sauquet H, Doyle JA, Scharaschkin T, Borsch T, Hilu K, Chatrou LW, Le Thomas A (2003) Phylogenetic analyses of Magnoliales and Myristicaceae based on multiple data sets: implications for character evolution. Bot J Linn Soc 142:125-186

Savolainen V, Chase MW (2003) A decade of progress in plant molecular phylogenetics. Trends Genet 19:717-24

Savolainen V, Chase MW, Hoot SB, Morton CM, Soltis DE, Bayer C, Fay MF, de Bruijn AY, Sullivan S, Qiu YL (2000) Phylogenetics of flowering plants based on combined analysis of plastid atpB and rbcL gene sequences. Syst Biol 49:306-62

Schleiden MJ (1837) Beitrage zur Kenntniss der Ceratophyleen. Linnaea:513-542

Selvaraj D, Sarma RK, Sathishkumar R (2008) Phylogenetic analysis of chloroplast matK gene from Zingiberaceae for plant DNA barcoding. Bioinformation 3:24-7

Seo TK, Kishino H (2008) Synonymous Substitutions Substantially Improve Evolutionary Inference from Highly Diverged Proteins. Syst Biol 57:367-377

Shimodaira H, Hasegawa M (1999) Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference. Mol Biol Evol 16:1114-1116

Siddall ME (1998) Success of parsimony in the four-taxon case: Long-branch repulsion by likelihood in the Farris zone. Cladistics:209-210

Siddall ME, Whiting MF (1999) Long-branch abstractions. Cladistics 15:9-24

Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu YL, Chase MW, Farris JS, Stefanovic S, Rice DW, Palmer JD, Soltis PS (2004a) Genome-scale data, angiosperm relationships, and "ending incongruence": a cautionary tale in phylogenetics. Trends Plant Sci 9:477-83

Soltis DE, Senters AE, Kim S, Thompson JD, Soltis PS, Zanis MJ, de Craene LS, Endress PK, Farris JS (2003) are sister to other core eudicots and exhibit floral features of early- diverging eudicots. Am J Bot 90:461-470

Soltis DE, Soltis PS (1997a) Phyogenetic Relationships in Saxifragaceae Sensu Lato: A Comparison of Topologies Based on 18S rDNA and rbcL Sequences. Am J Bot 84:504-522

Soltis DE, Soltis PS (2004b) Amborella not a basal angiosperm? Not so fast. Am J Bot 91:997-1001

Soltis DE, Soltis PS, Endress PK, Chase MW (2005a) Phylogeny and Evolution of Angiosperms. Sinauer Associates, Inc., Sunderland, Massachusetts

Soltis DE, Soltis PS, Mort ME, Chase MW, Savolainen V, Hoot SB, Morton CM (1998) Inferring complex phylogenies using parsimony: an empirical approach using three large DNA data sets for angiosperms. Syst Biol 47:32-42

Soltis DE, Soltis PS, Nickrent DL, Johnson LA, Hahn WJ, Hoot SB, Sweere JA, Kuzoff RK, Kron KA, Chase MW, Swensen SM, Zimmer EA, Chaw SM, Gillespie U, Kress WJ, Sytsma KJ (1997) Angiosperm phylogeny inferred from 18S ribosomal DNA sequences. Ann Mo Bot Gard 84:1- 49

115 Soltis PS (2005) Ancient and recent polyploidy in angiosperms. New Phytol 166:5-8

Soltis PS, Soltis DE (2003a) Applying the Bootstrap in Phylogeny Reconstruction. Stat Sci 18:256-267

Soltis PS, Soltis DE (2004) The origin and diversification of angiosperms. Am J Bot 91:1614-1626

Soltis PS, Soltis DE, Chase MW (1999a) Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402:402-4

Soltis PS, Soltis DE, Wolf PG, Nickrent DL, Chaw SM, Chapman RL (1999) The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal? Mol Biol Evol 16:1774-84

Soltis PS, Soltis DE, Zanis MJ, Kim S (2000) Basal lineages of angiosperms: Relationships and implications for floral evolution. Int J Plant Sci 161:S97-S-107

Steel M (2005) Should phylogenetic models be trying to "fit an elephant"? Trends Genet 21:307-9

Stefanovic S, Rice DW, Palmer JD (2004) Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol Biol 4:35

Stiller JW, Hall BD (1997) The origin of red algae: implications for plastid evolution. Proc Natl Acad Sci U S A 94:4520-5

Strimmer K, Rambaut A (2001) Inferring confidence sets of possibly misspecified gene trees. Proc Biol Sci 269:137-42

Strimmer K, von Haeseler A (1996) Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13:964-969

Sullivan J (1996) Combining Data with Different Distributions of Among-Site Rate Variation. Syst Biol 45:375-380

Sullivan J, Holsinger KE, Simon C (1996a) The effect of topology on estimates of among-site rate variation. J Mol Evol 42:308-12

Swofford DL (2002) PAUP*: Phylogenetic Analysis Using Parsimony (*and other Methods). Sinauer Associates, Sunderland

Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference Molecular systematics. Sinauer, Sunderland, Massachusetts, p 407-514

Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers JS (2001) Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst Biol 50:525-39

Takezaki N, Gojobori T (1999) Correct and incorrect vertebrate phylogenies obtained by the entire mitochondrial DNA sequences. Mol Biol Evol 16:590-601

Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol 24:1596-9

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25:4876-82

116 Thorne RF (1992) Classification and geography of the flowering plants. Bot Rev 58:225-348

Tomlinson PB (1995) Non-homology of vascular organisation in monocotyledons and dicotyledons. In: Rudall PJ, Cribb PJ, Cutler PJ, Humphries CJ (eds) Monocotyledons: systematics and evolution. Royal Botanic Gardens, London, p 589-622

Wenzel JW, Siddall ME (1999) Noise. Cladistics 15:51-64

Wiens JJ (1995) Combining Data Sets with Different Numbers of Taxa for Phylogenetic Analysis. Syst Biol 44:548-558

Wiens JJ (1998) Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst Biol 47:625-40

Wiens JJ (2003) Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol 52:528-38

Wikstrom N, Savolainen V, Chase MW (2001) Evolution of the angiosperms: calibrating the family tree. Proc Biol Sci 268:2211-20

Wilkinson M (1995) Coping with Abundant Missing Entries in Phylogenetic Inference Using Parsimony. Syst Biol 44:501-514

Woese CR (1987) Bacterial evolution. Microbiol Rev 51:221-71

Wolfe KH, Li WH, Sharp PM (1987) Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc Natl Acad Sci U S A 84:9054-8

Wolthers J, Erdmann VA (1986) Cladistic analysis of 5S rRNA and 16S rRNA secondary and primary structure - The evoloution of eukaryotes and their relation to Archaebacteria. J Mol Evol 24:152-166

Xia J (2003) The largest subunit of RNA polymerase II (rpbl) as a phylogenetic marker of seed plant species Biology. Ottawa U, Ottawa, p 99

Xia X, Xie Z, Kjer KM (2003) 18S ribosomal RNA and tetrapod phylogeny. Syst Biol 52:283-95

Yang Z (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 10:1396-401

Yang Z (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol 39:306-14

Yang Z (1994b) Estimating the pattern of nucleotide substitution. J Mol Evol 39:105-11

Yang Z (1996) Maximum-Likelihood Models for Combined Analyses of multiple Sequence Data. J Mol Biol 42:587-596

Yang Z (1996a) Among-site variation and its impact on phylogenetic analyses. TREE 11:367

Yang Z (1996b) Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol 42:294-307

Yang Z (1998) On the Best Evolutionary Rate for Phylogenetic Analysis. Syst Biol 47:125-133

Yang Z, Goldman N, Friday A (1994a) Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol Biol Evol 11:316-24

117 Yoder AD, Vilgalys R, Ruvolo M (1996) Molecular evolutionary dynamics of cytochrome b in strepsirrhine primates: the phylogenetic significance of third-position transversions. Mol Biol Evol 13:1339-50

Zanis M, Soltis PS, Qiu YL, Zimmer EA, Soltis DE (2003) Phylogenetic analyses and perianth evolution in basal angiosperms. Ann Mo Bot Gard 90:129-129

Zanis MJ, Soltis DE, Soltis PS, Mathews S, Donoghue MJ (2002) The root of the angiosperms revisited. Proc Natl Acad Sci U S A 99:6848-53

Zharkikh A, Li WH (1992) Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences: II. Four taxa without a molecular clock. J Mol Evol 35:356-66

Zimmer EA, Martin SL, Beverley SM, Kan YW, Wilson AC (1980) Rapid duplication and loss of genes coding for the alpha chains of hemoglobin. Proc Natl Acad Sci U S A 77:2158-62

Zurawski G, Bottomley W, Whitfeld PR (1982) Structures of the genes for the beta and epsilon subunits of spinach chloroplast ATPase indicate a dicistronic mRNA and an overlapping translation stop/start signal. Proc Natl Acad Sci U S A 79:6260-6264

Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51:588-98

118 APPENDIX

Appendix A. Nucleotidic alignment of all fifteen concatenated sequences used in this study. The nucleotides identical to Oryza are represented by a dot (.), whereas a dash (-) and a query mark (?) stand for a gap and a missing data, respectively. PLEASE REFER TO CD

119 Appendix B. Amino acid alignment of all fourteen concatenated protein-coded sequences used in this study. The amino acids identical to Oryza are represented by a dot (.), whereas a dash (-) and a query mark (?) stand for a gap and a missing data, respectively. PLEASE REFER TO CD

120 Appendix C. Permissions granted for the figures 1 to 4.

f

121