The Pennsylvania State University

The Graduate School

Eberly College of Science

A GENOMIC ANALYSIS OF THE EVOLUTION OF PROKARYOTES:

ASTROBIOLOGICAL PERSPECTIVES ON THE COLONIZATION OF

ENVIRONMENTS ON EARTH

A Thesis in

Biology and Astrobiology

by

Fabia Ursula Battistuzzi

© 2007 Fabia U. Battistuzzi

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2007

The thesis of Fabia Ursula Battistuzzi was reviewed and approved* by the following:

S. Blair Hedges Professor of Biology Thesis Advisor Chair of Committee

James G. Ferry Stanley Person Professor and Director, Center for Microbial Structural Biology

Jennifer Macalady Assistant Professor of Geosciences

Stephen Schaeffer Associate Professor of Biology

Douglas Cavener Professor of Biology Head of the Department of Biology

*Signatures are on file in the Graduate School

ii ABSTRACT

The relationships and timescale of prokaryote evolution are unresolved. The poor fossil record of these organisms does not provide sufficient information to outline their evolutionary history. Hence, phylogenetic and molecular clock methods are fundamental to clarifying their evolution and relating that to events in early Earth history. This thesis uses genomic sequence data to reconstruct the phylogeny of prokaryotes and estimate the divergence times of major groups. Multiple methods and data sets are applied and compared to gain a more general and robust result compared with previous studies that were narrower in scope. The timeline of prokaryotes obtained was used to infer evolutionary patterns in physiology, such as the origin of methanogenesis and phototrophy, and the colonization of land. In general, a more robust phylogeny of prokaryotes is established, with high-level groups previously unrecognized. Archaebacteria and Eubacteria are found to have rapidly evolved in the mid- to late- Archean (3.3−2.6 billion years ago) in relation to the colonization of new environments such as mesophilic photic zones and terrestrial habitats. A good correlation exists between molecular clock estimates and the geologic record.

iii TABLE OF CONTENTS

LIST OF FIGURES ...... viiii

LIST OF TABLES...... viiiii

ACKNOWLEDGEMENTS...... ix

Chapter 1. Introduction ...... 1

1.1 Overview...... 1

1.2 Prokaryote diversity...... 2

1.3 Genome sequencing and features ...... 3

1.4 The geologic record of prokaryotes...... 4

1.5 Molecular clocks...... 5

Chapter 2. A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land ...... 7

2.1 Abstract...... 7

2.2 Introduction...... 8

2.3 Methods...... 9

2.3.1 Data Assembly...... 9

2.3.2 Time estimation ...... 10

2.4 Results...... 12

2.4.1 Data set ...... 12

2.4.2 Phylogeny ...... 12

2.4.3 Time estimation ...... 13

2.5 Discussion...... 14

2.5.1 Origin of life ...... 14

2.5.2 Methanogenesis ...... 15

iv 2.5.3 Anaerobic Methanotrophy...... 15

2.5.4 Aerobic Methanotrophy...... 16

2.5.5 Phototrophy...... 16

2.5.6 The colonization of land ...... 17

2.5.7 Oxygenic ...... 17

2.6 Conclusions...... 18

2.7 Acknowledgements...... 18

Chapter 3 Progressive colonization of environments on the early Earth by prokaryotes ...... 24

3.1 Abstract...... 24

3.2 Introduction...... 24

3.3 Methods...... 25

3.3.1 Data assembly and phylogenetic analyses...... 25

3.3.1.1 Protein data set...... 25

3.3.1.2 Ribosomal RNA data set ...... 27

3.3.2 Time estimation ...... 27

3.3.2.1 Protein data set...... 27

3.3.2.1 Ribosomal RNA data set ...... 28

3.4 Results...... 28

3.4.1 Eubacteria ...... 29

3.4.1.1 Protein data set...... 29

3.4.1.2 Ribosomal RNA data set ...... 30

3.4.3 Archaebacteria ...... 32

3.4.3.1 Protein data set...... 32

3.4.3.2 Ribosomal RNA data set ...... 33

3.5 Discussion...... 34

v 3.5.1 Protein vs. ribosomal RNA, maximum likelihood vs. distance...... 34

3.5.2 of prokaryotes...... 35

3.5.2.1 Eubacteria ...... 35

3.5.2.2 Archaebacteria ...... 36

3.5.3 Evolution and adaptation of prokaryotes on early Earth ...... 37

3.5.3.1 Stage 1: Early presence of prokaryotes in submarine high temperature habitats (4.5−3.5 Ga) ...... 37

3.5.3.2 Stage 2: Colonization of mesophilic photic zones (3.5−3.3 Ga) ...... 38

3.5.3.3 Stage 3: Colonization of land and specialized niches (3.3−2.7 Ga) ...39

3.6 Conclusions...... 41

3.7 Acknowledgements...... 42

Chapter 4 Concluding remarks ...... 50

References...... 51

Appendix A Supplementary information for Chapter 2 ...... 65

Appendix B Supplementary information for Chapter 3...... 89

vi LIST OF FIGURES

2-1 Phylogenetic tree of Eubacteria rooted with Archaebacteria ...... 20

2-2 Phylogenetic tree of Archaebacteria rooted with Eubacteria ...... 21

2-3 A timescale of prokaryote evolution ...... 22

2-4 A time line of metabolic innovations and events on Earth...... 23

3-1 Maximum likelihood phylogeny of Eubacteria, protein data set...... 43

3-2 Timetrees of eubacterial classes for the protein and rRNA sata sets...... 44

3-3 Maximum likelihood phylogeny of Eubacteria, rRNA data set...... 45

3-4 Maximum likelihood phylogeny of Archaebacteria, protein data set...... 46

3-5 Maximum likelihood of Archaebacteria, rRNA data set...... 47

3-6 Timetrees of archaebacterial classes for the protein and rRNA data set...... 48

3-7 Timescale of major events in prokaryote history...... 49

vii LIST OF TABLES

2-1 Time estimates for selected nodes in the tree of Eubacteria and Archaebacteria...... 19

viii ACKNOWLEDGEMENTS

There are many people I would like to thank for sharing these past years with me. First of all, my advisor, Dr. Hedges who gave me the opportunity to step over the borders of my country to follow my interest for astrobiology. His guidance through the years led me to become the scientist I am today. A warm thanks goes also to all the past and present members of the Hedges lab who have made the every-day life in the lab much more enjoyable. I would also like to thank my committee members, Dr. Ferry, Dr. Macalady and Dr. Schaeffer and also Dr. Makalowski who, although not officially on my committee, has followed my progresses over the years. I had the fortune of interacting with professors and students from various departments through the PennState Astrobiology Research Group (PSARC) and I am grateful to each one of them for the many interesting discussions we had over the years. I also had the opportunity to attend many conferences, which would not have been possible without the financial support from PSARC and the NASA Astrobiology Institute. A special thanks goes to my ‘Italian legacy’, my friends and family: I am truly amazed by their constant presence in my life regardless of the geographical distance. I am especially grateful to my family who has been an endless source of strength and encouragement: thanks for always supporting me even when it meant uprooting your lives. Last but certainly not least, to all my friends who shared the everyday life in State College I would like to say thanks for becoming my extended family and for always reminding me that there is a life to live also outside the lab! I am looking forward to our future reunions for I am sure that our friendships are for life.

ix CHAPTER 1

Introduction

1.1 Overview A milestone paper published by Woese and Fox (1977) classified all life forms in three domains, Archaebacteria (also called ), Eubacteria (also called ), and Eucarya. The names Archaebacteria and Eubacteria are herein preferred as they avoid confusion between ‘Bacteria’, used to define the taxonomic domain, and ‘bacteria’, commonly used to refer to all prokaryotes, yet they include the stem name “bacteria” which unites them as prokaryotes. The term “prokarytote” itself has been criticized as referring to a non-monophyletic group (Pace 2006) but others have disagreed with this argument (Martin and Koonin, 2006) and it remains a widely used term to describe both Archaebacteria and Eubacteria. Studies on the origin of the three domains have shown that are the result of a symbiotic event between an archaebacterium and one or more eubacteria (Margulis 1970; Gupta et al. 1994; Gupta and Singh 1994), although alternative hypotheses have been proposed (Philippe et al. 2000; Hartman and Fedorov 2002). The symbiosis hypothesis implies that prokaryotes preceded eukaryotes, a theory supported also by the geologic record, and that they were the only inhabitants of early Earth; thus, the early evolution of our planet is necessarily intertwined with the evolution of prokaryotes. Because of their sparse fossil record, phylogenetic studies of prokaryotes are an invaluable resource to understand their evolutionary history. Unfortunately, the phylogeny of these organisms is in an undetermined state and no consensus has been reached on the relationships at high (e.g., above class) taxonomic levels (e.g., Brochier and Philippe 2002; Daubin, Gouy, and Perriere 2002; Hedges 2002; Wolf et al. 2002; Battistuzzi, Feijão, and Hedges 2004; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). Because of their metabolic adaptability they have acted as bridges between the abiotic and biotic world, recycling chemical compounds and contributing in rendering habitable the early Earth. The evolutionary history of prokaryotes can not only show their topological relationships but also help us understand the evolution of basic metabolisms and how these have interacted with the early Earth environment. The aim of this thesis is to (i) evaluate the evolutionary history of prokaryotes through phylogenetic analyses of their complete genomes, and to (ii) estimate a timeframe for the evolution of major lineages and place them in the context of the environmental conditions of early Earth. The remaining sections of this chapter will give a general overview of the status of our knowledge regarding prokaryotes, their diversity (§1.2), genome features (§1.3), geologic record (§1.4), and its application to evolutionary studies (§1.5). The second chapter will focus on a phylogenetic and timing study carried out on protein sequences of the major prokaryote classes available at the time. This was a first attempt to establish a comprehensive timeframe for prokaryote phylogeny as only few groups had been timed before (Feng, Cho, and Doolittle 1997; Hedges et al. 2001). We also evaluated the implications of the estimated timeline for the evolution of metabolisms and adaptations to new environments (Battistuzzi, Feijão, and Hedges 2004). This research provided the raw material for the following study, presented in chapter three. In this case, we acknowledge the uncertain nature of prokaryote backbone

1 phylogeny (relationships of classes) in light of recent studies (Gophna, Doolittle, and Charlebois 2005; Ciccarelli et al. 2006) and apply a “consensus approach” between ribosomal RNAs (rRNA) and protein trees to identify common trends. We use the same reasoning for timetrees in order to better constrain the time of evolution of all classes of prokaryotes represented by at least one completely sequenced genome and the ribosomal genes for the small and large subunit (SSU and LSU). These timetrees are used to correlate the origin of major groups and metabolisms with the geologic record and create a possible scenario of prokaryote-Earth interactions in the Archean. Conclusions and astrobiological perspectives are, then, discussed in chapter four.

1.2 Prokaryote diversity Prokaryotes are unicellular organisms that bear cytological and genetic distinctions from the third domain of life, the eukaryotes. Among their cytological differences, the most evident is the absence, in the vast majority of them, of membrane- bound internal compartments and a nucleus (Madigan, Martinko, and Parker 2003). There is, however, evidence for exceptions to this rule, such as the nucleoid in Gemmata obscuriglobus, or the anammoxosome in members of the Class Plactomycetacia (Fuerst and Webb 1991; van Niftrik et al. 2004). Genetic distinctions include higher order structures of genes (e.g., 16S+23S+5S operon present in most Eubacteria and Archaebacteria), generally circular chromosomes, and Shine-Dalgarno ribosome binding sites (Brown and Koretke 2000; Madigan, Martinko, and Parker 2003). Each prokaryote domain also shows distinctive characteristics in their cellular structures such as the presence of ether lipids in Archaebacteria versus ester lipids in Eubacteria (Madigan, Martinko, and Parker 2003). Most importantly, prokaryotes have co-transcriptional translation on their main chromosomes (Martin and Koonin 2006). Debates on the amount of prokaryote diversity on Earth are ongoing and estimates vary from conservative values of 10,000 species to estimates several orders of magnitude higher (up to 1010 species) (Schloss and Handelsman 2004; Curtis et al. 2006), with a general consensus ranging between 107 and 109 species globally. These figures are strikingly higher than the number of currently recognized species (8,337 validly published species, http://www.bacterio.cict.fr/number.html), and the number of whole genome projects completed (570 published genomes, 1390 ongoing projects, http://www.genomesonline.org as of Sept. 12th, 2007). Nonetheless, even with the partial knowledge of their diversity that we possess, prokaryotes are recognized as a fundamental force in maintaining Earth habitability and in sustaining other life forms (e.g., CO2 and N2 fixation) (Madsen 2005; Curtis et al. 2006). Their metabolic versatility and genome plasticity allowed them to colonize virtually every environment on Earth, utilizing a variety of metabolisms. These range from those fully dependent on organic substrates (chemoorganotrophy, i.e., use of organic compounds as source of energy and carbon) to metabolisms completely decoupled from preexistent organic inputs and thus dependent on other life forms (photoautotrophy, i.e., use of light as energy source and carbon dioxide as carbon source), either in mesophilic or extremophilic environments (Madigan, Martinko, and Parker 2003). While at first, laboratory cultivation techniques have been invaluable in understanding the basics of cell functioning and adaptability, it is now clear that they also limit the range of our knowledge. By posing constraints on the type of organisms and

2 metabolisms that can be cultured, these techniques favor mostly aerobic organisms growing on standard media, and provide a potentially biased view of the communities present in the habitats under study. Recently, developments of methodologies to study uncultured prokaryotes and environmental samples (e.g., FISH: fluorescent in situ hybridization, SIP: stable isotope probing, metagenomics) have enhanced our understanding of prokaryote communities. For example, new interactions have been discovered among species of different phyla and domains such as the anaerobic oxidation of methane carried out by reverse-methanogenic Archaebacteria and sulfur-reducing Eubacteria (Boetius et al. 2000; Orphan et al. 2001a). The connections that intertwine physical and biological cycles are, in many cases, regulated by prokaryotic redox reactions and define their crucial role in Earth’s chemical balance (Madsen 2005). Given their current basic function, it is reasonable to assume that prokaryotes played a similar role on early Earth.

1.3 Genome sequencing and features Effective sequencing methods have produced hundreds of whole genome sequences stored in open access databases (e.g., GenBank, JGI, TIGR), and the increasing use of pyrosequencing (Ronaghi, Uhlen, and Nyren 1998), which is better suited for environmental samples, is likely to further increase the pace of genome sequencing (Field, Wilson, and van der Gast 2006). These genomes are allowing, for the first time, comprehensive studies of prokaryote phylogeny with multiple “core” genes (i.e., genes shared by all or most species under study) and hundreds of species, providing the opportunity to glimpse at the general picture of prokaryote evolution and adaptability. The potential of whole genomes is not only limited to the identification of core genes, but it allows a variety of phylogenetic studies based on other genomic features, such as gene content, gene arrangements, and indels. All of these methods have been used in recent studies to estimate a phylogeny of prokaryotes with different results (reviewed in Philippe et al. 2005). However, when drawing conclusions based on information from whole genomes, two possible sources of error should be considered: biases in databases and phylogenetic artifacts. On the one hand, from a phylogenetic perspective, there is moderately good representation of prokaryotes with at least one fully sequenced genome in databases, with 15 out of the 24 phyla currently recognized present in GenBank (and the remaining 9 phyla with ongoing genome projects). On the other hand, from a metabolic point of view, three major biases are inherent to the genomes sequenced: (i) a preference for organisms of human interest, such as human or other eukaryotic pathogens, (ii) a higher representation of culturable species (and, thus, with specific metabolic characteristics), and (iii) a general preference for smaller genomes (Martiny and Field 2005). Our current understanding of prokaryote genome evolution has also highlighted particular features that can produce biases in phylogenetic reconstructions. The transfer of genes horizontally (i.e., between species) instead of vertically (i.e., from parent to offspring), variable GC contents within and among genomes, and variable rates of evolution are among the factors that can cause the artificial close clustering of distantly related species (Delsuc, Brinkmann, and Philippe 2005). The high taxon sampling now available and a better understanding of the effects of these biases on phylogenetic reconstruction models allow us to compensate, at least in part, for such biases (Delsuc,

3 Brinkmann, and Philippe 2005; Philippe et al. 2005). However, there is still debate on their overall influence the reconstruction of prokaryote phylogeny (e.g., Doolittle 1999; Dagan and Martin 2006; Huang and Gogarten 2006; Choi and Kim 2007). Because of the relative susceptibility of each method to any bias, it is appropriate to compare the results of various methodologies instead of relying exclusively on one method. This allows the identification of common trends while identifying the strengths and weaknesses of each method.

1.4 The geologic record of prokaryotes Phylogenetic studies of prokaryotes are only one, albeit fundamental, aspect of our search for understanding the evolution of early life. Another aspect of this evolution is the chronological framework during which it has occurred. As previously mentioned, information on prokaryote evolution from the geologic record is rare and often controversial. The oldest taxonomically recognizable fossil of a prokaryote is a Palaeoproterozoic (2.5-1.6 Ga) member of the Phylum Cyanobacteria (Brocks et al. 2003), although claims for even older fossils have been made (Schopf 2006). Instead, uncharacterized communities of prokaryotes in the form of stromatolites (i.e., accretionary laminated sedimentary structures often formed by bacterial mats) and ichnofossils can be traced back to rocks of the Warrawoona Group (Western Australia) and Baberton Mountain Land (South Africa) at 3.5 billion years ago (Ga) (Hofmann et al. 1999; Knoll 2003b; Banerjee 2007). The scarcity of well preserved sedimentary rocks from the Archean eon and the overlapping features of biotic and abiotic formations further complicates the interpretation of the geologic record of prokaryotes (Grotzinger and Rothman 1996; Knoll 2003b; Brasier et al. 2006; Schopf 2006). More detailed evidence of the prokaryote presence on Earth can be gained by alternative forms of geologic record, such as isotope fractionation and biomarkers. Isotope fractionation can result from the preferential use of lighter isotopes metabolized by living organisms. This process results in lighter signatures in sediments of biological origin that can be conserved through time (Madigan, Martinko, and Parker 2003). Different metabolisms can result in different fractionations of elements with different efficiencies allowing, to a certain extent, not only the identification of a general presence of life, but also of the specific metabolic properties of the life forms involved (House, Schopf, and Stetter 2003). Nevertheless, sediments enriched with light isotopes can also occur abioticallyabiogenically, for example in hydrothermal settings (Brasier et al. 2002; Horita 2005; Brasier et al. 2006). Biomarkers, instead, are the end product of the alteration of macromolecules produced exclusively by specific lineages of organisms (Brocks and Pearson 2005). With time these compounds are chemically altered in the sediments (e.g., only the hydrocarbon skeleton is maintained while the functional groups are lost) but retain the “phylogenetic imprint” that was characteristic of the organism that formed them. A potential problem related to biomarker interpretation is connected with the limited knowledge that we have of prokaryote phylogeny and diversity, in terms of species that have been cultured. The uncertainty of the identity of the closest relative of a particular lineage and the few lineages tested for production of putative biomarkers leave open the possibility that a compound might not be specific to a single lineage but, instead, to a higher taxonomic level. This would render the biomarker virtually useless to identify the presence of a lineage within that higher-level cluster at a specific time.

4 Moreover, the same compounds are often discovered to be produced by metabolic pathways employing alternative enzymes under different conditions and in different organisms, adding another layer of complexity to the interpretation of putative biomarker distributions (Kopp et al. 2005). These considerations notwithstanding, biomarkers and isotopes have provided insights into prokaryote evolution and have extended the record of their presence on Earth to the early Archaean (Brocks and Pearson 2005; Ueno et al. 2006).

1.5 Molecular clocks An important application of the geologic record of prokaryotes is its use to calibrate molecular clocks. The idea of molecular time estimation methods originated in the early 1960s when the presence of a relatively constant rate of evolution of macromolecules was recognized (Zuckerkandl and Pauling 1962), and this was related to the time elapsed since the divergence of two lineages (Margoliash 1963). Following this discovery, the enunciation of the neutral theory of evolution by Kimura (1983) gave a theoretical basis for molecular clock methods. The realization that the majority of mutations are neutral, in fact, provides the raw material on which molecular clocks can operate to estimate divergence times, as these mutations will occur stochastically but at a relatively constant rate over long periods of time (Kumar 2006). Since these early years, many different methods have been implemented to estimate divergence times, with the most recent developments focusing on local clock methods, which allow variable rates among branches of a phylogenetic tree (Hedges and Kumar 2003). The discovery of different evolutionary rates among lineages posed a fundamental problem in the time estimation with global clock methods, as the speed of the “ticking” of the clock would vary among branches of a phylogeny and could not be accurately approximated by a single constant rate of substitution. This applies to prokaryotes as rate changes have been found among and within the two domains (Moran 1996; Kollman and Doolittle 2000; Hedges et al. 2001). Among the local clock methods, two major types exist: (i) those that estimate divergence times based on a fixed phylogeny (Sanderson 1997; Thorne, Kishino, and Painter 1998; Kishino, Thorne, and Bruno 2001; Sanderson 2002), and (ii) those that estimate divergence times with a “relaxed phylogenetics approach” (Drummond et al. 2006). This latter one is a recent development that allows the contemporaneous estimation of phylogeny and divergence times. This is seen as an improvement in comparison with fixed phylogeny molecular clocks for two main reasons. First, it relaxes the assumption made in previous methods that substitution rates on adjacent branches are similar. Second, through a Markov Chain Monte Carlo (MCMC) process, it analyzes a set of distributions that maximize the posterior probability of tree topology, divergence times, rates, and substitution model simultaneously (Ho et al. 2005; Drummond et al. 2006). Yet, extensive application of this method to various data sets is lacking and so is a critical evaluation of its performance compared to other timing methods. Furthermore, restrictions in the parameter definition of calibration nodes limit its wide applicability at this time. Molecular clocks that assume a fixed (i.e., user-defined) phylogeny instead rest on the assumption that closely related lineages are likely to share similar substitution rates because of heritable traits linked to physiological, chemical, and evolutionary characteristics (Sanderson 1997; Thorne, Kishino, and Painter 1998). This autocorrelation

5 parameter can be estimated in a Bayesian or a maximum likelihood framework. In the first case, the rate of the “descendant” branch is drawn from a lognormal distribution whose mean is the rate of the “parent” branch. The initial conditions for this estimation are indicated by a set of priors on the ingroup root node, which generates a set of posteriors that are the divergence times (Thorne, Kishino, and Painter 1998; Ho et al. 2005). In a maximum likelihood framework, a smoothing parameter is applied to place a heavier weight on larger rate changes between related branches. Whether the smoothing parameter is fixed, non parametric rate smoothing (Sanderson 1997), or estimated, penalized likelihood (Sanderson 2002), the effect is a thwarting of large rate changes and the consequent distribution of rates and times among branches. Regardless of different methodological issues, all molecular clock methods allow the use of calibration points (i.e., nodes in the phylogeny for which the time is known from other sources) to set an absolute scale that will be used in the estimation of the free nodes. One important feature of the local clock methods discussed above is the flexibility by which calibration points can be defined. The incomplete geologic record of prokaryotes allows only the definition of soft bounds, either maxima (i.e., estimated times can be younger but not older) or minima (i.e., estimated times can be older but not younger). This allows the time for the free nodes and for the calibration(s) to be estimated according to a distribution (e.g., uniform, lognormal, exponential) that accounts for the uncertainty of the geologic record (Hedges 2002; Hedges and Kumar 2004).

6 CHAPTER 2

A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land

Fabia U. Battistuzzi1, Andreia Feijao2, S. Blair Hedges1

1NASA Astrobiology Institute and Department of Biology, 208 Mueller Laboratory, The Pennsylvania State University, University Park, PA 16802, USA 2 European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany

Note: published in BMC Evolutionary Biology 4: 44 (2004). AF assembled and aligned the dataset and conducted initial analyses. FUB conducted phylogenetic and molecular clock analyses and co-drafted the manuscript. SBH directed the research and co-drafted the manuscript.

2.1 Abstract Background: The timescale of prokaryote evolution has been difficult to reconstruct because of a limited fossil record and complexities associated with molecular clocks and deep divergences. However, the relatively large number of genome sequences currently available has provided a better opportunity to control for potential biases such as horizontal gene transfer and rate differences among lineages. We assembled a data set of sequences from 32 proteins (~7600 amino acids) common to 72 species and estimated phylogenetic relationships and divergence times with a local clock method. Results: Our phylogenetic results support most of the currently recognized higher-level groupings of prokaryotes. Of particular interest is a well-supported group of three major lineages of eubacteria (Actinobacteria, Deinococcus, and Cyanobacteria) that we call Terrabacteria and associate with an early colonization of land. Divergence time estimates for the major groups of eubacteria are between 2.5–3.2 Ga while those for archaebacteria are mostly between 3.1–4.1 Ga. The time estimates suggest a Hadean origin of life (prior to 4.1 Ga), an early origin of methanogenesis (3.8–4.1 Ga), an origin of anaerobic methanotrophy after 3.1 Ga, an origin of phototrophy prior to 3.2 Ga, an early colonization of land 2.8– 3.1 Ga, and an origin of aerobic methanotrophy 2.5–2.8 Ga. Conclusions: Our early time estimates for methanogenesis support the consideration of methane, in addition to carbon dioxide, as a greenhouse gas responsible for the early warming of the Earths’ surface. Our divergence times for the origin of anaerobic methanotrophy are compatible with highly depleted carbon isotopic values found in rocks dated 2.8–2.6 Ga. An early origin of phototrophy is consistent with the earliest bacterial mats and structures identified as stromatolites, but a 2.6 Ga origin of cyanobacteria suggests that those Archean structures, if biologically produced, were made by anoxygenic photosynthesizers. The resistance to desiccation of Terrabacteria and their elaboration of photoprotective compounds suggests that the common ancestor of this group inhabited land. If true, then oxygenic photosynthesis may owe its origin to terrestrial adaptations.

7 2.2 Introduction The evolutionary history of prokaryotes includes both horizontal and vertical inheritance of genes (Gogarten, Doolittle, and Lawrence 2002; Wolf et al. 2002; Boucher et al. 2003). Horizontal gene transfer (HGT) events are of great interest in themselves, for their roles in creating functionally new combinations of genes (Raymond et al. 2002), but they pose problems for investigating the phylogenetic history and divergence times of organisms. The existence of a core of genes that has not been transferred is still under debate as HGTs have been detected in genes previously considered to be immune to these events (Olsen, Woese, and Overbeek 1994; Doolittle 1999; Brochier, Philippe, and Moreira 2000; Nesbo, Boucher, and Doolittle 2001; Gogarten, Doolittle, and Lawrence 2002; Koonin 2003; Lawrence and Hendrickson 2003; Philippe and Douady 2003). Although a complete absence of HGT appears to be unlikely, genes belonging to different functional categories seem to be horizontally transferred with different frequencies (Jain, Rivera, and Lake 1999; Hansmann and Martin 2000; Lawrence and Hendrickson 2003). Genes forming complex interactions with other cellular components (e.g. translational proteins) have a lower frequency of HGT and are generally more conserved among organisms. Recent studies based on analyses of these genes have obtained similar phylogenies suggesting an underlying phylogenetic signal (Brown et al. 2001; Daubin, Gouy, and Perrière 2002; Wolf et al. 2002; Brown 2003; Daubin, Moran, and Ochman 2003). If we accept the use of core genes for phylogeny reconstruction then they should also be of use for time estimation with molecular clocks. Moreover, the increasing number of prokaryotic genomes available has facilitated the detection of HGT through more accurate detection of orthology, paralogy, and monophyletic groups, and the concatenation of gene and protein sequences has helped increase the confidence of nodes and decrease the variance of time estimates (Brown et al. 2001; Hedges 2002; Brown 2003; Hedges and Kumar 2003). Temporal information concerning prokaryote evolution has come from diverse sources. For eukaryotes, the fossil record provides an abundant source of such data, but this has not been true for prokaryotes, which are difficult to identify as fossils (Benton 1993; Altermann and Kazmierczak 2003). Limited information on specific groups or metabolites has been obtained from analyses of isotopic concentrations (Hinrichs 2002) and detection of biomarkers (Summons et al. 1999; Brocks et al. 2003). By making some simple assumptions – e.g., that aerobic organisms evolved after oxygen became available (Blank 2004)– it is possible to constrain some nodes in the prokaryote timescale, but only in a coarse sense. However, most information on the timescale of prokaryote evolution has come from analysis of DNA and amino acid sequence data with molecular clocks (Ochman and Wilson 1987; Doolittle et al. 1996; Feng, Cho, and Doolittle 1997; Hedges et al. 2001; Sheridan, Freeman, and Brenchley 2003). The detection of evolutionary patterns in metabolic innovations, as a consequence of a phylogeny not dominated by HGT events, allows more detailed constraints on a prokaryote timescale. In contrast to conventional interpretations of cyanobacteria as being among the most ancient of life forms on Earth (Nisbet and Sleep 2001), these studies have consistently found a late origin of cyanobacteria (Feng, Cho, and Doolittle 1997; Hedges et al. 2001), nearly contemporaneous with the major Proterozoic rise in oxygen at 2.3 Ga, termed the Great Oxidation Event (GOE) (Holland 2002).

8 In this study we have assembled a data set of amino acid sequences from 32 proteins common to 72 species of prokaryotes and eukaryotes and estimated phylogenetic relationships and divergence times with a local clock method. These results in turn have been used to investigate the origin of metabolic pathways of importance in evolution of the biosphere.

2.3 Methods 2.3.1 Data Assembly We assembled a dataset that maximized the number of taxa and proteins from available organisms with complete genome sequences of prokaryotes and selected eukaryotes. In doing so, we omitted a few taxa (e.g., Agrobacterium tumefaciens Cereon str C58 and sp. NRC-1) whose addition to the data set would have resulted in a substantial reduction in the total number of proteins. Data assembly began with the Clusters of Orthologous Groups of Proteins (COG) (Tatusov et al. 2001), which consisted of 84 proteins common to 43 species. With that initial dataset we added other species from among completed microbial genomes (NCBI; National Center for Biotechnology Information), assisted by BLAST and PSI-BLAST (Altschul et al. 1997). In total 72 species were included in the study (54 eubacteria, 15 archaebacteria and three eukaryotes). The species of Archaebacteria and their accession numbers are: Aeropyrum pernix K1 (NC_000854), Archaeoglobus fulgidus (NC_000917), Methanothermobacter thermoautotrophicus str. Delta H (NC_000916), Methanococcus jannaschii (NC_000909), kandleri AV19 (NC_003551), Methanosarcina acetivorans str. C2A (NC_003552), Methanosarcina mazei Goe1 (NC_003901), Pyrobaculum aerophilum (NC_003364), Pyrococcus abyssi (NC_000868), DSM 3638 (NC_003413), Pyrococcus horikoshii (NC_000961), Sulfolobus solfataricus (NC_002754), Sulfolobus tokodaii (NC_003106), Thermoplasma acidophilum (NC_002578), Thermoplasma volcanium (NC_002689). The species of Eubacteria are: Aquifex aeolicus (NC_000918), Bacilllus halodurans (NC_002570), Bacillus subtilis (NC_000964), Borrelia burgodorferi (NC_001318), Brucella melitensis (NC_003317, NC_003318), Buchnera aphidicola str. APS (Acyrthosiphon pisum) (NC_002528), (NC_002163), Caulobacter crescentus CB15 (NC_002696), Chlamydia muridarum (NC_002620), Chlamydia trachomatis (NC_000117), Chlamydophila pneumoniae CWL029 (NC_000922), Chlorobium tepidum str. TLS (NC_002932), Clostridium acetobutylicum (NC_003030), Clostridium perfringens (NC_003366), Corynebacterium glutamicum ATCC 13032 (NC_003450), (NC_001263, NC_001264), O157:H7 EDL933 (NC_002655), Fusobacterium nucleatum subsp. nucleatum ATCC 25586 (NC_003454), influenzae Rd (NC_000907), 26695 (NC_000915), Lactococcus lactis subsp. lactis (NC_002662), Listeria innocua (NC_003212), Listeria monocytogenes EGD-e (NC_003210), Mesorhizobium loti (NC_002678), Mycobacterium leprae (NC_002677), Mycobacterium tuberculosis H37Rv (NC_000962), Mycoplasma genitalium G-37 (NC_000908), Mycoplasma pneumoniae (NC_000912), Mycoplasma pulmonis (NC_002771), MC58 (NC_003112), Nostoc sp. PCC7120 (NC_003272), (NC_002663), PA01 (NC_002516), Ralstonia

9 solanacearum (NC_003295), conorii (NC_003103), (NC_000963), subsp. enterica serovar Typhi (NC_003198), Salmonella typhimurium LT2 (NC_003197), Sinorhizobium meliloti (NC_003047), Staphylococcus aureus Mu50 (NC_002758), Streptococcus pneumoniae TIGR4 (NC_003028), Streptococcus pyogenes M1 GAS (NC_002737), Streptomyces coelicolor A3(2) (NC_003888), Synechocystis PCC6803 (NC_000911), Thermoanaerobacter tengcongensis (NC_003869), Thermosynechococcus elongatus BP-1 (NC_004113), Thermotoga maritima (NC_000853), Treponema pallidum subsp. pallidum str. Nichols (NC_000919), Ureaplasma parvum serovar 3 str. ATCC 700970 (NC_002162), O1 biovar eltor str. N16961 (NC_002505, NC_002506), Xanthomonas campestris pv. campestris str. ATCC 33913 (NC_003902), Xanthomonas axonopodis pv. citri str. 306 (NC_003919), Xylella fastidiosa 9a5c (NC_002488), (NC_003143). The eukaryotes were Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens. Accession numbers for proteins are presented elsewhere. This dataset consisted of 60 proteins that were individually analyzed as a step in orthology determination. The proteins were aligned with CLUSTALW (Thompson, Higgins, and Gibson 1994). Then phylogenetic trees of each protein were built and visually inspected. Initial trees were constructed using Minimum Evolution (ME), with MEGA version 2.1 (Kumar et al. 2001). The major criterion that we used in determining which genes to include or exclude was the monophyly of domains. We rejected genes with domains (archaebacteria and eubacteria) that were non-monophyletic, as these would be the best examples of HGT; this amounted to 61% of the genes rejected. Some other genes were omitted if there were detectable cases of HGT within a domain, such as the deep nesting of a species from one Phylum within a clade of another Phylum. Otherwise we did not eliminate genes that had a different branching order of phyla within a domain or different relationships of groups of lower taxonomic categories. Admittedly, ancient cases of HGT might be an explanation for some of those topological differences, but they are not detectable. However, we further tested the effectiveness of our criteria by examining the stability of individual protein trees, using different gamma values (α=1, 0.5 and 0.3). We kept only the genes that were stable to such perturbations (in terms of remaining in that category of non-HGT genes). The position of eukaryotes, which varies depending on the gene, was not considered in assessing monophyly of eubacteria and archaebacteria. The 32 remaining proteins were concatenated for analysis. The α parameters used during the tree building process were estimated with the program PamL (Empirical+F+gamma model) (Yang 1997). From the concatenation, trees were constructed with ME, Maximum Likelihood (ML) (Strimmer and vonHaeseler 1996) and Bayesian (Ronquist and Huelsenbeck 2003) methods. The phylogenies obtained with ME, ML and Bayesian were similar, differing only at non-significant nodes assessed by the bootstrap (BP) method (Felsenstein 1985), with one only significant exception on the position of M. kandleri in the Bayesian phylogeny.

2.3.2 Time estimation Time estimation was conducted separately within each domain (Archaebacteria and Eubacteria) using reciprocal rooting and several calibration points. All time estimates

10 were calculated with a Bayesian local clock approach (Thorne, Kishino, and Painter 1998) utilizing concatenated data sets of multiple proteins and a Jones Taylor Thornton (JTT)+gamma model of substitution (Jones, Taylor, and Thornton 1992; Hedges and Kumar 2003; Hedges et al. 2004; Hedges and Kumar 2004). The following settings were used: numsamp (10,000), burnin (100,000), and sampfreq (100). This method permitted rates to vary on different branches, which was necessary given the known rate variation among prokaryote and eukaryote nuclear protein sequences (Kollman and Doolittle 2000; Hedges et al. 2001). Calibration of rate in this method was implemented by assigning constraints to nodes in the phylogeny. Five different initial settings (prior distributions) were used in each domain (Appendix A Table A2). These were chosen at intervals of 0.5 Ga starting from 4.5 Ga, which is approximately the age of the Earth and Solar System, to 2.5 Ga, which is slightly before the major rise in oxygen (Great Oxidation Event; GOE) as recorded in the geologic record (Holland 2002) and related to the presence of oxygenic cyanobacteria. Those constraints pertained to the ingroup root, or deepest divergence in the tree excluding the outgroup. Because of the relatively small number of duplicate genes available for rooting the tree of life, we were unable to estimate the time of the last common ancestor (the divergence of Eubacteria and Archaebacteria). For the archaebacterial data set, we included eukaryotes for calibration purposes because reliable calibration points were unavailable among those prokaryotes. In doing so, only proteins in which eukaryotes clustered with Archaebacteria were included (Hedges et al. 2001). An outgroup was used that consisted of representatives of the major groups of Eubacteria. We used the fossil and molecular times (separately) of the plant- animal divergence as calibration points, for comparison. The fossil calibration was the first appearance of a representative of the plant lineage (red algae) at 1.198 ± 0.022 Ga (Butterfield 2000). The molecular time estimate for this divergence was 1.609 ± 0.060 Ga from a study of 143 rate-constant proteins (Hedges et al. 2004). We used the minimum and maximum bounds for these calibration times as constraints in the Bayesian analysis. Although the results of these two different calibrations are provided for comparison, our preferred calibration is the 1.2 Ga fossil calibration because it has the best justification (supporting evidence). Therefore, our summary time estimates for Archaebacteria, presented in the timetree (Fig. 2-3), use only this fossil calibration. For the eubacterial data set, we used four internal time constraints in separate analyses, all involving the origin of Cyanobacteria. The first and most conservative constraint was a fixed origin (minimum and maximum bounds) at 2.3 Ga, which corresponds to the GOE. For the second constraint we used 2.3 Ga as a minimum bound, with no maximum bound. For the third constraint we used a previous molecular time estimate (2.56 Ga) for the divergence of Cyanobacteria from closest living relatives among Eubacteria, and fixed the minimum (2.04 Ga) and maximum (3.08 Ga) values to the 95% confidence limits of that time estimate (Hedges et al. 2001). The fourth constraint for the origin of Cyanobacteria was set at 2.7 Ga (minimum constraint) based on biomarker evidence for the presence of 2α-methylhopanes (Summons et al. 1999). We did not consider the fossil record of Cyanobacteria because the earliest indisputable fossils (Sergeev, Gerasimenko, and Zavarzin 2002) are younger (~2 Ga) than the indirect evidence (GOE) for the presence of these oxygen-producing organisms. Older fossils of Cyanobacteria are known but are disputed (Sergeev, Gerasimenko, and Zavarzin 2002; Knoll 2003a). The use of these four alternative constraints for the origin of Cyanobacteria

11 considers most of the widely discussed hypotheses but does not rule out an origin prior to 2.7 Ga. Although the results of the four different calibrations are provided for comparison, our preferred calibration is the 2.3 (minimum) geologic calibration because it has the best justification (supporting evidence). Therefore, our summary time estimates for Eubacteria, presented in the timetree (Fig. 2-3), use only this geologic calibration. For each of these calibration points, all five initial settings were applied, resulting in 15 and 20 analyses for the Archaebacteria and Eubacteria (respectively). The effects of the different initial settings on the analyses were found to be minimal. A 44% difference in the priors, in fact, generated a maximum 2.7% (average of all significant nodes) difference in the time estimates (fossil calibration point) in the Archaebacteria and a maximum 3.5% (average of all significant nodes) difference in the Eubacteria (molecular calibration point) (Appendix A Table A3).

2.4 Results 2.4.1 Data set The majority (81%) of the 32 proteins that were used are classified in the “information storage and processes” functional category of the COG. The other categories represented are “cellular processes” (10%), “metabolism” (3%), and “information storage and processing” + “metabolism” (proteins with combined functions; 6%). Other studies that have analyzed prokaryote genome sequence data for phylogeny have found a similar high proportion of proteins in the “information storage and processes” functional category, presumably because HGT is more difficult with such genes that are vital for the survival of the cell (Brochier et al. 2002; Hedges 2002; Wolf et al. 2002; Jackson and Dugas 2003). The concatenated and aligned data set of 32 proteins contained 27,205 amino acid sites (including insertions and deletions). With alignment gaps removed, the two data sets analyzed were 7,338 amino acid sites (Archaebacteria) and 7,597 amino acid sites (Eubacteria). The data sets were complete in the sense that sequences of all taxa were present for all proteins.

2.4.2 Phylogeny The phylogeny of Eubacteria (Fig. 2-1) shows significant bootstrap support for most of the major groups and subgroups. All form a monophyletic group (support values 95/47/99 for ME, ML and Bayesian respectively) with the following relationships of the subgroups: (Epsilon (Alpha (Beta, Gamma))). There has been debate about the effect of base composition and substitution rate on the phylogenetic position of the endosymbiont Buchnera among Gamma-proteobacteria (Itoh, Martin, and Nei 2002; Canback, Tamas, and Andersson 2004). Its position here (Fig. 2-1) differs slightly from both studies; accordingly, any conclusions concerning its divergence time should be treated with caution. Spirochaetes cluster with Chlamydiae, Actinobacteria with Cyanobacteria and Deinococcus (support values for Cyanobacteria + Deinococcus are 92/80/99) and the (Thermotoga, Aquifex) branch basally in the tree. These groups and relationships are similar to those found previously with analyses of prokaryote genome sequences (Brochier et al. 2002; Hedges 2002; Wolf et al. 2002; Jackson and Dugas 2003).

12 The phylogeny of Archaebacteria (Fig. 2-2) agrees with some but not all aspects of previous phylogenetic analyses of prokaryote genomes using sequence data (Brown et al. 2001; Hedges et al. 2001; Wolf et al. 2001; Hedges 2002; Wolf et al. 2002; Brochier, Forterre, and Gribaldo 2004) and the presence and absence of genes (Snel, Bork, and Huynen 1999; Tekaia, Lazcano, and Dujon 1999; Wolf et al. 2001; House, Runnegar, and Fitz-Gibbon 2003). For example, each of the two major clades of Archaebacteria (excluding Korarchaeota and Nanoarchaeota, which were not represented) is monophyletic. This is consistent with some analyses (Brown et al. 2001; Hedges 2002) but not others (Wolf et al. 2002). Also, the position of as closest relatives of eukaryotes (Fig. 2-2), instead of Euryarchaeota, has been debated (Rivera and Lake 1992; Cammarano et al. 1999; Brown et al. 2001; Hedges et al. 2001; Hedges 2002). The faster rate of evolution in eukaryotes (Fig. 2-2), as noted elsewhere (Kollman and Doolittle 2000; Hedges et al. 2001), requires some caution in drawing conclusions regarding their phylogenetic position. were found to be monophyletic in some previous analyses (Wolf et al. 2002; House, Runnegar, and Fitz-Gibbon 2003) but were paraphyletic in other analyses (Forterre, Brochier, and Philippe 2002; Matte-Tailliez et al. 2002; Brochier, Forterre, and Gribaldo 2004) and in our analysis (Fig. 2-2). The phylogenetic position of one species of in particular, Methanopyrus kandleri, has differed among previous studies (Burggraf et al. 1991; Rivera and Lake 1996; Slesarev et al. 2002). However, it is difficult to make direct comparisons among various studies because they have included different sets of taxa.

2.4.3 Time estimation Times of divergence were estimated for all nodes in the phylogenies of Eubacteria (Fig. 2-1) and Archaebacteria (Fig. 2-2) using the alternative constraints (calibrations) described in the Methods. The Eubacteria time estimates show an average 7% increase from the molecular to the geologic (2.3 Ga minimum) calibration point. Two other additional geologic calibration points were used in the analyses (see Methods), 2.3 Ga fixed and 2.7 Ga minimum, which showed respectively 10% younger and 11% older time estimates compared with the 2.3 Ga minimum calibration point. The times estimated with the fossil calibration point in the Archaebacteria data set were on average only 10% younger than the ones estimated with the molecular calibration. Moreover there was even a smaller effect on the time estimates of the deepest nodes, which were the ones of interest in this study (node M 3.2%, node N 2.1%, node O 1.8% and node P 1.3%). This variation is due not only to the different calibration times but also to the type of constraints used (i.e. minimum boundaries only vs. minimum and maximum bounds). A single timetree (Fig. 2-3) was constructed from the phylogenetic and divergence time data. The time estimates summarized in that tree derive only from the best-justified calibrations. For Eubacteria, the 2.3 Ga minimum calibration (constraint), from the geologic record, was chosen because it encompasses all of the hypothesized time estimates for the origin of Cyanobacteria. For Archaebacteria, the 1.2 Ga calibration (minimum 1.174 Ga, maximum 1.222 Ga), from the red algae fossil record, was selected because it provides a conservative constraint on the divergence of plants and animals. Time estimates and 95% credibility intervals for all nodes under all calibrations are presented elsewhere (Appendix A Table A1, Figures A1 and A2), and those data are

13 summarized for selected nodes and calibrations for Eubacteria and Archaebacteria (Table 2-1). Although some undetected HGT could be a source of bias in the time estimates, the direction of the bias (raising or lowering the estimate) would depend on the specific node and groups involved, and it is unlikely to have had a major affect on the results, even if present. Divergence times within Eubacteria (Fig. 2-3, Table 2-1, nodes A–K) show a pattern seen previously (Hedges et al. 2001) whereby most major groups diverge from one another (nodes B–I excluding node D) in a relatively limited time interval, approximately between 2.5–3.2 Ga. The position of the hyperthermophiles has been debated, with some studies showing them in a basal position whereas others place them more derived. The high G-C composition of these taxa is believed to be responsible for this difficulty in phylogenetic placement. Here, they branch basally (node J, 3.17–4.13 Ga and node K, 3.43–4.46 Ga), but this should be interpreted with caution for this reason. The divergence of Escherichia coli from Salmonella typhimurium (Fig. 2-3, Table 2-1, node A; 0.06–0.18 Ga) is consistent with the time estimated previously from consideration of mammalian host evolution (0.12–0.16 Ga) (Ochman and Wilson 1987). On the other hand an inconsistency with the fossil record is represented by the divergence of unicellular (Thermosynechococcus elongatus) and heterocyst-forming (Nostoc sp.) Cyanobacteria. Our time estimate for this divergence is 0.70–1.41 Ga (Fig. 2-3, Table 2- 1, node D) while microfossils of both groups have been identified in Mesoproterozoic (1.5–1.3 Ga) and Paleoproterozoic (2.12–2.02 Ga) rocks (Golubic, Sergeev, and Knoll 1995; Amard and Bertrand-Sarfati 1997; Sergeev, Gerasimenko, and Zavarzin 2002). However the identification of these latter fossils has been debated (Amard and Bertrand- Sarfati 1997). Branch lengths of Cyanobacteria in our protein tree and in SSU rRNA trees (Jackson and Dugas 2003) do not suggest obvious substitutional biases or rate changes, as they are neither unusually long nor unusually short. The reason for the discrepancy between the molecular and fossil times remains unclear but a possible misinterpretation of the fossil record cannot be dismissed. Divergence times of most internal nodes among Archaebacteria (Fig. 2-3, Table 2-1, nodes L–P) are closely spaced in time and relatively ancient, approximately between 3.1–4.1 Ga, regardless of the initial setting (prior) for the ingroup root. Node P is the earliest divergence, separating Euryarchaeota from Crenarchaeota+eukaryotes. Node O represents the common ancestor of the methanogens in our analysis (Methanopyrus kandleri, Methanothermobacter thermoautotrophicus, Methanococcus jannaschii, Archaeoglobus fulgidus, Methanosarcina mazei and M. acetivorans). Therefore, methanogenesis presumably arose between nodes P and O, or between 4.11 Ga (3.31– 4.49 Ga) and 3.78 Ga (3.05–4.16 Ga) (Fig. 2-3, Table 2-1). If the position of Methanopyrus kandleri is not considered, in lieu of the current debate concerning its relationships (noted above), node N (Fig. 2-3, Table 2-1), the minimum time for the origin of methanogenesis drops only slightly, from 3.78 (3.05–4.16) Ga to 3.57 (2.88– 3.95) Ga.

2.5 Discussion 2.5.1 Origin of life Neither the time for the origin of life, nor the divergence of Archaebacteria and Eubacteria, was estimated directly in this study. Nonetheless, one divergence within

14 Archaebacteria was estimated to be as old as 4.11 Ga (Node P), suggesting even earlier dates for the last common ancestor of living organisms and the origin of life. This is in agreement with previous molecular clock analyses using mostly different data sets and methodology (Feng, Cho, and Doolittle 1997; Hedges et al. 2001). A Hadean (4.5–4.0 Ga) origin for life on Earth is also consistent with the early establishment of a hydrosphere (Mojzsis, Harrison, and Pidgeon 2001; Nisbet and Sleep 2001). Nevertheless, the earliest geologic and fossil evidence for life has been debated (Schopf 1993; Mojzsis et al. 1996; Brasier et al. 2002; Kazmierczak and Altermann 2002; Schopf et al. 2002; Altermann and Kazmierczak 2003; Brasier et al. 2004) leaving no direct support for such old time estimates.

2.5.2 Methanogenesis The lower luminosity of the sun during the Hadean and Archean predicts that surface water would have been frozen during that time. Instead there is evidence of liquid water and moderate to high surface temperatures (Schwartzman 1999; Kasting and Catling 2003). The long-term carbon cycle (carbonate-silicate cycle), which acts as a temperature buffer, combined with greenhouse gases, probably explain this “Faint Young Sun Paradox” (Kasting and Catling 2003). Arguments have been made in support of either methane (Pavlov et al. 2000; Kasting, Pavlov, and Siefert 2001; Pavlov et al. 2003) or carbon dioxide (Ohmoto, Watanabe, and Kumazawa 2004) as the major greenhouse gas involved. If methane were important, it would have necessarily come from organisms (methanogens), given the volume required. Archaebacteria are the only prokaryotes known to produce methane. Our time estimate of between 4.11 (3.31–4.49) Ga and 3.78 (3.05–4.16) Ga for the origin of methanogenesis suggests that methanogens were present on Earth during the Archean, consistent with the methane greenhouse theory (Pavlov et al. 2003). Nonetheless, this does not rule out the alternative (carbon dioxide) explanation (Ohmoto, Watanabe, and Kumazawa 2004).

2.5.3 Anaerobic methanotrophy Anaerobic methanotrophy, or anaerobic oxidation of methane (AOM), is a metabolism associated with anoxic marine sediments rich in methane. This metabolism is characterized by the coupling of two reactions, oxidation of methane and sulfate reduction. The methane oxidizers are represented by Archaebacteria phylogenetically related to the Order Methanosarcinales, while the sulfate reducers, when present, are eubacterial members of the Class Delta-proteobacteria (Orphan et al. 2001a). These two groups of prokaryotes have been found associated in syntrophies, thus suggesting the coupling of these two pathways (Boetius et al. 2000; DeLong 2000; Orphan et al. 2001a; Orphan et al. 2001b). Archaebacteria have been found also isolated in monospecific clusters, oxidizing methane through an unknown reaction. It has been suggested that they may use elements of both the methanogenesis and sulfate-reducing pathways (Orphan et al. 2002). An example of coexistence of genes from both of these pathways is Archaeoglobus fulgidus. The particular condition of this archaebacterium has been explained with an ancient horizontal gene transfer from an eubacterial lineage, most likely a deltaproteobacterium (Klenk et al. 1997; Klein et al. 2001).

15 The phylogenetic position of the anaerobic methanotrophs with the Methanosarcinales places the maximum date for the origin of this metabolism at 3.09 (2.47–3.51) Ga (Fig. 2-3, Table 2-1, node M). The minimum time estimate of 0.23 (0.12– 0.39) Ga (Fig. 2-3, Table 2-1, node L), probably a substantial underestimate of the true time, results from the limited phylogenetic sampling available for this group.

2.5.4 Aerobic methanotrophy Aerobic methanotrophs are represented in the Alpha and Gamma classes of the Proteobacteria. This suggests an origin for this metabolism between node C (2.80 Ga; 2.45–3.22 Ga) and node B (2.51 Ga; 2.15–2.93 Ga) (Fig. 2-3, Table 2-1). Shared genes from this pathway and from methanogenesis also have been found in the Order Planctomycetales (Chistoserdova et al. 2004). This has suggested a revision of the direction of the HGT, usually considered from Archaebacteria to Eubacteria (Boucher et al. 2003), that presumably has spread these genes in the two domains. However the absence of Planctomycetales from our dataset and its controversial phylogenetic position (Jenkins and Fuerst 2001) does not allow us to discriminate among these possibilities. Both anaerobic and aerobic methanotrophy have been used to explain the highly depleted carbon isotopic values found in 2.8–2.6 Ga geologic formations (Hayes 1994; Hinrichs 2002). Our time estimates for these two metabolisms are both compatible with the isotopic record. Molecular clock methods have estimated the origin of Cyanobacteria at 2.56 (2.04–3.08) Ga (Hedges et al. 2001). Because oxygenic photosynthesis would have been necessary for aerobic methanotrophy (Hayes 1994), an anaerobic metabolism seems more likely to explain the isotopic record.

2.5.5 Phototrophy The ability to utilize light as an energy source (photosynthesis) is restricted to Eubacteria among prokaryotes. Phototrophic Eubacteria are found in five major phyla (groups), including Proteobacteria, green sulfur bacteria, green filamentous bacteria, gram positive Heliobacteria, and Cyanobacteria (Xiong et al. 2000; Raymond et al. 2002). Only Cyanobacteria produce oxygen. There are three explanations for this broad taxonomic distribution of phototrophic metabolism; it evolved in one lineage of Eubacteria and spread at a later time to other lineages by horizontal transfer, the common ancestor of these groups possessed this metabolism and genetic machinery, or there was a combination of horizontal transfer and vertical inheritance (Raymond et al. 2002). Because two of the three explanations require a phototrophic common ancestor, and because some features of the Archean geologic record require this metabolism if biologically produced (DesMarais 2000), we have assumed here that the common ancestor (Node I) was phototrophic. Therefore, we estimate that phototrophy evolved prior to 3.19 (2.80–3.63) Ga (Fig. 2-3, Table 2-1, node I). Because the hyperthermophiles Aquifex and Thermotoga are not phototrophic and branch more basally, 3.64 (3.17–4.13) Ga (Node J) can be considered a maximum date for phototrophy. However, if those hyperthermophiles instead occupy a more derived position on the tree, as some analyses have indicated (Brochier and Philippe 2002), then the maximum date is no longer constrained in this analysis.

16 2.5.6 The colonization of land The evolution of phototrophy was most likely linked to the evolution of other features essential to survival in stressful environments. Considerable biological damage can occur from exposure to ultraviolet radiation, especially prior to the GOE and later formation of the protective ozone layer (Cockell and Horneck 2001). The synthesis of pigments such as carotenoids, which function as photoprotective compounds against the reactive oxygen species created by UV radiation (Gotz et al. 1999), is an ability present in all the photosynthetic Eubacteria and in groups that are partly or mostly associated with terrestrial habitats such as the Actinobacteria, Cyanobacteria, and Deinococcus- Thermus. Pigmentation was probably a fundamental step in the colonization of surface environments (Wynn-Williams et al. 2002). Besides the sharing of photoprotective compounds, these three groups (Cyanobacteria, Actinobacteria, and Deinococcus) also share a high resistance to dehydration (Potts 1994; Mattimore and Battista 1996; Rokitko et al. 2001; Shirkey et al. 2003), which further suggests that their common ancestor was adapted to land environments. Therefore we propose the name Terrabacteria (L. terra, land or earth) for the group that includes the bacterial phyla Actinobacteria, Cyanobacteria, and Deinococcus-Thermus. An early colonization of land is inferred to have occurred after the divergence of this terrestrial lineage with Firmicutes (Fig. 2-3, Table 2-1, node H), 3.05 (2.70–3.49) Ga, and prior to the divergence of Actinobacteria with Cyanobacteria + Deinococcus (Fig. 2-3, Table 2-1, node F), 2.78 (2.49–3.20) Ga. These molecular time estimates are compatible with time estimates (2.6–2.7 Ga) based on geological evidence for the earliest colonization of land by organisms (prokaryotes) (Watanabe, Martini, and Ohmoto 2000). Many groups of prokaryotes currently inhabit terrestrial environments, indicating that land has been colonized multiple times in different lineages.

2.5.7 Oxygenic photosynthesis From the above analyses and discussion, some of the early steps leading to oxygenic photosynthesis apparently were acquisition of protective pigments, phototrophy, and the colonization of land. Currently, hundreds of terrestrial species of Cyanobacteria are known, broadly distributed among the orders, with species occurring in some of the driest environments on Earth. It is possible that a terrestrial ancestry of Cyanobacteria, where stresses resulting from desiccation and solar radiation were severe, may have played a part in the evolution of oxygenic photosynthesis. Nonetheless, there is ample evidence that horizontal gene transfer also has played an important role in the assembly of the photosynthetic machinery (Raymond et al. 2002). Although we have used the origin of Cyanobacteria as a calibration (2.3 Ga, geologic time based on GOE), such minimum constraints permit the estimated time to be much older in a Bayesian analysis. However, in this case, the time estimated for node E (2.56 Ga; 2.31–2.97 Ga; Fig. 2-3, Table 2-1) was not much older than the constraint itself. It also agrees with an earlier molecular time estimate (2.56 Ga; 2.04–3.08 Ga) based on a largely different data set and methods (Hedges et al. 2001). When we used the older minimum constraint of 2.7 Ga, corresponding to 2α-methyl-hopane evidence considered to represent a biomarker of Cyanobacteria (Summons et al. 1999), the estimated time was likewise only slightly older (Appendix A Table A1). The oldest time

17 estimates for oxygenic photosynthesis that we obtained are still considerably younger than has been assumed – generally – in the geologic literature (Buick 1992; Nisbet and Sleep 2001; Holland 2002). This suggests that carbon isotope excursions, microfossils, microbial mats, stromatolites, and other pre-3 Ga evidence ascribed to Cyanobacteria should be re-evaluated.

2.6 Conclusions The analyses presented here are based on the assumption, still under debate, that historical information (phylogenies and divergence times) can be retrieved from genes in the prokaryote genome that have not been affected by horizontal gene transfer. Our prokaryotic timeline shows deep divergences within both the eubacterial and archaebacterial domains indicating a long evolutionary history. The early evolution of life (>4.1 Ga) and early origin of several important metabolic pathways (phototrophy, methanogenesis; but not oxygenic photosynthesis) suggests that organisms have influenced the Earth’s environment since early in the history of the planet (Fig. 2-4). An inferred early presence of methanogens (3.8–4.1 Ga) is consistent with models suggesting that methane was important in keeping the Earth’s surface warm in the Archean but does not rule out the possibility that carbon dioxide may have been equally (or more) important. In contrast to many classical interpretations of the early evolution of life, we find no compelling evidence for a pre-3 Ga evolution of Cyanobacteria and oxygenic photosynthesis. This unique metabolism apparently evolved relatively late in the radiation of eubacterial clades, shortly before the Great Oxidation event (~2.3 Ga). The evolution of oxygenic photosynthesis may have involved a combination of adaptations to stressful terrestrial environments as well as acquisition of genes through horizontal transfer.

2.7 Acknowledgements We thank Prachi Shah for programming assistance, Hidemi Watanabe for providing alignment tools, and Jaime E. Blair, Robert E. Blankenship, James G. Ferry, Davide Pisani and Fabienne Thomarat for discussion. This work was supported by grants to SBH from the NASA Astrobiology Institute and the National Science Foundation. AF was supported by a Director’s Travel Scholar grant from NASA Astrobiology Institute.

18 Table 2-1. Time estimates for selected nodes in the tree of Eubacteria (A-K) and Archaebacteria (L-P). Letters refer to Fig. 2-3.

Time (Ma)a CIb Node A 102 57–176 Node B 2508 2154–2928 Node C 2800 2452–3223 Node D 1039 702–1408 Node E 2558 2310–2969 Node F 2784 2490–3203 Node G 2923 2587–3352 Node H 3054 2697–3490 Node I 3186 2801–3634 Node J 3644 3172–4130 Node K 3977 3434–4464 Node L 233 118–386 Node M 3085 2469–3514 Node N 3566 2876–3948 Node O 3781 3047–4163 Node P 4112 3314–4486 a Averages of the divergence times estimated using the 2.3 Ga minimum constraint and the five ingroup root constraints (nodes A–K) and using the 1.198 ± 0.022 Ga constraint and the five ingroup root constraints (nodes L–P). b Credibility interval (minimum and maximum averages of the analyses under the five ingroup root constraints)

19

20

21

22

23 CHAPTER 3

Progressive colonization of environments on the early Earth by prokaryotes

Fabia U. Battistuzzi1 and S. Blair Hedges1

1NASA Astrobiology Institute and Department of Biology, 208 Mueller Laboratory, The Pennsylvania State University, University Park, PA 16802, USA

3.1 Abstract The relationships of the classes and phyla of prokaryotes are unresolved. The availability of hundreds of genomes is insufficient to resolve most relationships and comparisons among studies are hindered by the use of different sets of species. Here we have applied a broad array of phylogenetic methods to two data sets: a protein data set of 25 common genes and a ribosomal data set comprising the small subunit (SSU) and large subunit (LSU) ribosomal RNA (rRNA). Our aim is to identify trends shared between the two data sets and among the different methods used to define a backbone phylogeny for prokaryotes. Furthermore, we obtained time estimates for the divergences of classes and higher-level taxonomic groups. We identified metabolically and physiologically defined groups that can be related to the colonization of Earth. A parallel path depicted by geologic and planetary events and by phylogenetic history shows a progressive colonization from submarine high temperature environments to mesophilic photic zones to terrestrial and specialized niches. These events occurred in a restricted period of time (between 3.5–2.5 billion years ago, Ga) and highlight a rapid adaptive radiation in Eubacteria that can contribute to explaining the difficulties in defining their phylogenetic history.

3.2 Introduction The phylogenetic relationships among classes and phyla of prokaryotes are unresolved. Different methods of analyzing the same or similar data sets have produced different phylogenies (e.g., Brochier and Philippe 2002; Wolf et al. 2002; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). As each method and data set has its strengths and weaknesses, it is often difficult to evaluate a new topology and objectively decide if it is more likely to reflect the evolutionary history of the organisms compared to alternative ones. This difficulty has left the major aspects of the prokaryote evolutionary tree in an undetermined state, except for a few well-established groups (e.g., monophyly of Proteobacteria and monophyly of Firmicutes). Two explanations for the difficulty in resolving the early evolution of prokaryotes are HGT and the long period of time that has elapsed. Both of these factors contribute to diminished phylogenetic signal in the DNA and protein sequences contributing to a lack of resolution in the trees. The unresolved phylogenetic history of prokaryotes has a direct effect also on the establishment of a timeline for their evolution. Information on the time of origin of lineages comes mostly from the geologic record (e.g., fossil, geochemical) and from molecular clock methodologies. The latter are useful to integrate the poor geologic record of prokaryotes but they depend on phylogenetic and evolutionary rate

24 information for calibration and rate estimation. Few studies have been carried out using molecular clock approaches (Feng, Cho, and Doolittle 1997; Hedges et al. 2001; Sheridan, Freeman, and Brenchley 2003; Battistuzzi, Feijão, and Hedges 2004) and they show differences in the time of origin of major groups. One of the most highly debated arguments concerning the relationships of prokaryotes is the influence of HGT in the estimates of phylogeny. Opinions have ranged from almost no role (Choi and Kim 2007) to a significant role (Lerat et al. 2005) and have left many researchers wondering if a tree-like structure is the best representation for the evolution of prokaryotes (Doolittle 1999; Bapteste et al. 2004; Bapteste et al. 2005; Dagan and Martin 2006; Dagan and Martin 2007). Nonetheless, phylogenies from carefully selected core genes show clusters consistent with currently recognized taxonomic groups (i.e., monophyly of classes) based on cytological and biochemical data (Holt 1984). This suggests that the phylogenetic history retained in a few non-transferred (or with low transfer rates) genes reflects that of the organisms that bear them (Battistuzzi, Feijão, and Hedges 2004; Bern and Goldberg 2005; Ciccarelli et al. 2006) although there are alternative opinions (Dagan and Martin 2006). The detection of horizontally transferred genes and deletion of these from the data set has proven useful to increase the signal in phylogenies based on protein sequences, although it is not sufficient to gain a stable phylogeny of prokaryotes at high taxonomic levels (i.e., above class). The SSU rRNA gene (16S rRNA) was originally used to estimate a tree of life because of its ubiquity and slow rate of evolution (Woese and Fox 1977; Fox et al. 1980). Because of its complex interactions with other cellular constituents, rRNA sequences are considered among the genes with low horizontal transfer rates (Rivera et al. 1998; Jain, Rivera, and Lake 1999), potentially providing an alternative HGT-free data set to protein based sequences. Recently, however, horizontally transferred genes forming the ribosome have been discovered (Lawrence and Hendrickson 2003) suggesting that careful screening of gene relationships should be applied to protein as well as rRNA sequences. In this study, we address the phylogeny and timing of prokaryote evolution through an analysis of protein and rRNA data sets. In particular, we focus on the relationships and times of origin of classes and higher-level clusters. During the last decade, hundreds of genomes have become available for phylogenetic studies, and data sets with multiple protein sequences have been deemed more reliable than rRNA (nucleotide) data given the elapsed evolutionary time (billions of years) and the larger amount of information from multiple genes. However, the availability of hundreds of sequences for both the SSU and LSU rRNA genes offers, now, an alternative to protein based studies. Although rRNA sequences can be more subject to some phylogenetic biases (e.g., GC content, saturation), especially for deep divergences (Hasegawa and Hashimoto 1993; Foster and Hickey 1999; Hedges 2002), a comparison of trees obtained from different data sets is expected to show some common patterns.

3.3 Methods 3.3.1 Data assembly and phylogenetic analyses 3.3.1.1 Protein data set Sequences from previously identified core genes (Battistuzzi, Feijão, and Hedges 2004) in Escherichia coli were used as queries for a similarity search (Altschul et al. 1997) against 311 fully sequenced genomes of Eubacteria and Archaebacteria (see

25 Appendix B Table B2 for a list of species and classification). We chose classes as our working taxonomic level because species of a same class are recovered in most phylogenies in highly supported monophyletic clusters (e.g., Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). The only exception is the phylum Bacteroidetes that is represented by two classes in each data set: Bacteroidetes and Sphingobacteria in the protein data set and Sphingobacteria and Flavobacteria in the rRNA data set. These are discussed at the phylum level as this is always found with significant bootstrap support. The retrieved sequences were aligned for each gene by ClustalX (Thompson, Higgins, and Gibson 1994). Single gene phylogenies were built in the program MEGA4 (Tamura et al. 2007) with Neighbor-Joining (NJ, model JTT +gamma = 0.5, 1, and 1.5, complete deletion of gaps) to check for orthology and possible HGT events. Seven genes were deleted from the data set because they were either absent from all the genomes representing a class (three genes) or because they supported, with high bootstrap values in at least one of the phylogenies from the three gamma values, anomalous clusters at the class or higher level (i.e., paraphyly of the domains or deep nesting of one class within another one) (four genes). Our final gene set was composed of 25 genes: 15 ribosomal proteins (RPL1, 2, 3, 5, 6, 11, 13, 16; RPS2, 3, 4, 5, 7, 9, 11), four genes (RNA polymerase alpha, beta, and gamma subunits, Transcription antitermination factor NusG) of the transcription category (K), three genes (Elongation factor G, Elongation factor Tu, Translation initiation factor IF2) of the translation, ribosomal structure and biogenesis (J), one gene (DNA polymerase III, beta subunit) of the DNA replication, recombination and repair (L), one gene (Preprotein translocase SecY) of the cell motility and secretion (N), and one gene (O-sialoglycoprotein endopeptidase) of the posttranslational modification, protein turnover, chaperones (O) category (Tatusov et al. 2001). Of the original 311 species, 28 were omitted because they were not shared by all genes, resulting in a data set of 283 species and 18,586 amino acid sites in concatenation. After removal of multiple strains of the same species, GBlocks 0.91b (Castresana 2000) was applied to delete poorly aligned sites (i.e., sites with gaps in more than 50% of the species and conserved in less than 50% of the species). Phylogenetic trees with multiple stringency levels were built and the level selected was the one that maximized the number of monophyletic eubacterial classes and their bootstrap support (see Appendix B for detailed description). Preliminary phylogenetic analyses showed a potential bias caused by the presence in the data set of the (Phylum Deincoccus-Thermus), most likely caused by its thermophilic adaptations (Omelchenko et al. 2005). In the final data set, we decided to remove this species so that the final composition included 218 species and 6,884 sites (37% of the original alignment). This data set was analyzed with ML (RaxML v. 2.2.1, JTT+gamma distribution) (Stamatakis 2006) and distance methods (NJ) (MEGA4, NJ, JTT+gamma distribution, pairwise deletion of gaps). In the ML analysis the data set was partitioned in order to allow the optimization of parameters for each gene. The alpha values (gamma distribution) for each gene were estimated (ML), averaged, and used in the NJ analysis. Phylogenetic confidence was estimated with 100 bootstrap replicates in both methods. Furthermore, a consensus tree from the 25 ML single gene trees was built using the program Consense of the Phylip package (Felsenstein 1989). This tree showed a generally poor phylogenetic signal in single genes for backbone relationships among classes and supported the use of a concatenation of genes to increase the signal to noise ratio (Appendix B, Fig. B3).

26 Additional analyses were carried out on a data set created by applying the Slow- Fast (SF) method (Brinkmann and Philippe 1999; Philippe et al. 2000) to the original concatenation and building the phylogeny as described above. This method progressively eliminates from the data set variable sites (i.e., sites with a number of changes above a threshold) leaving only slow evolving positions to estimate the phylogeny. The stringency of the SF method was selected with the same criterion used for GBlocks (i.e., maximization of monophyletic classes) (Appendix B). Furthermore, two additional data sets were built: one with only temperature mesophiles (i.e., species with growth temperature between 11 and 45 ºC) (Mesophiles-only data set), and another with all species but a higher stringency for HGT and rate variation detection (Strict data set) (detailed information about these data sets is in Appendix B). These new data sets were analyzed with the same procedure applied to the entire data set (i.e., GBlocks, ML and NJ analyses).

3.3.1.2 Nucleotide data set SSU and LSU sequences available at the European Ribosomal Database (Van de Peer et al. 2000; Wuyts, Perriere, and de Peer 2004) were used in their aligned form. The alignment was based on the secondary structure of rRNA using Methanococcus jannachii and Sulfolobus acidocaldarius as models (Van de Peer et al. 2000). Two sequences for Archaebacteria, Methanopyrus kandleri and Nanoarchaeum equitans, were added and manually aligned because they were absent from the original data set. The sequences for the two subunits were concatenated. As for the protein data set, GBlocks was applied to remove non-conserved sites and the stringency level was chosen using a criterion based on monophyly of eubacterial classes (Appendix B). The final data set was composed of 189 species for 3,786 sites (approximately 60% of the original alignment) (see Appendix B, Table B3 for a complete list of species and classification). ML and NJ trees were built with RAxML and MEGA using GTR+gamma (the two subunits were partitioned) and TamuraNei+gamma (alpha value of the gamma distribution is the average of the values estimated by the ML analysis) respectively. An additional data set was created using the SF method (Appendix B) and analyzed as explained above.

3.3.2 Time estimation 3.3.2.1 Protein data set The ML phylogeny obtained with the final data set analysis was used to estimate the times of class divergences. Analyses of Eubacteria and Archaebacteria were carried out separately using reciprocal rooting. One representative per class was chosen for a total of 21 ingroup eubacterial species and ten ingroup archaebacterial species. Five additional data sets were created using randomly chosen eubacterial species to test for sampling bias. Divergence times were estimated with a Bayesian method, Multidivtime T3 (Thorne and Kishino 2002), both with partitioned (T3p) and non partitioned (T3np) genes, and rate smoothing methods: nonparametric rate smoothing (NPRS) and penalized likelihood (PL) (Sanderson 1997). An appropriate rate smoothing factor could not be determined under penalized likelihood and therefore this method was substituted by the Langley-Fitch (LF) method. In the Bayesian and rate smoothing methods the branch lengths were estimated with a JTT+gamma model using the programs Estbranches (Thorne and Kishino 2002),

27 and PamL (Yang 1997) for the two methods respectively. Variable substitution rates among branches were allowed with variations happening in an autocorrelated fashion. Multiple calibration points were used in both the eubacterial and archaebacterial data sets. We used three calibrations within Eubacteria. The first was a maximum boundary for the ingroup root node at 4.2 Ga, which is the mid-point of the time range estimated for the last ocean-vaporizing event (Sleep et al. 1989). The second is a minimum boundary for the divergence of Chlorobia and Bacteroidetes at 1.64 Ga, based on biomarker evidence for chlorobactane in the Barney Creek Formation of the MacArthur Group, Northern Australia (Brocks et al. 2005). The third is a minimum boundary for the divergence of Gamma and at 1.64 Ga, which comes from biomarker evidence of okenane in the Barney Creek Formation of the MacArthur Group, Northern Australia, (Brocks et al. 2005). For the primary time estimation analyses, we avoided additional calibrations that included cyanobacteria or involved oxygen metabolism so that we could draw inferences about those organisms and metabolisms. However, two additional calibrations were used to test the robustness of the time estimates. One was a minimum at 2.3 Ga for the divergence of Cyanobacteria and Dehalococcoidetes (Phylum Chloroflexi), corresponding to the presence of oxygen in the atmosphere (Holland 2002). The other was a maximum of 4.0 Ga for the earliest land-dwelling taxa, corresponding to the presence of continents (Rosing et al. 2006). The small number of calibration points available for Archaebacteria is a reflection of the poor geologic record of these organisms. Fluid inclusions in dykes of the Dresser Formation (North Pole area, Pilbara craton, Western Australia) have a content of methane highly depleted in the heavy carbon isotope 13C. This depletion is comparable to that produced by methanogenic prokaryotes, offering a calibration point for the origin of these organisms at a minimum of 3.46 Ga (Bapteste, Brochier, and Boucher 2005; Ueno et al. 2006). A second calibration point is determined by the time of the last ocean-vaporizing event, inferred to have happened at 4.2 (maximum boundary) Ga (Sleep et al. 1989) on the ingroup root node.

3.3.2.2 Nucleotide data set The methods used in the analysis of the protein data set were applied also to the ML phylogeny of the combined SSU and LSU rRNA data set with a difference in the phylogenetic model used. To estimate branch lengths we used the Felsenstein 84 (F84) model (Kishino and Hasegawa 1989; Felsenstein and Churchill 1996) with estimation of gamma distribution and transition/transversion ratio.

3.4 Results For each domain we compare topology and timing from the protein and rRNA data set. In evaluating different topologies we favor those that maintain monophyly of classes and phyla, and those that show similar relationships with each other and previous studies. It was not possible to define a criterion to favor one timing method over the others. The Bayesian method and NPRS performed as expected but PL showed inconsistent results caused by the evolutionary rates of the data sets. The monotonic decrease in square-errors with increasing smoothing factor obtained under this method, in fact, suggests either a constant rate throughout the tree or rate variations that do not follow a specific pattern (Sanderson 2002). When this case occurs, use of the constant

28 rate molecular clock (LF) is favored, although the reliability of these time estimates remains unclear under the circumstances of uncorrelated rate variations. However, in the absence of other evidence, neither of the methods can be excluded. Time estimates will then be discussed as averages of the four methods and their range of point estimates. Furthermore, our timetrees show a rapid radiation of Eubacteria in the late Archean (3.5−2.7 Ga). We use a half-range mode estimate with 95% confidence interval (Hedges and Shah 2003) to represent this peak in diversification.

3.4.1 Eubacteria 3.4.1.1 Protein data set ML and NJ phylogenies showed very similar clusters (Appendix B, Fig. B4a and b), with the only differences regarding the position of Fusobacteria and Solibacteres, and the break of the Actinobacteria/Deinococci group. These classes formed clusters with low bootstrap values in both phylogenies. Because of the similarity of the phylogenies we will focus on the results from ML. This method allowed an analysis with more complex parameters (e.g., gene partition) that, given the range of evolutionary distances present in the data set, are more likely to better model the evolutionary history of Eubacteria. Of the nodes representing clusters of classes or higher level taxonomic groups, 32% have a bootstrap support of 100%, while 63% have a bootstrap of 80% or higher. All classes represented by two or more species are monophyletic and with significant bootstrap support (95% or higher) for 13 out of 15 classes (Fig. 3-1) (see Appendix B for a discussion on the classification of Symbiobacterium thermophilum herein considered a member of the Class Clostridia). All eubacterial classes, except for Aquificae, , and Fusobacteria, are part of a dichotomous group that clusters, on one side, low and high GC species and all gram positives, and, on the other, Proteobacteria and a group formed by organisms with very diverse metabolisms (e.g., parasites, phototrophs). In the former group, Firmicutes (low GC) are monophyletic with 100% BP support and show a closer relation of Bacilli and Mollicutes, with Clostridia as their closest relative. Cyanobacteria are related to Dehalococcoidetes (Phylum Chloroflexi) and form a higher cluster with Firmicutes (BP 48%). The closest relatives of this group are the classes Deinococci and Actinobacteria (high GC) (BP 53%). The confidence in the nodes relating Cyanobacteria, Actinobacteria, and Deinococci does not exclude a possible clustering of these as was previously found (Wolf et al. 2002; Battistuzzi, Feijão, and Hedges 2004). In the other branch of the major dichotomy, Proteobacteria are monophyletic (BP 87%) with the following taxonomic relationships among classes: ((((Gamma, Beta), Alpha), (Delta, Solibacteres)), Epsilon). The Deltaproteobacteria form a cluster with Solibacteres, a class of the phylum , as was found by Ciccarelli and colleagues (Ciccarelli et al. 2006). Spirochaetes and Chlamydiae appear in the same group, although not closest relatives, along with Bacteroidetes, Chlorobia, and Planctomycetacia (BP 82%). This group was not found in its entirety by other studies although subsets are documented in the literature (Brown et al. 2001; Battistuzzi, Feijão, and Hedges 2004; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). It is the closest relative of the Phylum Proteobacteria (BP 89%). The hyperthermophiles Aquifex and Thermotoga are at the base of the tree. These phylogenetic relationships are similar to those found in previous studies, although with differences in the deeper nodes (Brown

29 et al. 2001; Wolf et al. 2001; Battistuzzi, Feijão, and Hedges 2004; Gophna, Doolittle, and Charlebois 2005; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). The phylogeny from the SF method (ML) was identical except for the positioning of Solibacteres between Delta and (Appendix B, Fig. B6) and higher bootstrap values for most nodes. Phylogenies from the Mesophiles-only and Strict data set were also comparable to the previous ones. The major differences were the positions of Deinococci and Actinobacteria that stemmed from the base of the tree instead of clustering with the Firmicutes/Cyanobacteria group (Appendix B, Figs. B7- B8). There is an overall good agreement among all phylogenies, which increases our confidence in the backbone structure of the Eubacterial tree. In the timing analyses, estimates from T3p and T3np are similar, differing less than 4% (stdev = 0.04) (Appendix B, Table B4). T3np shows also wider credibility intervals, spanning on average 588 million years instead of 246 million years for T3p (Appendix B, Table B4). The T3np analysis was carried out with three and five calibration points to check for robustness of time estimates with increasing time constraints. Because the two analyses showed very similar results (2% average difference, stdev = 0.01) we decided to use only three calibration points because this allowed us to draw conclusions on the origin of Cyanobacteria and oxygenic metabolism. We also tested the effect of selecting different species to represent each class. Time estimates of the original data set and five other data sets showed similar results (coefficient of variation = 2.37%, stdev = 2.04) indicating that that the use of different species did not result in major differences in time estimates. The two rate smoothing methods, NPRS and LF, show similar results with an average increase in the LF time estimates of 6% (stdev = 0.03). Interestingly, estimated divergence times show, for most nodes, an increase from the T3p to the LF method (i.e., from a complex to a simpler method) possibly explained by an increasing difficulty of simpler methods to depict the true chronological history of Eubacteria. Half-range modes with 95% confidence intervals were estimated for each of the timing methods and the average shows a peak of divergences in the late Archaean (2.75 ± 0.12 Ga). This peak coincides with the divergence of Cyanobacteria between 2.84 (2.60−3.12) Ga and 2.57 (2.42−2.8) Ga. Other time estimates of interest include the origin of the two branches of the major dichotomy: (i) (((((Gamma, Beta), Alpha),(Delta, Solibacteres)), Epsilon), ((Spirochaetes, Planctomcyetacia), (Chlorobia, Bacteroidetes), Chlamydiae)) (cluster A in Fig. 3-2), and (ii) ((Firmicutes, (Cyanobacteria, Chloroflexi)), (Deinococci, Actinobacteria)) (cluster B in Fig. 3-2). The ancestor of these two clusters diverged 3.18 (2.83−3.54) Ga. Thus, the evolution of the two clusters happened between 3.18 (2.83−3.54) Ga and 3.03 (2.69−3.4) for cluster A, and between 3.18 (2.83−3.54) Ga and 2.97 (2.66−3.28) Ga for cluster B. Within cluster A, the origin of Planctomycetacia is estimated between 2.96 (2.62−3.34) and 2.56 (2.41−2.86) Ga. Estimated divergence times for the hyperthermophiles are 4.16 (4.05−4.2) Ga for Thermotoga and 4.0 (3.93−4.17) Ga for Aquifex. A timetree with the divergence of all nodes is shown in Fig. 3-2.

3.4.1.2 Ribosomal RNA data set As for the protein data set, the phylogeny of the concatenation of SSU and LSU rRNA was estimated with both ML and NJ. The two phylogenies are similar overall

30 (Appendix B, Fig. B5a and b), differing only in the position of poorly supported groups (Fibrobacteres, Chlorobia/Bacteroidetes, Actinobacteria, and Clostridia). However, the NJ tree has an overall lower support (44% of the nodes have a bootstrap value ≥ 80% versus 65% of the ML phylogeny), and Spirochaetes do not form a monophyletic group. The SF phylogeny also differs from the previous ones at poorly supported nodes (Chlorobia/Bacteroidetes, Cyanobacteria, Actinobacteria) (Appendix B, Fig. B9). For these reasons, and because the phylogeny of Archaebacteria is also problematic (see below) in the distance tree, we focus on the ML phylogeny from the complete nucleotide data set for discussion of phylogenetic relationships and timing. All classes represented by more than one species are monophyletic, with 75% of them having a bootstrap value ≥ 95% (Fig. 3-3) (see Appendix B for a discussion on the classification of Zoogloea ramigera herein considered an alphaproteobacterium). The Phylum Proteobacteria is monophyletic with the same relationships among classes as in the protein data set (Solibacteres is not represented in this data set) (BP 95%). Also cluster C (Fig. 3-2) is present, although the relationships within the group show some differences compared to the protein data set. Fibrobacter succinogenes (Phylum Fibrobacteres) clusters within this group, albeit with low bootstrap support. A comparable position for this phylum was found by Ciccarelli and colleagues (Ciccarelli et al. 2006). The association of cluster C with Proteobacteria is highly supported (BP 89%) as in the protein data set. Firmicutes are monophyletic and show the same taxonomic relationships as previously described in the protein data set. The hyperthermophiles are at the base of the tree followed by the Deinococci class and by Cyanobacteria, both represented by a single species. The position of T. thermophilus (Class Deinococci) and Synechocystis sp. (Phylum Cyanobacteria) is similar to previous rRNA studies (Olsen, Woese, and Overbeek 1994) but contradicts a later emergence of these lineages suggested by our and other protein based analyses (Wolf et al. 2002; Battistuzzi, Feijão, and Hedges 2004; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). An estimation of GC content in the conserved sites of the rRNA sequences of Aquifex aeolicus, Thermotoga maritima, and Thermus thermophilus showed the same enrichment (73-74%) in GC caused by their hyperthermophilic and thermophilic lifestyle. This could explain the position of the Class Deinococci, as phylogenies from DNA sequences are known to be affected by compositional biases (Hasegawa and Hashimoto 1993; Foster and Hickey 1999). Numerous evidence from protein-based analyses have shown a general association of Cyanobacteria with gram positive lineages (i.e., Bacilli, Clostridia, Chloroflexi, Actinobacteria, and Deinococci) (Wolf et al. 2002; Zhaxybayeva and Gogarten 2002; Raymond et al. 2003; Battistuzzi, Feijão, and Hedges 2004; Ciccarelli et al. 2006). It is possible that a combination of the deep position of Thermus thermophilus and low discriminating power of the rRNA sequences for these lineages is forcing this phylum in a deep position and breaking the cluster of terrestrial lineages (cluster B in Fig. 3-2) found in our protein data set. Divergence times were estimated with the same procedure used for the protein data set. The average differences between T3p/T3np and NPRS/LF were less than 5% (stdev < 0.05), with the LF method producing generally older times than NPRS. Analyses with three and five calibrations produced very similar results (average difference = 0.1%, stdev = 0.001). Credibility intervals for T3p and T3np were also very similar, spanning an average of 436 million years for the first one and 431 million years for the latter

31 (Appendix B, Table B4). Average mode estimates of the time distribution for each method (T3p, T3np, NPRS and LF) shows a peak in the early Proteorozoic (average = 2.44 ± 0.12 Ga), with the mode from the LF method being approximately 6% older than the one from the more complex T3p method (the same trend was found in the protein data set although increased in magnitude). However, in this case, it is not possible to make a comparison between this burst of divergence and the origin of the Cyanobacteria due to their deep phylogenetic position, which, if correct, would place their origin 2.90 (2.84−2.95) Ga. Divergence time for cluster A' (Fig. 3-2) is between 2.77 (2.71−2.85) Ga and 2.61 (2.54−2.74) Ga, while the origin of Planctomycetacia is between 2.41 (2.35−2.58) Ga and 2.27 (2.18−2.42). The hyperthermophiles diverged between 4.16 (4.12−4.2) Ga and 3.64 (3.53−3.72) Ga for Aquifex and Thermotoga respectively. A timetree of this phylogeny is represented in Fig. 3-2. Because of slightly different topologies, it was not possible to make a direct comparison of time estimates between the rRNA and protein data sets. However, for those branches that subtend the same or similar groups, time estimates from the rRNA data set are generally younger (e.g., cluster A diverged a minimum of 3.03 Ga in the protein data set, cluster A' diverged 2.61 Ga in the rRNA data set).

3.4.3 Archaebacteria 3.4.3.1 Protein data set Recent analyses of the phylogeny of Archaebacteria have found that the two phyla in this domain, Euryarchaeota and Crenarchaeota, are monophyletic (Brown et al. 2001; Battistuzzi, Feijão, and Hedges 2004; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). In contrast, the NJ analysis of our protein data set did not support monophyly of Euryarchaeota, and most of the nodes were poorly supported (63% of the nodes have a bootstrap support of less than 50%, and only one node is significantly supported). The ML analysis, instead, produced a tree showing monophyly of the two phyla, with 50% of the nodes supported by a bootstrap value ≥ 95% (Fig. 3-4 and Appendix B, Fig. B4a and b). Because this same phylogeny is supported also by the SF method (Appendix B, Fig. B6) we will use it to discuss the evolution of Archaebacteria. All classes are monophyletic and with significant support, except for Methanomicrobia (BP 69%). Excluding Nanoarchaeota, the relationships are identical to those found previously (Gribaldo and Brochier-Armanet 2006) and show a division of methanogens in two groups, Class I and Class II (Bapteste, Brochier, and Boucher 2005), although monophyly of methanogens have been proposed based on uniquely shared proteins (Gao and Gupta 2007). The classes Thermoplasmata, Archaeoglobi, Methanomicrobia, and Halobacteria are strongly clustered together, stemming in this order from older to younger, while Thermococci is the first diverging lineage of Euryarchaeota. The position of Nanoarchaeum equitans, the only representative of Nanoarchaeota – a putative third phylum in the Domain Archaebacteria, is debated. Recent studies, based on archaebacterial sequences only, suggest its positioning within the Euryarchaeota (Brochier et al. 2005; Makarova and Koonin 2005) but phylogenetic studies including also Eubacteria and Eukaryotes have not found this result (Waters et al. 2003; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). Given the highly reduced genome of this organism and its apparently parasitic lifestyle, it cannot be

32 excluded that its position in our and other analyses is because of different substitution rates and genomic rearrangments (Makarova and Koonin 2005). The same timing procedure used for Eubacteria was applied to Archaebacteria. The divergence times estimated with T3p and T3np differed of less than 10% with the majority of nodes being older for the first method (Appendix B, Table B4). Most divergence times are in the Archean with an average mode among all methods of 3.32 ± 0.3 Ga. This is in agreement with a previous estimate of Archaebacterial divergences using a calibration point within eukaryotes (Battistuzzi, Feijão, and Hedges 2004). Excluding the position of Nanoarchaeum, given the debate mentioned above, the first divergence is between Crenarchaeota and Euryarchaeota at 4.01 (3.90−4.19) Ga, which would imply an even earlier origin of life. The cluster including Thermoplasmata, Archaeoglobi, Halobacteria and Class II methanogens (cluster A, Fig. 3-6) originated 3.14 (3.1−3.19) Ga (Fig. 3-6).

3.4.3.2 Ribosomal RNA data set As in the protein data set, the NJ tree of the rRNA data set shows the Phylum Crenarchaeota deeply nesting within the Phylum Euryarchaeota, suggesting a potential bias (Appendix B, Fig. B5a and b). In contrast, the two phyla are found monophyletic in the ML tree (Fig. 3-5), and the cluster formed by Thermoplasmata, Archaeoglobi, Halobacteria and Methanomicrobia is identical to the protein data set (BP 97%). The overall phylogeny resembles that obtained previously with SSU rRNA (Olsen, Woese, and Overbeek 1994) and shows the thermophilic methanogens (the classes Methanopyri, Methanobacteria, and Methanococci) stemming in a ladder-like fashion at the base of the tree of Euryarchaeota, along with the Class Thermococci (a similar phylogeny is found in the SF method; Figure S7). This result may be explained by a lack of phylogenetic signal, given the fewer sites available compared to the protein data set and the low bootstrap values supporting these nodes (Brochier, Forterre, and Gribaldo 2004). The position of Nanoarchaeum is weakly supported as a sister group of Crenarchaeota (BP 60%). The Crenarchaeota are all members of the same class, Thermoprotei, and the relationships of the orders within that class agree with those found previously (Gribaldo and Brochier- Armanet 2006). Similarly to the protein data set, T3p time estimates were generally older than T3np times with credibility intervals smaller for T3p (on average 719 million years for T3p and 873 for T3np) (Appendix B, Table B4). The first divergence in the Archaebacteria rRNA tree is the division of Euryarchaeota and Crenarchaeota/Nanoarchaeum that is timed at 3.98 (3.71−4.05) Ga, suggesting, as in the protein data set, an early origin of life (before ~4.0 Ga). Methanopyrus kandleri is deep in the tree of Euryarchaeota, a placement that has been questioned on the basis of phylogenies from larger data sets (Brochier, Forterre, and Gribaldo 2004). As expected, its time of origin is older than the one obtained from the protein data set because of its deeper phylogenetic position (3.72 Ga; 3.47-3.9 Ga). The only cluster common to protein and rRNA phylogenies (clusters A and A', Fig. 3-6) is younger in the latter, with a time estimation of 2.76 (2.51−2.96) Ga. The average mode estimate for the rRNA divergence times is similar to that estimated with the protein data set (approximately 1% younger). Nonetheless, some nodes show a relatively large difference in divergence time (e.g.,

33 Halobacteria/Methanomicrobiales: 2.05 Ga with the protein data set and 1.49 Ga with the rRNA data). This is most likely because of differences in the relationships obtained with the two data sets.

3.5 Discussion Multiple studies have highlighted the difficulty of obtaining a consistent picture of prokaryote evolution even with the amount of data available from whole genomes. Comparisons of previous studies carried out with single methods have started showing general trends in taxonomic clusters (Wolf et al. 2002), but different species complements and methods make these comparisons often difficult to perform. To compensate for these obstacles, we have applied same phylogenetic procedures to a set of prokaryote species represented in two different data sets: (i) a protein-based data set from 25 core genes (32 classes), and (ii) an rRNA data set from the small and large subunit (30 classes). SSU rRNA was widely used at the beginning of phylogenetic studies because of its universal presence in all organisms, its slow rate of evolution, and its ease of sequencing (Woese and Fox 1977; Fox et al. 1980). Unfortunately, as more genes became available, contrasting phylogenies were found and these eroded the initial confidence in the reliability of the topology estimated by SSU sequences (Gogarten, Doolittle, and Lawrence 2002). The use of LSU in addition to SSU rRNA sequences has been applied occasionally to build trees of life (De Rijk et al. 1995) and more frequently in narrower taxonomic groups to increase phylogenetic signal and provide an additional point of view on unresolved phylogenetic issues (e.g., Mallatt and Winchell 2002; da Silva, Muschner, and Bonatto 2007; Le Gall and Saunders 2007). The availability of a large number of LSU prokaryotic sequences in databases such as the European Ribosomal Database, provided us with the opportunity to update the phylogenetic tree of prokaryotes based on these genes and make comparisons of phylogeny and timing.

3.5.1 Protein vs. ribosomal RNA, maximum likelihood vs. distance Ideally, different data sets and phylogenetic methods should produce identical results. We have analyzed the backbone phylogeny of prokaryotes with various methods (ML, NJ, and SF for each of the protein and rRNA data sets) and compared the results. We based the evaluation of the performance of each method and data set on the number of monophyletic classes and phyla that resulted, confidence values, and agreement among methods. NJ phylogenies generally showed lower bootstrap values and a non- monophyletic class and phylum (Class Spirochaetes and Phylum Euryarchaeota) in most of the phylogenies. Furthermore, in the cases of archaebacterial phylogenies, the SF topologies show higher similarity with ML than with NJ. When possible the models used under ML and NJ were identical (i.e., in the protein data set JTT was used for both ML and NJ, in the rRNA data set GTR was used ML and Tamura Nei for NJ) but the concatenation of genes was treated differently under ML and NJ. In the former it was partitioned into single genes so that the parameters of the likelihood function were optimized for each gene separately. In NJ, instead, the concatenation was used as a “supergene” (i.e., one gene formed by the fusion of all single genes) actually averaging the parameters among all genes. The higher complexity of the ML model used is more likely to reliably represent evolutionary histories (Pupko et al. 2002). For these reasons ML topologies were favored.

34 Some clusters supported by high bootstrap values are shared by the ML phylogenies of the two data sets, providing good evidence of their reliability. The deeper branches of the rRNA trees in both domains are the ones that show the strongest differences with the protein data set. Even if the accuracy of the deeper part of the rRNA tree cannot be disproved, some evidence points to a higher reliability of the protein data set. The deep positions of Deinococci, Cyanobacteria (Domain Eubacteria), and Methanopyri (Domain Archaebacteria), for example, can be explained by artifacts caused by compositional biases and the lower resolving power of the smaller rRNA data set. The eubacterial phylogenies (ML, NJ, and SF) from the protein data set show high similarity regardless of the phylogenetic method used, suggesting a robustness of topology absent from the rRNA tree. Finally, the protein data set provides a larger amount of information with an almost doubled number of sites and numerous functional categories represented by the core genes. Given the evidence, in the discussion below we favor ML topologies in both data sets, use the agreement between protein and rRNA ML phylogenies to highlight well supported groups, and favor the discussion of the protein topology in cases when disagreements exist.

3.5.2 Taxonomy of prokaryotes At the lowest taxonomic level considered in this study (i.e., class), protein and rRNA phylogenies are characterized by monophyly of all classes of Eubacteria and Archaebacteria, and of the major phyla (Proteobacteria–with the exception of the Class Solibacteres, Firmicutes, Euryarchaeota, and Crenarchaeota) (Figs. 3-1, 3-3, 3-4, 3-5). This confirms each class to be a coherent group as found by physiological and cytological analyses (Holt 1984) with the exception of two species found clustering outside their assigned category. Symbiobacterium thermophilum, present in the protein data set only, was originally assigned to Class Actinobacteria but our analysis consistently places it within the Phylum Firmicutes (Class Clostridia) (Ueda et al. 2001; Ueda et al. 2004; Gao, Paramanathan, and Gupta 2006; Pisani, Cotton, and McInerney 2007). Zoogloea ramigera, represented only in the rRNA data set, was considered a betaproteobacterium but our analysis shows its relation to the Order Rhizobiales of (Shin, Hiraishi, and Sugiyama 1993) (Appendix B).

3.5.2.1 Eubacteria The deepest nodes in the protein and rRNA phylogenies are occupied by the hyperthermophiles Aquifex and Thermotoga, a position that is debated because of compositional biases in these sequences (Brochier and Philippe 2002). Nonetheless, the agreement of the two data sets and the consistency of this positioning regardless of the phylogenetic method used increases the confidence in an early origin of hyperthermophiles. As mentioned above, the rRNA phylogeny of the deep nodes presents a ladder- like structure that breaks a clade (cluster B, Fig. 3-2) composed of mostly gram positives and lineages with terrestrial adaptations. Although in agreement with previous SSU rRNA studies (Olsen, Woese, and Overbeek 1994), this topology and the position of Cyanobacteria in particular is in contrast with evidence from protein studies, which identify an association of this phylum with gram positives (Zhaxybayeva and Gogarten 2002; Raymond et al. 2003; Pisani, Cotton, and McInerney 2007). For this and other

35 reasons (§ 3.5.1) we favor the phylogeny obtained by the protein data set for cluster B. The other branch of the major dichotomy in the protein data set (cluster A, Fig. 3-2), and relations therein, is also represented in the rRNA tree (cluster A', Fig. 3-2). Based on the phylogenies obtained in this study and on metabolic considerations (see below), we propose redefinitions of some existing taxa, and new taxa. We propose the following five new names of major taxa of prokaryotes: (i) the Superphylum Bacterobi for the group that includes the Phyla Bacteroidetes and Chlorobi, (ii) the Superphylum Spiroplancti for the group that includes the classes Spirochaetes and Planctomycetacia, (iii) the Infrakingdom Spirochlamydiae for the group that includes Bacterobi, Chlamydiae, and Spiroplancti, (iv) the Subkingdom Hydrobacteria (from the Greek hydro, water, in allusion to the moist environment favored by these species) (cluster A and A', Fig. 3-2) that includes Spirochlamydiae and Proteobacteria, and (v) the Kingdom Heliocytes (from the Greek helios, sun, and kytos, cell, in allusion to the innovation of photosynthesis in this group) for the group that includes the subkingdoms Hydrobacteria and Terrabacteria. Besides those new names, we elevate Terrabacteria (Battistuzzi, Feijão, and Hedges 2004) to the rank of subkingdom (cluster B, Fig. 3-2) and redefine it to include the original phyla Cyanobacteria, Actinobacteria, and Deinococcus-Thermus, and two additional phyla: Chloroflexi and Firmicutes. We also elevate the current Phylum Proteobacteria to the rank of Infrakingdom, Phylum Chlamydiae to the rank of Superphylum, and the phyla Thermotogae and Aquificae to the rank of Kingdom. Bacterobi, Spirochlamydiae, and Hydrobacteria are supported by both data sets (proteins and rRNA) with bootstrap supports > 80% (except for Spirochlamydiae in the rRNA, BP 13%). Fusobacteria is a phylum represented by a single complete genome that, in our protein phylogeny, is placed deep in the tree after Thermotogae and Aquificae. This phylum is absent from the rRNA data set. Although this lineage is generally considered a close relative of Firmicutes (Mira et al. 2004) alternative positions have been found (Gupta 2003; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). Furthermore, many features of the Fusobacterium nucleatum genome seem to have originated from extensive HGT events with other lineages (Mira et al. 2004). In light of these uncertainties, the position of Fusobacteria should be considered with caution and so is its exclusion from the kingdom Heliocytes (Appendix B, Fig. B10).

3.5.2.2 Archaebacteria Although claims have been made for a robust phylogeny of Archaebacteria (Gribaldo and Brochier-Armanet 2006), recent analyses have shown variable phylogenies within this domain (e.g., Wolf et al. 2002; Gophna, Doolittle, and Charlebois 2005; Ciccarelli et al. 2006; Pisani, Cotton, and McInerney 2007). Given the lower number of genomes available and the limited known diversity of Archaebacteria compared to Eubacteria, this variability is comparable to the one present in Eubacteria as virtually no class-level cluster is common in all phylogenies. Especially variable has been the position of Halobacteria and Thermoplasmata, which are influenced by their high percentage of horizontally transferred genes with Eubacteria and Crenarchaeota respectively (Gribaldo and Brochier-Armanet 2006).

36 Despite the use of LSU in addition to SSU rRNA sequences, our rRNA phylogeny shows substantially the same topology as previous studies (Olsen, Woese, and Overbeek 1994; Schleper, Jurgens, and Jonuscheit 2005). In particular, the position of Methanopyri as deepest lineage of the Phylum Euryarchaeota has been questioned by recent studies, which, instead, places it closely related to the other hyperthermophilic methanogens, Methanobacteria and Methanococci (Brochier, Forterre, and Gribaldo 2004). Moreover, the rRNA topology is poorly supported, except for two groups (Fig. 3-5), further decreasing confidence in its reliability. As discussed in § 3.5.1, features of the protein data set and its phylogeny argue in favor of this topology. Nevertheless, the two phylogenies share a common group, (Thermoplasmata, (Archaeoglobi, (Halobacteria, Methanomicrobia))) (clusters A and A' in Fig. 3-6), supported by significant bootstrap values (BP 97% in the rRNA, BP 100% in the protein data set). The high confidence in this cluster lends further support to the paraphyly of methanogens, which are clustered in two monophyletic groups by the core genes, Class I and Class II (Bapteste, Brochier, and Boucher 2005).

3.5.3 Evolution and adaptation of prokaryotes on early Earth Because of the differences between protein and rRNA phylogenies, unless otherwise stated, the time estimates reported here are the average between the two data sets when the taxonomy coincides and the times from the protein data set when they differ (see above). It is reasonable to postulate that part of the difficulty in determining the phylogenetic relationships among eubacterial classes is due to a rapid diversification that this group underwent early in their evolutionary history (Gribaldo and Brochier- Armanet 2006). Indeed, out timetrees show a rapid radiation of Eubacteria that follows the divergence of the hyperthermophiles Thermotoga and Aquifex. Although the peak of the divergences happened in the late Archean (average mode: 2.59 ± 0.12 Ga), closely spaced speciation events happened throughout the middle to late Archean, from 3.5 Ga to 2.5 Ga. The overall picture that is represented in our timetrees shows a progressive adaptation of prokaryotes from submarine hyperthermophilic and thermophilic environments (stage 1) to the mesophilic photic zone of oceans (stage 2) and terrestrial habitats (stage 3) (Fig. 3-7). The geologic record of prokaryotes mirrors this habitat expansion.

3.5.3.1 Stage 1: Early presence of prokaryotes in submarine high temperature habitats (4.2−3.4 Ga) The first diverging lineages of Eubacteria (Aquificae and Thermotogae) and Archaebacteria (Nanoarchaeota and Crenarchaeota) are non-phototrophic hyperthermophiles. Time estimates place these lineages between 4.0 Ga and 4.2 Ga. These early lineages are often isolated from submarine environments, a habitat that would have provided protection against the frequent meteoritic impacts of the Hadean era (4.5−3.8 Ga) (Kasting and Catling 2003). However, we set a maximum boundary so that the times can only be younger than the last ocean-vaporizing impact (4.1−4.3 Ga, midpoint 4.2 Ga), as it would have been unlikely for life to survive an event of that magnitude (Sleep et al. 1989). As mentioned above (§ 3.5.2.1), the position of hyperthermophiles at the base of the eubacterial tree has been criticized on the grounds of potential artifacts caused by

37 their compositional biases and outgroup rooting (Brochier and Philippe 2002). The confidence in a deep phylogenetic position of these taxa is increased by the agreement of our data sets but a more derived position cannot be discounted given the limited genomic information available on these lineages (i.e., Aquificae and Thermotogae are represented by a single species in the rRNA and protein data set). If the latter option will be shown to be correct, the ancestor of Eubacteria would be a phototrophic mesophile (see below) and, consequently, its presence on Earth would have to be at a more recent time (after the end of the heavy bombardment period, after 3.8 Ga). Our time estimation for the divergence of the first lineage after the hyperthermophiles is in agreement with this scenario: in both data sets this divergence happens at approximately 3.4 Ga with Fusobacteria (3.02−3.67 Ga) and Deinococci (3.34−3.49 Ga). A mesophilic origin of Archaebacteria seems more improbable, as the deepest branches of both Crenarchaeota and Euryarchaeota are occupied by hyperthermophilic organisms. Although our knowledge of the diversity of Archaebacteria is limited and it is theoretically possible that mesophilic deep-branching species will be discovered, the current phylogenetic status suggests a high-temperature environment for the ancestor of the deepest lineages. Taken together, this evidence suggests an early presence of life (> 4.0 Ga) in high temperature submarine habitats by both Archaebacteria and Eubacteria or by Archaebacteria only.

3.5.3.2 Stage 2: Colonization of mesophilic photic zones (3.4−3.2 Ga) The ancestor of Heliocytes was, arguably, a mesophile with phototrophic capabilities. Recent analyses of the photic zones of oceans and seas have evidenced the phylogenetic diversity encompassed by phototrophic organisms (Sabehi et al. 2005; Frigaard et al. 2006; McCarren and DeLong 2007). These are classified in two groups. The first is composed of strict photosynthesizers, which are the ones using and chlorophyll to convert carbon dioxide into cellular biomass with the aid of electron transport chains. The second is composed of those phototrophs that use type I rhodopsin genes (e.g., bacteriorhodopsin, proteorhodopsin, sensory rhodopsins) to store chemical energy under the form of a proton gradient or use them to guide their phototactic movements (Bryant and Frigaard 2007). Photosynthesizers are confined to six lineages: Cyanobacteria (the only oxygen photosynthezisers), Chloroflexi (green non sulfur or filamentous bacteria), Clostridia (only in the Family Heliobacteriaceae), Chlorobia (green sulfur bacteria), Alpha/Beta and (purple non sulfur and sulfur bacteria respectively), and the newly discovered Acidobacteria (Bryant et al. 2007). Rhodopsin-mediated phototrophy, instead, is more widespread, including also the Phylum Bacteroidetes, the Classes Actinobacteria, Bacilli, Planctomycetacia, and various Archaebacteria (Frigaard et al. 2006; Bryant and Frigaard 2007; McCarren and DeLong 2007). The current spread among distant phylogenetic lineages of phototrophic genes, whether related to the reaction centers in photosynthesizers or to the rhodopsins in strict phototrophs, was obtained through HGT events within and among domains (Raymond et al. 2002; Ruiz-Gonzalez and Marin 2004; Zhaxybayeva et al. 2006). Nevertheless, there is some evidence that the ancestor of all photosynthesizers, which corresponds to the ancestor of Heliocytes in our phylogeny, harbored the genes for reaction center I. This implies that multiple losses as well as HGTs

38 led to the current patchy distribution (Mix, Haig, and Cavanaugh 2005; Sadekar, Raymond, and Blankenship 2006). From a geologic perspective, there is evidence of photosynthesis-mediated deposition of sediments in the Buck Reef Chert in South Africa dating at 3.4 Ga (Tice 2006; Allen and Martin 2007), and of anoxygenic photosynthetic ecosystems in the early Archean (Canfield, Rosing, and Bjerrum 2006). Furthermore, re-analyses of oxygen isotopes of the early Archean (approximately 3.3 Ga) have rescaled temperatures in this period from 70 ± 15 °C (Knauth and Lowe 2003; Robert and Chaussidon 2006) to temperate (10−32 °C) (Kasting and Howard 2006; Jaffres 2007). This is in agreement with an ancestor of Heliocytes living in mesophilic photic zones. Considering all the previous geologic and phylogenetic evidence, we infer an early evolution of phototrophy between 3.4 Ga and 3.2 Ga (Fig. 3-7). The evolution of this new metabolism implies the colonization of new niches (e.g., mesophilic photic zones of the oceans and land), which could have triggered the rapid radiation of Eubacteria outlined in our timetree. The time distribution of Archaebacteria shows that most divergences in this domain happened during this same time period (average mode: 3.3 ± 0.3 Ga). Genes involved in methanogenesis and methylotrophy (i.e., methanopterin/methanofuran-linked C1 transfer genes) are shared by methanogens and at least two groups of Eubacteria (the Phylum Proteobacteria and the Class Planctomycetacia) (Chistoserdova et al. 2003). Based on this finding an evolutionary scenario in which early diverging Planctomycetacia evolved these genes to reduce formaldehyde and then spread them via HGT to methanogens and methylotrophs has been proposed (Chistoserdova et al. 2004). Our phylogeny and timetree contradicts this hypothesis, as Planctomycetacia are not deep diverging organisms. Furthermore, the presence of biomarkers for methanogenesis at 3.46 Ga is not reconcilable with an evolution of these genes in late-emerging Planctomycetacia. However, the presence of this pathway in the ancestor of Eubacteria and Archaebacteria cannot be discarded, especially in light of the recent discovery of these genes in yet to be classified lineages (Chistoserdova et al. 2004; Kalyuzhnaya and Chistoserdova 2005), although there are different opinions (Bauer et al. 2004).

3.5.3.3 Stage 3: Colonization of land and specialized niches (3.3−2.5 Ga) Shortly after the evolution of mesophilic phototrophs two major lineages diverged, Terrabacteria and Hydrobacteria. Metabolic and physiological characteristics suggest that Terrabacteria evolved a series of adaptations that favored their colonization of land environments. Desiccation and UV radiations, the latter particularly severe on land before the formation of the ozone layer, cause damages to the DNA and protein complements of organisms (Potts 1994; Billi and Potts 2002). To cope with these extreme conditions, all major phyla within the Terrabacteria harbor special adaptations. These include the synthesis of photoprotective pigments (e.g., carotenoids), the ability to form resting stages (e.g., endospores), and the accumulation of trehalose and sucrose to prevent protein denaturation and loss of membrane structure (Potts 1994; Mattimore and Battista 1996; Nicholson et al. 2000; Rokitko et al. 2001; Billi and Potts 2002; Shirkey et al. 2003; Alpert 2006). Time estimates for the evolution of Terrabacteria are between 3.35 Ga and 3.12 Ga, which is consistent with (postdates) the formation of continents on Earth (Hawkesworth and Kemp 2006; Rosing et al. 2006) and a possible colonization of them

39 by these organisms. The only lineage within the Terrabacteria that does not show any terrestrial adaptation is the Class Mollicutes (Phylum Firmicutes). This is not surprising considering that this class is composed of specialized species with highly reduced genomes (1.36 – 0.58 Mb) that led to the loss of the cell wall and adaptations to a parasitic lifestyle (Wolf et al. 2004). The evolution of phototrophy and adaptations to a terrestrial environment are the steps that preceded the origin of Cyanobacteria and oxygenic photosynthesis. Our protein timetree brackets this metabolism between 2.84 Ga and 2.57 Ga, an interval that includes the peak of the eubacterial divergences (2.75 ± 0.12 Ga). It is reasonable to infer that the evolution of oxygenic photosynthesis and its effect on Earth’s local and global environment (Holland 2002) triggered a series of adaptive radiations in Eubacteria that gave rise to most of the lineages within Terrabacteria and Hydrobacteria. It is worth noting that the divergence time of Cyanobacteria in the rRNA phylogeny, which places this group deep in the tree, is not much older, 2.90 Ga. This suggests that, regardless of the phylogenetic position of this phylum, oxygenic photosynthesis is unlikely to be responsible for early Archean geologic formations. Our time estimates are in agreement with putative biomarker evidence for Cyanobacteria from the Pilbara Craton, Northern Australia (Brocks et al. 1999; Summons et al. 1999), although these have recently been questioned (Kopp et al. 2005). One of the issues posed by an evolution of oxygenic Cyanobacteria between 2.8 Ga and 2.6 Ga is the delay in increased oxygen concentrations, registered only 300 million years later. A recent analysis of the evolution of submarine and subaerial volcanism offers a possible explanation as reducing (submarine) volcanic gases would have acted efficiently as an oxygen sink, virtually preventing an oxygen build up in the atmosphere (Kump and Barley 2007). The change from a strongly reduced submarine volcanism to a less reduced subaerial volcanism would have removed this sink allowing oxygen to accumulate in the atmosphere. Although many lineages within Hydrobacteria have evolved adaptations to terrestrial environments, gram negative bacteria are known to be more sensitive to desiccation (Rokitko et al. 2001). Species within the phyla Proteobacteria, Acidobacteria, Bacteroidetes, and the Class Planctomycetacia have been identified, among others belonging to Terrabacteria and Crenarchaeota, in desert environments (Chanal et al. 2006) but the spread of terrestrial adaptations in Proteobacteria and Spirochlamydiae is not known. Resistance to desiccation is modulated by different mechanisms (e.g., dormant stages, increased trehalose/sucrose concentrations, secretion of extracellular polysaccharides), which often interact with each other and provide protection against multiple stresses (e.g., ionizing radiation, salt tolerance) (Shukla et al. 2007). Until more data are available on the spread and phylogenetic history of genes involved in stress adaptations, the presence of a gram negative cell membrane and the general absence of dormant stages (except in Myxococcus and Myxobacter spp., Class Deltaproteobacteria) in Hydrobacteria suggests that the ancestor of this group was adapted to aquatic or moist environments. This implies that species within this group that are highly resistant to multiple stresses have gained these abilities as secondary adaptations through either HGT, mutations, or a combination of the two mechanisms. An unresolved issue within the Infrakingdom Spirochlamydiae is the position of Planctomycetacia. This is a metabolically diverse lineage isolated from both acquatic and terrestrial ecosystems and even hot springs. Its phylogenetic position has varied from

40 deep branching (Brochier and Philippe 2002) to the closest relative of Chlamydiae (Teeling et al. 2004; Wagner and Horn 2006). Our phylogenies support the latter with its placement within Spirochlamydiae. However, the closest relative varies from Spirochaetes in the protein data set to Chlamydiae in the rRNA. Both these clusters are present in previous analyses (e.g., Ciccarelli et al. 2006; Wagner and Horn 2006). We favor the placement in the protein data set (§ 3.5.1) with a divergence time of 2.56 Ga. Approximately in this time period, Archaebacteria evolved the ability to colonize thermoacidophilic environments (PH < 3; Temp. > 50 °C). This metabolism is mainly confined to the Domain Archaebacteria, although less strict metabolically constrained within Eubacteria have been identified (Goto et al. 2002), and is present in only two lineages: the classes Thermoplasmata (Phylum Euryarchaeota), and Thermoprotei (Phylum Crenarchaeota) (Bertoldo, Dock, and Antranikian 2004; Angelov and Liebl 2006). All known members of the Thermoplasmata (Genera Thermoplasma, Ferroplasma, and Picrophilus) are thermoacidophiles (Garrity 2001), with the only exception being one species of Ferroplasma that is a mesophilic (Edwards et al. 2000). Given this distribution, this metabolism was likely present in the ancestor of the class. There is evidence for extensive HGTs between Thermoplasmata and Thermoprotei (She et al. 2001; Angelov and Liebl 2006), but the direction of these transfers is unclear. Because we have not timed the divergence among orders, it is not possible to constraint the evolution of this metabolism with an upper and lower limit since it is unknown if the Sulfolobales, the only order represented in the protein data set for the Class Thermoprotei, originated earlier than the Thermoplasmata. However, the divergence of Thermoplasmata between 3.49 (3.46−3.57) Ga and 3.14 (3.10−3.18) Ga suggests that this metabolism had evolved by at least the mid-Archean.

3.6 Conclusions The purpose of this study was to analyse the phylogeny of prokaryote classes by comparing different phylogenetic methods (ML, NJ, and SF) for each of two data sets: core genes (protein) and rRNAs (SSU+LSU). The two data sets found similar associations for approximately 60% of the classes considered with significant bootstrap values for ~63% of them. In the Eubacteria domain among these highly supported groups are the Hydrobacteria and relationships within this subkingdom. For the Archaebacteria, Thermoplasmata, Archaeoglobi, Halobacteria, and Methanomicrobia form a highly supported cluster (BP > 95%) in both data sets. Based on the evidence from previous studies and on considerations regarding the performances of our data sets, we have expanded the definition of Terrabacteria to include the Phyla Firmicutes and Chloroflexi (besides Cyanobacteria, Actinobacteria, and the Class Deinococci). Our timetrees show a rapid evolution of these groups (3.4−2.5 Ga), after their divergence from the hyperthermophiles, which is mirrored in the evolution of Earth. The timing of the evolution of phototrophy (3.4−3.2 Ga) in the ancestor of Terrabacteria and Hydrobacteria and the colonization of photic habitats agrees with geologic evidence of anoxygenic photosynthesis. The evolution of physiological and genetic adaptations to resist the stresses of land environments enabled the colonization of land by 3.0 Ga, logically after the formation of continents 4.0−3.8 Ga. Finally, by using calibration points unrelated to oxygen metabolism, we were able to reconfirm the evolution of Cyanobacteria in the late Archean in agreement with the inferred evolution of Earth’s atmospheric composition.

41

3.7 Acknowledgements We thank Dr. D.A. Bryant for helpful discussion. This work was supported by grants to SBH from NASA Astrobiology Institute and National Science Foundation.

42

43

44

45

46

47

48

49 CHAPTER 4

Concluding remarks

In this work, we have addressed the early evolution of prokaryotes from the perspective of phylogeny and divergence times. The evidence of prokaryote communities in the mid-Archean geologic record suggests a long history of co-evolution between these organisms and our planet. This provides the means to further the understanding of Earth’s early history by studying the evolution of its earliest life forms, their metabolic innovations and colonization patterns of different environments. In order to address the timeline of prokaryote evolution it is necessary to evaluate the phylogenetic relationships of the major groups, a task that has proven difficult because of complex nature of their genome and how it evolved. Previous studies have shown some general trends but have not agreed on the major groups of classes and phyla. In order to improve these results, we carried out a series of studies on genes believed to have evolved vertically rather than horizontally. Through the comparison of phylogenies obtained with different methods and data sets we were able to identify some highly supported major groups of prokaryotes. A dichotomy was found in Eubacteria separating two lineages that gave rise to the majority of prokaryote classes (52% of them in the Hydrobacteria, 33% in Terrabacteria). Terrabacteria were also identified as the first lineage to evolve terrestrial adaptations. In the Domain Archaebacteria we found that the methanogens are not a natural group (i.e, non-monophyletic). In addition, we estimated divergence times with molecular clock methods in a Bayesian and likelihood framework, which indicated a mid- to late-Archean (3.3–2.6 Ga) radiation for Archaebacteria and Eubacteria respectively. The evolution of different types of metabolism such as methanogenesis (Archaebacteria) and phototrophy (Eubacteria) was timed in the mid-Archean (~3.6–3.4 Ga) and is in agreement with the geologic evidence (i.e., isotopic compositions and geologic structures). Following the evolution of phototrophy, eubacterial lineages evolved terrestrial adaptations that allowed their colonization of land environments by ~3.0 Ga, and the evolution of oxygenic photosynthesis by ~2.6 Ga. From an astrobiological perspective, the implications of a prokaryote evolutionary timeline span from the origin of life to changes in the redox state of Earth’s atmosphere. The phylogenetic distribution of metabolisms and the timeframe for their evolution have been used to identify possible early metabolic pathways. For example, the early evolution of methanogenesis, supported by molecular clocks and the geologic record, provides evidence for an early establishment of CO metabolism and its role in the evolution of an energy-conserving pathway that could have functioned in a primitive cell (Ferry and House 2006). On the other hand, the timing of the evolution of Cyanobacteria while in agreement between molecular clocks and geologic record (Summons et al. 1999), suggests the presence of geologic mechanisms that have delayed the rise in oxygen concentrations in the atmosphere of approximately 400 million years (Kump and Barley 2007). A timeline for the evolution of life during early Earth history and the geologic record are, thus, complementary evidence that, together, can improve our understanding of planetary habitability.

50 References

Allen, J. F., and W. Martin. 2007. Evolutionary biology: out of thin air. Nature 445:610- 612. Alpert, P. 2006. Constraints of tolerance: why are desiccation-tolerant organisms so small or rare? Journal of Experimental Biology 209:1575-1584. Altermann, W., and J. Kazmierczak. 2003. Archean microfossils: a reappraisal of early life on Earth. Research in Microbiology 154:611-617. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. H. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402. Amard, B., and J. Bertrand-Sarfati. 1997. Microfossils in 2000 Ma old cherty stromatolites of the Franceville Group, Gabon. Precambrian Research 81:197- 221. Angelov, A., and W. Liebl. 2006. Insights into extreme thermoacidophily based on genome analysis of Picrophilus torridus and other thermoacidophilic archaea. Journal of Biotechnology 126:3-10. Banerjee, N. R., Simonetti, A., Furnes, H., Muehlenbachs, K., Staudigel, H., Heaman, L., Van Kranendonk, M.J. 2007. Direct dating of Archean microbial ichnofossils. Geology 35:487-490. Bapteste, E., Y. Boucher, J. Leigh, and W. F. Doolittle. 2004. Phylogenetic reconstruction and lateral gene transfer. Trends in Microbiology 12:406-411. Bapteste, E., C. Brochier, and Y. Boucher. 2005. Higher-level classification of the Archaea: evolution of methanogenesis and methanogens. Archaea 1:353-363. Bapteste, E., E. Susko, J. Leigh, D. MacLeod, R. L. Charlebois, and W. F. Doolittle. 2005. Do orthologous gene phylogenies really support tree-thinking? BMC Evolutionary Biology 5:-. Battistuzzi, F. U., A. Feijão, and S. B. Hedges. 2004. A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land. BMC Evol Biol 4:44. Bauer, M., T. Lombardot, H. Teeling, N. L. Ward, R. I. Amann, and F. O. Glockner. 2004. Archaea-like genes for C1-transfer enzymes in Planctomycetes: phylogenetic implications of their unexpected presence in this phylum. J Mol Evol 59:571-586. Benton, M. J. 1993. The Fossil Record 2. Chapman and Hall, London. Bern, M., and D. Goldberg. 2005. Automatic selection of representative proteins for bacterial phylogeny. BMC Evol Biol 5:34. Bertoldo, C., C. Dock, and G. Antranikian. 2004. Thermoacidophilic microorganisms and their novel biocatalysts. Eng. Life Sci. 4:521-532. Billi, D., and M. Potts. 2002. Life and death of dried prokaryotes. Research in Microbiology 153:7-12. Blank, C. E. 2004. Evolutionary timing of the origins of mesophilic sulphate reduction and oxygenic photosynthesis: a phylogenomic dating approach. Geobiology 2:1- 20. Boetius, A., K. Ravenschlag, C. J. Schubert, D. Rickert, F. Widdel, A. Gieseke, R. Amann, B. B. Jorgensen, U. Witte, and O. Pfannkuche. 2000. A marine microbial

51 consortium apparently mediating anaerobic oxidation of methane. Nature 407:623-626. Boucher, Y., C. J. Douady, R. T. Papke, D. A. Walsh, M. E. Boudreau, C. L. Nesbo, R. J. Case, and W. F. Doolittle. 2003. Lateral gene transfer and the origins of prokaryotic groups. Annu Rev Genet 37:283-328. Brasier, M., O. Green, J. Lindsay, and A. Steele. 2004. Earth's oldest (similar to 3.5 Ga) fossils and the 'Early Eden hypothesis': Questioning the evidence. Origins of Life and Evolution of the Biosphere 34:257-269. Brasier, M., N. McLoughlin, O. Green, and D. Wacey. 2006. A fresh look at the fossil evidence for early Archaean cellular life. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 361:887-902. Brasier, M. D., O. R. Green, A. P. Jephcoat, A. K. Kleppe, M. J. Van Kranendonk, J. F. Lindsay, A. Steele, and N. V. Grassineau. 2002. Questioning the evidence for Earth's oldest fossils. Nature 416:76-81. Brinkmann, H., and H. Philippe. 1999. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Molecular Biology and Evolution 16:817-825. Brochier, C., E. Babteste, D. Moreira, and H. Philippe. 2002. Eubacterial phylogeny based on translational apparatus proteins. Trends in Genetics 18:1-5. Brochier, C., P. Forterre, and S. Gribaldo. 2004. Archaeal phylogeny based on proteins of the transcription and translation machineries: tackling the Methanopyrus kandleri paradox. Genome Biology 5:-. Brochier, C., S. Gribaldo, Y. Zivanovic, F. Confalonieri, and P. Forterre. 2005. Nanoarchaea: representatives of a novel archaeal phylum or a fast-evolving euryarchaeal lineage related to Thermococcales? Genome Biology 6:R42. Brochier, C., and H. Philippe. 2002. Phylogeny - A non-hyperthermophilic ancestor for bacteria. Nature 417:244-244. Brochier, C., H. Philippe, and D. Moreira. 2000. The evolutionary history of ribosomal protein RpS14: horizontal gene transfer at the heart of the ribosome. Trends Genet 16:529-533. Brocks, J. J., R. Buick, R. E. Summons, and G. A. Logan. 2003. A reconstruction of Archean biological diversity based on molecular fossils from the 2.78 to 2.45 billion-year-old Mount Bruce Supergroup, Hamersley Basin, Western Australia. Geochimica Et Cosmochimica Acta 67:4321-4335. Brocks, J. J., G. A. Logan, R. Buick, and R. E. Summons. 1999. Archean molecular fossils and the early rise of eukaryotes. Science 285:1033-1036. Brocks, J. J., G. D. Love, R. E. Summons, A. H. Knoll, G. A. Logan, and S. A. Bowden. 2005. Biomarker evidence for green and purple sulphur bacteria in a stratified Palaeoproterozoic sea. Nature 437:866-870. Brocks, J. J., and A. Pearson. 2005. Building the biomarker tree of life. Molecular Geomicrobiology reviews in mineralogy & geochemistry 59:233-258. Brown, J. R. 2003. Ancient horizontal gene transfer. Nature Reviews Genetics 4:121-132. Brown, J. R., C. J. Douady, M. J. Italia, W. E. Marshall, and M. J. Stanhope. 2001. Universal trees based on large combined protein data sets. Nature Genetics 28:281-285.

52 Brown, J. R., and K. K. Koretke. 2000. Universal trees: discovering the archeal and bacterial legacies in F. Priest, Goodfellow, M, ed. Applied Microbial Systematics. Kluwer academic Publishers. Bryant, D. A., A. M. Costas, J. A. Maresca, A. G. Chew, C. G. Klatt, M. M. Bateson, L. J. Tallon, J. Hostetler, W. C. Nelson, J. F. Heidelberg, and D. M. Ward. 2007. Candidatus Chloracidobacterium thermophilum: an aerobic phototrophic Acidobacterium. Science 317:523-526. Bryant, D. A., and N.-U. Frigaard. 2007. Prokaryotic photosynthesis and phototrophy illuminated. Trends in Microbiology 14:488-496. Buick, R. 1992. The antiquity of oxygenic photosynthesis: evidence from stromatolites in sulphate deficient Archaean Lakes. Science 255:74-77. Burggraf, S., K. O. Stetter, P. Rouviere, and C. R. Woese. 1991. Methanopyrus-Kandleri - an Archael Methanogen Unrelated to All Other Known Methanogens. Systematic and Applied Microbiology 14:346-351. Butterfield, N. J. 2000. Bangiomorpha pubescens n. gen., n. sp.: implications for the evolution of sex, multicellularity, and the Mesoproterozoic/Neoproterozoic radiation of eukaryotes. Paleobiology 26:386-404. Cammarano, P., R. Creti, A. M. Sanangelantoni, and P. Palm. 1999. The Archaea monophyly issue: a phylogeny of translational elongation factor G(2) sequences inferred from an optimized selection of alignment positions. Journal of Molecular Evolution 49:524-537. Canback, B., I. Tamas, and S. G. Andersson. 2004. A phylogenomic study of endosymbiotic bacteria. Molecular Biology and Evolution 21:1110-1122. Canfield, D. E., M. T. Rosing, and C. Bjerrum. 2006. Early anaerobic metabolisms. Philosophical Transactions of the Royal Society B-Biological Sciences 361:1819- 1836. Castresana, J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17:540-552. Chanal, A., V. Chapon, K. Benzerara, M. Barakat, R. Christen, W. Achouak, F. Barras, and T. Heulin. 2006. The desert of tataouine: an extreme environmen that hosts a wide diversity of microorganisms and radiotolerant bacteria. Environmental Microbiology 8:514-525. Chistoserdova, L., S. W. Chen, A. Lapidus, and M. E. Lidstrom. 2003. Methylotrophy in Methylobacterium extorquens AM1 from a genomic point of view. Journal of Bacteriology 185:2980-2987. Chistoserdova, L., C. Jenkins, M. Kalyuzhnaya, C. J. Marx, A. Lapidus, J. A. Vorholt, J. T. Staley, and M. E. Lidstrom. 2004. The enigmatic Planctomycetes may hold a key to the origins of methanogenesis and methylotrophy. Molecular Biology and Evolution 21:1234-1241. Choi, I. G., and S. H. Kim. 2007. Global extent of horizontal gene transfer. Proc Natl Acad Sci U S A 104:4489-4494. Ciccarelli, F. D., T. Doerks, C. von Mering, C. J. Creevey, B. Snel, and P. Bork. 2006. Toward automatic reconstruction of a highly resolved tree of life. Science 311:1283-1287.

53 Cockell, C. S., and G. Horneck. 2001. The history of the UV radiation climate of the earth - Theoretical and space-based observations. Photochemistry and Photobiology 73:447-451. Curtis, T. P., I. M. Head, M. Lunn, S. Woodcock, P. D. Schloss, and W. T. Sloan. 2006. What is the extent of prokaryotic diversity? Philosophical Transactions of the Royal Society B-Biological Sciences 361:2023-2037. da Silva, F. B., V. C. Muschner, and S. L. Bonatto. 2007. Phylogenetic position of Placozoa based on large subunit (LSU) and small subunit (SSU) rRNA genes. Genetics and Molecular Biology 30:127-132. Dagan, T., and W. Martin. 2007. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci U S A 104:870-875. Dagan, T., and W. Martin. 2006. The tree of one percent. Genome Biology 7:-. Daubin, V., M. Gouy, and G. Perriere. 2002. A phylogenomic approach to bacterial phylogeny: Evidence of a core of genes sharing a common history. Genome Research 12:1080-1090. Daubin, V., M. Gouy, and G. Perrière. 2002. Bacterial phylogeny using supertree approach. Genome Informatics 12:155-164. Daubin, V., N. A. Moran, and H. Ochman. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829-832. De Rijk, P., Y. Van de Peer, I. Van den Broeck, and R. DeWachter. 1995. Evolution according to large ribosomal-subunit RNA. Journal of Molecular Evolution 41:366-375. DeLong, E. F. 2000. Microbiology - Resolving a methane mystery. Nature 407:577-579. Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6:361-375. DesMarais, D. J. 2000. When did photosynthesis emerge on Earth? Science 289:1703- 1705. Doolittle, R. F., D.-F. Feng, S. Tsang, G. Cho, and E. Little. 1996. Determining divergence times of the major kingdoms of living organisms with a protein clock. Science 271:470-477. Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. Science 284:2124-2128. Drummond, A. J., S. Y. Ho, M. J. Phillips, and A. Rambaut. 2006. Relaxed phylogenetics and dating with confidence. PLoS Biol 4:e88. Edwards, K. J., P. L. Bond, T. M. Gihring, and J. F. Banfield. 2000. An archaeal iron- oxidizing extreme acidophile important in acid mine drainage. Science 287:1796- 1799. Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791. Felsenstein, J. 1989. Phylogeny Inference Package (Version 3.2). Cladistics 5:164-166. Felsenstein, J., and G. A. Churchill. 1996. A Hidden Markov Model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13:93-104. Feng, D.-F., G. Cho, and R. F. Doolittle. 1997. Determining divergence times with a protein clock: update and reevaluation. Proceedings of the National Academy of Sciences (U.S.A.) 94:13028-13033.

54 Ferry, J. G., and C. H. House. 2006. The stepwise evolution of early life driven by energy conservation. Molecular Biology and Evolution 23:1286-1292. Field, D., G. Wilson, and C. van der Gast. 2006. How do we compare hundreds of bacterial genomes? Current Opinion in Microbiology 9:499-504. Forterre, P., C. Brochier, and H. Philippe. 2002. Evolution of the archaea. Theoretical Population Biology 61:409-422. Foster, P. G., and D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J Mol Evol 48:284-290. Fox, G. E., E. Stackebrandt, R. B. Hespell, J. Gibson, J. Maniloff, T. A. Dyer, R. S. Wolfe, W. E. Balch, R. S. Tanner, L. J. Magrum, L. B. Zablen, R. Blakemore, R. Gupta, L. Bonen, B. J. Lewis, D. A. Stahl, K. R. Luehrsen, K. N. Chen, and C. R. Woese. 1980. The phylogeny of prokaryotes. Science 209:457-463. Frigaard, N. U., A. Martinez, T. J. Mincer, and E. F. DeLong. 2006. Proteorhodopsin lateral gene transfer between marine planktonic Bacteria and Archaea. Nature 439:847-850. Fuerst, J. A., and R. I. Webb. 1991. Membrane-bounded nucleoid in the eubacterium Gemmatata obscuriglobus. Proc Natl Acad Sci U S A 88:8184-8188. Gao, B., and R. S. Gupta. 2007. Phylogenomic analysis of proteins that are distinctive of Archaea and its main subgroups and the origin of methanogenesis. BMC Genomics 8:86. Gao, B., R. Paramanathan, and R. S. Gupta. 2006. Signature proteins that are distinctive characteristics of Actinobacteria and their subgroups. Antonie Van Leeuwenhoek 90:69-91. Garrity, G. M. 2001. Bergey's manual of systematic bacteriology, 2nd ed. Springer, New York. Gogarten, J. P., W. F. Doolittle, and J. G. Lawrence. 2002. Prokaryotic evolution in light of gene transfer. Molecular Biology and Evolution 19:2226-2238. Golubic, S., V. N. Sergeev, and A. H. Knoll. 1995. Mesoproterozoic Archaeoellipsoides: akinetes of heterocystous cyanobacteria. Lethaia 28:285-298. Gophna, U., W. F. Doolittle, and R. L. Charlebois. 2005. Weighted genome trees: Refinements and applications. Journal of Bacteriology 187:1305-1316. Goto, K., Y. Tanimoto, T. Tamura, K. Mochida, D. Arai, M. Asahara, M. Suzuki, H. Tanaka, and K. Inagaki. 2002. Identification of thermoacidophilic bacteria and a new Alicyclobacillus genomic species isolated from acidic environments in Japan. 6:333-340. Gotz, T., U. Windhovel, P. Boger, and G. Sandmann. 1999. Protection of photosynthesis against ultraviolet-B radiation by carotenoids in transformants of the cyanobacterium Synechococcus PCC7942. Plant Physiology 120:599-604. Gribaldo, S., and C. Brochier-Armanet. 2006. The origin and evolution of Archaea: a state of the art. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 361:1007-1022. Grotzinger, J. P., and D. H. Rothman. 1996. An abiotic model for stromatolite morphogenesis. Nature 383:423-425. Gupta, R. S. 2003. Evolutionary relationships among photosynthetic bacteria. Photosynth Res 76:173-183.

55 Gupta, R. S., K. Aitken, M. Falah, and B. Singh. 1994. Cloning of Giardia lamblia heat shock protein HSP70 homologs: implications regarding origin of eukaryotic cells and of endoplasmic reticulum. Proc Natl Acad Sci U S A 91:2895-2899. Gupta, R. S., and B. Singh. 1994. Phylogenetic analysis of 70 kD heat shock protein sequences suggests a chimeric origin for the eukaryotic cell nucleus. Current Biology 4:1104-1114. Hansmann, S., and W. Martin. 2000. Phylogeny of 33 ribosomal and six other proteins encoded in an ancient gene cluster that is conserved across prokaryotic genomes: influence of excluding poorly alignable sites from analysis. Int J Syst Evol Microbiol 50 Pt 4:1655-1663. Hartman, H., and A. Fedorov. 2002. The origin of the eukaryotic cell: a genomic investigation. Proc Natl Acad Sci U S A 99:1420-1425. Hasegawa, M., and T. Hashimoto. 1993. Ribosomal RNA trees misleading? Nature 361:23. Hawkesworth, C. J., and A. I. S. Kemp. 2006. Evolution of the continental crust. Nature 443:811-817. Hayes, J. M. 1994. Global methanotrophy at the Archean-Proterozoic transition. Pp. 220- 236 in S. Bengston, ed. Early life on Earth. Columbia University Press, New York. Hedges, S. B. 2002. The origin and evolution of model organisms. Nature Reviews Genetics 3:838-849. Hedges, S. B., J. E. Blair, M. L. Venturi, and J. L. Shoe. 2004. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evolutionary Biology 4:2. Hedges, S. B., H. Chen, S. Kumar, D. Y. Wang, A. S. Thompson, and H. Watanabe. 2001. A genomic timescale for the origin of eukaryotes. BMC Evolutionary Biology 1:4. Hedges, S. B., and S. Kumar. 2003. Genomic clocks and evolutionary timescales. Trends in Genetics 19:200-206. Hedges, S. B., and S. Kumar. 2004. Precision of molecular time estimates. Trends in Genetics 20:242-247. Hedges, S. B., and P. Shah. 2003. Comparison of mode estimation methods and application in molecular clock analysis. BMC Bioinformatics 4:31. Hinrichs, K. U. 2002. Microbial fixation of methane carbon at 2.7 Ga: Was an anaerobic mechanism possible? Geochemistry Geophysics Geosystems 3:-. Ho, S. Y., M. J. Phillips, A. J. Drummond, and A. Cooper. 2005. Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation. Molecular Biology and Evolution 22:1355-1363. Hofmann, H. J., K. Grey, A. H. Hickman, and R. I. Thorpe. 1999. Origin of 3.45 Ga coniform stromatolites in Warrawoona Group, Western Australia. Geological Society of America Bulletin 111:1256-1262. Holland, H. D. 2002. Volcanic gases, black smokers, and the Great Oxidation Event. Geochimica Et Cosmochimica Acta 21:3811-3826. Holt, J. G. 1984. Bergey's manual of systematic bacteriology, 1st ed. Williams & Wilkins, Baltimore.

56 Horita, J. 2005. Some perspective on isotope biosignatures for early life. Chemical Geology 218:171-186. House, C. H., B. Runnegar, and S. T. Fitz-Gibbon. 2003. Geobiological analysis using whole genome-based tree building applied to the Bacteria, Archaea and Eukarya. Geobiology 1:15-26. House, C. H., J. W. Schopf, and K. O. Stetter. 2003. Carbon isotopic fractionation by archaeans and other thermophilic prokaryotes. Organic Geochemistry 34:345-356. Huang, J., and J. P. Gogarten. 2006. Ancient horizontal gene transfer can benefit phylogenetic reconstruction. Trends Genet 22:361-366. Itoh, T., W. Martin, and M. Nei. 2002. Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proc Natl Acad Sci U S A 99:12944-12948. Jackson, C. R., and S. L. Dugas. 2003. Phylogenetic analysis of bacterial and archaeal arsC gene sequences suggests an ancient, common origin for arsenate reductase. BMC Evol Biol 3:18. Jaffres, J. B. D., Shields, G.A., Wallmann, K. 2007. The oxygen isotope evolution of seawater: a critical review of a long-standing controversy and an improved geological water cycle model for the past 3.4 billion years. Earth-Science Reviews 83:83-122. Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer among genomes: The complexity hypothesis. Proceedings of the National Academy of Sciences (U.S.A.) 96:3801-3806. Jenkins, C., and J. A. Fuerst. 2001. Phylogenetic analysis of evolutionary relationships of the planctomycete division of the domain bacteria based on amino acid sequences of elongation factor Tu. J Mol Evol 52:405-418. Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275-282. Kalyuzhnaya, M. G., and L. Chistoserdova. 2005. Community-level analysis: genes encoding methanopterin-dependent enzymes. Methods in Enzymology 397:443- 454. Kasting, J. F., and D. Catling. 2003. Evolution of a habitable planet. Annual Review of Astronomy and Astrophysics 41:429-463. Kasting, J. F., and M. T. Howard. 2006. Atmospheric composition and climate on the early Earth. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 361:1733-1741; discussion 1741-1732. Kasting, J. F., A. A. Pavlov, and J. L. Siefert. 2001. A coupled ecosystem-climate model for predicting the methane concentration in the archean atmosphere. Origins of Life and Evolution of the Biosphere 31:271-285. Kazmierczak, J., and W. Altermann. 2002. Neoarchean biomineralization by benthic cyanobacteria. Science 298:2351-2351. Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, UK.

57 Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 29:170-179. Kishino, H., J. L. Thorne, and W. J. Bruno. 2001. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution 18:352-361. Klein, M., M. Friedrich, A. J. Roger, P. Hugenholtz, S. Fishbain, H. Abicht, L. L. Blackall, D. A. Stahl, and M. Wagner. 2001. Multiple lateral transfers of dissimilatory sulfite reductase genes between major lineages of sulfate-reducing prokaryotes. Journal of Bacteriology 183:6028-6035. Klenk, H. P., R. A. Clayton, J. F. Tomb, O. White, K. E. Nelson, K. A. Ketchum, R. J. Dodson, M. Gwinn, E. K. Hickey, J. D. Peterson, D. L. Richardson, A. R. Kerlavage, D. E. Graham, N. C. Kyrpides, R. D. Fleischmann, J. Quackenbush, N. H. Lee, G. G. Sutton, S. Gill, E. F. Kirkness, B. A. Dougherty, K. McKenney, M. D. Adams, B. Loftus, S. Peterson, C. Reich, L. McNeil, J. Badger, A. Glodek, L. Zhou, R. Overbeek, J. Gocayne, J. Weidman, L. McDonald, T. Utterback, M. Cotton, T. Spriggs, P. Artiach, B. Kaine, S. Sykes, P. Sadow, K. D'Andrea, C. Bowman, C. Fujii, S. Garland, T. Mason, G. Olsen, C. Fraser, H. Smith, C. Woese, and J. C. Venter. 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390:364-370. Knauth, P., and D. R. Lowe. 2003. High Archean climatic temperature inferred from oxygen isotope geochemisrty of cherts in the 3.5 Ga Swatziland Supergroup, South Africa. GSA Bulletin 115:566-580. Knoll, A. H. 2003a. The geobiological consequences of evolution. Geobiology 1:3-14. Knoll, A. H. 2003b. Life on a young planet: the first three billion years of evolution on Earth. Princeton University Press, Princeton, NJ. Kollman, J. M., and R. F. Doolittle. 2000. Determining the relative rates of change for prokaryotic and eukaryotic proteins with anciently duplicated paralogs. Journal of Molecular Evolution 51:173-181. Koonin, E. V. 2003. Horizontal gene transfer: the path to maturity. Molecular Microbiology 50:725-727. Kopp, R. E., J. L. Kirschvink, I. A. Hilburn, and C. Z. Nash. 2005. The paleoproterozoic snowball Earth: A climate disaster triggered by the evolution of oxygenic photosynthesis. Proceedings of the National Academy of Sciences of the United States of America 102:11131-11136. Kumar, S. 2006. Molecular clocks: four decades of evolution. Nature Reviews Genetics 6:654-662. Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. MEGA2: molecular evolutionary genetics analysis software. Bioinformatics 17:1244-1245. Kump, L. R., and M. E. Barley. 2007. Increased subaerial volcanism and the rise of atmospheric oxygen 2.5 billion years ago. Nature 448:1033-1036. Lawrence, J. G., and H. Hendrickson. 2003. Lateral gene transfer: when will adolescence end? Molecular Microbiology 50:739-749. Le Gall, L., and G. W. Saunders. 2007. A nuclear phylogeny of the Florideophyceae (Rhodophyta) inferred from combined EF2, small subunit and large subunit

58 ribosomal DNA: establishing the new red algal subclass Corallinophycidae. Molecular Phylogenetics and Evolution 43:1118-1130. Lerat, E., V. Daubin, H. Ochman, and N. A. Moran. 2005. Evolutionary origins of genomic repertoires in bacteria. PLoS Biol 3:eI30. Madigan, M. T., J. M. Martinko, and J. Parker. 2003. Brock Biology of microorganisms. Prentice-Hall Inc., New Jersey. Madsen, E. L. 2005. Identifying microorganisms responsible for ecologically significant biogeochemical processes. Nature Reviews Microbiology 3:439-446. Makarova, K. S., and E. V. Koonin. 2005. Evolutionary and functional genomics of the Archaea. Current Opinion in Microbiology 8:586-594. Mallatt, J., and C. J. Winchell. 2002. Testing the new animal phylogeny: first use of combined large-subunit and small-subunit rRNA gene sequences to classify the protostomes. Molecular Biology and Evolution 19:289-301. Margoliash, E. 1963. Primary Structure and Evolution of Cytochrome C. Proc Natl Acad Sci U S A 50:672-679. Margulis, L. 1970. Origin of eukaryotic cells. Yale University Press, New Haven, CT. Martin, H., F. Albarède, P. Claeys, M. Gargaud, B. Marty, A. Morbidelli, and D. L. Pinti. 2006. Building a habitable planet. Earth, Moon, and Planets 98:97-151. Martin, W., and E. V. Koonin. 2006. A positive definition of prokaryotes. Nature 442:868. Martiny, J. B. H., and D. Field. 2005. Ecological perspectives on the sequenced genome collection. Ecology Letters 8:1334-1345. Matte-Tailliez, O., C. Brochier, P. Forterre, and H. Philippe. 2002. Archael phylogeny based on ribosomal proteins. Molecular Biology and Evolution 19:631-639. Mattimore, V., and J. R. Battista. 1996. of Deinococcus radiodurans: Functions necessary to survive ionizing radiation are also necessary to survive prolonged desiccation. Journal of Bacteriology 178:633-637. McCarren, J., and E. F. DeLong. 2007. Proteorhodopsin photosystem gene clusters exhibit co-evolutionary trends and shared ancestry among diverse marine microbial phyla. Environ Microbiol 9:846-858. Mira, A., R. Pushker, B. A. Legault, D. Moreira, and F. Rodriguez-Valera. 2004. Evolutionary relationships of Fusobacterium nucleatum based on phylogenetic analysis and comparative genomics. BMC Evol Biol 4:50. Mix, L. J., D. Haig, and C. M. Cavanaugh. 2005. Phylogenetic analyses of the core antenna domain: investigating the origin of photosystem I. J Mol Evol 60:153- 163. Mojzsis, S. J., G. Arrhenius, K. D. McKeegan, T. M. Harrison, A. P. Nutman, and C. R. Friend. 1996. Evidence for life on Earth before 3,800 million years ago. Nature 384:55-59. Mojzsis, S. J., T. M. Harrison, and R. T. Pidgeon. 2001. Oxygen-isotope evidence from ancient zircons for liquid water at the Earth's surface 4,300 Myr ago. Nature 409:178-181. Moran, N. A. 1996. Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci U S A 93:2873-2878. Nesbo, C. L., Y. Boucher, and W. F. Doolittle. 2001. Defining the core of nontransferable prokaryotic genes: the euryarchaeal core. J Mol Evol 53:340-350.

59 Nicholson, W. L., N. Munakata, G. Horneck, H. J. Melosh, and P. Setlow. 2000. Resistance of Bacillus endospores to extreme terrestrial and extraterrestrial environments. Microbiology and Molecular Biology Reviews 64:548-572. Nisbet, E. G., and N. H. Sleep. 2001. The habitat and nature of early life. Nature 409:1083-1091. Ochman, H., and A. C. Wilson. 1987. Evolution in bacteria: evidence for a universal substitution rate in cellular genomes. Journal of Molecular Evolution 26:74-86. Ohmoto, H., Y. Watanabe, and K. Kumazawa. 2004. Evidence from massive siderite beds for a CO(2)-rich atmosphere before ~ 1.8 billion years ago. Nature 429:395- 399. Olsen, G. J., C. R. Woese, and R. Overbeek. 1994. The winds of (evolutionary) change: breathing new life into microbiology. Journal of Bacteriology 176:1-6. Omelchenko, M. V., Y. I. Wolf, E. K. Gaidamakova, V. Y. Matrosova, A. Vasilenko, M. Zhai, M. J. Daly, E. V. Koonin, and K. S. Makarova. 2005. Comparative genomics of Thermus thermophilus and Deinococcus radiodurans: divergent routes of adaptation to thermophily and radiation resistance. BMC Evol Biol 5:57. Orphan, V. J., K. U. Hinrichs, W. Ussler, 3rd, C. K. Paull, L. T. Taylor, S. P. Sylva, J. M. Hayes, and E. F. Delong. 2001a. Comparative analysis of methane-oxidizing archaea and sulfate-reducing bacteria in anoxic marine sediments. Applied and Environmental Microbiolgy 67:1922-1934. Orphan, V. J., C. H. House, K. U. Hinrichs, K. D. McKeegan, and E. F. DeLong. 2001b. Methane-consuming archaea revealed by directly coupled isotopic and phylogenetic analysis. Science 293:484-487. Orphan, V. J., C. H. House, K. U. Hinrichs, K. D. McKeegan, and E. F. DeLong. 2002. Multiple archaeal groups mediate methane oxidation in anoxic cold seep sediments. Proceedings of the National Academy of Sciences (U.S.A.) 99:7663- 7668. Pace, N. R. 2006. Time for a change. Nature 441:289. Pavlov, A. A., M. T. Hurtgen, J. F. Kasting, and M. A. Arthur. 2003. Methane-rich Proterozoic atmosphere? Geology 31:87-90. Pavlov, A. A., J. F. Kasting, L. L. Brown, K. A. Rages, and R. Freedman. 2000. Greenhouse warming by CH4 in the atmosphere of early Earth. Journal of Geophysical Research 105:11981-11990. Philippe, H., F. Delsuc, H. Brinkmann, and N. Lartillot. 2005. Phylogenomics. Annual Review of Ecology Evolution and Systematics 36:541-562. Philippe, H., and C. J. Douady. 2003. Horizontal gene transfer and phylogenetics. Current Opinion in Microbiology 6:498-505. Philippe, H., P. Lopez, H. Brinkmann, K. Budin, A. Germot, J. Laurent, D. Moreira, M. Muller, and H. Le Guyader. 2000. Early-branching or fast-evolving eukaryotes? An answer based on slowly evolving positions. Proceedings of the Royal Society of London Series B-Biological Sciences 267:1213-1221. Pisani, D., J. A. Cotton, and J. O. McInerney. 2007. Supertrees disentangle the chimerical origin of eukaryotic genomes. Molecular Biology and Evolution 24:1752-1760. Potts, M. 1994. Desiccation tolerance of prokaryotes. Microbiol Rev 58:755-805.

60 Pupko, T., D. Huchon, Y. Cao, N. Okada, and M. Hasegawa. 2002. Combining multiple data sets in a likelihood analysis: which models are the best? Molecular Biology and Evolution 19:2294-2307. Raymond, J., O. Zhaxybayeva, J. P. Gogarten, and R. E. Blankenship. 2003. Evolution of photosynthetic prokaryotes: a maximum-likelihood mapping approach. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 358:223-230. Raymond, J., O. Zhaxybayeva, J. P. Gogarten, S. Y. Gerdes, and R. E. Blankenship. 2002. Whole-genome analysis of photosynthetic prokaryotes. Science 298:1616- 1620. Rivera, M. C., R. Jain, J. E. Moore, and J. A. Lake. 1998. Genomic evidence for two functionally distinct gene classes. Proc Natl Acad Sci U S A 95:6239-6244. Rivera, M. C., and J. A. Lake. 1992. Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science 257:74-76. Rivera, M. C., and J. A. Lake. 1996. The phylogeny of Methanopyrus kandleri. International Journal of Systematic Bacteriology 46:348-351. Robert, F., and M. Chaussidon. 2006. A palaeotemperature for the Precambrian oceans based on silicon isotopes in cherts. Nature 443:969-972. Rokitko, P. V., V. A. Romanovskaya, Y. R. Malashenko, N. A. Chernaya, N. I. Gushcha, and A. N. Mikheev. 2001. Soil drying as a model for the action of stress factors on natural bacterial populations. Microbiology 72:756-761. Ronaghi, M., M. Uhlen, and P. Nyren. 1998. A sequencing method based on real-time pyrophosphate. Science 281:363, 365. Ronquist, F., and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574. Rosing, M. T., D. K. Bird, N. H. Sleep, W. Glassley, and F. Albarede. 2006. The rise of continents - An essay on the geologic consequences of photosynthesis. Palaeogeography, Palaeoclimatology, Palaeoecology 232:99-113. Ruiz-Gonzalez, M. X., and I. Marin. 2004. New insights into the evolutionary history of type 1 rhodopsins. J Mol Evol 58:348-358. Sabehi, G., A. Loy, K. H. Jung, R. Partha, J. L. Spudich, T. Isaacson, J. Hirschberg, M. Wagner, and O. Beja. 2005. New insights into metabolic properties of marine bacteria encoding proteorhodopsins. PLoS Biol 3:e273. Sadekar, S., J. Raymond, and R. E. Blankenship. 2006. Conservation of distantly related membrane proteins: photosynthetic reaction centers share a common structural core. Molecular Biology and Evolution 23:2001-2007. Sanderson, M. 1997. A nonparametric approach to estimating divergence times in the absence of rate constancy. Molecular Biology and Evolution 14:1218-1231. Sanderson, M. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Molecular Biology and Evolution 19:101- 109. Schleper, C., G. Jurgens, and M. Jonuscheit. 2005. Genomic studies of uncultivated archaea. Nature Reviews Microbiology 3:479-488. Schloss, P. D., and J. Handelsman. 2004. Status of the microbial census. Microbiology and Molecular Biology Reviews 68:686-691.

61 Schopf, J. W. 1993. Microfossils of the Early Archean Apex Chert - New Evidence of the Antiquity of Life. Science 260:640-646. Schopf, J. W. 2006. Fossil evidence of Archaean life. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 361:869-885. Schopf, J. W., A. B. Kurdryavtsev, D. G. Agresti, T. J. Wdowiak, and A. D. Czaja. 2002. Laser-Raman imagery of Earth's earliest fossils. Nature 416:73-76. Schwartzman, D. 1999. Life, temperature, and the Earth. Columbia University Press, New York. Sergeev, V. N., L. M. Gerasimenko, and G. A. Zavarzin. 2002. The Proterozoic history and present state of cyanobacteria. Microbiology 71:623-637. She, Q., R. K. Singh, F. Confalonieri, Y. Zivanovic, G. Allard, M. J. Awayez, C. C. Chan-Weiher, I. G. Clausen, B. A. Curtis, A. De Moors, G. Erauso, C. Fletcher, P. M. Gordon, I. Heikamp-de Jong, A. C. Jeffries, C. J. Kozera, N. Medina, X. Peng, H. P. Thi-Ngoc, P. Redder, M. E. Schenk, C. Theriault, N. Tolstrup, R. L. Charlebois, W. F. Doolittle, M. Duguet, T. Gaasterland, R. A. Garrett, M. A. Ragan, C. W. Sensen, and J. Van der Oost. 2001. The complete genome of the crenarchaeon Sulfolobus solfataricus P2. Proc Natl Acad Sci U S A 98:7835- 7840. Sheridan, P. P., K. H. Freeman, and J. E. Brenchley. 2003. Estimated minimal divergence times of the major bacterial and archaeal phyla. Geomicrobiology Journal 20:1- 14. Shin, Y. K., A. Hiraishi, and J. Sugiyama. 1993. Molecular systematics of the genus Zoogloea and emendation of the genus. Int J Syst Bacteriol 43:826-831. Shirkey, B., N. J. McMaster, S. C. Smith, D. J. Wright, H. Rodriguez, P. Jaruga, M. Birincioglu, R. F. Helm, and M. Potts. 2003. Genomic DNA of Nostoc commune (Cyanobacteria) becomes covalently modified during long-term (decades) desiccation but is protected from oxidative damage and degradation. Nucleic Acids Research 31:2995-3005. Shukla, M., R. Chaturvedi, D. Tamhane, P. Vyas, G. Archana, S. Apte, J. Bandekar, and A. Desai. 2007. Multiple-stress tolerance of ionizing radiation-resistant bacterial isolates obtained from various habitats: correlation between stresses. Current Microbiology 54:142-148. Slesarev, A. I., K. V. Mezhevaya, K. S. Makarova, N. N. Polushin, O. V. Shcherbinina, V. V. Shakhova, G. I. Belova, L. Aravind, D. A. Natale, I. B. Rogozin, R. L. Tatusov, Y. I. Wolf, K. O. Stetter, A. G. Malykh, E. V. Koonin, and S. A. Kozyavkin. 2002. The complete genome of Methanopyrus kandleri AV19 and monophyly of archaeal methanogens. Proceedings of the National Academy of Sciences U.S.A. 99:4644-4649. Sleep, N. H., K. J. Zahnle, J. F. Kasting, and H. J. Morowitz. 1989. Annihilation of ecosystems by large asteroid impacts on the early Earth. Nature 342:139-142. Snel, B., P. Bork, and M. A. Huynen. 1999. Genome phylogeny based on gene content. Nature Genetics 21:108-110.

62 Stamatakis, A. 2006. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688-2690. Strimmer, K., and A. vonHaeseler. 1996. Quartet puzzling: A quartet maximum- likelihood method for reconstructing tree topologies. Molecular Biology and Evolution 13:964-969. Summons, R. E., L. L. Jahnke, J. M. Hope, and G. A. Logan. 1999. 2-Methylhopanoids as biomarkers for cyanobacterial oxygenic photosynthesis. Nature 400:554-557. Tamura, K., J. Dudley, M. Nei, and S. Kumar. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24:1596-1599. Tatusov, R. L., D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Research 29:22-28. Teeling, H., T. Lombardot, M. Bauer, W. Ludwig, and F. O. Glockner. 2004. Evaluation of the phylogenetic position of the planctomycete 'Rhodopirellula baltica' SH 1 by means of concatenated ribosomal protein sequences, DNA-directed RNA polymerase subunit sequences and whole genome trees. Int J Syst Evol Microbiol 54:791-801. Tekaia, F., A. Lazcano, and B. Dujon. 1999. The genomic tree as revealed from whole proteome comparisons. Genome Research 9:550-557. Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22:4673-4680. Thorne, J. L., and H. Kishino. 2002. Divergence time and evolutionary rate estimation with multilocus data. Syst Biol 51:689-702. Thorne, J. L., H. Kishino, and I. S. Painter. 1998. Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15:1647-1657. Tice, M. M., Lowe, D.R. 2006. Hydrogen-based carbon fixation in the earliest known photosynthetic organisms. Geology 34:37-40. Ueda, K., M. Ohno, K. Yamamoto, H. Nara, Y. Mori, M. Shimada, M. Hayashi, H. Oida, Y. Terashima, M. Nagata, and T. Beppu. 2001. Distribution and diversity of symbiotic , Symbiobacterium thermophilum and related bacteria, in natural environments. Applied and Environmental Microbiology 67:3779-3784. Ueda, K., A. Yamashita, J. Ishikawa, M. Shimada, T. O. Watsuji, K. Morimura, H. Ikeda, M. Hattori, and T. Beppu. 2004. Genome sequence of Symbiobacterium thermophilum, an uncultivable bacterium that depends on microbial commensalism. Nucleic Acids Res 32:4937-4944. Ueno, Y., K. Yamada, N. Yoshida, S. Maruyama, and Y. Isozaki. 2006. Evidence from fluid inclusions for microbial methanogenesis in the early Archaean era. Nature 440:516-519. Van de Peer, Y., P. De Rijk, J. Wuyts, T. Winkelmans, and R. De wachter. 2000. The European small subunit ribosomal RNA database. Nucleic Acids Res 28:175-176.

63 van Niftrik, L. A., J. A. Fuerst, J. S. Sinninghe Damste, J. G. Kuenen, M. S. Jetten, and M. Strous. 2004. The anammoxosome: an intracytoplasmic compartment in anammox bacteria. FEMS Microbiology Letters 233:7-13. Wagner, M., and M. Horn. 2006. The Planctomycetes, Verrucomicrobia, Chlamydiae and sister phyla comprise a superphylum with biotechnological and medical relevance. Current Opinion in Biotechnology 17:241-249. Watanabe, Y., J. E. Martini, and H. Ohmoto. 2000. Geochemical evidence for terrestrial ecosystems 2.6 billion years ago. Nature 408:574-578. Waters, E., M. J. Hohn, I. Ahel, D. E. Graham, M. D. Adams, M. Barnstead, K. Y. Beeson, L. Bibbs, R. Bolanos, M. Keller, K. Kretz, X. Lin, E. Mathur, J. Ni, M. Podar, T. Richardson, G. G. Sutton, M. Simon, D. Soll, K. O. Stetter, J. M. Short, and M. Noordewier. 2003. The genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism. Proc Natl Acad Sci U S A 100:12984-12988. Woese, C. R., and G. E. Fox. 1977. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences (U.S.A.) 74:5088-5090. Wolf, M., T. Muller, T. Dandekar, and J. D. Pollack. 2004. Phylogeny of Firmicutes with special reference to Mycoplasma (Mollicutes) as inferred from phosphoglycerate kinase amino acid sequence data. Int J Syst Evol Microbiol 54:871-875. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, and E. V. Koonin. 2002. Genome trees and the tree of life. Trends in Genetics 18:472-479. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, R. L. Tatusov, and E. V. Koonin. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evolutionary Biology 1:8. Wuyts, J., G. Perriere, and Y. V. de Peer. 2004. The European ribosomal RNA database. Nucleic Acids Research 32:D101-D103. Wynn-Williams, D. D., H. G. Edwards, E. M. Newton, and J. M. Holder. 2002. Pigmentation as a survival strategy for ancient and modern photosynthetic microbes under high ultraviolet stress on planetary surfaces. International journal of Astrobiology 1:39-49. Xiong, J., W. M. Fischer, K. Inoue, M. Nakahara, and C. E. Bauer. 2000. Molecular evidence for the early evolution of photosynthesis. Science 289:1724-1730. Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555-556. Zhaxybayeva, O., and J. P. Gogarten. 2002. Bootstrap, Bayesian probability and maximum likelihood mapping: exploring new tools for comparative genome analyses. BMC Genomics 3:4. Zhaxybayeva, O., J. P. Gogarten, R. L. Charlebois, W. F. Doolittle, and R. T. Papke. 2006. Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Research 16:1099-1108. Zuckerkandl, E., and L. Pauling. 1962. Molecular disease, evolution, and genetic heterogeneity. Pp. 189-225 in M. Marsha, and B. Pullman, eds. Horizons in Biochemistry. Academic Press, New York.

64 Appendix A

Supplementary information for Chapter 2

Table A1 Estimated times for each node and calibration for Eubacteria and Archaebacteria. Node numbers refer to figures A1 (Eubacteria) and A2 (Archaebacteria) in Appendix A. Eubacteria: nodes 54-106, Archaebacteria: nodes 18-34.

Node 54 Node 55 Node 56 Node 57 Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI 2500 32 19 13-84 831 138 584-1122 1150 159 845-1479 1403 167 1067-1737 3000 32 18 12-78 835 137 582-1113 1158 159 856-1465 1414 168 1082-1734 2300 3500 31 18 13-79 837 133 599-1119 1160 153 977-1472 1415 162 1107-1739 4000 31 19 13-80 833 135 579-1112 1155 154 862-1457 1411 160 1101-1719 4500 30 18 12-76 809 135 550-1077 1123 156 810-1425 1374 164 1042-1687 Average 31 13-79 829 579-886 1149 870-1460 1403 1080-1723 2500 33 21 12-90 842 162 558-1194 1164 197 810-1596 1420 218 1030-1901 3000 32 20 12-85 844 164 553-1198 1171 197 813-1597 1430 217 1042-1901 2040-3080 3500 33 21 12-87 858 171 558-1228 1187 208 815-1630 1449 230 1033-1938 4000 34 22 13-90 877 167 583-1234 1215 201 862-1646 1480 222 1085-1951 4500 33 20 13-83 871 165 577-1218 1207 198 849-1624 1474 220 1076-1934 Average 33 12-87 858 566-1214 1189 830-1619 1451 1053-1925 2500 36 23 14-95 918 166 616-1265 1269 197 895-1680 1545 213 1141-1993 3000 37 25 14-102 938 173 620-1305 1293 203 916-1716 1571 218 1165-2021 2300 min 3500 36 23 13-98 918 166 626-1276 1272 192 921-1677 1550 205 1174-1979 4000 36 21 14-94 926 162 630-1261 1282 192 924-1677 1564 208 1170-1994 4500 36 23 14-100 934 168 627-1290 1292 198 923-1701 1575 213 1178-2012 Average 36 14-98 927 624-1279 1282 916-1690 1561 1166-2000 2500 43 29 15-121 1051 185 705-1424 1444 211 1030-1860 1753 221 1314-2181 3000 42 28 15-115 1043 176 710-1387 1435 201 1043-1823 1745 209 1328-2149 2700 min 3500 45 30 15-126 1060 176 704-1396 1455 200 1033-1826 1765 207 1312-2140 4000 42 28 15-112 1050 175 712-1403 1448 199 1055-1843 1760 206 1349-2163 4500 43 29 16-120 1051 173 730-1407 1445 196 1069-1841 1755 205 1356-2157 Average 43 15-119 1051 712-1403 1445 1046-1839 1756 1332-2158 Tot Average 36 14-96 916 620-1196 1266 915-1652 1543 1158-1561 65 Node 58 Node 59 Node 60 Node 61

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 294 80 168-482 641 136 411-951 1633 164 1295-1948 151 39 91-243 3000 296 79 169-474 646 135 413-934 1646 166 1306-1954 151 37 89-234 2300 3500 295 79 171-479 645 134 421-943 1646 161 1332-1956 153 38 91-239 4000 295 81 169-483 642 136 417-937 1644 153 1338-1933 153 38 95-239 4500 282 76 160-458 616 130 392-903 1607 158 1283-1907 147 36 89-232 Average 292 167-475 638 411-934 1635 1311-1940 151 91-237 2500 299 90 160-510 649 155 393-999 1652 228 1256-2150 153 42 87-249 3000 298 88 166-506 648 152 402-992 1668 227 1264-2155 157 41 91-247 2040-3080 3500 303 91 165-512 659 161 399-1023 1689 241 1258-2197 159 43 92-261 4000 308 89 174-515 673 154 422-1029 1722 233 1316-2210 164 44 97-268 4500 307 89 170-519 670 154 416-1015 1717 232 1302-2204 160 42 94-255 Average 303 167-512 660 406-1012 1690 1279-2183 159 92-256 2500 325 97 174-550 705 163 429-1070 1799 216 1397-2257 171 46 100-276 3000 335 100 182-570 727 170 447-1108 1821 221 1406-2276 171 48 95-280 2300 min 3500 323 90 186-540 705 156 456-1061 1804 210 1419-2239 168 43 98-265 4000 328 92 180-538 713 157 442-1058 1823 212 1425-2257 171 45 98-277 4500 327 94 179-548 713 162 446-1078 1834 217 1427-2274 175 45 102-277 Average 328 180-549 713 444-1075 1816 1415-2261 171 99-275 2500 378 113 200-645 813 187 491-1228 2033 215 1600-2444 199 51 118-317 3000 371 108 204-622 802 178 499-1188 2028 203 1619-2415 199 50 119-313 2700 min 3500 380 103 212-607 819 171 510-1171 2044 202 1598-2414 198 49 116-305 4000 376 107 211-603 812 177 513-1214 2042 199 1640-2427 200 52 118-318 4500 374 107 213-639 809 176 522-1222 2036 200 1632-2420 195 50 113-309 Average 376 208-623 811 507-1205 2037 1618-2424 198 117-312 Tot Average 325 181-540 705 442-1057 1794 1406-2202 170 100-270

66 Node 62 Node 63 Node 64 Node 65

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 884 152 615-1194 1345 165 1031-1669 2073 128 1802-2315 957 142 664-1217 3000 887 147 599-1189 1352 165 1025-1672 2089 133 1816-2341 971 140 681-1224 2300 3500 893 146 626-1194 1357 160 1052-1679 2088 129 1823-2339 962 138 683-1222 4000 895 139 645-1188 1359 149 1081-1663 2089 116 1849-2297 973 132 702-1218 4500 866 136 612-1154 1324 149 1040-1620 2062 119 1823-2287 953 129 698-1199 Average 885 619-1184 1347 1046-1661 2080 1823-2316 963 686-1216 2500 893 171 597-1267 1359 206 1002-1810 2099 232 1729-2617 979 168 674-1338 3000 912 166 610-1259 1381 203 1020-1813 2126 237 1738-2639 989 172 676-1352 2040-3080 3500 924 176 620-1307 1399 216 1026-1863 2149 246 1745-2661 995 174 674-1363 4000 946 174 643-1324 1427 211 1056-1873 2182 245 1768-2686 1022 177 697-1388 4500 933 170 625-1297 1417 209 1043-1855 2179 245 1768-2688 1011 171 706-1370 Average 922 619-1291 1397 1029-1843 2147 1750-2126 999 685-1362 2500 986 176 667-1350 1490 201 1118-1913 2284 200 1940-2712 1068 169 748-1404 3000 989 186 649-1373 1498 212 1112-1933 2298 209 1927-2734 1065 164 755-1401 2300 min 3500 978 172 669-1336 1487 198 1117-1899 2292 200 1943-2709 1063 168 743-1406 4000 992 174 671-1355 1506 199 1128-1914 2318 203 1954-2734 1080 167 763-1420 4500 1011 178 679-1374 1524 205 1130-1940 2331 208 1956-2754 1098 170 781-1439 Average 991 667-1358 1501 1121-1920 2305 1944-2729 1075 758-1414 2500 1133 189 786-1521 1696 207 1295-2110 2557 166 2216-2876 1203 175 849-1535 3000 1136 179 804-1488 1698 193 1324-2071 2561 156 2249-2863 1214 166 887-1532 2700 min 3500 1130 182 781-1483 1695 197 1299-2072 2565 163 2230-2875 1211 168 875-1524 4000 1137 186 797-1515 1701 197 1317-2087 2568 156 2265-2872 1206 175 848-1528 4500 1123 183 760-1484 1689 198 1293-2069 2565 159 2252-2874 1207 173 854-1528 Average 1132 786-1498 1696 1306-2082 2563 2242-2872 1208 863-1529 Tot Average 982 673-1333 1485 1126-1877 2274 1940-2511 1061 748-1380

67 Node 66 Node 67 Node 68 Node 69

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 1913 140 1609-2162 2423 94 2250-2612 2642 88 2504-2835 665 115 452-891 3000 1930 143 1611-2183 2443 103 2258-2666 2665 101 2513-2895 643 118 427-877 2300 3500 1924 142 1629-2184 2441 102 2262-2659 2663 100 2507-2895 647 116 436-879 4000 1930 127 1659-2159 2438 86 2277-2613 2657 84 2514-2841 660 117 440-893 4500 1907 127 1641-2140 2425 87 2265-2609 2649 83 2510-2836 664 113 449-883 Average 1921 1630-2166 2434 2262-2632 2655 2510-2860 656 441-885 2500 1942 226 1563-2438 2452 244 2097-2994 2669 260 2300-3248 684 140 445-991 3000 1965 233 1570-2470 2485 252 2105-3022 2705 267 2304-3271 695 146 444-1010 2040-3080 3500 1984 240 1569-2477 2511 258 2112-3037 2736 274 2316-3291 692 143 443-1005 4000 2018 240 1605-2518 2540 259 2126-3063 2762 273 2326-3304 699 151 438-1019 4500 2013 240 1601-2508 2546 260 2124-3067 2774 275 2330-3311 702 146 451-1017 Average 1984 1582-2482 2507 2113-3077 2729 2315-3285 694 444-1008 2500 2115 201 1752-2533 2663 197 2356-3089 2895 205 2581-3335 743 147 483-1058 3000 2123 206 1754-2555 2676 206 2348-3106 2908 214 2574-3348 774 151 493-1083 2300 min 3500 2117 201 1751-2535 2679 198 2360-3096 2917 206 2585-3341 747 152 474-1064 4000 2143 203 1765-2560 2706 203 2363-3115 2942 210 2592-3357 756 149 480-1063 4500 2160 209 1779-2579 2717 206 2375-3135 2953 213 2601-3377 758 154 488-1084 Average 2132 1760-2552 2688 2360-3108 2923 2587-3352 756 484-1070 2500 2366 175 2001-2697 2961 131 2720-3230 3205 123 2993-3467 863 149 578-1155 3000 2377 165 2046-2693 2968 128 2737-3231 3211 124 2994-3470 872 155 583-1174 2700 min 3500 2376 173 2019-2702 2969 131 2730-3237 3214 125 2998-3473 858 155 558-1162 4000 2377 168 2032-2699 2973 128 2743-3239 3216 124 3000-3474 869 152 583-1176 4500 2376 170 2017-2698 2976 129 2741-3240 3222 124 3000-3475 858 156 559-1165 Average 2374 2023-2698 2969 2734-3235 3214 2997-3472 864 572-1166 Tot Average 2103 1749-2475 2650 2367-3013 2880 2602-3242 742 485-1032

68 Node 70 Node 71 Node 72 Node 73

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 916 134 656-1166 2300 1 2299-2301 246 69 139-409 821 121 589-1065 3000 888 140 626-1156 2300 1 2299-2301 243 69 136-406 810 121 570-1044 2300 3500 893 137 634-1153 2300 1 2299-1301 243 69 134-405 809 123 566-1051 4000 909 137 646-1167 2300 1 2299-2301 239 66 138-390 805 119 577-1036 4500 915 133 652-1159 2300 1 2299-2301 239 64 140-389 808 115 590-1040 Average 904 643-1160 2300 2299-2301 242 137-400 811 578-1047 2500 939 170 645-1308 2328 225 2049-2851 249 78 134-436 829 156 556-1178 3000 955 179 644-1339 2361 234 2052-2887 255 79 137-449 848 161 565-1209 2040-3080 3500 952 177 635-1325 2383 239 2054-2894 254 77 137-436 850 157 566-1193 4000 962 185 636-1352 2407 242 2055-2921 259 78 142-442 864 163 589-1222 4500 965 180 649-1347 2414 244 2058-2925 256 75 140-432 860 158 582-1198 Average 955 642-1334 2379 2054-2896 255 138-439 850 572-1200 2500 1021 177 699-1390 2531 177 2307-2942 275 83 151-474 915 159 632-1257 3000 1061 179 715-1418 2556 188 2309-2973 284 85 158-486 939 161 655-1282 2300 min 3500 1028 183 691-1401 2550 182 2309-2966 278 84 153-476 926 163 640-1273 4000 1040 181 696-1406 2570 186 2311-2970 278 83 152-474 927 160 641-1272 4500 1043 185 707-1423 2581 190 2312-2992 282 84 158-481 934 161 651-1288 Average 1039 702-1408 2558 2310-2969 279 154-478 928 644-1274 2500 1183 172 843-1503 2835 104 2705-3085 327 94 182-545 1063 157 764-1379 3000 1193 178 844-1526 2841 106 2706-3089 327 96 179-554 1064 164 755-1398 2700 min 3500 1179 180 816-1511 2838 106 2705-3088 323 94 183-549 1059 157 764-1378 4000 1191 176 852-1528 2844 107 2706-3099 328 97 178-560 1065 163 750-1395 4500 1176 180 820-1517 2843 107 2706-3092 325 99 177-557 1059 168 740-1406 Average 1184 835-1517 2840 2706-3091 326 180-553 1062 755-1391 Tot Average 1020 706-1355 2519 2342-2814 276 152-468 913 637-1228

69 Node 74 Node 75 Node 76 Node 77

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 1222 133 970-1481 2511 52 2432-2629 2764 98 2609-2980 114 53 49-253 3000 1207 133 927-1453 2525 60 2437-2664 2792 113 2622-3049 118 54 51-253 2300 3500 1205 137 926-1467 2523 59 2434-2665 2790 112 2614-3043 116 49 51-243 4000 1203 131 945-1458 2519 51 2436-2633 2782 95 2619-2985 113 49 52-244 4500 1208 128 962-1463 2515 50 2435-2631 2776 92 2617-2980 115 52 52-249 Average 1209 946-1464 2519 2435-2644 2781 2616-3007 115 51-248 2500 1237 192 906-1669 2538 244 2214-3090 2791 273 2401-3388 120 57 52-273 3000 1263 200 922-1713 2574 252 2217-3124 2828 279 2406-3413 121 56 52-260 2040-3080 3500 1267 198 922-1698 2601 258 2226-3138 2862 286 2420-3433 114 49 51-239 4000 1285 205 936-1739 2627 259 2231-3153 2888 284 2429-3442 123 55 54-266 4500 1282 202 930-1717 2636 261 2237-3159 2902 286 2435-3448 118 51 52-253 Average 1267 923-1707 2595 2225-3133 2854 2418-3425 119 52-258 2500 1361 189 1032-1771 2757 191 2486-3179 3024 214 2689-3472 130 60 57-283 3000 1393 194 1051-1804 2774 200 2482-3205 3037 222 2687-3489 132 55 58-269 2300 min 3500 1375 197 1033-1803 2778 193 2491-3197 3050 214 2694-3482 125 55 55-270 4000 1380 191 1037-1794 2801 197 2494-3204 3074 218 2704-3493 128 59 55-282 4500 1389 198 1038-1817 2812 201 2499-3229 3085 220 2713-3514 133 56 57-274 Average 1380 1038-1798 2784 2490-3203 3054 2697-3490 130 56-276 2500 1572 176 1237-1926 3068 111 2896-3318 3339 126 3117-3599 153 70 66-333 3000 1575 186 1224-1949 3073 113 2896-3318 3344 127 3113-3602 155 70 65-337 2700 min 3500 1567 178 1225-1934 3073 113 2897-3321 3347 128 3119-3608 151 68 66-332 4000 1575 183 1219-1942 3077 113 2900-3332 3350 127 3119-3606 157 72 65-342 4500 1566 189 1195-1942 3080 113 2901-3325 3357 126 3125-3607 150 68 66-324 Average 1571 1220-1939 3074 2898-3323 3347 3119-3604 153 66-334 Tot Average 1357 1032-1727 2743 2512-3076 3009 2713-3382 129 56-279

70 Node 78 Node 79 Node 80 Node 81

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 527 173 287-950 1675 170 1326-2005 2514 110 2317-2738 1116 174 782-1466 3000 545 174 292-957 1698 181 1330-2035 2545 123 2334-2822 1141 184 804-1540 2300 3500 537 161 291-914 1695 172 1328-2000 2543 119 2334-2802 1134 176 806-1503 4000 524 158 296-910 1676 169 1313-1980 2531 101 2346-2744 1127 179 791-1483 4500 531 167 294-946 1680 170 1327-2003 2526 103 2339-2742 1115 169 787-1449 Average 533 292-935 1685 1325-2005 2532 2334-2770 1127 794-1488 2500 549 184 291-1003 1708 232 1302-2205 2546 256 2167-3108 1152 207 803-1612 3000 556 185 292-991 1726 243 1277-2225 2577 263 2164-3126 1148 212 768-1602 2040-3080 3500 532 164 292-926 1725 240 1290-2213 2603 268 2179-3146 1165 212 797-1632 4000 564 180 303-999 1763 243 1321-2257 2632 268 2194-3162 1174 219 789-1640 4500 548 171 297-958 1753 247 1293-2254 2642 270 2195-3169 1170 216 794-1635 Average 550 295-975 1735 1297-2231 2600 2180-3142 1162 790-1624 2500 595 192 324-1070 1849 220 1431-2306 2757 208 2415-3200 1234 213 846-1692 3000 603 179 330-1010 1865 217 1450-2306 2771 213 2422-3205 1238 210 854-1683 2300 min 3500 580 181 314-1026 1843 226 1396-2289 2773 209 2418-3198 1216 212 841-1669 4000 588 190 318-1066 1865 225 1433-2315 2799 211 2435-3217 1259 216 867-1712 4500 609 185 325-1037 1893 224 1455-2327 2816 212 2450-3242 1275 216 871-1716 Average 595 322-1042 1863 1433-2309 2783 2428-3212 1244 856-1694 2500 688 217 372-1199 2069 211 1642-2468 3053 136 2806-3334 1404 222 986-1846 3000 698 219 370-1224 2083 213 1644-2475 3060 137 2803-3331 1422 219 989-1853 2700 min 3500 682 213 373-1213 2073 207 1654-2472 3061 136 2810-3337 1416 230 988-1879 4000 702 223 368-1251 2075 217 1603-2477 3063 137 2803-3338 1391 213 987-1827 4500 681 211 377-1189 2072 212 1630-2465 3068 134 2818-3337 1415 225 1001-1877 Average 690 372-1215 2074 1635-2471 3061 2808-3335 1410 990-1856 Tot Average 592 320-1042 1839 1423-2254 2744 2438-3115 1236 858-1666

71 Node 82 Node 83 Node 84 Node 85

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 470 103 284-683 619 124 386-871 1377 165 1033-1684 219 96 97-464 3000 480 102 300-697 631 123 408-884 1402 165 1068-1724 225 102 100-486 2300 3500 472 100 297-683 622 121 407-866 1390 160 1069-1701 219 97 99-469 4000 479 99 304-684 630 119 413-868 1398 156 1077-1689 228 100 101-484 4500 474 96 297-677 623 116 405-856 1386 152 1080-1668 221 95 100-467 Average 475 296-685 625 404-869 1391 1065-1693 222 99-474 2500 487 112 292-735 641 137 397-938 1418 204 1051-1861 238 106 101-509 3000 484 113 294-728 636 138 399-931 1418 210 1036-1859 224 100 100-467 2040-3080 3500 491 114 294-741 647 140 398-947 1442 214 1046-1886 237 107 101-508 4000 491 116 296-744 647 141 402-945 1449 216 1055-1895 233 108 101-506 4500 489 115 290-733 644 141 396-937 1443 218 1040-1888 227 101 97-485 Average 488 293-736 643 398-940 1434 1046-1878 232 80-495 2500 517 112 322-755 681 137 437-965 1517 196 1141-1920 244 105 106-510 3000 520 115 312-762 684 139 422-966 1521 198 1132-1922 243 105 107-506 2300 min 3500 516 115 315-756 680 140 427-967 1518 199 1133-1921 244 107 108-530 4000 532 117 328-781 700 142 446-997 1554 203 1164-1967 257 117 110-551 4500 538 118 328-783 708 144 448-1000 1575 203 1187-1981 263 114 114-544 Average 525 321-767 691 436-979 1537 1151-1942 250 109-528 2500 591 125 361-847 776 150 489-1074 1711 195 1291-2073 285 124 120-602 3000 607 124 384-865 796 148 523-1090 1739 190 1353-2103 303 138 127-660 2700 min 3500 600 125 379-863 788 149 516-1089 1732 190 1354-2103 293 130 128-617 4000 591 122 372-852 776 146 508-1079 1712 188 1325-2070 286 123 126-592 4500 600 129 369-868 788 154 505-1101 1729 198 1341-2113 296 136 125-649 Average 598 373-859 785 508-1087 1725 1333-2092 293 125-624 Tot Average 521 321-762 686 437-969 1522 1149-1901 249 103-530

72 Node 86 Node 87 Node 88 Node 89

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 1825 150 1507-2109 1321 160 992-1621 94 39 42-190 483 117 287-738 3000 1854 154 1547-2165 1340 162 1023-1675 97 43 44-204 405 126 296-790 2300 3500 1846 147 1545-2143 1331 158 1015-1632 94 41 42-197 484 120 292-754 4000 1846 138 1563-2109 1335 148 1049-1622 95 42 44-195 491 121 302-767 4500 1835 136 1568-2099 1327 145 1045-1612 95 40 44-196 488 116 308-764 Average 1841 1546-2125 1331 1025-1632 95 43-196 470 297-763 2500 1865 222 1497-2355 1343 199 989-1773 98 46 42-216 499 139 288-824 3000 1877 230 1481-2365 1363 199 1017-1794 99 45 43-211 504 134 296-818 2040-3080 3500 1905 233 1500-2391 1375 205 1012-1812 98 43 43-207 504 134 294-816 4000 1919 235 1510-2400 1393 207 1027-1837 101 46 44-219 515 138 301-835 4500 1919 237 1492-2398 1386 208 1019-1831 98 43 44-208 506 135 301-823 Average 1897 1496-2382 1372 1013-1809 99 43-212 506 296-823 2500 2006 199 1652-2424 1450 186 1117-1832 102 42 46-209 527 129 322-815 3000 2014 201 1649-2436 1460 191 1110-1863 107 46 47-221 542 134 325-840 2300 min 3500 2009 205 1621-2433 1452 202 1077-1874 105 47 46-229 537 142 315-874 4000 2050 206 1670-2467 1480 195 1120-1886 106 45 48-224 545 138 329-862 4500 2073 206 1690-2485 1510 195 1148-1911 111 50 49-239 563 144 334-890 Average 2030 1656-2449 1470 1114-1873 106 47-224 543 325-856 2500 2249 180 1867-2588 1641 194 1271-2024 122 55 53-262 615 159 367-976 3000 2270 175 1920-2609 1656 193 1282-2042 127 59 54-281 625 163 364-1005 2700 min 3500 2269 177 1921-2620 1661 202 1282-2066 127 61 53-284 629 168 368-1015 4000 2249 173 1897-2589 1630 188 1263-2006 119 51 53-244 601 148 364-939 4500 2266 179 1909-2611 1650 190 1279-2021 124 54 55-262 619 154 373-968 Average 2261 1903-2603 1648 1275-2032 124 54-267 618 367-981 Tot Average 2007 1650-2390 1455 1107-1837 106 47-225 534 249-856

73 Node 90 Node 91 Node 92 Node 93

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 192 42 123-287 5 3 1-12 91 26 52-154 371 71 246-524 3000 194 44 126-297 5 3 1-12 93 28 53-162 377 75 251-549 2300 3500 193 43 124-294 5 3 1-12 92 27 51-155 374 74 246-534 4000 193 42 128-287 5 3 1-12 92 27 53-154 375 71 256-531 4500 192 41 126-284 5 3 1-12 92 26 52-152 372 70 254-526 Average 193 125-290 5 1-12 92 52-155 374 251-533 2500 196 48 122-309 6 3 1-13 94 29 51-164 380 83 244-570 3000 197 49 121-311 6 3 1-13 94 30 51-167 382 85 245-573 2040-3080 3500 198 48 122-309 6 3 1-13 95 29 52-165 385 85 247-573 4000 202 48 127-317 6 3 1-13 97 30 53-169 392 84 252-584 4500 200 48 125-315 6 3 1-13 95 29 53-166 386 84 248-578 Average 199 123-312 6 1-13 95 52-166 385 247-576 2500 209 47 134-316 6 3 1-13 100 29 56-169 406 81 269-581 3000 213 48 135-323 6 3 1-13 102 30 56-174 412 83 271-595 2300 min 3500 212 50 133-327 6 3 1-14 101 31 56-176 410 88 265-612 4000 215 49 137-329 6 3 1-14 103 31 58-177 417 86 276-611 4500 221 52 141-342 6 3 1-14 106 32 59-182 428 88 283-624 Average 214 136-327 6 1-14 102 57-176 415 273-605 2500 244 57 155-378 7 4 2-16 119 36 65-206 472 97 313-686 3000 246 57 156-377 7 4 2-16 120 37 65-208 476 96 313-685 2700 min 3500 245 57 154-376 7 4 2-16 119 37 66-207 474 97 312-689 4000 239 53 153-361 7 6 2-15 115 34 64-197 462 90 307-660 4500 243 55 156-371 7 6 2-15 118 35 66-200 472 93 316-678 Average 243 155-373 7 2-16 118 65-204 471 312-680 Tot Average 212 135-326 6 1-14 102 57-175 411 271-599

74 Node 94 Node 95 Node 96 Node 97

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 627 97 442-824 721 106 517-932 772 110 557-990 1245 146 946-1520 3000 636 102 456-859 731 110 533-970 784 114 576-1028 1265 151 981-1588 2300 3500 632 100 445-834 727 108 521-942 778 112 563-998 1255 115 966-1535 4000 633 94 466-835 729 102 544-944 781 106 588-1002 1260 136 1004-1531 4500 629 93 463-825 724 100 541-934 776 104 587-992 1253 133 1000-1520 Average 631 454-835 726 531-944 778 574-1002 1256 979-1539 2500 640 116 438-895 736 128 512-1016 788 134 552-1075 1270 188 946-1675 3000 645 119 443-906 742 130 518-1025 796 136 563-1092 1285 188 967-1693 2040-3080 3500 651 120 447-906 749 132 524-1030 803 138 567-1095 1297 192 963-1708 4000 662 119 459-923 762 131 538-1048 816 137 580-1115 1315 193 976-1728 4500 653 118 451-916 752 130 529-1036 806 134 570-1098 1305 191 973-1707 Average 650 448-909 748 524-1031 802 566-1095 1294 965-1702 2500 685 111 488-918 789 121 571-1041 845 126 621-1106 1365 171 1061-1719 3000 694 113 493-943 798 124 578-1065 855 129 623-1134 1379 175 1059-1751 2300 min 3500 692 121 481-959 796 132 562-1085 853 138 606-1150 1372 187 1026-1767 4000 703 117 502-958 809 128 585-1085 866 133 632-1154 1397 180 1068-1775 4500 720 119 509-976 827 130 596-1106 886 135 647-1173 1424 181 1086-1805 Average 699 495-951 804 578-1076 861 626-1143 1387 1060-1763 2500 789 127 566-1060 905 137 659-1195 968 142 711-1263 1549 181 1216-1909 3000 795 126 568-1061 913 136 661-1195 977 140 715-1264 1561 179 1217-1927 2700 min 3500 794 127 567-1064 912 137 662-1195 976 143 717-1269 1566 186 1222-1942 4000 777 118 560-1029 892 128 653-1163 955 132 709-1235 1535 171 1204-1879 4500 790 121 573-1042 907 130 671-1172 971 134 725-1241 1558 174 1225-1896 Average 789 567-1051 906 661-1184 969 715-1254 1554 1217-1911 Tot Average 692 491-937 796 574-1059 853 620-1124 1373 1055-1729

75 Node 98 Node 99 Node 100 Node 101

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 1574 155 1256-1864 1751 153 1436-2039 2262 122 2003-2501 2530 108 2334-2750 3000 1597 158 1286-1931 1776 156 1461-2098 2294 130 2051-2561 2561 120 2364-2835 2300 3500 1588 152 1274-1883 1768 150 1458-2063 2289 124 2055-2557 2558 117 2365-2819 4000 1592 141 1312-1862 1769 139 1491-2034 2281 109 2065-2501 2548 99 2368-2755 4500 1583 139 1316-1849 1760 137 1494-2023 2272 109 2059-2499 2541 98 2360-2754 Average 1587 1289-1878 1765 1468-2051 2280 2047-2524 2548 2358-2783 2500 1602 210 1240-2057 1781 219 1413-2263 2298 240 1923-2827 2562 256 2183-3126 3000 1622 213 1255-2082 1802 224 1420-2286 2321 248 1921-2844 2592 263 2184-3146 2040-3080 3500 1639 219 1258-2092 1823 229 1425-2301 2351 251 1945-2870 2623 268 2202-3165 4000 1659 219 1274-2115 1842 230 1444-2317 2372 253 1944-2879 2648 268 2207-3176 4500 1653 219 1265-2116 1840 230 1430-2314 2379 253 1949-2883 2660 269 2215-3180 Average 1635 1258-2092 1818 1426-2296 2344 1936-2861 2617 2198-3159 2500 1727 190 1393-2125 1921 197 1576-2334 2482 200 2152-2910 2772 206 2443-3206 3000 1739 194 1382-2147 1931 200 1568-2351 2493 204 2150-2916 2785 211 2441-3219 2300 min 3500 1730 206 1346-2163 1923 209 1524-2359 2490 206 2127-2912 2788 207 2439-3210 4000 1764 200 1401-2180 1962 205 1585-2391 2527 205 2162-2941 2819 209 2459-3231 4500 1793 199 1426-2199 1989 205 1614-2405 2546 206 2179-2960 2834 211 2476-3249 Average 1751 1390-2163 1945 1573-2368 2508 2154-2928 2800 2452-3223 2500 1947 187 1583-2317 2160 184 1796-2516 2763 149 2469-3059 3070 133 2822-3345 3000 1962 185 1601-2331 2175 180 1816-2532 2774 146 2490-3061 3078 133 2828-3347 2700 min 3500 1967 193 1601-2350 2179 188 1822-2552 2776 148 2491-3068 3080 134 2827-3351 4000 1937 180 1573-2294 2152 178 1798-2503 2765 144 2486-3050 3078 132 2828-3343 4500 1959 183 1597-2309 2173 181 1813-2521 2778 147 2494-3065 3087 132 2845-3352 Average 1954 1591-2320 2168 1809-2525 2771 2486-3061 3079 2830-3348 Tot Average 1732 1382-2113 1924 1569-2310 2476 2156-2844 2761 2460-3128

76 Node 102 Node 103 Node 104 Node 105

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 2721 106 2548-2951 2805 109 2627-3038 2889 115 2704-3140 3319 164 3044-3686 3000 2752 121 2564-3029 2835 125 2645-3118 2921 132 2718-3219 3362 183 3067-3772 2300 3500 2750 120 2560-3016 2834 125 2638-3108 2920 131 2712-3211 3363 182 3053-3766 4000 2741 102 2566-2957 2825 106 2642-3055 2912 113 2717-3157 3361 167 3069-3723 4500 2734 99 2560-2951 2818 103 2638-3047 2905 109 2715-3145 3353 160 3066-3699 Average 2740 2560-2981 2823 2638-3073 2909 2713-3174 3352 3060-3729 2500 2750 271 2353-3339 2832 279 2425-3436 2915 287 2495-3536 3339 334 2828-4038 3000 2785 277 2358-3362 2868 284 2429-3456 2953 292 2499-3557 3385 338 2836-4057 2040-3080 3500 2818 283 2376-3383 2903 291 2446-3481 2990 299 2520-3580 3431 344 2866-4081 4000 2844 282 2385-3393 2930 289 2458-3489 3017 297 2528-3586 3462 342 2874-4092 4500 2859 284 2391-3401 2945 292 2463-3500 3033 299 2536-3593 3482 344 2884-4100 Average 2811 2373-3376 2896 2444-3472 2982 2516-3570 3420 2858-4074 2500 2977 214 2633-3427 3065 218 2715-3520 3154 224 2792-3619 3604 260 3158-4110 3000 2990 221 2631-3437 3078 227 2708-3533 3167 232 2787-3630 3619 270 3146-4125 2300 min 3500 3002 214 2639-3433 3093 219 2720-3533 3185 224 2799-3627 3648 261 3172-4129 4000 3027 217 2653-3447 3117 222 2730-3547 3208 228 2808-3641 3671 263 3188-4139 4500 3039 220 2665-3463 3128 224 2742-3561 3218 229 2821-3653 3679 264 3198-4147 Average 3007 2644-3441 3096 2723-3539 3186 2801-3634 3644 3172-4130 2500 3286 130 3053-3555 3379 131 3142-3647 3471 133 3227-3739 3932 154 3619-4209 3000 3292 131 3051-3553 3385 132 3139-3649 3477 134 3226-3741 3937 155 3629-4219 2700 min 3500 3295 131 3057-3556 3388 133 3147-3649 3480 134 3233-3741 3942 154 3634-4217 4000 3297 130 3056-3554 3391 132 3144-3651 3483 133 3230-3742 3948 154 3632-4223 4500 3304 129 3065-3560 3398 131 3153-3653 3491 133 3240-3743 3957 151 3650-4222 Average 3295 3056-3556 3388 3145-3650 3480 3231-3741 3943 3633-4218 Tot Average 2963 2658-3339 3051 2738-3434 3139 2815-3530 3590 3181-4038

77

Node 106 Node 18 Node 19 Calibration rttm St.De Calibration (Ma) rttm (Ma) Time St.Dev. CI Time CI Time St.Dev. CI (Ma) (Ma) v. 2500 3639 209 3277-4099 2500 1547 227 1034-1914 3167 200 2709-3470 3000 3690 229 3309-4201 3000 1547 235 1034-1928 3175 196 2714-3464 2300 3500 3696 226 3300-42001609 3500 1519 236 1004-1913 3141 225 2599-3464 4000 3709 221 3319-4202 4000 1531 245 985-1925 3178 196 2735-3471 4500 3699 210 3330-4152 4500 1540 239 1014-1923 3182 209 2635-3469 Average 3687 3307-4171 Average 1537 1014-1921 3169 2678-3468 2500 3652 370 3060-4412 2500 1523 232 1013-1910 3132 222 2612-3468 3000 3705 373 3077-4426 3000 1534 239 994-1925 3155 210 2671-3472 2040-3080 3500 3758 377 3119-44481489-1729 3500 1532 244 982-1925 3167 206 2678-3472 4000 3794 374 3130-4452 4000 1552 234 1026-1929 3191 192 2738-3475 4500 3817 375 3143-4454 4500 1548 238 1018-1936 3198 188 2745-3475 Average 3745 3106-4438 Average 1538 1007-1925 3169 2689-3472 2500 3931 287 3409-4452 2500 1323 221 859-1724 2925 242 2377-3308 3000 3945 295 3401-4459 3000 1327 229 849-1755 2931 252 2341-3324 2300 min 3500 3983 287 3440-44671174-1222 3500 1335 222 892-1737 2967 225 2447-3328 4000 4010 287 3452-4471 4000 1336 225 889-1754 2983 210 2510-3324 4500 4016 286 3467-4472 4500 1341 226 894-1761 2995 209 2519-3333 Average 3977 3434-4464 Average 1332 877-1746 2960 2439-3323 Tot 2500 4246 166 3885-4487 1469 966-1864 3099 2602-3554 Average 3000 4250 164 3891-4487 2700 min 3500 4256 161 3896-4489 4000 4267 160 3907-4489 4500 4274 156 3921-4491 Average 4259 3900-4489 Tot Average 3917 3437-4391

78 Node 20 Node 21 Node 22 Node 23

Calibration (Ma) rttm (Ma) Time St.Dev. Time St.Dev. Time St.Dev. Time St.Dev. CI Time St.Dev. CI

2500 3668 214 3668 214 3668 214 1609 1 1608-1610 3957 245 3339-4267 3000 3680 200 3680 200 3680 200 1609 1 1608-1610 3970 229 3402-4264 1609 3500 3642 249 3642 249 3642 249 1609 1 1608-1610 3927 286 3187-4264 4000 3691 204 3691 204 3691 204 1609 1 1608-1610 3985 234 3404-4270 4500 3691 227 3691 227 3691 227 1609 1 1608-1610 3985 261 3219-4271 Average 3674 3674 3674 1609 1608-1611 3965 3310-4267 2500 3632 248 3632 248 3632 248 1571 63 1492-1712 3924 285 3214-4275 3000 3660 225 3660 225 3660 225 1571 63 1492-1711 3956 257 3324-4277 1489-1729 3500 3677 220 3677 220 3677 220 1572 63 1492-1712 3976 251 3337-4282 4000 3702 201 3702 201 3702 201 1571 63 1492-1711 4004 229 3408-4280 4500 3714 196 3714 196 3714 196 1572 64 1492-1711 4019 223 3443-4283 Average 3677 3677 3677 1571 1492-1711 3976 3345-4279 2500 3451 271 3451 271 3451 271 1196 14 1175-1220 3808 323 3025-4224 3000 3462 283 3462 283 3462 283 1196 14 1175-1220 3824 334 2987-4228 1174-1222 3500 3508 243 3508 243 3508 243 1196 14 1175-1221 3880 286 3158-4229 4000 3534 217 3534 217 3534 217 1196 14 1175-1220 3911 254 3283-4232 4500 3546 216 3546 216 3546 216 1195 14 1173-1221 3926 252 3295-4233 Average 3500 3500 3500 1196 1175-1220 3870 3150-4229 Tot Average 3617 3617 3617 1459 1425-1514 3937 3268-4258

79 Node 24 Node 25 Node 26 Node 27

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 423 197 146-895 870 348 336-1656 3215 220 2700-3572 3447 228 2899-3792 3000 426 201 149-916 878 357 340-1702 3226 219 2717-3571 3459 223 2933-3790 1609 3500 403 181 146-841 834 322 334-1564 3178 258 2534-3554 3412 270 2722-3784 4000 432 202 151-919 888 355 351-1697 3238 224 2729-3592 3472 229 2936-3809 4500 410 186 145-850 849 332 332-1601 3223 237 2599-3573 3460 247 2785-3795 Average 419 147-884 864 339-1644 3216 2656-3572 3450 2855-3794 2500 429 196 153-901 880 341 355-1655 3203 250 2618-3586 3434 263 2803-3810 3000 441 203 150-933 905 356 343-1693 3232 237 2687-3598 3464 245 2890-3821 1489-1729 3500 430 195 153-902 888 343 353-1644 3246 232 2702-3599 3479 240 2900-3826 4000 426 195 148-901 880 346 341-1678 3260 219 2746-3599 3497 224 2956-3826 4500 436 203 151-929 900 356 350-1703 3282 214 2787-3617 3518 219 2989-3838 Average 432 151-913 891 348-1675 3245 2708-3600 3478 2908-3824 2500 336 146 137-703 709 272 313-1359 3078 277 2456-3506 3326 296 2644-3754 3000 340 146 139-696 718 271 322-1371 3091 291 2389-3516 3340 311 2583-3765 1174-1222 3500 337 142 142-694 714 265 324-1344 3132 255 2523-3517 3386 270 2720-3770 4000 341 148 143-705 720 275 325-1363 3155 237 2619-3535 3413 246 2839-3782 4500 338 145 143-701 715 269 329-1355 3165 233 2624-3527 3425 242 2844-3771 Average 338 141-700 715 323-1358 3124 2522-3520 3378 2726-3768 Tot Average 397 146-832 823 337-1559 3195 2629-3564 3435 2830-3795

80 Node 28 Node 29 Node 30 Node 31

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 564 251 240-1208 273 84 126-448 2733 247 2217-3183 3169 229 2669-3564 3000 581 287 239-1335 283 85 128-462 2752 270 2169-3230 3182 243 2639-3584 1609 3500 562 265 228-1239 275 84 123-451 2708 291 2024-3191 3135 278 2465-3557 4000 578 284 236-1344 281 84 130-454 2758 275 2171-3243 3191 249 2646-3603 4500 570 267 234-1253 277 84 121-450 2744 278 2095-3191 3179 257 2535-3569 Average 571 235-1276 278 126-453 2739 2135-3208 3171 2591-3575 2500 601 305 241-1400 277 85 133-459 2742 277 2161-3239 3166 265 2567-3798 3000 594 273 250-1288 285 84 135-459 2771 265 2203-3230 3195 251 2630-3603 1489-1729 3500 600 288 245-1349 284 86 131-466 2780 268 2202-3247 3211 249 2633-3607 4000 561 247 240-1181 280 83 132-450 2774 259 2211-3225 3214 238 2677-3600 4500 594 283 251-1349 284 84 137-461 2807 253 2266-3260 3243 231 2720-3629 Average 590 245-1313 282 134-459 2775 2209-3240 3206 2645-3647 2500 463 196 231-998 231 72 117-389 2588 278 2002-3083 3039 285 2405-3499 3000 464 190 229-977 233 70 116-382 2600 291 1948-3094 3051 300 2342-3512 1174-1222 3500 463 179 234-923 232 70 117-386 2631 272 2031-3108 3093 267 2464-3517 4000 465 192 240-967 235 71 117-390 2654 261 2088-3118 3118 249 2557-3524 4500 454 174 241-924 232 69 121-384 2654 251 2116-3109 3124 244 2579-3518 Average 462 235-958 233 118-386 2625 2037-3102 3085 2469-3514 Tot Average 541 238-1182 264 126-433 2713 2127-3183 3154 2568-3579

81 Node 32 Node 33 Node 34

Calibration (Ma) rttm (Ma) Time St.Dev. CI Time St.Dev. CI Time St.Dev. CI

2500 3626 234 3058-3955 3833 246 3219-4159 4149 266 3477-4472 3000 3637 224 3101-3951 3846 231 3282-4157 4164 247 3550-4469 1609 3500 3590 276 2874-3946 3798 288 3049-4153 4117 310 3310-4472 4000 3650 229 3104-3965 3860 236 3285-4163 4182 253 3549-4477 4500 3642 254 2931-3959 3855 264 3096-4163 4182 283 3349-4477 Average 3629 3014-3955 3838 3186-4159 4159 3447-4473 2500 3607 273 2944-3973 3808 287 3096-4175 4119 310 3341-4486 3000 3637 250 3042-3981 3840 261 3205-4180 4153 279 3468-4487 1489-1729 3500 3655 245 3053-3984 3860 255 3222-4183 4175 272 3482-4489 4000 3677 226 3111-3984 3886 233 3293-4187 4206 248 3566-4489 4500 3696 221 3144-3992 3904 229 3326-4188 4223 242 3592-4491 Average 3654 3059-3983 3860 3228-4183 4175 3490-4488 2500 3510 311 2775-3934 3720 329 2933-4152 4042 356 3171-4481 3000 3525 324 2723-3944 3736 341 2886-4158 4061 368 3137-4485 1174-1222 3500 3575 281 2876-3954 3791 294 3047-4168 4123 315 3319-4486 4000 3604 253 3002-3957 3822 263 3186-4169 4158 280 3469-4489 4500 3618 250 3002-3953 3838 260 3183-4169 4177 276 3474-4489 Average 3566 2876-3948 3781 3047-4163 4112 3314-4486 Tot Average 3617 2983-3962 3826 3154-4168 4149 3417-4482

82 Table A2 Mean of the prior distribution for the rate of molecular evolution of the ingroup root node (rtrate) in Eubacteria and Archaebacteria.

EUBACTERIA ARCHAEBACTERIA

Rttm (ingroup Rttm (ingroup Rtrate Rtrate root constraint) root constraint)

2500 Ma 0.034 2500 Ma 0.026 3000 Ma 0.028 3000 Ma 0.022 3500 Ma 0.024 3500 Ma 0.019 4000 Ma 0.02 4000 Ma 0.016 4500 Ma 0.019 4500 Ma 0.014

83 Table A3 Divergence time estimates and percentage difference due to different ingroup root constraints used under each calibration point. Node numbers refer to Figure A1 (Eubacteria) and A2 (Archaebacteria) in Appendix A.

EUBACTERIA Node 68 Node 70 Node 71 Node 75 Node 76 Node 92 Node 100 Node 101 Node 104 Node 105 Node 106 Calibration rttm (Ma) (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) Time (Ma) 2500 2642 916 2300 2511 2764 91 2262 2530 2889 3319 3639 3000 2665 888 2300 2525 2792 93 2294 2561 2921 3362 3690 2300 3500 2663 893 2300 2523 2790 92 2289 2558 2920 3363 3696 4000 2657 909 2300 2519 2782 92 2281 2548 2912 3361 3709 4500 2649 915 2300 2515 2776 92 2272 2541 2905 3353 3699 Difference 44% 0.60% 3.10% 0.00% 0.50% 1.00% 2.20% 1.40% 1.20% 1.10% 1.30% 1.90% 2500 2669 939 2328 2538 2791 94 2298 2562 2915 3339 3652 3000 2705 955 2361 2574 2828 94 2321 2592 2953 3385 3705 2040-3080 3500 2736 952 2383 2601 2862 95 2351 2623 2990 3431 3758 4000 2762 962 2407 2627 2888 97 2372 2648 3017 3462 3794 4500 2774 965 2414 2636 2902 95 2379 2660 3033 3482 3817 Difference 44% 3.80% 2.70% 3.60% 2.40% 3.80% 3.10% 3.40% 3.70% 3.90% 4.10% 4.30% 2500 2895 1021 2531 2757 3024 100 2482 2772 3154 3604 3931 3000 2908 1061 2556 2774 3037 102 2493 2785 3167 3619 3945 2300 min 3500 2917 1028 2550 2778 3050 101 2490 2788 3185 3648 3983 4000 2942 1040 2570 2801 3074 103 2527 2819 3208 3671 4010 4500 2953 1043 2581 2812 3085 106 2546 2834 3218 3679 4016 Difference 44% 2.00% 3.80% 1.90% 2.00% 2.00% 5.70% 2.50% 2.20% 2.00% 2.00% 2.10% 2500 3205 1183 2835 3068 3339 119 2763 3070 3471 3932 4246 3000 3211 1193 2841 3073 3344 120 2774 3078 3477 3937 4250 2700 min 3500 3214 1179 2838 3073 3347 119 2776 3080 3480 3942 4256 4000 3216 1191 2844 3077 3350 115 2765 3078 3483 3948 4267 4500 3222 1176 2843 3080 3357 118 2778 3087 3491 3957 4274 Difference 44% 0.50% 1.40% 0.30% 0.40% 0.50% 4.20% 0.50% 0.50% 0.50% 0.60% 0.70%

84

ARCHAEBACTERIA Node 29 Node 31 Node 32 Node 33 Node 34 Calibration rttm Time Time Time Time Time (mya) (mya) 2500 273 3169 3626 3833 4149 3000 283 3182 3637 3846 4164 1609 3500 275 3135 3590 3798 4117 4000 281 3191 3650 3860 4182 4500 277 3179 3642 3855 4182 Difference 44% 3.5% 1.8% 1.6% 1.6% 1.6% 2500 277 3166 3607 3808 4119 3000 285 3195 3637 3840 4153 1489-1729 3500 284 3211 3655 3860 4175 4000 280 3214 3677 3886 4206 4500 284 3243 3696 3904 4223 Difference 44% 2.8% 2.4% 2.4% 2.5% 2.5% 2500 231 3039 3510 3720 4042 3000 233 3051 3525 3736 4061 1174-1222 3500 232 3093 3575 3791 4123 4000 235 3118 3604 3822 4158 4500 232 3124 3618 3838 4177 Difference 44% 1.7% 2.7% 3.0% 3.1% 3.2%

85

86

87 Appendix B

Supplementary information for Chapter 3

Methods Data Assembly and phylogenetic analyses Protein data set GBlocks {Castresana, 2000 #257} was applied with the following parameters: minimum number of sequences for a conserved position: 110, minimum number of sequences for a flank position: 110, maximum number of contiguous non-conserved positions: 32000, allowed gap positions: with half. The threshold between eliminating non-conserved sites and keeping enough sites to recover a phylogenetic signal was found by altering the “minimum length of a block” parameter. This was increased, starting from a minimum of two to a maximum of 80, in order to obtain different data sets retaining 40% (the longest alignment obtainable with the parameters chosen), 30%, 20%, 10%, 5%, and 2% of the original alignment. A phylogeny was built with MEGA3 {Kumar, 2004 #198}(NJ, JTT+gamma, with the alpha parameter estimated by the program RAxML {Stamatakis, 2006 #200}) and the number of monophyletic classes, their bootstrap support, and the monophyly of the phyla Proteobacteria (excluding the position of Solibacteres) and Firmicutes were compared. In the evaluation of monophyly the position of Symbiobacterium thermophilum was not considered (see below). The monophyly of classes was used as a control for the recovery of phylogenetic signal as these are defined based on phylogenetic and physiological characters {Holt, 1984 #330}. An increase in stringency levels caused a decrease in bootstrap supports for the monophyly of classes because of fewer sites available, yet there was no apparent effect on the monophyletic recovery of classes. For this reason, we selected the 40% stringency level because it maximized the length of the alignment while keeping most classes monophyletic (Fig. B1). Slow-Fast data set: A data set with only slow evolving positions was built following the procedure outlined by Philippe and co-workers {Philippe, 2000 #181}{Brinkmann, 1999 #333}. PAUP* v.4 beta10 {Swofford, 1998 #338} was used to calculate the number of changes per site in each class represented by multiple species (a maximum of six species representing different genera was used when available). Archaebacteria were treated at the domain level because only one class was represented by more than three species. The threshold between slow and fast evolving sites was based on the sum of changes across all phylogenetic categories for a given site: any site showing less changes than the selected threshold was considered slow evolving and retained in the alignment. Distance trees (NJ, JTT+gamma, with the alpha parameter estimated by the program RaxML) were built for each data set with threshold of 45, 30, 15, ten, five, and two changes per site. Increase threshold stringency resulted in paraphyly of classes and phyla, and loss of phylogenetic signal. We selected a threshold of 45 changes because it maximized the number of monophyletic classes and phyla (Fig. B1). Mesophiles-only data set: Of the original 311 species considered, 40 were categorized as temperature extremophiles, and for nine species information on growth temperature could not be found. The growth temperature limits for defining psycro-,

88 meso-, thermo, and hyperthermophiles were: -1 to 10 ºC, 11 to 45 ºC, 46 to 75 ºC, and >75 ºC respectively. Temperature ranges for each species were defined using information in GenBank (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi), Bergey’s Manual for Systematic Bacteriology {Holt, 1984 #330}{Garrity, 2001 #329}, and species specific literature (Table B1). In two cases there was a disagreement among sources and the species were deleted. The 49 extremophilic and undetermined species were deleted from the data set and each gene was re-aligned with ClustalX {Thompson, 1994 #25}. After application of Gblocks the final data set was composed of 177 species and 7,132 amino acid sites. Distance and ML phylogenies were built with MEGA3 and RaxML, respectively. Strict data set: each gene was analyzed with more conservative criteria for HGT identification and for rate constancy violation of each species. Three “monophyletic criteria” were designed to detect anomalously positioned species in single gene distance trees. First, species belonging to Class X and Phylum Y that were clustering outside of their class and phylum were deleted. Second, if a class was represented by only one sequence and it caused the paraphyly of another class they were both deleted. Third, if all species of a lineage can be grouped together only under the phylum level (e.g., Cyanobacteria) and if this phylum it not monophyletic, only the group containing more than 50% of the species was kept in the data set. The remaining species were then analyzed with a branch length test for rate constancy violations with the program LinTree {Takezaki, 1995 #201}. An amino acid model with gamma was used to build NJ trees with 100 bootstrap replications and species evolving significantly (99% confidence level) slower or faster than the average root-to-tip distance were identified. These stricter criteria led to a new data set composed of 110 species for 7,336 amino acid sites (after application of GBlocks). This data set is a significant reduction in phylogenetic diversity compared to the original data set as only 20 of 32 taxonomic categories are represented. Nevertheless, it was used to build distance and ML phylogenies and these results were compared to those from the other data sets (Slow Fast, Mesophiles-only, and original).

Nucleotide data set The same procedure used for proteins was applied to this data set. The GBlocks parameters were: minimum number of sequences for a conserved position: 95, minimum number of sequences for a flank position: 95, maximum number of contiguous nonconserved positions: 32000, allowed gap positions: with half. The “minimum length of a block” parameter was progressively increased to obtain different data sets retaining 60%, 50%, 40%, 30%, 20%, and 10% of the original alignment (columns with only gaps are deleted at the beginning of the analysis). A phylogeny was built with MEGA3 (NJ, TamuraNei+gamma, with the alpha parameter estimated by the program RAxML) and the number of monophyletic classes, their bootstrap support and the monophyly of Proteobacteria and Firmicutes were calculated. In the evaluation of monophyly the position of Zoogloea ramigera was not considered (see below). Higher stringency levels caused a decrease in number of monophyletic classes (Gamma and Deltaproteobacteria, spirochetes, and Bacilli are the ones that become paraphyletic) as well as a decrease in bootstrap support of the monophyletic ones. The two phyla are unaffected. We selected a stringency of 60% to maximize the number of sites (Fig. B2).

89 Slow-Fast data set: number of changes per site in each eubacterial class represented by multiple species was calculated using the program PAUP* v.4 beta10. Archaebacteria were treated at the domain level because only two classes were represented by more than three species. A maximum of six species was used in each class, spanning different genera when available. As for the protein data set, the number of changes within each class was summed across the two domains to obtain an estimate of variability of each site. Based on this, four threshold levels were tested: 15, 10, 5, and 3 changes per site. Distance trees (NJ, JTT+gamma, with the alpha parameter estimated by the program RaxML) were built for each one of these levels and monophyly of classes and phyla, and bootstrap supports were calculated. Increasing stringency (i.e., lower threshold) resulted in paraphyly of many classes and phyla, and lower bootstrap supports. We selected a threshold of 15 changes because it maximized the number of monophyletic classes, phyla, and their bootstrap values. This new data set includes 60% of the variable sites present in the original data set (Fig. B2).

Results Symbiobacterium thermophilum This species, present only in the protein data set, is a thermophilic bacterium dependent on microbial commensalism for growth {Ohno, 2000 #272}. It was classified as an actinobacterium based on its high GC content {Ueda, 2001 #275} but recent studies have shown its affiliation with Firmicutes based on genome characteristics, indels, and the absence of proteins uniquely shared with Actinobacteria {Ueda, 2004 #273}{Gao, 2005 #276}{Gao, 2006 #274}. A recent supertree analysis also showed S. thermophilum clustering with Clostridia {Pisani, 2007 #253} as in our phylogeny (both ML and NJ, BP 68% and 58% respectively). Given the amount of evidence, we consider this species as a misclassified actinobacterium and the first high GC member of the Class Clostridia. Zoogloea ramigera The original classification of this species had placed it within the Gammaproteobacteria {Shin, 1993 #296}. A more detailed analysis of various strains evidenced this misclassification and placed the type strain within the Betaproteobacteria. Nonetheless, some strains did not cluster with the type strain in an SSU phylogenetic tree and also were found missing a particular rhodoquinone-8 (RQ-8) synthesized by the type strain. The putatively misclassified strains were shown to cluster within the Alphaproteobacteria close to Agrobacterium tumefaciens {Shin, 1993 #296}. This position is the same found in our phylogenetic tree of rRNA subunits (BP 100) and suggests that the sequence named Z. ramigera X88894 in the European Ribosomal Database belongs to one of the misclassified strains. We thus consider it an alphaproteobacterium.

90 Table B1 Growth temperature of non-mesophilic Eubacteria and Archaebacteria species. : -1 to 10 ºC, thermophiles: 46 to 75 ºC, hyperthermophiles: >75 ºC. Dash: information not available. Bergey’s: Bergey’s Manual of Systematic Bacteriology.

Growth temperature

ºC Bergey's Eubacteria Classification (1st & 2nd GenBank eds.) Anabaena variabilis ATCC 29413 Cyanobacteria - - Anaplasma marginale str. St. Maries Alphaproteobacteria - - Aquifex aeolicus VF5 Aquificae 85 96 Bacillus clausii KSM-K16 Firmicutes/Bacilli - - Burkholderia sp. 383 Betaproteobacteria - - Candidatus Blochmannia pennsylvanicus str. BPEN Gammaproteobacteria - - Carboxydothermus hydrogenoformans Z-2901 Firmicutes/Clostridia - 78 Chlorobium tepidum TLS Chlorobia 47-48 48 Colwellia psychrerythraea 34H Gammaproteobacteria psychro 8 Dechloromonas aromatica RCB Betaproteobacteria - - Desulfotalea psychrophila LSv54 Deltaproteobacteria 10 7 Ehrlichia canis str. Jake Alphaproteobacteria - - Geobacillus kaustophilus HTA426 Firmicutes/Bacilli - thermo Methylococcus capsulatus str. Bath Gammaproteobacteria 45 45 Moorella thermoacetica ATCC 39073 Firmicutes/Clostridia - 58 Nitrosococcus oceani ATCC 19707 Gammaproteobacteria 25-30 - Novosphingobium aromaticivorans DSM 12444 Alphaproteobacteria - - Photobacterium profundum SS9 Gammaproteobacteria 8-12 15 Pseudoalteromonas haloplanktis TAC125 Gammaproteobacteria 25 psychro Psychrobacter arcticus 273-4 Gammaproteobacteria - -2.5-20 Streptococcus thermophilus CNRZ1066 Firmicutes/Bacilli 37 (max 52) 45 Streptococcus thermophilus LMG 18311 Firmicutes/Bacilli 37 (max 52) 45 Symbiobacterium thermophilum IAM 14863 Actinobacteria - 60 Synechococcus sp. JA-2-3B'a(2-13) Cyanobacteria - thermo Synechococcus sp. JA-3-3Ab Cyanobacteria - thermo Thermoanaerobacter tengcongensis MB4 Firmicutes/Clostridia - 75 Thermobifida fusca YX Actinobacteria - 50-55 Thermosynechococcus elongatus BP-1 Cyanobacteria - 55 Thermotoga maritima MSB8 Thermotogae 80 80 Thermus thermophilus HB27 Deinococci 70 68 Thermus thermophilus HB8 Deinococci 70 thermo

91 Growth temperature

ºC Bergey's Archaebacteria Classification (1st & 2nd GenBank eds.) Aeropyrum pernix K1 Thermoprotei 90-95 90-95 Archaeoglobus fulgidus DSM 4304 Archaeoglobi 83 83 Methanocaldococcus jannaschii DSM 2661 Methanococci 85 85 Methanopyrus kandleri AV19 Methanopyri 98 98 Methanothermobacter thermoautotrophicus str. Delta H Methanobacteria 55-65 65-70 Nanoarchaeum equitans Kin4-M Nanoarchaeota - hyper Natronomonas pharaonis DSM 2160 Halobacteria 45 - Picrophilus torridus DSM 9790 Thermoplasmata 60 60 Pyrobaculum aerophilum str. IM2 Thermoprotei 100 100 Pyrococcus abyssi GE5 Thermococci 100 103 Pyrococcus furiosus DSM 3638 Thermococci 100 100 Pyrococcus horikoshii OT3 Thermococci 98 98 Sulfolobus acidocaldarius DSM 639 Thermoprotei 70-75 70-75 Sulfolobus solfataricus P2 Thermoprotei 85 85 Sulfolobus tokodaii str. 7 Thermoprotei - 80 Thermococcus kodakarensis KOD1 Thermococci - 85 Thermoplasma acidophilum DSM 1728 Thermoplasmata 59 59 Thermoplasma volcanium GSS1 Thermoplasmata 60 60

92 Table B2 List of Eubacteria and Archaebacteria species used in the protein data set and their classification (genome accession numbers can be found at http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi)

Species name Classification

EUBACTERIA Acinetobacter sp. ADP1 Gammaproteobacteria Agrobacterium tumefaciens str. C58 Alphaproteobacteria Anabaena variabilis ATCC 29413 Cyanobacteria Anaeromyxobacter dehalogenans 2CP-C Deltaproteobacteria Anaplasma marginale str. St. Maries Alphaproteobacteria Anaplasma phagocytophilum HZ Alphaproteobacteria Aquifex aeolicus VF5 Aquificae Aster yellows witches'-broom phytoplasma AYWB Firmicutes/Mollicutes Azoarcus sp. EbN1 Betaproteobacteria Bacillus anthracis str. 'Ames Ancestor' Firmicutes/Bacilli Bacillus anthracis str. Ames Firmicutes/Bacilli Bacillus anthracis str. Sterne Firmicutes/Bacilli Bacillus cereus ATCC 10987 Firmicutes/Bacilli Bacillus cereus ATCC 14579 Firmicutes/Bacilli Bacillus cereus E33L Firmicutes/Bacilli Bacillus clausii KSM-K16 Firmicutes/Bacilli Bacillus halodurans C-125 Firmicutes/Bacilli Bacillus licheniformis ATCC 14580 Firmicutes/Bacilli Bacillus subtilis subsp. subtilis str. 168 Firmicutes/Bacilli Bacillus thuringiensis serovar konkukian str. 97-27 Firmicutes/Bacilli Bacteroides fragilis NCTC 9343 Bacteroidetes Bacteroides fragilis YCH46 Bacteroidetes Bacteroides thetaiotaomicron VPI-5482 Bacteroidetes henselae str Houston-1 Alphaproteobacteria str. Toulouse Alphaproteobacteria Bdellovibrio bacteriovorus HD100 Deltaproteobacteria Bifidobacterium longum NCC2705 Actinobacteria Bordetella bronchiseptica RB50 Betaproteobacteria 12822 Betaproteobacteria Tomaha I Betaproteobacteria Borrelia burgdorferi B31 Spirochaetes Borrelia garinii Pbi Spirochaetes Bradyrhizobium japonicum USDA 110 Alphaproteobacteria biovar 1 str. 9-941 Alphaproteobacteria Brucella melitensis 16M Alphaproteobacteria Brucella melitensis biovar Abortus 2308 Alphaproteobacteria Brucella suis 1330 Alphaproteobacteria Buchnera aphidicola str. APS Gammaproteobacteria Buchnera aphidicola str. Bp Gammaproteobacteria

93 Buchnera aphidicola str. Sg Gammaproteobacteria ATCC 23344 Betaproteobacteria Burkholderia pseudomallei 1710b Betaproteobacteria Burkholderia pseudomallei K96243 Betaproteobacteria Burkholderia sp. 383 Betaproteobacteria Burkholderia thailandensis E264 Betaproteobacteria Campylobacter jejuni RM1221 Epsilonproteobacteria Campylobacter jejuni subsp. Jejuni NCTC 11168 Epsilonproteobacteria Candidatus Blochmannia floridanus Gammaproteobacteria Candidatus Blochmannia pennsylvanicus str. BPEN Gammaproteobacteria Candidatus Pelagibacter ubique HTCC1062 Alphaproteobacteria Candidatus Protochlamydia amoebophila UWE25 Chlamydiae Carboxydothermus hydrogenoformans Z-2901 Firmicutes/Clostridia Caulobacter crescentus CB15 Alphaproteobacteria Chlamydia muridarum Nigg Chlamydiae Chlamydia trachomatis A/HAR-13 Chlamydiae Chlamydia trachomatis D/UW-3/CX Chlamydiae Chlamydophila abortus S26/3 Chlamydiae Chlamydophila caviae GPIC Chlamydiae Chlamydophila felis Fe/C-56 Chlamydiae Chlamydophila pneumoniae AR39 Chlamydiae Chlamydophila pneumoniae CWL029 Chlamydiae Chlamydophila pneumoniae J138 Chlamydiae Chlamydophila pneumoniae TW-183 Chlamydiae Chlorobium chlorochromatii CaD3 Chlorobia Chlorobium tepidum TLS Chlorobia Chromobacterium violaceum ATCC 12472 Betaproteobacteria Clostridium acetobutylicum ATCC 824 Firmicutes/Clostridia Clostridium perfringens str. 13 Firmicutes/Clostridia Clostridium tetani E88 Firmicutes/Clostridia Colwellia psychrerythraea 34H Gammaproteobacteria Corynebacterium diphtheriae NCTC 13129 Actinobacteria Corynebacterium efficiens YS-314 Actinobacteria Corynebacterium glutamicum ATCC 13032 Actinobacteria Corynebacterium jeikeium K411 Actinobacteria RSA 493 Gammaproteobacteria Dechloromonas aromatica RCB Betaproteobacteria Dehalococcoides ethenogenes 195 Chloroflexi/Dehalococcoidetes Dehalococcoides sp. CBDB1 Chloroflexi/Dehalococcoidetes Deinococcus radiodurans R1 Deinococci Desulfitobacterium hafniense Y51 Firmicutes/Clostridia Desulfotalea psychrophila LSv54 Deltaproteobacteria Desulfovibrio desulfuricans G20 Deltaproteobacteria Desulfovibrio vulgaris subsp.vulgaris str. Hildenborough Deltaproteobacteria Ehrlichia canis str. Jake Alphaproteobacteria str. Arkansas Alphaproteobacteria Ehrlichia ruminantium str. Gardel Alphaproteobacteria

94 Ehrlichia ruminantium str. Welgevonden Alphaproteobacteria Enterococcus faecalis V583 Firmicutes/Bacilli Erwinia carotovora subsp. atroseptica SCRI1043 Gammaproteobacteria Erythrobacter litoralis HTCC2594 Alphaproteobacteria Escherichia coli CFT073 Gammaproteobacteria Escherichia coli K12 Gammaproteobacteria Escherichia coli O157:H7 Gammaproteobacteria Escherichia coli O157:H7 EDL933 Gammaproteobacteria Escherichia coli W3110 Gammaproteobacteria subsp. holarctica Gammaproteobacteria Francisella tularensis subsp. tularensis SCHU S4 Gammaproteobacteria Frankia sp. CcI3 Actinobacteria Fusobacterium nucleatum subsp. nucleatum ATCC 25586 Fusobacteria Geobacillus kaustophilus HTA426 Firmicutes/Bacilli Geobacter metallireducens GS-15 Deltaproteobacteria Geobacter sulfurreducens PCA Deltaproteobacteria Gloeobacter violaceus PCC 7421 Cyanobacteria Gluconobacter oxydans 621H Alphaproteobacteria 35000HP Gammaproteobacteria 86-028NP Gammaproteobacteria Haemophilus influenzae Rd KW20 Gammaproteobacteria Hahella chejuensis KCTC 2396 Gammaproteobacteria ATCC 51449 Epsilonproteobacteria Helicobacter pylori 26695 Epsilonproteobacteria Helicobacter pylori J99 Epsilonproteobacteria Idiomarina loihiensis L2TR Gammaproteobacteria Jannaschia sp. CCS1 Alphaproteobacteria Lactobacillus acidophilus NCFM Firmicutes/Bacilli Lactobacillus johnsonii NCC 533 Firmicutes/Bacilli Lactobacillus plantarum WCFS1 Firmicutes/Bacilli Lactobacillus sakei subsp. sakei 23K Firmicutes/Bacilli Lactococcus lactis subsp. Lactis Il1403 Firmicutes/Bacilli str.Lens Gammaproteobacteria Legionella pneumophila str.Paris Gammaproteobacteria Legionella pneumophila subsp. pneumophila str. Philadelphia 1 Gammaproteobacteria Leifsonia xyli subsp. xyli str. CTCB07 Actinobacteria Leptospira interrogans serovar Copenhageni str. Fiocruz L1-130 Spirochaetes Leptospira interrogans serovar Lai str. 56601 Spirochaetes Listeria innocua Clip11262 Firmicutes/Bacilli Listeria monocytogenes EGD-e Firmicutes/Bacilli Listeria monocytogenes str. 4b F2365 Firmicutes/Bacilli Magnetospirillum magneticum AMB-1 Alphaproteobacteria Mannheimia succiniciproducens MBEL55E Gammaproteobacteria Mesoplasma florum L1 Firmicutes/Mollicutes Mesorhizobium loti MAFF303099 Alphaproteobacteria Methylococcus capsulatus str. Bath Gammaproteobacteria Moorella thermoacetica ATCC 39073 Firmicutes/Clostridia

95 Mycobacterium avium subsp. paratubercolosis K-10 Actinobacteria Mycobacterium bovis AF2122/97 Actinobacteria Mycobacterium leprae TN Actinobacteria Mycobacterium tuberculosis CDC1551 Actinobacteria Mycobacterium tuberculosis H37Rv Actinobacteria Mycoplasma capricolum subsp. capricolum ATCC 27343 Firmicutes/Mollicutes Mycoplasma gallisepticum R Firmicutes/Mollicutes Mycoplasma genitalium G37 Firmicutes/Mollicutes Mycoplasma hyopneumoniae 232 Firmicutes/Mollicutes Mycoplasma hyopneumoniae 7448 Firmicutes/Mollicutes Mycoplasma hyopneumoniae J Firmicutes/Mollicutes Mycoplasma mobile 163K Firmicutes/Mollicutes Mycoplasma mycoides subsp. Mycoides SC str. PG1 Firmicutes/Mollicutes Mycoplasma penetrans HF-2 Firmicutes/Mollicutes Mycoplasma pneumoniae M129 Firmicutes/Mollicutes Mycoplasma pulmonis UAB CTIP Firmicutes/Mollicutes Mycoplasma synoviae 53 Firmicutes/Mollicutes FA 1090 Betaproteobacteria Neisseria meningitidis MC58 Betaproteobacteria Neisseria meningitidis Z2491 Betaproteobacteria sennetsu str. Miyayama Alphaproteobacteria Nitrobacter winogradskyi Nb-255 Alphaproteobacteria Nitrosococcus oceani ATCC 19707 Gammaproteobacteria Nitrosomonas europaea ATCC 19718 Betaproteobacteria Nitrosospira multiformis ATCC 25196 Betaproteobacteria Nocardia farcinica IFM 10152 Actinobacteria Nostoc sp. PCC 7120 Cyanobacteria Novosphingobium aromaticivorans DSM 12444 Alphaproteobacteria Oceanobacillus iheyensis HTE831 Firmicutes/Bacilli Onion yellows phytoplasma OY-M Firmicutes/Mollicutes Pasteurella multocida subsp. multocida str. Pm70 Gammaproteobacteria Pelobacter carbinolicus DSM 2380 Deltaproteobacteria Pelodictyon luteolum DSM 273 Chlorobia Photobacterium profundum SS9 Gammaproteobacteria Photorhabdus luminescens subsp. laumondii TTO1 Gammaproteobacteria Porphyromonas gingivalis W83 Bacteroidetes Prochlorococcus marinus str. MIT 9312 Cyanobacteria Prochlorococcus marinus str. MIT 9313 Cyanobacteria Prochlorococcus marinus str. NATL2A Cyanobacteria Prochlorococcus marinus subsp. marinus str CCMP1375 Cyanobacteria Prochlorococcus marinus subsp. pastoris str. CCMP1986 Cyanobacteria Propionibacterium acnes KPA171202 Actinobacteria Pseudoalteromonas haloplanktis TAC125 Gammaproteobacteria Pseudomonas aeruginosa PAO1 Gammaproteobacteria Pseudomonas fluorescens Pf-5 Gammaproteobacteria Pseudomonas fluorescens PfO-1 Gammaproteobacteria Pseudomonas putida KT2440 Gammaproteobacteria

96 Pseudomonas syringae pv. phaseolicola 1448A Gammaproteobacteria Pseudomonas syringae pv. syringae B728a Gammaproteobacteria Pseudomonas syringae pv. tomato str. DC3000 Gammaproteobacteria Psychrobacter arcticus 273-4 Gammaproteobacteria Ralstonia eutropha JMP134 Betaproteobacteria Ralstonia solanacearum GMI1000 Betaproteobacteria Rhizobium etli CFN 42 Alphaproteobacteria Rhodobacter sphaeroides 2.4.1 Alphaproteobacteria Rhodoferax ferrireducens DSM 15236 Betaproteobacteria Rhodopirellula baltica SH1 Planctomycetacia Rhodopseudomonas palustris CGA009 Alphaproteobacteria Rhodopseudomonas palustris HaA2 Alphaproteobacteria Rhodospirillum rubrum ATCC 11170 Alphaproteobacteria str. Malish 7 Alphaproteobacteria URRWXCa12 Alphaproteobacteria Rickettsia prowazekii str. Madrid E Alphaproteobacteria str. Wilmington Alphaproteobacteria Salinibacter ruber DSM 13855 Bacteroidetes Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 Gammaproteobacteria Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 Gammaproteobacteria Salmonella enterica subsp. enterica serovar Typhi Ty2 Gammaproteobacteria Salmonella enterica subsp. enterica serovar Typhi str. CT18 Gammaproteobacteria Salmonella typhimurium LT2 Gammaproteobacteria Shewanella oneidensis MR-1 Gammaproteobacteria Sb227 Gammaproteobacteria Sd197 Gammaproteobacteria 2a str. 2457T Gammaproteobacteria Shigella flexneri 2a str. 301 Gammaproteobacteria Ss046 Gammaproteobacteria Silicibacter pomeroyi DSS-3 Alphaproteobacteria Sinorhizobium meliloti 1021 Alphaproteobacteria Sodalis glossinidius str. 'morsitans' Gammaproteobacteria Solibacter usitatus Ellin6076 Acidobacteria/Solibacteres Staphylococcus aureus RF122 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus COL Firmicutes/Bacilli Staphylococcus aureus subsp. aureus MRSA252 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus MSSA476 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus MW2 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus Mu50 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus N315 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus NCTC 8325 Firmicutes/Bacilli Staphylococcus aureus subsp. aureus USA300 Firmicutes/Bacilli Staphylococcus epidermidis ATCC 12228 Firmicutes/Bacilli Staphylococcus epidermidis RP62A Firmicutes/Bacilli Staphylococcus haemolyticus JCSC1435 Firmicutes/Bacilli Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305 Firmicutes/Bacilli Streptococcus agalactiae 2603V/R Firmicutes/Bacilli

97 Streptococcus agalactiae A909 Firmicutes/Bacilli Streptococcus agalactiae NEM316 Firmicutes/Bacilli Streptococcus mutans UA159 Firmicutes/Bacilli Streptococcus pneumoniae R6 Firmicutes/Bacilli Streptococcus pneumoniae TIGR4 Firmicutes/Bacilli Streptococcus pyogenes M1 GAS Firmicutes/Bacilli Streptococcus pyogenes MGAS10394 Firmicutes/Bacilli Streptococcus pyogenes MGAS315 Firmicutes/Bacilli Streptococcus pyogenes MGAS5005 Firmicutes/Bacilli Streptococcus pyogenes MGAS6180 Firmicutes/Bacilli Streptococcus pyogenes MGAS8232 Firmicutes/Bacilli Streptococcus pyogenes SSI-1 Firmicutes/Bacilli Streptococcus thermophilus CNRZ1066 Firmicutes/Bacilli Streptococcus thermophilus LMG 18311 Firmicutes/Bacilli Streptomyces avermitilis MA-4680 Actinobacteria Streptomyces coelicolor A3(2) Actinobacteria Symbiobacterium thermophilum IAM 14863 Actinobacteria Synechococcus elongatus PCC 6301 Cyanobacteria Synechococcus elongatus PCC 7942 Cyanobacteria Synechococcus sp. CC9605 Cyanobacteria Synechococcus sp. CC9902 Cyanobacteria Synechococcus sp. JA-2-3B'a(2-13) Cyanobacteria Synechococcus sp. JA-3-3Ab Cyanobacteria Synechococcus sp. WH 8102 Cyanobacteria Synechocystis sp. PCC 6803 Cyanobacteria Thermoanaerobacter tengcongensis MB4 Firmicutes/Clostridia Thermobifida fusca YX Actinobacteria Thermosynechococcus elongatus BP-1 Cyanobacteria Thermotoga maritima MSB8 Thermotogae Thermus thermophilus HB27 Deinococci Thermus thermophilus HB8 Deinococci Thiobacillus denitrificans ATCC 25259 Betaproteobacteria Thiomicrospira crunogena XCL-2 Gammaproteobacteria Thiomicrospira denitrificans ATCC 33889 Espilonproteobacteria Treponema denticola ATCC 35405 Spirochaetes Treponema pallidum subsp. pallidum str. Nichols Spirochaetes Tropheryma whipplei TW08/27 Actinobacteria Ureaplasma parvum serovar 3 str. ATCC 700970 Firmicutes/Mollicutes Vibrio cholerae O1 biovar eltor str. N16961 Gammaproteobacteria Vibrio fischeri ES114 Gammaproteobacteria RIMD 2210633 Gammaproteobacteria CMCP6 Gammaproteobacteria Vibrio vulnificus YJ016 Gammaproteobacteria Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis Gammaproteobacteria Wolbachia Alphaproteobacteria Wolinella succinogenes DSM 1740 Epsilonproteobacteria Xanthomonas axonopodis pv. citri str. 306 Gammaproteobacteria

98 Xanthomonas campestris pv. campestris str. 8004 Gammaproteobacteria Xanthomonas campestris pv. campestris str. ATCC 33913 Gammaproteobacteria Xanthomonas campestris pv. vesicatoria str. 85-10 Gammaproteobacteria Xanthomonas oryzae pv. oryzae KACC10331 Gammaproteobacteria Xylella fastidiosa 9a5c Gammaproteobacteria Xylella fastidiosa Temecula1 Gammaproteobacteria Yersinia pestis CO92 Gammaproteobacteria Yersinia pestis KIM Gammaproteobacteria Yersinia pestis biovar Medievalis str. 91001 Gammaproteobacteria Yersinia pseudotuberculosis IP 32953 Gammaproteobacteria Zymomonas mobilis subsp. Mobilis ZM4 Alphaproteobacteria

ARCHAEBACTERIA Aeropyrum pernix K1 Crenarchaeota/Thermoprotei Archaeoglobus fulgidus DSM 4304 Euryarchaeota/Archaeoglobi Haloarcula marismortui ATCC 43049 Euryarchaeota/Halobacteria Halobacterium sp. NRC-1 Euryarchaeota/Halobacteria Methanocaldococcus jannaschii DSM 2661 Euryarchaeota/Methanococci Methanococcus maripaludis S2 Euryarchaeota/Methanococci Methanopyrus kandleri AV19 Euryarchaeota/Methanopyri Methanosarcina acetivorans C2A Euryarchaeota/Methanomicrobia Methanosarcina barkeri str. fusaro Euryarchaeota/Methanomicrobia Methanosarcina mazei Go1 Euryarchaeota/Methanomicrobia Methanosphaera stadmanae DSM 3091 Euryarchaeota/Methanobacteria Methanospirillum hungatei JF-1 Euryarchaeota/Methanomicrobia Methanothermobacter thermoautotrophicus str. Delta H Euryarchaeota/Methanobacteria Nanoarchaeum equitans Kin4-M Nanoarchaeota Natronomonas pharaonis DSM 2160 Euryarchaeota/Halobacteria Picrophilus torridus DSM 9790 Euryarchaeota/Thermoplasmata Pyrobaculum aerophilum str. IM2 Crenarchaeota/Thermococci Pyrococcus abyssi GE5 Euryarchaeota/Thermococci Pyrococcus furiosus DSM 3638 Euryarchaeota/Thermococci Pyrococcus horikoshii OT3 Euryarchaeota/Thermococci Sulfolobus acidocaldarius DSM 639 Crenarchaeota/Thermoprotei Sulfolobus solfataricus P2 Crenarchaeota/Thermoprotei Sulfolobus tokodaii str. 7 Crenarchaeota/Thermoprotei Thermococcus kodakarensis KOD1 Euryarchaeota/Thermococci Thermoplasma acidophilum DSM 1728 Euryarchaeota/Thermoplasmata Thermoplasma volcanium GSS1 Euryarchaeota/Thermoplasmata

99 Table B3 List of Eubacteria and Archaebacteria species used in the ribosomal RNA data set (shared by SSU and LSU) and their classification

Species Classification

EUBACTERIA Acetobacter europaeus AJ012698 Alphaproteobacteria Acetobacter intermedius AJ012697 Alphaproteobacteria Acetobacter xylinum X75619 Alphaproteobacteria Acinetobacter calcoaceticus M34139 Gammaproteobacteria AF099021 Gammaproteobacteria Agrobacterium radiobacter AJ130719 Alphaproteobacteria Agrobacterium rubi D12787 Alphaproteobacteria Agrobacterium tumefaciens D12784 Alphaproteobacteria Agrobacterium vitis D12795 Alphaproteobacteria Alcaligenes faecalis AF155147 Betaproteobacteria Aquifex aeolicus AE000751 Aquificae Bacillus alcalophilus AF078812 Firmicutes/Bacilli Bacillus anthracis AF155951 Firmicutes/Bacilli Bacillus cereus AF155952 Firmicutes/Bacilli Bacillus globisporus X68415 Firmicutes/Bacilli Bacillus halodurans D AP001507 Firmicutes/Bacilli Bacillus licheniformis AF234844 Firmicutes/Bacilli Bacillus stearothermophilus AJ005760 Firmicutes/Bacilli Bacillus subtilis B K00637 Firmicutes/Bacilli Bacillus thuringiensis AF155954 Firmicutes/Bacilli M65249 Alphaproteobacteria AF177666 Betaproteobacteria Bordetella bronchiseptica U04948 Betaproteobacteria Bordetella parapertussis U04949 Betaproteobacteria Bordetella pertussis AF142326 Betaproteobacteria Borrelia burgdorferi X85202 Spirochaetes Bradyrhizobium japonicum Z35330 Alphaproteobacteria Bradyrhizobium lupini U69636 Alphaproteobacteria Brevundimonas diminuta AB021415 Alphaproteobacteria Brucella melitensis AF220148 Alphaproteobacteria Buchnera aphidicola L18927 Gammaproteobacteria Burkholderia gladioli AB012916 Betaproteobactria Burkholderia mallei AF110187 Betaproteobactria Burkholderia pseudomallei Betaproteobactria Campylobacter coli L04312 Epsilonproteobacteria Campylobacter hyoilei L19738 Epsilonproteobacteria Campylobacter jejuni AL139074 Epsilonproteobacteria Campylobacter lari L04316 Epsilonproteobacteria Carsonella ruddii AF211123 Gammaproteobacteria Chlamydia muridarum aA16S AE002280 Chlamydiae Chlamydia trachomatis AE001347 Chlamydiae Chlamydophila abortus U76710 Chlamydiae Chlamydophila felis U68457 Chlamydiae Chlamydophila pecorum U68434 Chlamydiae Chlamydophila pneumoniae aA16S AE002256 Chlamydiae Chlamydophila psittaci U68447 Chlamydiae Chlorobium limicola Y10640 Chlorobia AJ233408 Gammaproteobacteria Clostridium botulinum A L37586 Firmicutes/Clostridia

100 Clostridium histolyticum M59094 Firmicutes/Clostridia Clostridium tyrobutyricum L08062 Firmicutes/Clostridia Coxiella burnetii D89791 Gammaproteobacteria Enterococcus faecalis AB012212 Firmicutes/Bacilli Erysipelothrix rhusiopathiae AB034200 Firmicutes/Mollicutes Erysipelothrix tonsillarum AB034201 Firmicutes/Mollicutes Escherichia coli B AE000471 Gammaproteobacteria Fibrobacter succinogenes M62683 Fibrobacteres Flavobacterium odoratum D14019 Bacteroidetes/Flavobacteria Flexibacter flexilis M62794 Bacteroidetes/Sphingobacteria Frankia sp. M55343 Actinobacteria Haemophilus influenzae D U32847 Gammaproteobacteria Helicobacter pylori A AE000620 Epsilonproteobacteria AB004753 Gammaproteobacteria Lactobacillus amylolyticus Y17361 Firmicutes/Bacilli Lactobacillus confusus M23036 Firmicutes/Bacilli Lactobacillus delbrueckii AB007908 Firmicutes/Bacilli Lactococcus lactis X64887 Firmicutes/Bacilli Leptospira interrogans M71241 Spirochaetes Leuconostoc carnosum AB022925 Firmicutes/Bacilli Leuconostoc lactis M23031 Firmicutes/Bacilli Leuconostoc mesenteroides AB023243 Firmicutes/Bacilli Leuconostoc oenos M35820 Firmicutes/Bacilli Leuconostoc paramesenteroides M23033 Firmicutes/Bacilli Leucothrix mucor X87277 Gammaproteobacteria Listeria grayi X56150 Firmicutes/Bacilli Listeria innocua S55473 Firmicutes/Bacilli Listeria ivanovii X98529 Firmicutes/Bacilli Listeria monocytogenes U84150 Firmicutes/Bacilli Listeria murrayi X56154 Firmicutes/Bacilli Listeria seeligeri X56148 Firmicutes/Bacilli Listeria welshimeri X56149 Firmicutes/Bacilli Microbispora bispora U58524 Actinobacteria Micrococcus luteus AF234843 Actinobacteria Mycobacterium avium M29573 Actinobacteria Mycobacterium kansasii M29575 Actinobacteria Mycobacterium leprae X55022 Actinobacteria Mycobacterium paratuberculosis M61680 Actinobacteria Mycobacterium phlei M29566 Actinobacteria Mycobacterium smegmatis AJ131761 Actinobacteria Mycobacterium tuberculosis X55588 Actinobacteria Mycoplasma flocculare X63377 Firmicutes/Mollicutes Mycoplasma gallisepticum L08897 Firmicutes/Mollicutes Mycoplasma genitalium A16S U39694 Firmicutes/Mollicutes Mycoplasma hyopneumoniae Y00149 Firmicutes/Mollicutes Nannocystis exedens AJ233946 Deltaproteobacteria Neisseria gonorrhoeae AF146369 Betaproteobacteria Neisseria meningitidis AF059671 Betaproteobacteria Paracoccus denitrificans AJ288159 Alphaproteobacteria Peptococcus niger X55797 Firmicutes/Clostridia Pirellula marina X62912 Planctomycetacia M59159 Gammaproteobacteria Propionibacterium freudenreichi AJ009989 Actinobacteria Pseudomonas aeruginosa AF023658 Gammaproteobacteria Pseudomonas fluorescens AF068010 Gammaproteobacteria Pseudomonas stutzeri AF038653 Gammaproteobacteria

101 Ralstonia pickettii AB004790 Betaproteobacteria Ralstonia solanacearum AB024604 Betaproteobacteria Renibacterium salmoninarum AB017538 Actinobacteria Rhizobium galegae AF025853 Alphaproteobacteria Rhizobium leguminosarum D12782 Alphaproteobacteria Rhizobium tropici D11344 Alphaproteobacteria Rhodobacter capsulatus D13474 Alphaproteobacteria Rhodobacter sphaeroides B X53854 Alphaproteobacteria Rhodococcus erythropolis AJ237967 Actinobacteria Rhodococcus fascians X81932 Actinobacteria Rhodopseudomonas palustris AB017261 Alphaproteobacteria Rhodospirillum rubrum D30778 Alphaproteobacteria L36099 Alphaproteobacteria L36101 Alphaproteobacteria Rickettsia bellii L36103 Alphaproteobacteria Rickettsia canada L36104 Alphaproteobacteria Rickettsia conorii L36105 Alphaproteobacteria L36673 Alphaproteobacteria Rickettsia prowazekii AJ235272 Alphaproteobacteria Rickettsia rhipicephali L36216 Alphaproteobacteria U11021 Alphaproteobacteria D38628 Alphaproteobacteria Rickettsia typhi L36221 Alphaproteobacteria Ruminobacter amylophilus AB004908 Gammaproteobacteria Salmonella typhi U88545 Gammaproteobacteria Serpulina hyodysenteriae U14931 Spirochaetes Serpulina innocens U14924 Spirochaetes Simkania negevensis U68460 Chlamydiae Staphylococcus aureus AF076030 Firmicutes/Bacilli Staphylococcus carnosus AB009934 Firmicutes/Bacilli Staphylococcus condimenti Y15750 Firmicutes/Bacilli Staphylococcus piscifermentans Y15754 Firmicutes/Bacilli Stigmatella aurantiaca AJ233935 Deltaproteobacteria Streptococcus macedonicus Z94012 Firmicutes/Bacilli Streptococcus oralis S70359 Firmicutes/Bacilli Streptococcus parauberis X89967 Firmicutes/Bacilli Streptococcus thermophilus X59028 Firmicutes/Bacilli Streptococcus uberis AB002527 Firmicutes/Bacilli Streptomyces ambofaciens M27245 Actinobacteria Streptomyces coelicolor A AL356612 Actinobacteria Streptomyces griseus B AB030568 Actinobacteria Streptomyces lividans AB037565 Actinobacteria Streptomyces rimosus F X62884 Actinobacteria Synechocystis sp. D64000 Cyanobacteria Thermomonospora chromogena AF002261 Actinobacteria Thermotoga maritima aA16S AE001703 Thermotogae Thermus thermophilus L09659 Deinococcus-Thermus Treponema pallidum AE001208 Spirochaetes Tropheryma whippelii AF190688 Actinobacteria Ureaplasma urealyticum AE002127 Firmicutes/Mollicutes Vibrio cholerae AE004096 Gammaproteobacteria Vibrio vulnificus X56582 Gammaproteobacteria Waddlia chondrophila AF042496 Chlamydiae Wolbachia pipientis AF179630 Alphaproteobacteria Wolinella succinogenes M26636 Epsilonproteobacteria Xylella fastidiosa aA16S AE003870 Gammaproteobacteria

102 M59292 Gammaproteobacteria Zoogloea ramigera D14254 Betaproteobacteria Zymobacter palmae AF211871 Gammaproteobacteria Zymomonas mobilis C AF117351 Alphaproteobacteria

ARCHAEBACTERIA Aeropyrum pernix AB019552 Crenarchaeota/Thermoprotei Archaeoglobus fulgidus AE000965 Euryarchaeota/Archaeoglobi Desulfurococcus mobilis M36474 Crenarchaeota/Thermoprotei Haloarcula marismortui AF034620 Euryarchaeota/Halobacteria Halobacterium halobium AJ002949 Euryarchaeota/Halobacteria Halobacterium marismortui X61689 Euryarchaeota/Halobacteria Halococcus morrhuae D11106 Euryarchaeota/Halobacteria Haloferax mediterranei D11107 Euryarchaeota/Halobacteria Methanobacterium thermoautotrop AE000940 Euryarchaeota/Methanobacteria Methanococcus jannaschii B U67517 Euryarchaeota/Methanococci Methanococcus vannielii M36507 Euryarchaeota/Methanococci Methanospirillum hungatei M60880 Euryarchaeota/Methanomicrobia Natronobacterium magadii X72495 Euryarchaeota/Halobacteria Pyrobaculum islandicum L07511 Crenarchaeota/Thermoprotei Pyrococcus abyssi AJ248283 Euryarchaeota/Thermococci Pyrococcus horikoshii AP000001 Euryarchaeota/Thermococci Sulfolobus acidocaldarius U05018 Crenarchaeota/Thermoprotei Sulfolobus shibatae M32504 Crenarchaeota/Thermoprotei Sulfolobus solfataricus X90483 Crenarchaeota/Thermoprotei Thermococcus celer M21529 Euryarchaeota/Thermococci Thermofilum pendens X14835 Crenarchaeota/Thermoprotei Thermoplasma acidophilum M38637 Euryarchaeota/Thermoplasmata

103 Table B4 Time estimates for ML phylogenies of protein and rRNA data sets. Abbreviations: Act: Actinobacteria, Dcc: Deinococci, Cnb: Cyanobacteria, Chf: Chloroflexi, Bcl: Bacilli, Mll: Mollicutes, Cls: Clostridia, Firmi: Firmicutes, Chb: Chlorobia, Bct: Bacteroidetes, Chd: Chlamydiae, Sbc: Solibacteres, Fbc: Fibrobacteres, Delta: Deltaproteobacteria, Alpha: Alphaproteobacteria, Beta: Betaproteobacteria, Gamma: Gammaproteobacteria, Fsb: Fusobacteria, Aqf: Aquificae, Tmt: Thermotogae, Eubact: Eubacteria, Mtb: Methanobacteria, Mcc: Methanococci, Mpy: Methanopyri, Mmb, Methanomicrobia, Hlb: Halobacteria, Agb: Archaeoglobi, Tpm: Thermoplasmata, Tcc: Thermococci, Tpt: Thermoprotei, Nan: Nanoarchaeota, Cren: Crenarchaeota, Eury: Euryarchaeota, Archaebact: Archaebacteria, CI: credibility interval, stdev: standard deviation. Times are in billion years (Ga).

Calibration nodes: Tmt/Eubact max 4.2 Ga; Gamma/Beta min 1.64 Ga; Chb/Bct min 1.64 Ga; Cnb/Chf min 2.3 Ga; Terrabacteria: max 4.0 Ga Protein data set rRNA data set T3np T3np Eubacteria Nodes Time StDev CI Eubacteria Nodes Time StDev CI Act/Dcc 2.664 0.135 2.427, 2.953 Bcl/Mll 2.145 0.154 1.822, 2.433 Cnb/Chf 2.503 0.130 2.312, 2.796 Cls/Bcl 2.281 0.135 2.017, 2.545 Bcl/Mll 1.876 0.162 1.537, 2.187 Chd/Pct 2.235 0.113 2.015, 2.459 Cls/Bcl 2.422 0.138 2.182, 2.717 Chb/Bct 1.926 0.126 1.682, 2.174 Firmi/Cnb 2.752 0.129 2.535, 3.036 Chd/Chb 2.356 0.103 2.160, 2.564 Terrabacteria 2.872 0.131 2.642, 3.151 Fbs/Chd 2.428 0.103 2.236, 2.639 Spiroplancti 2.443 0.157 2.135, 2.762 Spirochlamydiae 2.478 0.103 2.282, 2.689 Chb/Bct 2.126 0.176 1.769, 2.469 Beta/Gamma 1.674 0.033 1.640, 1.760 Chd/Chb 2.701 0.138 2.457, 2.996 Alpha/Gamma 2.050 0.074 1.916, 2.206 Spirochlamydiae 2.841 0.133 2.604, 3.123 Delta/Gamma 2.292 0.090 2.126, 2.477 Delta/Sbc 2.507 0.128 2.285, 2.785 Proteobacteria 2.393 0.095 2.216, 2.591 Gamma/Beta 1.756 0.095 1.644, 1.993 Hydrobacteria 2.545 0.102 2.357, 2.757 Alpha/Gamma 2.369 0.117 2.178, 2.638 Proteo/Act 2.712 0.113 2.502, 2.946 Gamma/Delta 2.666 0.127 2.447, 2.946 Proteo/Firmi 2.764 0.116 2.546, 2.997 Proteobacteria 2.777 0.130 2.548, 3.060 Cnb/Eubact 2.847 0.124 2.619, 3.095 Hydrobacteria 2.911 0.132 2.670, 3.190 Dcc/Eubact 3.431 0.152 3.146, 3.740 Heliocytes 3.065 0.135 2.812, 3.341 Tmt/Eubact 3.668 0.158 3.379, 3.999 Fsb/Eubact 3.235 0.136 2.965, 3.496 Aqf/Eubact 4.118 0.073 3.925, 4.197 Aqf/Eubact 3.968 0.145 3.622, 4.169 Tmt/Eubact 4.063 0.120 3.756, 4.195

104

rRNA data set Calibration nodes: Tmt/Eubact max 4.2 Ga; Gamma/Beta min 1.64 Ga; Chb/Bct min 1.64 Ga T3p T3np NPRSLF Eubacteria Nodes Time StDev CI Time StDev CI Bcl/Mll 2.166 0.142 1.881, 2.440 2.142 0.155 1.819, 2.431 2.277 2.281 Cls/Bcl 2.287 0.133 2.026, 2.552 2.277 0.137 2.006, 2.542 2.367 2.360 Chd/Pct 2.248 0.115 2.029, 2.477 2.232 0.114 2.010, 2.461 2.179 2.422 Chb/Bct 1.923 0.120 1.692, 2.164 1.923 0.124 1.687, 2.168 1.781 2.038 Chd/Chb 2.354 0.108 2.152, 2.577 2.352 0.104 2.157, 2.567 2.347 2.576 Fbs/Chd 2.436 0.107 2.237, 2.657 2.425 0.104 2.235, 2.638 2.444 2.642 Spirochlamydiae 2.482 0.108 2.281, 2.702 2.474 0.104 2.285, 2.686 2.512 2.686 Beta/Gamma 1.674 0.032 1.640, 1.756 1.675 0.033 1.640, 1.763 1.640 1.640 Alpha/Gamma 2.049 0.078 1.903, 2.209 2.047 0.075 1.911, 2.208 2.029 2.039 Delta/Gamma 2.297 0.094 2.123, 2.491 2.290 0.091 2.124, 2.480 2.288 2.349 Proteobacteria 2.405 0.101 2.220, 2.613 2.390 0.097 2.212, 2.592 2.409 2.506 Hydrobacteria 2.566 0.107 2.367, 2.783 2.541 0.104 2.354, 2.758 2.594 2.738 Proteo/Act 2.739 0.118 2.518, 2.976 2.707 0.115 2.495, 2.944 2.769 2.851 Proteo/Firmi 2.790 0.120 2.563, 3.033 2.758 0.117 2.543, 3.000 2.831 2.889 Cnb/Eubact 2.887 0.131 2.644, 3.150 2.842 0.125 2.612, 3.098 2.919 2.946 Dcc/Eubact 3.489 0.1643.187, 3.823 3.424 0.155 3.138, 3.740 3.419 3.344 Tmt/Eubact 3.719 0.164 3.417, 4.054 3.659 0.161 3.366, 4.002 3.639 3.525 Aqf/Eubact 4.119 0.074 3.924, 4.197 4.118 0.073 3.930, 4.197 4.200 4.200

Archaebacteria Nodes Hlb/Mmb 3.526 0.207 3.093, 3.895 3.576 0.268 3.026, 4.052 1.613 1.982 Agb/Hlb 1.247 0.179 0.920, 1.623 1.104 0.193 0.751, 1.514 2.474 2.675 Tpm/Hlb 2.252 0.209 1.852, 2.665 1.999 0.244 1.528, 2.486 2.882 2.959 Mcc/Hlb 2.695 0.205 2.298, 3.107 2.508 0.245 2.032, 3.000 3.171 3.112 Mtb/Hlb 3.061 0.195 2.687, 3.452 2.936 0.236 2.484, 3.402 3.371 3.203 Tcc/Hlb 3.280 0.195 2.916, 3.673 3.182 0.235 2.750, 3.656 3.559 3.287 Mpy/Hlb 3.493 0.179 3.168, 3.857 3.417 0.223 3.028, 3.874 3.901 3.466 Nan/Tpt 3.709 0.156 3.472, 4.031 3.783 0.205 3.474, 4.206 3.797 3.365 Eury/Cren 3.949 0.161 3.618, 4.188 4.053 0.213 3.638, 4.381 4.200 3.706

105

Protein data set Calibration nodes: Tmt/Eubact max 4.2 Ga; Gamma/Beta min 1.64 Ga; Chb/Bct min 1.64 Ga T3p T3np NPRSLF Eubacteria Nodes Time StDev CI Time StDev CI Act/Dcc 2.507 0.076 2.357, 2.659 2.610 0.161 2.306, 2.939 2.825 2.977 Cnb/Chf 2.420 0.082 2.258, 2.583 2.435 0.172 2.104, 2.781 2.629 2.803 Bcl/Mll 1.864 0.083 1.702, 2.030 1.825 0.184 1.445, 2.178 2.088 2.206 Cls/Bcl 2.286 0.076 2.135, 2.432 2.360 0.173 2.022, 2.709 2.558 2.662 Firmi/Cnb 2.604 0.073 2.460, 2.748 2.694 0.158 2.402, 3.019 2.954 3.123 Terrabacteria 2.663 0.070 2.524, 2.798 2.819 0.154 2.533, 3.134 3.098 3.278 Spiroplancti 2.444 0.077 2.291, 2.597 2.410 0.161 2.109, 2.747 2.538 2.857 Chb/Bct 1.933 0.074 1.788, 2.081 2.109 0.173 1.768, 2.458 2.086 2.272 Chd/Chb 2.526 0.070 2.389, 2.664 2.663 0.149 2.399, 2.983 2.875 3.146 Spirochlamydiae 2.618 0.068 2.486, 2.751 2.798 0.147 2.530, 3.109 3.061 3.335 Delta/Sbc 2.360 0.065 2.237, 2.491 2.477 0.135 2.247, 2.774 2.597 2.691 Gamma/Beta 1.664 0.023 1.640, 1.722 1.747 0.092 1.643, 1.983 1.640 1.640 Alpha/Gamma 2.224 0.055 2.121, 2.337 2.345 0.121 2.149, 2.621 2.402 2.441 Gamma/Delta 2.451 0.061 2.333, 2.572 2.632 0.137 2.394, 2.934 2.819 2.949 Proteobacteria 2.562 0.064 2.439, 2.690 2.738 0.143 2.485, 3.048 2.978 3.175 Hydrobacteria 2.692 0.066 2.565, 2.821 2.865 0.148 2.594, 3.174 3.152 3.403 Heliocytes 2.831 0.067 2.702, 2.965 3.017 0.152 2.729, 3.326 3.333 3.542 Fsb/Eubact 3.016 0.065 2.889, 3.146 3.190 0.151 2.888, 3.483 3.511 3.670 Aqf/Eubact 4.169 0.021 4.115, 4.196 3.941 0.161 3.555, 4.164 3.930 3.975 Tmt/Eubact 4.185 0.015 4.143, 4.199 4.047 0.136 3.695, 4.194 4.200 4.200

Archaebacteria Nodes Mtb/Mcc 3.119 0.060 2.998, 3.230 2.363 0.430 1.433, 3.117 3.036 2.823 Mpy/Mcc 3.339 0.038 3.260, 3.410 2.851 0.354 2.088, 3.459 3.310 3.176 Mmb/Hlb 2.378 0.088 2.205, 2.546 1.491 0.410 0.758, 2.352 2.147 2.180 Agb/Hlb 2.851 0.066 2.714, 2.973 2.450 0.400 1.609, 3.168 2.791 2.745 Tpm/Hlb 3.127 0.052 3.019, 3.224 3.183 0.255 2.595, 3.589 3.143 3.104 Methanogenesis 3.468 0.008 3.460, 3.489 3.566 0.092 3.463, 3.801 3.460 3.460 Tcc/Archaebact 3.567 0.041 3.489, 3.648 3.726 0.143 3.503, 4.035 3.645 3.610 Tpt/Archaebact 4.187 0.009 4.164, 4.198 4.025 0.105 3.777, 4.173 3.921 3.896 Nan/Archaebact 4.194 0.006 4.176, 4.199 4.105 0.083 3.887, 4.196 4.200 4.200

106

107

108

109

110

111

112

113

114

115

116

117

118 Vita − Fabia Ursula Battistuzzi 208 Mueller Laboratory, The Pennsylvania State University, University Park, PA 16802 Phone: (814) 863-0278 Fax: (814) 865-9131 email: [email protected]

Education Doctor of Philosophy in Biology (Molecular Evolution) and Astrobiology, The Pennsylvania State University, Eberly College of Science, PA (USA) − December 2007 Bachelor in Biology, Universita’ degli Studi Roma Tre, Rome (Italy) − July 2001

Research interests Astrobiology, molecular and evolutionary biology, evolution of early life (prokaryotes and origin of eukaryotes) and early Earth history

Publications S.B. Hedges, F.U. Battistuzzi, J.E. Blair. (2006). Molecular timescale of evolution in the Proterozoic. Pp 199-229 in S. Xiao and A. J. Kaufman (Eds.), Neoproterozoic Geobiology and Paleobiology, Springer, New York. A. Riccardi, S. Domagal-Goldman, F.U. Battistuzzi, V. Cameron. Astrobiology influx or in flux? (2006) Astrobiology 6(3) F.U. Battistuzzi, A. Feijao, S.B. Hedges. A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land (2004) BMC Evolutionary Biology, 4: 44 F.U. Battistuzzi, A. Feijao, S.B. Hedges. A genomic timescale of prokaryote evolution and early Earth history. (abstract published in International Journal of Astrobiology Supplement 1, March 2004: 50)

Research presentations Center for Tropical & Emerging Global Diseases & Department of Genetics, University of Georgia, Athens, GA – June 20, 2007 Society for Molecular Biology and Evolution Meeting SMBE2006, Tempe, AZ – May 24-28, 2006 National Workshop in Astrobiology, Capri, Italy – Oct 26-28, 2005 Congresso dei Biologi Evoluzionisti Italiani 2005, Ferrara, Italy – Aug 24-26, 2005 NASA Astrobiology Institute Science Conference, NASA Ames, Moffett Field, CA – March 29 - April 1, 2004

Professional associations European Astrobiology Network Association (EANA), Italian Society for Evolutionary Biology (ISEB), Society for Molecular Biology and Evolution (SMBE), Penn State Astrobiology Research Center

Teaching experience Teaching assistant for Bio 427: Evolution, The Pennsylvania State University, Fall 2005 - 2006 Teaching assistant for Bio 110: basic concepts and biodiversity, The Pennsylvania State University, Fall 2003 Training of undergraduate students during the Astrobiology Summer Program, The Pennsylvania State University, Dept. of Biology, Summer 2003 - 2005