Gene transfer history of carbon fixation proteins constrains marine cyanobacteria divergence times

by

Makayla N. Betts

B.S., Microbiology University of California, Davis, 2016

Submitted to the MIT Department of Earth, Atmospheric and Planetary Sciences in Partial Fulfillment of the Requirements for the Degree

of

Master of Science in Earth and Planetary Sciences

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2018

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Author S ig n a tu re ...... 6/ MIT School of Science Department of Earth, Atmospheric and Planetary Sciences May 23, 2018 C ertified by ...... Greg Fournier Assistant Professor of Geob ogy, Department of Earth, Atmospheric and Planetary Sciences Al' / . Thesis Supervisor

Accepted by...... Signature reaactea ...... --- Robert D. Van der Hilst MASSACHUSES INSTITUTE OF TECHNOLOGY Schlumberger Professor of Earth Sciences Head of the Department of Earth, Atmospheric and Planetary Sciences JUN 0 6 2018

LIBRARIES 1 ARCHIVES Gene transfer history of carbon fixation proteins constrains marine cyanobacteria divergence times

by

Makayla N. Betts

Submitted to the Department of Earth, Atmospheric and Planetary Sciences on May 23, 2018 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Earth and Planetary Sciences

ABSTRACT

Carboxysomes provide an avenue for narrowing the timing of evolutionary events in groups of cyanobacteria that are ecologically dominant in modem marine environments - groups that may have an integral role in oxygenating the Earth's atmosphere. Here I show that using concatenated phylogenies of carbon fixation proteins better informs the horizontal gene transfer event that brought carboxysomes from purple sulfur into marine cyanobacteria and that this gene history aids in constraining the evolutionary timing of carbon fixation. Genes encoding the proteins for the a-carboxysomal shell as well as RuBisCO and carbonic anhydrase are co-located on the genomes of various cyanobacteria in the Prochlorococcusand Synechococcus groups. Previous studies have shown that these genes were likely horizontally transferred together from Chromatiales (), a group of phototrophic .While many of these genes are highly conserved and thus yield poorly resolved phylogenies, their concatenation clarifies a shared evolutionary history. This work integrates gene transfer with molecular clock calibration methods to determine divergence times. Accordingly, I evaluate the relationship between atmospheric evolution and the ecology of important groups of phototrophs.

Thesis Supervisor: Gregory Fournier Title: Assistant Professor of Geobiology

2 Gene transfer history of carbon fixation proteins constrains marine cyanobacteria divergence times

by

Makayla N. Betts

Submitted to the Department of Earth, Atmospheric and Planetary Sciences on May 23, 2018 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Earth and Planetary Sciences

INTRODUCTION

Molecular data is an exciting window into the relationships between microbial communities and ancient Earth environments. Combining phylogenetic reconstruction methods and calibrated molecular clocks allows for divergences of to be pinned to a timeline of Earth's history and correlate evolutionary events with the changing planet. Recent studies have explored the timing of major metabolisms such as oxygenic photosynthesis and methanogenesis (Rothman et al., 2014; Wolfe and Fournier, 2018). No matter the question, achieving confident conclusions in this realm of research requires targeted searching of genomic databases and well- resolved phylogenetic histories of genetic information that has important evolutionary functions.

The evolution of oxygenic photosynthesis and the timing of its subsequent effects on the Earth's climate are not well understood, but cyanobacteria are believed to have had pivotal roles (Schirrmeister et al., 2013). The appearance of abundant marine planktonic cyanobacteria in particular may have instigated the oxygenation of the ocean with major impacts on biogeochemical cycling. Previous studies that use molecular clock methods to estimate divergence times in cyanobacteria suggest variable age distributions for the evolution of the marine planktonic cyanobacteria (Magnabosco et al., 2018; Sainchez-Baracaldo, 2015; Shih et al., 2017). These distributions are large and cover a dynamic period in Earth's history. Improved geochronology techniques and greater sampling have enabled more precise dating of geological features in stratigraphy throughout the Neoproterozoic (Macdonald et al., 2010). This is a time during which global glaciation events are believed to have occurred co-existent with large perturbations in the carbon cycle. The exact triggers and feedbacks influencing these events are debated. The timing of Neoproterozoic global glaciations, changing ocean biogeochemistry, and the rise of planktonic marine cyanobacteria and the consequences for the biological carbon pump remain enigmatic (Anbar and Knoll, 2002; Blank et al., 2010).

Carbon dioxide (C0 2) concentrating mechanisms (CCMs) in cyanobacteria provide an opportunity to study adaptations that potentially arose to address changing CO2 levels for organisms engaging in oxygenic photosynthesis. If CCMs contribute significantly to cyanobacterial calcification, then their role in establishing a biological carbon pump could be significant (Merz, 1992; Riding, 2006). The carboxysome is an exciting case because it has exhibited a clear, shared history of horizontal transfer for a suite of genes that encode it. Carboxysomes are microcompartments present in all photosynthetically competent species of

3 cyanobacteria and some autotrophic bacteria. The organisms concentrate CO 2 inside the proteinaceous shell of the carboxysome, which contains ribulose 1,5 bisphosphate carboxylase/oxygenase (RuBisCO) and carbonic anhydrase. In cyanobacteria, they exist in two forms, a and P, which correspond to the type of RuBisCO they encapsulate and with which they have undergone convergent evolution (Burnap et al., 2015; Rae et al., 2013; Yeates et al., 2014). While it is believed that at least one gene that contributes to the structure and function of a carboxysomes was horizontally transferred from purple phototrophic y-, the rest of their evolutionary history is relatively obscure (Abdul-Rahman et al., 2013; Cai et al., 2015; Marin et al., 2007). Combining the phylogenies of additional carboxysomal components yields enough evolutionary information for subsequent molecular clock analyses to make more robust inferences on the evolution of the a-cyanobacteria. Adding in calibrations from the Fournier lab analyses on cyanobacteria divergence times and proteobacteria divergence times contributes further parameters to constrain the molecular clock. This allows for an evaluation of the appearance of marine planktonic cyanobacteria in relation to the timing of major shifts in Earth's climate.

Earth's History of Oxygenation

The oxygenation of Earth's atmosphere is a topic of vibrant discussion in scientific communities across a wide range of expertise. Current research areas of particular interest are in understanding how oxygen could have been influenced by the movement of the continents, the effects of hydrogen escape in the atmosphere, and the evolution of oxygenic photosynthetic organisms (Blank et al., 2010). Cyanobacteria are broadly believed to have created the abundance of oxygen that eventually accumulated after various geological feedback mechanisms. There have been stepwise increases in oxygen, even more recently than the GOE - it has been suggested that oxygen levels stayed quite low all the way up through the mid- Proterozoic at least (Planavsky et al., 2014). The causes of these increases are also unknown. Changes in biological systems may been a major, if not primary, influence on these systems. Previous phylogenetic studies have suggested that extant marine Cyanobacteria in the Synechococcus/Prochlorococcus (SynPro) clade diverged relatively recently within cyanobacterial evolution. These organisms numerically dominate the ocean today, and Prochlorococcusis particularly prolific in oligotrophic water (Flombaum et al., 2013). Accordingly, they may be associated with the oxygenation of the ocean as well as the secondary Neoproterozoic rise in atmospheric oxygen.

4 Archean 1850 - 1250 Ma Phanerozoic Upper

-~ ocean

ocean sediments Figure 1. From Anbar and Knoll (2002). Proterozoic Ocean Chemistry and Evolution: A Bioinorganic Bridge?, illustrating transition from a largely anoxic Archean ocean to the oxygenated ocean of today.

Furthermore, the timing of colonization of the open ocean by oxygenic photosynthetic organisms and the resultant influence on the biological pump, on the precipitation of carbonate and, ultimately, on the climate, is uncertain. One model has been that pelagic cyanobacteria had already colonized the open ocean in the Archaean (Kasting, 1987), and that a modem biological pump was established by the Proterozoic (Canfield, 1998). However, others argue that a modem biological pump was not established until the Neoproterozoic (Lenton et al., 2014; Logan et al., 1995), and that there was little or no oxygenic primary productivity in the open ocean until then (Johnston et al.; 2009; Johnston et al., 2010; Sinchez-Baracaldo, 2015). Accordingly, gaining better resolution on the divergences of the SynPro clade and their radiation following the acquisition of carboxysomes could yield important insight into the co-evolution of the biosphere and the Earth's climate. Improved molecular clock calibration strategies for molecular data enables a better opportunity to answer these questions by being able to pinpoint the time at which organisms evolved key evolutionary innovations.

The Evolution and Ecology of the SynPro Clade

Cyanobacteria are hypothesized to have evolved from freshwater environments and subsequently moved into the marine and open ocean environments - this is based both on the phylogenetic distribution of habitat in modem cyanobacteria as well as on physiological experiments of salinity tolerance in various lineages (Hermann and Gehrinnger, 2017; Blank and Sanchez-Baracaldo, 2010). Cyanobacteria have also been taken up multiple times in symbiotic photosynthetic relationships across a broad range of taxa that then continued to be large impactors on Earth's ecosystems and climate. They operate as endosymbionts and as external facultative symbionts with sponges. They have also been engulfed by single eukaryotic cells to become the ancestor of plastids in algae, plants, and amoeba (Paulinellachromatophora). Algae themselves have subsequently evolved symbiotic relationships with a variety of biogeochemically important organisms, most notably corals. Algae are even known to be

5 involved in additional symbiotic relationships with other eukaryotic organisms, including diatoms and radiolaria.

Besides an efficient C02 concentrating mechanism, there are numerous ecological factors that contribute to the success of marine planktonic cyanobacteria. Close associations exist between these cyanobacteria and other bacteria that may be critical to their success and contemporary oxygen production - between Prochloroccusand SAR 11, for example (Braakman et al., 2017). The SynPro group as a whole, and particularly the Prochlorococcus,has also exhibited strong genome streamlining (Billing et al., 2015; Coleman et al., 2006). In combination with the separation of the lineages in ecological niches in the upper surface waters of the ocean, distinct ecotypes form (Rocap et al., 2003). Elucidating the relationships of the climate and oxygenic photosynthetic organisms during a dynamic period in Earth's history is important to understanding the carbon cycle of the planet.

Carboxysome Structure and Evolution

The carboxysome is but one example of an important evolutionary adaptation: the spatial and temporal separation of metabolic processes, accomplished through compartmentation within cells. Major radiations of organisms and consequent changes in their environment therefore have the potential to be characterized by evolutionary innovations that involve the adaptation of compartmentation and increased regulation. This is well understood in the eukaryotic realm of organisms, both in the broader understanding of stages in their evolution (i.e., mitochrondria, chloroplast, organelles). Compartmentation is frequently noted as a distinguishing trait of eukaryotic organisms in comparison to the other two Domains of life, Bacteria and Archaea. Yet compartmentation of enzyme activity and nutrient storage occurs in bacteria, as well. Though less well understood, such features in bacterial lineages hold promise for understanding how life on Earth has co-evolved with its environment and for advancing efficiency in future biotechnology. There are a range of bacterial microcompartments currently known that exist across a wide array of groups and which have a range of metabolic functions (Abdul-Raman et al., 2013; Kerfeld et al., 2018). Genes encoding the structural proteins of these microcompartments as well as the active enzymes that reside within them are most often found in a single locus on an organism's genome and are co-transcribed.

Of particular interest is the carboxysome, a key player in the CO 2 concentrating mechanism of photosynthetic microorganisms, including cyanobacteria. The carboxysome is a proteinaceous microcompartment that, in cyanobacteria, houses RuBisCO and carbonic anhydrase for oxygenic photosynthesis. A CO 2 concentrating mechanism is necessary within carbon fixing cells because the setup increases the partial pressure of oxygen around key enzymes. Such concentrating mechanisms possibly became necessary once falling CO 2 levels or rising 02 levels posed challenges for the efficiency of RuBisCo. Knowing the timing of carboxysome evolution in these groups of organisms would therefore add to our understanding of how photosynthetic organisms like the oxygenic photosynthetic cyanobacteria, confidently implicated in the oxygenation of Earth's atmosphere, have related to the evolution of life with the planet. There are two types of carboxysomes present in modem cyanobacteria, a and P carboxysomes (Kerfeld et al., 2018). a-carboxysomes are present in cyanobacteria groups that

6 are primarily marine, while the P-carboxysomes are distributed more broadly in environment. a and P carboxysomes are distinct in their structure, assemblage, and evolutionary history.

A- a-carboxysom 1-carboxysome Shell proteins: Shell proteins: CcmK CsoS1a,b,c Form 1 B CcmL Loose central Rubiso CCM0 CsoS2, Form 1A CsoS3 Rubisco Rubisco packing CCM CsoS4 R CcmN HCO HCO 3- Co 3 C0

Ordered Para-crystalline 150 nm dia. Rubisco layer? -200 -400 nm Rubisco packing Figure 2. From Rae et al., 2013, illustrating the two types of carboxysomes and their components - shell proteins, carbonic anhydrase, Rubisco - as well as their predicted internal packing structure.

Much of the work that contributes to our current understanding of the a-carboxysome stems from physiological experiments geared toward their future use in synthetic biology and industrial practices. This works includes physiological experiments analyzing their usage, flexibility, environmental distribution, and concomitant assembly (Badger and Price, 1994; Badger et al., 2002; Kupriyanova et al., 2013; Kerfeld et al., 2018; Rae et al., 2013; Whitehead et al., 2014). Studies quantifying the optimization of metabolism via compartmentalization further supports the size of the carboxysome and the average density of the enzymes it encapsulates as essential factors to their success, and less so the arrangement of the enzymes within it (Hinzpeter et al., 2017).

Previous phylogenetic analyses based on the main structural protein, CsoS2, suggested that a-carboxysomes were transferred into the cyanobacteria via horizontal gene transfer from Chromatiales (purple sulfur bacteria) (Cai et al., 2015). The carboxysome is encoded by a consort of genes co-located on bacterial genomes in an operon (Cai et al., 2009). Some of these genes encode structural proteins (pentameric and hexameric structural proteins that include highly conserved pfam domains) (Cai et al., 2009). Others encode the enzymes that are active within this proteinaceous shell, including both the large and small subunits of RuBisCO and carbonic anhydrase (So et al., 2003). Proximity on the genome tends to be positively correlated with metabolic relation because genes that stick together get transcribed and translated together. Though not perfect, similar patterns of gene synteny can generally be used as evidence for evolutionary relatedness.

7 "EW Cyanobacteria

Purple Bacteria

f SS-Proteobacteria a-Proteobacteri

Nitosplr1- - - y-Proteobacteria Actinobactera. Figure 3. Phylogeny of the CsoS2 carboxysomal structural protein from study by Cai et al., 2015, highlighting the likelihood that carboxysome proteins were horizontally transferred into Cyanobacteria from Purple Sulfur Bacteria.

Proximity on the genome is also positively correlated with the probability of genes to be horizontally transferred together. As discussed, horizontal gene transfer occurs through various means, including the formation and exchange of plasmids and viral gene capture. When genes are closer together on the genome, they are more likely to be taken up and transferred together. This might be convenient for the retention of large packets of horizontally transferred genes (an organism might be more likely to retain a large packet of operational genes than a series of partial and consequently non-useful ones).

METHODOLOGICAL THEORY

In order to analyze the history of genes and genomes, informative and robust sequence- based phylogenies can be generated. Today, this is a multi-faceted challenge that includes targeted searches of functional genes, sampling a useful set of taxa, making concrete biological interpretations of conservation and speciation, and drawing from an array of bioinformatics software that use different evolutionary models and assumptions. There are a few basic terms to know when discussing a phylogenetic tree. A clade is defined by a node and includes all of the taxa grouped by that node. Crown groups refer to a select set of extant taxa that can be monophyletic (all taxa of a single clade), polyphyletic (some taxa of a single clade), or paraphyletic (some taxa across multiple clades). Stem groups are, for a given crown group, all of the lineages of organisms that are known to have existed or could have existed along the branch preceding that crown group. Unrooted trees are rooted by identifying an outgroup - single or multiple taxa that place deeper in time to the rest of the taxa in the tree.

Phylogenetic Construction Methods

Bioinformatics analyses for phylogenetic inference use genome sequences collected from extant organisms in the environment and follow an established pipeline. Once the nucleotide sequence of a genome is determined, it is generally deposited and made available on the National Center for Biotechnology Information (NCBI) online databases. Protein sequences inferred from

8 translated nucleotide genes sequences are also included in these databases and can be queried via NCBI's BLAST (Basic Local Alignment Search Tool) to search for all genetic matches in the database that are related to a gene of interest (Altschul et al., 1990).

In order to construct the most likely evolutionary relationships between organisms, a best-fitting evolutionary model is first determined for the aligned sequences. Such models can take into account a variety of parameters, primarily an amino acid substitution matrix. Amino acid substitution matrices include LG (Le and Gascuel, 2008), WAG (Whelan and Goldman 2001), JTT (Jones, Taylor, Thornoton, 1992), Dayhoff (Dayhoff, 1978), and Blosum62 (Henikoff and Henikoff, 1962), each of which proposes different variations on the likelihood of an amino acid to substitute into another amino acid based on understandings of how amino acid properties affect protein structure. For example, there is a higher relative rate by which a hydrophobic amino acid is replaced by another hydrophobic amino acid rather than a hydrophilic one. Evolutionary models may also incorporate non-stationary nucleotide substitutions to allow for evolving nucleotide frequencies over time and empirical amino acid frequencies. A proportion of sites can be allowed to remain unchanged, or held invariant, and heterogeneity of the evolutionary rate distribution among the sites can be included in models.

Muscle; Evaluate GenBank; BLAST with GUIDANCE I I LIs - Seq. Collection -+ Alignment

Anlysis <.- Tree Building - Evolut. Model I I I Evaluate HGT; Add calibrations RAxML/Phylobayes ProtTest for Molecular Clock Analysis

Figure 4. Basic outline of methods used in the Fournier Lab to analyze molecular data.

Maximum likelihood methods (refs) construct the most likely evolutionary relationships between taxa. The number of possibilities in which a group of taxa can be related increases quickly due to the mathematics behind rooted and unrooted trees.

For a rooted tree with n taxa, there are:

(2n - 3)!! = (2n 3)! for n 2 2F-2(n - 2)!

For an unrooted tree with n taxa, there are:

9 (2n3 - 5)! (2n - 5)!! 2n- (n - 3)! ,forn >3

With increasing taxa included, the possibilities in the resulting phylogeny can be considered mapped as "tree space," and grow rapidly to extremely large numbers. Such numbers are non-trivial for contemporary computing power. While large, the tree space is generally non- uniform in probability, in that the establishment of initial bifurcating nodes affects the rest of the groupings. This means that the overall likelihoods of the tree space can exhibit local optimums, trees that are a solution given certain nodes in place, but are less likely in comparison to the true, most likely tree, expressed by the global optima of the tree space. To deal with these impossibly large and non-uniform tree spaces, statistical programs use strategies such as parallel Monte- Carlo Markov Chains to explore tree space more effectively (Larget and Simon, 1999). In this approach, the software evaluates the likelihood of the observed sequence data under different trees, finding the tree that maximizes this value. The "true" tree is difficult to obtain through statistical analysis of extant organisms - consistent patterns of evolution that are well-supported by sequence data allow us to make conclusions on the most likely solution.

The Imperfect Yet Useful Molecular Clock

From constructed phylogenetic topologies, methods to set divergences of organisms onto a distinct timeframe can be used. The concept of the molecular clock was established (Zuckerkandl and Pauling, 1962; Zuckerkandl and Pauling 1965) with the understanding that DNA mutates at a consistent rate. With this set rate, the time two species diverged can be calculated based on the number of mutations between them (in the form of genetic distance between organisms). This method of using flat rates to figure out the timing of evolutionary events, however, has proved unreasonable given the biological and geological context that we know life to have existed in. There are many reasons that the molecular clock as a flat rate over time is a faulty method - these include the relationship between mutation rates and speciation events, lab vs. natural environments, actual change in the rates of mutation due to external energy fluxes or adaptation to mutate less given equivalent external energy fluxes, varying mutations rates between species or whole groups of species, and accelerated rates of mutation in species when symbiosis between organisms leads to a rapid reduction of the genome.

Calibration Methods for the Molecular Clock

To manage these inconsistencies, calibration of the molecular clock is a necessary tool in pinpointing phylogenies onto a timescale. Once a maximum likelihood tree is constructed, the probability distribution of ages is fitted for each node based on a calibrated molecular clock model using Bayesian statistics and inputted priors. These priors include the maximum likelihood tree and any number of age constraints for the root and various nodes in the rest of the tree. Age constraints can be upper bounds (a node must be older than this many years), lower bounds (a node must be younger than this many years), or both (a node must be both older than this many years and younger than this many years). With all of these bounds, the age probability of the node in relation to that age constraint can be inputted variably. A "flat" prior implies that an equally likely probability that the node is anywhere within the age constraint defined. There

10 could also be a shaped probability, for example a normal or Gaussian probability distribution, imparted to account for interpretations based on current understanding of biological systems and the evolution of species.

There are several categories of calibration points, and for which there are arguments for and against as well as numerous methods of incorporating them. In order to discuss their validity, we must first articulate what event we are trying to date and define several terms. Potential calibrations for the molecular clock can be categorized in various ways. Many of the current studies using calibrated molecular clocks today do not agree on exactly which calibrations are useful and/or valid to include. I group here common calibrations in four broad categories: astrophysical, geochemical, fossil, and molecular. I will discuss arguments for and against these various calibration types in analyses of microbial history, and a few specific cases that are of particular relevance to my work and are frequently under discussion.

ASTROPHYSICAL and GEOLOGICAL CALIBRATIONS

Using reasonable age constraints that consider astrophysical bounds can provide conservative restrictions on phylogenies. The first appearance of zircons in the rock record around 4.4 Ga is seen as evidence for the existence of liquid water on the Earth's surface (Wilde et al., 2001). Since the origin and sustainability of life as we know it requires water, this is sometimes used as a constraint for LUCA or LBCA (Magnabosco et al., 2018). There are other astrophysical and geological events proposed to have influenced evolutionary rates of life throughout its history, including changing fluxes of UV radiation (with accordingly variant rates of DNA mutation) and changing continental extent. There is a large amount of uncertainty in these processes' influence on evolution, and the constraints previously mentioned are therefore more conservative. In molecular clock calibrations for phylogenies of large metazoans, geological processes such as continental splits, river formations, and climatic conditions can be used to factor in reasonable arguments for biogeographical speciation events and physiological constraints.

GEOCHEMICAL CALIBRATIONS

There are multiple types of geochemical calibrations that can be considered - isotopic fractionation signatures, organic molecules, and the redox state of minerals or the environment. Physiological studies of microbial metabolism can quantify isotopic fractionations by enzymes and biochemical pathways. Corresponding fractionations can sometimes be observed in the rock record and in ancient gases. In these interpretations, it is critical to understand the influences that species variation, physiological flexibility, and responses to reservoir variance have on the resulting fractionations observed. Organic molecules and their degradation products can be used as "biomarkers" to identify microbial groups or even sometimes specific species (Brocks et al., 2009; Love et al., 2009; Gold et al., 2017). Lipid biomarkers are particularly useful in their tendency to retain identifying structural features for a wide array of microorganisms over long timespans (Brocks et al., 2005). Choosing events independent to the hypothesis being tested is important to avoid circular reasoning.

11 FOSSIL CALIBRATIONS

Fossilized organisms found in the rock record can be used to calibrate the molecular clock. Their use requires confident interpretation based on morphological similarity as well as solid geochronology to date their placement. For microscopic organisms, fossils become increasingly difficult to assign confidently to crown groups and their identity is oft debated. The possibility that the fossil is attributed to a stem lineage must also be considered. This factor is amplified with microbial species due to the frequency of morphological polyphyly and convergence and the uncertainty of how likely microbial stem lineages might produce similar morphologies based on influences on convergence such as similarity in environmental conditions or ecological niche. Even without this ambiguity, the reservation of rocks decreases further back in Earth's history. A significant proportion of microbial evolution is expected to have taken place during old times that we have little record of, however boring the time seems from a macroscopic perspective.

Altogether, the incorporation of fossil calibrations into molecular clocks for microbial species can be a powerful tool, and the associated risks make sound reasoning critical. A clever integration of the fossil record to calibrate molecular clocks is using macroscopic fossils that are associated with the growth of microscopic organisms. Considering the aforementioned uncertainties, this allows for the conservative yet important timing of major divergences. For example, the first appearance of microbialites has been used as a lower bound for the LCBA, with the reasoning that microbially- induced stromatolite growth would not have occurred until the community ecology of bacteria was established (stromatolites from the Warawoona Group at -3.5 Ga; Hofmann et al., 1999).

MOLECULAR CALIBRATIONS

The principle of cross-cutting relationships can be used to understand the relative ages of geologic features. The relationships of species and their genes, as recorded in molecular data, can be used similarly to understand the relative ages of evolutionary events. As discussed, horizontal gene transfer (HGT) occurs frequently in the history of life, often bringing in key evolutionary innovations to biogeochemically important species that subsequently flourished. When these transfer events are well-resolved in phylogenetic analyses and the directionality from donor to recipient is clear, they can be readily integrated as relative constraints on the molecular clock (Davin et al., 2018; Magnabosco et al. 2018, Wolfe and Fournier, 2018).

Furthermore, phylogenetic analyses can reveal anomalies in the evolutionary history of proteins in comparison to the microbial species that hosts them. Gene trees that differ from their host species' trees are referred to as having reticulate histories - originating from the Latin word "reticulum", a fine network or netlike structure. This terminology illustrates how the frequency of horizontal gene transfer (HGT) complicates the traditional vertical inheritance of genetic information for microbial life. Phylogenetic

12 resolution and simple evolutionary histories can support the directionality of transfer events - genes suddenly appear in "recipient" clades, grouping most closely with the "donor" clades that they were transferred from. As calibrations for the imperfect molecular clock, well-resolved HGTs can help narrow the timing of evolutionary events in species' evolution. Here, I use the shared gene transfer history of carbon fixation proteins to constrain divergence times in marine cyanobacteria. Because these genes are influential on major metabolic processes, their appearance may characterize the time when marine cyanobacteria began to have a large ecological and biogeochemical impact. A better understanding of their timing in relation to the Earth's history could provide insight into the environmental conditions supporting this group's proliferation. Another method of using molecular data for calibrations involves the tracking of gene duplications, coined "cross-calibration" (Shih and Matzke, 2013).

Concatenating Gene Histories and Integrating Multiple Calibration Methods

In using genes to evaluate the history of organisms and their metabolic pathways, phylogenetic resolution is derived from sites in the genome that are neither so conserved as to exhibit too little information different between species nor so highly evolved that they have too many differences that do not yield enough similarity to resolve their phylogenetic resolutions. This is also important when considering molecular clock rates for timing evolutionary events in phylogeny. Using uncalibrated molecular clock rates on conserved regions can consequently underestimate divergence times, while analyses based on fast evolving regions can overestimate. Few genes fit this range in the spectrum of conservation and divergence and are also long enough that they can provide well-resolved relationships between species sampled. Concatenating genes with shared histories allows for greater opportunity to use more sites that have evolutionary information and thereby increase phylogenetic resolution (as expressed in bootstrap values within a maximum likelihood reconstruction).

With both geochemical and fossil calibration categories, it is particularly important to consider the potential contribution of stem lineages to fractionation signatures, molecular compounds, or morphological similarity, so conclusions that connect this data to crown groups should have robust supporting evidence. Though the use of conservative bounds may lack precision, using less certain events may push the phylogeny to false divergence times if it is a false assumption. For this reason, combining multiple conservative calibrations and emphasizing molecular calibration methods (HGT events and cross-calibration) could provide a better strategy for dating divergence times than single calibration strategies. There are many degrees and sources of uncertainty in using statistical methods of phylogenetic analyses. Even in the best scenario, analyses rely on the assumption that evolution has taken the most likely path. That most likely path, however probable, may not be true to what took place - incorporating Bayesian statistics, geobiological data, and molecular clock calibrations to better inform the most likely path reduces the chance that it is not.

METHODS

Bioinformatics analyses were performed using an established pipeline of the Fournier Lab and its extensive dedicated computing resources at the MGHPCC cluster facility The

13 numerous studies that have catalogued the protein constituents of a and P carboxysomes (Cai et al., 2015; Cai et al., 2009; Price et al., 1993; Rae et al., 2012; So et al., 2004; Sutter et al., 2015), were compiled into a comprehensive list of genes known to be involved in carboxysome structure and function, including CsoS lA-E, CsoS2, CsoS4A, and CsoS4B (shell proteins), CbbL and CbbS (the large and small subunits of Rubisco, respectively), and CsoS3 (carbonic anhydrase; also known as CsoSCA and closely associated with the shell). Using this resource as a set of queries for NCBI BLAST, databases of related protein sequences across organisms known to have the a-carboxysome were generated. Then, sequences of taxa that contained every one of the following genes were compiled: CbbL, CbbS, CsoS2 (main shell protein), CsoS3, and CsoS4A (vertex protein).

Sequences were aligned using MUSCLE (Multiple Sequence Comparison by Log- Expectation) (Edgar, 2004), creating a matrix of specific amino acid states with shared histories that can be used to discern protein evolutionary histories via phylogenetic inference. Using ProtTest (Darriba et al., 2011), the best-fitting evolutionary models for the individual proteins were determined. Tested models incorporated several amino acid substitution matrices (LG, WAG, Dayhoff, JTT, and Blosum 62), heterogeneity in site rates (gamma distribution), invariance at specific sites, and empirical amino acid frequencies, as described previously.

Phylogenies for carboxysomal protein families were built using maximum-likelihood techniques, specifically, RAxML-HPC (Randomized Accelerated Maximum Likelihood for High Performance Computing) (Stamatakis, 2014). For unresolved polytomies involving strains of the SynPro group, representative taxa were chosen. Extremely long branches that exhibited the propensity to distort other phylogenetic relationships via long branch attraction were removed. Outgroups for the maximum likelihood phylogenies of individual and concatenated proteins were chosen by comparison of the gene trees to published species trees of purple sulfur bacteria (of the order Chromatiales) and Proteobacteria (Eddie et al., 2016; Imhoff et al., 1998; Nupur et al., 2017; Yang et al., 2017) and by maximizing the grouping of purple sulfur bacteria.

Divergence time estimates were calculated using PhyloBayes 3.3 with the C20 set of site specific substitution models and the uncorrelated gamma distribution (ugam) relaxed molecular clock rate model (Drummond et al., 2006; Lartillot et al., 2009; Lartillo et al., 2013). A broad root prior was given to account for the uncertainty in the ancestral group of the purple sulfur bacteria. Secondary calibrations from published chronograms of cyanobacteria by Magnabosco et al., 2018, as well as from unpublished results in work on Proteobacteria by Wolfe and Fournier, were applied as uniform priors in consort through various models. Models were also run testing different root priors to ensure that the root prior was broad enough and would not artificially push the ages. All models run using PhyloBayes 3.3 with the various root priors and secondary calibration inputs as described are summarized (Appendix, Table 3). The congruent node of the cyanobacteria from Magnabosco et al., 2018, was the ancestral node from which the taxa CandidatusSynechococcus spongiarum 142 diverged, with an age constraint of between 909.761 and 294.145 mya ("SynPro" calibration). From the unpublished work by Wolfe and Fournier, two internal nodes within Chromatiales were applied with the following age constraints: the ancestral node of the group that includes Marichromatium and Allochromatium lineages, between 963.599 and 137.86 mya ("internal PSB 1" calibration), and the ancestral node of the group that includes Thioalkalivibrio and Ectothiorhodospiralineages, between 1507.52

14 and 220.946 mya ("internal PSB 2" calibration). Furthermore, the molecular clock for each model was run under the prior to check that adding sequence information informs the posterior and changes the age estimate - these include all of the even-numbered model runs, and they are paired with the preceding odd-numbered run. The distributions of estimated divergence times of the SynPro group were subsequently plotted using the APE (Analyses of Phylogenetics and Evolution in R language) and HDInterval (Highest (Posterior) Density Intervals) packages in R (Paradis et al., 2004; Meredith and Kruschke, 2016; R Core Team, 2013).

RESULTS

BLAST results of similar genes yielded an abundance of taxa containing carbon fixation proteins homologous to the SynPro groups in marine a-cyanobacteria themselves as well as in closely related organism such as Cyanobium gracile, Candidatu Synechococcus spongiarum 142, and the cyanobacterial ancestor to the plastid of an amoeba, Paulinellachromatophora. Many species of purple sulfur bacteria, belonging to the order Chromatiales in Gammaproteobacteria, hosted the genes and were included, along with various lineages of Gammaproteobacteria that do not belong to the order of Chromatiales. A few species identifying as Alphaproteobacteria, , Actinobacteria, and also presented. Taxa that contained every one of the genes CbbL, CbbS, CsoS2, CsoSCA (CsoS3), and CsoS4A are reported (Appendix, Table 1). The best-fitting evolutionary models for these individual proteins are reported (Appendix, Table 2). The RuBisCo subunits tended to be follow the LG model with invariant sites and gamma distribution of site-rate heterogeneity, while the shell proteins exhibited more variation. CsoS4A followed more similarly to the RuBisCO subunits. CsoS2 (main structural protein) and CsoSCA (carbonic anhydrase) were best predicted by the WAG model with invariant sites, gamma distribution of site-rate heterogeneity, and non-stationary amino acid composition based on the observed amino acid frequencies.

Maximum likelihood trees of individual and concatenated protein alignments of the select five shared genes were constructed via RAxML (Appendix, Figure 1 (concatenated protein alignment) and Figures 5 - 9 (individual proteins CbbL, CbbS, CsoS2, CsoSCA, and CsoS4A)). The HGT event of CsoS2 from purple sulfur bacteria to the a-cyanobacteria was confirmed and well-resolved. This history was also well-resolved for the individual proteins for the RuBisCo subunits, carbonic anhydrase, and the vertex protein CsoS4A. The shell-associated proteins CsoS 1 D, CsoS 1, and CsoS4B produced trees with topologies that could be similar and suggest the same shared history, but the resolution on the transfer was low, as represented by lower bootstrap values and so were not included. Extremely long branches (Bradyrhizobium sp. ORS 278) that exhibited the propensity to distort true relationships via long branch attraction were removed. Higher bootstrap values were achieved in the phylogeny that used the concatenated alignment of the five shared genes (Appendix, Figure 2). Taxa are colored as follows in the phylogenetic trees reported (Appendix, Figures 1, 3, 4, and 5 - 9): Green - marine a- cyanobacteria; purple - Chromatiales (purple sulfur bacteria); pink - Gammaproteobacteria that do not belong to the order of Chromatiales; red - Alphaproteobacteria; blue - Betaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

Comparison of the incorporation of root priors and secondary calibrations into divergence times calculated using PhyloBayes 3.3 with the C20 set of site specific substitution models and

15 the uncorrelated gamma distribution (ugam) relaxed molecular clock rate model are reported as a summary of the SynPro ancestor age distributions and mean (Appendix, Table 3 and 4). The resulting probability distribution of the root effectively covers the broad time interval applied for the broad root prior. The plotted distributions of estimated divergence times of the SynPro group aid in the visualization of the influence of the prior on various secondary calibration inclusions as well as the influence of various root priors (Appendix, Figure 2). Chronograms for model run 7, which includes a broad root prior and secondary calibrations both for internal nodes of the purple sulfur bacteria and for the ancestral node of the SynPro clade, are reported in detail (Appendix, Figures 3 and 4). Shortened and full names are detailed (Appendix, Table 5).

DISCUSSION

The resultant phylogenies of the individual proteins and concatenated protein alignment confirm a clear HGT event of the a-carboxysome shell proteins and its associated enzymes for carbon fixation from the purple sulfur bacteria into the a-cyanobacteria, as previously suggested. The increased resolution, represented by higher bootstrap values, on the phylogeny of the concatenated protein alignment supports the concatenation as an improvement on the resolution of this transfer for subsequent molecular clock analyses. This is important because choosing amino acid matrix selection in evolutionary model prediction can be complicated by conserved or small sequences as well as the frequencies of the amino acids (Darriba et al., 2011; Keane et al., 2006), a case particularly true of bacterial microcompartments (Kerfeld et al., 2018).

The topology of the SynPro and the purple sulfur bacteria taxa within these individual and concatenated phylogenies are scrambled yet relatively consistent. The HGT of the carbon fixation proteins and the carboxysome into SynPro is still clearly outlined, though the relationships of the taxa within these groups are sometimes inconsistent with reported species trees, both for the Proteobacteria and the SynPro lineages (Eddie et al., 2006; Imhoff et al., 1998; Nupur et al., 2017; Yang et al., 2017) .This suggests that the innovation of the compartmentalization of the carbon fixation pathway via the carboxysome was important for two distinct metabolisms - anoxygenic and oxygenic photosynthesis - that have transformed our planet. It may further be that biased gene transfer - the preferential occurrence of horizontal transfer between more related lineages - is diluting the species tree of the Proteobacteria and SynPro species such that the phylogenetic pattern ultimately observed reflects both vertical inheritance and HGT, an effect observed in other molecular data as well (Andam et al., 2010; Andam et al., 2011).

As the resulting probability distribution of the root effectively covers the broad time interval applied for the root prior, the broad root prior appropriately accounts for the uncertainty in the ancestral group of the purple sulfur bacteria. Because we observe that the medium-age root prior (model runs 13 and 14) included in the root prior comparison model runs (model runs 9-14) is closer to the oldest-age turtle prior (model runs 9 and 10), which pushes back divergence ages against a hard bound, we can confirm that putting an older root prior stretches divergence ages arbitrarily old. The broad and young priors report similar means, suggesting that the older age ranges of the broad root prior are not compatible with the tree and calibrations, and so are not abundant in the posterior. Together, this is evidence that the older bounds implied by internal node calibrations of the purple sulfur bacteria topology are important for inferring the ages, and

16 that our prior is broad enough that it is not forcing the result. The ages are, rather, being actively influenced by the secondary calibrations. Additionally, for the even-numbered runs in which the molecular clock was run under the prior, adding sequence information informs the posterior and changes the age estimates. The posterior age constraints resulting from the additions of the SynPro and internal PSB calibrations show that they help to further narrow the age constraint. This suggests that they are neither in discord with the branch lengths or wholly driven by any one calibration. The maximum ages on the ancestral nodes to the marine planktonic cyanobacteria experience the most change in response to the additions of calibrations.

In the most informative model (model run 7, aka "TurtleRun7"), the estimated divergence times of the SynPro clade fall between 895.2 mya and 379.6 mya in the 95% confidence interval, with a mean of 619.6 mya. This age distribution is narrower than previously reported and supports hypotheses that the marine planktonic cyanobacteria evolved prior to or around the dynamic Neoproterozoic. The mean divergence times of the Synechococcus and the Prochlorococcusare younger, 258 mya and 403.2 mya, respectively, suggesting that their subsequent divergence was much later and that the streamlining of Prochlorococcusgenomes has been rapid. Interestingly, the ancestral node to all of the a-cyanobacteria, represented by the divergence of CandidatusSynechococcus spongiarum 142 at a mean age of 735.9 mya, contains the bulk of its predicted divergence times prior to the global glaciation events. This supports the hypothesis that the proliferation of marine planktonic cyanobacteria had a significant effect on the carbon cycle. The split of the ecotypes of Prochlorococcusappears to have occurred more recently. With this sampling, High Light ecotypes split between 529.9 and 198.9 mya, age ranges that occur after the global glaciation events.

The environment's limitations on a group of organisms and how life adapts vis-d-vis those stresses informs understanding of evolution. Within that model, it is necessary to remember that the adaptation of organisms is captured by the evolution of life in its trend toward optimization in the face of competition - a process that is impeded or accelerated by ecosystem interactions, depending on the network of connections and the timescale of changes. The SynPro group exemplifies this optimization trend, and the acquisition of the carboxysome via HGT from purple sulfur bacteria could have supported it. It is possible that ecological interactions enabled the acquisition of the carboxysome into the a-cyanobacteria from purple sulfur bacteria, propelling expansion of the SynPro group into marine environments and accelerating ocean oxygenation. These interactions could have occurred in freshwater or brackish environments, or the open ocean itself - the scrambled topology instigated by frequent HGT events of these genes complicates evidence for interpretation. Corroborating previous work (Magnabosco et al., 2018; Sainchez-Baracaldo, 2015; Shih et al., 2017), these analyses support the divergence times of the ancestor to the SynPro group during the dynamic Neoproterozoic. The compartmentalization of carbon fixation via the carboxysome likely aided the marine planktonic cyanobacteria, and its co- associates, to expand into higher energy regimes in an oligotrophic environment.

CONCLUSION

The narrowing of divergence times enabled by the concatenation of carboxysomal proteins and other carbon fixation proteins aids in the advancement of our understanding of how the SynPro clade and the evolution of planktonic marine cyanobacteria relates to Earth history.

17 This work shows that the bulk probability density for the ages of the SynPro clade using the most descriptive model lies right around the range of the Neoproterozoic glaciations. Future work may clarify whether or not they appeared before, between, or after the major glaciations by quantifying the relative probabilities of divergences to the timing of various geological events. It remains unclear whether the carboxysome evolved as a necessity to low CO2 atmospheric concentrations or as an optimization process to gain better access to low nutrient concentrations before C02 concentrations were low enough to require their existence. Either scenario of the carboxysome acquisition was perhaps spurred by ecological interactions with other microbes. Hopefully future physiological experiments and reconstruction of ancestral proteins can provide insight into the exact nature of their requirements for various atmospheric conditions. This work would require factoring in complexities of physiological flexibility.

The spatial and temporal separation of metabolic processes is critical for a functional physiology. The degree to which the sustainability of the organism is tied to others scales up the competitive system of "self" to a community and an ecosystem, proportionately. This regulated, specialized functionalization can be achieved via compartmentalization within a single organism or increased ecosystem network interactions between organisms. As such, evolutionary innovations in compartmentalization tend to be retained in speciation events and therefore stand out as key divergence events in the history of life on Earth, from the origin of eukaryotes to the adaptations induced by eukaryotic organelles and bacterial microcompartments. Carboxysomes serve as an important and intuitive area of research for using these methods of molecular clock and calibrations. An understanding of their evolution may clarify outstanding questions regarding the relationship of Earth's evolving biosphere and climate.

ACKNOWLEDGEMENTS

I would like to thank the many people who helped make my thesis possible and my time at MIT fulfilling. Greg Fournier for his unwavering mentorship, inspirational drive to discover, and an enriching lab environment. Jo, Danielle, Sarah, Thiberio, and the rest of the Fournier Lab for their comradery, technical knowledge, and enthusiasm for a broad spectrum of geobiology. Rogier Braakman for the helpful discussions and productive collaboration. My committee members for their attentive feedback and guidance. My fellow students and the rest of the post- docs, administration, and faculty of the EAPS department, who together cultivate a place to flourish in science and career. Finally, my family for their continual support and encouragement.

REFERENCES

Abdul-Rahman, F., Petit, E., Blanchard, J.L. (2013) The Distribution of Polyhedral Bacterial Microcompartments Suggests Frequent Horizontal Transfer and Operon Reassembly. Journal of Phylogenetics & Evolutionary Biology. 1(4): 1000118.

Abramov, 0., Mojzsis, S.J. (2009). Microbial habitability of the Hadean Earth during the late heavy bombardment. Nature. 459: 419 - 422.

Anbar, A.D., Knoll, A.H. (2002). Proterozoic Ocean Chemistry and Evolution: A Bioinorganic Bridge? Science. 297(5584): 1137 - 1142.

18 Andam, C.P., Gogarten, J.P. (2011). Biased gene transfer in microbial evolution. Nature Reviews. 12: 543 - 555.

Andam, C.P., Williams, D., Gogarten, J.P. (2010) Biased gene transfer mimics patterns created through shared ancestry. PNAS. 107(23): 10679 - 10684.

Altschul, S.F., Fish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic Local Alignment Search Tool. Journal of Molecular Biology. 3: 403-10.

Badger, M.R., Hanson, D., Price, G.D. (2002). Evolution and diversity of C02 concentrating mechanisms in cyanobacteria. Funct. Plant Biol. 29: 161 - 173.

Badger, M.R., Price, G.D. (1994) The Role of Carbonic Anhydrase in Photosynthesis. Annu. Rev. Plain Physioi. Plam Mol. Biol. 45: 369 - 392.

Badger, M.R., Price, G.D. (2006). The environmental plasticity and ecological genomics of the cyanobacterial CO 2 concentrating mechanism. Journal of Experimental Botany. 57(2): 249 - 265.

Biller, S.J., Berube, P.M., Lindell, D., Chisholm, S.W. (2015) Prochlorococcus:the structure and function of collective diversity. Nature Reviews. 13: 13 -27.

Blank, C.E., Sa'nchez-Baracaldo, P. (2010). Timing of morphological and ecological innovations in the cyanobacteria - a key to understanding the rise in atmospheric oxygen. Geobiology. 8: 1- 23.

Braakman, R., Follows, M.J., Chisholm, S.W. (2017). Metabolic evolution and the self- organization of ecosystems. PNAS. E3091-E3100.

Brocks, J.J., Schaeffer, P. (2008). Okenane, a biomarker for purple sulfur bacteria (Chroamtiaceae), and other new carotenoid derivatives from the 1640 Ma Barney Creek Formation. Geochemica et Cosomichimica Acta. 72: 1396 - 1414.

Brocks, J.J., Pearson, A. (2005). Building the Biomarker Tree of Life. Reviews in Mineralogy & Geochemistry. 59: 233 - 258.

Burnap, R.L., Hagemann, M., Kaplan, A. (2015) Regulation of CO 2 Concentrating Mechanism in Cyanobacteria. Life. 5: 348-371.

Cai, F., Dou, Z., Bernstein, S.L., Leverenz, R., Williams, E.B., Heinhorst, S., Shively, J., Cannon, G.C., Kerfeld, C.A. (2015) Advances in Understanding Carboxysome Assembly in Prochlorococcusand Synechococcus Implicate CsoS2 as a Critical Component. Life. 5: 1141- 1171.

19 Cai, F., Menon, B.B., Cannon, G.C., Curry, K.J., Shively, J.M., Heinhorst, S. (2009) The Pentameric Vertex Proteins Are Necessary for the Icosahedral Carboxysome Shell to Function as a C02 Leakage Barrier. PLoS ONE. 4(10): e7521.

Canfield, D.E. (1998). A new model for Proterozoic ocean chemistry. Nature. 396: 450 - 453.

Coleman, M.L., Sullivan, M.B., Martiny, A.C., Steglich, C., Barry, K., DeLong, E.F., Chisholm, S.W. (2006) Genomic Islands and the Ecology and Evolution of Prochlorococcus.Science. 311: 1768-1770.

Darriba, D., Taboada, G.L., Doallo, R., Posada, D. (2011). ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics Applications Note. 27(8): 1164 - 1165.

Davin, A.A. Tannier, E., Williams, T.A., Bousaau, B., Daubin, V., Sz$ll6si, G.J. (2018). Gene transfers can date the tree of life. Nature Ecology & Evolution. 2: 904 - 909.

Dayhoff, M.O., Schwartz, R., Orcutt, B. (1978). A model of evolutionary change in proteins. In: Dayhoff MO, editor. Atlas of protein sequence and structure. National Biomedical Research Foundation. 5(3): 345 - 352.

Drummond, A.J., Ho, S.Y.W., Phillips, M.J., Rambaut, A. (2006). Relaxed phylogenetics and dating with confidence. PLoS Biology. 4(5): 699 -710.

Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 32: 1792-1797.

Eddie, B.J., Whang, Z., Malanoski, A.P., Hall, R.J., Oh, S.D., Heiner, C., Lin, B., Strycharz- Glaven, S.M. (2016). 'Candidatus Tenderia electrophaga', an uncultivated electroautotroph from a biocathode enrichment. International Journal of Systematic and Evolutionary Microbiology. 66: 2178 - 2185.

Flombaum, P., Gallegos, J.L., Gordillo, R.A., Rinc6, J., Zabala, L.L., Jiao, N., Karl, D.M., Li, W.K.W., Lomas, M.W., Veneziano, D., Vera, C.S., Vrugt, J.A., Martiny, A.C. (2013). Present and future global distributions of the marine Cyanobacteria Prochlorococcusand Synechococcus. PNAS. 110(24): 9824 - 9829.

Gold, D.A., Caron, A., Fournier, G.P., Summons, R.E. (2017). Paleoproterozoic sterol biosynthesis and the rise of oxygen. Nature. 543: 420 - 423.

Henikoff, S., Henikoff, J.G. (1992). Amino acid substitution matrices from protein blocks. Proc. Nati. Acad. Sci. 89: 10915 - 10919.

Herrmann, A., Gehringer, M.M. (2017). Could cyanobacteria have made the salinity transition during the later Archean? bioRxiv pre-print.

20 Hintzpeter, F., Gerland, U., Tostevin, F. (2017). Optimal Compartmentalization Strategies for Metabolic Microcompartments. Biophysical Journal. 112: 767 - 779.

Hofmann, H.J., Grey, K., Hickman, A.H., Thorpe, R.I. (1999). Origin of 3.45 Ga coniform stromatolites in Warrawoona Group, Western Australia. GSA Bulletin. 111(8): 1256 - 1262.

Imhoff, J.F., SUling, J., Petri, R. (1998). Phylogenetic relationships among the Chromatiaceae, their taxonomic reclassification and description of the new genera Allochromatium, Halochromatium, Isochromatium, Marichromatium, Thiococcus, Thiohalocapsa and Thermochromatium. International Journal of Systematic Bacteriology. 48: 1129 - 1143.

Jones, D.T., Taylor, W.R., Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. Bioinformatics. 8(3): 275 - 282.

Johnston, D.T., Wolfe-Simon, F., Pearson, A., Knoll, A.H. (2009). Anoxygenic photosynthesis modulated Proterozoic oxygen and sustained Earth's middle age. PNAS. 106(40): 16925 - 16929.

Johnston, D.T., Poulton, S.W., Dehler, C., Porter, S., Husson, J., Canfield, D.E., Knoll, A.H. (2010). An emerging picture of Neoproterozoic ocean chemistry: Insights from the Chaur Group, Grand Canyon, USA. Earth and Planetary Science Letters. 290: 64 - 73.

Kasting, J.F. (1997). Theoretical constraints on oxygen and carbon dioxide concentrations in the Precambrian atmosphere. Precambrian Research. 34(3-4): 205 - 229.

Keane, T.M., Creevey, C.J., Pentony, M.M., Naughton, T.J., McInerney, J.O. (2006). Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evolutionary Biology. 6: 29.

Kerfeld, C.A., Aussignargues, C., Zarzycki, J., Cai, F., Sutter, M. (2018). Bacterial microcompartments. Nature Reviews in Microbiology. doi: 10.1038.

Kupriyanova, E.V., Sinetova, M.A., Cho, S.M., Park, Y.-I., Los, D.A., Pronina, N.A. (2013). C0 2 -concentrating mechanism in cyanobacterial photosynthesis: organization, physiological role, and evolutionary origin. Photosynthesis Res. 117: 133-146.

Larget, B., Simon, D.L. (1999). Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees. Mol. Biol. Evol. 16(6): 750 - 759.

Lartillot, N., Lepage, T., Blanquart, S. (2009). PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics Application Note. 25(17): 2286 -2288.

Lartillot, N., Rodrigue, N., Stubbs, D., Richer, J. (2013). PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Softw. Syst. Evol. 62: 611-615.

21 Le, S.Q., Gascuel, 0. (2008). An Improved General Amino Acid Replacement Matrix. Mol. Biol. Evol. 25(7): 1307 - 1320.

Lenton, T.M., Daines, S.J. (2017). Biogeochemical Transformations in the History of the Ocean. Annu. Rev. Mar. Sci. 9: 31 - 58.

Logan, G.A., Hayes, J.M., Hieshima, G.B., Summons R.E. (1995). Terminal Proterozoic reorganization of biogeochemical cycles. Nature. 376: 53 - 56.

Love, G.D., Grosjean, E., Stalvies, C., Fike, D.A., Grotzinger, J.P., Bradley, A.S., Kelly, A.E., Bhatia, M., Meredith, W., Snape, C.E., Bowring, S.A., Condon, D.J., Summons, R.E. (2009). Fossil steroids record the appearance of Demospongiae during the Cryogenian period. Nature. 457: 718 - 721.

Lyons, T.W., Reinhard, C.T., Planavsky, N.J. (2014). The rise of oxygen in Earth's early ocean and atmosphere. Nature. 506: 307-315.

Magnabosco, C., Moore, K.R., Wolfe, J.M., Fournier, G.P. (2018). Dating phototrophic microbial lineages with reticulate gene histories. Geobiology. 16: 179 - 189.

Marin, B., Nowack, E.C.M., Glockner, G., Melkonian, M. (2007). The ancestor of the Paulinella chromatophore obtained a carboxysomal operon by horizontal gene transfer from a Nitrococcus- like gamma-proteobacterium. BMC Evolutionary Biology. 7: 85.

Macdonald, F.A., Schmitz, M.D., Crowley, J.L., Roots, C.F., Jones, D.S., Maloof, A.C., Strauss, J.V., Cohen, P.A., Johnston, D.T., Schrag, D.P. (2010). Calibrating the Cryogenian. Science. 327: 1241 - 1243.

Meredith, M., and Kruschke, J. (2016). HDInterval: Highest (Posterior) Density Intervals. R package version 0.1.3. Available online at: https://CRAN.R-project.org/package=HDInterval.

Merz, M.U.E. (1992). The biology of carbonate precipitation by cyanobacteria. Facies. 26(1): 81 - 101.

Nupur, N., Saini, M.K., Singh, P.K., Korpole, S., Tanuku, N.R.S., Takaichi, S., Pinnaka, A.K. (2017). Imhoffielle gen. nov., a marine phototrophic member of the family Chromatiaceae including the description of Imhoffiella purpurea sp. nov. and the reclassification of Thiorhodococcus bheemlicus Anil Kumar et al. 2007 as Imhoffiella bheemlica comb. nov. Int. J. Syst. Evol. Microbiol. 67: 1949 - 1956.

Paradis, E., Claude J., Strimmer K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 20(2): 289-290. Available online at: https://cran.r- project.org/web/packages/ape/index.html.

22 Planavsky, N.J., Reinhard, C.T., Wang, X., Thomson, D., McGoldrick, P., Rainbird, R.H., Johnson, T., Fischer, W.W., Lyons, T.W. (2014). Low Mid-Proterozoic atmospheric oxygen levels and the delayed rise of animals. Science. 346(6209): 635 - 638.

Price, G.D., Badger, M.R., Woodger, F.J., Long, B.M. (2008). Advances in understanding the cyanobacterial C02-concentrating-mechanisms (CCM): functional components, Ci transporters, diversity, genetic regulation and prospects for engineering into plants. Journal of Experimental Botany. 59(7): 1441 - 1461.

Price, G.D., Howitt, S.M., Harrison, K., Badger, M.R. (1993). Analysis of a Genomic DNA Region from the Cyanobacterium Synechococcus sp. Strain PCC7942 Involved in Carboxysome Assembly and Function. Journal of Bacteriology. 175: 2871-2879.

R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

Rae, B.D., Long, B.M., Badger, M.R., Dean Price, G. (2012). Structural Determinants of the Outer Shell of P-Carboxysomes in Synechococcus elongates PCC 7942: Roles for CcmK2, K3- K4, CcmO, and CcmL. PLoS ONE. 7(8): e43871.

Rae, B.D., Long, B.M., Badger, M.R., Dean Price, G. (2013) Functions, Compositions, and Evolution of the Two Types of Carboxysomes: Polyhedral Microcompartments That Facilitate

C0 2 Fixation in Cyanobacteria and Some Proteobacteria. Microbiology and Molecular Biology Reviews. 77: 357-359.

Riding, R. (2006). Cyanobacterial calcification, carbon dioxide concentrating mechanisms, and Proterozoic-Cambrian changes in atmospheric composition. Geobiology. 4: 299 - 316.

Rocap, G., Larimer, F.W., Lamerdin, J., Malfatti, S., Chain, P., Ahlgren, N.A., Arellano, A., Coleman, M., Hauser, L., Hess, W.R., Johnson, Z.I., Land, M., Lindell, D., Post, A.F., Regala, W., Shah, M., Shaw, S.L., Steglich, C., Sullivan, M.B., Ting, C.S., Tolonen, A., Webb, E.A., Zinser, E.R., Chisholm, S.W. (2003) Genome divergence in two Prochlorococcusecotypes reflects oceanic niche differentiation. Nature. 424: 1042 - 1047.

Rothman, D.H., Fournier, G.P., French, K.L., Alm, E.J., Boyle, E.A., Cao, C., Summons, R.E. (2014). Methanogenic burst in the end-Permian carbon cycle. PNAS. 111(15): 5462 - 5467.

Sainchez-Baracaldo, P. (2015) Origin of marine planktonic cyanobacteria. Nature Scientific Reports. 5: 17418.

Schirrmeister, B.E., de Vos, J.M., Antonelli, A., Bagheri, H.C. (2013). Evolution of multicellularity coincided with increased diversification of cyanobacteria and the Great Oxidation Event. PNAS. 110(5): 1791 - 1796.

Shih, P.M., Hemp, J., Ward, L.M., Matzke, N.J., Fischer, W.W. (2017). Crown group Oxyphotobacteria postdate the rise of oxygen. Geobiology. 15: 19 - 29.

23 Shih, P.M., Matzke, N.J. (2013). Primary endosymbiosis events date to the later Proterozoic with cross-calibrated phylogenetic dating of duplicated ATPase proteins. PNAS. 110(30): 12355 - 12360.

So, A.K.-C., Espie, G.S., Williams, E.B., Shively, J.M., Heinhorst, S., Cannon, G.C. (2003) A Novel Evolutionary Lineage of Carbonic Anhydrase (Epsilon Class) Is a Component of the Carboxysome Shell. Journal of Bacteriology. 186: 623 - 630.

Stamatakis, A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 9: 1312 - 1313.

Whelan, S., Goldman, N. (2001). A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. Mol. Biol. Evol. 18(5): 691 -699.

Whitehead, L., Long, B.M., Dean Price, G.D., Badger, M.R. (2014) Comparing the in Vivo Function of a-Carboxysomes and P-Carboxysomes in Two Model Cyanobacteria. Plant Physiology. 165: 398-411.

Wilde, S., Valley, J.W., Peck, W.H>, Graham, C.M. (2001). Evidence from detrital zircons for the existence of continental crust and oceans on the Earth 4.4 Gyr ago. Lett. to Nat. 409: 176 - 178.

Wolfe, J.M., Fournier, G.P. (2018). Horizontal gene transfer constrains the timing of methanogen evolution. Nature Ecology & Evolution. 2: 897 - 903.

Yang, L., Tang, L., Liu, L., Salam, N., Li, W.-J., Zhang, Y. (2017). Aquichromatium aeriopus gen. nov., sp. nov., A Non-phototrophic Aerobic Chemoheterotrophic Bacterium, and Proposal of Aquichromatiaceaefam. nov. in the Order Chromatiales. Cuff. Microbiol. 74: 972 - 978.

Yeates, T.O., Kerfeld, C.A., Heinhorst, S., Cannon, G.C., Shively, J.M. (2008) Protein-based organelles in bacteria: carboxysomes and related microcompartments. Nature Reviews: Microbiology. 6: 681-691.

Zuckerkandl, E., Pauling, L. (1962). Molecular disease, evolution and genetic heterogeneity. In Kasha, M., and Pullman, B. (eds.), Horizons in Biochemistry. New York: Academic, pp. 189 - 225.

Zuckerkandl, E., Pauling, L. (1965). Evolutionary divergence and convergence in proteins. In Bryson, V., and Vogel, H.J. (eds.), Evolving Genes and Proteins. New York: Academic, pp. 97 - 166.

24 APPENDIX

Table 1. Carbon fixation genes present in taxa. CbbL and CbbS (large and small subunits of RuBisCO), CsoS2 (main shell protein), CsoSCA (carbonic anhydrase, also known as CsoS3), and CsoS4A (vertex protein).

Species >Prochlorococcus marinus str. MIT 9515 >Prochlorococcus-marinus-str._MIT-9313 >Prochlorococcus-marinus-str.-MIT-9303 >Synechococcus sp._RCC307 >Synechococcus sp._CC9311 >Synechococcus sp.-CC9902 >Synechococcus-sp._WH_8102 >Synechococcus-sp._CC9605 NA NA N >Synechococcus-sp._WH_7803 >C anobium_ racile_PCC_6307 >Paulinella chromatophora

>Mlcrocystjsaeruqhno PCC-7941 NA NA NA >Thermosynechococcus elangatusBP-1 NA NA NA >Synechococcus. . CC 7002 UdCadidtus Synechuwucus spon uim >S nechococcus s . CB0205 >S nechococcus_s ._CB0101 >S nechococcus-s ._WH_5701 >S nechococcuss ._WH_7805 >S nechococcus_s ._RS9917 >S nechococcuss ._RS9916 >S nechococcus_s ._WH_8109 >Cvanobium sp. PCC 7001 >Leptolyngbyajborysna >Leptolyngby spPOC.. 5

NA NA NA

>synecnococcuSsp._ KUHUI-4W >Prochlorococcus sp.-MIT_0801 >Prochlorococcus sp._MIT_0601 >Thiohalocapsasp._ML1 >Thioflavicoccus-mobilis 8321 >Thiorhodovibrio-sp._970 >Ectothiorhodospira sp.-BSL-9 >Halorhodospira,_halochloris str._A

>Thiocapsajroseopersicina >Thiocapsa,_marina_581 1 >Ectnthinrhndnqira manna I I NA NA

25 (Table 1, continued).

>lmhoffiellapurpurea >CandidatusTenderia electroph a >Thioalkafivibrio nitratireducens >Thioalkaivibrio-paradoxus >Nitrococcusmobiis >Thiohalospira halophila >Thiohalorhabdus-denitrificans >Thioalkafivibrdo versutus >Thioalkalivib-rio-sulfidiphilus >Sulfurivirga~caldicuralii >Nitrosomonas eutropha >Thiorhodoooccus-drewsiL- >Halothiobacillus-neapolitanus c2 >Allochromatium vinosumDSM 180 >HaIothiobacilluss._LS2 >Hdoeovibrio-marnnus NA >Nitrobacters.Nb-311 A >Hydrogenovibrio kuenenii >Nitrobacter winogradskyi >Hydrogenovibrio crunogenus us c il n i 6>>Acid H io ith mro iobacillus n caldus >Nitrobacter _vulgaris >Nitrobacter hamburgensis >Acid iferrobacter thiboxydans >Acid ith iobacillu s thiooxidans >Thiomicrorfiabdus chilensis >Hydrogenovibrio halophilus >,Thiomicrospims sp. WB1 >PseudonocardIR-thermophila NA NA, NA >Bradyrobiumsp..BR 10303 I NA, NA NA >Manchromatium-purpuratum 984 >Marichromatium gracile >Acidimicrobium-ferrooxidans >Ferrithrx-thermotolerans >Acidithiobacillus ferrooxidans Syntechococcus sp._WH_8103 I .NA ,I hiualkaivibnio halophilus >Prochlorococcus marinus str. MIT 9311 >Thermosynechoooccus-yulcanus I I NA >Prochlorococcusmainus.str MiTj 312 SI NA >Acdfthlobocplfus ferivornan I >Prochlorococus-mainu_st.MIT-1342

>Bradyrhizobium-sp._ORS_278 I i

26 I

Table 2. Best-fitting evolutionary models for individual proteins (left), with the top three likelihoods reported under Bayesian Inference Criteria (BIC). Partitioning of concatenation with models used (right).

Fitting of Evolutionary Models

Individual Proteins

Most Likely Second Most Third Most Partitioning of Concatenations CbbL LG+l+G LG+G LG+l+G+F CbbS LG+G LG+l+G WAG+l+G WAG+G+F LG+l+G+F Genes Model CsoS2 LG+l+G I C so S4A LG+G JTT+G ICbbL/CbbS/CsoS4A LG+l+G+F 1W CsoS2/CsoSCA

Table 3. Description of PhyloBayes models run with root priors and secondary calibrations. Estimated probability distributions of SynPro divergence times reported, 95% Confidence Interval (CI) maximum and minimum ages, as well as mean. "Turtle" refers to the shape of the broad root prior created by imposing a wide, truncated standard distributed root prior.

Distribution of Divergence Tumes for the SynPro Group Congruent to Magnabosco et a, 2018 - Model Summary Max 95% Cl (Ma) Mean (Ma) Min. 95% Cl (Ma) Calibrations - ior 0" rp " Root truncation 2000 1000 3000 1000 TurleRunl 1525.55 789.324 417249 broad turtle root 2000 1000 3000 1000 TurlRun2 1613.12 982.6172 521.571 broad turtle root prior 908.333 647.3001 412.969 broad turtle root + SynPro 2000 1000 3000 1000 TurtleRun3 3000 1000 TureRwt 881.871 691.8555 480.952 broad turtle root + SynPro por 2000 1000 929.31 615.514 386.853 broad turtle root + Internal PSBs 2000 1000 3000 1000 TurtllRun 3000 1000 TurteRun6 1034.92 710.737 484.601 broad turtle root + Internal PBs pror 2000 1000 1000 3000 1000 TurileRun? 895.213 619.5655 379.566 broad turtle root + SynPro + Internal PS86 2000 TurtleRune 908.863 886.1465 457.767 broad turtle root + SynPro + internal PSBs 2000 1000 30001000 902.881 811.0743 719.062 old turtle root + SynPro + internal PSBs 2500 500 30002000 TuleRun9 30002000 TurtleRun1O 909.198 880.752 825.524 old turtle root + SynPro + internal PSBs prior 2500 500 20001000 871.158 590.472 391 A08 young turtle root + SynPro + internal PSBs 1500500 TurtleRunil 1500 500 2000 1000 TurtleRun12 900.48 6782647 494.519 young turtle root + SynPro + internal PSBs Prior 716.3656 557.895 medun turtle root + SynPro + internal PS8s 2000500 25001500 TurliSun1 894.771 500 2500 1500 TurdeRun14 909.38 797.993 688.777 medurn turtle root + SynPro + internal PSB Pror 2000 2000 1000 3000 1000 TurtleRun15 1115.92 655.8054 405.792 broad turtle root + internal PSB1 2000 1000 3000 1000 TurtieRun16 1310.59 748.634 484.724 broad turtle root + internal PSB1 30001000 TurtleRuni7 982.769 642.262 398.736 broad turtle root + internal PS82 2000 1000 pror 2000 1000 3000 1000 TurteRun18 1045.82 731.873 471.964 broad turtle root +internal PS82 20001000 30001000 TurteRun19 907.934 609.32 402.115 broad turtle root + SynPro + internal PSB1 prior 2000 1000 3000 1000 TurteRun2O 904.074 664.69 447.352 broad turtle root + SynPro + internal PSB1 2000 1000 30001000 Turtl*Run21 882.069 632.4395 396334 broad turtle root + SynPro +Intenal PS82 2000 1000 3000 1000 TurtlsRun22 894.962 683.9879 497.113 broad turtle root + SynPro + intemal PS2 pdor

Table 4. Divergence times of select nodes reported for Model 7 (TurtleRun7 in Table 3, above). Nodes include the older and younger bounds for the HGT event from the purple sulfur bacteria (Chromatiales) into the cyanobacteria, the SynPro group congruent to the group reported in Magnabosco et al., 2018, and ancestors to the SynPro group, Prochlorococcus,and High Light Prochlorococcusgroups, as described.

Model 7 (TurtdeRun7) Divergence Times of Main Nodes

27 Table 5. The shortened names (left) corresponding to the full names (right) of the taxa included in these analyses.

Shortened Name Full Name >Af thiooxy >Acidiferrobacter thiooxydans >Am ferroox >Acidimicrobium ferrooxidans >Acaldus >Acidithiobacillus-caldus >At ferrivo >Acidithiobacillusferrivorans >At ferroox >Acidithiobacillus ferrooxidans >A thiooxi >Acidithiobacillus thiooxidans >ADSM1 80 >Allochromatium-vinosumDSM 180 >CaSyn_142 >CandidatusSynechococcus,.spongiarum_142 >CaTen-elec >Candidatus Tenderia electrophaga >CyPCC6307 >Cyanobiumgracile PCC_6307 >CyPCC7001 >Cyanobium-sp._PCC 7001 >E magna >Ectothiorhodospira magna >E marina >Ectothiorhodospira marina >E BSL9 >Ectothiorhodospirasp._BSL-9 >F thermo >Ferrithrix thermotolerans >Halo ha A >Halorhodospira halochloris str. A >Halo-ne c2 >Halothiobacillusneapolitanus c2 >Halo-LS2 >Halothiobacillus sp._LS2 >Hcrunogenus >Hydrogenovibrio-crunogenus >H-halophilus >Hydrogenovibrio halophilus >H kuenenii >Hydrogenovibrio kuenenii >H marinus >Hydrogenovibrio marinus >Lpurpurea >mhoffiella_purpurea >Mgracile >Marichromatium gracile >Mpurp_984 >Marichromatium purpuratum 984 >N hamburgensis >Nitrobacter_hamburgensis >N-Nb3 11A >Nitrobacter_sp._Nb-31 1A >N vulgaris >Nitrobacteryvulgais >N winogradskyi >Nitrobacter winogradskyi >N_mobilis >Nitrococcus_mobilis >N eutropha >Nitrosomonas eutropha >P chromato >Paulinellaschromatophora >Pr MIT9303 >Prochlorococcus-marinus str._MIT 9303 >Pr MIT9311 >Prochlorococcusmarinusstr. MIT_9311 >Pr MIT951 5 >Prochlorococcus marinus str._MIT 9515 >Pr MIT0601 >ProchIorococcussp._MIT 0601 >Pr MIT0801 >Prochlorococus_sp._MIT,0801 >Scaldicur >Sulfurivirgarcaldicuralii >SyBL107 >Synechococcussp. BL107 >SyCB0101 >Synechococcussp. CB0101 >SyCB0205 >Synechococcussp._CB0205 >SyCC931 1 >Synechococcu ssp._CC931 1 >SyCC9605 >Synechococcus,_sp._CC9605

28 (Table 5, continued).

>SyCC9902 >Synechococcus_sp._CC9902 >SyKORDI49 >Synechococcus-sp. KORDI-49 >SyRCC307 >Synechococcussp._RCC307 >Sy-RS9916 >Synechococcussp._RS9916 >SyRS9917 >Synechococcussp. RS9917 >Sy_.WH5701 >Synechococcus sp._WH 5701 >SyWH7803 >Synechococcussp. WH 7803 >SyWH7805 >Synechococcussp. WH 7805 >SyWH8102 >Synechococcus sp. WH 8102 >SyWH8103 >Synechococcus-sp._WH 8103 >SyWH8109 >Synechococcussp. WH 8109 >T halo 1 >Thioalkalivibrio halophilus >T nitratir >Thioalkaiivibrio nitratireducens >Tparadoxu >Thioalkalivibrio-paradoxus >T sulfidi >Thioalkalivibrio sulfidiphilus >T versutus >Thioalkaivibrio versutus >T mar 5811 >Thiocapsamarina 5811 >T roseoper >Thiocapsajoseopersicina >Tmob_8321 >Thioflavicoccus mobilis_8321 >T ML1 >Thiohalocapsa sp. ML1 >T denitrif >Thiohalorhabdus denitrificans >Thalo_2 >Thiohalospirahalophila >T chilens >Thiomicrorhabdus chilensis >T WB1 >Thiomicrospira sp._WB1 >T drewsii >Thiorhodococcus drewsii >T 970 >Thiorhodovibro sp. 970

29 Figure 1. Maximum Likelihood tree (constructed with RAxML) of concatenated carbon fixation proteins. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Alphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

I

i

-4 i i

.0 40 -4j

Rif~ C

- Ii

(p

~ ; ! ~1 0 2

a . I i 3 I' Ii II Jzr

T 31 ic q

30 Figure 2. Probability distributions plotted for the age of the ancestral node to the SynPro clade (congruent to Magnabosco et al., 2018). Divergence times estimated using an uncorrelated relaxed molecular clock with gamma site rate distributions using PhyloBayes. Models with inputs of root prior and secondary calibrations described in Table 2.

Distribution st SynPro Divergence Times - Models 1 and 2 DistrIbution dt SynPro Divergence Times - Models 3 nd 4

Modell I Model3 Model 2 Model 4

8 C

I C 8 a

I 1500 2000 0 S00 1000 1500 2000 0 500 1000 Age (m) Age (Af)

Distibution of SynPro Divergence Times -ModefS and II Deatbudon of SynPro Divergence Times - Models 7 end 8

Model 5 Model 7 Model 6 Modela

I I

0 500 1000 1500 2000 0 B00 1000 1500 00 Ag (LM) AP (Me)

31 -100

(Figure 2, continued).

Diasbbution of SynPro Dheqgnos Tmgs - Models 9 and 10 DWIsbuton of SynPro Dvergence Times - Models 11 and 12

ModI 9

a mod 10 LIde1 d id 1' \11

a 5 d I 0

aI!

a I I I 0 500 1000 15 2000 0 100 1000 1600 2000 Age Age (MR)

DIsbtrmuon of SynPro DIvergence Times - Models 13 and 14 Oeblbiiunof SynPro Dhvegeuos Times- Models Iand 16 I- C Mod.13S -- Model II Modd4 I .L,. a

I- C \ I ~ I 0 ai I- C

a

AJ a C 0 A001000 1500 2000 0 amo 1000 ,soo 2000 Ap (MR)

32 (Figure 2, continued).

D4M "w of SynP' Dlveno Time- Modb 19 usd a ei uon o SPro o verms . Tne. - Modes 17 and 18

- Mode -- Model 17 19 ModelS I

d I Moe2 %

1500 2000 0 500 l0w 1500 2O00 0 500 1000 Age Oft) Age (Mb)

DOlirbaionm of SynPro Mvergeno. Times - Moddls 21 and 22

-MOWe 21 C MOde 22 I /d.I

C

a 1! C

CI I I C I I I 0 500 1000 Ism 2000 Age Oft)

33 =FJ

(Figure 2, continued).

Disbbulon of SynPro Divergence Time - Models 1, 3,5,7 Dslbeuvon of SynPuo Divergence Times - Models 1, 3,5,7

- udsI S .3 Mbdel 3 d kiad. 5 MoM MOMS 7 -- MMi Ci 8- /J * I / .

I 0 500 100 15m 200 0 51X0 1000 1500 2000 Age f Age (MR)

DbumUon of SynPr Divergence Times - Models 7, 9,11,13 Dwtbiumon of SynPrO Divegmns Times - Module 7,9,11,13

V;de 7

I ModM 13 -k" 13

CI-

I C U- I - M dmi 0- 0 C

C I- 0 500 100 150 2O00 0 m 1000 1500 2000 Age PW

34 Figure 3. Chronogram. constructed using PhyloBayes; model run 7 (TurtleRun7) as described in Table 2. Mean divergence times of nodes reported here. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Alphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

II

T

-? 04 391

04. i4m .I- -

35 Figure 4. Chronogram constructed using PhyloBayes; model run 7 (TurtleRun7) as described in Table 2. Min. and max. divergence times of nodes reported here. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Aiphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

It

.41z--01Z - 4-,

;1a4911i

36 Figure 5. Maximum Likelihood tree (constructed with RAxML) of CbbL protein. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Alphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

ripi

allI

410

Iig. I qi

[lie T

37 Figure 6. Maximum Likelihood tree (constructed with RAxML) of CbbS protein. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Alphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

* I

-4 o 3 I S

S. *:V

I:8Kr.u 3 OXXXIS~MEAg , a,

UIII H Xi rtP to r.i ZOWT -Bra 47? 1? V: -WS I 15 itI I C,4 I ICa z~IIc F let I -A I

#ato

38 Figure 7. Maximum Likelihood tree (constructed with RAxML) of CsoS2 protein. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Alphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

I

~1 I~iii

i

SI aI 00 4,1

S W3 C 1IfarnoC

5' J'ii ~I 'C i 4 9.=30~ 1 log~ 9:1?it t II

hill ii I

39 Figure 8. Maximum Likelihood tree (constructed with RAxML) of CsoSCA (CsoS3) protein. Green - marine cyanobacteria; purple - Chromatiales; pink - other Gammaproteobacteria; blue - Betaproteobacteria; red - Alphaproteobacteria; brown - Acidithiobacillia; orange - Actinobacteria.

I I

i

i 401 S

i II iti ii I i i V-BI I 6 E~I1111 ~ lap, I ii I J r -f Li AI K hi I, I I ii *'11I'Ni RI.

I-4~ It

40 ydrOgenlovIbCnchaIOghilus Th'rmicrospira sp WS1 a) go Thtomicrorhabdus chilensis HydrogonovibvtocrunognuS 66 Sutlurivirga caIdicuralit Wyorogenovibro marinas 26 Hydrogenovibrio kuoneni N trobact.ervuigaris S trobacterwinogradskyi a) 83 robacteorp.Nb-311A 73 N -trobacterhamburgensis Nitrosomonag-eutropha Halothiobacillus neapoiltanus-c2 Halothiobacilus-sp._LS2 F) AcldthiobaciIlus_caldus is AcidithIobacillus-thiooxidans Acidfo rrobacter throoxydans cdO AcidithiobacilIusferrivorans 72 cidithiobacilluserrOOxa Acrdrmicrobir um ferrooxidans Ferrilfrix tnermotoloranS Thioalkalvibrio-halophilus - g4o-0 %oalkalivibrio vesutue -i Thioalkailvibrio-sulfidiphilus Ectothiorhodospira magne -- Thiocapsanoseopersina Thlocapsa-marina5611 ThIorhodovibrio-sp.970 Candidalus.Synochococcus-sponglarum 142 Prochlorococcus-marinus str. MIT. 9311 Prochirococcu.marinus .str...MIT_9515 Prochlorococcusep. MIT 0801 0 - 7 rochlorococcus-marinus-sir.MIT-9303 0 Prochlorococcus-ep._MIT.0601 Synechococcus--sp.-,WH _5701 * ynechococcussp. CB0101 '-I 346 7 ISynechocccus-sp. C50205 I yanobium-sp. PCC 7001 R* .C Paulinella chromatophora Cyanobium gracilePCC 6307 Synechococcus-sp.-RS9917 Synechococcus-sp.-CC9311 25 rnechococcus- p. RS9916 5- Synechococcussp. KORDI-49 Synachococcususp. WH-8109 * a) 31 fpynechococcus-sp._CC9902 - ynechococcus-sp.-L107 C 9 ynechococcus-sp. CC9605 nechococcussp. WH-8 103 nechococcus-sp.-WH-8102 Synechococcus-sp. WH 7803 3Synechococcus sp.-.WH 7805 -o .C L Synecnococcus-sp. RCC307 Thohalbcapsa-sp.JJL1 alorhodo ira-halochloris-str. A a) clothlorwdospira-maIna ctothiorhodospira sp. BSL-9 Nmtrocous-imobiis Imhoffiella purpurem ThIoflavicoccus-mobilis-8321 - -* 29 CandidatusTenderia-otectrohogs Th alopira-halophla Thwhalorhabdus denitrificanb Fc a) 27 so 30Aflochromtiumvlnosum.DSM_180 c* Thlorhodococcue-drewsil 3Marichromatium-purpuratum-984 Marichromatium-gracdle * ThoalkahIvibrio paradoxus - 3 5- SThloalkalivl'brio--nitratireducens 0.8 10 0