THE UNIVERSITY OF SYDNEY

Copyright and use of this thesis This thesis must be used in accordance with the provisions of the Copyright Act 1968.

Reproduction of material protected by copyright may be an infringement of copyright and copyright owners may be entitled to take legal action against persons who infringe their copyright.

Section 51 (2) of the Copyright Act permits an authorized officer of a university library or archives to provide a copy (by communication or otherwise) of an unpublished thesis kept in the library or archives, to a person who satisfies the authorized officer that he or she requires the reproduction for the purposes of research or study.

The Copyright Act grants the creator of a work a number of moral rights, specifically the right of attribution, the right against false attribution and the right of integrity.

You may infringe the author’s moral rights if you:

- fail to acknowledge the author of this thesis if you quote sections from the work

- attribute this thesis to another author

-subject this thesis to derogatory treatment which may prejudice the author’s reputation

For further information contact the University’s Copyright Service. sydney.edu.au/copyright evolution in and marine Synechococcus

Camilla Lan-Chih Ip BSc (Hons), MApplSc (Bioinf)

A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy

School of Biological Sciences The University of Sydney New South Wales 2006 Australia

July 2010 Statement of authorship

I hereby declare that the work presented in this thesis is, to the best of my knowledge, original and my own work, except as acknowledged in the text; and that the material has not been submitted, either in whole or in part, for a degree at this or any other university.

Acknowledgements

A PhD is a long and difficult journey. During my candidature, many people have been generous with their time, good counsel, and resources. I am honoured that my supervisors Anthony Larkum, Michael Charleston and Murray Henwood thought I was worthy of such a significant investment of their time and energy. For the hundreds of hours collectively spent with me during my candidature, their patience and friendship, and their continued support of my efforts to complete my candidature in the face of numerous obstacles, I am profoundly grateful. Without the wisdom and guidance of my supervisors, I would not have reached my destination. I warmly thank Min Chen for welcoming me into the her laboratory group, for her insight­ ful discussions on the biochemistry of , and for demonstrating by example how to be an efficient research scientist. I extend my thanks to Lars Jermiin, under whose supervision I started this candidature, for taking me on as a student and guiding my early research. I thank past and present members of the Bioinformatics Laboratory (Isabel Hyman, Shauna Murray, Faisal Ababneh, Vivek Jayaswal, Nick Ho and Kwok Wai (Rex) Lau, Julia Jones and Tanya Golubchik), the Marine Algae Laboratory (Yinan Zhang, Mar­ tin Schliep, Ray Ritchie, Kathie Donohoe, Ross McC. Lilley and Moacir Torres) and the Plant Systematics Laboratory (Trevor Wilson, Matt Pye, Endymion Cooper, Kerry Gibbons and Andrew Perkins) at the School of Biological Sciences, the Bioinformatics Interest Group (Joshua Ho, Yung Shwen Ho, Jeremy Sumner, Michael Woodhams, liana Lichtenstein and Selma Klanten) at the School of Information Technologies, and Leon Poladian from the School of Mathematics and Statistics. I thank Luke Henderson from the School of Medical Sciences in the Faculty of Medicine for generously providing access to additional computing resources. I thank the University of Sydney for funding the University Postgraduate Award scholarship that subsidised my living expenses during my candidature and allowed me to conduct my studies on a full­ time basis. Similarly, I was able to attend international conferences and workshops with the assistance of the Postgraduate Research Support Scheme at the University of Sydney, and Marine Europe, who generously relaxed their rules to allow a student from outside the EU to attend. These opportunities to interact with researchers in my held were critical for the development of my research skills. The final part of the thesis write-up was conducted while working as a post-doc. I warmly thank Rory Bowden, Derrick Crook, Tim Peto and Rosalind Harding for not putting additional pressure on me during this difficult period and generously allowing me to take time off to complete the thesis. Throughout the entire PhD journey, my partner Danny Yee, has been there to support me. He has been an active sounding board for ideas, provided invaluable technical support for the installation of computer software and hardware, put his own plans on hold in order to be there for me, and unfailingly believed that I could complete the candidature. Without his love and companionship, it would have been a dark and lonely journey. I thank my siblings Amy, Daniel and Lilian for all their love, support and advice during my candidature. This work is dedicated to my mother Dana, and my late father Percy, who nurtured my quest for wisdom through knowledge.

IV Abstract

The primary aim of this work was to explore the relative importance of vertical and non-vertical inheritance in one group of closely-related cyanobacteria, and from this to gain insights into the processes of genome evolution associated with the diversification of photosynthetic systems in . Recovering the genome-level events associated with the evolutionary diversification of photosynthetic systems is difficult. The information that could be used to recover the evolutionary relationships among taxa has been eroded by a long series of gains, losses, duplications, rearrangements and mutations in the primary sequence. Furthermore, the main technique for inferring the evolutionary relationships among taxa, phylogenetic in­ ference, is compromised by stochastic and systematic error, a particular problem if the sequences being studied have experienced lateral gene transfer (LGT). Cyanobacteria from the genus Prochlorococcus and the marine of Synechococcus are small, unicellular, marine organisms that are ecologically important on a global scale. The Prochlorococcus genus is unusual in using a membrane-intrinsic light-harvesting an­ tenna system consisting of divinyl chlorophylls a and 6, while their closest relative, the marine Synechococcus clade, uses the system most commonly found in the cyanobacteria, a membrane-extrinsic system based on biliproteins organised into phycobilisomes. There is extensive evidence that Prochlorococcus and marine Synechococcus have evolved

v from a common ancestor, but the genomic processes involved are not well understood.

The evidence of common ancestry is difficult to reconcile with evidence of extensive LGT.

Analyses based on different genes have found different phylogenies and it is not clear

whether this is due to stochastic or systematic error arising from the phylogenetic inference

procedure or is indicative of different evolutionary histories.

Prochlorococcus and marine Synechococcus were chosen for this study because they are closely-related, so their are very similar, while still being sufficiently different to be used for a study of genome evolution. Furthermore, as cyanobacteria they are the extant survivors of the pioneering oxyphotobacteria in which oxygenic photosynthetic originally evolved, they have genetic similarities with the anoxygenic photosynthetic bac­ teria, and they share a common ancestry with the plastids of the eukaryotes, ffence a better understanding of the mechanisms of genome evolution in cyanobacteria may lead to insights into the processes by which photosynthetic systems have diversified.

The approach adopted in this study has been to prepare and select sequence data to prevent or minimise sources of error, so as to produce phylogenies that are a better repre­ sentation of the evolutionary history. Each step of the phylogenetic inference process was scrutinised to identify potential sources of error and extensive preliminary studies were conducted to identify the most appropriate methods for data preparation and analysis.

Unsupervised inference of orthologous sets of protein-coding genes from diverse eubacte- ria was a necessary but particularly error-prone step, so preliminary studies were carried out to identify the best BLAST parameters to use in pairwise comparisons and several clustering methods were trialled before the Markov Clustering (MCL) method was se­ lected. Different phylogenetic methods, and in particular, phylogenetic network methods, were evaluated for suitability and utility. Maximum-likelihood was selected for tree-based phylogenetic inference, and the Neighbor-Net analyses for highlighting conflicting site- patterns in gene alignments.

vi A set of 70 diverse eubacteria consisting of 31 cyanobacteria, 12 anoxyphotobacteria and

27 non-photosynthetic eubacteria was selected for analysis. A “core” of 439 protein­ coding genes was identified and, from these 280 that could be used for phylogenetic analyses, were filtered to identify a “reduced core” of 62 genes which did not exhibit large variation in amino-acid frequencies. Functional annotation was allocated to each set by cross-referencing with the Kyoto Encyclopedia of Genes and Genomes (KEGG) database to find that 12 of the 21 functional categories relevent to bacteria were covered.

Phylogenetic trees were inferred using the maximum-likelihood method implemented in

TREEFINDER. Neighbor-Net analyses were conducted using the implementation in SPLITS-

TREE. Sets of putatively orthologous genes that yielded phylogenies with suspected LGT events were investigated for alternate lines of evidence to support or refute this hypothesis, such as positive selection, recombination, local gene order, or insertion/deletion events.

Additional phylogenetic analyses of the 16S rDNA gene for the same 70 bacteria were con­ ducted using Bayesian Inference. Stochastic error effects on the phylogenies were explored with maximum-likelihood trees inferred using subsamples of the original 70 eubacterial sequences and with a tree with over 1,000 16S rDNA sequences.

Although the phylogenies were sufficiently robust to explore the evolutionary origins of the Prochlorococcus and marine Synechococcus genomes in a eubacterial context, they were not able to recover the relationships among the bacterial phyla or to recover the precise position of individual taxa within phyla. This lack of resolution in the phylogenies was likely due to a combination of the loss of historical signal over very long time scales, stochastic error arising from taxon sampling effects, and systematic error arising from the application of a single model to disparate time scales.

Phylogenetic analyses of protein-coding genes that span the biochemical pathways shared by all Eubacteria indicate that the core metabolisms of Prochlorococcus and marine Syne­ chococcus have been inherited vertically from a cyanobacterial common ancestor, most

vii likely an ancestor of Cyanobium. Despite the fact that all groups share photosynthesis genes with a common origin, the genomic hosts of these genes, spread across the cyanobac­ teria and five groups of anoxygenic photosynthetic bacteria, do not appear to be closely related. For the small number of reduced core phylogenies that could have been interpreted as evidence of LGT in the cyanobacteria, only one instance was corroborated by alternate lines of evidence, in this case, by an insertion/deletion event. On the basis of these explorations, I conclude that on these very long time scales, inher­ itance of core metabolic genes in cyanobacteria is largely vertical. It is possible that the Prochlorococcus and marine Synechococcus lineages experienced a high level of LGT in their early history, but even detailed phylogenetic analyses based on carefully prepared data do not have the resolution to interpret minor topological differences as evidence of LGT. Hence, the level of LGT among the Prochlorococcus and marine Synechococus is likely to be less than previously claimed. The most likely explanation for the distribution of lineages with a membrane-intrinsic light-harvesting system among the cyanobacteria, and of photosynthetic lineages among the Eubacteria, is that most of these lineages have acquired the genes for the novel light­ harvesting system, or the ability to perform photosynthesis, by lateral gene transfer. Thus, the photosynthesis-related genes appear to be a self-contained genetic module that does not require specialised metabolic support from a new eubacterial host. Furthermore, the photosynthesis module itself may consist of smaller modules, for example for the construction of a light-harvesting antenna system. It is this modularity, which is likely to be an evolved characteristic of these metabolic pathways, that promotes the ongoing diversification of the photosynthetic machinery in new lineages of Eubacteria and Eukarya.

viii Contents

Statement of authorship i

A cknow ledgem ents iii

A bstract v

List of figures xxiii

List of tables xxiv

Abbreviations xxv

1 General Introduction 1

1.1 The process of photosynthesis...... 1

1.1.1 Photosynthesis is an ancient energy conversion process ...... 1

1.1.2 Photosynthesis in the tree of life...... 4

1.2 The cyanobacteria ...... 6

IX 1.2.1 Cyanobacteria are significant contributors to global nutrient cycles . 6 1.2.2 Oxygenic photosynthesis evolved in the ancestors of the cyanobacteria 7 1.2.3 Light-harvesting antenna systems in cyanobacteria ...... 8 1.2.4 The significance of the various chlorophylls in the evolution of light harvesting system s...... 9 1.2.5 Cyanobacteria and the endosymbiotic origin of the plastids...... 10 1.2.6 The evolutionary history of the cyanobacteria...... 11 1.3 Genome evolution in bacteria...... 12 1.3.1 Bacterial genomes...... 12 1.3.2 Mechanisms of inheritance...... 14 1.3.3 Extending the species concept...... 15 1.4 The Prochlorococcus spp...... 18 1.4.1 Microbes of global ecological importance...... 18 1.4.2 The closely related marine Synechococcus spp...... 19 1.4.3 Diversification of LH-antenna systems...... 20 1.4.4 Prochlorococcus and the Paulinella chromatophore...... 22 1.4.5 Model organisms for studying genome evolution incyanobacteria . . 23 1.5 Genome evolution in Prochlorococcus and marine Synechococcus...... 24 1.5.1 Evidence of L G T ...... 24 x 1.5.2 Evidence of divergence from a common ancestor...... 25 1.5.3 Evaluating the case for LG T...... 26 1.6 Objectives of this thesis ...... 28 1.7 Outline of this thesis...... 29

2 Orthologue Inference : Methods and Preliminary Studies 31 2.1 Introduction...... 31 2.1.1 A working definition of orthology...... 32 2.1.2 Orthologue inference using graph th eory...... 37 2.1.3 Functional annotation of most-similar clusters of protein-coding genes 38 2.1.4 Motivation for preliminary studies...... 39 2.1.5 Aims...... 41 2.2 Preliminary studies...... 41 2.2.1 Finding suitable BLAST parameters to detect homologous protein­ coding genes...... 41 2.2.2 Iterative clustering using paraclique topologies of RBH relationships 50 2.2.3 Refining the criteria for identifying the most similar pairs of proteins among two genomes ...... 60 2.2.4 Iterative clustering using seed clusters, set expansion and pruning . 65 2.2.5 Single-step clustering with the MCL algorithm...... 75 xi 2.2.6 Inferring functional annotation for sets of orthologues...... 79 2.3 Data storage and manipulation ...... 84 2.3.1 Computational requirements...... 84 2.3.2 A relational database for data storage...... 85 2.3.3 Software developed during this w o rk ...... 85 2.4 Discussion...... 86 2.4.1 Sources of error in orthologue inference...... 86 2.4.2 Heuristic methods for dealing with error in orthologue inference . . 88 2.4.3 Future directions...... 89 2.4.4 Conclusions...... 89

3 Phylogenetic Inference : Methods and Preliminary Studies 91 3.1 Introduction...... 91 3.1.1 Stochastic error from sampling...... 92 3.1.2 Systematic error from non-historical signals ...... 93 3.1.3 Aims...... 95 3.2 Methodologies and preliminary studies...... 95 3.2.1 Taxon selection...... 95 3.2.2 Orthologue inference...... 97 3.2.3 Selection of nucleotide or amino-acid d a ta ...... 98

xii 3.2.4 Multiple sequence alignment...... 98

3.2.5 Removal of sites with uncertain h o m o lo g y ...... 100

3.2.6 Screening for large variation in amino-acid com position...... 101

3.2.7 Single versus concatenated gene phylogenies...... 104

3.2.8 Selection of evolutionary substitution m odels...... 106

3.2.9 Tree-based phylogenetic inference m ethods...... 108

3.2.10 Network-based phylogenetic inference methods...... 117

3.3 Discussion...... 122

3.3.1 Sources of error in phylogenetic inference...... 122

3.3.2 Strategies for reducing sources of e r r o r ...... 124

3.3.3 Conclusions...... 125

4 Selecting data for exploring the origins of Prochlorococcus and marine

Synechococcus genomes in a eubacterial context 127

4.1 Introduction...... 127

4.1.1 The origins of the PS-group genomes are not known...... 128

4.1.2 LGT between cyanobacteria may be common ...... 129

4.1.3 Desirable properties of data for this s t u d y ...... 131

4.1.4 A im s...... 133

4.2 M ethods...... 134

xiii 4.2.1 Selection of taxa ...... 134

4.2.2 Inference of orthologous, eubacterial-core protein-coding genes . . . 134

4.2.3 Selection of a reduced eubacterial core for phylogenetic analysis . . 136

4.2.4 Representation of gene functional groups in the reduced eubacterial

core ...... 138

4.2.5 The phylogenetic signal in the reduced and rejected eubacterialcores 139

4.2.6 Codon alignments of the reduced and rejected eubacterial cores . . 142

4.2.7 Evidence of diversifying selection in the reduced eubacterial core . . 144

4.2.8 Addition of Acaryochloris and Heliobacterium to the s tu d y ...... 146

4.2.9 Inclusion of eubacteria may reveal instances of L G T ...... 147

4.3 Results...... 148

4.3.1 The eubacterial taxa selected ...... 148

4.3.2 The reduced eubacterial core...... 151

4.3.3 Representation of gene functional groups in the reduced eubacterial

core ...... 155

4.3.4 Supertrees and supernetworks of the reduced and rejected eubacte­

rial cores...... 160

4.3.5 Codon alignments of the reduced and rejected eubacterial cores . . 165

4.3.6 Pairwise estimates of d^/ds in the reduced and rejected eubacterial

c o r e s ...... 167

xiv 4.4 Discussion 171

4.4.1 Selection of t a x a ...... 171

4.4.2 Orthologue inference...... 174

4.4.3 The reduced eubacterial core...... 178

4.4.4 Phylogenetic signal in the reduced eubacterial core ...... 180

4.4.5 Evidence of selection in the reduced and rejected eubacterial cores . 181

4.4.6 Conclusions...... 182

5 The origin of the core genomes of the PS-group taxa 185

5.1 Introduction...... 185

5.1.1 Conflicting phylogenies in cyanobacteria: artefacts or LGT? . . . .186

5.1.2 Phylogenetic artefacts arising from taxon sampling ...... 187

5.1.3 Phylogenetic artefacts arising from LB A effects ...... 188

5.1.4 Distinguishing artefactual error from L G T ...... 189

5.1.5 A im s...... 189

5.2 M ethods...... 190

5.2.1 Inference of a 16S rDNA reference t r e e ...... 190

5.2.2 Inference of ML trees from the reduced eubacterial c o r e ...... 192

5.2.3 Visualisation of phylogenetic conflict with Neighbor-Net analyses . 192

5.2.4 Quantification of cyanobacterial over-sampling effe cts...... 193

xv 5.2.5 Estimation of cyanobacterial under-sampling effects...... 194 5.2.6 Positive selection or recombination as an alternative to LGT .... 195

5.2.7 Investigation of gene order support for suspected L G T ...... 198

5.3 Results...... 200

5.3.1 Bayesian phylogeny inferred from the 16S rDNA gene...... 200 5.3.2 ML phylogenies inferred from the reduced eubacterial c o re ...... 203 5.3.3 Neighbor-Net analyses of the 16S rDNA gene and the reduced eu­ bacterial c o re ...... 215 5.3.4 The effects ofcyanobacterial over-sampling on tree topologies . . . 218 5.3.5 The effects of cyanobacterial under-sampling on the position of the PS-group ...... 220 5.3.6 Evidence of positive selection or recombination in the reduced eu­ bacterial c o re ...... 224 5.3.7 Evidence of conserved geneorder around sequencessuspected of LGT230 5.4 Discussion...... 237 5.4.1 Origins of the PS-group genomes in the eubacteria...... 237

5.4.2 Evolution of the Chl-based LH-antenna systems...... 239 5.4.3 The effect of taxon sampling on cyanobacterial phylogenies...... 239

5.4.4 Origin of the PS-group within the cyanobacteria...... 241 xvi 5.4.5 Distinguishing LGT from other evolutionary processes for four eubacterial-

core protein-coding genes...... 242

5.4.6 C on clu sion s...... 244

6 General Discussion 247

6.1 Overview...... 247

6.2 Use of a reduced eubacterial c o r e ...... 248

6.3 Resolution of the eubacterial p h y l a ...... 250

6.4 Inference of intra- LGT using phylogenetic m eth od s...... 254

6.5 Origins of the Prochlorococcus and marine Synechococcus g en om es...... 258

6.6 Mechanisms of genome evolution in cyanobacteria...... 263

6.7 Evolution of photosynthesis in b a cte ria ...... 266

6.8 C on clu sion s...... 270

6.9 Future d ire c tio n s ...... 273

Appendices 277

A Selected relational database tables used in this study 279

B Properties of chromosomes and of the reduced core 283

C Properties of proteins of the reduced core 291

xvii Bibliography 344

xviii List of Figures

1.1 Time line of our Solar S ystem ...... 3

1.2 A scheme for the origin of plastids through multiple endosymbioses .... 5

2.1 Gene evolution showing orthologous, paralogous and xenologous genes . . . 33

2.2 All-against-all b l a s t com parison...... 34

2.3 The inference of clusters of RBH matches...... 35

2.4 Number of BLASTP query hits among three Prochlorococcus “genomes” . . 45

2.5 Sample b l a s t output for Prochlorococcus CCMP1986 vs. CCMP1375 . . . 47

2.6 Inference of orthologous sets of protein-coding genes from RBH relationships 52

2.7 Histograms of the number of organisms and sequences in RBH fans .... 54

2.8 Annotation of 52 protein-coding genes of an RBH fan with 36 organisms . 56

2.9 Possible BLAST-hit relationships between protein-coding genes in two genomes 59

2.10 “Expand and prune” clustering output for high-light inducible protein 11 . 70

2.11 “Expand and prune” clustering for high-light inducible protein C ...... 71

xix 2.12 Neighbor-Net analysis of the hli gene fa m ily ...... 73

3.1 Phylogenetic trees inferred from the amino-acid sequence of polyphosphate

k in a s e ...... 114

3.2 Phylogenetic networks inferred from the amino-acid sequence of polyphos­

phate kinase...... 120

4.1 SRH test failure rate vs. amino-acid alignment len g th ...... 156

4.2 KEGG gene functional categories...... 158

4.3 Number of taxa in each of the reduced and rejected eubacterial core protein­

coding gene s e ts ...... 161

4.4 Supertrees inferred from the ML trees from the reduced and rejected eu­

bacterial cores ...... 163

4.5 Phylogenetic supernetworks inferred from the reduced and rejected eubac­

terial cores ...... 166

4.6 Rate of non-synonymous (dat) and synonymous (ds) substitutions .... 169

4.7 Pairwise d^/ds ratios within the reduced and rejected eubacterial cores . 170

4.8 Pairwise d^/ds ratios within the eubacterial c o re ...... 172

5.1 Phylogenies inferred from the 16S rDNA gene for 70 bacteria...... 202

5.2 Phylogenies inferred from the sec Y g e n e ...... 207

5.3 Phylogenies inferred for the ribH gene...... 208

5.4 Phylogenies inferred for the apt g e n e ...... 211

xx 5.5 Phylogénies inferred for the ahcY g e n e ...... 212

5.6 Phylogenies inferred for the rpe g e n e ...... 213

5.7 Phylogenies inferred for the bcp g e n e ...... 214

5.8 Parameters of the discrete T distribution inferred for ML an aly ses...... 217

5.9 The effect of taxon sampling on the position of the P S-group...... 219

5.10 ML tree inferred from 16S rDNA genes of 1,128 cyanobacteria...... 223

5.11 Protein-coding genes on either side of the rpe gene in the cyanobacteria . 232

5.12 Protein-coding genes on either side of the apt genes of the cyanobacteria . 234

5.13 Distribution of GC% in clades of the rpe and apt g e n e s...... 236

xxi X X ll List of Tables

2.1 b l a s t parameters suggested on the NCBI w ebsite...... 42

2.2 BLASTP matches for Prochlorococcus CCMP1986 protein-coding genes in

str. CCMP1375 ...... 48

2.3 Uni-directional best hits between Synechocystis PCC6803 and Prochloro­

coccus M IT 9313...... 64

2.4 Taxa common to both the CyOG database and the present s t u d y ...... 67

2.5 Default parameters for MCL vl.008 ...... 76

2.6 Gene function categories in the KEGG and COG databases...... 82

3.1 Contingency table of nucleotide counts in two nucleotide sequences .... 103

4.1 Whole bacterial genomes used in this study ...... 148

4.2 The reduced bacterial core of 62 protein-coding g e n e s...... 152

4.3 Properties of the supertrees of the reduced and rejected eubacterial core . . 162

4.4 Number of pairwise d ^ /d s comparisons of each taxon group retained for

analysis in the reduced and rejected eubacterial c o r e s ...... 168

xxiii 5.1 Clade support and PS-group positions inferred from phylogenetic analyses of the 62 protein-coding genes of the reduced eubacterial core ...... 203 5.2 Evidence of positive selection and/or recombination in the reduced eubac­ terial c o re ...... 225 5.3 Strength of evidence for positive selection and/or recombination in the ahcY, rpe, bcp and apt protein-coding genes...... 228 5.4 Summary of evidence for positive selection and/or recombination in the ahcY, rpe, bcp and apt protein-coding genes...... 229

A. l Selected relational database tables used in this stu d y ...... 279

B. l Properties of the chromosomes and plasmids of the reduced core...... 284

C. l Properties of the alignments and parameters used in the phylogenetic anal­ yses of the reduced core ...... 292

xxiv Abbreviations

AIC Akaike Information Criterion AICc Corrected Akaike Information Criterion BChl bacteriochlorophyll BI Bayesian Inference BIC Bayesian Information Criterion BLH branch length heterogeneity BPP Bayesian Posterior Probability BS bootstrap Chi chlorophyll core an ambiguous term referring to a set of genes, thus in the text, one of the following terms is always used to clarify the intended meaning: the “genomic backbone” (or “clonal frame”); the “taxon-specific core”; the “vertical core”; or the “metabolic core” genome all the heritable genetic information in an organism “genome” the list of all known protein-coding genes in an organism GI GenBank Identifier Gya giga/billion years ago hLRTs hierarchical likelihood ratio tests

xxv indel insertion / deletion KEGG Kyoto Encyclopedia of Genes and Genomes database LBA long branch attraction LGT lateral/horizontal gene transfer LH light-harvesting InL log-likelihood LR-ELW Local Rearrangement Expected-Likelihood Weight LRT likelihood ratio test M bp mega (million) base pairs MCL Markov Clustering (algorithm) MSH-cluster a cluster of protein-coding genes related by MSH-matches M SH -m atch a pair of highly similar protein-coding genes MSH membrane-spanning helices ML maximum likelihood mS m arine Synechococcus M ya million years ago NJ Neighbor-Joining (tree) OEC Oxygen-evolving complex PBS phycobilisome pcb prochlorophyte chlorophyll a /b binding protein-coding genes Pcc Prochlorococcus PSI Photosystem I PSII Photosystem II PS-group th e Prochlorococcus and marine Synechococcus species RBH reciprocal BLAST hit RBT Refined Buneman Tree

XXVI SRH stationary, reversible and homogeneous (conditions) XXVlll Chapter 1

General Introduction

1.1 The process of photosynthesis

1.1.1 Photosynthesis is an ancient energy conversion process

Photosynthesis is the biological process that converts the energy in light into chemical energy stored in the bonds of organic molecules (Blankenship, 2002). It is easy to take this process for granted because it occurs inconspicuously all around us with a minimum of fuss at room temperature. However, this process provides the organic carbon for all organisms in the ecosystem. Photosynthesis is an oxidation-reduction reaction that relies on electron transport across a membrane. It can generally be expressed as:

C02 + 2H2A + light pigmewt> (CH20 ) + H20 + 2A where the catalyst for the reaction is a pigment (Falkowski and Raven, 2007). All photo­ synthetic bacteria, except the cyanobacteria, perform this process under anaerobic con-

1 ditions using an electron donor such as sulfur. The cyanobacteria, and all eukaryotic photosynthetic organisms, are oxygenic and use oxygen as an electron donor. The basic parts of the equation “carbon dioxide plus water and nutrients plus light yield increase in plant mass plus oxygen” were not recognised until about the year 1800 and the term “pho­ tosynthesis” was first used by C. MacMillan in 1893 (Govindjee and Krogmann, 2004).

Oxygenic photosynthesis is now defined as

C 0 2 + H20 + light Mo + 0 2

where (CH20 ) refers to incorporation of carbon dioxide into an organic form, in this case a sugar, and the process occurs in the photosynthetic lamellae of cyanobacteria or the chloroplasts of algae or land plants, under the catalytic action of the pigment chlorophyll.

This process is important to the entire biosphere because the energy used by life in most ecosystems on our planet is derived from it (Blankenship, 2002; Falkowski, 2006; Raymond and Segre, 2006).

Photosynthesis is important on a global scale and on geological time scales. On a time­ line of our Solar System (Figure 1.1), Earth appeared about 4.5 billion years ago (Gya)

(Lugmair and Shukolyukov, 2001; Watson and Harrison, 2005). Life is thought to have ap­ peared following a period of intense asteroid bombardment ~3.8 Gya (Nisbet, 1987; Schid- lowski, 1988; Sleep et al., 1989; Chyba, 1993; Mojzsis et al., 1996; Nisbet and Fowler, 1996;

Nisbet and Sleep, 2001). However, Life may have appeared earlier (Gogarten-Boekels et al., 1995), the asteroid bombardment providing the selection for extreme thermophiles seen at the root of the Archaea and Bacteria, but not for the last universal common an­ cestor (LUCA) (Galtier et al., 1999; Boussau et al., 2008; Fournier and Gogarten, 2010).

Anoxygenic photosynthesis evolved about 3.5 Gya (Buick et al., 1981; Schidlowski, 1988;

Awramik, 1992; Schidlowski and Aharon, 1992).

2 Figure 1.1: Time line of our Solar System Image adapted from http://en.wikipedia.org/wiki/Earth; dates from Falkowski and Raven (2007).

Oxygenic photosynthesis, the more complex system that appears to require two reaction centres, evolved ~2.7 Gya with estimates between 3.5 and 2.3 Gya (Rye and Holland, 1998; Brocks et al., 1999; Hedges et ah, 2001; Xiong and Bauer, 2002; Brocks et al., 2003a,b; Bekker et ah, 2004; Cavalier-Smith, 2006; Anbar et ah, 2007; Kaufman et ah, 2007; Knoll et ah, 2007). The oxygen produced as a by-product first accumulated in the atmosphere in significant quantities between 2.4 and 2.3 Gya (Farquhar et ah, 2000; Bekker et ah, 2004; Canfield, 2005; Holland, 2006), in what is known as the Great Oxidation Event (GOE), and facilitated the evolution of the multicellular eukaryotes (Blankenship, 2002; Xiong and Bauer, 2002; Falkowski, 2006; Raymond and Segre, 2006). 3 1.1.2 Photosynthesis in the tree of life

Photosynthetic organisms that use light to drive an ATP process are found in the Eu- bacteria and the Eukarya (Blankenship, 2002). The evolution of photosynthesis occurred within the bacteria (Blankenship, 1992). There are currently six bacterial phyla known to contain photosynthetic taxa: (i) the a-Proteobacteria; (ii) the Chlorobi; (iii) the Chlo- roflexi; (iii) the Firmicutes, (iv) the Acidobacteria; and (v) the Cyanobacteria (Balows et ah, 1992; Madigan et ah, 1997; Bryant et ah, 2007b). The first five groups contain either Photosystem I (PSI) or Photosystem II (PSII) and perforin what is known as anoxygenic photosynthesis. The cyanobacteria contain both PSI and PSII connected in series and are the only group capable of generating sufficient electron potential to split water molecules and release oxygen in the oxygen-evolving complex (OEC) of PSII (Blankenship, 2002; Raymond and Blankenship, 2008).

Chloroplasts, the organelles in which photosynthesis occurs in algae (photosynthetic pro- tists) and land plants have their own DNA, distinct to that of the nuclear DNA, and are believed to be remnants of free-living cyanobacteria that were captured and “do­ mesticated” as endosymbionts (Palmer, 2003; Lake, 2009). These organelles, collectively known as plastids, evolved through a complex series of primary, secondary, tertiary and even higher order endosymbiotic events (Archibald and Keeling, 2002) (Figure 1.2). Thus, when combined with lateral gene transfer (LGT) and loss events, it is difficult to accu­ rately infer their evolutionary relationships (Martin, 1999; Archibald and Keeling, 2002). 4 PRIMARY ENDOSYMBIOSIS

Cryptomonads Heterokonts Apicomplexa Euglenids Chlorarachniophytes

Haptophytes Ciliates Dinoflagellates

Alveolates

Chromalveolates

Figure 1.2: A scheme for the origin of plastids through multiple endosymbioses Figure adapted from Archibald and Keeling (2002).

5 1.2 The cyanobacteria

1.2.1 Cyanobacteria are significant contributors to global nutri­

ent cycles

The cyanobacteria are a large, ancient, extraordinarily diverse and successful group of bacteria. They are found in almost all environments that receive enough sunlight for photosynthesis to occur - in terrestrial environments, including ice sheets and deserts, and in saltwater and freshwater environments (Thajuddin and Subramanian, 2005). In aquatic environments, the zone of water which receives at least 1% of incident sunlight, typically the upper 200m of the surface waters, is known as the euphotic zone (Falkowski and Raven, 2007) and is the home of many cyanobacteria and algae, and some seagrasses and hydrophytes which have evolved from land plants and returned to the aquatic envi­ ronment.

The cyanobacteria are recognised as plentiful and productive primary producers that are ecologically important on a global scale for their role in nutrient cycling (Field et al.,

1998; Partensky et al., 1999; Whitman et al., 1998). They have potential uses: (i) for their ability to break down harmful industrial waste; (ii) as a source of food for people or livestock; (iii) as a source of medicines; (iv) in carbon-sequestration; (v) as a source of clean energy; and (vi) in life support systems for space exploration missions (Thajuddin and Subramanian, 2005). 6 1.2.2 Oxygenic photosynthesis evolved in the ancestors of the

cyanobacteria

The photosynthetic reaction centre (RC) complexes are divided into two classes, known as Type I and Type II (Blankenship, 1992; Larkum, 2006; Howe et ah, 2008), which are descended from a distant common ancestor (Williams et ah, 1984; Michel, 1988; Fromme and Krauss, 1994; Schubert et ah, 1998; Larkum, 2006; Sadekar et ah, 2006). Reaction centre I is based on an electron acceptor system involving iron-sulphur clusters and RCII on a quinone reduction system (Blankenship, 2002). At some point a critical event oc­ curred where these two complexes united to provide the electron potential to split water, although we do not fully understand how this occurred (Larkum, 2006; Blankenship and

Raymond, 2007). There is consensus that the photosynthetic machinery in all extant photosynthetic organisms is sufficiently similar that it must have been inherited from a common ancestor (Blankenship and Hartman, 1998; des Marais, 2000; Larkum and Kuhl,

2005; Zhang et ah, 2007; Shi and Falkowski, 2008). So a better knowledge of how pho­ tosynthesis has diversified over time will aid our understanding of biologically-catalysed energy conversion - and perhaps answer questions about the early evolution of life.

Since oxygenic photosynthesis appears to require this complex biochemical machinery to generate sufficient electron potential to split water, oxygenic photosynthesis is thought to have evolved only once (des Marais, 2000). Thus the Cyanobacteria - the only ex­ tant group of oxygen-producing photosynthetic prokaryotes - are almost certainly the descendants of the ancient photosynthetic bacteria (oxyphotobacteria) that evolved the biochemical process of oxygenic photosynthesis. It is uncertain when this happened - no more than 3.5 Gya and no less than 2.4 Gya (Larkum, 2006).

7 1.2.3 Light-harvesting antenna systems in cyanobacteria

All photosynthetic organisms collect light energy with a light-harvesting antenna system, then funnel the energy across a pigment bed to a reaction centre where the energy con­ version process takes place. All known photosynthetic organisms possess light-harvesting

(LH) pigment protein-coding genes (Blankenship, 2002) and light-harvesting appears to be an essential part of photosynthesis. It maximises the light harvesting properties of a photosynthetic system while economising on the number of reaction centres.

The majority of cyanobacteria collect light through biliprotein pigments organised into antenna complexes called phycobilisomes (PBSs) (Bryant, 1991). These structures are described as “membrane-extrinsic LH antenna systems” because the phycobilisome struc­ ture is located on the outer side of thylakoid membranes (Falkowski and Knoll, 2007).

Four genera of cyanobacteria are unusual in that they lack functioning phycobilisomes, instead depending on intrinsic LH chlorophyll proteins (LHCs). These LHCs are struc­ turally quite different from those based on phycobilisomes because the chlorophyll proteins are embedded within the thylakoid membranes, i.e., they are intrinsic membrane proteins.

The four genera that use membrane-intrinsic LHCs are: (i) Prochloron (Lewin, 1977), an obligate endosymbiont of didemnid ascidians, which uses Chi b as the main photosyn­ thetic pigment; (ii) Prochlorothrix (Burger-Wiersma et al., 1986), a free-living freshwa­ ter species which uses Chi b as the main photosynthetic pigment; (iii) Prochlorococcus

(Chisholm et al., 1988), which use divinyl Chi a and divinyl Chi b (Rippka et al., 2000), do not contain Chi a (although strains have Chi 6), and lack phycobilisomes (Ting et al.,

2002); and (iv) Acaryochloris (Miyashita et al., 2003), which uses Chi d (Chen et al.,

2002a). The three genera that contain Chi (or divinyl Chi) b are commonly referred to as “prochlorophytes” (Lewin, 1976). Although they share this unusual LH-antenna system, none of these genera appears to be closely related to each other (Palenik and Haselkorn, 1992; Urbach et ah, 1992; Miyashita et ah, 2003). Hence, the evolutionary events that have led to membrane-intrinsic LH- antenna systems are not clear and may well have been involved in LGT (Chen et ah, 2005b). Of these genera, Prochlorococcus and Acaryochloris contain phycobiliproteins (Hess et ah, 1996; Tomitani et ah, 1999; Hess et ah, 2001; Miyashita et ah, 2003).

1.2.4 The significance of the various chlorophylls in the evolu­ tion of light harvesting systems

Four types of chlorophylls exist in oxygenic photosynthetic organisms: Chi a, Chi 6, Chi c and Chi d. These chlorophylls are based on a similar porphyrin ring binding magnesium, with different side groups in rings 1 and 2. Until recently it was assumed that Chi a was the only chlorophyll which could participate in the generation of radical pairs in the primary processes of oxygenic photosynthesis. However, it is now firmly established that in this process in Acaryochloris marina, Chi d participates in PSI and, probably, in PSII (Hu et ah, 1998; Tomo et ah, 2007).

Both Chi a and Chi d play a role in light harvesting as well as RC processes. The roles of Chi b and Chi c are solely as light harvesting pigments. Chlorophyll d extends the light-harvesting properties, compared with Chi a, into the near infra-red region of the electromagnetic spectrum. Chi b extends the LH properties, compared to Chi a, in the y-transition band, pushing out the peak from around 436 nm to around 465 nm, while at the same time shifting the x-transition peak in the red from around 670 nm to 650nm. Chlorophyll c is somewhat similar but only extends the peak to 450nm, while at the same time the x-transition peak in the red is much reduced and shifted to 630 nm. 9 Chlorophyll b occurs in prochorophyte algae (see below) and clearly evolved at the prokaryotic level. Chlorophyll c, in the strict sense, has only been found in chloroplasts and it has been assumed that it evolved at an early stage in the evolution of eukaryotic cells (Howe et ah, 2008). However the very similar pigment magnesium-3,8-divinyl phaeo- porphyrin a5 monomethyl ester (or 8-vinyl-chlorophyllide) is now regarded as a true Chi c

(Larkum, 2006) and is present in cyanobacteria and photosynthetic anoxygenic bacteria.

Therefore Chi c in the broad sense was present in cyanobacteria and their forebears, the oxyphotobacteria.

Clearly Chi b and Chi c evolved after Chi a as light harvesting pigments. Chlorophyll d is thought to have evolved from Chi a, but when is unknown.

1.2.5 Cyanobacteria and the endosymbiotic origin of the plas-

tids

The hypothesis that plastids are vestiges of cyanobacteria that were acquired through endosymbiosis was first proposed, briefly, in 1883 (Schimper, 1883) and, in more detail, in

1905 (Mereschkowsky, 1905; Martin and Kowallik, 1999). Since the discovery of circular

DNA in chloroplasts (Ris and Plaut, 1962; Chun et ah, 1963; Kirk, 1963) there is no longer any doubt that the cyanobacteria and plastids share a common ancestry and the ancestors of the higher plants and algae acquired the ability to perform photosynthesis by acquiring an ancient cyanobacterium (Echlin, 1966; Margulis, 1970; Raven and Allen,

2003). Approximately 1.2 Gya, at least one proto-eukaryote captured at least one proto- cyanobacterium in what is termed a primary endosymbiotic event (Bhattacharya and

Medlin, 1995; Delwiche and Palmer, 1996; Bhattacharya and Medlin, 1998; Butterfield,

2000; Stiller et al., 2003; Rodriguez-Ezpeleta et al., 2005; Larkum et ah, 2007). Subse­ quently, there were secondary, tertiary and further order endosymbiosis events in which

10 non-photosynthetic eukaryotes captured plastid-containing eukaryotes and retained the plastid (Delwiche, 1999; Archibald and Keeling, 2002).

There are now many lineages in the tree of life that contain photosynthetic organisms

(e.g., Fehling et ah, 2007). Some, such as the apicoplast of the apicomplexan parasite

Plasmodium that causes malaria, retain only a very limited set of genes in what was once their photosynthetic organelle (McFadden and Waller, 1997; Okamoto and McFadden,

2008). Since there has been such a complex sequence of transfer and loss of genetic material within the photosynthetic eukaryotes, there are many theories of how these lineages are related to each other (e.g., Cavalier-Smith, 1999; Archibald and Keeling,

2002; Falkowski et ah, 2004; Keeling, 2010), but no consensus.

1.2.6 The evolutionary history of the cyanobacteria

Geochemical evidence indicates that the biochemical process of oxygenic photosynthesis evolved possibly as long as 2.8 Gya (Nisbet and Sleep, 2001) and not less than 2.3 Gya

(Rasmussen et ah, 2008). It is unlikely to have evolved more than once (des Marais, 2000) so the extant group of oxyphotobacteria, the members of the phylum Cyanobacteria, have inherited the genetic information required to perform oxygenic photosynthesis from the ancient oxyphotobacteria. The age of the extant radiation of cyanobacteria is not known

(Larkum et ah, 2007), however, and there is little evidence of what their ancient ancestors were like or how closely they resemble the modern cyanobacteria (Larkum et ah, 2007;

Howe et ah, 2008). We do not know whether the extant cyanobacteria are representative of the diversity of the ancient oxyphotobacterial lineage or are a single surviving lineage of it.

The extant cyanobacteria are genetically and ecologically diverse yet form a monophyletic clade (Tomitani et ah, 2006). Larkum and colleagues argue that the extant descendants

11 of such an ancient lineage that diversified at least 2.8 Gya (Summons et ah, 1999) are unlikely to be monophyletic unless they are the only surviving lineage of a more recent bottleneck, perhaps caused by a global snowball event (e.g., Kirschvink, 1992; Hoffman et ah, 1998b,a; Nisbet and Sleep, 2001). If this is true, the present diversity of the extant cyanobacteria tells us little about the diversity of ancient oxyphotobacteria or the rate of diversification over time because we do not know when or how much diversity has been lost. It is possible that the modern cyanobacteria diversified from a lineage of the ancient oxyphotobacteria much more recently than the earliest oxyphotobacterial lineages.

1.3 Genome evolution in bacteria

1.3.1 Bacterial genomes

In bacteria, the heritable material is located in one or more circular or linear chromosomes and possibly one or more plasmids. In this work, any references to the genome refer to all the heritable genetic information in an organism. In the present work, much of the analyses were based on the list of all known protein-coding genes in that genome, and this set of protein-coding genes is sometimes referred to as a “genome” for conciseness

(e.g. Raymond et ah, 2003a) and whenever this use is intended it is noted by the use of double-quotes.

Genes are any segment of a chromosome that encodes a ribosomal, transfer or messenger

RNA (abbreviated to rRNA, tRNA or mRNA, respectively) molecule. Messenger RNA molecules are transcribed into proteins which are enzymatic catalysts for biochemical reactions or form the structure of biological organisms.

A biochemical reaction is the transformation of one or more metabolites into other metabo-

12 lites. The reaction may occur spontaneously when all the input metabolites are in close proximity or may require the presence of an inorganic catalyst or a protein that acts as a catalyst. There are many biochemical reactions occurring in each living cell at any given instant and the subset of all possible biochemical reactions occurring at any instant depends on the state of the organism and the environmental conditions in which it finds itself.

Bacteria may live as free-living cells or with other closely-related individuals or more distantly-related individuals in communities, commonly referred to as bacterial colonies, filaments, or microbial mats. Since a single-celled organism must contain the genes nec­ essary to survive in its environment, the biochemical capabilities of a bacterium reflects how it interacts with its environment in terms of nutrient assimilation and responses to environmental stresses.

The comparison of closely and more distantly-related bacteria can be used to infer the evolutionary history of their genomes and, hence, the genomic changes that result in the genomes we observe today.

The biochemical capability of bacteria has evolved through the following discrete genome- level events: (i) gene rearrangement; (ii) gene acquisition; (iii) gene loss; and (iv) gene duplication (Abby and Daubin, 2007). At the level of the gene, changes accumulate through: (i) point mutations; (ii) breaks in the DNA and loss of a gene fragment; (iii) fu­ sion of two gene fragments; and (iv) recombination of two homologous gene fragments to form a new gene that is a mosaic of genes with different evolutionary histories (Lawrence and Hendrickson, 2005; Koonin, 2009). 13 1.3.2 Mechanisms of inheritance

In eukaryotes, the genetic material in an individual is generally inherited “vertically” from two parents. Bacteria reproduce very differently. The genome in a bacterial cell replicates, then two daughter cells inherit a copy of the parental genome when the parent undergoes binary fission. The daughter cells are clones of the parent, with small differences in the genomic information due to errors or point mutations that occurred during the replication process.

In addition, bacterial cells may acquire genetic information in the form of gene fragments via three separate mechanisms (Frost et ah, 2005): (i) conjugation, in which two bacteria become connected by a tube called a pilus and genetic material from a bacterial chromo­ some or is sent from one bacterium to the other (in effect, this is a form of sexual reproduction since there is a chance of crossing over of portions of the two chromosomes); (ii) transduction, where genes from a bacterium infected by a virus are incorporated into the viral progeny and transported to a subsequent bacterial host; and (iii) transformation, where naked DNA (e.g., a plasmid) from the environment is transported into the cell and incorporated into the bacterial chromosome (Ochman et ah, 2000). These three mech­ anisms are the natural, bacterial analogues of the “genetic modification” of eukaryotes reported in the popular press, such as the artificial transfer of a GFP gene for the green fluorescent protein from a jellyfish to a “genetically modified” mouse (Hadjantonakis and Nagy, 2001).

The theoretical framework for phylogenetic analysis was originally developed for use with organisms which inherit their genetic material vertically. Its application to bacteria is complicated by their additional, non-vertical modes of inheritance.

Since bacterial lineages have evolved through a complex series of events, comparative 14 genomic research has focused on the comparisons of closely related genomes, such as those ofProchlorococcus and marine Synechococcus. In any such study, a fundamental problem is that we do not understand the processes of genome evolution well enough to model them accurately (Dutilh et ah, 2007). Rates of vertical and horizontal transfer could be very different for any given set of bacteria. The historical signal in the primary sequence may have been eroded by evolution over long time scales or accelerated evolutionary rates, hidden paralogy, LGT or homologous recombination (e.g., Galtier and Daubin, 2008). Additional sources of information can be used to distinguish relationships between bacteria, such as insertions or deletions (“in- dels”) and gene order, but the historical signal in these data may also be unclear due to a complex series of genomic evolutionary events. Furthermore, the results of any phylo­ genetic or other analysis may have some degree of error due to the use of data that is not compatible with the assumptions of the method, or data that confound it and produce topologies that do not reflect the evolutionary history.

1.3.3 Extending the species concept

In eukaryotes, species are most commonly defined as a group of individuals able to ex­ change genes (Templeton, 1989). A significant problem in the study of bacterial evolution is re-defining the species concept for bacteria (Wayne et ah, 1987; Cohan, 2006; Ward et ah, 2008) or questioning whether a species concept is applicable (Wood et ah, 1992; Doolittle, 1999). The central concept of a species is a stable “gene pool” that is shared by a group of organisms over a relatively long period of time (Woese et ah, 2000). For bacteria, where rates of LGT may be relatively high, generation times short, and population sizes ex­ traordinarily large, this definition could be applied to groups of bacteria that share a very

15 recent common ancestor and have not experienced any LGT, but such a use of the term would not be helpful in a discussion of the evolution of bacterial lineages because every sample of bacteria taken could be considered to belong to a separate “species”. Many studies in the literature refer to a “core” set of genes. In the literature, this term can be used to refer to: (i) the “genomic backbone” or “clonal frame” (“stretches of DNA descended from a single ancestral molecule and bounded by recombinational borders” (Milkman and Bridges, 1990)); (ii) the taxon-specific core - the set of genes that are charcteristic of, or evolutionary stable with, a taxon; (iii) the vertical core - a set of genes that cannot undergo xenologous replacement and thus show a vertical pattern of transmission; or (iv) the metabolic core - the set of genes responsible for fundamental bacteria-specific metabolic pathways. In this work, I have qualified each use of “core” to indicate the meaning intended. The classification of bacteria is based on the hypothesis that, irrespective of any LGT that may have occurred, there is a central “core” of genes that are inherited vertically and form the genomic backbone of the genome (Lan and Reeves, 2000; Boucher et ah, 2003; Tettelin et ah, 2005; Lawrence and Hendrickson, 2005; Medini et ah, 2005). There is empirical evidence that there are some genes, the metabolic core, that are so central to the viability of a bacterial cell that xenologous replacement of that gene is unlikely to be successful (Jain et ah, 1999). The 16S rDNA gene has been found to be universal across the prokaryotes and phylogenies inferred from this gene agree well with a large number of genes from the host genomes (Woese and Fox, 1977). In particular, genes that have been described as “informational genes”, that is, they are involved in translation, transcription and replication (Rivera et ah, 1998), have been found to have a largely consistent evolutionary history that agrees with that of the 16S rDNA molecule (Olsen and Woese, 1996). Also, there is empirical evidence in support of “the complexity hypothesis”, that genes which belong to complex biochemical pathways are less prone to LGT because

16 it is harder to incorporate a foreign gene into such pathways (Jain et al., 1999). On shorter time-scales, the use of the 16S rRNA molecule as the “gold standard” for bacterial classification can be criticised because there is extensive evidence that while some genes are not prone to transfer by LGT, possibly due to gene-dosage toxicity issues (Sorek et ah, 2007), there is extensive evidence that the 16S rRNA molecule has experienced LGT, in particular, between close relatives followed by homologous recombination (Gogarten et ah, 2002). Thus, the congruence between 16S rRNA phytogenies and those based on gene- content may be due to both being representative of the mosaic nature of the gene and the genomes being investigated (Gogarten et ah, 2002). On longer time-scales, some researchers who have tried to uncover the origins of life or the relationship between the Eubacteria, Archaea and Eukarya (Woese and Fox, 1977) argue that LGT has been so widespread in the evolutionary history of bacteria that it has erased all record of the deepest branches in the tree of life (Philippe and Forterre, 1999; Doolittle, 1999; Martin, 1999). Our understanding of the origin of life is hazy at best but, prior to the evolution of cells, it is hypothesised that consortia of “organisms” lived together and engaged in a free exchange of genetic material (Woese, 2002; Kandler, 1994). Since then, there has been a progressive reduction of LGT between organisms towards a system where only organisms which we would classify as “species”, or organisms that belong to the same lineage, are able to exchange genetic material (Templeton, 1989). When viewed in this context, LGT is a “natural” process that was once ubiquitous and is still a significant mechanism of genome evolution in prokaryotes. If we reflect that the vast majority of life on Earth is prokaryotic, perhaps it is the organisms that do not engage in LGT that are unusual. There is no model of bacterial evolution, then, that can easily capture the evolutionary history of the heritable genetic material in the bacterial genome. If any gene or gene fragment has experienced LGT there is no single tree that can accurately describe the

17 evolutionary history of the genome. It is thus more meaningful to try to infer the evolu­ tionary history of bacterial genomes than to search for a phylogeny of bacterial “species”

(Doolittle, 1999). The challenge of bacterial comparative genomics is to infer how evo­ lutionary processes have assembled modern bacterial genomes into their present form

(Martin and Kowallik, 1999).

Since bacterial evolution is not understood well enough to be modelled accurately, re­ search into the evolution of bacterial genomes has focused on lineages that have diverged relatively recently and retain a high degree of similarity allowing easier inference of ge­ nomic history. One example is the cyanobacteria from the genus Prochlorococcus and the marine strains of Synechococcus.

1.4 The Prochlorococcus spp.

1.4.1 Microbes of global ecological importance

Cyanobacteria from the genus Prochlorococcus are slow-growing unicellular cyanobacteria found in the upper 200m of oligotrophic aerobic marine surface waters in latitudes from approximately 40 °N to 40 °S (Partensky et al., 1999; Coleman and Chisholm, 2007). The

Prochlorococcus genus has high- and low-light adapted ecotypes, some of which have re­ duced genomes (Rocap et al., 2003), and contains the smallest known oxygen-evolving autotroph (Chisholm et al., 1988). These are tiny cells with an equivalent spherical diameter of 0.5-0.7 gm (Morel et al., 1993) and typical abundances of 104-105 Prochloro­ coccus cells per ml of seawater (Campbell et al., 1994; Partensky et al., 1999; Coleman and Chisholm, 2007), making them the numerically dominant primary producer in the temperate and tropical oceans (Partensky et al., 1999).

18 Due to their abundance, they are an important contributor to global primary production and may be responsible for up to 50% of CO2 fixation in some regions (Liu et ah, 1997).

1.4.2 The closely related marine Synechococcus spp.

The evolutionary relationships of cyanobacteria with membrane-intrinsic LH-antenna sys­ tems are not clear, with phylogenies based on standard bacterial markers concluding they are not closely related to each other within the cyanobacteria (Palenik and Haselkorn,

1992; Urbach et ah, 1992; Miyashita et ah, 2003). Instead, each of the four lineages is more closely related to other lineages of cyanobacteria with a membrane-extrinsic LH- antenna system.

The closest relatives of the Prochlorococcus spp. appear to be the marine strains of Syne­ chococcus (e.g., Turner et ah, 1989; Kishino et ah, 1990; Lockhart et ah, 1992a; Palenik and Haselkorn, 1992; Urbach et ah, 1992; Ferris and Palenik, 1998; Moore et ah, 1998; Ur­ bach et ah, 1998; Partensky et ah, 1999; Hess et ah, 2001, 1995; Ting et ah, 2001; Zeidner et ah, 2003; Ashby and Houmard, 2006) from marine subcluster 5.1 unicellular cyanobac­ teria (Herdman et ah, 2001). These Synechococcus spp. were found to be abundant in seawater in 1979 (Waterbury et ah, 1979), have a diameter of 0.6-0.9 ym (Albertano et ah, 1997) and collect light with membrane-extrinsic phycobilisomes (Partensky et ah,

1997).

Although existing evidence suggests that Prochlorococcus and marine Synechococcus spp. share a common ancestor, the mechanisms by which they have diverged are not under­ stood.

19 1.4.3 Diversification of LH-antenna systems

Although the prochlorophytes were initially thought to have evolved from a recent com­

mon ancestor because they share a Chl-based LH-system, phylogenetic studies have con­

cluded otherwise. Phylogenetic analyses based on the 16S rDNA gene indicate that

the three prochlorophyte genera are polyphyletic within the cyanobacteria (Palenik and

Haselkorn, 1992; Urbach et ah, 1992). In agreement with these studies, phylogenetic studies based on 16S rDNA, metabolic core bacterial genes (e.g., rpoCl and atpB /E ) and

photosynthesis genes present in all cyanobacteria (e.g., psbA/B) have consistently con­ cluded that the closest relatives of Prochlorococcus are the marine strains of Synechococcus

(e.g., Kishino et ah, 1990; Lockhart et ah, 1992a; Palenik and Haselkorn, 1992; Urbach et al., 1992; Hess et ah, 1995; Turner et ah, 1989; Ferris and Palenik, 1998; Moore et ah,

1998; Partensky et ah, 1999; Hess et ah, 2001; Ting et ah, 2001; Zeidner et ah, 2003; Ashby

and Houmard, 2006). Similarly, two cyanobacterial symbionts of didemnid ascidians that

use different LH-antenna systems, Prochloron spp. and Synechocystis trididemni, evolved

from a common ancestor that possessed phycobiliproteins (Shimada et ah, 2003).

In all cases, the intrinsic proteins binding Chi a and b belong to a family that has six

membrane-spanning a-helices (MSH) (Zhang et ah, 2007). Interestingly, members of this family are also found in the cyanobacterium Acaryochloris marina, where more than 95% of the chlorophyll is in the form of Chi d. An analysis of all four genera of this family of Chl-binding protein-coding genes, including the Chi d LH protein, concluded that the genes for these proteins had been gained by LGT, probably in association with the genes for Chi b or Chi d formation (Chen et ah, 2005b). Thus all the evidence so far suggests that the Prochlorococcus and marine Synechococcus group (PS-group) is not closely related to the other three genera of cyanobacteria which contain membrane-intrinsic Chl-based

LH-systems.

20 The age of the Prochlorococcus spp. is not clear. An early investigation of the origins of prochlorophytes based on the 16S rDNA gene interpreted the short branch lengths of the Prochlorococcus cluster in parsimony and Neighbor-Joining (NJ) trees as evidence that they are a recent radiation (Urbach et al., 1998). More recent studies based on a much broader group of genes common to the PS-group taxa have also concluded that

Prochlorococcus diverged from the marine Synechococcus relatively recently, but experi­ enced accelerated evolution (Dufresne et al., 2005).

On the assumption that the genera possessing Chi 6 must be closely related to each other and related to the plastids, they were allocated a new order, Prochlorales, and referred to as “prochlorophytes” (Lewin, 1976). However, phylogenetic studies have found that the three genera of “prochlorophytes” are not closely related to each other (Palenik and

Haselkorn, 1992; Urbach et al., 1992), or to the Acaryochloris that have a membrane- extrinsic system based on Chi d (Miyashita et al., 2003), or to the plastids (La Roche et al., 1996), so the relationship of the four lineages that have LH-antenna systems based on chlorophylls is unclear.

It is still unclear whether the taxonomic distribution of the novel pigments in prochloro­ phytes is best explained by the “Chi 6-loss theory” (Tomitani et al., 1999), where the ancestor of the cyanobacteria had both Chi b and phycobilins, or the “Chi 6-gain the­ ory” (Palenik and Haselkorn, 1992; Urbach et al., 1992), where Chi 6 developed or was acquired several times in various lineages of cyanobacteria and higher plants (La Roche et al., 1996; Chen et al., 2005b). The simplest explanation for most extant cyanobacteria having a PBS-based LH-antenna system, however, is that they inherited it from a common ancestor. Hence, the chlorophyll-based LH-antenna system in the prochlorophytes and

Acaryochloris could have evolved: (i) once, then been dispersed to other cyanobacterial lineages through LGT and diverged within the new host genomes; (ii) multiple times in the different lineages; or (iii) through some combination of these options.

21 However, the Chl-biosynthesis pathways are at least as old as the process of anoxygenic photosynthesis (Larkuin, 2008; Raymond and Blankenship, 2008), so it is possible that the Chl-based system predates that based on phycobilisomes. That most cyanobacteria use a phycobilisome system could be due to a successful, recent radiation of that innovation, or to the loss of most of the diversity in the chlorophyll-using lineages.

1.4.4 Prochlorococcus and the Paulinella chromatophore

There is no longer debate that the plastids share a common ancestor with the cyanobacte­ ria, but their evolutionary history is not well understood. From 1.2 Gya, and maybe much earlier, eukaryotic protists acquired the ability to perform oxygenic photosynthesis (But­ terfield, 2000) through one (Bhattacharya and Medlin, 1995; Rodriguez-Ezpeleta et al., 2005) or multiple (Delwiche and Palmer, 1996; Stiller et al., 2003; Larkum et al., 2007) endosymbiotic events involving ancestors of the modern cyanobacteria (Bhattacharya and Medlin, 1998). When cyanobacteria that use a chlorophyll-based LH-antenna system were first discov­ ered in 1976, they were initially hailed as the missing link between the cyanobacteria and the plastids of higher plants and algae (Lewin, 1976). The hypothesis that the prochloro- phytes are the closest relatives of the plastids was refuted by the finding that the Clil a/b binding (pcb) proteins of the prochlorophytes do not belong to the same protein-coding gene family as the eukaryotic Chi a/b and Chi a/c light-harvesting protein-coding genes (La Roche et al., 1996). Even the term “prochlorophyte” is no longer thought appropriate and the original proponent of the Prochlorophyta has suggested it be subsumed into the Cyanophyta or Cyanobacteria (Lewin, 2002). Recent work on the evolutionary relationships of the Paulinella chromatophore, however, has re-ignited this issue. Phylogenetic studies of genes from the Paulinella chromatophore 22 have concluded that the chromatophore shares a common ancestor with the clade consist­ ing of Prochlorococcus and marine Synechococcus spp. (Marin et ah, 2007, 2005). This is evidence that there have been at least two primary endosymbiotic events in the history of plastids and plastid-like organelles (Marin et ah, 2007, 2005). The evolutionary histories of Prochlorococcus spp. are thus of particular interest because this genus may share a closer relationship to some plastids than other cyanobacteria.

1.4.5 Model organisms for studying genome evolution in cyanobac­ teria

Prochlorococcus and the related marine Synechococcus have been studied intensively since their discovery. There are both low- and high-light ecotypes of Prochlorococcus, adapted to different depths in the water column (Rocap et ah, 2002; Moore and Chisholm, 1999; Partensky et ah, 1999). Cyanophage-assisted LGT may be a significant mechanism by which the Prochlorococcus genomes evolve (Mann et ah, 2003; Sullivan et ah, 2003). Adaptation for different nutrient environments are evident in marine Synechococcus, with some strains that live in high nutrient coastal regions and others in oligotrophic open oceans (Fuller et ah, 2006; Zwirglmaier et ah, 2007).

Thus, efforts have focused on genomic differences between Prochlorococcus ecotypes and marine Synechococcus strains which explain their light and nutrient adaptations, and on how genomic structure reveals the mechanisms by which these organisms have evolved (Coleman and Chisholm, 2007). 23 1.5 Genome evolution in Prochlorococcus and ma­

rine Synechococcus

1.5.1 Evidence of LGT

The evolutionary history of the Prochlorococcus and marine Synechococcus genomes ap­ pears to be relatively straightforward when considered in isolation. Phylogenetic analy­ ses of taxon-specific core genes that are unlikely to have experienced LGT indicate that

Prochlorococcus and marine Synechococcus evolved from a common ancestor (e.g., Palenik and Haselkorn, 1992; Urbach et al., 1992). Much of the genome of each lineage appears to have been inherited from this common ancestor (Kettler et al., 2007; Dufresne et al.,

2008), but there is strong evidence that photosynthesis genes could have been transferred between the two groups by cyanophage-assisted LGT (Mann, 2003; Sullivan et al., 2003;

Lindell et al., 2004). However, the relationship of these two genomic lineages to those of the rest of the cyanobacteria, the anoxygenic photosynthetic bacteria and the rest of the eubacteria is not so clear.

Phylogenomic studies of multiple genes from the first complete genomes of Prochlorococcus

(Dufresne et al., 2003; Rocap et al., 2003) and marine Synechococcus (Palenik et al., 2003) suggest that their genomes, like those of other bacteria, have been subject to a significant level of LGT (Rivera et al., 1998; Jain et al., 1999).

The standard molecular marker used for the classification of bacteria, the 16S rDNA gene, was chosen because it is ubiquitous in bacteria and not prone to LGT because it is functionally constrained by the complex biochemical pathways to which it belongs (Shi and Falkowski, 2008). However, a recent study that concluded the 16S rDNA gene has a mosaic origin in Acaryochloris (Ueda et al., 1999; Yap et al., 1999; Miller et al., 2005)

24 suggests even this gene is not immune to LGT and there are many studies that have demonstrated that the 16S rDNA molecule has been subject to LGT (for a review, see

Gogarten et ah, 2002).

Studies based on a cyanobacterial-specific core genes from full genomes have found con­ flicting phylogenetic relationships among individual cyanobacteria and concluded that there must have been extensive LGT within this group and between cyanobacteria and other phyla (Raymond et ah, 2002, 2003b; Zhaxybayeva et ah, 2006). Another possibility, mentioned by Dufresne et al. (2005), is that phylogenies based on taxon-specific core pro­ tein sequences generally do not agree because the phylogenies are biased by the accelerated evolutionary rates of taxon-specific core genes in Prochlorococcus species. However, the most prevalent opinion in the literature is that this phylogenetic conflict in taxon-specific core genes is the result of extensive LGT within the cyanobacteria.

The photosynthesis-related genes of cyanobacteria are unreliable phylogenetic markers be­ cause they are part of the environment-specific genes of the “accessory” genome acquired by LGT (Mulkidjanian et al., 2006; Kettler et al., 2007). There is evidence that even key components of the photosynthetic machinery in Prochlorococcus and marine Synechococ- cus have been acquired through inter-specific, cyanophage-assisted LGT (Mann, 2003;

Sullivan et al., 2003; Lindell et al., 2004).

1.5.2 Evidence of divergence from a common ancestor

Comparative studies of the Prochlorococcus and marine Synechococcus genomes have con­ cluded that these taxa diverged from their common ancestor as recently as 150 million years ago (Mya) (Rappe and Giovannoni, 2003; Dufresne et al., 2005).

The entire genomes of the Prochlorococcus spp. have been subject to accelerated evolu-

25 tionary rates, most likely due to loss of DNA-repair genes (Dufresne et al., 2003, 2005). Most of their metabolic core genome was inherited from their common ancestor and the taxon-specific core genes of 12 Prochlorococcus and 4 marine Synechococcus contain suf­ ficient genes to encode a functioning cell (Kettler et ah, 2007). The “accessory” genes perform non-essential functions or assist niche-adaptation (Coleman and Chisholm, 2007; Kettler et al., 2007; Dufresne et al., 2008). The finding that accessory genes identified from 11 Synechococcus and 3 Prochlorococcus were more frequently found in “genomic islands” of unusual GC content has led to the conclusion that genes for local niche adap­ tation were acquired relatively recently through cyanophage-assisted LGT (Coleman et al., 2006; Dufresne et al., 2008).

1.5.3 Evaluating the case for LGT

Whether LGT really is a significant mechanism by which cyanobacterial genomes evolve is still unclear. All evidence gathered so far strongly suggests that the cyanobacteria have had a complex evolutionary history. Like all bacteria, they reproduce through binary fission and the daughter cells inherit their genetic material vertically from their parent. However, it is also possible that selected genes, or even sizeable biochemical pathways, have been acquired by various lineages of the cyanobacteria by LGT events. The frequency of these events is probably much higher than the retention rate of genes acquired through LGT. However, over longer time periods that span species divergences, it seems reasonable to assume that any given cyanobacterial genome is a mosaic of vertically and horizontally acquired genes that have also evolved within the host genome to operate efficiently in concert with each other to enable the organism to survive in its environment. Theoretical and empirical research into phylogenetic methods suggests that the phylo­ genetic conflict observed for the cyanobacteria could be an artefact of the phylogenetic 26 inference method arising from properties of these data (Felsenstein, 2004). Unfortunately, without being sure whether phylogenetic analyses of taxon-specific core genes are accurate, it is difficult to draw conclusions about the mechanisms through which the cyanobacterial genomes have evolved.

Much of the evidence for our understanding of genome evolution in cyanobacteria has come from studies of Prochlorococcus and marine Synechococcus. There is more full ge­ nomic: information available for these two groups of cyanobacteria than for any other, and since existing evidence suggests they have evolved relatively recently from their common ancestor, it is more likely that this recent evolutionary history can be recovered. When placed in the broad context of cyanobacterial evolution, however, in which LGT may or may not be affecting the metabolic core, the evolutionary history of Prochlorococcus and the marine Synechococcus is less clear.

If, as suggested by previous studies, the rate of LGT between these two lineages is even higher than that observed within the cyanobacteria as a whole (Zhaxybayeva et ah, 2009), how can the Prochlorococcus and marine Synechococcus spp. maintain their identity while sharing vast amounts of genetic information with each other? If genetic material is being acquired by the Prochlorococcus and/or marine Synechococcus, is the exchange occurring primarily between the two lineages or are there also donors from the other cyanobacteria, the anoxyphotosynthetic bacterial lineages or non-photosynthetic bacteria? Or has the extent of LGT between cyanobacterial genomes been over-estimated?

Before being able to answer questions on the mechanisms of bacterial evolution in cyanobac­ teria, it is first necessary to evaluate whether the phylogenetic conflict observed for cyanobacteria in taxon-specific core genes is due to LGT or some other process.

27 1.6 Objectives of this thesis

The overall objective of this work is to infer the contributions of vertical and lateral gene inheritance in the evolution of the Prochlorococcus and hence the evolutionary processes by which the genomes of Prochlorococcus and marine Synechococcus have come to be assembled into their current form.

Existing attempts to address these questions using phylogenetic methods have found conflicts that could be due either to artefactual error from the inference process or to processes of genome evolution such as LGT.

The aims of this study are thus to:

1. perform preliminary investigations to evaluate the best method for obtaining and

storing data on bacterial orthologues;

2. select a set of orthologues present across the eubacteria that are unlikely to result

in phytogenies hindered by artefactual error;

3. infer the origins the Prochlorococcus and marine Synechococcus genomes using phy­

logenetic methods;

4. estimate the effects that data and methodological issues have on the results and

conclusions of these analyses; and

5. comment on the processes of genome evolution that have been important in the

evolution of the Prochlorococcus and marine Synechococcus genomes, and more gen­

erally, of cyanobacteria.

28 1.7 Outline of this thesis

This thesis has six chapters. Preliminary analyses conducted to find a suitable orthologue identification method are described in Chapter 2. Methodological considerations and pre­ liminary analyses for phylogenetic inference are discussed in Chapter 3. The methodology used to identify a reduced set of eubacterial-core protein-coding genes is described and defended in Chapter 4. The phylogenetic analyses and tests for alternative interpreta­ tions to LGT are described in Chapter 5. Lastly, the relevance and implications of the present work to our understanding of the evolution of the Prochlorococcus and marine

Synechococcus genomes, and more broadly, of the cyanobacteria, are discussed in Chapter

6.

29 30 Chapter 2

Orthologue Inference : Methods and Preliminary Studies

2.1 Introduction

The starting point for many comparative genomic studies is an accurately-inferred set of orthologous genes (Altenhoff and Dessimoz, 2009), as defined in the next section. In this work, where the primary aim is to explore the evolutionary relationships of the Prochlorococcus and marine Synechococcus through phylogenetic inference, accurate sets of orthologues are crucial.

In this chapter, I define the terminology that will be used to describe evolutionary rela­ tionships in the remainder of this work, describe the issues that hinder accurate orthologue inference, the framework within which orthologue inference is performed, and the prelim­ inary analyses conducted to formulate a suitable method.

31 2.1.1 A working definition of orthology

The terminology used to describe the evolutionary relationships of genes differs between studies. So I begin by explaining the terms used in this work. The definitions of Fitch (2000) have been generally adopted. Homology is used to describe the relationship of genes that have a common ancestor. It is an abstract concept which we cannot know definitively, but only infer with more or less certainty. Orthology is a relationship where the evolutionary history of the genes follows the pattern of speciation. That is, if you were to map the relationship of the genes onto the species tree, the most recent common ancestor of the genes would occur at a speciation event. Paralogy is the relationship where sequence divergence follows gene duplication (Fitch, 1970). An example that demonstrates these relationships is provided in Figure 2.1. The three species, A, B and C, are descended from a common ancestor. Genes within these species have undergone speciation events, denoted by a fork, or duplication events, denoted by a horizontal bar. Any genes whose last common ancestor coincides with a speciation event are orthologues, for example, A1 and Bl, Al and B2, or A1 and Cl. Any genes whose last common ancestor coincides with a duplication event are paralogues, for example, Bl and C2, B2 and Cl, or C2 and C3. Thus, gene Bl is orthologous to Cl, but Bl is paralogous to both C2 and C3. The red arrow indicates the lateral transfer of gene Bl into species A, now denoted AB1. Gene AB1 is xenologous to all other genes in this diagram.

Issues in orthologue inference

A key assumption of phylogenetic analyses of single genes is that the genes have been inherited from a common ancestor in such a way that the evolutionary relationships among the genes follows speciation (Fitch, 2000). Consequently, phylogenetic analyses 32 Figure 2.1: Gene evolution showing orthologous, paralogous and xenologous genes For a discussion of the relationships, refer to §2.1.1. Adapted from Fitch (2000). will yield information on the evolutionary relationships among the taxa being studied. With the definitions above, the taxa used in phylogenetic analyses must be orthologous to one another.

The most common way to estimate homology is through sequence similarity (Needleman and Wunsch, 1970). If two sequences are so similar that it is virtually impossible that the similarity could be due to anything except descent from a common ancestor, the sequences are likely to be homologous (Fitch, 2000).

In the literature, the reciprocal BLAST hit (RBH) method is widely used to infer sets of genes that are an approximation of the set of orthologues across multiple genomes (Bork et al., 1998; Montague and Hutchison, 2000). In this method, one of the programs from the BLAST family is used to perform an “all-against-all” comparison of every gene against every other gene in each “genome” (Figure 2.2). This comparison is usually conducted 33 Figure 2.2: All-against-all BLAST comparison The b l a s t program uses a given query gene and a given target “genome" (the list of all known protein-coding genes, denoted by the large coloured circles) to find the hits that are significant at a given E-value threshold. An all-against-all comparison involves using every gene in every “genome” as a query sequence to find the significant hits in every other "genome". In this way, each “genome" is, in effect, compared with every other "genome” (black arrows). using genes that have been translated into their corresponding amino-acid sequences. The best significant hit for every query sequence is saved. Significance is usually measured by the Expectation Value (E-value), which is an estimate of the odds of encountering such a good match given two sequences drawn at random (Altschul et ah, 1990). All of these uni-directional relationships between pairs of protein-coding genes are then used to collate bi-directional best and significant hits between pairs of genes (Figure 2.3C). Pairs of protein-coding genes related by a reciprocal best BLAST hit relationship (Figure 2.3C) are then collated into clusters of protein-coding genes (Figure 2.3D).

The possibility that protein-coding genes may have been duplicated or lost, been acquired through xenologous replacement, or may be short or low-complexity sequences between which the BLAST programs cannot detect significant similarity, means that the method by which the clusters are collated is somewhat subjective. It relies on a definition of what properties a cluster of RBHs should have in order to be considered RBHs and this definition depends on what the data will be used for. If the sets of RBHs are to be used 34 A

D

• • • • 0 O

Figure 2.3: The inference of clusters of RBH matches (A) Two “genomes” (large circles) can be represented as two sets of genes (small circles). (B) The edges represent RBH relationships between two genes. A uni-directional edge denotes a BLAST hit relationship. A bi-directional edge denotes a reciprocal BLAST hit relationship. (C) The RBH relationships relationships can be abstractly represented as a graph where nodes denote genes and edges denote relationships among genes. In most cases, there are two protein-coding genes that are each others RBH. However, in this case, the red protein-coding gene in the blue "genome” has two equal-scoring hits to both the red and the orange genes in the green “genome” . (D) From this graph, we can attempt to recover sets of genes related by an RBH relationship.

35 for phylogenetic analysis, it is preferable to choose a method that is conservative in its identification without false positive RBHs.

Functional annotation of the potential clusters of orthologues is also problematic. A widely accepted method is based on the transitive allocation of annotation from one significantly similar protein-coding gene to another (for a review, refer to Altenhoff and Dessimoz, 2009). In practice, a sequence of unknown annotation is used as the query sequence for a BLAST search of a database of sequences of known annotation, such as the COG or KEGG databases, and the annotation of the best target sequence is transferred to that of the query sequence. Although this method has been criticised because it has the potential to perpetuate incorrect annotations from one sequence to another (Brown and Sjolander, 2006), this method is used routinely in the annotation of newly-sequenced genes and genomes.

Orthology is all the more difficult to infer because genes can be duplicated before or after a speciation event, be acquired through lateral gene transfer, or have evolved to be similar due to selective pressures that promote convergent evolution (Fitch, 2000). The most problematic histories, however, involve the loss of genes in some lineages, making it hard or impossible to infer whether two genes are, in fact, orthologous. In the example in Figure 2.1, if only genes B1 and C2 were sampled, it would be impossible to know that the true orthologue of B1 in species C is Cl.

The most common way to overcome the above obstacle is to compare the function of the genes in question. If they are so similar that they must share a common ancestor and the genes still perform the same function in the species from which they were sampled, they are assumed to be orthologous because the simplest explanation for the conserved gene function is that the function has been preserved across speciation events without duplication. This is also an inference - it is possible that the gene duplication has been 36 so recent that both copies have retained the same function - but in either case these two genes would be orthologous. Genes C2 and C3 in Figure 2.1 provide an example which would be orthologous to sequence B2, and hence, the gene tree should be the same as the species tree for this set of taxa. More recently, whole-genome sequencing has been used to identify single-copy genes which are, rightly or wrongly, assumed to be orthologues.

2.1.2 Orthologue inference using graph theory

The application of graph theory (Bondy and Murty, 2007) can conceptually simplify the task of finding groups of the most similar sets of genes, which can be used as approxima­ tions for sets of homologous or orthologous protein-coding genes, across multiple genomes. For a set of “genomes”, where each “genome” has been simplified to be the set of protein­ coding genes, each protein-coding gene can be represented as a node in a graph (Figure 2.3A) and each homologous relationship a directed edge corresponding to a significant BLAST hit from one protein-coding gene to another (Figure 2.3B). In such a system, a set of homologous protein-coding genes in different organisms would correspond to a set of protein-coding genes from different “genomes” related by a dense coverage of edges connecting protein-coding genes in the set (Figure 2.3C). Thus, the task of finding a set of orthologues becomes one of finding a clustering operation that reliably retrieves sets of orthologues without a high rate of false positives or negatives (Figure 2.3D). Existing studies have mostly used variants of the RBH method to infer sets of genes that are an approximation of the set of orthologues across the taxa being examined. This involves using a program from the BLAST family to compare every protein-coding gene in every “genome” with every other protein-coding gene in every other “genome” in what is referred to as an “all-against-all” comparison (Figure 2.2). From these, the reciprocal best and significant hits are retrieved, and a clustering method is used to retrieve groups

37 of protein-coding genes which are suitable for use.

2.1.3 Functional annotation of most-similar clusters of protein­ coding genes

The only way to be certain that a set of protein-coding genes genuinely encodes proteins with the same function is to experimentally verify each one. Since such experiments are prohibitively expensive, only a small proportion of such proteins have been verified in this way. It is possible to attempt to infer function via the folded structure, domains, surface features or motifs present in a protein. However, the vast majority of genes are assigned the functional annotation of significantly similar genes in a curated database of functionally classified genes, such as the COG or the KEGG databases.

Every protein-coding gene in a completed genome available in the NCBI REFSEQ database has a functional annotation which has, in most cases, been acquired through sequence similarity and the quality of the annotation can vary quite dramatically (Ogata et ah, 1999). In some cases, the annotation is quite detailed, stating the full name of the gene and the gene product, and the identifier of the match in the curated database. In other cases, much less information is available. The gene in question may be similar to, but quite divergent from a known gene, so it is described as being “like” that protein-coding gene. Genes of unknown function that are conserved in different organisms may be described as a “highly conserved gene of unknown function”. If no significantly similar genes can be found, it can be described as having “unknown” function or as a “hypothetical protein”. Allocating a consensus functional annotation to a set of RBH clusters is a routine but problematic operation in genomic studies. Genes in newly-sequenced genomes are usually allocated the annotation of the most similar protein-coding gene or cluster of protein- 38 coding genes in a curated database, but the metric of similarity is not flawless (Koski et al., 2001), especially if a protein-coding gene in the curated database is a paralogue to the unknown protein-coding gene, so erroneous annotations can creep into trusted online databases. Taking the consensus annotation of several protein-coding genes that have been annotated by this “transitive” method only compounds the error, and without any other point of reference these errors cannot be detected.

One example of how unreliable annotation can be is a comparison of annotations of the Mycoplasma genitalium genome. Three separate groups attempted to annotate 458 genes in this organism. They managed to annotate 340 of the 458 genes, but 8% of the annotations from the three groups did not agree, and hence are suspect. And the cases where different groups arrived at the same, but incorrect, annotation are not taken into account, so the true error rate is probably somewhat higher (Brenner, 1999).

In response to all these issues and problems, I tried several strategies before I decided on the best method for my data.

2.1.4 Motivation for preliminary studies

At the beginning of my candidature, my overall aim was to investigate genome evolution in Prochlorococcus and marine Synechococcus in comparison with other cyanobacteria.

Previous studies had concluded that these two groups had evolved relatively recently and rapidly from a common ancestor. At the genomic scale, several Prochlorococcus genomes had experienced accelerated evolutionary rates and extensive gene loss. Ge­ netic innovations in at least one of these groups appeared to have been achieved through cyanophage-assisted acquisition of genes that enabled the invasion and success of environ­ mental niches that had not been accessible beforehand. The extent to which these two

39 groups were genetically distinct was thus unclear.

The xenologous acquisition of genes could have been a one-off, chance occurrence that marked the start of the divergence and speciation towards two genetically distinct groups that have different environmental adaptations, based on a diverging complement of bio­ chemical pathways (Rocap et ah, 2002). It is also possible that the two lineages have experienced an ordinary evolutionary rate over longer periods of time, but recombination between the two lineages, or with other cyanobacteria, has helped to retain the phyloge­ netic coherence of this lineage (Zhaxybayeva et ah, 2009). This scenario is appealing in that it would explain the conflict in phylogenies inferred from different genes.

The two obvious approaches to a study of the origins of the Prochlorococcus and marine Synechococcus genomes were: (i) a comparison of the evolutionary relationships of shared genes; and (ii) examination of the biochemical capabilities conferred by the unique genes in either group.

In either case, the analyses would require a basic data set of accurately inferred ortho- logues and the use of robust and accurate phylogenetic methods. Both would require accurate functional annotation of the inferred orthologues. Lastly, an efficient and exten­ sible framework for the manipulation and storage of data would be required.

Since it is not possible to infer orthology, BLAST-based methods for inferring sets of genes related by RBH relationships were used as an approximation for sets of orthologues. Thus, in the early stages of this work, preliminary investigations were carried out to familiarise myself with the domain-specific problems with the data and to identify methods that would be suitable for its further preparation and analysis. 40 2.1.5 Aims

In this chapter, I describe preliminary investigations that were used to:

1. evaluate the suitability of BLAST comparisons as the basis for inferring RBH pairs

of protein-coding genes and the most suitable thresholds for significance;

2. find the most suitable clustering method for inferring clusters of genes related by

RBH relationships;

3. identify a suitable method for the functional annotation of RBH clusters;

4. devise a set of relational database tables for storing the information; and

5. develop a software framework for manipulating the data in further work.

2.2 Preliminary studies

2.2.1 Finding suitable BLAST parameters to detect homologous

protein-coding genes

Background

In a search for two DNA or amino-acid sequences that are likely to be orthologous, a reasonable starting point are sets of genes that are most similar to each other than to any other genes. Such similarity is routinely used as a proxy for shared ancestry when the significance of the similarity is so high that it is virtually impossible for the two sequences to have evolved independently (Needleman and Wunsch, 1970; Fitch, 2000).

41 The BLAST family of programs is the most widely used and accepted methods for identi­ fying significant similarity between nucleotide or amino-acid sequences (Cameron et al., 2006) and have been shown to be an effective way to infer pairs of genes that are likely to be homologous (Altenhoff and Dessimoz, 2009). It is generally accepted that the sub­ stitution matrix and the gap opening and gap extension costs are the BLAST parameters with the most influence on the alignment generated between two sequences, and hence, on the resulting Expectation value (E-value) (Hall, 2004). There are two main series of substitution matrices: PAM (Point Accepted Mutation) (Dayhoff et ah, 1978; Schwartz and Dayhoff, 1978) and BLOSUM (BLOcks Substitution Matrices) (Henikoff and Henikoff, 1992). Different substitution matrices within the same series are tuned to detect similarities between sequences with differing degrees of diver­ gence (Altschul, 1991; States et al., 1991; Altschul, 1993). The NCBI website contains a page on BLAST substitution matrices that suggests matrices should be chosen on the basis of the length of the query sequence (Table 2.1) (NCBI website, 2005). The best general purpose substitution matrix for detecting most weak protein-coding gene similarities is BLOSUM-62, named from an alignment of sequences which share at least 62% identity (page 10,916 in Henikoff and Henikoff, 1992).

Table 2.1: blast parameters suggested on the NCBI website

Query Substitution b l a s t p defaults for gaps length m atrix (opening,extension) less than 35 PAM -30 (9,1) 35-50 PA M -70 (10,1) 50-85 BLO SUM -80 (10,1) ~ 85 BLO SUM -62 (11,1) much greater than 85 BLO SUM -45 (14,2)

The BLOSUM-62 matrix with a gap opening and extension cost of 11 and 1, respec- 42 tively, are the default options programmed into BLASTP version 2.2.8 (Table 2.1). Most studies of bacterial protein-coding genes have used these default BLAST parameters with

E-value thresholds ranging from 10-20 to 10“ 4 (e.g., Tatusov et al., 1997; Zhaxybayeva and Gogarten, 2002; Millard et al., 2004), with 10-5 the most widely used (Hall, 2004).

The BLASTP parameters chosen must return significant query hits only when the query and target sequences are likely to be homologous, without an unacceptably high proportion of false positives (Type I errors) or false negatives (Type II errors).

This section describes work done to identify the substitution matrix, associated default gap opening and extension costs (Table 2.1), and E-value cut-off that, when used with

BLASTP, will return accurate pairs of homologous protein-coding genes among three

Prochlorococcus “genomes” .

Methods

The set of translated protein-coding genes of three closely-related Prochlorococcus marinus genomes, strains CCMP1986 (also known as MED4), CCMP1375 (also known as SS120) and MIT9313 were downloaded from the NCBI REFSEQ database as FASTA-format files.

These three genomes were chosen because they have a small, intermediate and large number of protein-coding genes, respectively.

An all-against-all comparison of the sets of protein-coding genes was conducted using the

BLASTP program using five substitution matrices from the PAM and BLOSUM series with the default gap opening and extension costs programmed in BLASTP version 2.2.8 (Table

2.1). The E-values were calculated by the BLASTP program (Altschul et al., 1990, 1997) and are based on the size of the target database, which in the present study, contained one sequence per protein-coding gene. All query hits with an E-value of up to 10 were 43 retained. The query sequences finds matches at E-value thresholds from 10~5 to 10. The lower cut-off was chosen because it is a relatively permissive cut-off that should detect the most similar matches. The overly-permissive cut-off of 10 was chosen to check if using a level that would not normally be considered significant could result in matches of very low similarity pairs. The output of the blastp searches were inspected for: (i) the number of matches returned; (ii) the quality of alignments and E-values; and (iii) the differences that E-value thresholds make.

Results

The number of matches significant at seven E-value thresholds between 10-5 and 10 (Figure 2.4) was plotted for each of the five substitution matrices. The total number of BLASTP hits significant at a specified E-value cut-off was approximately equal across the five substitution matrices (Figure 2.4A), although the BLOSUM-62 matrix consistently identified slightly more matches than the other matrices (Figure 2.4B). The number of matches generated with an E-value threshold of 10 was approximately 5 times the number generated with the standard similarity threshold used for homology inference of 10~5 (Figure 2.4B), with most of the additional matches having E-values between 1 and 10 (Figure 2.4). The number of best hits for the same set of substitution matrices and E-value cut-offs was also plotted (Figure 2.4C,D). Only about 20% of the hits correspond to best hits, the remainder being lower quality hits to other protein-coding genes in the target “genome”. Individual inspection of the raw BLAST output files revealed that the lower scoring hits were often not as good as the top hit. For example, the search for the most similar protein­ coding genes from Prochlorococcus strain CCMP1986 in strain CCMP1375 recovered two 44 A 120000 B 120000

100000 i normo

8DOOO 80000

■ 000001 aoooo ■ 0 0001 GO 000 ■ PAMX) a oooi ■ PAUTO ■ 0U1 □ BLOSUMBO ■ 01 ■ BLOSUM6? □ 1 ■ BLOSUM45 ■ 10 40000

PAM30 PAUTO BLOSUMBO BLOSUM62 BLOSUM45 000001 0 0001 0001 0 01 0.1 1 10 Substitution matrix E-value cut-off

■ 000001 ■ 00001 ■ PAM30 □ 0001 ■ PAM70 ■ 001 □ BLOSUMHD ■ BLOSUMfl? ■ BLOSUM45

PAUTO BLOSUU8D BLOSUM6Z BLOSUM45 000001 00001 0001 001 Substitution matrix E-value cut-off

Figure 2.4: Number of BLASTP query hits among three Prochlorococcus "genomes" The number of query hits using one of five substitution matrices (PAM30, PAM70, BLOSUM-80, BLOSUM-62 and BLOSUM-45) and E-value cut-offs between 10-5 to 10 were generated using BLASTP v2.2.8. The total number of query hits are shown grouped by (A ) matrix and (B ) E-value. The number of best query hits are shown grouped by (C) matrix and (D ) E-value.

45 hits for the protein-coding gene with GI 33861487 (Figure 2.5). The hits to both GI

33240216 and 33240378 would be considered significant at the E-value=10~5 level. How­ ever, the first hit is clearly a much better match for the query sequence than the second, in E-value, score, alignment and annotation (Figure 2.5). A similar pattern was found in manual inspection of the raw BLAST output for the six possible search possibilities of the three Prochlorococcus “genomes” . Therefore all further analyses were conducted with only the best hits.

Based on previous analyses, an E-value between 10-5 and 10-3 was expected to be the most suitable for detecting the most similar sequences. To test this, the best hits signifi­ cant at E-value between 10-5 and 10 were inspected for BLASTP runs for query sequences in Prochlorococcus str. CCMP1986 and Prochlorococcus str. CCMP1375 having 1,882 protein-coding genes. The BLOSUM-62 matrix was used for the comparison since it was found to consistently yield more matches at every E-value tested than the alternatives

(Table 2.4D).

The query genome Prochlorococcus str. CCMP1986 has 1,712 protein-coding genes and the target Prochlorococcus str. CCMP1375 has 1,882 protein-coding genes. Using the

BLOSUM-62 matrix, b lastp recovered 1,493 significant matches at the E = 10~5 level, increasing to 1,514 at the E = 10~3 level and 1,708, or almost all of the query sequences, at the E = 10 level (Figure 2.2). Most of the alignments with an E-value less than 10-5 were very good. The additional 21 matches at the E = 10-3 level were for query sequences that were shorter, contained low complexity regions, or were simply lower similarity matches.

In some cases they appeared to be good matches, in others, they probably were not.

A few photosynthesis protein-coding genes such as “Photosystem II protein PsbK” (GI

33860830 and 33239756) were detected with an E-value of 3 x l0 -4 (data not shown). For a conservative search, it appears that an E-value of 10~5 is suitable. For a more permissive search more suited to matching more divergent sequences, an E-value of 10~3 could be

46 Query= MDJJ00927 gi|33861487|ref|NP_893048.1| Pyruvate dehydrogenase El beta subunit [Prochlorococcus marinus subsp. pastoris str. CCMP1986] (327 letters)

Database: Prochlorococcus marinus subsp. marinus str. CCMP1375 1882 sequences; 516,756 total letters

Searching....done

Score E Sequences producing significant alignments: (bits) Value

MGWN00764 gi|33240216|ref|NP_875158.1| Pyruvate dehydrogenase El... 570 e-164 MGWN00926 gi|33240378|ref|NP_875320.1| Deoxyxylulose-5-phosphate... 55 2e-09

>MGWN00764 gi|33240216|ref|NP_875158.1| Pyruvate dehydrogenase El component beta subunit [Prochlorococcus marinus subsp. marinus str. CCMP1375] Length = 327

Score = 570 bits (1469), Expect = e-164 Identities = 283/324 (87%), Positives = 300/324 (92%)

Query: 1 MASTLLFTALKEAIDEEMANDVNVCIMGEDVGQYGGSYKVTKDLYEKYGELRVLDTPIAE 60 MA TLLF AL+EAIDEEMA D +VC+MGEDVGQYGGSYKVTKDLYEKYGELRVLDTPIAE Sbjct: 1 MAGTLLFNALREAIDEEMARDPHVCVMGEDVGQYGGSYKVTKDLYEKYGELRVLDTPIAE 60

Query: 61 NSFTGMAVGAAMTGLRPIVEGMNMGFLLLAFNQISNNMGMLRYTSGGN+KIP Sbjct: 61

Query: 121 QLGAEHSQRLEAYFHAVPGIKIVACSTPTNAKGLMKAAIRD+NPVLFFEHVLLYNL+ Sbjct: 121

Query: 181 EELPEGDY+CSLDQADLV+EGKD+TILTYSRMRHHCLKAVE+L KK+IDVELIDLISLKP Sbjct: 181

Query: 241 FD+KTI SI+KT+ VIIVEECMKTGGIGAEL+ALI E CFDDLD RPIRLSSQDIPTPY Sbjct: 241

Query: 301 NG LENLTIIQPHQIVE VE+V+N Sbjct: 301

>MGWN00926 [Prochlorococcus marinus subsp. marinus str. CCMP1375] Length = 643

Score = 5! bits (132), Expect = 2e-09 Identities 55/249 (22%), Positives = 97/249 (38%), Gaps 24/249 (9%)

Query: 43 + +D IAE +A G A GLRP+ FL AF+Q+ +++G+ Sbjct: 353

Query: 103 N + Q + Y ++P ++A + ++ + Sbjct: 411 -NLPVTFVMDRAGIVGADGPTHQGQYDISYLRSIPNFTVMAPKDEAELQRMLVTCL 465

Query: 163 + +++ EG D+ 1+ Y M Sbjct: 466

Query: 217 +1+ L+P D ++ V+ +EE Sbjct: 520

Query: 277

Sbjct: 577

Figure 2.5: Sample b la s t output for Prochlorococcus CCMP1986 vs. CCMP1375 Raw output from b la s t p showing query sequence GI 33861487 from Prochlorococcus str. CCMP1986 and the two hits in Prochlorococcus str. CCMP1375 that are significant at the E = 10-5 level.

47 used.

Table 2.2: BLASTP matches for Prochlorococcus CCMP1986 protein-coding genes in str. CCMP1375 E-value Number of best hits cut-off using BLOSUM-62 H T 5 1,493 1 (T4 1,504 H T 3 1,514 10"2 1,527 1CT1 1,566 1 1,708 10 1,708

To test the effect of the substitution matrix on the matches found with BLASTP, the matches found with the BLOSUM-62 matrix across the six pairwise comparisons between the three Prochlorococcus “genomes” were compared with the BLOSUM-80 matrix, which returned the next highest number of matches (Figure 2.4D). There were 4,471 matches found with the BLOSUM-62 matrix and 4,435 found with the BLOSUM-80 matrix. The 48 matches that were only found using BLOSUM-62 appeared to be homologues, as judged by visual inspection. The 12 matches only found using BLOSUM-80 were generally of lower quality and uncertain homologues. The 14 cases where the matrices found different targets for the same query were mostly because the two matrices ordered the top two matches differently. However, manual inspection revealed that it was the BLOSUM-62 matrix that most accurately identified the target sequence which was most similar and annotated as the same protein-coding gene. Comparison of BLOSUM-62 and other matrices at the E = 10-5 level revealed similar trends. That is, in most cases the substitution matrix and associated gap opening and ex­ tension costs did not affect the sequences identified as the best hit in the target “genome”. The E-value reported for the match could vary by several orders of magnitude, due to

48 differences in the alignment or use of a different substitution matrix, but the difference was usually small enough that the match would be deemed significant at the same E- value significance threshold. For the cases where the E-value reported was higher than the threshold, the values were usually found to be less than 10-3.

Conclusions

Ideally, it would be beneficial to run the BLASTP comparison using the substitution matrix and gap costs most suited to the length of the query sequence. Since I was using the NCBIStandalone.blastall function from the bio pyth o n package to call blastp and the same matrix and gap costs must be used for the entire set of query sequences against the target “genome” sequences, this would not be possible without partitioning the input sequences into groups by length. Thus, the best strategy for the identification of significantly similar pairs of amino acid sequences with a single run of BLASTP is to use only: (i) the best, most significant hits; (ii) the default BLOSUM-62 matrix with the default gap opening cost of 11 and gap extension cost of 1; and (iii) an E-value threshold between 10-5 and 10-3.

Running BLASTP several times with different matrices and gap costs may discover superior alignments with lower E-values, but the parameters above are sufficient to identify highly similar amino acid sequences across the Prochlorococcus spp.

Once the protein-coding genes and BLAST hits are mapped to nodes and directed edges in a graph, one still has to find a reliable and efficient method for inferring sets of orthologues in the graph. Problematically, the score of a pairwise comparison is not always symmetric in that the E-value and/or the top hit can vary. However, in most cases, the E-value is significant in 49 both directions when using a suitable threshold between 10~5 and 10-3 . The older fully- sequenced genomes also suffer from gene-calling and annotation errors (for discussion, see

Poptsova and Gogarten, 2010). The caveat is that any method that constructs sets of RBH matches must be robust to missing relationships and potentially confusing relationships between protein-coding gene families. These problems are addressed in the following studies.

2.2.2 Iterative clustering using paraclique topologies of RBH

relationships

Background

Sets of eubacterial-core orthologues were required for phylogenetic analysis. It is reason­ able to expect that the protein-coding genes shared across a divergent set of eubacteria will: (i) be relatively few in number; (ii) consist of essential housekeeping genes that are necessary for the survival of any eubacterium; (iii) be highly conserved across evolution­ ary time; (iv) be part of complex biochemical pathways that are resistant to xenologous replacement; and (v) have highly consistent annotation since they are well studied and characterised.

In graph theory, a graph where every node is connected to every other node is known as a “completely connected graph” ; if a subgraph has all its nodes connected to every other node in the subgraph then it is a “clique” (Bondy and Murty, 2007). In this section,

I describe seed-based methods for inferring sets of orthologues related by a clique, or a clique approximation ( “paraclique” ), of RBHs.

50 Methods

A set of 36 eubacteria (consisting of 17 cyanobacteria, 6 green sulfur bacteria, 1 green non­ sulfur bacterium, 2 purple bacteria and 10 non-photosynthetic bacteria) and 4 cyanophages was selected for the study. The complement of translated protein-coding genes for each bacterium was downloaded from the NCBI REFSEQ database (Pruitt et ah, 2005) between

26 and 29 October, 2005.

An all-against-all comparison was conducted using the b l a s t p program with a relatively permissive E-value significance threshold of 10-3. The RBH hits were collated for protein­ coding genes in every possible pair of “genomes” . For every protein-coding gene in every

“genome” , the set of protein-coding genes related by one RBH was collated.

To identify all maximal completely-connected subgraphs in the graph formed by RBH rela­ tionships among the associated protein-coding genes, I implemented the polynomial-time algorithm of Stix (2004) in Python. Unfortunately, even when some simple optimisations were done, the method was too slow to run on these data.

An alternate method was thus devised based on set intersection. Since the query protein­ coding gene and the protein-coding genes to which it is connected by RBH relationships form a pattern that resembles the sector of a folding fan, these sets were termed “fans” . A single file containing the number of chromosomes, the number of sequences, the protein­ coding gene at the centre of the fan, and the list of sequences (including the central protein­ coding gene) in alphabetical order was collated for each protein-coding gene. Unique fans which contained exactly N organisms and N sequences (where N can be a number between

1 and 40, but more typically between 1 and 36 because most protein-coding genes did not occur in the 4 phage “genomes” ), that is, having only one representative per organism, were easily identified because the alphabetically-sorted sets would be identical. Abstractly,

51 Figure 2.6: Inference of orthologous sets of protein-coding genes from RBH relationships In a scenario with five ‘genomes", the RBH relationships of a seed protein-coding gene to protein­ coding genes in the other "genomes" can be described as a fan (A). If the fans from each of the target protein-coding genes in (A) are added, the protein-coding genes and RBH edges form a network (B). If an orthologous set of protein-coding genes is defined as one that is connected by RBHs that form a clique, then in this example, the set formed by RBH matches would have four members (C). If this criterion is relaxed to permit a modest number of missing edges, the set formed by RBH matches would have five members and form a paraclique (D).

52 this corresponds to finding sets of protein-coding genes where the fan from each of the member nodes overlaps exactly (Figure 2.6). For an example with five “genomes”, the protein-coding genes and RBHs from a single query forms a fan (Figure 2.6A). The fans for all protein-coding genes in the set formed by the original query sequence and the protein-coding genes connected by an RBH form a network (Figure 2.6B). In the example featuring five “genomes” in Figure 2.6, the largest clique that includes the source protein­ coding gene has four protein-coding genes (Figure 2.6C). In addition, subgraphs that were paracliques, in that they were missing only a few of the RBH connections, were explored (Figure 2.6D).

Results

Since the analysis was based on 36 bacterial genomes and 4 cyanophages, and many genes common to the bacteria would not be expected to be present in the cyanophages, the most useful sets of protein-coding genes would be those represented here as fans with exactly 36 sequences found in the 36 main chromosomes of the bacteria. The analysis yielded 104,315 fans, where each fan contained up to 111 sequences (Figure 2.7B) from 43 chromosomes (Figure 2.7A). The distribution of chromosomes represented in each fan shows that while many fans contain representatives from a small number of chromosomes which are not useful for further analysis, there are still a sizeable proportion with representatives from close to 36 organisms (Figure 2.7A). Unfortunately, some fans contained many representative sequences from the same “genomes”: 651 of the 104,315 fans (0.62%) of size 2 or more containing between 40 and 111 sequences (Figure 2.7B). These large fans contained protein-coding genes from large gene families that have experienced gene duplication and divergence. 53 oltd rm 6 uatra n 4 cyanophages. 4 and eubacteria 36 from collated rqec dsrbto o () ubr f hoooe ad B nme o sqecs n h fans the in sequences of number (B) and chromosomes fans of number RBH in sequences (A) and of organisms of distribution number the Frequency of Histograms 2.7: Figure Frequency Frequency L o o o f) Number of chromosomes per fanper chromosomes of Number Number of sequences per fanper sequences of Number 54 These latter fans were so large, contained so many duplicate members from the same “genome” and were so inter-connected by RBH relationships with other fans, that it would be very hard, if not impossible, to reliably separate them into sets that contained only one member per “genome” (Figure 2.8). Once these large fans were discarded, however, the remaining large fans suggested that a paraclique clustering topology might be suitable for my eubacteria. Through inspection of the fans of a limited sample of genes, I found that if a single protein-coding gene was selected, and its fan augmented by another level of RBH relationships, the resulting set contained very few protein-coding genes that were not connected to every other protein­ coding gene in the set, and was very close to the clique formed by the set of protein-coding genes in the original fan. For the cases where there was more than one protein-coding gene from the same “genome”, examination of the multiple alignment, Neighbor-Joining tree and/or the annotation was able to distinguish the protein-coding gene that was most similar to the other protein-coding genes in the set. Unfortunately, closer inspection revealed problems preventing automation on a large scale. There was no single E-value threshold that could reliably to identify sets of proteins likely to be homologous Also, the clique criterion was too strict, excluding any potentially orthologous sets with coverage of less than 36 eubacteria. The input data, consisting of RBHs, is imperfect in that it may contain edges that are false positives or false negatives. For a sample of, say, 50 “genomes”, a clique requires 1,225 (— 50 x 49 / 2) RBHs. If a single one of these is missing because the blast method ranked one of two high ranking hits first in some comparisons but second in others, that individual protein-coding gene would be eliminated from further consideration. In this preliminary study, many of the fans contained only 33 to 35 organisms (Figure 2.7A). Closer inspection revealed that the overly-strict clique criterion resulted in many 55 >gi |15606302 | ref 213681.1 | Hsp70 chaperone DnaK [Aquifex aeolicus VF5] >gi |15643141 | ref 228184.1 | dnaK protein [Thermotoga maritima MSB8] >gi |15927160 ¡ref 374693.1 | DnaK protein [Staphylococcus aureus subsp. aureus N315] >gi |16329715 j ref 440443.1 | DnaK protein [Synechocystis sp. PCC 6803] >gi |16331261 ¡ref 441989.1 | DnaK protein [Synechocystis sp. PCC 6803] >gi |16331492 ¡ref 442220.1 | DnaK protein [Synechocystis sp. PCC 6803] >gi |17229234 j ref 485782.1 | DnaK-type molecular chaperone [Nostoc sp. PCC 7120] >gi |17229938 ¡ref 486486.1 | DnaK-type molecular chaperone [Nostoc sp. PCC 7120] >gi |17230482 \ ref 487030.1 | DnaK-type molecular chaperone [Nostoc sp. PCC 7120] >gi |17934041 j ref 530831.1 | DNAK Protein [Agrobacterium tumefaciens str. C58] >gi |21673475 ¡ref 661540.1 | DnaK protein [Chlorobium tepidum TLS] >gi ¡22298333 j ref 681580.1 | chaperone protein [Thermosynechococcus elongatus BP-1] >gi ¡22299276 ¡ref 682523.1 | DnaK protein 2 [Thermosynechococcus elongatus BP-1] >gi ¡22299693 j ref 682940.1 | DnaK protein 3 [Thermosynechococcus elongatus BP-1] >gi ¡23125088 \ ref 00107038 .l| COG0443: Molecular chaperone [Nostoc punctiforme PCC 73102] >gi ¡23129420 ¡ref 00111247 .l| COG0443: Molecular chaperone [Nostoc punctiforme PCC 73102] >gi ¡23130153 \ ref 00111972 .lj COG0443: Molecular chaperone [Nostoc punctiforme PCC 73102] >gi ¡25029185 ¡ref 739239.1 | putative heat shock protein DnaK [Corynebacterium efficiens YS-314] >gi ¡26245936 ¡ref 751975.1 | Chaperone protein dnaK [Escherichia coli CFT073] >gi ¡33241320 \ ref 876262.1 | Molecular chaperone, DnaK [Prochlorococcus str. CCMP1375] >gi ¡33598002 \ ref 885645.1 | molecular chaperone [Bordetella parapertussis 12822] >gi ¡33862260 \ ref 893821.1 | Molecular chaperone DnaK2, heat shock protein hsp70-2 [Prochlorococcus str. CCMP1986] >gi ¡33864519 ¡ref 896079.1 | Molecular chaperone DnaK2, heat shock protein hsp70-2 [Prochlorococcus str. MIT 9313] >gi ¡33867038 ¡ref 898597.1 | Molecular chaperone DnaK2, heat shock protein hsp70-2 [Synechococcus sp. WH 8102] >gi ¡37523833 ¡ref 927210.1 | molecular chaperone [Gloeobacter violaceus PCC 7421] >gi ¡39933410 ¡ref 945686.1 | heat shock protein DnaK (70) [Rhodopseudomonas palustris CGA009] >gi ¡45509933 \ ref *00162265 .l| COG0443: Molecular chaperone [Anabaena variabilis ATCC 29413] >gi ¡46191788 \ ref [00008041 .2| COG0443: Molecular chaperone [Rhodobacter sphaeroides 2.4.1] >gi i 47529836 \ ref 021185.1 | chaperone protein dnak [Bacillus anthracis str. 'Ames Ancestor'] >gi \51473385 \ ref 067142.1 | chaperone protein DnaK [Rickettsia typhi str. Wilmington] >gi j51594962 ¡ref 069153.1 | chaperone Hsp70 in DNA biosynthesis/cell division [Yersinia pseudotuberculosis IP 32953] >gi i 53764031 ¡ref 00159660 .2| COG0443: Molecular chaperone [Anabaena variabilis ATCC 29413] >gi j53764403 ¡ref .00160213 .2j COG0443: Molecular chaperone [Anabaena variabilis ATCC 29413] >gi ¡53765167 j ref 00161387 .2| COG0443: Molecular chaperone [Anabaena variabilis ATCC 29413] >gi j53795503 \ ref [00356578 .l| COG0443: Molecular chaperone [Chloroflexus aurantiacus] >gi \53798849 j ref 00359141 .l| COG0443: Molecular chaperone [Chloroflexus aurantiacus] >gi \56751539 ¡ref [l72240.1 | DnaK protein [Synechococcus elongatus PCC 6301] >gi \56751645 ¡ref 172346.1 | DnaK protein [Synechococcus elongatus PCC 6301] >gi \67918978 ¡ref 00512568 .l| Heat shock protein Hsp70 [Chlorobium limicola DSM 245] >gi ¡67920502 j ref .00514022 .l| Heat shock protein Hsp70 [Crocosphaera watsonii WH 8501] >gi |67920816 ¡ref 00514335 .l| Heat shock protein Hsp70 [Crocosphaera watsonii WH 8501] >gi i 67935436 j ref 00528457 .l| Heat shock protein Hsp70 [Chlorobium phaeobacteroides DSM 266] >gi i 67938845 j ref 00531363 .l| Heat shock protein Hsp70 [Chlorobium phaeobacteroides BS1] >gi |71673729 \ ref .00671477 .l| Heat shock protein Hsp70 [Trichodesmium erythraeum IMS101] >gi |71674655 ¡ref .00672402 .l| Heat shock protein Hsp70 [Trichodesmium erythraeum IMS101] >gi ¡71677290 ¡ref 00675028 .l| Heat shock protein Hsp70 [Trichodesmium erythraeum IMS101] >gi ¡72383150 \ ref 292505.1 | Heat shock protein Hsp70 [Prochlorococcus str. NATL2A] >gi \78185870 j ref .378304.1 | Heat shock protein Hsp70 [Synechococcus sp. CC9902] >gi |78186504 ¡ref .374547.1 | Heat shock protein Hsp70 [Pelodictyon luteolum DSM 273] >gi ¡78189848 \ ref .380186.1 | Heat shock protein Hsp70 [Chlorobium chlorochromatii CaD3] >gi ¡78214175 j ref 382954.1 | Heat shock protein Hsp70 [Synechococcus sp. CC9605] >gi ¡78780182 ¡ref 398294.1 | Heat shock protein Hsp70 [Prochlorococcus str. MIT 9312]

Figure 2.8: Annotation of 52 protein-coding genes of an RBH fan with 36 organisms A listing of the 52 protein-coding genes in an RBH fan from Synechococcus sp. WH 8102 query sequence with Gl 33867038.

56 individual protein-coding genes being excluded from a set because of a few missing RBH edges, when both annotation and similarity evident through multiple sequence alignment of the entire set indicated that the entire set consisted of orthologues. I had hoped to find a large set of protein-coding genes that were shared across all taxa so that the phylogenies of these taxon-specific core protein-coding genes could be more easily compared. I thus experimented with sets of protein-coding genes which were related by a paraclique set of relationships. Through iterative examination of connectedness, I found that sets of protein-coding genes that I would consider orthologous following manual curation typically had up to 5% of the total number of RBH relationships missing. When I implemented this rule for identifying orthologues, however, I found that in many cases, the majority of the missing edges arose from the presence of a single protein-coding gene which was poorly connected to the rest of the protein-coding genes in the set. Further analyses resulted in a reasonably robust rule for defining orthology. A set of protein-coding genes formed a paraclique if all the following three conditions were met: (i) it contained at least 90% of the edges required for a clique; (ii) every protein-coding gene was connected to at least 90% of the others; (iii) and at least 90% of the protein­ coding genes had no missing edges.

Discussion

This preliminary analysis demonstrated that the clique-based topology of RBHs was too stringent a criterion for finding eubacterial-core orthologues. I hoped this method would offer a simple, easy solution to finding eubacterial-core orthologues - because they should be present in all the “genomes”, would be long enough to be useful for phylogenetic analy- 57 sis (since they are detectable by BLAST), and would be protein-coding genes that had been duplicated in an ancient ancestor and should separate into different clusters. However, this was not the case, largely because the BLAST relationships were not symmetric. In agreement with previous studies (e.g., Hulsen et ah, 2006; Chen et ah, 2007; Altenhoff and Dessimoz, 2009), these analyses thus suggest that RBHs can detect homologous re­ lationships, but that the requirement that all protein-coding genes in a set must form a clique to be considered orthologous is too strict. The clique criterion, while intu­ itively accurate in modelling the relationships that ought to exist between metabolic-core protein-coding genes in different “genomes” that have been inherited from a common ancestor and retained in each descendant, is not appropriate for real data where the high­ est scoring BLAST hit may not be the most similar sequence (Koski and Golding, 2001) and relationships may not be symmetric (Chiu et ah, 2006). That is, the E-value of a BLAST hit from protein-coding gene A to B is not the same as from B to A when there is more than one highly similar sequence in either or both the query and target “genomes”. The non-symmetry arises from the BLAST algorithm itself as the step that extends the matched region until a stop criterion is met means that a search in one direction does not result in exactly the same result in the reverse direction). Furthermore, the presence of closely-related homologous may also cause problems. For example, some protein-coding genes may form an exclusive RBH pair (Figure 2.9A). The presence of an equally high scoring BLAST relationship does not obscure the two most similar pairs of protein-coding genes in the two “genomes” (Figure 2.9B). However, non-reciprocal RBH relationships can obscure one of the most similar pairs (Figure 2.9C), and the most similar pairs of proteins may not be identifiable from a dense network of RBH relationships, as could be recovered among protein-coding genes of a gene family (Figure 2.9D). A better variation of the clique-based method was tested, where a seed protein-coding gene is augmented with highly similar protein-coding genes and then pruned to remove 58 A

Figure 2.9: Possible BLAST-hit relationships between protein-coding genes in two genomes (A) An exclusive RBH relationship between two pairs of protein-coding genes. (B) If an additional significant BLAST relationship is detected between the protein-coding genes, the most similar pairs of protein-coding genes can still be easily identified. (C) However, a denser network of blast relationships, where some are not reciprocated, can obscure the inference of the most similar pairs of proteins between the genomes. (D) And the most similar pairs of proteins between the genomes can be almost impossible to infer from a network of RBH relationships among closely-related proteins.

59 individual protein-coding genes that are not related to the rest of the graph by enough connections for them to be considered orthologues.

Conclusions

The inference of reliable clusters of RBH-matches, with associated functional annotation, is not straight-forward. Intuitively, a set of protein-coding genes that would approximate the set of orthologues among the starting genomes could be expected to be connected by RBH relationships that form a clique topology. But the nature of the BLAST-based similarity procedure makes this criterion overly strict. Requiring a set of potentially orthologous protein-coding genes to be connected by a dense connection of RBHs that forms a paraclique is sufficient to identify sets of eubacterial-core protein-coding genes and shows promise for the more problematic non-core protein-coding genes. This approach is thus developed further in the following section.

2.2.3 Refining the criteria for identifying the most similar pairs of proteins among two genomes

Background

The previous sections suggest that it might be possible to define a general-purpose, rule- based system for finding sets of the most similar sequences shared by an arbitrary set of randomly-chosen eubacteria. The most promising system was based on the properties of sets of protein-coding genes 60 within one RBH (“radius 1”) of a seed protein-coding gene. The choice of a radius of 1 was arbitrary and the moderate success of a method using a radius of 1 suggested that the clusters in the supergraph formed by the pan-genome of a set of eubacteria may consist of relatively isolated clusters of the most similar protein-coding genes that correspond to an approximation of a set of orthologues. In my original clique-based method for finding sets of the most similar proteins, I used moderately stringent criteria (an RBH relationship based on E-values < 10-5). This con­ servative inference method returned good candidate sets of protein-coding genes because any sets of protein-coding genes that conformed to the strict clique topology of RBHs were all very similar to each other. The relaxation of the topology to permit paracliques means that any false positive RBH edges in the graph, due to recent duplications, could result in the inclusion of some protein-coding genes that had a lower level of similarity to the rest of the set. I experimented with improvements to the stringency of the pairwise similarity criteria. In this section, I describe how a more specific test for the inference of the most similar pairs of protein-coding genes was developed.

Methods

First, I compared pairs of protein-coding genes from two cyanobacterial genomes inferred with a combination of BLAST parameters. During an initial comparison of BLAST hits between Prochlorococcus strain CCMP1375 (also known as strain SS120) and MIT9313 (data not shown), I found that the E-value, the percentage of the matching query and target sequences (posit%) and the percent identity (ident%) were the best indicators of a high level of similarity. The 7,389 significant hits from query sequences in Pcc CCMP1373 in Pcc MIT9313 were partitioned into the 8 possible categories of Boolean criteria, where:

61 (i) E-value < 10-5; (ii) posit% > 75%; and (iii) ident% > 35%, and the thresholds were those suggested in Gill et al. (2005). Through inspection of the visual inspection of the alignments returned in the BLASTP output, I found that the hits that satisfied all three criteria were always what I would consider to be the most similar proteins in the two genomes. The ident% had little effect on the hits retained, and the threshold of ident% > 35% seemed adequate. The threshold for posit%, however, was too strict.

Relaxing the criteria from permitting only best hits to permitting significant hits was not a beneficial strategy because, in general, the best hit was clearly the most similar to the query sequence.

A more systematic investigation was conducted between the protein-coding gene com­ plement of Prochlorococcus strain MIT9313 and Synechocystis sp. PCC6803, two more divergent “genomes” with 2,269 and 3,174 protein-coding genes respectively. The number of uni-directional best hits from one “genome” to another was tabulated for the 8 possible

Boolean categories based on E-value, ident% and posit%, where all possible combinations of E-value = 10-5, 10-4, 10-3, ident% fixed at > 35%, and posit% taking values in 60,

65, 70, 75.

A Python program was written to classify the pairs of protein-coding genes from BLASTP hits from “genomes” A to B and “genomes” B to A into 5 categories: (1) pairs of or- thologues; (2) paralogues of existing orthologue pairs; (3) protein-coding genes unique in

“genome” A; (4) protein-coding genes unique in “genome” B; and (5) unclassified. The most suitable thresholds and criteria for identifying the most similar pairs of protein­ coding genes in pairs of genomes were determined through an iterative approach.

62 Results

The number of uni-directional best hits produced by each combination of the three criteria are listed in Table 2.3. The 8 possible combinations of criteria of E-value, ident% and posit%, are indicated by a 3-digit number. For example, the value in the “010” corresponds to the number of uni-directional hits that failed the E-value threshold, passed the ident% threshold and failed the posit% threshold. The best hits for matches of query sequences in Synechocystis sp. PCC6803 in target “genome” Prochlorococcus sp. MIT9313 are listed in (A). The reverse hits are listed in (B). Comparison of the difference between each of the sets revealed that higher E-value thresh­ olds resulted in the inclusion of protein-coding genes that shared a lower level of similarity to the rest of the protein-coding genes in the set, hence the most accurate E-value thresh­ old was 10-5. By similar reasoning, a threshold for posit% between 65 and 70% was most accurate. Pairs of protein-coding genes were considered highly similar if they passed the following tests. Let labels ‘E’, T and ‘P’ mark E-value, ident% or posit% better than a threshold value, ‘B’ a reciprocal best hit and ‘O’ an only significant hit with labels duplicated for tests that held in both forward and reverse comparisons. Then pairs of protein­ coding genes were considered good matches if they scored “EEIIPPBBOO”, “EEIIPPBB”, “EEIIPPB”, “EEIIBB”, “EEBB” or “EIPB”. The best matches passed all five “EIPBO” tests, but the minimum requirement for a a significantly similar pair of protein-coding genes was either that there be a reciprocally significant similarity measured by E-value and a reciprocal best hit (“EEBB”), or that the similarity in one direction was similar as measured by E-value, had a high enough ident% and posit% and the other protein­ coding gene was the best hit (“EIPB”). This choice of tests was determined by individual inspection of several genomic comparisons between taxa of varying divergences.

63 Table 2.3: Uni-directional best hits between Synechocystis PCC6803 and Prochlorococcus MIT9313

The number of uni-directional blastp hits is shown for all combinations of of (E-identity%- positives%) with different thresholds. The leftmost column lists the thresholds used. The following eight columns show how many of the total number of hits passed each possible combination of the three threshold criteria. For example, when the pairwise comparisons were run for an E-value threshold of 10—3, identity > 35% and positives > 60% ( "e3-35-60” ), there were 890 hits that did not pass any threshold, 4 that only passed the positives% threshold, and 1071 that passed all three thresholds. (A) shows results for Synechocystis sp. PCC6803 in target "genome” Prochlorococcus sp. MIT9313 and (B) shows hits in the opposite direction.

(A) 000 001 010 O il 100 101 110 111 e3-35-60 890 4 346 643 77 12 129 1071 e3-35-65 670 0 566 647 39 2 167 1081 e3-35-70 451 0 785 647 25 2 181 1081 e3-35-75 285 0 951 647 10 0 196 1083 e4-35-60 888 4 340 598 79 12 135 1116 e4-35-65 669 0 559 602 40 2 174 1126 e4-35-70 450 0 778 602 26 2 188 1126 e4-35-75 284 0 944 602 11 0 203 1128 e5-35-60 882 3 337 567 85 13 138 1147 e5-35-65 664 0 555 570 45 2 178 1158 e5-35-70 448 0 771 570 28 2 195 1158 e5-35-75 283 0 936 570 12 0 211 1160

(B) 000 001 010 011 100 101 110 111 e3-35-60 830 2 305 398 66 9 72 587 e3-35-65 606 0 529 400 39 1 99 595 e3-35-70 420 0 715 400 23 0 115 596 e3-35-75 267 0 868 400 10 0 128 596 e4-35-60 825 2 304 380 71 9 73 605 e4-35-65 601 0 528 382 44 1 100 613 e4-35-70 419 0 710 382 24 0 120 614 e4-35-75 266 0 863 382 11 0 133 614 e5-35-60 821 1 300 367 75 10 77 618 e5-35-65 598 0 523 368 47 1 105 627 e5-35-70 417 0 704 368 26 0 126 628 e5-35-75 265 0 856 368 12 0 140 628

64 Conclusions

Good criteria for inferring whether two protein-coding genes could be orthologous are:

(i) whether they have significant similarity, as measured by the E-value, percent identity, percent of positive matches in the query and target sequence; and (ii) whether they are the best and/or only significant hits in each other’s “genomes” .

Conservative thresholds for pairwise estimation of similarity are an E-value < 10~5, iden­ tity > 35% and positive matches in the query and target sequences covering > 65%.

More extensive testing of suitable thresholds revealed, however, that all three criteria need not apply for similar pairs of protein-coding genes. The most important criterion was a significant E-value and a best hit relationship. As long as there was either a strong reciprocal relationship involving both those criteria, or a uni-directional relationship that matched most criteria, the two protein-coding genes being considered were the most sim­ ilar pair of protein-coding genes among the two host genomes.

The most similar pair of protein-coding genes were those that were related either by an

“RBH” or “EIPB” relationship. Henceforth, a pair of such proteins are referred to as a

“most-similar” or “MSH-pair” and a cluster of such proteins as an “MSH-cluster” .

2.2.4 Iterative clustering using seed clusters, set expansion and

pruning

Background

Paraclique topologies are sufficient to identify sets HS-clusters of eubacterial-core protein­ coding genes. Non-taxon-specific-core protein-coding genes are more problematic, how-

65 ever, because they may belong to gene families that cause more error in the BLAST-based metrics of similarity.

In this section, I explore the possibility of using clusters of non-taxon-specific-core protein­ coding genes from a curated database as seeds for a search for clusters of protein-codine genes that are more similar to each other than to genes not included in the cluster.

Methods

Fans were compiled from relationships between HS-pairs using the criteria found in §2.2.3, for the taxa used in §2.2.2. Each protein-coding gene for which a cluster was required was also collated into a file. A C program was written to identify MSH-clusters of protein­ coding genes for each protein-coding gene in the file, based on relationships specified in the file of fans, using an “expand and prune” clustering method. The algorithm works as follows.

for each protein—coding gene of interest: set v — { protein—coding genes within radius r} set e = {edges between any elements in v} sort v by the number of adjacent edges iterate through v, deleting elements with the lowest connectivity until each element of the remaining protein—coding genes has a connectivity that is at least as good as a pre—defined set of criteria return the set of most—similar protein—coding genes where the radius r , the number of MSHs separating two proteins , is typically 2 or 3 and the connectivity threshold is acceptable if: the number of protein—coding genes > 3 and the number of edges connecting the protein—coding genes > 90% of the maximum number of possible undirected edges and the number of protein—coding genes in the subgraph with more than two missing edges < 2 and the maximum number of edges missing from any

66 protein—coding gene is < 10%

Each protein-coding gene of the reference genome, Prochlorococcus marinus strain MIT9313, was used as the seed for a search for sets of most similar protein-coding genes in other genomes. Protein-coding genes for which the most similar match could not be found, such as those from the photosynthesis linker family, were searched for individually. The CyOG database of “Cyanobacterial COGs” of photosynthesis-related genes (Mulkidjanian et ah, 2006) was downloaded from the supplementary information of the journal website. The CyOGs had been inferred by a method inspired by that used by the COG database (Tatusov et ah, 2000, 2003). Protein-coding genes from three cyanobacterial “genomes” were clustered into triplets, which were augmented by protein-coding genes from 11 additional cyanobacterial “genomes” in a method analogous to that used by the COG database (Mulkidjanian et ah, 2006). Eight of the cyanobacteria used in the CyOG database overlapped with my own selection of taxa (Table 2.4). The CyOG clusters were limited to protein-coding genes from genomes in common with my study, and used as the seeds in the “expand and prune” method for inferring sets of most-similar protein-coding genes.

Table 2.4: Taxa common to both the CyOG database and the present study Bacterial/Photosynthetic group Organism Cyanobacteria Gloeobacter violaceus PCC 7421 Prochlorococcus marinus str. CCMP1375 (also known as str. SS120) Prochlorococcus sp. WH 8102 Prochlorococcus marinus str. MIT 9313 Synechococcus elongatus PCC 6301 Synechocystis sp. PCC 6803 Purple bacteria (a-proteobacteria) Rhodopseudomonas palustris CGA009

67 In either case, each seed protein-coding gene was increased to a set of protein-coding genes within a radius of r, for values of r between 1 and 4. For each initial set, various criteria for pruning the least connected protein-coding genes in the set were applied, resulting in a potential set of orthologues for each seed protein-coding gene.

To ascertain the most appropriate values of the radius r and the terminating condition for pruning, a small number of eubacterial-core and photosynthesis-related genes were examined in detail and the most appropriate parameters decided from these test cases.

Results

Using the method explained in §2.2.2, I tested values of radius r from 1 to 4, inclusive.

In general, the best overall value for the radius was 2, or failing that 3.

An example of a successful application of this procedure is shown in Figure 2.10. The

hlill protein is part of a family of hli (high-light inducible) proteins which are not al­ ways annotated sufficiently in cyanobacteria. Using the annotated hli 11 protein from

Prochlorococcus str. NATL2A as the seed for the search, the initial set consisted of all protein-coding genes within radius 2 of the seed protein-coding gene. The missing

MSH-matches are indicated by a ‘0’ in the adjacency matrix. The order in which the protein-coding genes with the poorest connectivity are removed is shown, along with di­ agnostic output of the number of protein-coding genes (nurnN), number of edges (numE), the maximum number of edges possible for numN protein-coding genes (expE), the num­ ber of edges expressed as a percentage of the maximum number of protein-coding genes

(pctE), the number of protein-coding genes without any edges missing (numNwM), the number of protein-coding genes with up to 2 edges missing (numNwMTO), the maximum number of edges missing among the entire set (maxNumEM) and this value expressed as a percentage (pctEM). The last column of the table shows the selection criteria that

68 failed, with T when pctE > 90%, when pctNwMTO < 10% (although this does not occur in this example) and when maxpctEM < 10%. The annotations of the final pruned set are displayed. The labels of the form “XXXXnnnnn” are unique identifiers used in this study, with “XXXX” an organism code, making protein-coding genes from the same organism visible even without the full annotation. Unfortunately, in many cases the expansion of an initial seed protein-coding gene con­ verges to a cluster that does not include the seed protein-coding gene. Such a case is illustrated in Figure 2.11 for the high-light inducible protein C (hliC) when an expansion of radius 2 was used. Here, the expanded set contained a subset of protein-coding genes that have a high level of pairwise potential orthology, visible as an area with a scarcity of zeros in Figure 2.11. The other protein-coding genes, including the seed protein-coding gene with identifier MEIK00627, have a low level of pairwise potential orthology across the set. In cases like these, the pruning procedure would preferentially remove the loosely connected protein-coding genes, including the seed protein-coding gene, leaving the cluster of strongly connected protein-coding genes.

Discussion

The extend-and-prune clustering method presented yields high quality sets of MSH- clusters The method works particularly well for seed protein-coding genes that are ubiq­ uitous among a diverse range of taxa, such as those from the eubacterial core. But when the seed protein-coding gene belongs to a gene family, it is possible for the method to find a cluster to which the seed protein-coding gene does not belong, or to produce an empty set of protein-coding genes. So, the process relies heavily on manual verification and is thus not scalable (Altenhoff and Dessimoz, 2009). For proteins that belong to a closely-related gene family, the putatively orthologous sub- 69 Protein HAPO094O : Starting matrix where missing edges are denoted with '0 ‘

>MLAP00940 gi172382780| ref|YP_292135. 1| high ligh t inducible protein h l i l l [Prochlorococcus str. NATL2A]

. ME ON00010, ME ON00012, ME DN00050, MK0000104...... MiVYJe0687. HYY JO 1510, MXBCG1221 MEDN00010,.,0,0,0,0,0,.,0,0,0,0,0,0,0 MEDN00012,0, ,,0,.,0, .,0,0, .,.,0,0, ... MEDNOOO50,0,0,.,0,.,0,0,0,0,0,.,. ,0,0 MK0000104,0, .,0, . ,0,0,0,0,0,0,0,0,0,0 1*0001793,0,0,.,0,.,.,.,0,. ,0,., . MLAP009400, . ,0,0...... 0, . ,0___,0, . MR ETC 1511, . ,0,0,0, . ,0, . ,0, , ,0, . , . MRFTO1525,0,0,0,0,0,0,0, . ,0,0,0, . ,0,0 MSXU01748, 0 , . , 0 , 0 , . , 0 , . , 0 , . ,0, ., . MTY000815,0, . ,0,0,0,0,0,0,0, .,0,0,0,0 M7YQ01380,0,0, . ,0, ., ., . ,0, . ,0, . ,0 ,., . MtfYJ00687,0,0, .,0,0, .,0, .,0,0,0, .,0,0 IWYJO1510.0, . ,0,0, .,0,. ,0, .,0, .,0 ..,. MXBCO1221.0, .,0 ,0 ...... 0, .,0, .,0,.,.

Proteins deleted from in it ia l set and properties of the putative set of orthologues

num num exp pet num num pet max max Protein [ DelProt] N E E E MwH NwMTO NwMTO numEH pctEM HAP00940 [ Start] 14 32 91 35.16 14 0 0.00 13 92.86 ' # HAP00940 [MEDNOGQ10] 13 31 78 39.74 13 0 0.00 12 92.31 '.# HAP00940 [MK0000104] 12 30 66 45.45 12 0 0.00 11 91.67 ' # HAP00940 [MRFTO 1525] 11 29 55 52.73 11 0 0.00 10 90.91 > .# HAP00940 [MTY000815] 10 28 45 62.22 10 0 0.00 7 70.00 !.# HAP00940 [MED«00050] 9 25 36 69.44 9 0 0.00 6 66.67 '.# HAP00940 [MWYJ00687] 8 24 28 85.71 6 0 0.00 2 25.00 1.# HAP00940 [MEOW00012] 7 20 21 95.24 2 0 0.00 1 14.29 ..# HAP00940 [MWYJ01510] 6 15 15 100.00 0 0 0.00 0 0.00 ... HAP00940 [ End] 6 15 15 100.00 0 0 0.00 0 0.00 .

>HDN00010 gi1845 174111 ref|ZP 01004760.1| possible high lig h t inducible protein iProchlorococcus marinus str. MIT 9211] >1*0000104 g i| 1240248161 ref|YP_001013932.1| hypothetical protein NAH.101031 [Prochlorococcus marinus str. NATL1A] >MRFT01525 g ij332409771 ref |NP_875919.1| High lig h t inducible protein h lil2 [Prochlorococcus marinus subsp. marinus str. ([0*1375] >MTYQO0815 g ij338613751 ref |NP892936. lj possible high lig h t inducible protein [Prochlorococcus marinus subsp. pastons str. CCMP19861 >MEDN00050 gi1845 17451| ref |ZP_01004800.1| high ligh t inducible protein [Prochlorococcus marinus str. |fT 9211] >1WYJ00687 gij 78779072 j ref ¡YP 397184. 1| high lig h t inducible protein-like [Prochlorococcus marinus str. MIT 9312 >MEDN00012 gi|84517413|ref|ZP01004762. 1| high ligh t inducible protein [Prochlorococcus marinus str. MIT 9211] »MWYJ01510 gi178 7798951ref¡YP 398007.1| hypothetical protein PMT9312_1511 IProchlorococcus marinus str. MIT 9312]

Final set of putatively orthologous proteins

MK0001793, MLAP00940, MRFTO 1511, MSXU01748, MTYQ01380, MX8C01221

Annotation of fin a l set of proteins

>MK0001793 gi 11240265051 ref |YP_001015620.1| hypothetical protein NAU1_18001 [Prochlorococcus str. NAU1A] >HAP00940 gi| 72382780| ref IYP 292135.1| high ligh t inducible protein h l i l l [Prochlorococcus str. NATL2A] >MRFT01511 g i|33240963|ref¡HP 875905.1| High light inducible protein h l i l l IProchlorococcus str. CCMP1375] >MSXU01748 g i j 123966982|ief|YP_001012063.1| Possible high ligh t inducible protein [Prochlorococcus str. MIT 9515] J-MTYQ01380 gi|33861940| ref |NP_893501.1| possible high lig h t inducible protein [Prochlorococcus str, CCHP1986] >MXBC01221 gi|123968757|ref|YP_001009615.1| Possible high lig h t inducible protein (Prochlorococcus str. AS9601]

Figure 2.10: ‘‘Expand and prune" clustering output for high-light inducible protein 11 This diagnostic output illustrates how a seed protein-coding gene can be used to successfully identify the cluster of hlill protein-coding genes using an expansion of radius 2 and pruning of the last connected protein-coding genes in the expanded set of protein-coding genes. For further explanatory notes, refer to the main text.

70 Protein MEIK00627 : Starting matrix where missing edges are denoted with '0'

>MEIK00627 gi|87301154|ref|ZP_01083995.1|possible high light inducible polypeptide HliC [Synechococcus sp. WH 5701]

,M Hu00239,M Hl>00667,M HM01443,M 00448,M M 03721,M P01688,MEu 05040, 00243,M v u01634,MZGA00085 MAHD0023 9, . , 0,0,0,0, . ., 0, . , 0, ., 0, . , 0, . , 0,0, . MAHD00667,0,. , 0,0,0,0 0,0,0,0,0, ., 0, . , 0,0,0,0 0, 0,0 MAHN0144 3,0,0, .,0,0,0 0, .,.,0, .,0, .,0, . ,0,0, . o,. , . MANU00448,0,0,0,.,0,. 0,0, . ,0, .,0,0,0,0,0,0,0 0, 0,0 MANU03721,0,0,0,0,.,0 0, 0, 0, 0, 0, 0, 0, . , 0, 0, 0,0 0, 0,0 MBBP01688,.,0,0,.,0,. MEDA05040, . , 0,0,0,0, . . , 0, . , 0, ., 0, . , 0, . , 0,0, . MEDA05758,0,0,.,0,0,0 0, ., 0,0,0, . , 0, . , 0,., 0,0 0, 0,0 MEDN0144 9, .,0, ., .,0, . ., 0, . , 0,.,0, ., 0,., 0, ., . . , 0 , . MEIK00627,0,0,0,0,0,0 0,0,0, . , .,0,0, . , 0,0,0,0 0, 0,0 MFLB01269,.,0,.,.,0,. .,0,., 0,.,0,0,.,. MIV000445,0,.,0,0,0,. 0, ., 0,0, ., . , 0,0,0,0, ., 0 0, 0,0 MK0000162, . , 0, .,0,0, . . , 0, . , 0,0,0, . , 0, . , 0,0, . , 0 , . MKUR03074,0, .,0,0, . , 0 0, . , 0, . , .,0 ,0, . , 0, . , 0,0 , 0,0

MOME01521,0 0,0,0,0,0 o,. 0,0 0,0,0,., 0, . o, o, ., . , 0, ., 0, . ,o, ,0,0, 0,0,0 MOME02214,0 0,0,0,0, . 0,0 ., o ...,o,o, 0,0 o, 0,0,0, 0,0,0, o, 0,0 ,0,0, . ,0,0 MPGP00363,.

MRFT00110,. MRKW02692, . 0,0, .,0, . ., o ., o • ,0, . , 0, ., o o, .,.,

MSZZ004 96, . 0,0, ., 0, . ., o •, o o, ., o o,

MWHL02173,. 0,0,0,0, . ., o ., o .,0,.,0, ., o o, 0, . , MWYJ00097,. 0,0,0,0, . . ,0 ., o ., 0, . , 0, o, ,0, . , MXBC00110,. 0,0,0,0, . ., o ., o . , 0, ., . , 0, ,0, . , MXCH02522,0 0, . , 0,0, . ., o 0,0 .,0,0,0, 0,0 0, o, .,0,.,0,.,0, o, 0,0 . , 0, • ,0,0 MXLT0220 5, . MYJB0024 3, . MYRD01634,. MZGA00085,.

Proteins deleted from initial set and properties of the putative set of orthologues

num num exp pet num num pet max max Protein [ Del Prot] N E E E NwM NwMTO NwMTO numEM pet EM MEIK00627 t Start] 33 266 528 50.38 33 3 9.09 31 93.94 ! .# MEIK00627 [MAHD00667] 32 264 496 53.23 32 2 6.25 30 93.75 !1 .# MEIKO0627 [MANU03721] 31 262 465 56.34 31 1 3.23 29 93.55 1 .# MEIKO0627 [MOOQ03534] 30 261 435 60.00 30 0 0.00 27 90.00 ! .# MEIKO0627 [MEIK00627] 29 258 406 63.55 29 0 0.00 24 82.76 ! .# MEIK00627 [MEDA05758] 28 253 378 66.93 28 0 0.00 23 82.14 ! .# MEIK00627 [MOME02214] 27 248 351 70.66 27 0 0.00 20 74.07 ! .# MEIK00627 [MANU00448] 26 241 325 74.15 26 0 0.00 18 69.23 !1 .# MEIKO0627 [MOME01521] 25 234 300 78.00 24 0 0.00 17 68.00 ! .# MEIK00627 [MXCH02522] 24 226 276 81.88 23 0 0.00 15 62.50 ! .# MEIK00627 [MIV000445] 23 220 253 86.96 22 0 0.00 12 52.17 !1 .# MEIK00627 [MAHNO1443) 22 211 231 91.34 18 0 0.00 10 45.45 . .# MEIKO0627 [MKUR03074] 21 205 210 97.62 7 0 0.00 2 9.52 . MEIK00627 [ End] 21 205 210 97.62 7 0 0.00 2 9.52 .

Final set of putatively orthologous proteins

MAHD0023 9,MBBP01688,MEDA0504 0,MEDN0144 9,MFLB01269,MK0000162,MLAP014 55,MPGP00363,MQYC01025,MRFT00110, MRKW02692,MSXU00106,MSZZ004 96,MTYQ00093,MWHL02173,MWYJ00097,MXBC0 0110,MXLT02205,MYJB0024 3,MYRD01634, MZGA00085

Annotation of final set of proteins

>MAHD00239 gi|124021953|ref|YP_001016260.1| hypothetical protein P9303_02401 [Prochlorococcus str. MIT 9303] >MBBP01688 gi|113475630|ref|YP_721691.1| CAB ELIP HLIP family protein [Trichodesmium erythraeum IMS101] >MEDA05040 gi|67925182|ref|ZP_00518551.1| possible high light inducible polypeptide HliC [Crocosphaera watsonii WH 8501] >MEDN014 4 9 gi|84518850|ref|ZP_01006199.1| High light inducible protein [Prochlorococcus str. MIT 9211] >MFLB01269 giI 56751279 I refjYP_171980.11 possible high light inducible polypeptide HliC [Synechococcus elongatus PCC 6301] >MIV000445 giI 22297989 I ref|NP_681236.11 CAB/ELIP/HLIP superfamily protein [Thermosynechococcus elongatus BP-1] >MK0000162 gij124024874|ref|YP_001013990.1| possible high light inducible protein [Prochlorococcus str. NATL1A] >MLAP014 55 gi|72383295|ref|YP_292650.1| possible high light inducible protein [Prochlorococcus str. NATL2A] >MPGP00363 giI 78183947jrefjYP_376382.1j possible high light inducible protein [Synechococcus sp. CC9902] >MQYC01025 gi|16330195|ref|NP_440923.1| CAB/ELIP/HLIP superfamily [Synechocystis sp. PCC 6803] >MRFT00110 gi|33239562|ref|NP_874504.1| High light inducible protein hli4 [Prochlorococcus str. CCMP1375] >MRKW02692 gi|86607435|ref|YP_476198.1| CAB/ELIP/HLIP family protein [Synechococcus sp. JA-3-3Ab] >MSXU00106 gi|123965340|ref|YP_001010421.1| possible high light inducible protein [Prochlorococcus str. MIT 9515] >MSZZ004 96 gi|86608000|ref|YP_476762.1| CAB/ELIP/HLIP family protein [Synechococcus sp. JA-2-3B'a2-13] >MTYQ00093 gi|33860653jref|NP_892214.1| possible high light inducible protein [Prochlorococcus str. CCMP1986] >MWHL02173 gi|33866712|ref|NP_898271.1| possible high light inducible protein [Synechococcus sp. WH 8102] >MWYJ00097 gi|78778482|ref|YP_396594.1| high light inducible protein-like [Prochlorococcus str. MIT 9312] >MXBC00110 gi|123967646|ref|YP_001008504.1| possible high light inducible protein [Prochlorococcus str. AS9601] >MXLT02205 giI 871252411ref|ZP_01081087.11 possible high light inducible protein [Synechococcus sp. RS9917] >MYJB00243 giI 81299054|ref|YP_399262.11 possible high light inducible polypeptide HliC [Synechococcus elongatus PCC 7942] >MYRD01634 giI 33863907jref|NP_895467.1j possible high light inducible protein [Prochlorococcus str. MIT 9313] >MZGA00085 gij88807162 I ref IZP 01122674.l| possible high light inducible protein [Synechococcus sp. WH 7805]

Figure 2.11: “Expand and prune” clustering for high-light inducible protein C This diagnostic output illustrates how “expand and prune” can result in a cluster that does not include the initial seed protein-coding gene. For further explanatory notes, refer to the main text.

71 clusters can often be identified with the assistance of phylogenetic methods. If a tree of the entire set of protein-coding genes within a radius of 2, 3 or even 4 is generated, the relationships between the annotation and the clusters become much clearer. In the most problematic cases, detailed annotation for a large gene family is available for very few of the protein-coding genes. Since the annotation of newly sequenced genomes is usually inferred and transferred from existing annotated genomes, the quality of the new annotation is heavily dependent on the quality of existing genome annotations. If the precise function of a protein-coding gene cannot be inferred with certainty, it is usually annotated as part of a gene family without further information. The “expand and prune” method is based on selecting a seed protein-coding gene with trusted annotation and finding a highly similar cluster of protein-coding genes that includes the seed protein­ coding gene. If there are very few sequences in a family annotated to a specific protein­ coding gene, and there are inaccuracies in the specific annotation, the “expand and prune” method does not perform well.

The hli gene family was one of the many protein-coding gene families examined in detail due to these sorts of problems. Only 26 of the 337 protein-coding genes were annotated to the gene, rather than the family, level (Figure 2.12). All protein-coding genes within a radius of 2 MSH relationships to the 26 annotated protein-coding genes were aligned with CLUSTALX (Jeanmougin et ah, 1998; Larkin et ah, 2007) and displayed as a phylogenetic network with the Neighbor-Net method using Protein ML distances(Bryant and Moulton, 2002, 2004) in splitstree v4.1.9 (Huson and Bryant, 2006) (Figure 2.12). The potentially orthologous clusters are closely related to each other. The scarcity of annotated protein­ coding genes (marked with green circles) and the close proximity of different protein­ coding genes in closely related clusters (e.g., the protein-coding genes hli\2, hli 11 and hli8 in Figure 2.12) make it difficult to determine the best annotation for nearby protein­ coding genes. For the case of the hliC and hlii protein-coding genes, my best estimation of 72 Figure 2.12: Neighbor-Net analysis of the hli gene family The high-light inducible (hli) gene family, estimated as the protein-coding genes within a radius of 2 MSH relationships of any precisely annotated protein-coding gene (marked by circles) is large. The annotation of clusters is based on the consensus of precisely labelled protein-coding genes (green circles). In some cases, it was difficult to separate a cluster of one protein-coding gene (e.g., hli 11) from sequences labelled as another protein-coding gene (e.g., hli8) on the basis of the phylogenetic pattern or the alignment of the amino-acid sequences (data not shown). Clades are marked in roman font and protein-coding genes in italics. Questionable annotations appear in red.

73 orthology places a small cluster of hli C protein-coding genes within a larger cluster of hli A protein-coding genes. These were allocated on the basis of the phylogenetic pattern and the highly-conserved nature of the sequence alignment (data not shown) of all sequences within the cluster denoted as hli A. In other cases, sequences annotated as the same protein-coding gene, such as the two hli 1 protein-coding genes, occur in a larger group that is separated by protein-coding genes that are annotated as CEH proteins. Thus, it is possible that the entire group encodes the same protein-coding genes and should have the same name, or that there are actually three, rather than two, sub-groups of hli 1 protein-coding genes (Figure 2.12).

Conclusions

Clusters within a radius of 2 (but sometimes 3) MSH relationships of a seed protein-coding gene with a trusted annotation, pruned until the set of protein-coding genes consists only protein-coding genes with a significant similarity to every other protein-coding gene in the set, appeared to be orthologous if the pruning phase converges to a set that includes the seed protein-coding gene.

The “expand and prune” clustering method was developed around a empirical observation of the MSH relationships among sets of protein-coding genes.

The method works well for protein-coding genes which do not belong to gene families.

Protein-coding genes ubiquitous across the eubacteria fall largely into this category. Those that do form gene families, such as the photosynthesis-related proteins, require extensive manual verification. Visual inspection of sequence alignments and phytogenies are helpful to classify different proteins, but this is a time-consuming task that cannot realistically be employed for genomic analyses.

74 Experience performing manual verification of problematic clustering cases led me to sus­ pect that the clustering operation should be done de novo, without a seed protein as a reference, and without supervision. Furthermore, the clustering operation should be applied to cluster all protein-coding genes at the same time rather than in a pairwise or local fashion. So in the next section I explore a general purpose clustering method that operates on the entire complement of protein-coding genes from the genomes being analysed.

2.2.5 Single-step clustering with the MCL algorithm

Background

A variety of clustering methods for inferring orthologues has been tested. Although all have had some successes, distinguishing protein-coding genes that belong to gene families has been problematic and the subsequent need for manual verification to resolve these issues is an ongoing problem for genome-scale analyses.

The Markov Clustering (m c l ) algorithm is a general-purpose clustering algorithm based on simulation of stochastic flow in graphs (van Dongen, 2000a,b). Clusters are inferred from a graph by the iterative application of “expansion” and “inflation” operations that reinforce and weaken connections between nodes in turn, terminating when all nodes in the graph have been resolved into clusters of the granularity specified by the “inflation parameter” (van Dongen, 2000a,b).

This algorithm can be applied to graphs with either uni- or bi-directional edges, and with Boolean- or real-valued edge weights, but has been shown to perform best on graphs with undirected (bi-directional) edges (van Dongen, 2000a,b). Since the graph formed by protein-coding genes connected by reciprocal best BLAST hit (RBH) or most-similar

75 hit (MSH) relationships in such a graph (§2.1.2), the “orthology inference by similarity” problem is a good match for the MCL algorithm. Other studies (e.g., Li et ah, 2003) have demonstrated that it can be successfully applied to infer sets of potential orthologues from data similar to mine. Furthermore, the authors’ implementation of the MCL algorithm can process extensive data sets with millions of nodes and edges, on a computer with modest computational and memory capacity, in a matter of minutes, due to efficient im­ plementation of the matrix operations required for the expansion and inflation operations (van Dongen, 2000b,a).

The following sections describe how the MCL program was evaluated for its potential as an unsupervised method for inferring orthologues shared by a diverse group of eubacteria.

Methods

The 68 eubacteria that would form the basis of future analyses (Table 4.1) were selected for this exploratory analysis. Reciprocal best BLAST hits were collated from pairwise “genome” analyses using the criteria for the most similar hit (MSH) established in §2.2.3. These pairwise relationships, formed a symmetric graph (i.e., where the relationship from node A to node B is the same as from node B to node A) with all edge weights set to the value of one.

Table 2.5: Default parameters for MCL vl.008 Parameter Default value Command-line syntax Prune number 10000 [-Pn] Selection number 1100 [-S n] Recovery number 1400 [-R n] Recovery percentage 90 [-pet n] nx (x window index) 4 [-nx n] ny (y window index) 7 [-ny n] nj (jury window index) 7 [-nj n] continued on next page 76 continued from previous page adapt-exponent 2.00 [-ae f] adapt-factor 4.00 [-af f] warn-pct 30 [-warn-pct k] warn-factor 50 [-warn-factor k] dumpstem [-dump-stern str] Initial loop length 0 [-1 n] Main loop length 10000 [-L n] Initial inflation 2.0 [-i f] Main inflation 2.0 [-1 f]

The MCL program (van Dongen, 2000a,b) is a command-line program with an extensive selection of configurable parameters. The most important is the inflation parameter (‘-

I’) that controls the granularity of the clusters returned. The MCL manual advises that the best values for inflation parameter I are usually in the range [1.2, 5.0] where a value around 1.2 would result in relatively coarse-grained clustering and 5.0 in a fine-grained clustering.

Clusters were generated using default values for all parameters, including the inflation parameter I = 2.0, using MCL vl.008 (Table 2.5). The clusters recovered were inspected manually and compared with the clusters recovered through the “expand and prune” method described in §2.2.4 and with the fans inferred from the MSH relationships used as input to the MCL program.

R esults

The initial graph consisted of 179,200 nodes and 2,097,279 edges. The MCL program found

21,034 clusters of which 546 (2.6%) were trivial clusters containing just one protein-coding gene.

Although the input to MCL consisted of 68 eubacteria, these were split into the 68 bacterial

77 chromosomes and 38 plasmids, resulting in 106 separate “genomes”, each of which had been allocated a unique “organism code” (orgcode).

The largest cluster appeared to consist of a family of membrane transport proteins. Thus, both previous clustering methods described earlier in this chapter and the MCL method appear to be unable to discriminate the separate genes in this closely-related gene family.

These clusters were discarded prior to further analysis.

Some desirable properties of a good clustering are: (i) no protein-coding gene is repre­ sented in more than one cluster; and (ii) each cluster contains only one representative protein-coding gene from each organism. The first property was tested and found to be true. The second was not.

The maximum number of orgcodes in a cluster was 75 and the maximum number of sequences in a cluster was 149. Simple properties of the clusters, such as the number of organisms covered and the number of sequences, were calculated. The difference between the number of sequences and the number of orgcodes was calculated for the 428 clusters with representatives from at east 75% of the orgcodes.

A randomly-selected proportion of these clusters were examined for accuracy, with the assistance of the fans and clusters prepared previously. They were found to be approxi­ mately equivalent. In many cases, extra protein-coding genes in a cluster were so similar to each other than the orthologous sequence could not be identified: (i) by conserved sequence alignment; (ii) by the presence of indels; or (iii) by inspection of a phytogeny.

Conclusions

The MCL algorithm yields high quality clusters of eubacterial-core genes that are approxi­ mately equivalent to those found through manual collation of sets from fans of MSHs and

78 from the “expand and prune” method.

The MCL method is superior because it is a completely automated, reproducible method that does not rely on manual intervention after preliminary testing, and does not rely on a third-party source of reliable clusters, yet produces biologically-meaningful clusters, both in this study and in others (Enright et al., 2002; Kunin and Ouzounis, 2003; Harlow et ah, 2004).

2.2.6 Inferring functional annotation for sets of orthologues

Background

Once clusters of potentially orthologous genes have been identified, the annotated function of the genes was needed to guide the methodology or interpretation of further analyses.

The only practical approach for performing this operation for a large number of genes was to an accurate, unsupervised, bioinformatic method. As explained more fully in

§2.1.3, methods based on significant similarity to a trusted database of annotated genes are effective in many, but not all, cases.

The following sections describe how a method for inferring functional annotation was developed for use in future analyses.

Methods and Results

Initially, I observed that clusters of eubacterial-core genes found by the methods outlined in §2.2.2 were highly conserved, with well-characterised function. Thus, it might be sufficient to transfer the annotation from a protein-coding gene in a reference “genome” to the rest of the set. I tried this method using a reference “genome” , Prochlorococcus

79 strain MIT9313, selected because it had the largest number of protein-coding genes of any Prochlorococcus available at the time.

However, this did not prove to be very effective. There were too many cases where the annotation for the MIT9313 sequence was different to the majority of the sequences in the cluster, casting doubt on its accuracy. There were also too many cases where the MIT9313 protein-coding gene was annotated with an uninformative description, such as “unknown function” or “hypothetical protein”, even though other members of the set had consistent and informative annotation. Hence, this approach was abandoned.

By one estimate, based on 300,000 protein-coding genes in the UniProt database, only 3% of the protein-coding genes with an informative annotation have been experimentally verified, thus 97% are based only on bioinformatic methods (Brown and Sjolander, 2006). In such a system, genomes that were completed earlier could be expected to have lower quality annotation than those completed more recently, but be assigned more weight when new genomes are assigned their annotation through bioinformatic means because the consensus of the existing annotation will be weighted in favour of the original annotation of that cluster, which may be incomplete or inaccurate. This suggests that a method based on consensus rather than transfer, should be used to infer the functional annotation of the cluster. Another approach that could be taken, and is advocated for future studies, is re-annotation of all the finished genomes to yield a uniform quality of annotation (e.g., SEED as provided through the RAST server).

Next, I explored the possibility of using the consensus of the annotation for a set of potentially orthologous protein-coding genes. Even when most members of a set are well- annotated, it is difficult to automate the allocation of a consensus annotation because even minor differences in annotation, due to the use of different gene names or descriptive language, or to typographical errors, are difficult to reconcile without manual curation. 80 Even in cases where it would be possible to reliably infer the consensus annotation in a manner that could scale to hundreds of orthologous sets of protein-coding genes, the consensus annotation may still be inaccurate in cases where it has clearly been transferred from one potentially mis-annotated gene to another. Hence, I decided to explore options involving verification through third-party curated databases of genes. The most popular curated databases of protein-coding genes are the Clusters of Orthol­ ogous Groups (c o g ) database (Tatusov et ah, 2000, 2003) and the Kyoto Encyclope­ dia of Genes and Genomes KEGG database (Kanehisa, 1997; Ogata et al., 1999). The COG database is a manually-curated database of potentially orthologous sets of protein­ coding genes. All-against-all b l a st comparisons were conducted for 21 taxa selected for the initial database. Triplets of genes that were significantly similar were grouped into COGs. Any triplet COGs which shared two genes were merged into a single COG. Subsequently, genes from additional taxa were allocated to COGs by the COGNITOR program (Tatusov et al., 2000, 2001) on the basis of similarity to the existing COGs.

The KEGG database project started in 1995 with the aim of being a computerised repre­ sentation of biological systems (Kanehisa et al., 2006). It consists of a set of inter-related “resources”, or databases, that capture all the information, such as gene names, consistent naming of orthologous groups, metabolites, and regulatory mechanisms, obtained from complete genome data and selected partial genomes, and integrated using both automated methods and manual-curation. In the design of the resources related to the annotation of genes and orthologous sets, the authors realised that the annotation would, in part, depend on the sample of genomes included at any time point. The GENES database is first populated from complete genome data from GENBANK (Benson et al., 1998). The unique KEGG Enzyme Commission (EC) identifier (Kotera et al., 2004; NC-IUBMB, 2010), that corresponds to a manually-curated, consistent, hierarchical naming system for the func­ tional description of genes, is allocated using the GFIT program (Bono et al., 1998). It is

81 then continuously re-evaluated against the continuously updated SWISS-PROT and other such databases (Kanehisa, 1997).

The gene function categories in the COGand KEGG databases are similar, the main dif­ ference being that membrane transport protein-coding genes that are spread through the

COG categories are grouped into a separate category in KEGG (Table 2.6). I explored the possibility of using either or both of the COG and the KEGG databases to annotate my clusters of potential orthologues, but concluded that using the KEGG database alone gave superior results. The KEGG databases included more of the complete cyanobacterial genomes that would be used in my study. The gene and orthologue annotation in the

KEGG databases had been designed to be more robust to errors inherent in the transi­ tive allocation of annotation. In particular, by continuously re-evaluating the annotation of clusters rather than individual genes, as the database expands to incorporate more genomes that more closely approximate the true diversity of life in the environment, the databases should be, and should become, more accurate.

Table 2.6: Gene function categories in the KEGG and COG databases

KEGG COG 1110 Carbohydrate Metabolism A RNA processing & modification 1120 Energy Metabolism B Chromatin structure & dynamics 1130 Lipid Metabolism C Energy production & conversion 1140 Nucleotide Metabolism D Cell cycle control, cell division & chromo­ some partitioning 1150 Amino Acid Metabolism E Amino acid transport & metabolism 1160 Metabolism of Other Amino Acids F Nucleotide transport & metabolism 1170 Glycan Biosynthesis & Metabolism G Carbohydrate transport & metabolism 1180 Biosynthesis of Polyketides &: Nonriboso- H Coenzyme transport & metabolism mal Peptides 1190 Metabolism of Cofactors & Vitamins I Lipid transport & metabolism 1195 Biosynthesis of Secondary Metabolites J Translation, ribosomal structure Sz bio­ genesis continued on next page

82 continued from previous page 1196 Xenobiotics Biodegradation & K Transcription Metabolism 1210 Transcription L Replication, recombination &; repair 1220 Translation M Cell wall/membrane/envelope biogenesis 1230 Folding, Sorting & Degradation N Cell motility 1240 Replication & Repair O Posttranslational modification, protein turnover, chaperons 1310 Membrane Transport P Inorganic ion transport & metabolism 1320 Signal Transduction Q Secondary metabolites biosynthesis, transport & catabolism 1330 Signaling Molecules & Interaction R General function prediction only 1410 Cell Motility S Function unknown 1420 Cell Growth & Death T Signal transduction mechanisms 1430 Cell Communication U Intracellular trafficking, secretion & vesic­ ular transport 1440 Development V Defense mechanisms 1450 Behavior W Extracellular structures Y Nuclear structure Z Cytoskeleton

Conclusions

The transfer of annotation from a reference “genome” was found to be unreliable for annotating the cluster of potential orthologues due to the reliance on a single “genome” annotation, which will have some proportion of unannotated or mis-annotated genes. A method based on the consensus annotation of the set of genes was also unsuitable because it is biased by the chain of annotation transfer already present in the set of annotations and because of practical difficulties in comparing free-text annotations that do not adhere to a strict naming convention. The method based on consensus of a set of curated annotations solves both these problems. 83 The KEGG ORTHOLOGY database was used as the basis of functional annotation because it is based on a biologically meaningful hierarchy of gene functional groups, described with a consistent ontology.

2.3 Data storage and manipulation

2.3.1 Computational requirements

I expected that I would be mining potentially hundreds of bacterial “genomes” for or- thologues, cross-referencing against several data sources, and conducting computationally expensive phylogenetic analyses. As well as all the analyses required for the results pre­ sented in this work, I would need storage and computational resources for the preliminary and aborted analyses, and intermediate results. I needed to set up a framework where new software could be acquired and installed cheaply and easily, and where my research tools could be chosen by their merit, rather than their cost.

Early in the project, I realised that I would need a computing platform that would facili­ tate: (i) software development; (ii) database management; (iii) data storage; (iv) compu­ tationally intensive analyses that may run for several weeks; (v) compilation and instal­ lation of recently-developed third party software; (vi) bibliographic library management; and (vii) word processing facilities for my thesis. The only computing platform that would enable me to achieve all these goals at an affordable cost was through use of a PC running

Linux.

84 2.3.2 A relational database for data storage

The management of genomic data was going to be a significant aspect of the project. All data would have to be stored in a way that lent itself to the efficient organisation, storage, retrieval and cross-referencing of data. Most importantly, however, data had to be kept in a way that made it difficult to corrupt or create conflicting records. The obvious solution was to use a relational database. I chose MySQL because it is a freely available, mature product, which is often used commercially. It is under active develop­ ment and has important features such as rollback of transactions, in case the database becomes corrupt and retrieving a previous version of the database is required. Further­ more, MySQL databases can be accessed from Python using the MySQLdb module. All the primary genomic data downloaded from the NCBI database, secondary data such as BLAST hits and sets of orthologues, and data from the KEGG, COG, CyOG and CYANORAK databases were stored in relational database tables in a MySQL database. The most important tables in the database are provided in Appendix A.l. Although most data was stored in the relational database structure, operations that re­ quired frequent database lookups were unpractically slow. In these cases, a significant improvement in speed could be achieved by first dumping the contents of a table to a file, and reading the information from there.

2.3.3 Software developed during this work

Most of the data manipulation conducted during this project was to explore some aspect of the data, to reformat data or to automate and manage the execution of repetitive tasks for multiple data sets. 85 In most cases, short Python scripts were rapidly developed. For tasks where more intensive computation was required, such as re-formatting the fans into a matrix representation, the script was prototyped in Python, then re-implemented in C for improved runtime speed.

2.4 Discussion

2.4.1 Sources of error in orthologue inference

The concept of orthology is conceptually simple (§2.1.1) and sets of orthologues are the basis of comparative genomic analyses, but the development of better methods for accurate inference of orthologues is still an active field of research. As discussed in the introduction to this chapter, the most widely-used methods are based on the use of significant similarity as a proxy for homology. Since the BLAST family of programs is the most widely used and respected general-purpose tool for identifying significant similarity between primary sequences, BLAST-based methods of orthologue inference dominate the field.

The blast programs have well-documented strengths (Altschul et al., 1997; She et ah, 2009). They are good at identifying significant similar between reasonably long, well- conserved genes, at either the nucleotide or amino-acid level. Extensive testing of BLAST parameters has identified good default parameters for common searches, whether they be “typical” searches or searches involving short, low-complexity sequences. As a result, it is considered the best general-purpose tool for identifying pairs of the most similar sequences in two genomes. If inter-genomic comparisons were conducted for genomes where there had been no re- 86 cent gene duplications or losses, and all genes were long enough and without long, low- complexity regions, then MS-clusters could be inferred with little difficulty. Unfortunately, this idealised scenario bears little resemblance to real life. MS-clusters are generally in­ ferred through the following process: (i) pairs of the most similar sequences are inferred;

(ii) the pairs are assembled into a network of sequences connected by MS-pair relation­ ships; and (iii) MS-clusters of genes are inferred through application of a pre-defined set of rules that define the level of similarity that might be expected to be shared by orthologous genes.

The method fails to result in perfect sets of truly orthologous genes because errors accu­ mulate at every step of the process. Thus, in this work, they have been referred to as most-similar clusters of genes.

Significant pairwise similarity may go undetected if either the query or the target sequence is too short to contain sufficient information to result in a statistically significant measure of similarity, the target sequence has broken into two short genes, or either the query or target sequences contain a proportionally large region of low-complexity sequence regions.

Significant pairwise similarity may be inferred erroneously due to erosion of the historical signal from the accumulation of multiple substitutions at individual sites along the gene, convergent evolution, or the presence of closely-related paralogous sequences.

The process of finding MS-clusters of protein-coding genes assumes that genes likely to be orthologous are connected by a denser and/or stronger set of pairwise relationships than paralogous genes. Thus, the MS-cluster can be identified through application of a rule-based set of criteria based on pre-defined thresholds. However, the comparison of genomes that are closely-related and have evolved through gene duplication and gene loss events would confound this methodology. It is also very difficult to define a single set of criteria for identifying MS-clusters of genes across by an arbitrary set of genomes which

87 may have diverged recently or a long time ago.

2.4.2 Heuristic methods for dealing with error in orthologue inference

The present study demonstrates that the problems that could be expected to occur, do indeed cause error in BLAST-based orthologue inference based on clustering. Fine-tuning the methodology, through exploration of parameters or schemes for defining pairwise homology or different clustering strategies, can reduce the error rate. However, there are also persistent errors inherent in the methodology that cannot be removed. In general, however, clustering of significantly similar pairs of eubacterial-core sequences, that are thought to have experienced less recent duplication and loss than accessory protein-coding genes, is probably fairly accurate. The idea of a semi-supervised clus­ tering method based on seed protein-coding genes (§2.2.4) was based on the idea that unsupervised clustering, while appealing because it is unbiased by potentially subjective information, can sometimes be enhanced by the use of domain-specific information (e.g., Keswani and Hall, 2002). But I found that unsupervised clustering methods that perform the clustering operation on the entire graph rather than on a localised scale perform much better (§2.2.5). This could be because access to additional information, in the form of differential density of relationships among different genes in a gene family (van Dongen, 2008), aids the separation of the genes into clusters that correspond to different genes. Thus, methods that perform the clustering operation on the entire graph are preferred. The problem of inferring most-similar clusters of protein-coding genes among a set of genomes is very similar to that of inferring functional annotation of a cluster. The pre­ liminary analyses performed in this chapter demonstrate that methods that depend on the transfer of annotation from a single, curated gene to statistically significantly similar genes, are problematic for the same reasons outlined for orthology inference. The best method identified here relied on a consensus of annotations, rather than transfer from a pairwise comparison. As a result of the work carried out in this chapter, the procedures detailed in the conclusions (§2.4.4) were adopted for future work.

2.4.3 Future directions

In the future, the identification of MS-clusters of protein-coding genes or inference of functional annotation can be improved by the incorporation of additional information in the form of phylogenetic groupings and/or gene order. During the exploration and verification of MS-clusters of protein-coding genes, phyloge­ netic methods were found to be particularly useful for separating genes from gene families. Since this study was started, software such as PhyloGena (Hanekamp et ah, 2007) has taken the approach of combining orthologue inference, multiple alignment and phyloge­ netic inference to aid functional inference. Similar tools, such as the “Level of Orthology from Trees” (loft) program (van der Heijden et ah, 2007), also use phylogenetic methods to infer orthology.

2.4.4 Conclusions

The main aim of this chapter was to plan for the data and methodological issues related to the inference, storage and manipulation of orthologous sets of protein-coding genes inferred from whole bacterial genomes. These preliminary studies guided design decisions that would aid the analyses in subsequent chapters. On the basis of the investigations carried out in this chapter, I have reached the following conclusions: 89 1. A simple but effective method for inferring most-similar clusters of eubacterial-core protein-coding genes was to use BLAST similarity measures as a proxy for homology, model the homologous relationships as nodes and edges in a graph and use the MCL method to find clusters of protein-coding genes that are an approximation to the set of genes that may have evolved from a common ancestor without duplication. 2. The most accurate and efficient method for inferring the functional annotation of a potential set of orthologues was by a consensus method with a good, curated database like the k eg g database. 3. The use of a MySQL database helped enforce a strict relational structure and main­ tain data integrity. 4. In most cases, Python scripts were sufficient for manipulating the genomic data. For computationally intensive tasks, however, programs were re-implemented in the C programming language.

90 Chapter 3

Phylogenetic Inference : Methods and Preliminary Studies

3.1 Introduction

The first scientific attempts to classify organisms were based on comparison of morpholog­ ical features (Hillis and Wiens, 2000). These were successful in recovering many relation­ ships, but could not resolve population structures or reliably infer evolutionary histories

(Ridley, 2004). Molecular data, either sequences of nucleotides or of amino-acids, evolve through a well-understood mechanism that can be mathematically described by a “sub­ stitution model” arid can be used to make inferences as to how these sequences may have changed over time (Page and Holmes, 2003; San Mauro and Agorreta, 2010). This allows more accurate and finer-grain evolutionary relationships to be estimated.

Phylogenetic analyses based on molecular sequence data form the basis of bacterial classifi­ cation (Garrity, 2001). These methods are subject to errors, which can be categorised into two classes: stochastic or sampling errors arising from the analysis of sequence alignments

91 of finite length, and systematic error arising from non-historical signals in the data from a mismatch between the estimated and actual models of evolution. Error of either type can result in phylogenies that do not accurately represent the evolutionary relationships of the sequences. Systematic errors are particularly problematic when studying deep di­ vergences. The historical signal over the scale of eubacterial evolution is potentially quite weak - a consequence of the longer period of time over and the complex evolutionary processes through which these sequences have evolved. Phylogenetic analyses involve a long series of steps: (i) selection of taxa; (ii) inference of orthologous sequences from those taxa; (iii) multiple sequence alignment; (iv) selection of a phylogenetic inference method that is appropriate for the expected divergence in the sequences; (v) selection of an evolutionary model (if the method requires one); (vi) if the method does not have a built in termination condition, running the implementation for long enough to converge on a good solution; and (vii) estimation of the confidence of the inferred phylogeny. At each step, artefactual error can accumulate in the analysis due to poor or subjective choices, or to data unsuitable for the phylogenetic inference method if not properly handled.

3.1.1 Stochastic error from sampling

Models of sequence evolution assume that the taxonomic coverage of sequences selected for study is unbiased and represents the true diversity in nature, and that the sites in the aligned sequences are an unbiased sample of sites from an infinite theoretical alignment of the sequences. If either of these is false, the discrepancy between the model estimated for the data and the “true” model of evolution may lead to error of sufficient magnitude to change the topology or the branch length estimates in the resulting phylogeny. This sort of error may arise from samples that are too limited, are oversampled from one or more 92 groups, are undersampled from one or more groups, have too many or too few out-group sequences, or contain sequences that are too short.

Stochastic error is generally addressed by more careful sampling or the addition of more data, for example, adding more taxa to even out biases or concatenating gene alignments.

Unfortunately, these techniques can yield other problems. In statistics, an estimator is said to be asymptotically consistent if the addition of more data brings the estimate closer to the true value. A phylogenetic estimator is said to be statistically consistent, given a model of site substitution, if the probability of its returning the true phylogeny increases as more characters (that is, sites of the input alignment) are added, irrespective of the parameters of of the model (Steel and Penny, 2000). Unfortunately, no single method of phylogenetic analysis qualifies for this role (Felsenstein, 1978; Hendy and Penny, 1989; Huelsenbeck and Hillis, 1993; Yang, 1994; Huelsenbeck, 1995; Lockhart et al., 1996; Gascuel et ah,

2001; Huelsenbeck and Lander, 2003; Susko et al., 2004; Philippe et al., 2005).

A common method for checking the robustness of a phylogeny is to use the bootstrap, a method whereby the strength of the support of clade groupings are estimated by the support of taxon groupings from a set of trees generated from pseudo-replicate alignments

(Mueller and Ayala, 1982; Felsenstein, 1985a; Penny and Hendy, 1985, 1986). However, sampling or systematic error can mislead phylogenetic inference methods in such a way that there is 100% bootstrap support for an incorrect phylogeny (Phillips et al., 2004).

Hence, the bootstrap method is not fail-safe for evaluating the accuracy of a phylogeny.

3.1.2 Systematic error from non-historical signals

The great challenges in - understanding how life evolved, how the three domains of life are related to each other, how the main lineages within the three domains are related to each other, and the evolutionary processes through which life at all taxo-

93 nomic levels evolves - share a common problem: the information required to answer these

questions may have been lost, making it very difficult to infer the evolutionary relation­ ships between the sequences being studied because of non-historical signals in the data

that lead to artefactual error.

Phylogenetic methods based on molecular sequence data attempt to infer the speciation events that gave rise to the observed sequences. In the best scenario, the sequences being

analysed have only accumulated a moderate amount of change, such that the alignments

recovered unambiguously reflect the changes that have occurred since the common ances­

tor and contain sufficient information to resolve the relationships between the sequences.

Unfortunately, such a scenario occurs only rarely in actual analyses. In the worst case,

the method returns a star-like pattern of divergences, indicating that each sequence is equally different from every other sequence. This pattern could arise if either very little change had occurred since the sequences had diverged from the ancestral sequence, or if

there has been so much change since the ancestor that they are effectively random (Susko et ah, 2005).

Despite many attempts, the relationships between the eubacterial phyla generally display

a star-like pattern of divergences or a tree-like pattern of divergences with weak clade support values (e.g., Koonin, 2003; Creevey et al., 2004; Pisani et al., 2007; Bapteste et al.,

2008). It is possible that this pattern is the legacy of a Biological Big Bang (Koonin, 2007) of phyla from the common ancestor of the Eubacteria, and that a star-like radiation is an accurate representation of their evolutionary history. However, it is also possible that the

phylogenetic signal, that is, the information to discriminate relationships among lineages

using primary sequence data, may have been lost through the accumulation of multiple changes at the same site of a sequence, referred to as “substitutional saturation” . This could result in similarities between the sequences that were not inherited from the common ancestor and, hence, are not indicative of shared evolutionary history (Felsenstein, 1978;

94 Doolittle, 1999; Heath et al., 2008a). Thus much phylogenetic information may have been lost as a result of the ancient nature of the divergences or the rapid accumulation of changes.

Alternatively, the sequences being analysed may have evolved under different evolutionary pressures and evolved shared characters that reflect a process that is interpreted by the phylogenetic method as a historical signal. This “convergent” evolution is sometimes referred to as “homoplasy” . For example, some genomes are known to have a higher or lower GC content. If two genomes have a higher GC content than others in the set being analysed, their sequences may share more similarities at the nucleotide or amino-acid level. These branches would appear together in a phylogeny in a phenomenon known as

“long branch attraction” (Felsenstein, 1978).

3.1.3 Aims

In this chapter, I outline the problems that can be expected in the inference of eubacterial phylogenies and describe the reasoning and preliminary investigations performed to choose the methodologies for subsequent analyses.

3.2 Methodologies and preliminary studies

3.2.1 Taxon selection

In a phylogenetic analysis, the first decision is often which taxa to include in the study.

Good taxon sampling is essential for accurate phylogenetic analyses (Heath et ah, 2008b).

An increase in the overall number of taxa generally improves the accuracy of both the

95 topology and the branch length estimates (e.g., Lecointre et ah, 1993; Philippe and

Douzery, 1994; Hillis, 1996; Yang, 1997; Hillis, 1998; Pollock et al., 2002; Hedtke et ah,

2006). Having more taxa increases the amount of information from which to infer the evolutionary model, in part by providing more sequences from which to infer changes that would otherwise be “unobserved” (Felsenstein, 1978; Pollock et al., 1999; Sullivan et al., 1999; Pollock and Bruno, 2000; Pollock et al., 2002). The chance of LBA effects are also reduced because an increase in taxa bisects long branches (Hendy and Penny, 1989;

Graybeal, 1998; Poe and Swofford, 1999; Poe, 2003). The use of parametric methods, like maximum-likelihood, can also improve the analysis as the models used in these analyses account for unobserved substitutions (Hillis et al., 1994; Pollock et al., 2002).

Most phylogenetic analyses yield unrooted trees, in which the deepest branching is un­ known. If a phylogeny were generated with only the sequences of interest, the “in-groups”, the most basal of the in-group taxa would be unknown. The addition of a few taxa from a closely-related sister group of the taxa of interest results in a phylogeny with two large sub-clades separated by a relatively long branch. The branch that separates the in-groups from the “out-groups” forms the root of the in-groups (Kitching et al., 1998). The choice of the out-groups is important because the inclusion of too few could lead to the incor­ rect inference of the location of the root (Holland et al., 2003), while too many or too divergent out-groups could bias the evolutionary model estimated for the analysis (Smith,

1994; Hillis, 1998; Rannala et al., 1998). The general principle for selecting out-groups is that there should be a small number of them (Smith, 1994), perhaps three, that form a small out-group clade separated by short internal branches (Graybeal, 1998), and that they should be as closely related to the in-groups as possible, while still distant enough to form a distinctly separate clade (Hall, 2004).

In short, a sensible selection of taxa is one of the easiest ways that the accuracy of a phy­ logenetic analysis can be improved (Zwickl and Hillis, 2002). For a more comprehensive

96 review of the issues related to taxon sampling, refer to Heath et al. (2008b).

3.2.2 Orthologue inference

Kimura’s neutral theory predicts that the rate of evolution of a gene remains the same in different organisms if the gene also has the same biological function (Kimura, 1983; Koonin, 2005). Only wet laboratory experiments can test whether two genes perform the same biological function. This can be costly, time-consuming, and simply unfeasible in the current genomic era. The best approximation is to assume that genes inferred to be orthologuous have aso retained their biological function. Hence, accurate phylogenetic analyses require alignments of genes that are not only homologous, but orthologous. In an exploration of eubacterial phylogenies, where the historical signal in the data may be weak, the accurate inference of orthologues is particularly important. Homology is a term used to describe any genes that share a common origin (Fitch, 1970; Koonin, 2005). Orthology is an evolutionary concept used to describe genes that origi­ nate from a single ancestral gene in the last common ancestor of the genomes (Fitch and Markowitz, 1970; Koonin, 2005). And paralogy is used to describe the evolutionary rela­ tionship between genes that are related through a duplication event (Fitch, 1970; Koonin, 2005). The identification of most-similar clusters of genes that could be used as an approximation to the set of orthologues across a set of genomes is usually based on using similarity metrics to infer relationships of putative orthology. Since this step of the phylogenetic analysis process was deemed to be a critical step in the present work, extensive preliminary analyses were conducted to ensure that the putatively orthologous sets of protein-coding genes used as the basis of analysis were of high quality. These analyses are described in Chapter 2. 97 3.2.3 Selection of nucleotide or amino-acid data

Phylogenetic analyses of protein-coding genes can be performed using either nucleotide sequences or the translated amino-acid sequences. The genetic code is degenerate, in that there is a many-to-one mapping between codons and amino-acids. Many changes at the third codon position are “synonymous” in that they do not change the resulting amino- acid (Agosti et al., 1996). Mutations tend to accumulate in the third codon position, providing good information for recent divergences, while those at the first two sites are more suitable for discriminating more ancient divergences (Ren and Paulsen, 2005). Thus, the historical signal at the third codon position can become degraded through multiple substitutions at these sites and, overall, changes should accumulate more slowly at the amino-acid level than at the nucleotide level. Since I expected that a diverse range of eubacteria would be subject to phylogenetic analysis in this work, I decided to perform all phylogenetic analyses of protein-coding genes at the amino-acid level to limit the effects of saturation.

3.2.4 Multiple sequence alignment

Although alignment-free approaches for phylogenetic analysis exist and are gaining in popularity (Vinga and Almeida, 2003; Hohl and Ragan, 2007; Jun et al., 2010; Sims et al., 2009; Wu et al., 2009), MSA forms the basis of most phylogenetic analyses and it is important to perform them as accurately as possible, as inaccuracies could affect the conclusions of the analyses (Wong et al., 2008). The aim of multiple sequence alignment (MSA) is to align sites in different homologous nucleotide or amino-acid sequences so that the similarities and differences at each site reflect the mutation, insertion and deletion histories that have accumulated in each of the sequences since the common ancestral 98 sequence.

Many programs are available to automate MSA (Edgar and Batzoglou, 2006). Popu­ lar choices include CLUSTALW (Larkin et al., 2007), which is most commonly run via the graphical interface c l u st a l x (Jeanmougin et al., 1998; Larkin et al., 2007), m a f f t

(Katoh et al., 2002), d ia lign - t (Subramanian et al., 2005), MUSCLE (Edgar, 2004), T-

COFFEE (Notredame et al., 2000) and pr o b co n s (Do et al., 2005). Most of these will perform well for sequences with little variation. However, some perform better for data that have evolved under particular evolutionary pressures.

I considered a program to be good if it aligned blocks of sites that were clearly homologous, inserted gaps to show indels, and did a reasonable job of aligning nucleotides or amino acids at the boundaries of gaps. In a review by Edgar and Batzoglou (2006), MAFFT, m uscle and pro bco ns were found to be the most accurate, in that order, but probcons was generally too slow. On the recommendation of Golubchik et al. (2007), I suspected that MAFFT and DIALIGN-T would produce the most accurate alignment. To check this, however, I generated alignments of a diverse set of Eubacteria for common bacterial markers such as the genes for ATP synthases or elongation factors using five programs that I considered to be stable and promising: CLUSTALW, DIALIGN-T, MAFFT, MUSCLE and T-COFFEE, using default parameters.

I found that the MAFFT program produced the best overall alignments, in that the clearly homologous regions were aligned correctly, the amino acids adjacent to gaps were gener­ ally aligned sensibly, and the alignments were performed quickly. Hence, in subsequent analyses, MAFFT was used for MSAs.

The most popular program for performing multiple alignments, CLUSTALW, also produced sensible alignments quickly. If I needed to generate a test alignment to quickly view the relationships between sequences, I often used the graphical CLUSTALX interface.

99 3.2.5 Removal of sites with uncertain homology

When the number of changes has been small, MSA can be performed accurately. But as

the number of indels increases, there are invariably regions of the MSA that cannot be

aligned with any certainty - whether by eye or by some automated process - and contain

sites that do not represent the evolutionary relationships between the sequences being

analysed. Whether the errors are due to greater divergence or to errors in the alignment

process, poorly-aligned regions can bias an analysis either because they do not reflect

the evolutionary history of the sequences or because they are effectively random (Susko

et ah, 2005). Hence, it is generally recommended that such regions be removed prior to

phylogenetic analysis because they introduce errors which can affect the topology of the

result (Lake, 1991; Olsen and Woese, 1993; Swofford et ah, 1996; Susko et ah, 2005).

Since one of the key aims of this study was to infer phylogenies with a minimum of error,

the removal of poorly aligned sites was an essential part of the data preparation. In many studies, the identification of poorly-aligned regions is performed by eye (e.g., Webster et ah, 2002; Zinner et ah, 2009). While this may be effective, it is time-consuming so

not practical for large scale analyses, and subjective so hard to reproduce reliably or to explain to others.

Initially, I selected a sample of alignments of eubacterial-core genes and identified sites

which I felt were poorly aligned. These contained so many gaps that the position of

individual amino-acids in individual sequences were haphazard and did not accurately

reflect inheritance from the ancestral sequence. Although I tried to be consistent in the

removal of sites, it was a tedious and time-consuming process and I noticed I was becoming

more ruthless in my removal of sites as time went by, so I started searching for a reliable,

objective and fast method for automating this task.

100 The GBLOCKS program (Castresana, 2000) appeared to be exactly what I required: it was a command-line program that removed regions with too many indels, leaving blocks of highly-conserved sites. Comparison of alignments where sites had been removed by hand and by GBLOCKS revealed that GBLOCKS, with default parameters except that sites containing up to 50% of the gap state were permitted (whereas by default, all sites with at least one gap were removed), removed sites that I too had considered poorly aligned, yet retained alignments with enough sites to be sufficient for phylogenetic analysis. Hence­ forth, sites of uncertain homology were removed from alignments with the application of GBLOCKS.

3.2.6 Screening for large variation in amino-acid composition

One of the major aims of this study was to evaluate whether the conflicting position of individual cyanobacterial sequences in previous phylogenetic studies was due to differences in the evolutionary history of these genes or due to variability from stochastic or systematic error. Significant variation in the distribution of nucleotides or amino-acids among a set of sequences, sometimes referred to as compositional heterogeneity, has been recognised as a potential problem in phylogenetic analyses (e.g., Lockhart et ah, 1992b; Foster et ah, 1997; Mooers and Holmes, 2000; Jermiin et ah, 2004) because it, along with convergent evolution, can lead to long branch attraction (LBA) effects (Felsenstein, 1978). Hence, methods for the removal of such aligned amino-acid sequences or individual amino-acid sequences were explored.

The tree- puzzle v5.2 program (Strimmer and von Haeseler, 1996; Schmidt et ah, 2002) provides quite a lengthy output file of properties of the amino-acid or nucleotide align­ ment. The x 2 (Chi-square) test output compares the amino acid composition of each

101 sequence to the frequency distribution assumed in the maximum-likelihood model. The null hypothesis is that the distributions are not significantly different. The test returns a p-value indicating whether the null hypothesis is likely to be true. If the p-value is less than a pre-determined significance threshold of, say, 0.05, the sequence is reported to have failed the test, indicating that it has a composition that is significantly different from that of the frequency distribution assumed by the ML model.

Another test, implemented in the HOMO program (Jermiin, unpublished), uses a matched pairs test of symmetry to test whether the sequences in the alignment are likely to have evolved under both a reversible and stationary process, by testing for internal and marginal symmetry, respectively (Ababneh et al., 2006). This works as follows.

An alignment can be thought of as an ordered sequence of patterns. In an alignment of r sequences of length n, where each sequence is an ordered series of characters sampled from a set of k character states, there are unique patterns. For example, an alignment of 2 sequences of nucleotides which do not contain gaps and have a length 100 could be thought of as an ordered series of 100 patterns, where each column is one of the 16 possible unique patterns (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG or TT) that can be constructed for a sequence where each position in a pattern must contain one of 4 character states (A, C, G or T).

The proportion of each of the ik unique patterns in the alignment can be summarised in a contingency table. A contingency table for Seql and Seq2 of nucleotides looks like the table below in Figure 3.1. It is a 4-by-4 table where the (i,j)~th cell contains the number of occurrences in the alignment of the column consisting of nucleotide i in sequence 1 and nucleotide j in sequence 2, plus an extra column to the right and an extra row at the base of the table to store the row and column totals. The extra row totals are the number of nucleotides of each type in Seql, while column totals at the base are the number of

102 nucleotides of each type in Seq2.

Table 3.1: Contingency table of nucleotide counts in two nucleotide sequences

Nucleotides in Nucleotides in Sequence 2 Sequence 1 A C G T Total

A count A A count .4(7 count a g counter total 1,4

C count c a count cc countcc counter to tals

G county count g c count g g counter totalic

T county count t c count t g count rr totalir Total tota^A total2c total2G totaBr seqlength

If the pair of sequences has evolved under a reversible process, the contingency table will have “internal symmetry” across an axis from the top left to the bottom right of the table. That is, if the probability of nucleotide i changing to nucleotide j is the same as the probability of nucleotide j changing to nucleotide i (i.e., the process under which the sequences evolved was reversible), there will be an equal number of pattern ij and ji in the alignment and the corresponding counts in the contingency table will be the same.

Similarly, if a pair of sequences has evolved under a stationary process, the contingency table will have “marginal symmetry”. That is, the distribution of character states in the character states in the two sequences will be the same. Another way of expressing this idea is that the total number of nucleotide i in Seql will be the same as the total number of nucleotide i in Seq2. Hence the margin totals of the contingency table should be the same.

Bowker’s test statistic (Bowker, 1948) returns a single probability value (p-value) that the contingency table has both internal and marginal symmetry. The null hypothesis of the test is that the sequences have both internal and marginal symmetry. Thus, a p-value greater than a chosen significance threshold such as 0.05 indicates no support for the rejection of the null hypothesis - the two sequences being compared are likely to have

103 evolved under a process that was both reversible and stationary.

The method for comparing 2 sequences described above can be extended to the compar­ ison of r sequences by performing pair-wise comparisons and analysing the proportion of successful tests. I decided that if at least 90% of the pair-wise (or “matched-pairs” ) tests returned a p-value greater than the chosen significance threshold of 0.05, then the set of aligned sequences was likely to have evolved under reversible and stationary condi­ tions and was thus suitable for analysis with phylogenetic inference methods that assume this. Although the example above used nucleotide sequences with 4 character states, the approach is the same for amino-acid sequences with 20 character states.

The x 2 test implemented in TREE-PUZZLE is effectively the same as the test for marginal symmetry in the matched-pairs test of symmetry. However, since the matched-pairs test of symmetry performs both tests, I chose to use it in subsequent analyses to determine whether the set of sequences are likely to have evolved under a process that is both stationary and reversible. The output of the pairwise comparisons can also be used to identify individual sequences that have a very different amino-acid composition. It is possible, using this test, to remove the offending sequences, then test the remaining sequences in the set. In this way, fewer putatively orthologous sets of proteins need be discarded due to problematic amino-acid frequency distributions.

3.2.7 Single versus concatenated gene phylogenies

Molecular phylogenetics has traditionally been used to infer evolutionary trees, or net­ works, from single genes. The evolutionary history of several genes could be summarised into a single consensus tree or network if each phylogeny covers the same taxa, and into a super-tree or super-network if there is a large, but not perfect, overlap in the taxa of each phylogeny.

104 As extensive amounts of genomic data have become available, a new field, “phyloge- nomics” , has appeared (Eisen and Fraser, 2003). This loosely covers any method by which the evolutionary history of whole genomes, rather than single markers such as the

16S rDNA gene for bacteria, is explored (Daubin et ah, 2002; Delsuc et ah, 2005; Ciccarelli et ah, 2006; Dutilh et ah, 2007). One of the popular methods by which this is achieved is to concatenate alignments of homologous genes and generate a single tree from the concatenated alignment. The use of more sequence data should provide more information with which to discern divergences between taxa and provide stronger statistical support for tree topologies (Hillis et ah, 1994; Graybeal, 1994; Rannala et ah, 1998; Delsuc et ah,

2005; Heath et ah, 2008b). Furthermore, the increase in the number of characters with respect to the number of taxa should reduce stochastic error (Phillips et ah, 2004; Delsuc et ah, 2005; Heath et ah, 2008b).

There is a vast literature on the benefits of using concatenated gene analyses based on whole genome datas (e.g., Dutilh et ah, 2007). In general, these methods advocate the alignment of homologous genes, the concatenation of the alignments of multiple genes, and the phylogenetic analysis of the concatenated gene set to yield a single phylogeny (Delsuc et ah, 2005). Earlier analyses used a single evolutionary model for the entire concatenated gene set (e.g., Bern and Goldberg, 2005), which is clearly not a good fit for the data, as multiple genes are likely to have evolved with different evolutionary models. However, even analyses where an appropriate evolutionary model is applied to each gene in the concatenated alignment and used to infer a single phylogeny are fundamentally flawed because some genes in the super-alignment may have genuinely different evolutionary histories due to the effects of homologous recombination (Suerbaum et ah, 1998; Feil et ah, 2001) or other LGT processes. Hence, forcing the data to yield a single phylogeny can serve to dilute the historical signal in the set of genes being analysed. On this basis, the analysis of concatenated alignments was deliberately avoided in the present work.

105 By having access to individual gene phylogenies, I would have the maximum amount of information on the evolutionary history of each gene. Only if multiple eubacterial-core protein-coding genes were found to support the same topology would I consider analysing them as a concatenated data set, or combining their individual trees to yield a consensus phylogeny.

3.2.8 Selection of evolutionary substitution models

Since the present study includes a diverse Eubacteria, the most appropriate phylogenetic methods are those that use amino-acid substitution models that account for unobserved or multiple substitutions (Heath et ah, 2008b). Systematic error may result through use of an inappropriate model inappropriate model parameters (Gojobori et ah, 1982; Bruno and Halpern, 1999; Farris, 1999; Pollock et ah, 2002; Keane et ah, 2006; Hugall and Lee, 2007). Hence, careful model selection is required to minimise these sources of error. Models are usually evaluated using statistical methods that compare the fit of the proposed model with the observed data. The tests most widely used to compare the fit of a hierarchy of nested alternate models are hierarchical Likelihood Ratio Tests (hLRTs) (Huelsenbeck and Crandall, 1997) or information criteria such as the Akaike Information Criterion (AIC) (Akaike, 1974), the corrected Akaike Information Criterion (AICc) (Hurvich and Tsai, 1989), and the Bayesian Information Criterion (BIC) (Schwarz, 1978). These tests are implemented for nucleotide alignments in m odeltest (Posada and Crandall, 1998) and for amino-acid alignments in PROTTEST (Abascal et ah, 2005). Since the AIC favours a more general model with more parameters than is thought to be warranted in many cases, I decided that the most reliable test for selecting a model was the AICc, which penalises extra parameters more strongly (Hurvich and Tsai, 1989).

Programs like MODELTEST or PROTTEST, which employ a hierarchical system for evalu- 106 ating models, use a statistic for testing the goodness of fit of a simpler null model and an alternate more complex and parameter-rich model, where the null model is a special case of the alternate model (Posada and Crandall, 1998). A limitation of these tests is that any models that do not fit into the nested hierarchy are not evaluated (Shapiro et ah, 2006). In this study, where model accuracy was an important consideration. I considered using phylogenetic inference methods based on a more complex, parameter-rich model such as the extension of the Barry and Hartigan model implemented in Jayaswal et al. (2007). Unfortunately, while the use of a more complex model can improve the fit to the data, it can also increase the error in the estimated parameters (Huelsenbeck and Hillis, 1993; Penny et al., 1994; Bruno and Halpern, 1999). It has been demonstrated that use of more sophisticated models can reduce the chance of recovering the underlying tree and, in fact, it has been demonstrated that a set of parameters can be found to support any tree topology (Steel et al., 1994; Yang et al., 1995; Steel and Penny, 2000). Hence the use of over-parameterised methods is generally not recommended (San Mauro and Agorreta, 2010). Since there was also advice from the author of this method that it would not be computationally feasible to run the program for 60 to 100 taxa (V. Jayaswal, personal communication), I did not pursue use of this or similar models.

Since standard models of amino-acid substitution evaluated by the PROTTEST program are expected to operate well for the eubacterial-core protein-coding genes used in this study, they were used for subsequent analyses. In addition to the phylogenetic analyses of eubacterial-core protein-coding genes, I ex­ pected that I would infer phylogenies based on the 16S rDNA gene. Since the RNA molecule has a strong secondary structure, mostly consisting of stem regions of paired nucleotides and loop regions consisting of independent sites, it is recommended that stem- loop models be used to account for this secondary structure (Telford et al., 2005). While nucleotide DNA substitution models generally have 4 states (or 5 if they treat the gap

107 condition as an additional state), RNA substitutions have 16 states for the 16 possible

pairs of 4 bases. The most general RNA model available in the PHASE program (Jow et ah, 2002), the RNA7A model with 7 states, 21 rate parameters and 7 frequencies

(Higgs, 2000; Savill et ah, 2001; Jow et ah, 2002) was used for stem regions and a general- purpose nucleotide model for the loop regions. In preliminary studies, the most popular general-purpose model recommended was the General Time Reversible (GTR or REV)

model (Tavare, 1986).

3.2.9 Tree-based phylogenetic inference methods

Introduction

I expected this study to involve large data sets. Inferring the origin of Prochlorococcus

and marine Synechococcus sequences in a eubacterial context would entail analysing po­ tentially hundreds of eubacterial-core genes, each with between 60 and 100 sequences, with multiple phylogenetic methods. So the phylogenetic methods used had to return accurate results quickly on the computational resources available.

The most common model for species divergence is a bifurcating tree. The most widely- used phylogenetic tree inference programs can be grouped by the optimisation criteria:

(i) maximum parsimony (MP), which finds the tree with the minimum number of evo­ lutionary events (Eck and Dayhoff, 1966; Fitch, 1971; Hartigan, 1973; Edwards, 1996);

(ii) minimum evolution (ME), which finds the tree with the minimum evolutionary dis­ tance across the tree (Edwards and Cavalli-Sforza, 1963; Kidd and Sgaramella-Zonta,

1971; Rzhetsky and Nei, 1992a,b, 1993); (iii) neighbor-joining (NJ), which uses a clus­ tering method to recover a tree using an estimate of the evolutionary distances between every pair of sequences (Saitou and Nei, 1987; Studier and Keppler, 1988); (iv) maximum-

108 likelihood (ML), which finds the tree with the maximum likelihood estimate, given the observed data and the substitution model (Edwards and Cavalli-Sforza, 1963; Neyman,

1971; Kashyap and Subas, 1974; Felsenstein, 1981); and (v) Bayesian inference (BI), which finds trees with a high posterior probability, given the observed data, a model of evolution and prior knowledge (Rannala and Yang, 1996; Mau, 1996; Li, 1996; Yang and Rannala,

1997; Archibald et ah, 2003a). In addition to the above, there are more conservative tree inference methods that only resolve edges with strong support.

Initially, I expected that parametric phylogenetic inference methods such as ML or BI, which are computationally expensive but more accurate, would be required as many sites in the sequences would have experienced multiple substitutions and some LGT was expected. However, since phylogenetic analyses would be central to my work, and previous studies had suggested that accuracy would be an issue for the taxa being studied, all these methods were considered. They were evaluated theoretically and through preliminary analyses to establish whether their use would be practical.

Background

Maximum parsimony is often rejected outright as an option for more ancient divergences because it does not utilise an explicit model of evolution and there are well-documented evolutionary scenarios that can bias the method towards incorrect phylogenies (Steel and

Penny, 2000). However, the situation is not really that simple. Careful comparison of the effect of optimisation criteria of MP and ML have revealed that some “flavours” of ML, such as the maximum average likelihood (MauL) method, return exactly the same trees as the maximum parsimony method in some circumstances (Penny et ah, 1994; Tuffley and

Steel, 1997; Steel and Penny, 2000). The ML method can also be inconsistent, whether the model used to generate the sequences is the same or different to the model used to

109 infer the tree (Huelsenbeck and Hillis, 1993; Waddell, 1996; Yang, 1996; Hnelsenbeck,

1998; Siddall, 1998; Bruno and Halpern, 1999). Furthermore, there are some evolutionary scenarios where the MP method returns the correct tree while ML does not (Steel and

Penny, 2000). I was open to the idea of using MP in this study, but after weighing up the evidence, decided that I too was worried about the lack of an evolutionary model, and the possibility that even if the topologies inferred were correct, the poor handling of unobserved substitutions by the MP method might distort branch length estimates.

Since short internal branch lengths were expected, the most accurate estimates of branch lengths would be required. Hence, I decided that MP would not be used. For similar reasons, ME was also rejected.

The ML method (Felsenstein, 1981) is well-regarded and considered superior for its use of a well-defined model of evolution (Felsenstein, 1981; Strimmer and von Haeseler, 1996). All possibilities of obtaining the data are considered and the likelihood function is a consistent and powerful technique in statistical inference (Holder and Lewis, 2003). The sound statistical framework can give accurate results for divergent sequence data or cases where the historical signal has been partially obscured by multiple substitutions (San Mauro and Agorreta, 2010). Hence, various implementations were tested.

The main difference between ML and BI methods is that BI allows the user to specify prior information, usually in the form of an increased probability for some tree topologies, that is taken into consideration in the tree inference process (Archibald et al., 2003b).

If no prior information is available, all trees are specified as having equal probability. If such a “flat prior” is used, the method is sometimes referred to as “maximum integrated likelihood” (MIL) because, from a theoretical point of view, the method is effectively an ML method that uses the same optimality criterion but a different method to search the possible space of trees and model parameters (Huelsenbeck et al., 2002; Archibald et al., 2003a). In addition, BI permits searching over alternative models and topologies

110 using Markov Chain Monte Carlo (MCMC) methods and the estimation of a posterior probability for each subtree. Although I expected that ML and BI trees would be very similar, I also evaluated the feasibility of using BI in further analyses. The Neighbor-Joining method is different to the other methods mentioned so far, in that pairs of sequences are compared to yield a “distance matrix”, which in turn, is used in a clustering process that yields a tree (Nei and Kumar, 2000). Both the choice of distance metric and the clustering method can affect the accuracy of the inferred tree, but in general, NJ trees tend to be quite similar to trees inferred with more computationally expensive methods such as ML (Page and Holmes, 2003). However, to check whether there were any major differences with the ML phylogenies, preliminary analyses were conducted using NJ methods. A problem with the phylogenetic tree-based methods used above, are that they tend to yield the best possible fully-resolved tree, thus hiding any conflicting signals in the site-patterns. Some methods like the Buneman Tree (BT) (Buneman, 1971) only resolve edges with strong support. Its successor, the Refined Buneman Tree (RBT) (Moulton and Steel, 1999; Brodal et ah, 2003), resolves more edges than the BT method but is still more conservative than the NJ method. This was selected for further testing.

Methods

Twenty-one sets of eubacterial-core, putatively orthologous, protein-coding genes were inferred from the same set of taxa used to test the iterative clustering methods described in §2.2.2.

Neighbor-Joining trees were inferred using programs from PHYLIP v3.63 (Felsenstein, 2005) using 1,000 bootstrap replicates and the JTT (Jones et ah, 1992) and WAG (Whe-

111 Ian and Goldman, 2001) substitution matrices, without a discrete F distribution, and with an unrooted consensus tree inferred using the “Majority rule (extended)” method. Exploration of the literature on available software suggested that the most accurate, yet fast, maximum-likelihood programs were phyml (Guindon and Gascuel, 2003), t r ee- puzzle (Strimmer and von Haeseler, 1996; Schmidt et al., 2002), and treefin d er (Jobb et al., 2004). PHYML was run with the JTT and WAG substitution models. Since substi­ tution rates are not equal for all sites in a protein-coding gene, a proportion of sites (I) is assumed to be invariant, that is, to have a zero rate of change, and a discrete F distribution is used to model rate heterogeneity across sites (Hasegawa et al., 1987; Felsenstein, 2004; San Mauro and Agorreta, 2010). In this analysis, the proportion of invariant sites I was estimated from the alignment, four discrete F rate categories were used, the a parameter was estimated using an ML method, an initial BIONJ tree (Gascuel, 1997) was used, and bootstrap clade support was estimated using 200 pseudo-replicates. TREEFINDER was run with JTT and WAG models, and the a parameter of the discrete F distribution estimated from the data. Edge confidence values were estimated with 100, 200 and 1,000 replicates using the BS (Felsenstein, 1985b) and LR-ELW methods (Strimmer and Rambaut, 2002). TREE-PUZZLE was run using 10,000 puzzling steps and parameter estimation based on an NJ tree.

The most respected BI programs available at the time were MRBAYES (Huelsenbeck et al., 2001; Ronquist and Huelsenbeck, 2003) and BEAST (Drummond and Rambaut, 2007). Although BEAST has now become the standard program to use for Bayesian analyses, only MRBAYES was chosen for testing at this time because it was reputedly easier to use and the more complex analyses available in BEAST would not be required. The MRBAYES program was run with 4 chains, 2,200,000 generations sampled every 100 generations, 10% of the generations discarded as the burn-in period, four discrete F distribution categories, and a model with stationary state frequencies and substitution rates based on the WAG

112 substitution model (Whelan and Goldman, 2001).

The RBT method implemented in SPLITSTREE v3.2 (Huson, 1998) was also tested on these data.

Results

In general, the trees inferred for each protein-coding gene using the NJ, ML and BI methods were similar. A characteristic feature is that they have short internal branches towards the centre of the tree and a tree-like structure within the clades with more taxa, such as the cyanobacteria and the green sulfur bacteria. The different methods and programs produced trees with minor differences in the precise position of leaf nodes, but were otherwise similar. A typical tree, inferred for the ppk gene that encodes the polyphosphate kinase protein, is shown in Figure 3.1 A. In many cases, the trees inferred by the NJ and ML methods were identical both for the distantly-related taxa from different phyla and the more closely related taxa within the cyanobacteria. Where they were not, the variation was usually a minor re-ordering of closely-related taxa. Furthermore, the trees were robust to the use of different substitution matrices. The maximum log-likelihood (InL) values returned for the trees generated using different implementations of the ML method were comparable. For example, the InL values of the ppk gene trees inferred by phyml, tree-puzzle and treefinder were -26,142, - 27,778 and -27,843, respectively. Overall, however, the InL values of the phyml trees were consistently higher than those of the TREE-PUZZLE trees, by approximately 5%. The InL values for the treefinder trees were more variable, but most fell within 5% of the values of the phyml trees.

113 PROCHLOROCOCCUS

P-NATL2A PROCHLOROCOCCUS

Figure 3.1: Phylogenetic trees inferred from the amino-acid sequence of polyphosphate kinase

The trees were inferred using: (A) the maximum-likelihood method implemented in p h y m l with the number of the 200 bootstrap trees that support key branches marked; and (B) the Refined Buneman Tree method implemented in SPLITSTREE. Photosynthetic taxa are highlighted and labelled. 114 The speed with which the various methods could compute a tree with clade support values was an important consideration. With the time and computational resources available, phyml was able to compute trees for up to 200 BS replicates and tree- puzzle could compute trees for a respectable 10,000 puzzling steps. In the treefinder trees, the support values inferred using the BS and LR-ELW weights were comparable, but the BS calculations were much slower. Analysis of pseudo-replicates was limited to a maximum of 100 using the BS, but up to 1,000 using the LR-ELW method, with the computational resources available.

The MRBAYES programs took a relatively long time to run: on the order of 2 to 4 weeks for each alignment tested. At the time, I had access to only two computers, so I decided that it was not practical to rely on this method, and that I would only consider its use if the maximum-likelihood analyses were suspected of inaccuracy and further computing resources became available. For the few trees that were inferred (data not shown), how­ ever, the topologies were very similar to those of the ML trees. So, based on these limited preliminary analyses the use of BI trees would not have altered my conclusions.

Test trees were inferred using the RBT method because I hoped it would provide a useful visualisation of the clades that had high support. Unfortunately, the resulting trees were star-like (Figure 3.IB).

Conclusions

In general, the NJ and ML methods produced phylogenetic trees that were similar in topology, branch lengths, and log-likelihood values. The use of different ML methods and the use of different substitution matrices had very little effect on the resulting topologies or branch lengths. The robust nature of the phytogenies suggest that they are at least capturing a consistent signal in the data, even if it cannot be proven to be historical.

115 The different ML methods produced similar results, but TREEFINDER performed the best overall, because it was faster, while producing trees of similar quality to the other meth­ ods, as measured by log-likelihood values. The speed was important because analyses of pseudo-replicate data sets were required to gauge the strength signal in the resulting eubacterial phylogenies. Since TREEFINDER was the only ML program for which analysis of 1,000 pseudo-replicate data sets was possible, it was the preferred ML program for future analyses.

The tree puzzling method constructs four-taxon trees (“quartets”) then combines them into an ML tree that agrees with the quartets (Strimmer and von Haeseler, 1996; Ran- wez and Gascuel, 2001). TREE-PUZZLE was originally chosen for evaluation because it promised to be an efficient method for yielding more robust results for distantly-related taxa. It is a popular and widely used method, but there were concerns that the tree­ puzzling method at the heart of the algorithm may itself introduce biases into the analyses (Ranwez and Gascuel, 2001; St John et ah, 2003). Since the other ML programs performed as well as, if not better, than TREE-PUZZLE it was not used in further analyses.

The RBT method was trialed with hopes that it might reveal when other tree inference programs had resolved branching orders which lacked enough support to be resolved. Unfortunately, whether due to the conservative nature of the RBT method, the weak his­ torical signal in the eubacterial protein-coding gene alignments, or the conflicting signals in sites of the alignments, the RBT trees were not able to provide such information.

On the basis of the above analyses, the TREEFINDER ML program was chosen for future phylogenetic analyses. The BI and RBT methods were rejected due to practical consid­ erations. The NJ method was not pursued because the trees inferred were consistently so similar to the ML trees that whatever knowledge might be gained would be of insufficient value to warrant the extra computation.

116 3.2.10 Network-based phylogenetic inference methods

Int roduction

Phylogenetic networks are generalisations of phylogenetic trees that can represent con­ flicting signals or alternative phylogenetic histories in the data (Fitch, 1997; Bryant and

Moulton, 2004). They have become more popular in the genomic era because a network is a more appropriate model of evolution than a tree for evolutionary histories that involve hybridisation, LGT or recombination (Huson, 2005).

In the present study, I expected there might be more than one signal in the sites of eubacterial-core protein-coding gene alignments. I was looking for an easy way to visualise how much variation there was and whether this variation was localised to certain taxa or clades. Methods that could reveal instances of LGT or conflicting signals in the data were required, so both tree and network-based methods were explored for the different perspectives they might yield.

Several phylogenetic network methods were also evaluated for their ability to test the

“tree-ness” of the phylogeny, identifying instances of LGT, or visualise the conflicting signals in the sites of the aligned sequences.

Background

A “split” is a partition of the taxa into two non-empty sets. Split decomposition methods decompose site-patterns into bipartitions of the taxa and represent them as a “splits graph” . The intention of a split network is to capture the information in a sequence alignment or a set of trees as a set of weighted splits that can be combined to yield a phylogenetic network (Huson and Bryant, 2006). Splits graphs are thus a generalisation

117 of a set of phylogenetic trees, able to represent incompatibilities within and between data sets and the evolutionary distances between taxa (Bryant and Monlton, 2004).

The three split network methods that seemed promising for my data were: (i) the median­ joining (MJ) network, based on sequence data (Guenoche, 1986; Bandelt, 1992; Bandelt et ah, 1995; Huber et ah, 2001; Holland et ah, 2005); (ii) the Neighbor-Net (NN) method, based on distances (Bryant and Moulton, 2002, 2004); and (iii) the Split Decomposition

(SD) network, based on distances (Bandelt and Dress, 1992; Wetzel, 1994; Dress and

Huson, 2004).

On further investigation, the MJ network was deemed unsuitable for the data in this study as it is most suited for closely-related sequences sampled from populations (Bandelt et ah,

1999).

The Neighbor-Net method computes a set of weighted splits which can be converted into a splits graph using a drawing algorithm (Bryant and Moulton, 2004). If the historical signal in the data is perfectly congruent, the method returns a strictly bifurcating tree.

If not, the non-compatible splits are used to generate a representation using a circular arrangement of the taxa and branches that are proportional to the evolutionary distance separating taxa (Bryant and Moulton, 2002, 2004; Huson and Bryant, 2006).

The Split Decomposition method produces a set of weakly compatible splits (Huson, 1998) which identify related groups of taxa (Bandelt and Dress, 1992) and can be visualised as a “splits graph” (Bryant and Moulton, 2004). It was trialed as an alternative to the

Neighbor-Net method, as it has been used extensively in the literature.

118 Methods

The same data used with the methods in §3.2.9 were used to test phylogenetic network methods. The most stable implementations of the split-based phylogenetic network meth­ ods were those in splitstree (Dress et al., 1996; Huson, 1998; Huson and Bryant, 2006). Splits graphs were inferred using the Neighbor-Net and Split Decomposition methods im­ plemented in splitstree v3.2. Distances were computing using the P roteinMLdist method implemented in SPLITSTREE, which calculates the maximum-likelihood protein distance estimates using a specified model (Swofford et al., 1996; Huson and Bryant, 2007). A proportion of invariant sites and a discrete T distribution of rates across sites were assumed in the calculations to account for among site rate variation (ASRV) effects. The values for I and a were estimated by PROTTEST. Networks were inferred using both the JTT and WAG substitution models to test if this had a significant effect on the topol­ ogy. Networks based on alignments with and without poorly-aligned regions were also inferred, to evaluate the effect that this had on the “tree-ness” of the resulting networks. The feasibility of performing bootstrap analyses on the Neighbor-Nets networks was also investigated.

Results

In general, the splits graphs inferred with the Neighbor-Net method exhibited poor res­ olution at the centre of the phylogeny, visible as regions shown with boxes indicating alternate branch orderings in the alignment (Figure 3.2A). However, the lineages that emanate from the central area of the network were generally tree-like, with a relatively low level of uncertainty in the order of the branches (Figure 3.2A). The relationship between the Neighbor-Net analyses and the ML trees for a particular

119 PROCHLOROCOCCUS S 9902 S-8102 S-9605 P-9313 P-1375 P-NATL2A Thermosyn Selo-6301

P-9312 Scystis P 1986 Croc Npun-73102 Anabaena N-7120

Trichod Agrobact

Gloeo

CYANOBACTERIA PURPLE BACTERIA

C-BS1

C-CaD3 GREEN Pelo SULFUR C-TLS BACTERIA BACTERIA C-D5M245 C-DSM266 Bacillus Ecoll Yersinia 0.1

PROCHLOROCOCCUS GREEN C-TLS SULFUR C-BS1 BACTERIA C-CaD3 Pelo P 9312 C-DSM 245 P-1986 C-DSM 266

P NATL2A Agrobact Rpseud PURPLE BACTERIA Scystis Rbact Croc

Therm osyn

Anabaena Selo-6301 Npun-73102 Gloeo CYANOBACTERIA Chlorofl

GREEN NON-SULFUR BACTERIA Yersinia Bordetella Ecoli 0.1

Figure 3.2: Phylogenetic networks inferred from the amino-acid sequence of polyphosphate kinase The networks were inferred using: (A) the Neighbor-Net algorithm implemented in splitstr.ee; and (B) the Split Decomposition method implemented in splitstree. Photosynthetic taxa are highlighted and labelled.

120 protein-coding gene was clear. In general, a Neighbor-Net analysis yielded a tree-like network that was similar to the topology of the corresponding ML tree, but with some areas highlighted as uncertain. For example, the p h y m l tree inferred for the ppk gene

(Figure 3.1 A) indicates that the basal branches of the Prochlorococcus branch had the support of 172 and 177 of the 200 bootstrap replicates, corresponding to quite reasonable bootstrap support values of 86 and 89, respectively. The Neighbor-Net analysis, however, reveals that there is considerable variation in the branching order of the Prochlorococ­ cus sequences in this clade, depicted as the large rectangular areas at the base of the

Prochlorococcus clade (Figure 3.2A). This pattern is repeated for other clades in the ppk tree and Neighbor-Net network, suggesting that the the low bootstrap support values in the ML tree and the uncertainty in the branching order of taxa visible in the Neighbor-Net are different representations of the same signal in the data.

The graph-drawing method implemented in SPLITSTREE v3.2 reports a “fit index” for each representation of a Neighbor-Net analysis, indicating the percentage of the information in the distance matrix that could be captured by the graph (Salemi et al., 2003). The values from the 21 alignments where poorly aligned sites were retained had a mean of

96.6%, median of 96.7% and a standard deviation of 1%. Removal of poorly aligned sites had no significant effect: the mean and median were 96.3% with a standard deviation of 1%. The topologies of the Neighbor-Net phylogenies were also robust to the removal of poorly aligned sites. Bootstrapping with the Neighbor-Net method was found to be time-consuming, but not so slow as to be prohibitive for obtaining clade support values with around 100 replicates.

The splits graphs inferred using the Split Decomposition method (Figure 3.2B) were similar to the trees inferred using the Refined Buneman Tree method (Figure 3.IB) in that the main lineages formed a star-like radiation from the centre of the graph, indicating the inability of the method to provide meaningful indications of the relationships between

121 them. Both the Neighbor-Net and Split Decomposition methods were robust to the use of dif­ ferent substitution matrices.

Conclusions

The Neighbor-Net method was the only phylogenetic network method that was able to capture the conflicting groupings between sites of the same alignment and return phylo- genetically informative groupings of taxa.

3.3 Discussion

3.3.1 Sources of error in phylogenetic inference

This chapter has documented an extensive list of theoretical reasons to support the em­ pirical evidence that stochastic and systematic error can be large enough to affect the topology and evolutionary distances of phylogenies. A key problem in the phylogenetic analysis process is the large number of steps (§3.1): er­ ror can be introduced at each step (§3.2), and these effects are cumulative. Even the best choices of data, models and algorithms cannot guarantee that the phylogeny inferred is a true representation of the evolutionary history of the sequences being analysed. Further­ more, the methods used to mitigate the effects of known problems can, unintentionally, introduce new errors (e.g., §3.2.7 and §3.2.8). My overall conclusion is that it is unrealistic to expect that phylogenetic analyses of divergent sequences with a weak or eroded historical signal, or which involve divergent 122 sequences that have evolved via different evolutionary processes for at least some of their evolutionary history, will ever be untainted by error. The only reasonable approach is thus to adopt the best practices for the data being explored to minimise non-historical signals.

Many of the results in the literature on the testing of different phylogenetic inference methods appear to be based on four-taxon trees, or trees with fewer than 10 taxa, and this trend continues even in quite recent studies (e.g., Wong et al., 2008). Since the effects of systematic errors appear to be more pronounced in phylogenies with fewer taxa

(Heath et al., 2008b), I believe the conclusions of these studies are, themselves, affected by sampling and stochastic error. It is hoped that these tests will be repeated soon with more taxa and more recent variants of the phylogenetic methods available. Given that the trend in phylogenetics is to use large data sets, and that understanding the magnitude of artefactual error in an analysis is important for assessing the confidence of any results, it is vitally important that the studies which guide experimental design are accurate.

During these preliminary investigations, I was concerned that the evolutionary models being used were not sufficiently complex to model the processes under which the data may have evolved. Sequences in any set of putatively orthologous protein-coding genes could have evolved under some sort of covarion/covariotide evolutionary model. However, with no reliable way to estimate what the correct model might be, I decided that the risk of over-fitting the data was too great, and that the more conservative approach of using standard models, which effectively use the best “average” model for all sites and all branches of the phylogeny, was the most sensible.

One positive source of error which I did not properly investigate here was branch length heterogeneity (BLH) (Tuffley and Steel, 1998; Galtier, 2001; Huelsenbeck, 2002; Wang et al., 2007). Since the analysis of sets of sequences which have large variation in terminal

123 or internal branch lengths can distort the phylogenetic inference process by introducing systematic error (Lyons-Weiler and Takahashi, 1999), I should probably have explored methodologies for screening out these sequences here.

3.3.2 Strategies for reducing sources of error

Despite all problems with limitations of the phylogenetic inference methodologies de­ scribed, the preliminary analyses conducted suggest a data preparation and analysis methodology that will support the overall aims of this study.

This study would involve phylogenetic analysis of deeply divergent eubacteria and poten­ tially rapidly-evolving cyanobacterial sequences. Thus, the main criteria for a method was that it perform well with sequences that contain multiple substitutions. The pre­ liminary analyses conducted with tree-based phylogenetic inference methods concur with the literature (e.g., Steel and Penny, 2000) that maximum-likelihood is the most suitable method to use in this situation, based on accuracy, robustness and speed. Bayesian infer­ ence could also have been adopted, but due to the limitation of available computational resources this was not possible (§3.2.9).

Through exploration of the literature and investigations carried out in this chapter, it has become clear that any aspect of the phylogenetic inference process could contribute to stochastic or systematic errors of sufficient magnitude to affect the conclusions.

However, the adoption of appropriate data preparation methodologies, based on careful taxon selection, rigorous orthologue inference, consistent multiple sequence alignment, and removal of poorly aligned sites that dilute the historical signal in the data, could improve the overall accuracy of phylogenetic analyses. Screening for properties of the data that are known to confound the phylogenetic inference methods that will be used,

124 an additional step that has not been used routinely in previous work, may also help. The combined use of tree- and network-based phylogenetic inference methods that complement each other may also help to evaluate the confidence of the groupings of taxa recovered for a particular gene. Analysis of concatenated gene sets are to be avoided as this unnecessarily imposes a single evolutionary history on sets of genes that may genuinely have different histories (§3.2.7). The lack of resolution at the centre of the eubacterial phylogenies was a disappointing, but not unexpected (e.g., Creevey et al., 2004), preliminary finding. However, it does not affect the overall objectives of this study. Individual lineages that emanate from the central part of the phylogeny generally display a topology that is robust to use of different substitution models or phylogenetic inference methods or implementations (§3.2.9). Hence, there does appear to be a useful signal in the data. The lack of significant topological differences between phylogenies inferred with and without poorly aligned sites suggests that the errors associated with poorly aligned sites are not large enough to change the topology of the resulting phylogeny (§3.2.10). However, given that the error in phylogenetic analysis is cumulative, I decided that the benefit of removing such sites would out-weigh the loss of characters for the analyses.

3.3.3 Conclusions

The main aim the work in this chapter was to anticipate the data and methodological issues that would be associated with the extensive phylogenetic analyses described in the remainder of the work. The investigations carried out led to the following conclusions:

1. Poor choices of data and methods can contribute to sampling and stochastic errors. Dis-entangling cause and effect, and gauging the extent of individual effects, is

125 difficult because the effects of these artefactual errors could be cumulative and not

entirely predictable.

2. The preliminary analyses conducted indicate that, despite all these problems, the

phylogenies inferred from individual protein-coding genes are robust to variation in

the sites, evolutionary model, and inference method or implementation. The consis­

tency of the phylogenetic groupings, which tend to follow accepted taxonomic lines,

suggests that these phylogenies are suitable for exploring the more ancient origins

of essential house-keeping genes in the Prochlorococcus and marine Synechococcus

genomes.

3. The maximum-likelihood implementation in TREEFINDER was selected for future

analyses because it returned trees with topologies that were very similar to that

found by the respected PHYML program, but was much faster, to the extent that

running analyses using 1,000 replicates to infer clade support values would be pos­

sible. Although bootstrap replicates are available in TREEFINDER, the much faster LR-ELW clade support method was chosen for future analyses because it returns

similar results but is less computationally demanding.

4. Neighbor-Net analyses were found to be a suitable way to visualise incompatible or

ambiguous signals in the data.

126 Chapter 4

Selecting data for exploring the origins of Prochlorococcus and marine Synechococcus genomes in a eubacterial context

4.1 Introduction

In a phylogenetic analysis, the choice of data and the method by which data are prepared can introduce error that affects the resulting phylogeny. I suspected that previous studies that have found a high level of phylogenetic conflict for the cyanobacteria have been affected by such errors. Hence, the careful selection and preparation of data were a major concern for my phylogenetic analysis of the cyanobacteria. In this chapter, I outline how data were selected to examine the origins of the Prochlorococcus and marine Synechococcus genomes in relation to the eubacteria.

127 4.1.1 The origins of the PS-group genomes are not known

Phylogenetic studies based on single genes have found that Prochlorococcus and marine

Synechococcus spp. are more similar to each other than to other cyanobacteria and are derived from a common ancestor (e.g., Turner et ah, 1989; Kishino et ah, 1990; Lockhart et ah, 1992a; Palenik and Haselkorn, 1992; Urbach et ah, 1992; Hess et ah, 1995; Ferris and Palenik, 1998; Moore et ah, 1998; Partensky et ah, 1999; Hess et ah, 2001; Ting et ah, 2001; Zeidner et ah, 2003; Ashby and Houmard, 2006). The large number of shared protein-coding genes (Dufresne et ah, 2008), the similarity of the primary sequences and the conservation of gene order in alignments of genomes of Prochlorococcus and marine

Synechococcus (Dufresne et ah, 2005) suggest that these two genera may have diverged as recently as 150 Mya (Rappe and Giovannoni, 2003; Dufresne et ah, 2005) through genome-level rearrangements, gene duplication, divergence through random processes of mutation and LGT (Zhaxybayeva et ah, 2009).

However, the more ancient origins of the genomes of the PS-group are not known. Phylo­ genetic studies of different cyanobacterial-core genes have not found evidence of a single shared evolutionary history of the PS-group taxa (Zhaxybayeva et ah, 2004b, 2006; Beiko et ah, 2005; Zhaxybayeva et ah, 2009). It is possible that the conflicting phytogenies are due to the comparison of different taxa, the use of different data or different meth­ ods of data preparation, or the use of different phylogenetic methods. However, on the implicit assumption that each gene phytogeny bears at least some relationship to the true representation of the evolutionary history of the gene, it has been concluded that these cyanobacterial-core protein-coding genes have been subject to LGT and that LGT is more prevalent between Prochlorococcus and marine Synechococcus than between these taxa and other eubacterial phyla (Raymond et al., 2002, 2003b; Zhaxybayeva et al., 2006).

128 4.1.2 LGT between cyanobacteria may be common

The mechanisms of viral replication and infection, and the potential for the incorpora­ tion of genetic material from one host in another are well understood (Canchaya et ah, 2003). Viruses that infect cyanobacteria, now commonly referred to as cyanophages, were discovered several decades ago (Safferman and Morris, 1963). However, the possibilities for cyanophage-assisted LGT only began to be fully appreciated with intensive study of viruses that infect the PS-group (Mann et ah, 2003; Sullivan et ah, 2003). The population size of Prochlorococcus in the wild appears to be a function of growth rate and predation (Monger et ah, 1999; Mann and Chisholm, 2000; Landry et ah, 2003; Worden and Binder, 2003). Predation by protists may account for 50% of bacterial cell deaths in marine environments (Fuhrman and Noble, 1995). Viral infection could also be a significant contributor to Prochlorococcus mortality, since most interactions between viruses and bacteria living in offshore waters result in infection (Suttle and Chan, 1994). Cyanophages are known to be plentiful in marine environments (Suttle and Chan, 1994; Angly et ah, 2006) and it is estimated that for 20-40% of cyanobacterial cell death may be due to cyanophages (Suttle and Chan, 1994; Fuhrman, 1999), making them a significant factor in cyanobacterial mortality (Wilcox and Fuhrman, 1994; Sano et ah, 2004; Suttle, 2005; Sime-Ngando and Colombet, 2009). Co-infection of cyanobacterial hosts results in the passing of viral genes through a series of host cyanobacteria (Suttle, 2005). Genes may also have been acquired several times from other hosts (Lindell et ah, 2004). This process can be thought of as bacterial genomes “sampling” genes from the bacterial “pan-genome” (all the genes in a set of genomes), retaining those that are beneficial and losing those that are not useful (Lawrence, 1999; Ochman et ah, 2000). This process may explain the large diversity in the ecotypes ob­ served in the Prochlorococcus and marine Synechococcus (Sullivan et ah, 2003).

129 Experiments have demonstrated that cyanophage-assisted LGT is possible: cyanophages can cross-infect Prochlorococcus and marine Synechococcus (Sullivan et ah, 2003) and transfer the psbA and psbD photosynthesis protein-coding genes of cyanobacterial origin to the new host (Mann et ah, 2003; Mann, 2003; Sullivan et ah, 2003; Lindell et ah,

2004; Millard et ah, 2004) where they are expressed (Lindell et ah, 2005). This system benefits the cyanophage because photosynthesis decreases after viral infection (Padan and Shilo, 1973; Zaitlin and Hull, 1987), but production of phage progeny in infected cyanobacteria depends on photosynthesis continuing until just before lysis (Mackenzie and Haselkorn, 1972; Sherman, 1976), resulting in selection for cyanophages that encode functional photosynthesis genes (Lindell et ah, 2004). However, since other metabolic genes have been found in phage genomes (Sullivan et ah, 2003; Weigele et ah, 2007), there are good reasons to suspect that metabolic core genes are transferred as well.

Random cyanobacterial DNA fragments can easily be incorporated into viral genomes on a transient basis because of the way viruses replicate in their hosts (e.g., Rohwer and

Thurber, 2009). In studies of closely-related Escherischia coli genomes, rates of transfer of several hundred genes every four million years have been estimated (Perna et ah, 2001;

Hayashi et ah, 2001; Welch et ah, 2002), but the transferred genes are present in the new host genome only briefly and are not retained (Daubin and Perriere, 2003). On a very small number of these occasions, however, the metabolic core genes could have been retained by the cyanobacterial hosts for longer. In an ecological environment populated by a vast number of cyanobacteria and cyanophages over millions of years, in which random fragments of DNA are transported between cyanobacterial lineages, some protein-coding genes in cyanobacteria have probably been acquired by cyanophage-assisted LGT.

The benefit to the cyanophage in this system is clear: the cyanobacterial host is necessary for the cyanophage to replicate and the death of the host is necessary for the survival of the viral progeny. However, this seemingly parasitic relationship may be stable and persist

130 over evolutionary time periods. Although most cyanobacteria have a single chromosome, some have additional circular or linear chromosomes as well as one or more plasmids

(e.g., the cyanobacterium Acaryochloris marina has one circular chromosomes and nine plasmids (Swingley et ah, 2008)). The picocyanobacteria of the PS-group have a single chromosome without additional plasmids and the high-light ecotypes of Prochlorococcus have a reduced genome (Rocap et ah, 2003). Cyanophages may serve as a repository of a random assortment of potentially useful protein-coding genes (Zeidner et ah, 2005). The acceleration in evolutionary rate experienced by genes in the viral host may be an auxiliary mechanism by which gene families increase in size in cyanobacteria (Lindell et ah, 2004).

There is evidence that cyanophages are more frequently cross-infect closely-related cyanobac­ teria (Sullivan et ah, 2003) and genetic material from a closely-related relative can more easily be incorporated into a new host genome (Medrano-Soto et ah, 2004). It is thus possible that some cyanobacterial core genes have experienced xenologous replacement, resulting in the conflicting phylogenetic relationships between the PS-group taxa and the cyanobacterial phylum observed in previous studies.

4.1.3 Desirable properties of data for this study

While LGT is acknowledged as an important process in bacterial genome evolution (Ochman

et ah, 2000; Doolittle et ah, 2003), another possibility has not been fully explored. Each

phylogeny contains some error in the branching pattern of the taxa due to the cumu­

lative effects of errors at each step of the phylogenetic inference process, such as mis-

identification of orthologues, alignment of non-homologous sites in the alignment, use of

sequences with a high variation in nucleotide or amino-acid frequency distribution, model

misspecification, or inclusion of sequences with an unusually high evolutionary rate. (For a

comprehensive review of phylogenetic inference methods and problems with phylogenetic

131 inference, see Felsenstein (2004); Heath et al. (2008a)). In some cases, these errors could

be large enough to alter the relationships among taxa. The error could be manifest as the

recovery of different phylogenetic relationships in the phylogenies inferred from different

protein-coding genes. This may explain the incongruent phylogenies observed in previous

studies (e.g., Raymond et al., 2003b), interpreted as evidence that the cyanobacterial core

genes have different evolutionary histories.

Such sources of error have not been ignored in the past (see, for example, Zhaxybayeva et al., 2006, 2009). Nevertheless, I have paid special attention to these problems and have considered a greater number of possibilities than previous studies.

In phylogenetic analysis, another potential source of error results from the inclusion of

homologous sequences undergoing unusually high evolutionary rates that may have ac­ cumulated substitutionally saturated sites, making them more similar to each other. In

a phylogeny, these sequences often appear on unusually long terminal branches sharing

a common ancestor, so this phenomenon is known as “long branch attraction” (LBA)

(Felsenstein, 1978). In this study, I compare the evidence for diversifying selection in

the reduced and rejected eubacterial-core protein-coding gene alignments and discuss the suitability of the reduced eubacterial-core for phylogenetic analysis.

I investigate whether that some of the observed phylogenetic conflict could be due to

analysis of alignments of protein-coding genes that have properties that have mislead the

phylogenetic inference process.

My approach has been to investigate the origins of the PS-group core genome using data

that are representative of the PS-group core genomes, with properties that minimise the

possibility of artefactual error in the resulting phylogenies.

In this section, I describe how I prepared a set of amino-acid alignments where each set:

132 (i) consists of orthologues; (ii) covers most of the taxa selected in my study; (iii) has low variance in amino-acid frequencies across the taxa; and (iv) consists of alignments without ambiguously-aligned sites.

Furthermore, this set of amino-acid alignments has: (i) good representation of gene func­ tional categories, to ensure that my conclusions are based on phylogenies from protein­ coding genes representative of the core genome of the PS-group taxa; (ii) a phylogenetic signal (information that permits inference of the evolutionary history of the taxa) that is representative of the phylogenetic signal of the eubacterial core; and (iii) low rates of diversifying selection across the taxa, to minimise the possibility of LBA effects.

4.1.4 Aims

The aims of this part of the study were to:

1. identify a set of “core” protein-coding genes common to most of the eubacteria selected for study;

2. identify a subset of these eubacterial-core protein-coding genes that are not expected to induce phylogenetic artefacts; and

3. verify that the reduced eubacterial core is suitable for answering questions on the origins of the PS-group genomes because it is representative of the entire eubacterial core and not just a subset of it 133 4.2 Methods

4.2.1 Selection of taxa

Since phylogenetic inference methods assume that the sequences being analysed are ortho- logues, and orthologues are most reliably inferred from similarity searches between whole genomes (Lake and Rivera, 2004), only bacteria for which whole genomes were available were considered for this study.

Bacteria were selected from those for which full genomes were available in the NCBI r e f s e q database (Pruitt et ah, 2005) in August 2007. All available genomes from the cyanobacteria and the anoxygenic photosynthetic groups were included, along with a sampling of non-photosynthetic bacteria from as many phyla as possible.

The initial selection of 68 bacteria did not include any representatives from the anoxy­ genic Heliobacteria (within the Firmicutes) or the recently discovered Chloracidobacteria

(within the Actinobacteria) (Bryant et ah, 2007a) because full genomes were not then available (these were added later).

4.2.2 Inference of orthologous, eubacterial-core protein-coding

genes

The initial data for my analysis consisted of the complete complement of translated protein-coding genes from 68 selected bacteria, downloaded from the NCBI REFSEQ database

These protein-coding genes were partitioned into the 71 chromosomes and 35 plasmids present in the 68 bacteria. Each of the protein-coding genes was used as the query se­ quence of a b l a s t p (Altschul et ah, 1990, 1997) search against the 105 other chromosomes

134 and plasmids to identify similar sequences. The separation into chromosomes and plas­ mids facilitated the identification of homologous genes at different loci of the genome, and in particular of genes carried in plasmids. Sets of protein-coding genes in different genomes that are thought to be orthologous were inferred using the following two-step approach.

I inferred pairs of homologous protein-coding genes using a combination of BLAST simi­ larity metrics similar to those used in previous studies (Raymond et ah, 2002; Edgar and Batzoglou, 2006; Shi and Falkowski, 2008) and thresholds tested for their suitability for my data (§2.2.3). I did this with strict criteria in order to minimise the accidental inclu­ sion of paralogues (Zhaxybayeva et al., 2006). I performed the similarity search using the default parameter values of the BLOSUM62 matrix (Henikoff and Henikoff, 1992), gap opening and extension penalties of 11 and 1, respectively, and an E-value threshold of 0.001. The occasional false-positive hits due to the use of a relatively permissive E-value were addressed by the more stringent criteria used later to infer pairs of orthologous se­ quences. I considered a pair of protein-coding genes orthologous only if: (i) they were each others’ best BLASTP hit with E-value < 10-5, score > 160 and identity > 35%; (ii) the alignment of the two sequences covered at least 65% of the length in both directions; and (iii) they were each others only significant hit in the other chromosome or plasmid. This is similar to the criteria used in studies such as Gill et al. (2005), which used an E-value < 10-5, identity > 35% and match proportion > 75% in both directions. In a study involving whole genomes from many organisms, the only practical way to infer a large number of sets of orthologues is to choose criteria for a set to be orthologous and to use an algorithm wih those criteria to find candidate sets. Here, the protein-coding genes and the suspected homologous relationships between pairs of protein-coding genes can be represented as a graph, where each protein-coding gene is 135 a node in the graph and each homologous relationship is an edge.

The Markov Clustering (m cl) algorithm is a general-purpose clustering algorithm based on simulation of stochastic flow in graphs (van Dongen, 2000a,b). Following a prelimi­ nary investigation to determine parameters for the clustering operation that would yield clusters that are approximately equivalent to hand-curated clusters collated from the raw BLASTP results (see §2.2.1), I chose the MCL program with default parameters to infer clusters of protein-coding genes from the graph of homologous pairwise relationships. Clusters containing sequences from at least 75% of the 68 bacteria, but with no more than 7 sequences per bacterium, were retained for further analysis. Each of the clusters was checked to confirm that each sequence belonged to only one cluster. The threshold of 7 sequences (approximately 10% of the number of bacteria) from the same bacterium was chosen to eliminate clusters with many similar paralogues.

4.2.3 Selection of a reduced eubacterial core for phylogenetic analysis

Each of the protein-coding genes selected for further analysis was aligned by MAFFT V5.861 (Katoh et al., 2002) using default parameter values. Each data set was manually allocated a consensus annotation that included the most complete gene or gene product description plus any keywords that would serve as links between the consensus anno­ tation and alternate descriptions. Each alignment was visually inspected and adjusted. Sequences that were markedly different from the rest of the alignment were further scru­ tinised. Clusters containing many sequences from the same organism were usually found to belong to large families of protein-coding genes related through paralogy (i.e., a duplication

136 event). Any sequences assumed to be paralogous because they were notably divergent in an otherwise highly conserved alignment, or pseudogenes because they contained a long deletion, usually at the start or the end of the sequence, were removed from the data set. The alignment and annotation for each remaining sequence was checked for quality and consistency. However, if the remaining sequences from the same “genome” were so similar that the putative orthologue could not be distinguished, all remaining sequences were retained. This may have resulted in the inclusion of paralogues or pseudogenes in the analysis, but I argue that inclusion of such similar sequences would not affect the relative position of that taxon in the resulting phylogeny.

The remaining sets of protein-coding genes were re-aligned with MAFFT v5.861. Sites of uncertain homology were removed with GBLOCKS v0.9.1b (Castresana, 2000) using default parameter values with the exception that sites containing up to 50% of the gap state were permitted (whereas by default, all sites with at least one gap were removed). Since the use of aligned molecular sequences that do not meet the assumptions of the phylogenetic inference methods have been shown to produce biased results (e.g., Ho and Jermiin, 2004), I reduced the eubacterial core set further. Alignments with fewer than 100 amino acids were removed from further analysis because they were considered too short for accurate phylogenetic analysis. Sets that had not evolved under stationary, reversible and homogeneous (SRH) conditions were estimated by application of the matched-pairs test of symmetry (Ababneh et al., 2006). Rather than considering the alignment as a whole, the method compares all unique pairs of sequences in the alignment. The null hypothesis of the test is that the two sequences have been sampled from the same distribution. Hence, a p-value greater than the significance threshold indicates lack of support for the alternate hypothesis. An alignment was considered acceptable for phylogenetic analysis if at least 90% of the tests indicate no support for the alternate hypothesis (i.e., produced a p-value > 0.05). All protein-coding gene alignments that did not fit this criteria were removed 137 from further analysis.

4.2.4 Representation of gene functional groups in the reduced

eubacterial core

In an investigation of the origins of entire genomes, it helps to use as many eubacterial core protein-coding genes as possible within the constraints of computational power and the ability to identify a eubacterial-core set of protein-coding genes. I tried to identify a set of highly conserved eubacterial-core genes from my diverse set of 70 eubacteria, which would include representative genes from different biochemical pathways.

I now checked whether the sets of protein-coding genes of my “reduced” eubacterial core are representative of the biochemical pathways present in the entire bacterial genome core. This was done by allocating each of the gene products to the corresponding functional categories (Table 4.2) from the BRITE and ORTHOLOGY databases of KEGG (Ogata et al.,

1999).

To check whether the reduced eubacterial core was representative of the entire eubacterial core in terms of alignment length and SRH properties, a scatterplot of these two prop­ erties was generated using the R statistical package (R Development Core Team, 2009).

Similarly, to check whether the use of alignment length and SRH failure rate biased the gene functional groups represented in the reduced eubacterial core, I collated the list of gene functional categories to which protein-coding genes in the entire eubacterial core belong and plotted them as a scatterplot of SRH failure against post-GBLOCKS alignment length.

Each of the protein-coding genes in an alignment that is present in the KEGG database belongs to a gene functional group. In some cases, the protein-coding genes in an align-

138 ment may not belong to the same gene functional group, so some gene sets could map to more than one KEGG gene category The best way to resolve these discrepancies would be to inspect each group manually and to decide on a consensus allocation. However, since only an approximate comparison of the full and reduced eubacterial core was required, I did not do this: all unique tuples based on length, SRH failure rate and KEGG category were retained, even if a single set of protein-coding genes mapped to more than one KEGG category.

4.2.5 The phylogenetic signal in the reduced and rejected eu­ bacterial cores

A possible problem is that the protein-coding genes rejected on the basis of SRH testing could, for any number of reasons, have a different phylogenetic signal to those of the reduced eubacterial core. If so, conclusions about the evolutionary history of the PS- group genomes based on the reduced eubacterial core might not be valid. In order to compare the phylogenetic signal present in the reduced and rejected eubacterial cores, I generated ML trees from each of the amino-acid alignments each and combined these trees using supertree and supernetwork methods. Supertree and supernetwork methods combine the information present in a set of input trees into a single phylogeny or network (for a review, see Bininda-Emonds et ah, 2002). While a consensus method is able to combine the information in trees that contain exactly the same taxa, supertree and supernetwork methods are able to combine trees with par­ tially overlapping taxa. This is more practical for genomic studies since many genes will be present in almost all, rather than all, taxa. Thus, supertree or supernetwork methods permit use of the information in gene trees that would otherwise need to be discarded.

139 A more detailed description of supertree and supernetwork methods is provided in the results section (§4.3.4). I chose to use both a supertree method and a supernetwork method to compare the phylogenetic signal in the reduced and rejected eubacterial cores.

In order to prevent the inclusion of paralogues in the analysis, each of the post-GBLOCKS amino-acid alignments of the reduced eubacterial core that had not been augmented with sequences from Acaryochloris or Heliobacterium was filtered to remove sequences from any chromosome or plasmid that occurred more than once in that alignment. For example, if there were n bacterial chromosomes in an alignment, but (n+1) sequences, due to there being two representatives from a particular chromosome, the two sequences from the same chromosome were removed prior to further analysis.

Maximum-likelihood trees were inferred for each of the 280 amino-acid alignments using TREEFINDER with the WAG-t-F4-(-I model, where the a and I parameters were estimated from the data and empirical amino-acid frequencies. Clade support values were estimated using the LR-ELW method (Strimmer and Rambaut, 2002) using 100 replicates in order to obtain a more robust tree than would be obtained by a single iteration. However, clade support values were ignored in the generation of the supertrees and supernetworks. Supertrees were inferred from the ML trees estimated from the reduced eubacterial core and the protein-coding genes of the rejected eubacterial core using the CLANN V3.2.2 pro­ gram (Creevey and Mclnerney, 2005). Prior to analysis, one or two representatives from each of the three smaller chromosomes of Ralstonia, Agrobacterium and Dehalococcoides were removed from the input trees to prevent problems with the bootstrapping procedure in CLANN. The best supertree was found using the AVCON (Average Consensus) method (Lapointe and Cucumel, 1997; Levasseur et al., 2003), which was chosen out of the five other methods available because it infers a supertree topology with branch lengths. The

140 best supertree was found by performing a search through the supertree space using the heuristic SPR (Subtree Prune and Re-graft) method (Swofford and Olsen, 1990).

Clade support values were obtained by performing input tree bootstrapping (Creevey et ah, 2004; Burleigh et ah, 2006) using 100 subsampled tree sets. Heuristic search was used to find the best tree for each of the bootstrap tree sets. The bootstrap trees that con­ tained all unique taxa in the original set of input trees (also referred to as the universally distributed source trees) were used to create the consensus tree and only relationships with 50% support or greater (including congruent minor components) were included in the consensus.

A comparison of the distribution of the scores of the best supertree from each of the bootstrap replicates and the score of the best supertree from the original data gives an indication of whether the real data contains a signal that is better than random (Creevey and Mclnerney, 2005). The relative standard deviation of the distribution of scores (%SD) is defined as the absolute value of the coefficient of variation. It gives an indication of variation in the scores obtained through the bootstrapping procedure. If the pruned supertree and the source tree are the same, the score takes the value of zero. If the supertrees obtained through the bootstrapping procedure are similar, the value of the %SD will be small. Finally, clades of the best supertree (that has branch lengths) that were also present in the consensus tree (that has clade support values) were annotated with the support values from the consensus tree. Due to computational constraints, bootstrapping was not conducted for the supertree inferred from the rejected eubacterial core.

Supernetworks were inferred from the protein-coding genes of the reduced eubacterial core and the protein-coding genes of the rejected eubacterial core using the Z-closure method (Huson et ah, 2004a) implemented in SPLITSTREE v4.1.9 (Huson and Bryant, 2006). The

141 inputs were the ML tree topologies and the associated branch length information. Follow­ ing a preliminary investigation to determine the suitable parameters, the supernetworks were generated using the TreeSizeWeightedMean method to generate the edge weights of the network, in a single run, using the refine heuristic and a maximum dimension of 2.

The maximum dimension d for the splits graph is defined as the maximum dimension of cubes in the splits graph (Huson et al., 2004b). In my analysis, the value of d was set to 2, resulting in a simpler supernetwork. During a preliminary investigation, I found that the supernetworks generated from my relatively large number of input trees with the default maximum value of d of 4, displayed so much conflict in the phylogeny that it was difficult to see the underlying groupings of taxa. Hence, the maximum value of d was reduced to 2, a value found to return an indicative graphical summary of the multiple phylogenies of the input sets from the reduced and rejected eubacterial cores.

Network clade support values were obtained by performing input tree bootstrapping us­ ing 100 subsampled tree sets for the reduced eubacterial core. Due to computational limitations, bootstrapping was not possible for the rejected eubacterial core.

4.2.6 Codon alignments of the reduced and rejected eubacterial

cores

One possible mechanism by which eubacterial core protein-coding genes may have experi­ enced elevated evolutionary rates is diversifying selection. A robust method for detecting adaptive evolution in a protein-coding gene is to calculate the ratio of the non-synonymous substitution rate (d^) to the synonymous substitution rate (d$). Since the neutral theory predicts that most mutations have no fitness advantage and are equally likely, the rate ratio lo = dN/ds can take values <1, =1 or >1 which indicate purifying selection, neutral

142 evolution or positive selection, respectively (Yang and Swanson, 2002).

In order to obtain the most accurate information for the d^/ds tests, I calculated the cor­ responding codon alignment of the original amino-acid sequence alignments corresponding to both the reduced and rejected core sets.

The amino-acid sequence data for this study were originally downloaded from the NCBI r e f s e q database as translated protein-coding gene sequences in FASTA-format files, the translation having been performed at the NCBI using Bacterial Translation Table 11.

In order to obtain the DNA sequences that corresponded to the amino-acid sequences,

I downloaded the full DNA sequence of each chromosome, the protein table start-end- strand information for each protein-coding gene on the chromosomes, and Bacterial Codon

Translation Table 11 from the NCBI REFSEQ database in September 2009.

I used the above information to perform an initial extraction of the original DNA version of the protein-coding genes and compared these with the original amino-acid sequences downloaded from the REFSEQ database to ensure they were identical. Each error was investigated. Most were due to: (i) a minor error in the start or end specifier of the gene; (ii) the inclusion of the stop codon at the “end” causing a mismatch with the orig­ inal amino-acid sequence which lacked a stop codon; or (iii) start-end-strand information that did not capture the information required for proteins that are assembled from more than one exon. For example, a simple protein-coding gene encoded by one exon might appear “174250..177162” , while a protein-coding gene encoded by an exon that appeared on the opposite strand and spanned both sides of location ‘F would appear “comple- ment(join(407856..408101,1..60))” , but the data downloaded from the GENBANK protein tables would only have “407856..60” . In all these cases, I either repaired the start-end specifiers or downloaded the full location specifier for multi-exon genes, re-extracted them,

and confirmed they were the same as the translation performed by the NCBI database.

143 Lastly, the DNA versions of the protein-coding genes were aligned by codons as per the original, pre-GBLOCKS amino-acid data sets aligned with MAFFT.

4.2.7 Evidence of diversifying selection in the reduced eubacte- rial core

Since the inclusion of sequences with an elevated evolutionary rate increases the possibility of recovering unusual phyletic patterns due to LBA effects, I performed a conservative estimation of evolutionary rates to check that the protein-coding genes from the post- GBLOCKS alignments of the reduced eubacterial core, the eubacterial core rejected by the SRH test and the eubacterial core rejected for having short alignments, were all suitable for phylogenetic analysis. I examined the alignments of the protein-coding genes of the reduced and rejected eubac­ terial core for evidence of unusual evolutionary rates by looking for evidence of variation in u; within each set of protein-coding genes. My aim was to establish whether the reduced eubacterial core is more, less or equally likely to be affected by phylogenetic artefacts when compared with the eubacterial core that was rejected by the matched-pairs test of symmetry. If the reduced eubacterial core were suitable for phylogenetic analysis, I would expect u values within each of the protein-coding gene sets to be <1, the variation in u within and alignment and between sequences in different alignments to be relatively small, and the variation in u between data sets also to be relatively small. In order to demonstrate that the reduced eubacterial core is at least as suitable for phylogenetic analysis as the rejected eubacterial core, those properties would have to be better or at least no worse. Maximum-likelihood trees were inferred for each of the sets of amino-acid sequences of the

144 reduced and rejected eubacterial cores using TREEFINDER (Jobb et al., 2004) using the WAG substitution model (Whelan and Goldman, 2001) with a discrete T distribution with four rate categories (r4) and a non-zero proportion of invariant sites (I). Clade support values were estimated with the LR-ELW method (Strimmer and Rambaut, 2002), using 100 replicates in order to obtain a more robust tree than would be obtained by a single iteration.

An estimate of u for every pair of protein-coding genes in each of the reduced and rejected eubacterial core sets was performed using the CODEML program from the PAML V4.1 pack­ age (Yang, 1997, 2007), which was modified to encode Bacterial Translation Table 11 from GENBANK. The branch length and clade support values were removed from all the ML trees for this analysis. A pairwise analysis was performed for each codon alignment and the ML tree (parameters runmode=-2, model=0, NSsites=0, seqtype=l, CodonFreq=2, ndata=l, clock=0, icode=10, Mgene=0, fix_kappa=0, kappa=2, fix_omega=0, omega=l, fix_alpha=0, alpha=0, Malpha=0, ncatG=10, getSE=0, RateAncestor=l, Small _Diff= 0.5xl0-6, cleandata=T and method=0) (for a full description of these parameters, refer to the PAML manual).

The resulting u values for pairs of sequences in the reduced and rejected eubacterial core sets were filtered to remove values having: (i) an unreliable ds value (ds < 0.0001 and u;>5, or ds undefined); or (ii) an insufficient number of sites left in the codon alignment once gap sites had been removed by the CODEML program (N < 20 or S > 20); or (iii) a d^ or ds value that indicated that, on average, the sites in the alignment had reached substitutional saturation (d^ > 2 and/or ds > 2). Data sets with fewer than 8 remaining uj values were discarded.

A scatterplot of the dn and ds was generated for the reduced eubacterial core and the rejected eubacterial core to check the ratios were different for the two cores. Boxplots of

145 the u values were generated to visualise the distribution of u values across the reduced and rejected eubacterial cores, and for each protein-coding gene within the two cores. Unpaired two-sample t-tests for equality of means were performed with a 95% significance threshold to evaluate whether the mean of uj was significantly different in different eubacterial core sets. All tests and graphs were generated with the R statistical package (R Development Core Team, 2009).

4.2.8 Addition of Acaryochloris and Heliobacterium to the study

After 62 bacterial protein-coding genes had been identified for phylogenetic analysis, the genomes of two additional photosynthetic bacteria that would enhance the value of my study became publicly available in the NCBI REFSEQ database. The first was that of the cyanobacterium Acaryochloris marina MCIB1101 (Swingley et ah, 2008), which like Prochlorococcus spp. mainly uses chlorophyll, albeit Chi d, for light harvesting (Chen et ah, 2002b,a). The second was that of Heliobacterium modesticaldum Icel (Sattley et ah, 2008), a representative of the anoxygenic photosynthetic Heliobacteria, a group which uniquely uses bacteriochlorophyll (BChl) g in a Type I reaction centre and which was not yet represented in my study. Since these two bacteria were important enough to warrant inclusion in my study, but I did not wish to repeat the entire data selection process, I augmented my 62 data sets with genes from the two new bacteria by the following method. All chromosomes and plasmids for Acaryochloris and Heliobacterium were obtained from the NCBI REFSEQ database. The blastp results for 70 bacteria (Table 4.2) (which now consisted of 73 chromosomes and 44 plasmids) were compiled. A full listing of putative relationships of orthology was collated using the same E-value, score, percent identity and proportion-aligned criteria used previously. The MCL program was used to identify all 146 clusters of putatively orthologous protein-coding genes. The original clusters of 62 protein­ coding genes were augmented with the sequences from Acaryochloris and Heliobacterium present in the re-computed clusters (Table 4.1). Each of the 62 augmented clusters of protein-coding genes were aligned with MAFFT and then visually inspected for sequences from the additional bacteria that appeared to be paralogous. Any such sequences were deleted, and the remaining sequences were realigned with MAFFT. Next, the ambiguously aligned sites in each alignment were deleted using gblocks. Finally, we verified that the

62 augmented alignments contained at least 100 amino acids each and had evolved under

SRH conditions. At each step of the above process, the same parameters and selection criteria that were used in the previous analysis were used.

Following analyses to verify that the inclusion of two extra taxa did not affect the position of the other taxa in my phylogenetic study (data not shown), all further analyses were conducted on the augmented data sets, now containing 70 bacteria.

4.2.9 Inclusion of eubacteria may reveal instances of LGT

To test whether components of the Prochlorococcus and marine Synechococcus genomes have been acquired relatively more recently from other bacterial lineages rather than in­ herited from the ancestor of the PS-group taxa, we need to understand the origins of the PS-group genomes in a eubacterial context. Unfortunately, phylogenetic analyses of bacterial-core genes have consistently failed to infer robust, well-supported evolutionary relationships among bacterial lineages (Beiko et ah, 2005; Pisani et ah, 2007; Bapteste et ah, 2008). These phylogenies are notorious for being poorly-resolved. Since this could be due either to loss of historical signal during evolution from the common ancestor, artefacts from the phylogenetic inference methods, or the inclusion of paralogues (Zhaxy- bayeva et ah, 2004b), the preparation of data is important if we are to improve on previous

147 studies.

4.3 Results

4.3.1 The eubacterial taxa selected

The first step of the phylogenetic inference process was the selection of cyanobacteria,

anoxyphotobacteria and non-photosynthetic bacteria to place our study of the PS-group taxa into a eubacterial context.

The 70 bacteria selected for my study are listed in Table 4.1. This selection includes

30 representatives from the cyanobacteria, 4 from the anoxygenic photosynthetic green sulfur bacteria (a subset of the Chlorobi), 3 from the anoxygenic photosynthetic green

non-sulfur bacteria (a subset of the Chloroflexi) and 4 from the anoxygenic photosynthetic

purple bacteria (a subset of the a-proteobacteria).

Comparison of my taxa with those currently documented in the NCBI database

(Wheeler et ah, 2004) shows that I have sampled from 20 of the 25 known eubacterial phyla (Table 4.1). A complete listing showing REFSEQ identifiers, GC content, number of genes and number of protein-coding genes (“proteins”) is provided in Appendix B.l.

Table 4.1: Whole bacterial genomes used in this study

Bacterial/Photosynthetic groupa Abbreviated Full Name Name6

Acidobacteria Acidobacteria bacterium Ellin345 Aba Solibacter usitatus Ellin6076 Sus Actinobacteria continued on next page

148 continued from previous page - Corynebacterium efficiens YS-314 C ef A quificae

- Aquifex aeolicus V F 5 Aae Bacteroidetes

- Bacteroides fragilis N C T C 9343 Bfr Chlam ydiae

- Chlamydophila caviae G P IC C ca Chloracidobacteria No photo synthetic representatives available C hlorobi S Chlorobium chlorochromatii CaD3 Cch S Chlorobium tepidum TLS Cte s Pelodictyon luteolum D SM 273 Plu s Pelodictyon phaeoclathratiforme BU-1 Pph Cliloroflexi N Chloroflexus aurantiacus J-10-fl Cau

- Dehalococcoides ethenogenes 195 Det N Herpetosiphon aurantiacus ATCC 23779 Hau N Roseiflexus sp. RS-1 Ros. Cyanobacteria C Acaryochloris marina MBIC11017C A m a C Anabaena variabilis ATCC 29413 Ava C Crocosphaera watsonii W H 8501 Cwa C Gloeobacter violaceus P C C 7421 G vi C Nostoc punctiforme P C C 73102 Npu C N ostoc sp. PCC 7120 Nos. C Prochlorococcus marinus str. AS9601 Pm a-AS9601 C Prochlorococcus marinus str. MIT 9211 Pma-MIT9211 c Prochlorococcus marinus str. MIT 9301 Pma-MIT9301 c Prochlorococcus marinus str. MIT 9303 Pma-MIT9303 c Prochlorococcus marinus str. MIT 9312 Pma-MIT9312 c Prochlorococcus marinus str. MIT 9313 Pma-MIT9313 c Prochlorococcus marinus str. MIT 9515 Pma-MIT9515 c Prochlorococcus marinus str. NATL1A P m a -N A T L IA c Prochlorococcus marinus str. NATL2A P m a-N A T L 2A c Prochlorococcus marinus str. CCMP1375 Pma-CCMP1375 continued on next page

149 continued from previous page c Prochlorococcus marinus str. CCMP1986 Pma-CCMP 1986 c Synechococcus elongatus PCC 6301 Sel-PCC6301 c Synechococcus elongatus PCC 7942 Sel-PCC7942 c Synechococcus sp. CC9311 Syn.CC9311 c Synechococcus sp. CC9605 Syn.CC9605 c Synechococcus sp. CC9902 Syn.CC9902 c Synechococcus sp. JA-2-3B’a(2-13) Syn.JA2 c Synechococcus sp. JA-3-3Ab Syn.JA3 c Synechococcus sp. RS9917 Syn.RS9917 c Synechococcus sp. WH 5701 Syn.WH5701 c Synechococcus sp. WH 7805 Syn.WH7805 c Synechococcus sp. WH 8102 Syn.WH8102 c Synechocystis sp. PCC 6803 Scy.PCC6803 c Thermo synechococcus elongatus BP-1 Tel c Trichodesmium erythraeum IMS 101 Ter Deinococcus - Deinococcus radiodurans R1 Dra Firm icutes - Bacillus anthracis str. “Ames Ancestor” Ban H Heliobacterium modesticaldum Icelc Hmo - Staphylococcus aureus subsp. aureus N315 Sau - Thermoanaerobacter tengcongensis MB4 T te Fusobacteria - Fusobacterium nucleatum ATCC 25586 Fnu Planctomy cetes - Rhodopirellula baltica SH 1 Rba a - proteobacteria - Agrobacterium tumefaciens str. C58 Atu P Rhodobacter sphaeroides ATCC 17025 Rsp-ATCC 17025 P Rhodobacter sphaeroides ATCC 17029 Rsp-ATCC 17029 P Rhodopseudomonas palustris BisB18 Rpa-BisB18 P Rhodopseudomonas palustris CGA009 Rpa-CGA009 - Rickettsia typhi str. Wilmington Rty /3-proteobacteria - Bordetella parapertussis 12822 Bpa

- Neisseria gonorrhoeae FA 1090 Ngo continued on next page

150 continued from previous page Ralstonia metallidurans CH34 Rme 7 -proteobacteria Buchnera aphidicola str. Bp Bap Escherichia coli CFT073 Eco Xanthomonas campestris str. 8004 Xca Yersinia pseudotuberculosis IP 32953 Yps (5-proteobacteria Geobacter sulfurreducens PCA Gsu e-proteobacteria Helicobacter acinonychis str. Sheeba Hac Spirochaetes Borrelia garinii Pbi Bga Thermotogae Thermotoga maritima MSB8 Tma Thermus Thermus thermophilus HB8 Tth

a Indicates the photosynthetic bacterial group to which some of the taxa belong, where: P = purple bacteria, S = green sulfur bacteria, N = green non-sulfur bacteria, C = cyanobacteria and H = heliobacteria. Non­ photosynthetic bacteria are marked with a dash b The abbreviated organism name used in the annotation of figures. For the names used to distinguish chromosomes or plasmids from the same bacterium, refer to Appendix B.l. c This bacterium was added to the analysis after the core set of bacterial protein-coding genes had been selected from the original 68 bacteria.

4.3.2 The reduced eubacterial core

The all-vs-all blastp runs on the unaugmented data identified 2,097,279 pairs of putative relationships of orthology. These relationships connected 179,200 protein-coding genes within the 71 chromosomes and 35 plasmids from 68 bacterial genomes. A graph formed from the nodes corresponding to protein-coding genes, and edges corresponding to puta-

151 tively orthologous relationships between pairs of nodes, was compiled from the BLASTP data. The MCL program was used to find clusters of putatively homologous protein-coding genes from the graph by simulating flow in the graph (for more information on how the

MCL method works, refer to Sections 4.2.2 and 4.4.2).

From these data, the MCL method recovered 9,592 clusters containing at least four se­ quences. The 439 clusters containing sequences from at least 75% of the 68 bacteria but no more than 7 sequences per bacterium were selected for further analysis. Of these, 396 were retained following manual verification. A further 116 were removed because they had fewer than 100 amino acids and a further 218 were removed because they were not found to have evolved under stationary, reversible and homogeneous (SRH) conditions, leaving a total of 62 data sets (Table 4.2) suitable for phylogenetic analysis.

My final set of 62 sets of core bacterial protein-coding genes contained representatives from

12 of the 21 KEGG functional categories that are relevant to bacteria (Table 4.2). The 9 categories without representatives were “Metabolism of other amino acids”, “Biosynthe­ sis of polyketides and nonribosomal peptides”, “Biosynthesis of secondary metabolites”,

“Xenobiotics, biodegradation and metabolism”, “Membrane transport”, “Signal transduc­ tion” , “Signaling molecules and interactions”, “Cell motility” and “Cell communication”.

Table 4.2: The reduced bacterial core of 62 protein-coding genes

Missing % No. No. Photo Failed Fna Product Gene SeqsbBactc Groups^ Testse A S-adenosyl L-homocysteine hydrolase ahcY 56 56 1.17 A Arginosuccinate synthase argG 65 65 4.62 A Chorismate synthase aroC 69 68 2.64 A Imidazoleglycerol-phosphate dehydratase his B 63 63 7.94 A S-adenosylmethionine synthetase metK 69 68 3.37 A GTP-binding protein TypA typA 64 64 6.10 B Acetyl-CoA carboxylase carboxyl transferase subunit alpha accA 63 62 7.58 continued on next page

152 continued from previous page B Ribosome recycling factor fr r 70 69 - 5.47 B Acetolactate synthase III small subunit ilvid 64 63 - 3.67 B Ribulose-phosphate pyrophosphokinase p rsA 74 69 - 4.55 B Ribulose-phosphate 3-epimerase rpe 70 68 - 4.89

C Rod shape-determining protein mre B 63 58 H 4.51

E ATP synthase FI subunit A dtp A 67 66 - 3.93 E ATP synthase alpha subunit atp B 69 65 - 0.72 E ATP synthase FI subunit B dtpD 67 66 - 3.17 E Recombinase A ree A 74 68 - 3.92 F Thioredoxin peroxidase dhp C 56 55 H 1.10 F Bacterioferritin comigratory protein bep 61 61 - 3.55 F Chaperonin GroEL, HSP60 family groEL 74 68 - 3.81 F Preprotein translocate SecY subunit secY 70 69 - 5.76 F SsrA-binding protein sm pB 70 70 - 8.12 G UD P- N- acety lglucosamine acyl transferase Ipx A 59 56 G 5.61

L Enoyl-[acyl-carrier protein] reductase (NADH) fa b l 62 62 - 3.33 L 4-hydroxy-3-methylbut-2-en-l-yl diphosphate synthase-^ gcpE 63 63 G 3.89 N Adenine phosphoribosyltransferase dpt 62 62 H 1.27

N Nucleoside-diphosphate kinase ndk 67 67 - 0.68

N Uridylate kinase pyrid 68 68 - 5.75 N FeS assembly protein SufB su fB 54 54 H,G 7.13

M ATP-dependent Clp protease, proteolytic subunit clpR 70 69 - 4.31

M ATP-dependent Clp protease, ATP-binding subunit clpX 68 68 - 4.74

M Coenzyme A biosynthesis protein cod D 67 67 - 6.06

M 6,7-dimethyl-8-ribityllumazine synthase ribid 66 66 - 5.69 M Thiamine biosynthesis protein ThiC thiC 60 59 N 5.59 M Predicted transcriptional regulator (Zn-ribbon/ATP-cone) trn R 9 56 56 G 3.12

R Recombination protein RecR recR 66 66 - 4.94 R Crossover junction endodeosyribonuclease ruvC 65 65 - 7.02

T Translation initiation factor IF-3 in fC 70 70 - 1.57

T DNA-directed RNA polymerase subunit alpha rpoA 70 70 - 1.61

S Elongation factor EF-P efp 73 70 - 2.05

S Elongation factor EF-G fus A 74 69 - 4.78 S GTP-binding protein LepA lepA 71 70 - 6.36 s Cytochrome b6 p etB 55 55 - 0.47 continued on next page

153 continued from previous page s 50S ribosomal protein L2 rplB 69 69 5.67 s 50S ribosomal protein L4 rpl D 70 70 1.82 s 50S ribosomal protein L5 rplB 70 70 2.98 s 50S ribosomal protein L6 rplF 70 70 2.36 s 50S ribosomal protein L ll rplK 69 69 0.85 s 50S ribosomal protein L13 rplM 69 69 0.51 s 50S ribosomal protein L14 rp/N 71 70 1.17 s 50S ribosomal protein L16 rplB 70 70 1.28 s 50S ribosomal protein L20 rplT 70 70 5.67 s 30S ribosomal protein S2 rps B 70 70 5.51 s 30S ribosomal protein S3 rpsC 70 70 4.35 s 30S ribosomal protein S4 rps D 70 70 1.41 s 30S ribosomal protein S5 rps E 69 69 0.77 s 30S ribosomal protein S7 rps G 71 70 3.14 s 30S ribosomal protein S8 rp.sH 70 70 1.95 s 30S ribosomal protein S ll rps K 68 68 1.54 s 30S ribosomal protein S12 rps L 68 68 0.04 s 30S ribosomal protein S13 rps M 70 70 1.45 continued on next page

154 continued from previous page S Elongation factor TS tsf 70 70 - 2.48 S Elongation factor Tu tu fk 75 70 - 1.59 a The genes are grouped by the corresponding KEGG gene functional category where: A - Amino acid metabolism, B - Carbohydrate metabolism, C - Cell growth and death, E - Energy metabolism, F - Folding sorting and degradation, G - Glycan biosynthesis and metabolism, L - Lipid metabolism, N - Nucleotide metabolism, M - Metabolism of cofactors and vitamins, R - Replication and repair, T - Transcription and S - Translation, b The number of sequences in the data set. c The number of bacteria (rather than the number of chromosomes and plasmids) represented in the data set. d The photosynthetic bacterial groups which are not represented in this data set. The codes indicate: H = heliobacteria, G = green sulfur bacteria and N = green non-sulfur bacteria. Data sets which cover all 4 groups of non-photosynthetic bacteria are marked with a dash e Percent of matched-pairs tests of symmetry that failed at the p=0.05. That is, the test re­ turned p<0.05 indicating lack of support for the hypothesis that the two sequences have been sampled from the same amino-acid frequency distribution, and hence, support for the alter­ nate hypothesis that the two sequences have a different amino-acid frequency distribution, f Also known as 1-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate synthase (zspG). g This is a temporary gene name used only in this study.

4.3.3 Representation of gene functional groups in the reduced

eubacterial core

The scatterplot of SRH failure rate and post-GBLOCKS alignment length for the 396 protein-coding genes of the eubacterial core is shown in Figure 4.1. The scatterplot is broken into three regions: (A ) for the 62 protein-coding genes of the reduced eubacterial core selected on the basis of having a post-GBLOCKS alignment length > 100 amino acids and not more than 10% failures of the pairwise SRH tests; (B) for the 218 protein-coding genes rejected because more than 10% of the sequence pairs failed the SRH test; and (C) the 216 protein-coding genes rejected because the post-GBLOCKS alignment length was <

155 Figure 4.1: SRH test failure rate vs. amino-acid alignment length In the scatterplot generated for the 396 protein-coding genes of the eubacterial core, region (A) contains the 62 protein-coding genes of the reduced eubacterial core with post-GBLOCKS alignment length > 100 amino acids and up to 10% failures of the pairwise SRH tests; region (B) contains the 218 protein-coding genes rejected on the basis of the SRH test; and region (C) contains the 216 protein-coding genes rejected on the basis of being too short for phylogenetic analysis.

156 100 amino acids. Figure 4.1 shows that most proteins of the 396-protein eubacterial core had a post- GBLOCKS alignment length of between 0 and 600 amino acids, with a few outliers ex­ tending the range to almost 1,000 amino acids. The 62-protein reduced eubacterial core is representative of the entire eubacterial core in containing protein-coding genes that span the same length range, with the exception of the longer outliers; it is not biased towards unusually short or long protein-coding genes. The 396-protein eubacterial core contained alignments where up to 28% of the sequence pairs had a different amino-acid composition as evaluated by the SRH test (Figure 4.1). The regular distribution of SRH failure rates for the protein-coding genes of all post-GBLOCKS alignment lengths shows no strong relationship between these two properties of the alignments. Hence, the use of both alignment length and SRH failure rates has yielded a reduced eubacterial core that is representative of the entire eubacterial genomic core. The 396 sets of eubacterial core protein-coding genes mapped to 611 points in the scatter- plot (Figure 4.1) with a unique combination of length, SRH failure rate and KEGG gene category that covered 23 KEGG categories. Since plotting 611 points onto a single scat- terplot obscured many of the points that were located closely together, I plotted separate scatterplots for each of the KEGG categories that were represented (Figure 4.2). Points that mapped to KEGG gene function categories are not represented in bacteria sensu KEGG are denoted by crosses rather than dots. Since the representatives of each KEGG category are spread across the reduced eubacterial core (region A of Figure 4.1) and the rejected eubacterial core (region B of Figure 4.1), I concluded that the reduced eubacte­ rial core of 62 protein-coding genes was representative of the biochemical pathways of the entire eubacterial core. The resulting reduced eubacterial core may be unrepresentative of the entire eubacterial core for other reasons, for example, it may consist of more highly- integrated genes that are less prone to LGT or ribosomal proteins are over-represented.

157 1 : Carbohydrate Metabolism 2 : Energy Metabolism 3 : L ip id M etabolism

o _ ; s - i m

i in _

25 i i i L • • • f ♦ • 8 V \ * 8 - 8 - w i • • : • • <. m _ in _ m _

. ■ ______o ______• # r * • . * . v . i • m - m - • i < . * • t % '• o - . : » o - i

200 400 600 800 1000 200 400 600 800 1000 800 1000

4 : Nucleotide Metabolism 9 : Amino Acid Metabolism Metabolism of Other Amino Acids

8 : Biosynthesis of Polyketldes Metabolism of Cofactors 7 : Glycan Biosynthesis & Metabolism & Nonrlbosomal Peptides & Vitamins

1------1------1-----1------r 200 400 600 800 1000

10 : Biosynthesis of Secondary 11 : Xenoblotics Biodegradation 12 : T ranscription Metabolites & M etabolism

Figure 4.2: KEGG gene functional categories Points were drawn from the 396 protein-coding genes of the eubacterial core. Crosses indicate allocations to categories not relevant to bacteria sensu KEGG. (see §4.2.4). 158 13 : Translation 14 : Folding Sorting & Degradatio 15 : Replication & Repair

30 30 ; • • 1 ► L L : • ______1 1 < * ! • __ __ 0 2C 0 2C 1 1 1 »A . i* • to - 1 in - J o - o - «Ì T------1------1------r 0 200 400 600 800 0 200 400 600 800

16 : Membrane Transport 17 : Signal Transduction 18 : Cell Growth & Death

O _ , , 1 30 CO 1 X • X

K * X, x ! * * • ^ 20 20 X X ! * :* x * ' X x X o _ o _ 1 X ; 1 • X in - in - X« 1 X xj 1 o - o -

0 200 400 600 800 0 200 400 600 800 0 200 400 600 800

19 : Endocrine System 20 : Immune System 21 : Neurodegenerative Disorder:

I----- 1----- 1----- 1----- 1---- r 1----- 1--- 1------1----- 1-----r I--- 1------1----- 1----- 1-----r 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800

22 : Metabolic Disorders 23 : Cancers

1----- r 1----- 1----- 1--- 1------1-----r 0 200 400 600 800 0 200 400 600 800

Figure 4.2: KEGG gene functional categories (continued from prev page)

159 I argue, however, that all sampling decisions involve compromises and are approxima­ tions to the actual population of data, and that this reduced eubacterial core is suitable for this analysis because it has been selected in a sufficiently rigorous manner to yield a set of genes that are representative of essential eubacterial metabolism, yet less prone to phylogenetic artefacts.

4.3.4 Supertrees and supernetworks of the reduced and rejected

eubacterial cores

Before embarking on an analysis of the evolutionary history of the PS-group genomes based only on the reduced eubacterial core, it is necessary to establish that the phyloge­ netic signal in the reduced eubacterial core is representative of that of the full eubacterial core.

My 280 sets of eubacterial core protein-coding genes contained representatives from most, but not all, of the bacteria selected for my study. The supertree, of a set of trees that contain overlapping but non-identical taxa, is the tree that contains all the species from the input trees and is consistent with most of them (Felsenstein, 2004). A supernetwork is an extension of the supertree concept that attempts to capture the phylogenetic relationships between taxa in several trees, up to a specified complexity (Huson et al., 2004a). The supernetwork approach has the advantage that it does not impose a tree-like model of evolution onto the data (Huson and Bryant, 2006) and thus is suitable for estimating the congruence of the phylogenetic signal in the reduced and rejected eubacterial core sets of genes.

160 sets iue .: ubr ftx i ec o h rdcd n rjce ebceil oe rti-oig gene protein-coding core eubacterial rejected and reduced the of each in taxa of Number 4.3: Figure Frequency o _ w C o M 5 6 7 8 9 0 1 2 3 4 5 56 55 54 53 52 51 50 49 48 47 46 45 Number of Number in tree in 8 9 0 1 2 3 4 5 6 7 68 67 66 65 64 63 62 61 60 59 58 □ Rejected core Rejected core□ Reduced ■ 161 Table 4.3: Properties of the supertrees of the reduced and rejected eubacterial core

Reduced Rejected Core Core Number of input trees 62 218 Number of unique taxa 68 71 Score of best tree 12.117 14.395 Average %SD of score 7.295 7.613 Number of bootstrap trees used 10 100

The frequency distribution of the number of sequences in the reduced and rejected eu­ bacterial cores (Figure 4.3) was generated to check for bias between the two cores. The most frequent number of sequences in the reduced eubacterial core is 68 and that in the rejected eubacterial core is 66, and the range of both groups and the overall distribution of values between the upper and lower bounds of the values is similar, so I concluded that biases due to sample sizes should not be an issue.

The best supertrees inferred from the reduced and rejected eubacterial cores using the

Average Consensus (AVCON) method (Lapointe and Cucumel, 1997; Levasseur et ah,

2003) implemented in CLANN v3.2.2 (Creevey and Mclnerney, 2005) contained a similar number of taxa, had similar tree scores (§4.2.5), and a low relative standard deviation

(%SD) (§4.2.5) for the distribution of scores produced by the bootstrapping procedure

(Table 4.3).

Both supertrees recovered a similar topology (Figure 4.4). The cyanobacteria formed a single clade in which individual cyanobacteria followed a similar pattern of divergence, and the PS-group formed a single clade within the cyanobacteria, demonstrating that the reduced and rejected eubacterial cores have a similar phylogenetic signal for the cyanobac­ teria.

162 A

CYANOBACTERIA

ACIDO BACTERIA

a-PROTEOBACTERIA

/fsp-A-ATCC17029 TL------_ -,g < \ ' Rsp-ATCC17025 Rpa-BisB18 R p a-C G A 009

y-PROTEOBACTERIA

p-PROTEOBACTERIA L

e-PROTEOBACTERIA V. CHLAMYDIAE C v. AQUIFICAE ^ BACTEROIDETES Cau SPIROCHAETES 'Ros. V. Syn.JA3 THERMOTOGAE \ _ v' ^PLANCTON! YCETES CHLOROFLEXI 1.0 substitution per site FUSO BACTERIA ACTINOBACTERIA

6-PROTEOBACTERIA CYANOBACTERIA

ACIDOBACTERIA

a-PROTEOBACTERIA

Y-PROTEOBACTERIA

p-PROTEOBACTERIA (

6-PROTEOBACTERIA (

PLANCTOMYCETES C

CHLAMYDIAE f e-PROTEOBACTERIA' Prochlorococcus and marine Synechococcus SPIROCHAETES Cch , CHLOROBI BACTEROIDETES THERMOTOGAE ACTINOBACTERIA AQUIFICAE FUSO BACTERIA FIRMICUTES 1.0 substitution per site

Figure 4.4: Supertrees inferred from the ML trees from the reduced and rejected eubacterial cores The supertrees were inferred using the AVCON method applied to: (A) the 62 protein-coding genes of the reduced eubacterial core; and (B ) the 218 protein-coding genes of the rejected eubacterial core. Clade support values were inferred from 10 and 100 input bootstrap trees, respectively, and are indicated by colour: <50 purple, <65 green, <80 yellow, <95 orange, 95-100 red and terminal edges and clades that did not appear in the consensus of the bootstrap trees are black. Photosynthetic taxa are labelled with italic font (see Table 4.1). 163 The rest of the eubacterial taxa show a star-like pattern of divergences from which a reliable evolutionary relationship cannot be inferred (Figure 4.4). Since the supertree method attempts to form a phylogeny that is consistent with the input trees, this indicates that the input trees of both the reduced and rejected eubacterial cores contained a range of different topologies, which, when combined into a supertree, cannot resolve a tree-like pattern of divergence. Since the same effect was observed for both the reduced and the rejected eubacterial core, the reduced eubacterial core can be said to be a representative sample of the entire eubacterial core that should lead to a better representation of the phylogeny of the whole group.

In both figures 4.4B and 4.5B, the supertree and supernetwork contain two sequences from the Deinococcus radiodurans Rl. This is not a mistake it is a by-product of the decision to include more than one sequence from a taxon in the case where there was more than one possible candidate sequence for the orthologue and the candidates could not be discriminated (§4.2.3). That one of the Deinococcus sequences was found in the expected position, near to the Thermus sequence, while the other was found to be in a completely different position that could be interpreted as evidence for LGT, suggests that the choice to retain all such sequences was justified in that it protects against the incorrect inference of LGT events due to error in the orthologue selection process.

The AVCON supertree method selected for this analysis returns branch length estimates. The supertrees inferred from the reduced and rejected eubacterial cores are similar in that taxa on short branches in one supertree are also on short branches in the other, and a similar pattern occurs for taxa on long branches. Furthermore, the scale of both figures shows the branch lengths recovered for both the reduced and rejected eubacterial cores are similar in length (Figure 4.4).

The consensus feature in CLANN is able to infer a consensus tree from the set of universally

164 distributed source trees, that is, the trees that contain all the taxa (Creevey, 2004). Since only 10 of the bootstrap data sets sampled from the trees of the reduced eubacterial core contained all 68 taxa, the clade support values of the best tree inferred from the reduced eubacterial core were based on these 10 bootstrap replicates. The clade support values at the centre of both phylogenies were between 50 and 80 (Figure 4.4), indicating relatively poor support for the recovered main clades and the order of their divergence.

The supernetworks were inferred using the Z-closure method with a maximum dimension of 2. The supernetwork inferred from the 62 ML trees from the reduced eubacterial core used the information in 182 splits to generate a network containing 71 taxa (Figure 4.5A).

Similarly, that inferred from the 218 ML trees of the rejected eubacterial core used the information in 183 splits to generate a network containing 71 taxa (Figure 4.5B).

Both supernetworks distinguish the cyanobacteria from the rest of the eubacteria and exhibit a star-like radiation of eubacteria from the centre of the phylogeny (Figure 4.5).

The clade supports for branches at the centre of the phylogeny are relatively low, having values between 50 and 80 (Figure 4.5A).

4.3.5 Codon alignments of the reduced and rejected eubacterial

cores

Nucleotide sequences, aligned by the codons that correspond to the amino-acid sequences aligned earlier with MAFFT, were generated for the 280 data sets of the reduced and rejected eubacterial cores. The original 280 data sets contained 17,748 protein-coding genes from 77 chromosomes (from 71 bacteria and 6 plasmids). Of these, the DNA codon alignments were obtained for 14,361 or 80.9%, spanning 62 chromosomes (from 58 bacteria and 3 plasmids).

165 A CHLAMYDIAE r~

e-PROTEOBACTERIA

0.01 substitutions per site

Figure 4.5: Phylogenetic supernetworks inferred from the reduced and rejected eubacterial cores The supernetworks were obtained by applying the Z-closure method to: (A) 62 protein-coding genes of the reduced eubacterial core; and (B ) 218 protein-coding genes of the rejected eubacterial core. The supernetworks were generated with a maximum dimension of 2. Thick branches in (A ) indicate clades with support >50% inferred from 100 runs. 166 The shortfall was due to some of the protein-coding gene records being superseded by a newer record with a new GENBANK Identifier (GI) in between the downloading of the FASTA-format amino-acid sequences in 2007 and the downloading of the protein ta­ ble start-end-strand specifiers in 2009. In particular, four bacterial chromosomes had been completely replaced by a new refseq record with new GIs for each sequence (Cro- cosphaera watsonii WH 8501, Neisseria gonorrhoeae FA 1090, Synechococcus sp. RS9917 and Synechococcus sp. WH 7805), and several chromosomes had several hundred protein­ coding genes that had been superseded by records with a new GI. Since it would have been time-consuming to write new software to retrieve each of the location specifiers for the old GIs, the existing data were representative of the entire data and the missing data would not bias my results, I did not pursue these missing records before continuing with the analysis.

4.3.6 Pairwise estimates of d^/dg in the reduced and rejected eubacterial cores

Pairwise estimates of the ratio of non-synonymous to synonymous substitution rates were inferred for all pairs of protein-coding genes in the reduced and rejected eubacterial cores. The CODEML analysis required 356,357 comparisons between pairs of protein-coding genes in the 280 eubacterial core protein-coding genes. Following the filtering process, 3,708 or 1% of the values were retained for further analysis. Of these, 1,243 were from the reduced eubacterial core and 2,465 from the rejected eubacterial core; they were spread across the four taxonomic groups of interest: (i) the Prochlorococcus; (ii) the PS-group; (iii) the cyanobacteria; and (iv) the Eubacteria (Table 4.4). After the eubacterial cores were filtered to remove data sets with fewer than 8 sequences, the reduced and rejected eubacterial cores consisted of 60 and 206 protein-coding genes, respectively. 167 Table 4.4: Number of pairwise dj^/ds comparisons of each taxon group retained for analysis in the reduced and rejected eu bacteria I cores

Reduced Rejected Total Prochlorococcus 471 1,350 1821 PS-group 214 185 399 Cyanobacteria 214 587 801 E ubacteria 344 343 687 Total 1,243 2,465 3,708

The scatterplots of the average rate of non-synonymous substitutions to the average rate of synonymous substitutions illustrate that the properties of the reduced eubacterial core

(Figure 4.6A) are very similar to those of the rejected eubacterial core (Figure 4.6B).

From visual inspection of the boxplots, d^/ds values (lj) for sequences in the reduced eubacterial core do not appear to be significantly different to u values in the rejected eubacterial core (Figure 4.7). The boxplots of lu values within each codon alignment in the reduced and rejected eubacterial cores were plotted to estimate the level of variation in the co values between data sets. Visual inspection of these values (Figures 4.8A and B) reveal no systematic difference between the two cores. To verify this observation, an unpaired two-sample t-test for equality of means was used to compare the mean u) values within the 60 protein-coding gene sets of the reduced eubacterial core (3,540 = 60 x 59 comparisons) and within the 206 protein-coding gene sets of the rejected eubacterial core

(42,230 = 206 x 205 comparisons). The t-test indicated that 69% of the mean u values within the reduced eubacterial core were not significantly different at the 95% significance level. The results for the rejected eubacterial core were the same: 69% of the mean uj value pairs compared were not found to be significantly different to each other. Therefore,

I concluded that the reduced eubacterial core was a representative sample of the entire core.

168 A

B

Figure 4.6: Rate of non-synonymous ( c/jv) and synonymous ( ds) substitutions After filtering of the d^ and ds values, the protein-coding gene sets consisted of (A) 60 protein­ coding genes for the reduced eubacterial core; and (B ) 206 protein-coding genes for the rejected eubacterial core.

169 Figure 4.7: Pairwise d^/ds ratios within the reduced and rejected eubacterial cores The boxplots show the distribution of the djs/ds ratios within all pairs of sequences within proteins of the reduced and rejected eubacterial cores. After filtering of the div and ds values, the reduced eubacterial core consisted of 60 protein-coding gene sets and the rejected eubacterial core of 206 protein-coding gene sets. The boxplots show the d^/ds ratio for pairwise comparisons within each core.

170 4.4 Discussion

The main aim of this thesis was to analyse the origin of the Prochlorococcus and marine Synechococcus taxa in relation to the rest of the cyanobacteria and to place this evolution­ ary history in the broader context of the Eubacteria. To do this, large data sets had to be compared and this, as always, may lead to artefacts caused by the analytical procedures. In this chapter I have described how a set of comparable genes have been chosen and how sources of error have been addressed.

4.4.1 Selection of taxa

One of the first decisions to be made in a phylogenetic analysis is the selection of taxa that should be included. Taxa must be chosen to fit its aims because inappropriate choices may compromise the analysis. Taxa must be chosen for their potential to aid the aims of the study. In my case, I required a set of taxa from which I could infer orthologous sets of protein-coding genes that could reveal the origins of the taxon-specific core genes of the Prochlorococcus/ Synechococcus-group in the context of the eubacteria. In an ideal situation, without limitations on data or computational resources, I would have chosen to use a random sample of all extant eubacteria. A large enough random sample would contain enough PS-group taxa to test my hypotheses, be representative of the true diversity of the extant eubacteria and not be biased by over- or under-sampling of any bacterial lineages. Unfortunately, such an ideal is not, and may never be, possible. My selection of taxa is an approximation to this ideal, constrained by the resources available to me. Since the data would be used for phylogenetic analysis, for which I needed accurate sets of orthologues (Chiu et al., 2006), I was restricted to eubacteria for which whole genomes had

171 A

B

Figure 4.8: Pairwise d^/ds ratios within the eubacterial core The boxplots show the distribution of the d^/ds ratios within each of the sets of core protein-coding genes of (A ) the reduced eubacterial core; and (B ) the rejected eubacterial core. After filtering of the dw and ds values, the reduced eubacterial core consisted of 60 protein-coding gene sets and the rejected eubacterial core of 206 protein-coding gene sets. One outlier value of c/jv/ds, of 2.67, was removed from (B ) to facilitate an easier comparison with the values in (A ). 172 been publicly released. I chose to include all oxy- and anoxy-photosynthetic bacterial taxa and a representative sample of non-photosynthetic bacteria. Since I expected that any phylogenetic analyses would be computationally “expensive” , I tried to select a minimal set of non-photosynthetic taxa that could reveal LGT between distantly related bacterial lineages, yet would still be computationally tractable.

Initially, I experimented with selecting one representative from each genus of the non­ photosynthetic bacteria. But after some initial investigations, I realised my data would be biased in that they would not contain taxa that represent the diversity of the eubacteria at the genus level. Furthermore, phylogenetic analyses using maximum-likelihood methods might have been prohibitively slow due to the large number of sequences in each data set.

Hence, I chose to sample only one or two representatives from each of the phyla for which representatives were available.

Since bacterial research has traditionally been funded by the medical and energy sectors, the bacterial genomes available are heavily biased towards bacteria that are: (i) human pathogens; (ii) associated with commercially important livestock or crops; or (iii) com­ mercially important for biotechnology, energy production, or bioremediation. In addition, the available genomes are skewed towards bacteria which can be cultured (Rappe and

Giovannoni, 2003). Hence, my selection may not be truly representative of the extant eubacteria.

I acknowledge that the lack of multiple taxa from each phylum may lead to taxa on

relatively long terminal branches, increasing the possibility of LBA effects. However, my sample contains a representative sample from the PS-group, enough other cyanobacteria

to test the coherence of the PS-group core genome, representatives from four of the five

anoxyphotobacterial groups to test the evolutionary relationships of the photosynthetic

bacterial groups, and a fair sample of eubacterial phyla to explore the evolution of the

173 photosynthetic bacteria (and especially the cyanobacteria) in a eubacterial context.

4.4.2 Orthologue inference

A key requirement of data used to infer gene phylogenies is that the aligned sequences

are orthologues sampled from different organisms that follow species divergence patterns

(Blair et al., 2005; Ciccarelli et ah, 2006; Kuzniar et ah, 2008). The inclusion of par-

alogues or xenologues may lead to incorrect conclusions on the relationships between taxa

(Altenhoff and Dessimoz, 2009). Hence, having an accurate orthologue-inference method

that is suitable for the data is important.

The BLAST family of programs is the most popular set of tools for finding homologous

pairs of nucleotide or amino-acid sequences (Moreno-Hagelsieb and Latimer, 2008). They

are suitable for use with a majority of genes (McGinnis and Madden, 2004), easy to use,

and familiar because they are the default method for querying online sequence databases such as NCBI GENBANK. The BLAST programs rely on “local alignment” to find similar

sequences. The query sequence is broken into small substrings which are matched against

the database. The most promising matches in the database are then extended on either side of the small query substring. A limitation of the BLAST programs, caused by the local alignment strategy, is the inability to detect significant similarity between sequences that:

(i) are more divergent; (ii) are short; or (iii) contain repeats (e.g., Li et al., 2000; Wang and Dunbrack, 2003). Since my study required sets of orthologues that were: (i) well conserved across long periods of evolution but not part of large gene families; (ii) longer than the lengths at which BLAST does not perform well; and (iii) protein-coding without

many repeat regions, BLAST was a suitable method.

My approach, based on all-against-all BLAST comparisons, inferring homologous pairs through the reciprocal b last hit (RBH) method (Tatusov et al., 1997; Bork et al., 1998),

174 and inferring orthologous pairs, was developed to minimise both the false positive (type

I) and the false negative (type II) error rates in the inference of sets of orthologues. The all-against-all b l a s t p comparisons were generated with the permissive E-value threshold of 10-3 to reduce the false negative rate, ensuring more distant homologous relationships were retained to provide a more comprehensive set of similarity relationships for further analysis. The stricter criteria used to define an orthologous pair, based on the E-value, score and percent identity was used to reduce the number of false positive relationships.

An alternative to the method is the Reciprocal Shortest Distances (RSD) method (Wall et ah, 2003) that uses an evolutionary measure of distance to rank the protein-coding gene homologues found by BLASTP (Moreno-Hagelsieb and Latimer, 2008). It has been argued that a method based on evolutionary distances might be better than the method based solely on BLAST similarity metrics, but a study that compares both has found that

RSD has no advantages over : the number of orthologues inferred is approximately the same but the error rate of RSD is higher than that of (Moreno-Hagelsieb and Latimer,

2008).

The MCL algorithm was selected as an accurate, computationally efficient solution to find­

ing cliques in a graph of almost 180,000 protein-coding genes connected by over 2,000,000

putative relationships of orthology. Since the best BLAST hit may not always correspond

to the orthologue in the presence of in-paralogues, out-paralogues or xenologues (Galperin

and Koonin, 1998; Brenner, 1999; Koski and Golding, 2001; Moreno-Hagelsieb and La­

timer, 2008), my initial graph may contain a proportion of false positive relationships of

orthology. In this situation, methods that start with a “seed” protein-coding gene and

then use transitive relationships of orthology may result in erroneous groupings. The MCL

algorithm finds clusters of highly-connected protein-coding genes across the entire input

graph through repeated application of an “expansion” operation to simulate flow across

the graph and an “inflation” operation to reinforce strong connections and decrease the

175 weight of weak connections (van Dongen, 2000a,b). Although edge weights in the input graph may take real values, on the recommendation of the MCL authors and my own preliminary investigations (data not shown), I found that a Boolean-weighted, symmetric input matrix corresponding to reciprocal relationships between two protein-coding genes was the best approach and produced sensible clusters.

A potentially controversial decision was the inclusion of multiple protein-coding genes from the same organism if the protein-coding genes in question were so similar, and aligned so well to the other protein-coding genes from other bacteria, that I could not decide which was the putative orthologue. I have adopted the following definition of orthology: “the relationship of two homologous characters whose common ancestor lies in the cenances- tor of the taxa from which the two sequences were obtained” where the cenancestor is defined as “the most recent common ancestor of the taxa under consideration” (Fitch and Markowitz, 1970; Fitch, 2000). Hence, I argue that the inclusion of multiple copies of protein-coding genes from the same gene is acceptable in my study because the one or two extra protein-coding genes in an alignment of ~70 bacteria will not introduce significant bias into the substitution model selected for the analysis, and the resulting phylogeny would not be compromised by errors in topology.

My data preparation methodology differs from that of other studies in that I have at­ tempted to select data that will produce more accurate phylogenies. The researchers who has done the most work on trying to find the evolutionary history of PS-group genomes are Zhaxybayeva and colleagues. Early studies (Zhaxybayeva and Gogarten, 2002, 2003b;

Zhaxybayeva et al., 2004b, 2006) basically use an all-against-all BLASTP method to find homologues and selected quartets of orthologues based on s. The authors acknowledged the possibility that earlier studies could have been affected by LBA effects in the quar­ tets (Zhaxybayeva and Gogarten, 2003b) and questioned whether the conflict observed between gene phylogenies could be due to the orthologue selection method or to taxon

176 sampling (Zhaxybayeva et al., 2004b). They argue that is suitable for inferring ortho- logues because it is a conservative method that excludes a high level of false negatives from the analysis (Zhaxybayeva et ah, 2004b). In the most recent study (Zhaxybayeva et ah, 2009), more attempts have been made to refine the orthologue inference procedure.

The authors used the all-against-all BLAST method to find homologues, but then utilised the MCL method with an inflation parameter of 1.1 to find gene clusters, followed by the branchclust program (Poptsova and Gogarten, 2007) to find gene clusters within the gene families. Finally, the data were “cleaned” using GBLOCKS. Their methods are similar to mine in that they used superior clustering methods to find the sets of putative orthologues and a method to eliminate sites of uncertain homology. My method, however, additionally eliminates data sets based on length and SRH conditions, and excludes more questionable data from analysis.

It is difficult to evaluate the accuracy of MS-clusters that are assumed to be an approx­ imation to a set of orthologues both from a theoretical and a practical point of view.

Different thresholds of similarity that might be expected for orthologous proteins result in different classifications. LGT of a gene from a closely-related bacterium followed by loss of the orthologue is difficult to detect through sequence similarity alone. The sheer amount of data makes manual verification difficult.

Since an objective of this study was to use the highest quality data to investigate rela­ tionships among cyanobacteria, I manually curated each of the 438 sets of orthologues of my initial eubacterial core set. This involved viewing alignments, refining alignments, removing individual sequences and rejecting alignments with sequences that might not be orthologous. Manual curation of the MCL reduced eubacterial core clusters was per­ formed by comparison with clusters I individually collated from the raw BLAST results.

For each MCL cluster, I chose a reliably-annotated “seed” protein-coding gene, then col­ lated the set of all protein-coding genes within 2 (or sometimes 3) edges of the seed

177 protein-coding gene. Each set was pruned to remove individual protein-coding genes that were not probably not orthologous because they were not as highly connected to the rest of the subgraph, their alignment indicated missing domains or large indels, or they had an annotation that suggested they were a different protein-coding gene. For clusters with multiple representatives from the same chromosome or the same bacterium, NJ trees and/or Neighbor-Net networks were generated to evaluate whether the cluster may have consisted of two closely-related sets of orthologous protein-coding genes, or to distinguish which of the in-paralogues was likely to be the orthologue.

4.4.3 The reduced eubacterial core

Another limitation of my reduced eubacterial core of 62 protein-coding genes is that it is relatively small. The initial eubacterial core was reduced from 438 to 396 (~90%) based on putative orthology, further reduced to 280 (~71%) based on length, and finally reduced to 62 (~14%) based on the SRH test (Ababneh et ah, 2006). Of the genomes in my study, 62 protein-coding genes represents ~2% of the protein-coding genes, and only ~14% of the initial eubacterial core of 438 protein-coding genes. My approach could thus be criticised on the basis that any conclusions drawn from the evolutionary history of such a small proportion of the genome would constitute a “tree of one percent” (Dagan and Martin, 2006) or that such a reduced eubacterial core is not representative of the true phylogeny of the entire genome. The “tree of one percent” is a term discussed in a recent paper by Dagan and Martin (2006), questioning the conclusions of a study by Ciccarelli et al. (2006). At face value, this criticism suggests that the latter’s conclusions are based on 1% of the genes of an organism which have been chosen and not randomly selected, and that the other 99% of genes might yield a very different phylogenetic tree or, in fact, no tree at all. However, I 178 disagree with their reasoning.

In their study, Ciccarelli and colleagues: (i) identified a eubacterial core set of protein­ coding genes from the three domains of life that produced trees that did not conflict; (ii) performed a phylogenetic analysis on the concatenated alignment of these protein­ coding genes; (iii) found a eubacterial core of 31 protein-coding genes from 191 species that supported a tree; and (iv) concluded there is support for a tree-like phylogeny of life. On the other hand, Dagan and Martin assert that: (i) 31 protein-coding genes corresponds to approximately 1% of that present in a bacterial genome; (ii) 99% of the genes must show conflict; (iii) this is overwhelming support for a network, rather than a tree, of life; and (iv) thus LGT appears to play a significant role in the evolution of bacteria. The methodology of the former study is questionable. The arguments of the latter are not valid because a lack of a clear signal cannot be interpreted as evidence of conflict.

My study differed from that of Ciccarelli et al. (2006) and other such studies in that I exhaustively searched for a set of eubacterial core protein-coding genes that would result in more accurate phylogenies. Rather than generating trees and rejecting protein-coding genes that give conflicting trees, I have selected a reduced eubacterial core of protein­ coding genes that may yield phylogenies with less error. While it might be preferable to have more than 62 protein-coding genes, my approach, and almost any approach, is always going to result in a reduced eubacterial core. However, I argue that although my reduced eubacterial core may represent only 2% of the genomes it cannot result in a “tree of one percent” as it is meant in the Dagan and Martin study, i.e., that it ignores the other 98-99% of the genes in the genome.

The study of Ciccarrelli and colleagues found a reduced eubacterial core of protein-coding genes which gave trees that did not conflict with each other. My study identified a reduced eubacterial core of protein-coding genes which, though smaller in number, were chosen 179 to yield accurate phytogenies. In the approach of Ciccarrelli and colleagues, the lack of conflict could be due to systematic biases of the phylogenetic method or to LGT. But my approach is an attempt to find the best data for phylogenetic analysis. Hence, my eubacterial core is not the “1%” of the genome that is tree-like, but rather the 2% of the eubacterial core that could yield more accurate phytogenies. Furthermore, I have demonstrated that my reduced eubacterial core is representative of the entire eubacterial core by checking that the reduced eubacterial core contains a range of KEGG gene functional categories and that the sampling of the reduced eubacterial core is similar to that of the entire eubacterial core. Similarly, the phylogenetic signal in the reduced eubacterial core was demonstrated to be approximately similar to that of the rejected eubacterial core in my supertree and supernetwork analysis. Hence, my reduced eubacterial core is representative of the whole eubacterial core and so conclusions about the evolutionary history of the reduced eubacterial core can be extended to that of the whole eubacterial core.

4.4.4 Phylogenetic signal in the reduced eubacterial core

The supertrees and supernetworks were inferred in this study to give a rough indication of the phylogenetic signal in the reduced and rejected eubacterial cores. In particular, I wanted to know if the signal in the two eubacterial cores was markedly different, which would have suggested that the reduced eubacterial core was not representative of the core protein-coding genes that were rejected by the SRH tests.

Supernetworks, as inferred by the implementation in the SplitsT ree software, are not a representation of the evolutionary history of the taxa, but rather a graphical summary of multiple input phytogenies (Huson et al., 2004a). Better bootstrap clade support data for the supernetworks would have been preferred, but my analysis still provides a view of the 180 consensus of the phylogenetic signal in the reduced and rejected eubacterial cores. The general groupings of the supernetworks and, in particular, the lack of resolution at the centre of the phylogenies, indicates considerable disagreement among ML trees of both eubacterial cores.

Since the reduced and rejected eubacterial cores display the same characteristics when examined under two super-phylogeny methods, I conclude that the phylogenetic signal of the reduced eubacterial core is representative of the entire eubacterial core. Hence, any conclusions drawn from phylogenetic analysis of the reduced eubacterial core can be extended to the larger eubacterial core, that contained more variation in the amino-acid frequencies of the protein-coding gene sequences.

4.4.5 Evidence of selection in the reduced and rejected eubac­

terial cores

The filtering process of the preparation of the u> values for this study retained only 1% of the pairwise comparisons generated from both the reduced and the rejected eubacterial cores. Though I would have preferred to have retained a larger portion of the original

pairwise u estimates, my data contained representatives from each of the four taxonomic

groups of interest (Table 4.4) and the 3,708 values retained were sufficiently numerous to

give an indication of the selective pressures acting on all 280 eubacterial core data sets.

The problems experienced in preparation of the data were shared by a recent study which

compared sequences from only the PS-group taxa (Hu and Blanchard, 2009). They too

found that most of the comparisons were substitutionally saturated and eliminated all

those where ds > 1.5. I eliminated all comparisons with u; > 2.0 for the same reason.

The conclusions of the Hu and Blanchard study were based on uj values estimated by an

181 approximation to the ML method (Yang and Nielsen, 2000) implemented in the y n OO program from the PAML package. Although the author of PAML recommends that the

ML method implemented in CODEML should be used in preference to that implemented in YNOO (Yang, 2008), this does not appear to have affected the analyses greatly as both studies reached the same conclusion: that there is no evidence for diversifying selection in eubacterial core genes.

My analysis of the non-synonymous to synonymous substitution rates between pairs of sequences within the reduced eubacterial core of 62 protein-coding genes and the rejected eubacterial core of 218 protein-coding genes did not find evidence of diversifying selec­ tion. There does not appear to be a significant difference between selection pressures in the reduced and rejected eubacterial core sets of protein-coding genes. On this basis, I conclude that my reduced eubacterial core is representative of the entire eubacterial core of 280 protein-coding genes and is suitable for phylogenetic analysis. The use of these

62 eubacterial core protein-coding genes in constructing phylogenetic trees is preferable to using the expanded eubacterial core of 280 protein-coding genes because: (i) it should yield more reliable trees; and (ii) is computationally more tractable.

4.4.6 Conclusions

The main aim of this thesis is to investigate the origins of the Prochlorococcus and marine

Synechococcus genomes. In this chapter, I described how I have selected data that could be used to investigate the origins of the PS-group genomes through the phylogenetic analysis of protein-coding genes that are ubiquitous across the eubacteria. Following the investigations carried out in this chapter, I have come to the following conclusions:

1. The preparation of data for phylogenetic analysis involves many steps, at each of

182 which errors can be introduced, which will lead to error in the final phylogenies. My hypothesis is that previous studies, which have concluded that eubacterial core genes have been subject to high levels of LGT during their evolutionary history, may have been overstated because their conclusions were based on data with properties that mislead the phylogenetic inference process. My approach has been to evaluate the sources of error at each step of the data preparation process and take all reasonable steps to eliminate data that could lead to error in the resulting phylogenies. 2. The final data consisted of 62 amino-acid alignments of protein-coding genes sam­ pled from the 70 eubacteria selected for this study. The sequences of each set were selected from bacteria for which whole genomes were available, to maximise the possibility of inferring sets of sequences which are true ortliologues. Each alignment retained for phylogenetic analysis had a length of at least 100 amino acids and low variation of the amino-acid frequency between the sequences of the same alignment. The data sets included a representative sample of biochemical pathways present in the bacteria selected for study. 3. One possible criticism of my methodology is that my data, which consists of 62 eubacterial core protein-coding genes, is too small to be representative of the bac­ terial genomes from which they have been selected. Hence, if this were the case, any conclusions on the evolutionary history of the reduced eubacterial core could not be extended to that of the entire eubacterial core or to the bacterial genomes as a whole. I maintain that this criticism is not directly applicable to my study. A small eubacterial core was inevitable in a study involving representatives from diverse eubacteria. The reduced eubacterial core was intentionally selected to result in more accurate phylogenies. Finally, I demonstrated that the reduced eubacterial core is representative of the entire eubacterial core, and hence, suitable for draw­ ing conclusions on the origins of the PS-group genomes through the phylogenetic 183 analyses described in the following chapter.

184 Chapter 5

The origin of the core genomes of the PS-group taxa

5.1 Introduction

The evolutionary events that shaped the Prochlorococcus spp. genomes we observe today are not known. While phylogenetic analyses of standard bacterial markers have consis­ tently concluded that Prochlorococcus and marine Synechococcus evolved from a common ancestor (e.g., Palenik and Haselkorn, 1992), other studies suggest that significant por­

tions of their genome - both the novel chlorophyll-based light-harvesting system and many

metabolic-core genes - have been acquired through cyanophage-assisted LGT (Raymond et al., 2002; Mann et al., 2003; Sullivan et ah, 2003; Raymond et al., 2003b; Zhaxybayeva

et al., 2006). Hence, the relative contributions of vertical and horizontal inheritance to

the Prochlorococcus genomes are not clear.

The primary aim of this study is to infer origins of the Prochlorococcus and marine Syne­

chococcus genomes with respect to the eubacteria, taking into account the problems both

185 of artefactual error in phylogenetic inference and of LGT. In this chapter, I present phy­ logenetic analyses based on protein-coding genes from my reduced eubacterial core and additional analyses conducted to test possible instances of LGT. These consider non- phylogenetic lines of evidence such as gene order and possible artefactual errors from evolutionary processes such as positive selection and homologous recombination, where two alleles of the same gene are combined to form a composite.

5.1.1 Conflicting phylogenies in cyanobacteria: artefacts or LGT?

The methods underpinning phylogenetic analyses have a solid theoretical foundation built up over several decades (for a review, see Felsenstein, 2004). However, a significant prob­ lem in accurately inferring bacterial evolutionary histories is distinguishing artefactual error from LGT.

A phylogenetic analysis consists of a series of steps involving choices of: (i) the in-group and out-group taxa for the study; (ii) the orthologous genes or gene fragments to use; (iii) an alignment method for the sequences; (iv) the sites that should be used; (v) a suitable substitution model; (vi) a phylogenetic method, along with parameters for its search procedure; and (vii) a method for evaluating the certainty or quality of the taxon groupings.

At each stage, an inappropriate decision can change the inferred topology and/or branch lengths, which can lead to an incorrect inference of the evolutionary history. In addition, there are fundamental problems which can lead to error in phylogenetic analyses. Some of the most common ones are outlined in the following sections. 186 5.1.2 Phylogenetic artefacts arising from taxon sampling

The choice of both in-groups and out-groups is known to affect the position of individual taxa in a phylogeny (for a recent review, see Heath et ah, 2008a). In general, it is recommended that in-groups should consist of taxa expected to result in terminal edges of approximately equal length, in order to avoid long branch attraction (LBA) effects

(Hendy and Penny, 1989; Graybeal, 1998; Poe and Swofford, 1999), and that multiple out-group taxa separated by short internal branches be included, to reduce phylogenetic error (Graybeal, 1998).

For my set of 70 bacteria, I was concerned that the inclusion of many distantly-related non-cyanobacterial out-groups and oversampling of the PS-group, relative to the rest of the cyanobacterial in-groups, would result in the oversampled in-groups being drawn toward the base of the tree (due to model parameters estimated for a biased sample of the taxa). These conflicting phylogenies might then be taken as evidence of LGT.

In this chapter, the possible effects of taxon sampling are explored by repeating phylo­ genetic analyses with different sub-samples of the taxa. Using this approach, I hope to

determine whether the phylogenetic results inferred on the basis of the original alignments

were affected by over-sampling of some taxonomic groups and inclusion of distantly-related

non-cyanobacterial groups.

Another risk is the choice of cyanobacterial sequences that do not approximate the full

diversity of the extant cyanobacteria, which could also produce taxonomic groupings that

could be interpreted as LGT.

187 5.1.3 Phylogenetic artefacts arising from LB A effects

Another possible explanation for the differing - or congruent - position of individual cyanobacteria in phylogenies inferred from different eubacterial core genes is that some of the sequences have been affected by LB A artefacts.

Some sets of sequences cannot be analysed without the risk of LBA artefacts creeping into the analysis. For example, if a set of sequences contains a subset that has shared a different GC or amino-acid frequency to the rest of the group, this may result in those unusual taxa being grouped into a clade. However, this artifical grouping may not reflect the evolutionary relationships between the sequences.

Positive selection across a relatively distantly-related set of sequences could result in long terminal branches on an otherwise ordinary phylogeny. However, most phylogenetic models and methods will incorrectly infer that these distantly-related sequences share a common ancestor. So phylogenies inferred from sets of sequences which had, or had not, experienced positive selection may be significantly different, possibly leading to the incorrect conclusion that LGT had played a part in the evolutionary history of some genes.

Lastly, it is possible that some genes have experienced homologous recombination with DNA from other sequences. This scenario can be thought of as a form of LGT that in­ volves only segments of genes. Here, gene segments genuinely have different evolutionary histories and phylogenetic analysis may recover trees that more closely reflect the evo­ lutionary history of the source of the newly-acquired sequences rather than the original segments of the sequence. 188 5.1.4 Distinguishing artefactual error from LGT

The main aim of this study was to infer phylogenies with a minimum of phylogenetic artefacts and, hence, to reveal the origins of the Prochlorococcus and marine Synechococcus genomes.

Earlier studies focusing on the PS-group (e.g., Hess et ah, 2001) did not answer this question because they: (i) were based on a single molecular marker (which may have been subject to LGT); (ii) produced different phylogenies from the same genes; and/or (iii) did not contain enough non-cyanobacterial out-groups. More recent studies (Mulkidjanian et ah, 2006; Kettler et ah, 2007; Dufresne et ah, 2008) did not adequately address this question either, because they: (i) did not contain enough cyanobacterial taxa, (ii) did not include a broad enough sample from other bacterial phyla; or (iii) focused on the presence and absence of non-cyanobacterial-core genes.

In this study, I use both the 16S rDNA gene and a curated subset of eubacterial core protein-coding genes to infer the origins and evolution of the Prochlorococcus and marine

Synechococcus genomes and assess whether the conflicts among previously inferred phy­ logenies (Raymond et ah, 2002; Zhaxybayeva et ah, 2006) could be due to a combination of model misspecification (e.g., Jermiin et ah, 2008), taxon sampling (e.g., Heath et ah,

2008a), positive selection, recombination or LGT (e.g., Raymond et ah, 2002). In addition,

I explore alternate lines of evidence, such as the gene order surrounding protein-coding genes purported to have experienced LGT.

5.1.5 Aims

The aims of this study were to:

189 1. infer the evolutionary histories of the protein-coding genes of the reduced eubacterial

core using a reliable phylogenetic method;

2. visualise the phylogenetic conflict in the alignments of the reduced eubacterial core

protein-coding genes;

3. estimate the effects of taxon sampling on phylogenetic analyses of cyanobacteria, and

hence, on conclusions about the evolutionary relationships between cyanobacteria;

4. evaluate whether alleged instances of LGT can be explained by other evolutionary

processes, such as accelerated evolutionary rates or recombination, or by properties

of the data that may lead to phylogenetic artefacts, such as variation in GC content;

5. evaluate the importance of LGT in the evolution of eubacterial core protein-coding

genes in Prochlorococcus and marine Synechococcus genomes.

5.2 Methods

In the previous chapter, I identified a reduced set of 62 eubacterial core protein-coding genes that would lead to reliable phylogenies (from 70 diverse eubacteria). Here, I describe how these were used along with the 16S rDNA set to explore the origins of the PS-group genomes.

5.2.1 Inference of a 16S rDNA reference tree

The 16S rDNA gene is the reference phylogeny by which bacteria are classified (Fox et al.,

1977; Woese and Fox, 1977; Garrity, 2001). To extend previous analyses to include the

70 eubacteria selected for this study and to incorporate the information encoded in the

190 secondary structure of the rDNA molecule, a Bayesian phylogeny was inferred using the PHASE PACKAGE v2.0-/? (Jow et ah, 2002).

The RNA sequences were identified by their annotation from the NCBI REFSEQ database of whole genomes. The nucleotide sequences were aligned with MAFFT using default parameter values and annotated with the corresponding stem-loop structure of three reference sequences downloaded from the EUROPEAN RIBOSOMAL RNA database (Wuyts et ah, 2004). The step-loop annotation was used to manually improve the initial m afft alignment and partition the sites into those in stem or loop regions. As in other recent analyses (e.g., Murray et ah, 2005), the Bayesian analysis was performed using the paired- site RNA7A model for sites in the stem regions. The most appropriate model suggested by use of PAUP* v4.0bl0 (Swofford, 2003) and MODELTEST v3.7 (Posada and Crandall, 1998) on the basis of the BIC (Schwarz, 1978) was used for all other sites.

Following a preliminary analysis to estimate the parameters for the Markov chain Monte Carlo search, I used the MCMCPHASE and MCMCSUMMARIZE programs to run the analysis for 1,000,000 iterations. The first 200,000 iterations were discarded as the burn-in period. A consensus tree was obtained from the trees sampled every 50 iterations thereafter. The convergence of the search following the burn-in period was investigated by plotting the log-likelihood values using the TRACER program vl.3 (Rambaut and Drummond, 2007). If the search had converged, the log-likelihood values should have formed a stable plateau during the burn-in period and stayed at that plateau during the entire sampling phase.

The OPTIMIZER program was used to calculate the ML estimates of the branch lengths. The Bayesian Posterior Probability (BPP) of the clades was found by computing the support for each clade in the 10,000 sampled trees with the Majority rule (extended) feature of the CONSENSE program of the phylip package v3.63 (Felsenstein, 1989, 2005). The support value for each clade was scaled to a probability value between 0 and 1.

191 5.2.2 Inference of ML trees from the reduced eubacterial core

Maximum-likelihood trees were inferred from each of the 62 data sets using TREEFINDER (version of June 2007) (Jobb et al., 2004). The analyses were run using the substitution model recommended by pro ttest vl.2.7 (Abascal et al., 2005), on the basis of the hLRT (Huelsenbeck and Crandall, 1997), AIC (Akaike, 1974) and AICc (Hurvich and Tsai, 1989). In general the models recommended by each of these tests were very similar for a given gene; in cases where they were not, the model with fewer parameters was preferred. For protein-coding genes where rate heterogeneity across sites was implied, a discrete T distribution was used, with shape parameter ol and four rate categories (denoted r4), and a proportion I of invariant sites (denoted I), estimated by TREEFINDER. The confidence of inferred edges was estimated by the LR-ELW method with 1,000 replicates (Strimmer and Rambaut, 2002).

5.2.3 Visualisation of phylogenetic conflict with Neighbor-Net analyses

A Neighbor-Net analysis has an advantage over tree-based representations because it can represent the conflicting groupings of taxa (with bands of parallel edges) while retaining the display of evolutionary distances (Bryant and Moulton, 2002, 2004). It was chosen for its potential in exploring site pattern incompatibilities between ancient or rapidly-evolved lineages (Whitfield and Lockhart, 2007). I inferred Neighbor-Net phytogenies for the 16S rDNA (without the structural informa­ tion) arid the 62 eubacterial core protein-coding gene alignments using SPLITSTREE v4.1.9 (Huson and Bryant, 2006). The most appropriate model, the proportion of invariant sites and the a parameter of the discrete T distribution were estimated by MODELTEST for 192 the 16S rDNA alignment and by PROTTEST for the protein-coding gene alignments. The confidence of inferred clades was estimated from 1,000 bootstrap replicates.

5.2.4 Quantification of cyanobacterial over-sampling effects

The choice of taxa is an important consideration for any phylogenetic analysis because it can affect the topology and branch lengths of the resulting phylogeny and lead to incorrect conclusions (Heath et ah, 2008a). Hence, the variability in the eubacterial core gene phylogenies inferred in previous studies could be due to the use of different in-group and out-group taxa in different studies. To estimate the magnitude of tree topology differences that can be obtained by changing the proportion of in-group and out-group taxa, I generated new 16S rDNA alignments for subsamples of the original taxa used for the original phylogeny (Table 4.1). Analyses were performed to infer: (i) the tree for the original cyanobacteria plus the 5 non-cyanobacterial out-groups having the shortest branches relative to the cyanobacteria; (ii) the “Majority Rule (extended)” consensus tree for the 153 possible data sets com­ posed of a sub-sample of 2 of the 18 Prochlorococcus and marine Synechococcus sequences plus the rest of the cyanobacteria and the same 5 non-cyanobacterial out-groups; and (iii) the consensus tree for 10 trees containing 3 randomly-selected PS-group taxa, the rest of the cyanobacteria and the same 5 non-cyanobacterial out-groups.

Bayesian trees were inferred with the PHASE program using the same method as described in §5.2.1. In a similar way, 100 new alignments were generated for each of the 62 eubac­ terial core protein-coding gene data sets. Each contained 2 randomly chosen PS-group

193 taxa, the rest of the cyanobacteria, and the 5 non-cyanobacterial out-groups used for the

16S rDNA resampling tests.

ML trees were inferred with t r e e f in d e r using the same substitution model used for the original alignments. LR-ELW replicates were limited to 100 to reduce computation time. The “Majority Rule (Extended)” consensus tree was generated using the CONSENSE program from the PHYLIP package (Felsenstein, 1989, 2005).

5.2.5 Estimation of cyanobacterial under-sampling effects

It is possible that the position of the PS-group taxa within the cyanobacteria varies across gene trees because the cyanobacteria used in this study are not representative of the true diversity of the extant cyanobacteria, leading to biases in the resulting phylogenies. Thus, to infer the position of the PS-group taxa with respect to the rest of the cyanobacteria,

I inferred a cyanobacterial tree from 16S rDNA sequences that were more representative of cyanobacterial diversity.

All full-length cyanobacterial 16S rDNA sequences having a length between 1,400 and

1,500 nucleotides were downloaded from the NCBI GENBANK databases on 19 November

2008. The sequences were aligned using CLUSTALW vl.83 (Thompson et al., 1994). The alignment was refined to agree with the structural alignment of the original sequences as described in §5.2.1. Sequences that were missing a domain at the beginning or the end of the alignment were discarded. Ambiguously aligned sites were removed using GBLOCKS, with default parameter values except that sites containing up to 50% of the gap state were permitted.

An ML tree was inferred using TREEFINDER with the GTR model (Tavare, 1986) with 4 discrete T rate categories and a proportion of invariant sites. The radial and rectangular

194 phylograms were viewed using DENDROSCOPE (Huson et al., 2007), a tree-viewing program that can handle large data sets.

5.2.6 Positive selection or recombination as an alternative to

LGT

Another possible explanation for the unusual phylogenies recovered by the ML trees and the phylogenetic conflict evident in the Neighbor-Net analyses is less strong purifying se­ lection, positive selection or recombination. If some sequences in a homologous alignment have experienced positive selection leading to elevated substitution rates, the resulting phylogeny may be unusual because they are biased by LB A effects (Philippe and Laurent,

1998). Similarly, if some sequences in an alignment of homologous protein-coding genes have experienced homologous recombination, this can mislead the phylogenetic inference process (Posada and Crandall, 2002) and lead to incorrect conclusions (Schierup and Hein,

2000b,a). In this section, I explain how the alignments of the reduced eubacterial core were examined for evidence of positive or negative selection, or of recombination. Robust evidence of positive selection or recombination would constitute an alternate explanation for unusual ML phylogenies or conflict in the Neighbor-Net analyses.

The data for the analyses were the reduced eubacterial core alignments prepared for the pairwise ratio of d ^ / d s (= u j ) analysis in §4.2.7. These consisted of the codon alignments corresponding to the pre-GBLOCKS alignments of the 62 protein-coding genes of the reduced eubacterial core.

Evidence of non-neutral evolution in sites of each of the 62 codon alignments was calcu­ lated using the HYPHY package (Pond et al., 2005). This software currently implements three likelihood-based methods for identifying sites under positive selection. The Single

195 Likelihood Ancestor Counting (SLAC) method that infers ancestral sequences and esti­

mates the number of non-synonymous and synonymous changes that have occurred at

each codon (Pond and Frost, 2005c). The Random Effects Likelihood (REL) method

fits a distribution of rates across sites then infers the rate for each site (Pond and Frost,

2005c). The Fixed Effects Likelihood method (FEL) method fits substitution rates on

a site-by-site basis (Pond and Frost, 2005c). The hyphy package also implements the

PARRIS method (Scheffler et al., 2006) which indicates if there is evidence of positive selection across all the sites in the alignment by using a likelihood ratio test (LRT) to distinguish between models with and without positive selection. Lastly, the GABranch

method (Pond and Frost, 2005b) uses a genetic algorithm to assign a fixed number of different classes of u to lineages and identifies lineages which have experienced positive selection.

Due to computational constraints, only the FEL method was used to infer positive se­

lection in all 62 protein-coding genes of the reduced eubacterial core. The FEL method

was chosen for this analysis because previous studies have shown it performs better than

the counting methods or random effects models (Pond and Frost, 2005c) and it allows substitution rates to vary on a site-by-site basis without the need to specify site classes

prior to analysis (Pond and Frost, 2005c). The analyses were conducted assuming the

“universal” genetic code, since the bacterial code was not available for use, using an NJ

tree inferred from the codon alignment, the REV (also known as the GTR) substitution model with empirical nucleotide frequencies, and a significance level of 0.1 for the p-value cut-off for the LRT (with a single degree of freedom) to classify a site as positively or negatively selected.

In addition, the SLAC, REL and PARRIS analyses were run for the four eubacterial core protein-coding genes where cyanobacteria were implicated in possible LGT events. The

SLAC analyses used the REV substitution model, where the global u value was estimated

196 starting from value 1.0, ambiguous nucleotides were resolved using a weighted averaging method that does not count gap or missing characters (Pond and Frost, 2005c), and a significance level of 0.1 for the p-value of the two-tail extended binomial test used to infer whether a site had been under positive or negative selection. The REL analyses were conducted using the REV substitution model and a significance level of 50 for the

Bayes factor (so values over 50 provide evidence for the hypothesis) (Pond and Frost,

2005c). The PARRIS analyses were conducted under the REV substitution model with a significance level of 0.1 for the p-value for the 2-degrees of freedom LRT where the null model has no sites under selection and the alternate model has a proportion with positive selection (Pond and Frost, 2005c).

The Genetic Algorithm Recombination Detection (GARD) method (Pond et al., 2006) implemented in the HYPHY package was chosen to screen the 62 codon alignments for recombination. It was chosen because simulation studies have shown that this method has good power and accuracy for detecting recombination as it does not rely on a sliding window approach (e.g., Holmes et ah, 1999; Archibald and Roger, 2002), which is known to be sensitive to the parameters used for window size and to the way in which the window is moved, and for its potential to return the breakpoint locations (Pond et ah, 2006).

The GARD method relies on use of the AIC score to indicate how well the data fit a given tree topology, rate parameters and set of branch lengths. The AIC score is first calculated for the NJ tree and ML estimates of the rate parameters and branch lengths inferred from the entire alignment. For each possible breakpoint, the AIC is calculated for the fragment block on either side of the breakpoint, using the rate parameters and branch lengths inferred for the entire alignment. If there is at least one breakpoint in the alignment which has an AIC for the left or right fragment block that is lower than the AIC for the entire alignment, some of the sequences in the alignment are recombinant (Pond et al., 2006). Expressed in another way, if a fragment of the original sequence fits the NJ

197 tree and associated rate parameters and branch lengths better than the entire sequence, that is taken as evidence of a recombination breakpoint. The analyses were conducted using an NJ tree inferred from the codon alignment, the REV substitution model with empirical nucleotide frequencies, no site-to-site rate variation, and 2 rate classes.

The HYPHY package also implements the Single Breakpoint (SBP) recombination method. This procedure concludes there is evidence of recombination in the alignment if at least one breakpoint can be found. Due to computational limitations and my preference for identifying the possible locations of breakpoints, only the GARD method was run for the 62 protein-coding genes of the reduced eubacterial core. However, SBP analyses were run for eubacterial core protein-coding genes where cyanobacteria were implicated in possible LGT events, in order to check for additional evidence of recombination. The SBP analyses were conducted using the REV substitution model, no site-to-site rate variation and 2 rate classes. Finally, the alignments for the eubacterial core protein-coding genes that exhibited an unusual cyanobacterial phyletic pattern were examined for any indels that would support or refute the HYPHY test results. All HYPHY analyses were executed via the DATAMONKEY (Pond and Frost, 2005a) web interface to a cluster running a distributed implementation of the HYPHY package.

5.2.7 Investigation of gene order support for suspected LGT

The ML trees and Neighbor-Net analyses of several reduced eubacterial core protein­ coding genes recovered highly unusual groupings with relatively high clade support. With­ out any further information, these could be interpreted as clear instances of LGT. Since the primary concern of this study is the evolutionary history of the PS-group taxa, I performed additional analyses on those protein-coding gene sets for which LGT was sus-

198 pected because PS-group taxa had also been found in an unusual position in the ML trees and Neighbor-Net phylogenies inferred in the previous sections. My aim was to check if gene order or properties of the sequences lent further support to the initial suspicion of

LGT.

Firstly, I tested whether the order of the genes around the gene of interest was con­ served. The PROTTAB tables for each of the main bacterial chromosomes in our study were downloaded from the Microbial Genome Project information pages at the NCBI web­ site (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). The PROTTAB table for each bacterial chromosome contains the start and end indices of the location of the nucleotide sequence of each protein-coding gene, and whether it occurs on the leading or lagging strand of the DNA helix. For each of the bacterial chromosomes that appeared in the set of orthologous protein-coding genes being examined, the 20 genes on either side of the gene of interest were extracted and collated into a table with rows corresponding to taxa and columns to the genes in the order that they occur on the chromosome. Each of the protein-coding genes in the gene order table was cross-referenced with the MCL clusters inferred in §4.2.2. Although the clusters may contain false positives, they are a good enough approximation to infer conserved gene order. Each cluster represented in the PS-group taxa was allocated a colour if it appeared at least twice in the gene order table. Colours were re-used within a table if it would be clear from the context that they denoted a different set of putative orthologues. Genes left white denote those that did not appear in an MCL cluster, only appeared once in the gene order table collated here, or only appeared in the non PS-group cyanobacteria. Conserved gene order across groups of taxa was considered lack of evidence of LGT.

Secondly, I tested whether the atypical phylogeny could be due to the grouping of se­ quences with atypical GC% content. The unusual phylogeny was annotated with the

GC% of each protein-coding gene. A significant difference between sub-groups of the PS-

199 group taxa could account for the strange placement of these sub-groups and would offer

an alternative explanation to LGT.

5.3 Results

5.3.1 Bayesian phylogeny inferred from the 16S rDNA gene

As described in §5.2.1, a phylogeny based on 16S rDNA was generated to extend previous studies to include the 70 bacteria listed in Table 4.1 and complement the main analyses on eubacterial core genes.

Prior to analysis, the matched-pairs test of symmetry (Ababneh et al., 2006) was used

to evaluate whether the sequences have evolved under SRH conditions. Of the sequence

pairs, 13.1% failed at the p=0.05 level. This is higher than the maximum of 10% permitted

in the selection of the eubacterial core protein-coding genes, so an additional degree of caution is warranted when interpreting the phylogenetic result.

The RNA molecule consists of stem regions, where nucleotides in the RNA strand are

bonded to the complementary nucleotide on another part of the molecule, and loop re­ gions, where each site in the molecule is independent (Linderstrm-Lang, 1952; Garrett and Grisham, 2005). Most nucleotide substitution models assume that each site evolves independently. Since the di-nucleotide pairs of the stem regions evolve together, it is more appropriate to use a di-nucleotide substitution model that takes this into account, and an independent model for the loop regions (Higgs, 1998).

The Bayesian inference program PHASE was chosen for the phylogenetic analysis because it permits partitioning of the data into stem and loop regions, and the use of a di­ nucleotide substitution model for the stem regions and an site-independent model for the

200 loop regions. The trees were inferred using the RNA7A substitution model (0.95 in Figure 5.1 A. However, the relatively short internal edges separating the bacterial phyla and the many low BPP values around the base of the bacterial phyla (green, yellow and orange edges, Figure 5.1 A) show that the precise relationship among these phyla could not be inferred from the 16S rDNA alignment.

201 AQUIFICAE DEINOCOCCUS I Pma-MIT9301 |-Pma-MIT9312 THERMUS | Pma-AS9601 ' Pma-MIT9515 SPIROCHAETES ■ Pma-CCMP1986 Pma-NATL1A Pma-NATL2A ACIDOBACTERIA Pma-CCMP1375 6-PROTEOBACTERIA ACTI NOBACTERIA I Pma-MIT9211 PLANCTOMYCETES ^ < Syn.CC9605 ( Syn.CC9902 CHLAMYDIAE Syn.WH8102 e-PROTEOBACTERIA Syn.CC9311 Syn.WH7805 BACTEROIDETEi Pma-MIT9303 Pma-MIT9313 CHLOROBI Syn.WH5701 Syn.RS9917

Y-PROTEOBACTERIA

P-PROTEOBACTERIA Rpa-BisB18 Rpa-CG A009 Rsp-ATCC 17029 Rsp-ATCCI 7025 a-PROTEOBACTERIA

AQUIFICAE

CYANOBACTERIA

marine Synechococcus FUSOBACTERIA ( 6-PROTEOBACTERIA Sel-PCC7942 Sel-PCC6301 ACIDOBACTERIA Ama ACTINOBACTERIA Ter yn.JA2 Ava SPIROCHAETES ^ Npu Rpa-CG A009 Y-PROTEOBACTERIA Rpa-BisB 18 Rsp-ATCCI 7029 (J-PROTEOBACTERIA Rsp-ATCCI 7025 a-P ROTEOBACTERIA 0.1 substitutions per site

Figure 5.1: Phylogenies inferred from the 16S rDNA gene for 70 bacteria (A) The Bayesian Jtree; (B) the PS-group clade of the Bayesian tree; and (C) the Neighbor-Net phylogeny. BBP clade support values in the Bayesian tree are indicated by colour: <0.50 purple, <0.65 green, <0.80 yellow, <0.95 orange, 0.95-1.00 red and terminal edges black. Clades with bootstrap support >50 in the Neighbor-Net analysis are indicated by thick branches. Bacterial lineages and photosynthetic taxa are labelled (see Table 4.1).

202 The four groups of anoxygenic photosynthetic bacteria represented in my data were poly- phyletic within the eubacteria and none were particularly closely related to the cyanobac­ teria (Figure 5.1A).

5.3.2 ML phylogenies inferred from the reduced eubacterial core

Given an aligned set of sequences and a substitution model, the maximum-likelihood estimate of a phylogeny is the tree that is most likely to have produced the observed sequences (Edwards and Cavalli-Sforza, 1963; Neyman, 1971; Kashyap and Subas, 1974;

Felsenstein, 1981). A maximum-likelihood inference method was chosen for this study because it is accepted as a robust and reliable method for phylogenetic inference (Page and Holmes, 2003; Felsenstein, 2004).

Prior to phylogenetic inference, the most suitable amino-acid substitution model for each

amino-acid alignment was estimated using the pr o tt e st program (Abascal et ah, 2005).

For most of the alignments (59 of 62), the recommendation was the WAG model (Whelan

and Goldman, 2001) with a discrete T distribution to account for substitutional rate

heterogeneity and a proportion of invariant sites (I). These are summarised in (Table

5.1).

Table 5.1: Clade support and PS-group positions inferred from phylogenetic analyses of the 62 protein-coding genes of the reduced eubacterial core

Cyan in ML tree LR-ELW / Neighbor-Net BS PCC unusual clade support values Basal position5 Fna Gene^ Modelc pos’ns^ Cyan Pcc/mS Pcc genus-^ 1 2 3 A ahcY W A G + r+ I Cy - / - 100 / 97.2 - / - Pcc - B B A arg G W A G + r+ I 89.4 / 94.5 100 / 74.9 - / 2.4 Pcc T T W A aroC W A G + r+ I G - / - 95.0 / 75.5 - / - Pcc B B B A his B W A G +r 66.6 / 97.2 100 / 5.4 - / 6.2 mS B B B A metK W A G + r+ I 99.5 / 97.1 98.7 / 85.2 - / Pcc B B B continued on next page

203 continued from previous page

A typA WAG+r+I 100 / 100 99.8 / 95.6 - / - P cc B B W

B accA WAG+r+I 97.5 / 96.3 74.5 / 96.4 60.1 / - m S T T W B frr W A G + r 77.0 / 96.8 100 / 98.4 - / - P cc T T W B ilv H jTT+r 79.3 / 81.5 100 / 98.4 - / - m S T T W B prs A W A G + r 100 / 99.6 98.9 / 96.3 - / 78.0 P cc B T W B rpe W A G + r P c c /m S - / - - / - - / - - B T W

C mre B WAG+r+I 52.6 / 83.9 100 / 99.8 98.1 / 50.7 m S T T W ? E atpA WAG+r+I 100 / 98.5 99.9 / 79.3 99.3 / 76.8 T T W ? E atp B WAG+r+I 70.5 / 87.5 53.4 / 95.1 100 / 73 T B W ? E atpD WAG+r+I 100 / 98.0 99.9 / 99.1 90.5 / 75.3 T T W E ree A WAG+r+I 100 / 99.8 99.8 / 98.5 - / - m S T T W

F ahp C WAG+r+I G - / - 76.4 / 37.7 - / - m S B B W F bcp WAG+r+I C y - / - 71.1 / 51.0 - / - m S T T T F pro EL WAG+r+I 99.6 / 99.0 100 / 99.5 98.8 / 79.2 m S T T W F secY WAG+r+I 100 / 100 100 / 100 - / - P cc T T W F sm p B W A G + r 83.7 / 27.6 99.1 / 96.6 - / - P cc T T W G Ipx A WAG+r+I 78.8 / 86.1 86.1 / 81.8 - / - m S B W B

L fabl WAG+r+I 96.0 / 54.1 99.9 / 96.2 68.4 / - m S T T W L gcpE WAG+r+I 100 / 98.8 - / - 93.6 / 20.4 m S B T W

N apt W A G + r P c c /m S - / - - / - - / - - W W B N ndk W A G + r G - / 37.7 33.3 / 93.1 65.3 / - m S T T T N p p rH WAG+r+I 99.5 / 92.8 99.3 / 95 .7 90.5 / 38.8 m S T T W N su fB W A G + r G - / - 100 / 91.7 - / - P cc B T W M clpP W A G + r 92.8 / 63.1 99.3 / 90.8 - / - P cc T W W M clpX WAG+r+I 100 / 98.7 100 / 98.5 - / - P cc T T W M coaD W A G + r 99.6 / 91.5 99.1 / 97.1 - / - m S T T W M ribE WAG+r+I 99.9 / 91.6 100 / 92.4 - / - m S T B W M thiC WAG+r+I 100 / 99.2 99.8 / 77.6 98.4 / - m S T T W M tm K e W A G + r 99.3 / 94.6 99.9 / 36.5 - / - P cc T T T

R recR WAG+r+I 77.7 / 73.5 99.9 / 94.8 - / - P cc B T W R ruvC W A G + r 75.8 / 61.0 99.0 / 81.3 - / - P cc T T W

T in fC W A G + r 99.8 / 95.1 99.9 / 99.3 - / - P cc T T T T rpoA W A G + r 100 / 100 100 / 100 85.1 / - m S T T W

S efp W A G + r 99.7 / 92.8 99.7 / 93.3 - / - P cc T T W S fus A WAG+r+I 100 / 99.9 100 / 99.8 - / - P cc T T W continued on next page

204 continued from previous page s lepA W A G + r + I 100 / 99.4 100 / 100 57.0 / 29.2 mS T T W s petB W A G + r 100 / 100 99.8 / 96.9 46.2 / - ? T T T s rp/B W A G + r + I 100 / 97.3 100 / 99.6 78.4 / - mS T T W s rplD W A G + r + I 81.8 / 90.6 100 / 83.3 - / - mS T T W s rp/E W A G + r + I 93.7 / 71.3 100 / 95.2 - / - Pcc T T W s rplF W A G + r G - / 64.8 99.9 / 97.8 - / - ? B W W s rplK W A G + r 98.0 / 72.8 99.6 / 98.6 - / - ? T B B s rp/M W A G + r 97.6 / 78.0 68.2 / 47.7 - / - Pcc T T W s rp/N W A G + r 99.9 / 94.5 99.8 / 95.1 - / - Pcc T T B s rpZP W A G + r 75.0 / 62.8 99.4 / 95.1 - / - Pcc B B B s rplT W A G + r + I 100 / 93.2 78.8 / 82.7 - / - Pcc T T T s rps B W A G + r + I 100 / 99.9 100 / 100 88.5 / - ? T T W s rps C W A G + r 100 / 99.6 100 / 99.8 - / 59.5 Pcc T T W s rpsD W A G + r G - / - 97.2 / 84.2 80.5 / 5.6 mS T T W s rpsE W A G + r 99.3 / 100 100 / 99.2 - / - mS T T W s rps G jT T + r 95.4 / 96.8 98.6 / 95.4 - / - Pcc T T W s rpsH W A G + r 85.4 / 86.0 99.9 / 98.7 - / - Pcc B T W s rpsK JT T + r 80.0 / 96.7 63.0 / 98.8 - / - Pcc B B B s rpsL W A G + r 93.3 / 84.4 98.2 / 85.2 - / - ? T T T s rpsM W A G + r 98.4 / 85.6 100 / 99.7 - / - Pcc T T W s tsf W A G + r 98.5 / 86.0 100 / 94.4 - / - Pcc T T W s tuf A W A G + r + I 99.8 / 97.7 99.8 / 98.1 - / - mS B W W Count 10 52 / 53 59 / 59 17 / 13 Num T 44 46 7 Average 92.9 / 87.9 94.4 / 88.9 82.3 / 45.8 Num W 1 5 44 Num B 16 11 11 a The genes are grouped by the corresponding KEGG gene functional category where: A - Amino acid metabolism, B - Carbohydrate metabolism, C - Cell growth and death, E - Energy metabolism, F - Folding, sorting and degradation, G - Glycan biosynthesis and metabolism, L - Lipid metabolism, N - Nucleotide metabolism, M - Metabolism of cofactors and vitamins, R - Replication and repair, T - Transcription and S - Translation. b See Table 4.2 for the full names of the gene products. c The substitution model used for each analysis is listed (for the a parameter value used in the discrete T distribution and the proportion of invariant sites I, refer to Appendix

C . l ) . continued on next page

205 continued from previous page d Protein-coding genes for which an LGT event was suspected are marked with the taxa which are located in an unusual position: Cy - cyanobacteria, Pcc/mS - for Prochlorococcus and/or marine Synechococcus, and G - Gloeobacter. The total number at the base of the table is only for those genes marked, e Clade support values for the monophyly of the cyanobacteria, the PS-group and the Prochlorococcus genus were inferred from 1,000 replicates using LR-ELW for the ML trees and the bootstrap values for the Neighbor-Net analyses, f The most basal taxa for the PS-clade were indicated, where applicable, by ‘P’ (Prochlorococcus), ‘mS’ (marine Synechococcus), ‘?’ where unclear, and when not applicable for the data set. g The position of the PS-group within the cyanobacteria for three subsamples of taxa: Case 1 - the original set of 70 bacteria, Case 2 - the original cyanobacteria with only 5 non-cyanobacterial out-groups, and Case 3 - 2 PS-group taxa, the rest of the cyanobacteria and 5 non-cyanobacterial out-groups. The position of the PS-group within the cyanobacteria is indicated as being: T - at the tip, W - within or B - at the base of the cyanobacterial radiation.

Across the 62 eubacterial core protein-coding genes used in this study, there was a high level of variability in the relative position of bacterial phyla and the position of individual taxa. No single pattern of evolutionary divergences emerged from phylogenetic analyses of the 62 eubacterial core protein-coding genes. Nevertheless, there were clear trends in the data, that are indicative of the evolutionary history of the cyanobacteria being studied, summarised in Table 5.1

The ML method was able to recover the main bacterial phyla, but the short internal branches at the centre of these radiations indicate uncertainty in the relationships between phyla.

The tree inferred‘from the secY gene, which encodes the “Preprotein translocate SecY subunit” (Figure 5.2) is typical of the broad trends evident in the trees inferred from the eubacterial core protein-coding genes. The bacterial lineages radiate from the centre

206 THERMUS Pma-CCMP1375 DEINOCOCCUS SPIROCHAETES CHLOROFLEXI Pma-MIT9211 THERMOTOGAE Pma-NATL1A AQUIFICAE _ Pma-NATL2A Pma-CCMP1986 ACTINOBACTERIA Pma-MIT9301 Pma-AS9601 Pma-MIT9312 CFILAMYDIAE Pma-MIT9515 Pma-MIT9303 FIRMICUTES Pma-MIT9313 Syn.WH5701 Syn.WH7805 PLANCTOMYCETES Syn.CC9311 Syn.WH8102 Syn.RS9917 Syn.CC9605 Syn.CC9902 CHLOROBI

0.1 substitutions per site BACTEROIDETES

Rsp-ATCCI 7025 Rsp-ATCC17029' Rpa-CGA009_ ^ — Rpa-BisB18 J Prochlorococcus a-PROTEOBACTERIA marine Synechococcus Y-PROTEOBACTERIA ACIDOBACTERIA 6-PROTEOBACTERIA 0.1 substitutions per site P-PROTEOBACTERIA

a-PROTEOBACTERIA ACTINOBACTERIA Y-PROTEOBACTERIA e-PROTEOBACTERIA P-PROTEOBACTERIA SPIROCHAETES Rpa-CG A009

Y-PROTEOBACTERIA ACIDOBACTERIA 6-PROTEOBACTERIA - PLANCTOMYCETES ^ (

CFILOROBI /p /u Cch (Pph Cte BACTEROIDETES

AQUIFICAE (

THERMOTOGAE ( Prochlorococcus

THERMUS (, marine Synechococcus DEINOCOCCUS

CHLOROFLEXI

FIRMICUTES

CHLAMYDIAE FUSOBACTERIA 0.1 substitutions per site

Figure 5.2: Phylogenies inferred from the secY gene (A) The ML tree; (B) the PS-group clade of the ML tree; and (C) the Neighbor-Net phylogeny. Clade support values are indicated by colour: <50 purple, <65 green, <80 yellow, <95 orange, 95-100 red, and terminal edges black. Bacterial phyla and photosynthetic taxa are labelled (see Table 4.1).

207 Pma-CCMP1375 BACTEROIDETES B Pma-MIT9211 Syn.CC9902 . Pma-CCMP1986 CHLOROFLEXI H —Pma-MIT9515 THERMOTOGAE Pma-MIT9301 FIRMICUTES Pma-MIT9312 CHLOROFLEXI Pma-AS9601 Hau CHLOROFLEXI e-PROTEOBACTERIA Pma-NATL2A FIRMICUTES Pma-NATL1A Y-PROTEOBACTERIA CHLOROBI ' X FUSOBACTERIA Pma-MIT9303 Ó-PROTEOBACTERIA rT Pma-MIT9313 THERMUS ACTINOBACTERIA Syn.WH7805 ACTINOBACTERIA Syn.CC9605 Syn.RS9917 Y-PROTEOBACTERIA Syn.CC9311 PLANCTOMYCETES Cwa Syn.WH8102 ( ------C Scy.PCC6803 Syn.WH5701 Sel-PCC7942 Sel-PCC6301 Gvi 0.1 substitutions per site

Prochlorococcus and marine Synechococcus

CYANOBACTERIA

a-PROTEOBACTERIA

0.1 substitutions per site CHLAMYDIAE

Ó-PROTEOBACTERIA CHLOROFLEXI BACTEROIDETES

Figure 5.3: Phylogenies inferred for the ribH gene (A) The ML tree; (B) the PS-group clade of the ML tree; and (C) the Neighbor-Net phylogeny. Clade support values are indicated by colour: <50 purple, <65 green, <80 yellow, <95 orange, 95-100 red, and terminal edges black. Photosynthetic taxa are labelled (see Table 4.1).

208 of the phytogeny. Each has a relatively long branches when compared to the internal branches which connect them. The bootstrap clade support values of internal branches are indicated by colour. The centre of the phylogeny contains many clades with LR-ELW support <80, shown in purple, green and yellow. The tree inferred from the ribH gene, which encodes the “6,7-dimethyl-8-ribityllumazine synthase”, is a typical example of the tree structure in many of the reduced eubacterial core protein-coding genes (Figure 5.3A). Here the tree has relatively longer internal edges, with better resolution of the radiation of bacterial phyla. However, the same trends are evident as in the secY tree: the bacterial lineages radiate on relatively long branches and LR-ELW clade support is only moderately high. In all trees inferred from the reduced eubacterial core, the four groups of anoxyphoto- bacteria were dispersed among the eubacteria and the purple bacteria did not form a monophyletic group (Figure 5.2A). This is in agreement with previous studies (e.g., Kon­ dratieva et al., 1992; Woese, 1992; Olsen et al., 1994; Yurkov and Beatty, 1998). The position of cyanobacteria in trees of the 62 eubacterial core protein-coding genes (Table 5.1), were mostly in agreement with the patterns of divergence inferred from the 16S rDNA gene (Figure 5.1). In most trees, the oxyphotobacterial cyanobacteria were separated from the rest of the cyanobacteria by a relatively long internal branch (e.g., Figures 5.2A and 5.3A). The monophyly of the cyanobacteria was supported by 52 of the trees with an average LR-ELW of 92.9 (Table 5.1). Prochlorococcus and the marine Synechococcus formed a monophyletic clade in the phylogenies of 60 of the 62 protein-coding genes with an average LR-ELW of 94.4 (Table 5.1). While marine Synechococcus was at the base of the PS-group in 29 of the trees, Prochlorococcus was at the base in 24, while for 9 of the trees the most basal taxon was not clear (Table 5.1). The Prochlorococcus genus was found

209 to be monophyletic in only 17 trees with average LR-ELW support of 82.3 (Table 5.1).

However, the low support of the deeper branching clades (e.g., Figure 5.2B) indicates lack of strong consensus on the position of individual taxa within the PS-group.

Although the trees inferred from the reduced eubacterial core were mostly in agreement with the 16S rDNA tree, there were some important exceptions. The trees inferred from

10 of the 62 eubacterial core protein-coding genes did not group the cyanobacteria into a

monophyletic group (Table 5.1).

In six of these cases, for the aroC, ahpC, ndk, suf B, rpsD and rplF proteins, Gloeobac-

ter was more closely related to non-photosynthetic bacteria than to other cyanobacteria

(Table 5.1). Since it is well documented in the literature that Gloeobacter may be a more divergent member of the cyanobacteria (Nakamura et ah, 2003) and my primary concern was the position of the PS-group, I did not investigate this further.

In one case, for the apt gene, the Spirochaetes and the Actinobacteria were closely related

to the PS-group taxa within the cyanobacteria (Figure 5.4).

In the remaining three cases, the cyanobacteria did not appear as a single group. For the

ahcY protein, the PS-group and Acaryochloris were located in different positions within the eubacteria (Figure 5.5). For the rpe protein, some members of the PS-group were found in the eubacteria (Figure 5.6). Similarly, for the bcp protein, Nostoc, Anabaena,

Crocosphaera and Synechocystis were found in a clade within the other eubacteria (Figure

5.7).

In previous studies, such unusual phyletic patterns have been interpreted as LGT. Since the protein-coding genes of my reduced eubacterial core were chosen to minimise the possibility of error in the resulting phylogenies (see §4.2), my results offer strong evidence that the four protein-coding genes with unusual cyanobacterial phyletic patterns indicate

210 Prochlorococcus and marine Synechococcus CYANOBACTERIA Sel-PCC6301 Syn.JA2 Sel-PCC7942 Syn.JA3 CHLOROFLEvl Tel

Nos. Ava Cau Npu Scy.PCC6803 FIRMICUTES Cwa Ó-PROTEOBACTERIA f Ter Gvi THERMOTOGAE Y-PROTEOBACTERIA^ Pma-NATLI A 1 Prochlorococcus Pma-NATL2A I a-PROTEOBACTERIA /Rsp-ATCC 17025' SPIROCHAETES Rsp -ATCC17029

Rpa-BisBI 8 R p a-C G A 009

P-PROTEOBACTERIA Prochlorococcus e-PROTEOBACTERIA CHLOROBI Y-PROTEOBACTERIA PLANCTOMYCETES FUSOBACTERIA FIRMICUTES BACTEROIDETES

Figure 5.4: Phytogenies inferred for the apt gene The (A) M L tree and the (B) Neighbor-Net phytogeny show the Spirochaetes and Actinobacteria in the cyanobacteria. Clade support values are indicated by colour (see Figure 5.2). Bacterial phyla and photosynthetic taxa are labelled (see Table 4.1). 211 THERMOTOGAE

CYANOBACTERIA . ------' 0.1 substitutions per site

CYANOBACTERIA Sel-PCC6301 Sel-PCC7942 Ter Qvi ACIDOBACTERIA Scy.PCC6803 CHLOROBI y-PROTEOBACTERIA Npu Nos Ava - ^ <~'Wa BACTEROIDETES PLANCTOMYCETES f o-PROTEOBACTERIA Syn.JA2 Syn.JA3 Cch Rsp-ATCC17025 CHLOROFLEXI Rsp-ATCC17029 Hau Rpa-BisB18 Ros Rpa-CG A009 ' ,u.

‘rochlorococcus and marine Synechococcus AQUIFICAI ACIDOBACTERIA

3-PROTEOBACTERIA

ACTINOBACTERIA

FIRMICUTES

THERMOTOGAE Ama

CYANOBACTERIA 0.1 substitutions per site

Figure 5.5: Phylogenies inferred for the ahcY gene The (A) ML tree and the (B) Neighbor-Net phylogeny show the PS-group are separated from the rest of the cyanobacteria by a very long branch with high clade support. Clade support values are indicated by colour (see Figure 5.2). Bacterial phyla and some photosynthetic taxa are labelled (see Table 4.1). 212 Figure 5.6: Phylogenies inferred for the rpe gene The (A ) ML tree and the (B ) the Neighbor-Net analysis show a subset of the Pcc on a long branch, separated from the rest of the PS-group and the rest of the cyanobacteria. Clade support values are indicated by colour (see Figure 5.2). Bacterial phyla and photosynthetic taxa are labelled (see Table 4.1). 213 FIRMICUTES y-PROTEOBACTERIA

ACIDOBACTERIA I CHLOROFLEXI AQUIFICAE j PLANCTOMYCETES 6-PROTEOBACTERIA FIRMICUTES /TT. THERMUS DEINOCOCCUS (

ACTINOBACTERIA ( __ '

FIRMICUTES Y-PROTEOBACTERIA ^ (

P-PROTEOBACTERIA

e-PROTEOBACTERIA ( Prochlorococcus and Scy.PCC6803 marine Synechococcus

CYANOBACTERIA

CYANOBACTERIA

a-PROTEOBACTERIA 0.1 substitutions per site

e-PROTEOBACTERIA Scy.PCC6803

P-PROTEOBACTERIA

Y-PROTEOBACTERIA

FIRMICUTES f „ l Hmo Rsp-ATCC17029 Rsp-ATCC 17025 ~

Rpa-BisB18

Rpa-CGA009

a-PROTEOBACTERIA

DEINOCOCCUS 6-PROTEOBACTERIA ACTINOBACTERIA ACIDOBACTERIA CHLOROFLEXI 0.1 substitutions per site

Figure 5.7: Phylogenies inferred for the bcp gene The (A) ML tree and the (B) Neighbor-Net analysis indicate a group of cyanobacteria are more closely related to non-photosynthetic eubacteria than to the rest of the cyanobacteria. Clade support values are indicated by colour (see Figure 5.2). Bacterial phyla and photosynthetic taxa are labelled (see Table 4.1). 214 LGT.

To test if the unusual phyletic patterns are instead due to poor quality data in the alignments, I checked whether the four protein-coding genes in question had an unusually low number of representative bacteria or an unusually low number of variable sites. No correlation was found (see Appendix C.l), ruling this out as an alternative explanation.

Genes that have been laterally transferred may reside in “genomic islands” having a GC% that is atypical for the surrounding DNA (Dufresne et ah, 2008). The ahcY and rpe genes are not located in such genomic islands (data not shown), however, this is not evidence for LGT here.

The results of these analyses are consistent with the apt, ahcY, rpe and bcp genes having experienced LGT. There remain other explanations for the data, however, which are evaluated in the following sections.

Conflicting phylogenetic signals in sites of a single protein-coding gene are visualised through Neighbor-Net analyses in §5.3.3. Evidence of LGT or lack of LGT is investigated

in the gene order surrounding the four protein-coding genes of interest in §5.3.7. The

possibility that the unusual phylogenies are due to increased evolutionary rates caused by

positive selection or by recombination is investigated in §5.3.6.

5.3.3 Neighbor-Net analyses of the 16S rDNA gene and the re­

duced eubacterial core

Neighbor-Net analyses conducted by the SPLITSTREE program were used to visualise the

congruence or lack of congruence of the phylogenetic signal at the sites in each of the

protein-coding genes of the reduced eubacterial core. Although divergences with low

support are evident in a tree, a Neighbor-Net analysis has the potential to reveal which

215 taxa are responsible for the low clade support values in the corresponding tree. A Neighbor-Net phylogeny was generated from the 16S rDNA nucleotide alignment using empirical nucleotide base frequencies, the GTR model (a=0.3196 and 1=0.7164) without ML distances, and 1,000 BS replicates (Figure 5.1C). Clades with BS support >50 are indicated by thick branches. This Neighbor-Net phylogeny (Figure 5.1C) recovered the main bacterial phyla but the relationships among phyla are different from those recovered by the Bayesian tree (Figure 5.1A). The two methods largely agree, however, in that clades with lower BS support in the tree (Figure 5.1A) correspond to a more uncertain position in the network (Figure 5.1C). Considerable conflict in the position of individual taxa is evident within the cyanobacteria, in particular within the PS-group and between Thermosynechococcus, the two strains of Synechococcus from Octopus Springs, JA-2-3B’a(2-13) and JA-3-3Ab, and Gloeobacter (Figure 5.1C). The Neighbor-Net phytogenies for the reduced eubacterial core were inferred using the same substitution model used for the ML analyses (Table C.l), but with the a and I parameters estimated by PROT'TEST rather than TREEFINDER. In all cases, the values estimated by PROTTEST were higher than that estimated by TREEFINDER. This is il­ lustrated in a scatterplot of the c* values in PROTTEST vs. TREEFINDER (Figure 5.8). Since most a points lie on a line with a slope>l (Figure 5.8A), the discrepancy appears to be due to a systematic bias in the pro ttest program to infer higher values (Figure 5.8A). On the other hand, the more irregular variation in the values of I (the proportion of invariant sites) estimated by the two programs (Figure 5.8B) suggests a different cause. I do not have an explanation for why the estimates differ, but I deemed the differences not so great as to significantly affect the results. The groupings recovered by the ML trees for the 62 eubacterial core protein-coding genes

216 0.00 0.05 0.10 0.15 0.20 0.25 Alpha parameter (TREEFINDER) Proportion of invariant sites (TREEFINDER)

Figure 5.8: Parameters of the discrete T distribution inferred for ML analyses (A) The a shape parameter of the discrete T distribution and (B) the proportion of invariant sites (I) inferred by pro ttest (y-axis) for the Neighbor-Net analyses and TREEFINDER (x-axis) for the ML analyses. The dotted line indicates the idealised position of points, if both programs had inferred the same a and I values. were mostly recovered by the Neighbor-Net analyses (Table 5.1). The monophyly of the cyanobacteria was supported by 53 of the Neighbor-Net phylogenies with an average BS support of 87.9 and the clade of Prochlorococcus and marine Synechococcus was supported by 59 of the Neighbor-Net phylogenies with BS support of 88.9 (Table 5.1). As was observed for the 16S rDNA tree (Figure 5.1A), the cyanobacteria are separated from the other eubacteria by a relatively long edge and the anoxygenic photosynthetic bacteria are polyphyletic within the eubacteria (e.g., Figure 5.2C). The relationship between BS support values in Neighbor-Net phylogenies and those in­ ferred for trees is not clear. While the network method permits conflicting branching patterns among taxa, lowering individual clade support values, many more networks can contain the same splits, thus potentially increasing the BS support value. In general, the advantage of the Neighbor-Net method is that it reveals the taxa responsi­ ble for the low clade support values observed in the corresponding ML tree. For example,

217 the Neighbor-Net phylogeny inferred for the ribU gene (Figure 5.3A) suggests that the reason for the low support values of the bacterial phyla in the ML tree is the uncertainty in the branching order of the phyla containing the cu-proteobacteria (Figure 5.3C). Similarly, although the taxa of the PS-group are reasonably well supported in the ML tree (Fig­ ure 5.3B), the Neighbor-Net phylogeny (Figure 5.2C) reveals many conflicting taxonomic groupings in the site patterns of the PS-group and the rest of the cyanobacteria. The unusual phylogenetic patterns in the ML trees inferred from the apt, ahcY, rpe and bcp proteins are also evident in the Neighbor-Net analyses. The pattern of divergences in the ML tree for the ahcY protein-coding gene (Figure 5.5A) is evident in the Neighbor- Net phylogeny (Figure 5.5B). The low clade support values in the ML tree, such as the branch with a support value of at least 50 but less than 65 leading to the Prochlorococ- cus, Spirochaete and Actinobacteria taxa (Figure 5.5A) are echoed in the large region of variable branching orders in the Neighbor-Net analysis (Figure 5.5B). The unusual two groupings of taxa in the ML tree for the ahcY protein are also visible in the Neighbor-Net analysis with strong support (Figure 5.5). The Neighbor-Net analyses for both the rpe gene (Figure 5.6B) and the bcp gene (Figure 5.7B) provide a clearer indication of the magnitude of the uncertainty in the entire central region of the tree, and highlight the star-like structure of the phylogenies.

5.3.4 The effects of cyanobacterial over-sampling on tree topolo­ gies

In order to evaluate the effect that an oversampling of non-cyanobacterial taxa and of sequences of Prochlorococcus and marine Synechococcus might have on the position of the PS-group within the cyanobacteria, I generated trees from subsamples of the original taxa

218 Figure 5.9: The effect of taxon sampling on the position of the PS-group The "majority rule (extended)" consensus tree topologies inferred from the (a) 16S rDNA and (b) secY genes when the 70 bacterial taxa have been reduced to include only 5 non-cyanobacterial out­ groups and two PS-group taxa. The 16S rDNA tree was inferred from the 153 possible trees having a subsample of 2 of the 18 PS-group taxa. The secY tree was calculated from 100 randomly chosen pairs of 2 PS-group taxa. Clade support values and photosynthetic taxa are labelled as in Figures 5.1 and 5.2. and compared them to the trees inferred from the original set of taxa (Table 5.1).

The position of the PS-group in the cyanobacteria was affected in the same way for both the subsampled 16S rDNA and eubacterial core protein-coding gene alignments (Table

5.1).

For the original set of 70 bacteria, the PS-group was found at the tip (i.e., furthest from the base) of the cyanobacterial radiation in the trees inferred from the 16S rDNA gene

(Figure 5.1A). The same pattern was found for 44 of the 62 eubacterial core protein-coding

219 genes (Table 5.1, including that for the ML tree inferred for the secY gene (Figures 5.2A) and the ribH gene (Figure 5.3A).

When the trees were inferred from the same alignments with all but 5 non-cyanobacteria removed, the position of the PS-group did not change either for the 16S rDNA tree or for most of the trees of the eubacterial core protein-coding genes (Table 5.1). However, when consensus trees were generated from the same 16S rDNA alignment containing only two

PS-group taxa, Acaryochloris was found closer to the PS-group (Figure 5.9A).

For the eubacterial core protein-coding genes, the PS-group moved nearer the base of the cyanobacterial radiation between the basal taxa of Gloeobacter and the two thermophilic strains of Synechococcus, JA-2-3B’a(2-13) and JA-3-3ab, and the clade containing Nostoc and Anabaena (Table 5.1). This is clearly visible for the secY gene analyses in Figures

5.2A and 5.9B.

In summary, the sub-sampling tests carried out in this section demonstrate that an over- sampling of PS-group taxa does not affect the recovery of taxonomic groupings at the phylum level, but can affect the position of the PS-group or any other individual taxon within the cyanobacteria. Hence, the topologies of the ML trees described in §5.3.2 could contain some errors in the position of individual cyanobacterial taxa.

5.3.5 The effects of cyanobacterial under-sampling on the posi­

tion of the PS-group

To see whether a lack of representatives from the diversity of the cyanobacterial genera was hiding the true position of the PS-group, I generated an ML tree from an alignment of

1,024 sites of the 16S rDNA gene for 1,128 cyanobacteria (including those in my original alignment) and the 5 non-cyanobacterial out-groups used above (Figure 5.10A). The tree

220 was inferred by tr eefin d er with the GTRt T4-|- model (a = 0.478, I = 0.083; the subscript 4 indicates there are 4 rate classes) and LR-ELW clade support was calculated from 1,000 replicates. The main lineages in Figure 5.10A have been labelled. Clade A consists of the non- cyanobacterial out-groups. The genera with a Chl-based LH-antenna system are located in clades B (Prochlorales cyanobacterium EV-7), D (Acaryochloris), F (Prochloron) and H (Prochlorothrix and Prochlorococcus). The clade consisting mostly of Prochlorococcus, marine Synechococcus and Cyanobium is located in clade H5. The tree (InL = -54,497) was inferred by TREEFINDER using the GTR+T4 + model (q=0.478, 1=0.083). Coloured internal edges indicate LR-ELW clade support values inferred from 1,000 replicates (<50 purple, <65 green, <80 yellow, <95 orange and 95-100 red.) Terminal edges are black. Photosynthetic taxa are labelled (for full taxon names, refer to Table 4.1). For completeness, the most common genera in clades A to Q are provided, followed by the number of sequences from each genus in parentheses, are: A - non-cyanobacterial out­ groups (5), Oscillatoria (3) and Prochlorales cyanobacterium EV-7 (1); B - Synechococcus (10) and Gloeobacter (2); C - Synechococcus (1); D - Oscillatoria (21) and Pseudan- abaena (8): E - Oscillatoria (12), Microcoleus (10), Arthrospira (10), Trichodesmium (7), Phormidium (7) and Planktothrix (5); F - Microcystis (96), Prochloron (25), Syne­ chococcus (11), Cyanothece (10), Synechocystis (7), Pleurocapsa (6) and Spirulina (5); G - Microcoleus (14), Chroococcidiopsis (10) and Symploca (6); HI - Leptolyngbya (18) and Phormidium (7); H2 - Leptolyngbya (18) and Phormidium (5); H3 - Limnothrix (7) and Planktothrix (1); H4 - Synechococcus (3) and Microcystis (2); H5 - Synechococcus (102), Cyanobium (48), Prochlorococcus (20) and Chroococcales (7), Microcystis (2) and Merismopedia (2); J - Trichocoleus (1) and Synechococcus (1); I< - Calothrix (28), Fis- cherella (8) and Rivularia (7); L - Stigonematales (1); M - Cylindrospermum (2); N Nostoc (163) and Anabaena (11); P - Umezakia (1) and Cylindrospermum (1); and Q

221 - Anabaena (121), Nodularia (46), Aphanizomenon (24), Cylindrospermopsis (20), Tri- chormus (8) and Nostoc (7).

The best tree, with InL = -54,497, is shown in Figure 5.10A. As before, violation of the assumption of evolution of under SRH conditions was evaluated with the matched-pairs test of symmetry (Ababneh et ah, 2006). I found 11.7% of pair-wise comparisons yielded p < 0.05. This is lower than the 13.1% for the eubacterial 16S rDNA analysis, as would be expected for an alignment of a more closely related group, and implies that the tree does not contain serious artefacts arising from model misspecification (such errors have previously been found to affect the 16S rDNA phylogeny; see Jermiin et al. (2008) and references therein).

The genera with a Chl-based LH-antenna system, Prochlorococcus, Prochloron, Prochlorothrix and Acaryochloris, were found in the earlier-branching clades H, F, H and D, respectively (Figure 5.10A). In agreement with previous studies, these four genera do not share a most recent common ancestor and are located towards the tip of the clades in which they were found (Figure 5.10A), suggesting a relatively recent origin. Prochlorococcus and ma­ rine Synechococcus formed part of a larger clade that included sequences of Cyanobium (subclade H5, Figures 5.10 A and B).

In addition to the prochlorophytes mentioned above, the large 16S rDNA tree included an unpublished sequence (GI 66815028) for an unclassified prochlorophyte tentatively named Prochlorales cyanobacterium EV-7, isolated from a sawgrass rhizome growing in a freshwater environment in the Everglades, Florida, USA (Gantar et al., 2005). In keeping with the distribution of the other prochlorophytes, this taxon is not located close to any of the other prochlorophytes (clade A of Figure 5.10). However, the position of this sequence on a long branch, near Gloeobacter, close to the non-cyanobacterial out­ groups of Bacillus, Dehalococcoides, Geobacter, Heliobacterium and Staphylococcus (clade

222 Figure 5.10: ML tree inferred from 16S rDNA genes of 1,128 cyanobacteria The ML tree inferred from the 16S rDNA genes of 1,128 cyanobacteria and 5 non-cyanobacterial out-groups showing (A ) the deep-branching lineages within the cyanobacteria and (B ) the subtree containing the Prochlorococcus sequences. Each of the main lineages has been labelled from A to Q. The clade containing Prochlorococcus has been further broken into parts H I to H5 (see main text for a full listing). Clade support values are shown in braces and are labelled as in Figure 5.1.

223 A, Figure 5.10A), makes its classification suspect. Although the discoverers of this strain checked there was no eukaryotic contamination, the classification of this strain cannot be further investigated because the culture has died (Miroslav Gantar and Chris Sinigalliano, personal communication).

5.3.6 Evidence of positive selection or recombination in the re­

duced eubacterial core

In this section, I investigate whether the four protein-coding genes could have unusual trees due to LBA effects from elevated evolutionary rates induced by positive selection and/or from recombination.

All 62 protein-coding genes of the reduced eubacterial core were initially scanned for evidence of positive selection and recombination using the FEL and the GARD tests, respectively (§5.2.6). Although other methods are implemented in the HYPHY package, due to computational limitations only these two were performed on the entire reduced eubacterial core.

Positive selection was considered to have been possible if at least 3 sites showed evidence of it. Recombination was considered to have been possible if at least one recombination breakpoint was found and the KH test showed topological incongruence on either side of the breakpoint significant at the p=0.05 level. Four of the 62 eubacterial core protein­ coding genes tested positive, but they were not the same four genes for which LGT was suspected (Table 5.2). The condition for recombination was satisfied by 28 of the 62 eubacterial core protein-coding genes, including that of the ahcY and the rpe proteins

(Table 5.2).

224 Table 5.2: Evidence of positive selection and/or recombination in the reduced eubacterial core

C yan ta x a Positive selection Recombination Fun G ene in unusual No. No. No. possible possible c a t“ nam e position6 seq. sites codons (FEL)?C (GARD)?d A ah cY Cy 42 1590 530 / A argG 49 1575 525 / A aroC G 54 1602 534 / A his B 47 1308 436 / A m e tK 51 1743 581 / A typ A 50 2088 696 / B accA 47 1437 479 / B frr 55 609 203 B ilv H 47 723 241 B prs A 56 1506 502 B rpe P cc/m S 55 1152 384 / C m re B 45 1434 478 E atpA 50 1836 612 / E atp B 52 1845 615 E atp D 50 1749 583 / E recA 53 1434 478 F ahpC G 48 708 236 / F bcp Cy 46 756 252 F groEL 57 1755 585 / F se cY 55 1764 588 F sm p B 55 693 231 G Ipx A 49 1092 364 / L fa b l 49 1134 378 L gcpE e 51 2562 854 / M clpP 55 843 281 / M clpX 54 1680 560 / M coaD 53 792 264 M rib H 51 1017 339 / M thiC 49 2646 882 / continued on next page

225 continued from previous page

M trn R f 44 684 228

N apt P cc/m S 48 1383 461 N ndk G 52 819 273 N p y rB 52 1095 365 / N su f B G 40 1965 655 / R recR 52 909 303 R ruvC 51 819 273

S efp 56 906 302 S fu s A 58 2382 794 / s lepA 57 2166 722 / s petB 42 3144 1048 / s rplB 55 984 328 / s rplB 55 1107 369 s rplE 55 699 233 / s rplF G 54 765 255 / s rplK 56 669 223 / s rplM 52 738 246 s rp/N 55 456 152 s rplB 55 555 185 s rplT 55 429 143 s ip sB 52 1665 555 / s rpsC 55 1269 423 s rpsD G 55 702 234 s rpsE 54 807 269 s rpsG 57 507 169 s rps H 55 546 182 s rpsK 55 441 147 s rpsL 51 576 192 / s rpsM 54 558 186 s tsf 54 1731 577 / s tu f A 59 1269 423 / T in fC 55 1047 349 T rpoA 55 1464 488 /

continued on next page

226 continued from previous page a The genes are grouped by the corresponding KEGG gene functional cate­ gory where: A - Amino acid metabolism, B - Carbohydrate metabolism, C - Cell growth and death, E - Energy metabolism, F - Folding sorting and degradation, G - Glycan biosynthesis and metabolism, L - Lipid metabolism, N - Nucleotide metabolism, M - Metabolism of cofactors and vitamins, R - Replication and repair, T - Transcription and S - Translation. b Cyanobacterial taxa that are in an unusual position in the ML tree: Pc- c/mS - Prochlorococcus and/or marine Synechococcus, G - Gloeobacter, and Cy - other cyanobacteria. c Positive selection was considered possible if at least 3 sites showed evi­ dence of positive selection. d Recombination was considered possible if the KH test showed significant topological incongruence at the p=0.05 level. e Also known as l-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate syn­ thase (isp G). f This is a temporary gene name used only in this study.

The results of additional HYPHY tests performed for the four protein-coding genes for which LGT was suspected are shown in Table 5.3. On the basis of the FEL, SLAC and

PARRIS tests, only the rpe gene shows evidence of positive selection. Both the GARD and SBP tests found evidence of recombination in the four genes, and in the case of the ahcY and the rpe proteins, the KH test indicates that there were significant topological incongruences across at least one breakpoint.

To check whether the topological differences affect the groupings for which LGT is sus­ pected, the topology of the NJ trees on either side of the breakpoint boundary was visually inspected. The unusual topology was found on either side of the breakpoint boundary for the four genes (Table 5.3). Thus, it is possible that recombination has occurred in the ahcY and the rpe genes, but this does not explain the unusual phylogenetic groupings 227 observed for these genes, as the unusual grouping of taxa was present in the trees on either side of the breakpoint (data not shown).

Table 5.3: Strength of evidence for positive selection and/or recombination in the ahcY, rpe, bcp and apt protein-coding genes Fun Gene Signif Unusual tree cat“ name FEL SLAC PARRIS GARD KHb SBP across bkptc A ahcY Poor Poor No No - / / B rpe / No / / / / / F bcp Poor No No / / / / N apt Poor No / No - / / a The corresponding KEGG functional group of the gene, with codes as used in Table 5.2. b Indicates whether the KH test found significant topological incongruence for the NJ trees on either side of the recombination breakpoints, c Indicates whether an unusual phyletic pattern was observed for the NJ trees inferred for each of the segments between the breakpoints. A positive value indicates that the phyletic pattern was different on either side of the break­ point, in support of the hypothesis that this is a recombination breakpoint.

Lastly, the codon alignments of the four genes were examined for evidence of insertions or deletions (“indels”) that might support the unusual grouping of taxa observed in the ML trees. Support for LGT via indels was only found for the ahcY gene: an indel at approximately position 440 to 550 of the nucleotide sequences (data not shown) is shared by the PS-group taxa and the non-photosynthetic taxa (Table 5.4 and Figure 5.5). The nucleotide alignment for the rpe protein (data not shown) indicates that the Group 2 PS-taxa shown in Figure 5.11 share an indel, but this does not support the position of the Group 2 clade in the non-photosynthetic bacteria (Table 5.4) because no other proximal taxa share the indel.

228 Table 5.4: Summary of evidence for positive selection and/or recombination in the ahcY, rpe, bcp and apt protein-coding genes Fun Gene Positive Recomb­ Indel supports possible reason for unusual clade cata name selection? ination? unusual clade groups? groupings in the ML tree? A ahcY No No / LGT B rpe / No No LBA arising from positive selection F bcp No No No Loss of historical signal N apt No No No Loss of historical signal a The corresponding KEGG functional group of the gene, with codes as used in Table 5.2.

The most plausible explanations for the unusual topologies of the ahcY, rpe, bcp and apt genes are summarised in Table 5.4. Although the tests implemented in the HYPHY pro­ gram have been found to be reliable indicators of non-neutral selection and recombination (Pond and Frost, 2005c), they would only refute the LGT hypothesis if they are able to discriminate the ahcY, rpe, bcp and apt genes from the rest of the reduced eubacterial core. In Table 5.2, there was no correlation between the four genes and the results from the FEL and GARD tests. Caution seems warranted when interpreting the results of the recombination tests: a surprisingly high proportion of tests - almost half - of the protein-coding genes (28 of 62) (Table 5.2) tested positive. However, when all the results are considered together a plausible explanation can be seen. For the ahc Y gene, there was no reliable support for positive selection or recombination (Table 5.3), and the presence of an indel supporting the unusual grouping of the PS-group with the non-photosynthetic taxa suggests the most reasonable explanation is that the ancestor of the PS-group acquired this gene from a non-photosynthetic bacterium (Table 5.4). The rpe gene tested positive for positive selection, but negative for recombination (Table 5.3), and there were no indels to reliably establish a relationship between taxa in the Group 2 PS-taxa and the nearby non-photosynthetic bacteria (Figures 5.11 A and C). 229 Hence, a possible explanation for the unusual phylogeny for the rpe gene was that the

ancestral sequence of the Group 2 PS-taxa experienced elevated evolutionary rates, which

in turn lead to LBA effects, resulting in the clade being positioned closer to the non­

photosynthetic bacteria than the rest of the PS-group or the other cyanobacteria (Table

5.4). On the other hand, the unusual phytogenies inferred for the bcp and apt proteins could be due to loss of phylogenetic signal in the sequence (Table 5.4), since there is no conclusive evidence to support or refute LGT, and the trees are generally star-like in structure (a topology generally associated with loss of phylogenetic signal).

5.3.7 Evidence of conserved gene order around sequences sus­

pected of LGT

Two protein-coding genes of the reduced eubacterial core produced ML trees with such

unusual groupings for the PS-group taxa that LGT events were suspected. In the tree

inferred from the rpe gene, one group of PS-group taxa were located with the cyanobac­

teria while the remaining eight PS-group taxa formed a small clade on a long branch in

the non-photosynthetic bacteria (Figure 5.6A and 5.11A). Without any additional infor­

mation, the topology of the rpe protein could be interpreted as evidence that the ancestor of the eight Prochlorococcus had acquired the rpe gene from the non-photosynthetic bac­ teria. Similarly, the PS-group was split into three groups in the tree inferred from the

apt protein-coding genes and the cyanobacteria included “intruder” sequences from the

Actinobacteria and Spirochaetes (Figure 5.4A and 5.12A). This topology could be inter­ preted as evidence that an apt gene from the cyanobacteria had been acquired by the non-photosynthetic Spirochaetes and Actinobacteria through LGT.

As bacteria that share a common ancestor diverge, genes may relocate to the other strand of the DNA helix, move to another part of the chromosome, and be deleted or acquired

230 (Lawrence and Hendrickson, 2005). The cyanobacteria have already diverged to the ex­ tent that conserved gene order observed in distantly-related cyanobacteria may be more indicative of LGT than of shared ancestry (Zhaxybayeva et al., 2004b). The PS-group taxa, however, are a sub-group of the cyanobacteria that do have a highly conserved gene order, probably due to common ancestry (Dufresne et al., 2008; Kettler et al., 2007).

If gene order is generally conserved across the PS-group taxa, one would expect that the protein-coding genes on either side of any given gene should be very similar and appear in a similar order. Discernible sub-groups of PS-group taxa with different genes on either side could be interpreted as evidence that the common ancestor of the sub-group has acquired the protein-coding gene through LGT from a different part of the chromosome. If these sub-groups were also correlated with the unusual groupings observed in phylogenetic trees, this would be good evidence of an LGT event.

The ML tree of the rpe gene shows the PS-group taxa split into two groups, denoted

Group 1 and Group 2 (Figure 5.11A). A different set of flanking genes in these groups would be good evidence of a different evolutionary history for the rpe gene (Figure 5.11A).

For the apt gene, a different set of flanking genes or a very different ordering of genes, that correlate with Groups 1, 2 or 3 of the PS-taxa, would also be evidence of a different evolutionary history (Figure 5.12A). Conversely, if the gene order flanking the rpe and apt genes was the same for all PS-group taxa, it could be regarded as lack of evidence for

LGT of a subset of the PS-group taxa.

The gene order table shows that both Groups 1 and 2 have highly conserved gene order

(Figure 5.11C), but with the exception of some gene gains and losses in individual PS- group taxa and some genes appearing on either the leading or lagging strand. The gene order to the right of the rpe gene is particularly well conserved in Group 2, suggesting that this group diverged from their common ancestor more recently than Group 1. Inter-

231 A B Pma-MIT9211 Pma-MIT9303 Pma-MIT9313 Syn.WH7805 Syn.CC9311 a Syn.CC9605 o Syn.RS9917 O Syn.WH8102 — Syn.CC9902 Syn.WH5701

0.1 substitutions per site

C Pma-CCMP1375 Pma-NATL1 A Pma-NATL2A Syn.CC9311 C\J Q. Pma-CCMP1986 13 O Pma-MIT9301 O Pma-AS9601 Pma-MIT9312 Pma-MIT9515

0.1 substitutions per site

D 18 17 -16 -15 -14 13 -12 -11 -10 -9 -8 -7 -6 -5 15 16 17 18

Pma-MtT W211 Pma-MIT9303 * a - Pma MIT9313 Syn WH7805 mmm * - m Syn CC0311 Syn CC9605 B**a - a ’ m • Syn RS9917 m ■ ...... a • a < a *...... __ _.m ♦ i Syn WM8102 I ■» * _Baaa BBBBI Syn CCOM? ■ - * * ' Syn WM5701 ■ •'■¡■■¡Mil- M - : S - - s r _ Pm«-CCMP1375 ■ • M • > a a B a B b - b b BB Pm«-NATUA • i *i - \m • • - • B • B ' * PM BBM 1 B - - a ■ ■ B a a a a B • Pmi-NAHJA * ♦ f « mam « • - ■ m ■ m ■ mm ...... imm mm - - a ■■a a a - a a Syn CC9311 mm mi ■■■ B • ♦ ♦ • - a BHi a * m Pma-CCMP 1986 B «1 * 1-» a • m>* * m - •“.■■r* ■ * a ■1 B B ■ m m - a Pma MIT9301 B ’ 1 ♦ 1 ♦> * * * , . . * ♦ ai * a ...... ■■¡«i - m - ftB ■ ■1 B B1 - * B B Pma-AS9601 ■i • • a * m m * • - • m ■ m • • a • - < ♦ -a** • m - MU■B BB - BIB■ fl Pma-MIT 9J12 m m • i • * m m • • • m ■ m ■ . mm, ■ - ■ . . aaa mm -Mi ■ B m - BB Pma-MIT 9515 m m m m ♦ m m ■ ■ ■ ■ ♦ Mb aaai-i - - ip - a a aam - Bm * HU - Mi Hi 4 m Npu ♦ Not * SM-PCC6301 !■ SM-PCC7942 Syn JA2 Syn JA3 Scy PCC0603 T e> Tw S

Figure 5.11: Protein-coding genes on either side of the rpe gene in the cyanobacteria (A) The ML tree inferred from the rpe protein-coding genes of the reduced eubacterial core. (B) The two clades of the PS-group taxa. (C) The gene order in the 20 protein-coding genes surrounding rpe genes of the cyanobacteria. The location of the gene is indicated by *+’ for the leading or for the lagging strand. Orthologous sets of protein-coding genes are indicated by colour. Colours have been re-used in some cases. White squares indicate protein-coding genes in the PS-taxa which did not have any homologues in the rest of the cyanobacteria in this reduced eubacterial core set. For more information, refer to §5.3.7.

232 estingly, only eight genes from the genes flanking the Group 1 and 2 rpe genes are present in the rest of the cyanobacteria.

Although little similarity in gene order was expected across the cyanobacteria since the cyanobacteria are more diverse than the PS-group taxa, genome rearrangements are an important aspect in the evolution of cyanobacterial genomes. Since the gene order cannot be used to discriminate between the two sub-groups of PS-taxa observed in the ML tree for the rpe gene (Figure 5.11), the unusual split of the PS-group taxa could be due to phylogenetic artefacts rather than to LGT.

The PS-taxa are split into 3 sub-groups in the ML tree inferred from the apt gene (Figure 5.12A). Group 1 (Figure 5.12B), located close to the rest of the cyanobacteria, has a gene order that is conserved across the group (Figure 5.12D). Group 2 has a very similar set of protein-coding genes and gene order to Group 1 (Figure 5.12D). On the other hand, the genes flanking the apt gene in Group 3 (Figure 5.12C) are not very similar to those of Groups 1 or 2 (Figure 5.12D). Four strains of Prochlorococcus (CCMP1375 also known as SS120, CCMP1986 also known as MED4, MIT9212 and MIT9515) have a very similar gene order downstream of the apt gene on the negative DNA strand, but share a very limited number of genes with any other cyanobacteria upstream of the apt gene (Figure 5.12D). As observed with the rpe gene (Figure 5.11), very few of the genes flanking the apt gene in the PS-group taxa are present near the apt gene in the other cyanobacteria (Figure 5.12).

Groups 1, 2 and 3 in Figure 5.12 are separated from each other by relatively long branches indicating a higher number of changes per site in the aligned sequences. The gene orders within each of Groups 1, 2 and 3 are similar because they are closely related. By a similar argument, the gene orders between Groups 1, 2 and 3 are not similar because they have diverged from each other to the extent that the gene order is as different between these 233 A Pma-MIT9303 ■ L Pma-MIT9313 Syn.CC9311 Syn.WH7805 Syn.RS9917 Syn.WH5701 Syn.WH8102 Syn.CC9605 Syn.CC9902

0.1 substitutions per site

Pma-CCMP1375 Pma-CCMP1986 ■ Pma-MIT9312 Pma-MIT9515 Pma-AS9601 Pma-MIT9211 Pma-MIT9301

D

Cl D O CD

C\J

CD

CO a 13 O CD

Figure 5.12: Protein-coding genes on either side of the apt genes of the cyanobacteria (A) The ML tree inferred from the apt genes of the reduced eubacterial core. (B ) The three clades of PS-group taxa. (C) The gene order in the 20 protein-coding genes surrounding rpe genes of the cyanobacteria. The location of the gene is indicated by '+' for the leading or for the lagging strand. Orthologous sets of genes in the PS-taxa are indicated by colour. Colours have been re-used in some cases. White squares indicate protein-coding genes which did not have any homologues in the cyanobacteria in this reduced eubacterial core set. For more information, refer to the text of §5.3.7.

234 groups as it is across the entire cyanobacteria. A sub-clade of PS-taxa located in an unusual part of the ML tree could be interpreted as evidence that the gene in question had been acquired through LGT by the common ancestor of the sub-clade. Furthermore, Group 3 has a distinctly different set of flanking genes to either Groups 1 or 2. However, I concluded that it is equally plausible that the unusual phyletic pattern is due to increased evolutionary rates and genome rearrangements than LGT.

The representatives of the Spirochaetes and the Actinobacteria that were located within Groups 2 and 3 of the PS-taxa in the ML tree (Figure 5.12A) were included in this study to test whether LGT between them and the cyanobacteria could have occurred. While the sequences from Spirochaetes and the Actinobacteria do share a very limited number of individual genes with the PS-group taxa and the other cyanobacteria, there is no conserved gene order either with each other, or with any other cyanobacteria in this study. Since these two bacterial groups occur on very long branches, it is possible that they have been grouped together with the cyanobacteria through LBA effects.

To check whether differences in GC content correlate with the unusual groupings of taxa recovered in the ML phytogenies of the rpe and apt genes, I generated boxplots of the GC% in each of the relevant groups. For the rpe gene, the variation in GC% content was plotted for taxa of PS-group 1, PS-group 2, the rest of the cyanobacteria and the rest of the eubacteria (Figure 5.13A). Similarly, boxplots were generated for the apt gene, PS- groups 1 to 3, the rest of the cyanobacteria, the two non-photosynthetic bacteria that are unusually placed within the cyanobacterial clade, and the rest of the eubacteria (Figure 5.13B).

For both the rpe and apt genes, the split of the PS-taxa into two groups could be due to a difference in GC%. For rpe, in Figure 5.13A, the GC% of Group 2 is different to that of the other PS-taxa, the other cyanobacteria and the rest of the eubacteria. Similarly, for 235 A B

Figure 5.13: Distribution of GC% in clades of the rpe and apt genes Boxplots of the GC content across in sub-groups of (A) the rpe genes shown in Figure 5.11; and (B) the apt genes shown in Figure 5.12. The taxon group codes indicate: 1, 2, 3 - Group 1 to 3 of the PS-taxa; C - the other cyanobacteria, U - non-photosynthetic bacteria that unusually group with the cyanobacteria; and E - all other eubacteria.

236 apt in Figure 5.13B, the GC% of Group 2 and Group 3 are very different from that of the other groups. Since the GC content of the clades correlates with the clade groupings in the trees, the clade groupings could reflect shared GC content rather than shared ancestry. So the unusual topologies observed in the ML trees inferred from the rpe and apt genes could be due to phylogenetic artefacts related to LB A effects arising from GC content, rather due to LGT. Although there are two other genes for which LGT of cyanobacteria is suspected, that is, the ahcY and bcp genes (Figures 5.5 and 5.7), similar analyses could not be conducted to check if the gene order supports the unusual groupings because gene order is not conserved enough across the cyanobacteria to be informative.

5.4 Discussion

The main aim of this study was to use the reduced eubacterial core of 62 protein-coding genes identified in Chapter 4 to investigate the origins of the Prochlorococcus and marine Synechococcus genomes. In this chapter, I describe the results of phylogenetic analyses and additional tests conducted to ascertain whether unusual phyletic patterns recovered are instances of LGT or due to artefactual error arising from other evolutionary processes. I discuss the implications of these findings for cyanobacterial evolution.

5.4.1 Origins of the PS-group genomes in the eubacteria

Most phylogenies inferred from the core bacterial genes selected in Chapter 4 recovered the grouping of taxa into their recognised bacterial phyla (Figure 5.2A). Since these conclusions were based on data and models carefully selected to minimise the possibility of 237 artefactual error, the phytogenies represent the evolutionary relatedness of the taxa being examined. Unfortunately, the mechanism which gave rise to this pattern of relatedness cannot be discriminated without further analyses. It is possible that the observed pattern of taxonomic groupings could be due to vertical inheritance of eubacterial-core genes within a phylum (e.g., Daubin et al., 2003; Beiko et ah, 2005; Kunin et ah, 2005; Susko et ah, 2006; Galtier and Daubin, 2008). or of a high level of LGT of eubacterial-core genes within a phylum (e.g., Andam et ah, 2010).

My analyses confirm that the five photosynthetic bacterial lineages represented are not closely related to each other. They show that the closest relatives of Prochlorococcus are the marine strains of Synechococcus and that these form a monophyletic clade within the cyanobacteria (Table 5.1). There is no evidence that shared genomic ancestry with the anoxyphotobacterial groups has influenced the evolution of the PS-group genomes. Hence, despite any LGT that may have occurred between these genera and other prokaryotes,

Prochlorococcus spp. and marine Synechococcus spp. appear to have evolved from a relatively recent common ancestor.

The 62 gene data sets on which I based my conclusions included sequences from the genome of the Chi d-containing Acaryochloris marina. The taxon-specific core genomes of Prochlorococcus and marine Synechococcus may be more closely related to Acaryochloris than to other cyanobacteria like Nostoc or Gloeobacter (Figure 5.10A), but there is no evidence that there has been substantial LGT of core bacterial genes between these two groups (Figure 5.2A).

This is the first analysis, using data that have been carefully selected to minimise the introduction of artefactual bias to the phylogenetic inference process, to confirm that this approach generates a tree which is congruent with that based on the 16S rDNA analysis.

238 5.4.2 Evolution of the Chl-based LH-antenna systems

The Chl-biosynthesis pathways are at least as old as the process of anoxygenic pho­ tosynthesis (Larkum, 2006; Raymond and Blankenship, 2008). It is thus possible that a Chl-based LH system evolved prior to that based on PBSs. However, unless more cyanobacteria with novel LH-antenna systems are discovered, the most reasonable hy­ pothesis is that the extant Chl-based LH-antenna system is a recent innovation. Since the core genome of the PS-group evolved from an ancestor within the cyanobacte­ ria, the ancestral LH-antenna system of this group was probably based on PBSs. The suggestion that Prochlorococcus and marine Synechococcus evolved from an ancestor that contained phycobiliproteins is consistent with the identification of degenerate phycoery- thrin genes in Prochlorococcus (Ting et al., 2001). Results here complement a similar relationship inferred for the Chi 6-using Prochloron sp. and the phycobilisome-using Synechocystis trididemni (Shimada et al., 2003).

5.4.3 The effect of taxon sampling on cyanobacterial phytoge­ nies

While the position of individual cyanobacterial clades, such as the PS-group, was not consistent across phylogenies inferred from different bacterial-core genes for the 70 bac­ teria selected (Table 5.1), most bacterial-core genes recovered similar groupings for the cyanobacterial taxa. For example, Gloeobacter and the two strains of thermophilic Syne­ chococcus from Octopus Springs are most commonly located at the base of the cyanobac­ terial radiation, the clade consisting of Nostoc and Anabaena is closely related to Cro- cosphaera and Synechocystis, Prochlorococcus and marine Synechococcus almost always form a clade, and Acaryochloris and Thermo synechococcus are usually located on branches 239 that start at an intermediate position of the cyanobacterial radiation (Figure 5.9).

While model misspecification might be responsible for these instances of “minor” phylo­ genetic conflict observed within the cyanobacteria in previous studies, this study suggests it may be due to phylogenetic artefacts arising from taxon sampling.

My data were collated with the primary aim of determining the origins of bacterial-core genes in the PS-group genomes. The trees inferred from the 62 eubacterial-core protein­ coding genes, while suitable for exploring the relationship between bacterial phyla, were not suitable for determining the precise position of the PS-group within the cyanobacteria because: (i) they contained too many representatives from the closely related Prochloro-

coccus and marine Synechococcus which could affect the resulting phylogeny through long branch attraction; (ii) there were too many non-cyanobacterial out-groups; and (iii) these out-groups were too divergent from the cyanobacteria. Maximum-likelihood trees inferred from subsamples of PS-group taxa in eubacterial-core protein-coding gene alignments suggest that oversampling of in-group sequences has the most effect on the position of individual cyanobacteria (Table 5.1).

The consensus position of the PS-group in the original 16S rDNA and eubacterial core

protein-coding gene alignments was within the cyanobacterial radiation (Figure 5.9A).

Oversampling of non-cyanobacterial out-groups had little effect on position of individ­

ual cyanobacteria (Table 5.1) but oversampling of PS-group cyanobacteria resulted in

long-branch attraction effects that moved the PS-group to within or to the base of the

cyanobacterial radiation (Table 5.1, Figure 5.9). Hence, the presence of competing but

similar branching patterns for cyanobacteria in my study could be due to phylogenetic

artefacts arising from oversampling of the PS-group taxa.

I conclude that without access to a broad sample of gene sequences from diverse cyanobac­

terial genera, it is not possible to reliably distinguish phylogenetic conflicts due to phy-

240 logenetic artefacts from genuine instances of LGT among the cyanobacteria. Previous studies have found that ~61% of analysed genes have significant conflicting signals within the cyanobacteria that could be indicative of LGT (Zhaxybayeva et ah, 2006). How­ ever, in contrast to previous studies (Raymond et al., 2002, 2003b; Beiko et ah, 2005;

Zhaxybayeva et al., 2006), I conclude that while LGT of bacterial-core genes has occurred between cyanobacteria and other bacterial lineages, minor variations in the relative posi­ tion of cyanobacteria could be due to artefacts arising from other evolutionary processes.

5.4.4 Origin of the PS-group within the cyanobacteria

Since previous studies have concluded that the 16S rDNA gene is a good marker for cyanobacterial phylogenies (Dufresne et al., 2005) and the trends in my 16S rDNA and eubacterial core protein-coding gene phylogenies were similar (Table 5.1), the ML phy­ togeny inferred from over 1,000 cyanobacterial 16S rDNA sequences can give a good indication of the relationship of the PS-group taxa within the cyanobacteria and reflect the evolutionary history of the PS-group genomes.

The Prochlorococcus and marine Synechococcus group appears at the tip of a clade that branches near the base of the cyanobacterial radiation, as illustrated in the ML tree of 1,128 cyanobacteria sampled from 81 of the 86 known cyanobacterial genera (Figure

5.10A). The position of Prochlorococcus and marine Synechococcus at the tips of estab­ lished clades with many extant members suggests that the PS-group evolved relatively recently, in agreement with many previous studies (e.g., Dufresne et al., 2005).

Further, this tree indicates that Prochlorococcus, Prochloron, Prochlorothrix and Acary- ochloris are not closely related to each other. As was found previously (Crosbie et al., 2003;

Sanchez-Baracaldo et al., 2008), the PS-group is a sister clade to that of the Cyanobium, and together they form a distinct clade within the cyanobacteria (Figure 5.10B).

241 I conclude that Prochlorococcus and marine Synechococcus evolved from a Cyanobium-like ancestor. The evolutionary changes that have contributed to the success of this lineage may have occurred in their common ancestor. Hence, the comparison of the genomes from all three groups may reveal the genome changes required to switch to a Chl-based

LH-antenna system.

5.4.5 Distinguishing LGT from other evolutionary processes for

four eubacterial-core protein-coding genes

Of the 62 genes of my reduced bacterial core, 52 (or ~84%) recovered the cyanobacteria as a monophyletic group (Table 5.1, Figure 5.2). This proportion is comparable to other studies that have found ~77% of their genes support monophyly of the cyanobacteria

(Zhaxybayeva et ah, 2006).

In the remaining 10 cases, the phylogenies exhibited such unusual phyletic patterns that

I suspected that these sequences had experienced LGT (Table 5.1). Two of the 10 cases implicated PS-group taxa in the LGT event: (i) the phylogeny of the rpe gene (Figure 5.6) divides the PS-group into two groups, one more closely related to other eubacterial phyla than to the rest of the PS-group or other cyanobacteria; and (ii) with the phylogeny of apt, the PS-taxa are split into three groups and the Spirochaetes and the Actinobacteria are closely related to the PS-group taxa within the cyanobacterial group (Figure 5.4).

Although these unusual groupings are well supported and robust in both the trees and the network (see §5.3.2 and §5.3.3), and taxon sampling artefacts are only large enough to cause variation in the branching order of taxa within a phylum (§5.3.4) additional analyses suggest that while LGT is possible, it not the only explanation for these unusual phylogenies. 242 Firstly, the four genes do not lie in the genomic islands identified in another study

(Dufresne et al., 2008) (§5.3.2), thus providing no corroborating support for LGT.

Secondly, the unusual phylogenies for the rpe and apt genes could be explained by ele­ vated evolutionary rates due, respectively, to positive selection and loss of the historical signal. Phylogenetic artefacts arising from variation in GC content could be a contribut­ ing factor for both protein-coding genes, as the unusual groupings observed correlate with a significantly different GC content in the groups (Figure 5.13).

The unusual phylogenies were originally thought to be clear instances of LGT because they were fairly robustly supported, but further investigations suggest other explanations. So

the conservative conclusion is that LGT has not affected the evolution of the eubacterial core genes in the PS-group.

Another possibility is that my data selection method selected some genes that are not orthologous. For example, organisms such as Prochlorococcus spp., which are thought to

have accelerated evolutionary rates across the entire genome (Dufresne et al., 2005), may

also acquire a second copy of a protein-coding gene through LGT. In such a scenario, it

is possible that my data selection method preferentially selected the xenologous copy of

the gene because it is less divergent, overall, from the rest of the homologous sequences.

A phylogeny inferred from a set of genes when not all are orthologous could result in a

phylogeny that does not agree with the more commonly-observed pattern of taxonomic

groupings and could be incorrectly interpreted as evidence of LGT. However, I do not

believe this scenario is possible in my analyses because I did not identify multiple copies

of genes from the PS-group bacteria for the genes of my reduced eubacterial core. Fur­

thermore, if there were two very similar copies of the same gene in the same bacterium,

they were intentionally included in the analysis (see §4.2.3) to alert me to the possibility

of orthologue misspecification.

243 5.4.6 Conclusions

The main aim of this chapter was to investigate the origins of the Prochlorococcus and marine Synechococcus genomes using protein-coding genes shared across the eubacteria, chosen to be representative of the evolutionary history of the genomes, and prepared to minimise artefactual error in the resulting phylogenies. As a result of the investigations conducted, I have come to the following conclusions:

1. Although the Neighbor-Net analyses of eubacterial core protein-coding genes indi­

cate that sites within each alignment contain variation in the phylogenetic signal,

the ML phylogenetic analyses reveal a fairly consistent evolutionary history for the

Prochlorococcus and the marine Synechococcus taxa which represents the true evolu­

tionary relationships between the taxa because the data on which these conclusions

are based were carefully selected to minimise the possibility of artefactual error.

2. Despite sharing phenotypic similarities with other genera of prochlorophytes, namely

Prochloron and Prochlorothrix, with Acaryochloris, and with the four lineages of

anoxygenic photosynthetic bacteria included in this study, no evidence of a shared

ancestry between the genomes of Prochlorococcus or marine Synechococcus and these

groups was uncovered. Hence, the genomes of Prochlorococcus and marine Syne­

chococcus appear to have evolved from a common ancestor within the cyanobacteria,

possibly an ancestor of Cyanobium.

3. On the basis of the 16S rDNA gene, which is largely congruent with the eubacterial

core phylogenies, the PS-group taxa appear to have evolved relatively recently within

an early-branching clade of the cyanobacteria.

4. Although the PS-group taxa may have acquired genetic material through LGT, I

found only one unusual phylogenetic groupings involving PS-group taxa that could 244 not be explained by other properties of the sequences. Taxon sampling, variation in GC%, and elevated evolutionary rates are suspected of affecting the topology of the resulting phylogenies. Hence, any genetic material that has been acquired by the PS-group taxa through LGT does not appear to have consisted of eubacterial-core genes. 5. Many of the incongruent phylogenies observed in previous studies of cyanobacterial evolution may be artefacts of the complex evolutionary processes that have acted on these sequences in situ, rather than representing instances of LGT. Phylogenetic analyses of cyanobacteria, and other lineages of bacteria, ought to be carried out with additional analyses that test for evolutionary processes or properties of the data that may result in misleading phylogenies.

245 246 Chapter 6

General Discussion

6.1 Overview

The primary aim of this work was to explore the relative importance of vertical and non-vertical modes of inheritance in one group of closely-related cyanobacteria, and thus, explore the processes of genome evolution associated with the diversification of photosyn­ thetic systems in bacteria (§1.6). Recovering the genome-level events associated with the diversification of the photosyn­ thetic apparatus is difficult. Historical information is eroded following a long sequence of gain, loss, duplication, rearrangement and mutation (§1.3.1). Phylogenetic inference is compromised by stochastic (§3.1.1) and systematic error (§3.1.2) and potentially extensive lateral transfer of genetic material in the Eubacteria (§1.3.2). The genomes of Prochlorococcus and marine Synechococcus were chosen for this study. As cyanobacteria, they are the extant survivors of the lineage in which oxygenic photosyn­ thesis originally evolved (§1.2.2), and they are related to the anoxygenic photosynthetic 247 bacterial lineages and also to the plastids of the Eukaryotes (§1.1.2). Since they may have diverged from their common ancestor as recently as 150 Mya, their genomes are still similar in gene content and order, but have diversified to use different light-harvesting antenna systems specialised for different light and nutrient conditions (§1.5.2). Their genomes retain sufficient similarity and enough differences, to offer insights into the pro­ cesses of genome evolution associated with use of a structurally different light-harvesting antenna system and, more generally, into mechanisms of genome evolution in bacteria

(§1.5.2).

In the rest of this chapter, I discuss how the present work complements that of previous studies to reveal a clearer picture of the mechanisms of genome evolution in cyanobacteria that have facilitated the diversification of photosynthetic systems.

6.2 Use of a reduced eubacterial core

Phylogenies spanning the divergence of the Eubacteria contain at least three signals:

(i) the historical signal, which records the pattern of divergence events of taxa in the

PS-group; (ii) stochastic error, due to use of data that is not a truly random, fair, sub­ sample of all possible data (§3.1.1); and (iii) systematic error, due to a mismatch between the estimated and actual model of evolution (§3.1.2). The historical signal may be quite weak compared with the two sources of error due to erosion through multiple, indepen­ dent substitutions at the same site and the methodological problems that follow (§1.3.2).

Furthermore, there is compelling evidence that the historical signal in the Prochlorococcus and marine Synecfyococcus has been eroded due to accelerated evolutionary rates in these lineages and a complex series of LGT events (§1.4.3). There is a large body of evidence that systematic error accumulated during the phylogenetic inference process can affect

248 the conclusions of a study (§3.3.1). So, how successful have the methods implemented here been in reducing artefactual error?

The approach adopted in this study has been to prepare and select sequence data to prevent or minimise sources of error, to yield better representations of the evolutionary history of the PS-group in a eubacterial context (Chapter 2).

A key and novel step was the pre-emptive screening of alignments of protein-coding genes to identify and remove those which had a high variation in amino-acid usage or contained individual sequences which increased the variation in amino-acid usage over a pre-defined threshold of acceptability (§4.2.3). Other studies (e.g., Zhaxybayeva et ah, 2009) have chosen to perform this screening at the nucleotide level. Since I had decided to perform phylogenetic analyses on translated protein-coding genes (§3.2.3), I reasoned that the tests for variation in amino-acid use between homologous sequences should also be conducted at the amino-acid level (§3.2.6). The matched-pairs test of symmetry of Ababneh et al.

(2006) was used for this purpose because it tests for both internal and marginal symmetry

(§3.2.6), where a positive test for a high variation in amino-acid frequencies suggests the sequences have evolved through evolutionary processes that are not modelled by the phylogenetic inference programs available.

A possible problem with the application of this test is suggested by the finding that the unusual taxonomic groupings recovered for the rpe and apt genes correlated with unusual

GC content (Figure 5.13), indicating that the matched-pairs test did not detect the GC variation hidden within the amino-acid encoding in these genes (§5.3.7). It is possible that the matched-pairs test is more effective when applied at the nucleotide, rather than at the amino-acid level, or that permitting up to 10% of pairwise comparisons to have significant variation in amino-acid frequency was too high. Further investigation for the best use and parameters for the test could improve its use in future studies.

249 When used in this study, a surprising 218 of the 280 sets of potentially orthologous protein-coding genes from the eubacterial core, that were long enough to be used for phylogenetic analyses, did not pass this test (§4.2.3). From this, I conclude that studies which have based their conclusion on an entire eubacterial core that was not screened for this potential source of systematic error should be treated with caution and possibly re-analysed. However, it is equally possible that the matched-pairs test of symmetry is too stringent and/or has a high false-positive rate. Further investigation into this test and further investigation of the statistical properties would be valuable.

My reduced eubacterial core of house-keeping protein-coding genes, consisting of 62 protein-coding genes (§4.3.2), is small when compared with the size of the entire eu­ bacterial core, estimated in the present study as approximately 450 protein-coding genes (§4.3.2), and to the PS-group genomes, which are estimated to have between 1,700 and 3,000 protein-coding genes (Appendix B.l). Basing all further phylogenetic analyses on a reduced eubacterial core was done to increase the historical signal-to-noise ratio.

6.3 Resolution of the eubacterial phyla

Phylogenetic inference based on molecular sequence data is the foundation of our un­ derstanding of bacterial evolution (Garrity, 2001), used both to infer the evolutionary relationships among vertically inherited sequences and detect instances of LGT (e.g. Zhaxybayeva et al., 2006). The procedure is based on statistical theory that models the stochastic nature of sequence divergence over time and has been developed and re­ fined over many'decades (Felsenstein, 2004). A problem is that the phylogenetic inference process is not robust when the ratio of phylogenetic signal to artefactual error is rela­ tively low, as error accumulated at any step of the process could be large enough to affect 250 the topology, or the evolutionary distances between taxa within the resulting phylogeny (Steel and Penny, 2000). Despite repeated attempts through phylogenetic analysis (e.g. Pisani et ah, 2007; Bapteste et ah, 2008) or the assimilation of biogeochemical evidence (e.g. Cavalier-Smith et ah, 2006), there is no consensus on the age of the extant eubacte- rial phyla or the relationships among them (Wolf et ah, 2002; Iyer et ah, 2004; Bern and Goldberg, 2005). Given the difficulties experienced here, what additional information was gained from analysis of eubacterial phylogenies inferred using a reduced eubacterial core of protein-coding genes?

Attempts to infer the relationships of the eubacterial phyla through analysis of metabolic core house-keeping genes can only be successful if they assume: (i) the core metabolism of bacteria has contained the same genes, performing the same function, within the same biochemical pathways, since the Eubacteria diverged from their common ancestor; (ii) the historical signal is encoded in mutations in the primary sequence; (iii) these mutations have accumulated so slowly that sites in the primary sequence have not become substitu- tionally saturated; and (iv) inheritance is largely vertical.

My analysis indicates that the bacterial phyla themselves can be recovered. Since the data on which these conclusions were based were selected to minimise artefactual error, I conclude that the eubacterial phylogeny is a good approximation of the historical signal in the reduced eubacterial core (§4.4.3).

That a phylogenetic analysis will almost always yield a phylogeny is both comforting and problematic, as it is rarely clear to what extent the resulting phylogeny has been affected by artefactual error. As an example, short internal branches could indicate that the se­ quences are related across a short period of evolutionary time, or that the historical signal is weak. Similarly, long terminal branches may represent rapid evolution from a common ancestor, or undue similarity arising from independent multiple substitutions rather than 251 divergence from a single ancestral sequence (Lyons-Weiler and Takahashi, 1999). The stability of recovered clades can be estimated with methods based on pseudo-replicates, such as the bootstrap (Felsenstein, 1985b), but such methods cannot distinguish similarity due to common ancestry from convergence of more distantly-related sequences through independent events.

The star-like radiation of the bacterial phyla is consistent with several hypotheses. First, that there is insufficient historical signal left in the data. That is, so many mutations have accumulated in the primary sequence of the data that the evidence of shared ancestry can no longer be retraced (Page and Holmes, 2003). Second, that there was extensive LGT prior to the emergence of the extant eubacterial phyla. The house-keeping genes thus have different evolutionary histories and a tree-like pattern of divergences between the eubacteria does not exist (e.g. Vetsigian et al., 2006). Third, that the eubacterial phyla genuinely emerged rapidly within short time periods. The idea of punctuated equilibrium that new lineages emerge rapidly after periods of relative stasis (Eldredge and Gould, 1972; Gould and Eldredge, 1977) - has been re-formulated recently by Koonin (2007) as the “Biological Big Bang” model which hypothesises that new biological entities emerge from a rapid phase of evolution, followed by a slower, more tree-like pattern of evolutionary change. Fourth, that the star-like pattern is all that we can recover because all other lineages have become extinct. Thus, the extant eubacterial phyla do not share a single, most recent common ancestor. Fifth, the lack of resolution is due to the finite age of each gene allele. The metaphor of a genome as a “rope” and genes as “strands” in the rope was originally suggested by G. Olsen (as cited in Zhaxybayeva et al., 2004b). Here, no strand spans the length of the rope yet the rope has continuity and coherence along its length. If orthologous genes suitable for phylogenetic analysis do not span the divergence from the common ancestor to the extant taxa, then phylogenies may appear as star-like radiations. Sixth, that some combination of the above applies. 252 This study was not designed to address such complex issues in evolutionary theory, and there is insufficient evidence to rigorously test the above hypotheses. The high level of conflict at the centre of the super-networks inferred from the reduced and rejected eubac- terial core genes (§4.3.4) and the evidence from pairwise sequence comparisons suggest, however, that most of the third codon sites may have reached substitutional saturation (§4.2.7 and §4.3.6). While this is not sufficient in itself to indicate that the amino acid sequence is also saturated (for that, one would need to map the number of substitutions at conserved sites in the amino acid sequence alignment against inferred divergence), it does suggest that there may be insufficient historical signal in the primary sequences of eubacteria to discriminate the evolutionary relationships among the distantly-related eubacterial phyla.

Alternatively, the information to resolve the relationships among the eubacterial phyla may not be encoded in the random mutations of primary sequences. The most commonly used models of sequence evolution assume that change at any particular site occurs ran­ domly, independently of the state of any other site, and through the same process in all lineages (Dayhoff et ah, 1972, 1978; Gonnet et ah, 1992; Jones et ah, 1992; Muller and Vingron, 2000). These assumptions, while largely valid for closely-related taxa that have not experienced high levels of evolutionary change, are not appropriate for sequences with even a modest amount of divergence (Crooks and Brenner, 2005), let alone for sequences spread across a bacterial phylum or across the Eubacteria. Evolutionary processes are different on short and long time scales (Gonnet et ah, 1992; Benner et ah, 1994; Muller and Vingron, 2000). Any evolutionary scenario in which unique changes to the primary sequences have been lost through multiple substitutions, convergent changes, or unusu­ ally high or low evolutionary rates in some lineages, can lead to artefactual error with unpredictable effects on the phylogeny (§3.1.2). There is accumulating evidence that the standard model of amino-acid sequence substitution does not apply at longer time 253 scales (Benner et al., 1994; Crooks and Brenner, 2005) because there are effectively two evolutionary processes occurring: single base mutations between neighbouring codons on a short time scale, and selection of chemically and structurally compatible residues at longer time scales (Benner et ah, 1994). Crooks and Brenner (2005) have suggested a site-specific, continuous time, zeroth order Markov chain model which takes into account the different distribution of amino acids that may occur at different sites in a protein. I hypothesise that the lack of resolution of the eubacterial phyla observed here (§5.3.2) and previously (e.g., Pisani et ah, 2007; Bapteste et ah, 2008) is partially due to the use of first order Markov models of substitution that are saturated over longer time scales. Thus, composite models that use a model like that of Crooks and Brenner (2005) may provide a scaffold for the phylogeny at longer time scales, and a first order Markov model to refine the initial phylogeny with more precise estimates of the relationships among individual taxa, may be a better fit for the data.

6.4 Inference of intra-phylum LGT using phyloge­ netic methods

There is ample evidence from both natural systems and laboratory experiments that homologous recombination and the lateral transfer of genes or gene fragments is a com­ mon and important mechanism by which genomes evolve in prokaryotes (e.g., Doolittle, 1999; Nelson et ah, 1999; Garcia-Vallve et ah, 2000; Koonin et ah, 2001; Boucher et ah, 2003; Jain et ah, 2003). Independent analyses suggest that cyanobacterial genomes, in­ cluding those of the Prochlorococcus and marine Synechococcus taxa, can be partitioned into a “stable core” (or vertical core) and a “variable shell” of genes, which have experi­ enced different levels of LGT in their evolutionary histories (Coleman et ah, 2006; Kettler 254 et al., 2007; Dufresne et al., 2008; Shi and Falkowski, 2008). There is strong support for cyanophage-assisted LGT as the mechanism that explains the presence of accessory genes such as those involved in photosynthesis and, possibly, house-keeping genes (Mann, 2003;

Sullivan et ah, 2003; Lindell et ah, 2004, 2005; Sullivan et ah, 2005; Zeidner et ah, 2005;

Sullivan et ah, 2006; Sharon et ah, 2007; Weigele et ah, 2007; Dammeyer et ah, 2008;

Sandaa, 2008). Additional support for LGT in falataccessory genes of Prochlorococcus

(Coleman et al., 2006) and marine Synechococcus comes from their over-representation of accessory genes in genomic islands of unusual GC content (Dufresne et al., 2008). Thus,

it is generally accepted that LGT has played a significant role in the evolutionary history

of accessory protein-coding genes in cyanobacteria.

The case for “extensive” LGT of metabolic-core protein-coding genes within the cyanobac­

teria has been promoted over many years by Gogarten, Zhaxybayeva and colleagues (Gog-

arten et ah, 2002; Raymond et al., 2002; Zhaxybayeva and Gogarten, 2002; Raymond et ah,

2003b; Zhaxybayeva and Gogarten, 2003a; Zhaxybayeva et al., 2004a,b, 2006, 2009). This

hypothesis is controversial because, if true, it would mean that the fundamental bacterial

metabolism in cyanobacteria is relatively unstable.

The earlier studies of this group (Raymond et al., 2002; Zhaxybayeva and Gogarten, 2002;

Raymond et al., 2003b; Zhaxybayeva and Gogarten, 2003a; Zhaxybayeva et ah, 2004a)

were based predominantly on a very limited number, typically four or five, of diver­

gent bacterial genomes which were not adequately screened for data that might confound

the phylogenetic inference process. The quartets from such a sample of Eubacteria are

essentially eubacterial phylogenies with four samples. My analyses indicate that phylogé­

nies on this scale of divergence are not robust to the effects of artefactual error (§5.3.2,

5.4.3 and 5.4.5). Hence, conclusions that there has been either transfer of chlorophyll

biosynthesis genes relative to the plurality consensus, or extensive LGT of eubacterial-

core house-keeping protein-coding genes between the anoxyphotobacterial lineages and

255 the cyanobacteria, are weak. The most recent studies of the PS-group taxa by Zhaxybayeva et al. (2006) and Zhaxy- bayeva et al. (2009) have focused solely on the PS-group genomes. They are more similar to my analyses in that sources of error have been minimised so that the resulting phy- logenies are a more accurate representation of the evolutionary signal in the data (§4.4). By limiting their study to just the PS-group taxa, they may have reduced the artefac- tual errors associated with discrepancies between the actual and estimated models of sequence evolution (Zhaxybayeva et ah, 2006, 2009). However, these analyses could still contain some residual variability in the position of individual taxa and, thus, may be over­ estimating the number of gene families that have experienced LGT because they count any statistically significant deviation from the supertree topology as evidence of LGT, in effect leaving no margin for error due to residual artefactual error or loss of historical signal. But despite these problems, these studies are the strongest argument yet that there has been extensive lateral transfer of eubacterial-core house-keeping genes among members of the PS-group genomes. The present work was designed to test whether there is evidence of LGT, or whether the phylogenetic differences observed in the studies of Gogarten, Zhaxybayeva, Raymond and colleagues could be induced by methodological decisions in the phylogenetic inference process. My analyses involved a rigorous evaluation of each step of the phylogenetic inference pro­ cess (Chapters 2 and 3), preliminary analyses to identify the best methods and parameters for selecting protein-coding genes related by putative relationships of orthology (Chap­ ters 2 and 3), the selection of a reduced set of eubacterial-core genes that minimises the possibility of phylogenies that suffer from LBA effects or other artefactual error (Chapter 4) , careful phylogenetic analyses with both tree- and network-based methods (Chapter 5) , and further analyses to investigate putative instances of LGT (Chapter 5). I found that methodological decisions in the phylogenetic inference process can affect the result- 256 ing topology. In particular, the choice of taxa (§5.3.4) can induce variation in topology within the cyanobacteria, and variation in GC content (§5.3.7) may have induced topo­ logical patterns that could be interpreted as evidence of inter-phylum LGT (§5.3.2 and 5.3.3). The ML analyses were able to recover groups of taxa belonging to the same bacterial phylum, but not the relationships among phyla (§5.3.2). Within a phylum, the taxa displayed a tree-like pattern of divergences, but the Neighbor-Net analyses suggested that sites of eubacterial-core house-keeping genes contain conflicting phylogenetic patterns within the cyanobacteria (§5.3.3). These findings were corroborated by the lack of support for a single phylogeny from the ML trees (§5.3.2). Furthermore, the taxon sampling tests indicated that these analyses are probably affected by sampling (stochastic) error (§5.3.4). From these observations, I concluded that the historical signal is so weak, compared with the non-historical signals in the data, that the precise position of a taxon within a phylum is uncertain (§5.4.3). Hence, a topological difference in the position of a taxon between two phylogenies is not sufficient to conclude that genes have experienced LGT (§5.4.6). On the basis of these investigations, I concluded that phylogenetic evidence alone is rarely sufficient to deduce LGT: at least one alternative line of evidence is required (§5.4.6). Non-phylogenetic evidence that could verify potential instances of LGT was investigated for a limited number of protein-coding genes of the reduced eubacterial core. Tests to de­ tect positive selection through the ratio of non-synonymous to synonymous substitutions in protein-coding genes are not possible across the eubacteria els most third codon sites are saturated and no longer contain historical information (§4.3.6 and Table 4.4). Other studies have found this to be the case for many sites in genes shared by the closely-related Prochlorococcus and marine Synechococcus (Hu and Blanchard, 2009), possibly because these genomes have experienced elevated evolutionary rates (Dufresne et al., 2005). Thus, these tests are only of limited practical use. 257 Tests of recombination were also suspect in that almost half of the aligned sequences of

the reduced eubacterial core tested positive for positive selection (§5.3.6 and Table 5.2);

however, I was not able to identify any correlation between these results and the phylo­ genetic patterns inferred in the ML or Neighbor-Net analyses (Table 5.1), in properties of the alignment such as number of sequences or length (Table C .l), or which genes were anomalous (Table 5.1). Gene order is a contentious indicator of LGT because it is possible that high similarity is indicative of LGT rather than shared ancestry (Zhaxybayeva et ah,

2004b). Also, since gene order is only of limited practical use, because it is only preserved among very closely-related organisms such as the PS-group taxa (§5.3.7, Figures 5.11 and

5.12), and not even across the cyanobacteria (Zhaxybayeva et ah, 2004b). From my de­ tailed investigation of two genes for which LGT were suspected for PS-group taxa, indels provided the clearest evidence of LGT (§5.3.7). On the basis of the observations in this study, I conclude that, of the alternate non-sequence-based lines of evidence explored, rare genomic events such as indels provide the best, positive indication of LGT.

6.5 Origins of the Prochlorococcus and marine Syne-

chococcus genomes

Existing studies of the evolutionary origins of Prochlorococcus and marine Synechococ- cus had indicated that they were closely related lineages which share a common ancestor

(§1.4.2). The evidence that LGT plays a key role in the evolutionary history of the eu­ bacterial core and accessory genome suggested that a significant portion of the genomes of Prochlorococcus and marine Synechococcus were a mosaic of genes from their common ancestor, more distantly related cyanobacteria and other eubacteria (§1.5.1). However, ex­ isting studies were not able to provide strong evidence for the lateral transfer of metabolic 258 core house-keeping genes in the Prochlorococcus and marine Synechococcus genomes be­ cause differences in their data selection and preparation methodologies could have been responsible for the observed “phylogenetic conflict” (§5.4.5). So this study set out to test whether phylogenies with a higher historical signal-to-noise ratio (§4.4) can yield more definitive evidence for the role of vertical and non-vertical mechanisms of inheritance in eubacterial-core genes in Prochlorococcus and marine Synechococcus.

Extensive preliminary analyses were conducted to find an accurate method for inferring putatively orthologous sets of eubacterial-core genes (Chapter 2) and to identify possible sources of stochastic and systematic error in the phylogenetic inference process (Chapter 3).

In order to explore the possibility that the eubacterial core genes of Prochlorococcus or marine Synechococcus might have a closer relationship to some of the anoxygenic pho­ tosynthetic bacterial groups than to the other cyanobacteria, I included representatives from the green sulfur bacteria, green non-sulfur bacteria, purple bacteria and heliobacte- ria in my study (Table 4.1). Furthermore, since the number of taxon-specific core genes decreases as the diversity of host genomes increases, I included 27 non-photosynthetic bacteria both to facilitate the investigation of broader evolutionary trends and to identify the most ancient, conserved vertical core genes (Table 4.2).

On the assumption that the eubacterial core set from 70 eubacteria would be ancient key genes with conserved function from different biochemical pathways, the set of genes common to most of the 70 eubacteria was identified (§4.2.1). However, a significant issue here was the choice of a metabolic core set of protein-coding genes that would have a high signal-to-noise ratio, and thus yield phylogenies that are a better representation of the relationships among these taxa. To reduce the possibility of including paralogues in my eubacterial core protein-coding gene alignments, I chose eubacterial core sets of protein- 259 coding genes that: (i) were inferred from strong pair-wise homologies among members of the cluster; (ii) formed a tightly-knit cluster but not so strictly so as to exclude more diverged protein-coding genes; and (iii) did not contain highly divergent sequences. Since my selection of 70 bacteria from 20 phyla (Table 4.1) would result in phylogenies with deep divergences, more easily affected by phylogenetic artefacts, it was not sufficient to simply infer phylogenies from alignments of putatively orthologous protein-coding genes. To minimise the possibility of artefacts affecting my analyses, I restricted these to eubacterial core sets of protein-coding genes that: (i) did not contain ambiguously aligned sites;

(ii) contained sequences of sufficient length for reliable phylogenetic analysis after the removal of ambiguously aligned sites; and (iii) appeared to be consistent with evolution under time-reversible conditions (Table 4.2). My final selection of 62 protein-coding genes included the best candidates for phylogenetic analysis from 12 of the 21 KEGG gene categories that are relevant to bacteria (Table 4.2).

Recent studies have compared the genomes of 12 Prochlorococcus and 4 marine Syne- chococcus genomes (Kettler et al., 2007), 11 marine Synechococcus and Prochlorococcus

(Dufresne et ah, 2008), and 15 cyanobacteria (Mulkidjanian et ah, 2006). My study extends these by: (i) addressing the potential for LGT with the anoxyphotosynthetic and non-photosynthetic bacteria; (ii) doing phylogenetic analysis of bacterial core genes rather than using the presence or absence of accessory genes; and by (iii) the careful selection of data that minimise the possibility of phylogenies with artefacts. My finding that approximately 55% of the 396 eubacterial-core protein-coding gene alignments tested appear to have evolved under non-time-reversible conditions (§4.3.2) suggests that this care in data selection is well warranted. An implication of this result is that conclusions drawn from previous phylogenetic studies using these sequences may be unjustified. The partial overlap of my reduced eubacterial core (Table 4.2) with eubacterial core sets in other studies of prokaryotic evolution chosen using different criteria - for example, typA,

260 atpAB, clpX, rpoA, inf C and ¿5/ (Daubin et al., 2002); metK, recA and rpoA (Zeigler,

2003); and atp AD, rpoEFKN, rpsBDGH and secY (Bapteste et al., 2008) - suggests that my reduced eubacterial core set is suitable for an investigation of bacterial evolution.

My phylogenetic analysis of 62 protein-coding genes contained a sample of 11 Prochlorococ- cus and 7 marine Synechococcus and recovered these as a clade within the cyanobacteria, with no convincing evidence of a shared ancestry with other anoxyphotobacterial lineages or any other non-photosynthetic bacteria (§5.4.1). The most plausible explanation for such a consistent phylogenetic pattern for protein-coding genes ubiquitous and essential for the survival of eubacterial cells is that it represents the strong historical signal in the primary sequence data over time scales that span the eubacterial phyla (§5.4.1). Only two protein-coding genes, rpe and apt, yielded phylogenies that could have been interpreted as inter-phylum LGT (§5.3.2, §5.3.3, Figures 5.4 and 5.6). Since further investigations found that the unusual groupings correlated with shared GC content, I concluded that the unusual groupings were more representative of artefactual error than LGT events (§5.4.5).

Since there was some variation in the precise position of the PS-group within the cyanobac­ teria (Table 5.1) and taxon re-sampling tests had shown that artefactual error arising from the choice of taxa are large enough to account for such variation (§5.3.4), I concluded that the precise position of the PS-group within the cyanobacteria could not be determined ac­ curately from phylogenetic analyses of a limited sample of cyanobacteria, unrepresentative of the true diversity of the extant cyanobacteria (§5.4.3). Since the phylogenies inferred from the reduced eubacterial core were largely concordant with the phylogeny inferred from the 16S rDNA molecule (§5.3.2), I further concluded that a 16S rDNA phylogeny inferred with a much broader sample of the diversity of the cyanobacteria would provide a good estimate of the position of the PS-group within the cyanobacterial radiation (§5.2.5).

This analysis suggests that in the context of the extant radiation of cyanobacteria, the PS-

261 group have evolved relatively recently from a Cyanobium-Yike ancestor on an early branch­

ing clade of the cyanobacteria (§5.3.5). This study was not designed to answer questions on the absolute age of the extant cyanobacteria. However, if the extant cyanobacteria are,

themselves, a recent radiation, this would also indicate a recent origin of the PS-group taxa. On the other hand, if the extant cyanobacteria are an ancient lineage, the PS-group could, themselves, be an old but relatively young lineage within an ancient lineage.

The present study recovered the Prochlorococcus as a monophyletic group for approxi­ mately half the cases (Table 5.1). This could be interpreted as evidence of independent evolution of the same membrane-intrinsic LH-antenna systems, based on divinyl Chi a and divinyl Chi 6, which are found uniquely in this genus (Chen et ah, 2005a). However, given the lack of support for LGT in my analyses, I concluded that phylogenies that

located some low-light Prochlorococcus species at the base of the marine Synechococcus could be due to residual artefactual error or erosion of the historical signal (§5.4.1 and

§5.4.3)

The present analysis is not the first to come to any of the conclusions above about the evolutionary history of the Prochlorococcus and marine Synechococcus genomes. It is the first analysis, however, to confirm that most eubacterial-core protein-coding genes share the same evolutionary history as the 16S rDNA molecule, using well-verified data that provide a better representation of the evolutionary history of the PS-group taxa (§5.4.1).

The implications of this for understanding the processes of genome evolution in the cyanobacteria and of photosynthesis are discussed in the following sections.

262 6.6 Mechanisms of genome evolution in cyanobacte­ ria

Addressing the relative importance of vertical inheritance and lateral gene transfer is unavoidable in any exploration of the evolutionary relationships among bacteria. The rate of lateral gene transfer within and between populations of bacteria is, without doubt, high (Ochman et ah, 2000). However, there is a growing body of evidence that vertical inheritance is the predominant mechanisms by which Eubacterial genomes evolve over time frames that span divergences from the level of species to phylum.

Studies by Beiko et al. (2005), and Kunin et al. (2005) concluded that inheritance in the prokaryotes is largely vertical and tree-like, but that there have been “highways” or “vines” of gene sharing between some lineages, with some acting as “hubs” in a scale-free network, facilitating the rapid transfer of information across the “web”, sensu Beiko, or “network”, sensu Kunin, of life. A third key study by Ge et al. (2005) concluded that the mean genome-specific rate of LGT is about 2%, and hence, that the bacterial phylogeny is essentially tree-like, but with some links between branches “much like cobwebs hanging from tree branches”.

On the other hand, the hypothesis that house-keeping genes in cyanobacteria have ex­ perienced extensive lateral transfer has been proposed over many years, primarily by Gogarten, Zhaxybayeva, Raymond and their colleagues outlined in §6.4. While PS-group genomes may have been exchanging genetic material, I argue that Zhaxybayeva et al. (2009) may have over-estimated the number of gene families affected. The foundation of their conclusions is that any variation among gene phytogenies, whether that be in a clade of closely-related species, within a phylum, or across the Eubacteria, indicates a different evolutionary history of the gene. This is too liberal, as it does not allow for any resid- 263 ual artefactual error, even though the authors are aware of the multitude of confounding

methodological problems with phylogenetic analyses, and in their later studies have been comprehensive and rigorous in their attempts to screen out data that could induce arte­

factual error (Zhaxybayeva et ah, 2006, 2009). This study indicates that the assumption

that any variation between a quartet and a reference phylogeny indicates a genuine dis­ crepancy in the evolutionary history is unrealistically optimistic. So the extent of LGT

among the cyanobacteria may have been over-stated. However, there has probably been some level of LGT among the PS-group taxa and these two results are complementary: on longer time scales that span the cyanobacteria or the eubacteria, there is little reli­ able evidence for the lateral transfer of house-keeping genes (§5.4.3 and §5.4.5) but on the shorter time scales of the PS-group there has been some LGT, even if it is not as extensive as claimed.

Partly as a result of analyses of the cyanobacteria, and in particular of the genomes

of the PS-group taxa that have been available since 2003 (Dufresne et ah, 2003; Rocap

et ah, 2003), the Zeitgeist regarding the relative importance of LGT in bacterial genome evolution has changed significantly in the last decade.

Lateral gene transfer may once have been understood as a stochastic process and is process

limited by the opportunity for exchange, but the process is decidedly non-random, affected

by a large number of organism-specific effects such as different phage and plasmid host

ranges, host population structure, differential integration into cellular networks, and cost-

benefit effects related to host fitness. The accepted model now envisages high rates of transfer over short time-frames, but much lower ones over longer evolutionary scales (e.g..

Beiko et ah, 2005; Ge et ah, 2005; Kunin et ah, 2005). Thus, the high rate of genetic exchange at the population level is not visible when examined on taxa with more distant shared ancestry because the foreign genes do not establish themselves in the recipient population on a long term basis. 264 Even the most appropriate model for phylogenetic analysis can be compromised if it is applied to taxa spanning both recent and ancient divergences, since they will have evolved under different evolutionary constraints which cannot be accurately approximated by a single model. For example, the diversity of ecotypes in the PS-group taxa suggests that this group of organisms has diversified rapidly and comparatively recently (Rocap et ah, 2003; Dufresne et ah, 2005). At this time scale, the amount of variation that has accumulated would be much higher than in groups that have been subject to purifying selection over much longer periods of time. Any single model chosen for the phylogenetic analysis of a set of diverse Eubacteria may be a poor match either for the changes that occur on the scale of eubacterial evolution or for those that have occurred in a single lineage within the cyanobacteria. Effects such as these could result in artefactual error in any such phylogenetic analysis.

This study, which has relied solely on phylogenetic analyses of the Eubacteria, could thus be criticised for its choice of taxa, but I believe such criticism is not warranted. The choice to explore the evolutionary history of the PS-group genomes in a eubacterial phylogeny was deliberate and necessary for the aims of the research.

When this study began, it was not known whether the LGT of eubacterial-core protein­ coding genes was largely limited to exchanges within the PS-group or perhaps the cyanobac­ teria, or extended even further. I hypothesised that the Prochlorococcus lineage might have achieved its spectacular ecological success because the membrane-intrinsic LH-antenna system, which could have been acquired by LGT from other cyanobacterial lineages, had been accompanied by laterally-acquired metabolic core genes. These might have pro­ moted the rapid assimilation of the novel accessory genes into the new host genome, and increased the likelihood of a successful non-transient incorporation of the new LH-antenna system into the Prochlorococcus lineage. Since it was these sorts of inter-phylum LGT events that I was most interested in detecting, and the phylogenies inferred in this study

265 have been generally reliable at the scale of phyla, the conclusions regarding inter-phylum LGT in this study should be robust. On the basis of this study, I conclude that inheritance of metabolic core genes within the radiation of the extant cyanobacteria has been largely vertical. Lateral gene transfer of accessory genes that aid niche-adaptation could have occurred even between distantly- related cyanobacteria. Even though two lineages may have diverged and become spe­ cialised for different environmental conditions and could, in some sense, be considered to exhibit sufficient phenotypic differences as to be considered different species, they may still behave like a single population, exchanging genetic material either through homolo­ gous recombination of shared genes or through the lateral gene transfer of entire genes or gene fragments. So, what does this suggest about the diversification of photosynthetic systems in the cyanobacteria?

6.7 Evolution of photosynthesis in bacteria

Research into the evolutionary history of the Prochlorococcus and marine Synechococcus suggests that these two lineages evolved within the cyanobacteria (Palenik and Haselkorn, 1992; Urbach et ah, 1992; Rocap et ah, 2002) from a Cyanobium-like ancestral popula­ tion located in an early-branching lineage (§5.4.4). The Prochlorococcus lineage diver­ sified from a population that acquired the genes to produce a membrane-intrinsic light­ harvesting antenna system based on divinyl chlorophylls a and b. This population evolved into two lineages: the Prochlorococcus, specialised for the different light conditions found at different depths of the marine euphotic zone (Moore and Chisholm, 1999), and the ma­ rine Synechococcus, specialised for different nutrient conditions, from coastal waters to the 266 oligotrophic oceans (Dufresne et al., 2008). The acquisition of new environment-specific accessory genes in the lineages that lead to Prochlorococcus, possibly via LGT, destabilised the genome. Key DNA repair genes were lost first, then more non-essential genes, and gene mutations rapidly accumulated (Dufresne et al., 2003, 2005) as the genomes evolved to a more efficient state for the environmental conditions. Similarly, the marine Syne- chococcus may also have experienced elevated evolutionary rates (Dufresne et ah, 2005;

Hu and Blanchard, 2009), as the entire genome underwent selection for optimal operation in the new niche.

My analyses confirm that Prochlorococcus and marine Synechococcus acquired the ma­ jority of their eubacterial house-keeping genes from their common ancestor (§5.3.2). The non-cyanobacterial out-groups selected for this study included representatives from five of the six known anoxyphotobacterial groups and a diverse set of non-photosynthetic taxa, to test whether any metabolic core genes of the cyanobacteria share an unusually close relationship with these other lineages, which would suggest they have been acquired by LGT. There were only four unusual phylogenies in non-Gloeobacter cyanobacterial taxa that could be interpreted as instances of LGT (Figure 5.1). Of these, there was strong evidence in the form of an indel for only one gene, ahcY, that encodes S-adenosyl

L-homocysteine hydrolase (§5.3.2, §5.3.3, Figure 5.5, §5.3.6 and Table 5.4). Thus, the majority of the genes of the reduced eubacterial core were inherited from the common ancestor of the PS-group taxa (§5.4.1) and, in agreement with existing studies (described in §1.4 and §1.5), the Prochlorococcus and marine Synechococcus spp. are sister lineages

(§5.4.1).

The origins of the membrane-intrinsic chlorophyll-based LH-antenna systems found in the Prochloron, Prochlorothrix and Acaryochloris genera are not known (Larkum, 2006).

However, phylogenies based on the 16S rDNA molecule (§5.2.1 and §5.2.5), which were found to track the tree-like evolutionary history of the eubacterial-core house-keeping

267 genes well in the present study (§5.3.2), concur with the suggestion of van der Staay et al. (1998) and Chen et al. (2005b) that the different LH-antenna systems in all these lineages could have evolved through a similar process, namely, the acquisition and suc­ cessful incorporation of the genes required to construct the chlorophyll-based LH-antenna system.

The relationship between Prochlorococcus and marine Synechococcus is analogous to that of Prochloron and Synechocystis trididemni, which share a common ancestor that probably used phycobiliproteins (Shimada et ah, 2003). There is no reason to suspect that the reverse is not possible - that a cyanobacterium that uses a Chl-based LH-antenna system could switch to using phycobilisomes (A. Larkum, personal communication).

The successful evolution of the Prochlorococcus lineage within the cyanobacteria (§5.4.1), the paucity of inter-phylum LGT of essential eubacterial house-keeping genes in the PS- group genomes (§5.4.4), and the possibility that there has been a modest amount of LGT of house-keeping genes among the PS-group taxa (Zhaxybayeva et ah, 2006, 2009; Andam et ah, 2010) suggests the successful integration of a new light-harvesting system was not dependent on the simultaneous acquisition of supporting house-keeping genes from the same donor.

Phytogenies based on the 16S rDNA molecule (e.g., §5.3.5) and inferred from the reduced eubacterial core (§5.3.2) confirm earlier conclusions that the five known anoxyphotobacte- rial lineages and the single oxyphotobacterial lineage did not evolve from a recent common ancestor. Hence, the photosynthesis-related biosynthesis pathways could have been ac­ quired in the ancestors of those lineages by LGT (§5.4.4). The diversity of the phyla in which these successful photosynthetic lineages have appeared suggests that specialised biochemical capabilities in the new host are not a pre-requisite for successful integration of photosynthetic modules. The photosynthetic machinery may consist of a modular unit

268 that is potentially functional in any eubacterium. Previous researchers have noted that cyanobacteria are quite flexible in their use of different chlorophyll types for light harvest­ ing (Satoh et ah, 2001; Chen et ah, 2005b). Thus, the photosynthetic unit could itself be composed of modular units, such as that which encodes the LH-antenna system.

While geochemical evidence suggests an ancient origin of the process of oxygenic photo­ synthesis (§1.1.1), the the age of the extant radiation of the cyanobacteria is not known.

From the fossil record, the cyanobacteria are generally regarded as dating back to 2.4 to

2.9 Gya (§1.2.6). However, as there have been various bottlenecks in evolutionary his­ tory, including several “Snowball” periods (Larkum et al., 2007), the extant cyanobacteria may be relatively young, perhaps less than 0.8 Gya. Protein-coding genes shared by all cyanobacteria, such as the oxygen evolving proteins Dl, D2, psbO and so forth, should have evolved with the inception of the cyanobacteria and thus could be used for a molec­ ular clock analysis (A. Larkum, personal communication), but since these genes belong to the accessory genome, there would be doubts as to the reliability of such analyses. How­ ever, since the metabolic core house-keeping genes appear to be inherited vertically, basing analyses on a suitable subset of the cyanobacterial metabolic core may be profitable.

The complex evolutionary histories of photosynthetic prokaryotes, plastids, and other pho­ tosynthetic organelles (Keeling, 2010) suggest that the photosynthesis-related biochemical pathways could have evolved to be a relatively host-independent genomic module shortly after the first primary endosymbiotic events occurred, and that functional modularity of the photosynthetic machinery facilitates the successful lateral transfer and retention of photosynthesis genes in distant lineages of Eubacteria and Eukarya.

A marine cyanobacterium Cyanobacterium UCYN-A was found that lacks the oxygen- producing photosystem II complex (Zehr et al., 2008; Tripp et al., 2010). It is still un­ clear whether the extant anoxyphotobacterial lineages are descendants of the cyanobac-

269 terial lineage which lost a photosystem, as suggested by Mulkidjanian et al. (2006), or whether the cyanobacteria formed from the fusion of two older lineages with PSI and PSII (Mathis, 1990; Blankenship, 1992; Xiong and Bauer, 2002). However, the discovery of this cyanobacterium is firm evidence that it is possible that at least some of the anoxypho- tobacterial lineages have evolved through a similar process. Second, the shared ancestry of the Paulinella chromatophore and the PS-group is evidence that there have been at least two primary endosymbiotic events involving a photosynthetic bacteria and a Eu­ karyote, and that the photosynthetic apparatus can be incorporated into a distant lineage in its entirety. Third, the recently discovered anirnal-plant photosynthetic sea slug, Elysia chlorotica (Rumpho et ah, 2008), where photosynthesis genes have been transferred to the host genome, suggests that novel symbioses between Eubacteria and Eukarya continue to emerge.

6.8 Conclusions

1. A significant proportion of eubacterial-core orthologues have a high level of variation in amino-acid frequencies and this has been shown to be a source of long-branch attraction effects that can lead to systematic error in phylogenetic topologies. This has probably responsible for some of the conflicts observed by others between dif­ ferent eubacterial-core phylogenies. To avoid this problem, in this study over 75% of the 280 putatively orthologous sets of sequences that could have been used for phylogenetic analysis were excluded, resulting in phylogenies less compromised by artefactual error. 2. The historical signal in eubacterial sequences is strong enough to recover lineages that share a common ancestor at the phylum level and at the level of major lineages within a phylum, but not to resolve the relative position of individual taxa. The 270 variation observed within phylogénies could be due to a combination of the loss of

the historical signal and systematic error arising from the application of a single

model of evolution to all evolutionary time scales represented by the diversity of

taxa in the study.

3. Despite there being problems with the accuracy of the relationship among the eu-

bacterial phyla and the individual position of cyanobacterial taxa, the ML phylo­

génies inferred from the reduced eubacterial core were sufficiently robust to explore

the evolutionary origins of genes in the Prochlorococcus and marine Synechococcus

genomes.

4. The error associated with sampling (stochastic) and systematic errors (arising from

non-historical signals in the data) are of sufficient magnitude to induce topologi­

cal differences that could be interpreted as evidence of LGT. Phylogenetic evidence

alone does not constitute sufficient evidence of LGT. The most convincing evidence

found in this study was in the form of indels shared by sequences located in un­

usual positions in the phylogeny. Alternate explanations to LGT, such as positive

selection, recombination, unusual nucleotide or amino-acid or shared gene order, are

useful and merit investigation.

5. Phylogenetic analyses of protein-coding genes that span the biochemical pathways

shared by all Eubacteria indicate that the fundamental bacterial metabolism of

Prochlorococcus and marine Synechococcus was inherited from a common cyanobac­

terial ancestor. During the period over which these two lineages have diverged from

their common ancestor, metabolic-core genetic material has predominantly been

passed from generation to generation through vertical inheritance. This is the first

analysis to confirm that the eubacterial-core protein-coding genes share the same

evolutionary history as the 16S rDNA molecule, using carefully selected data that

271 provide a better representation of the evolutionary history of the PS-group taxa.

6. It is possible that some of the variation in the position of individual Prochlorococ-

cus or marine Synechococcus sequences within the PS-group clade could be due to

homologous recombination or other forms of lateral gene transfer of house-keeping

protein-coding genes between closely-related members of this lineage. The conclu­

sions of this study do not conflict with this hypothesis. The longer time scales

explored in the present analysis, which extend to the taxonomic level of phylum, in­

dicate that the PS-group taxa form a well-defined lineage that has evolved primarily

by vertical descent, over relatively recent times, when examined on time scales that

span the diversification of the Eubacteria. Other studies suggest that in the early

phase of divergence, two lineages can continue to exchange genetic material of the

metabolic-core genome while genomic changes that result in the eventual speciation are accumulating in the accessory genome. In my opinion, however, there is a case

for some level of lateral gene transfer of metabolic-core genes among the PS-group

genomes but its extent may have been over-estimated.

7. The unusual membrane-intrinsic LH-antenna system shared by the Prochlorococ-

cus appears to have been acquired from another lineage in the cyanobacteria that

shares a similar system. It is unknown whether this was the lineage of Prochloron,

Prochlorothrix or Acaryochloris, or an as yet undiscovered lineage with a chlorophyll-

based light-harvesting antenna system, or a lineage that is now extinct. Access to

more full genomes, such as those of Prochloron or Prochlorothrix, might resolve this

question.

8. If the vertical and horizontal processes of inheritance observed for the closely-related

Prochlorococcus and marine Synechococcus can be extrapolated to the more ancient

relationships among the photosynthetic lineages dispersed throughout the Eubacte-

272 ria, that suggests that the photosynthesis-related genes form a mobile, self-contained genetic module that does not require specialised metabolic support from a new eu- bacterial host. Furthermore, the photosynthesis genetic module itself appears to consist of smaller modules, such as that necessary for the construction of a light­ harvesting antenna system. 9. The evolution of the Prochlorococcus lineage could have been triggered by the chance acquisition of photosynthesis-related accessory genes in the ancestors of the PS- group taxa. While the genomes diversified to become separate lineages with differ­ entiated ecotypes, the populations exchanged genetic material through LGT. How­ ever, when considered on a time scale that spans the evolution of the cyanobacteria or the eubacteria, inheritance has been primarily vertical. This suggests that during speciation, the level of lateral gene transfer among the diverging lineages tapers off, leaving two divergent lineages which are defined at the level of a genus by their genetic history, but at the species level, by their signature, accessory genes. 10. The modularity of the photosynthesis-related genes could have evolved in the Eu­ bacteria and have contributed to the ongoing diversification of the photosynthesis machinery in new lineages of Eubacteria and Eukarya.

6.9 Future directions

This study has demonstrated that reducing the sources of systematic error in phylogenetic analyses is a useful approach for improving their accuracy. Until there are sufficient bacterial genomes available that are truly representative of the diversity of this domain, it will be hard to estimate the contribution that non-representative sampling has on our models of bacterial evolution. The lack of data at this level will be remedied relatively soon 273 due to recent technological advances in high throughput sequencing, enabling population- level exploration of individual bacterial lineages, and a merging of all data to explore the evolutionary mechanisms that drive bacterial evolution on longer time scales.

In a discipline like physics, there is a wealth of models that accurately explain and predict observable properties of the physical world. In biology, however, the field of “theoretical biology” struggles to find models that will adequately explain how even a small system works in isolation, such as how a particular protein interacts in the presence of others. Pre­ dicting the emergence of new photosynthetic lineages is not, perhaps, crucially important on the time scale of human evolution. However, lateral gene transfer is the mechanism by which pathogenic viruses and bacteria evolve resistance or increased infectivity or severity, and being able to predict whether the strain of bacteria that has emerged in a new out­ break is likely to die out as rapidly as it began, or spread to become a new “super-bug”, would be invaluable. Lateral gene transfer may be ubiquitous among prokaryotes, but it is not yet clear why some genes acquired by lateral transfer are integrated only transiently in the resulting sub-population, while others are retained over longer periods of time and become a stable part of the genome (Marri et ah, 2007). On the basis of the analyses in the present work, I have concluded that the integration of foreign genes acquired through

LGT is not dependent on the concomitant acquisition of core metabolism genes. However, it is possible that in some cases, core genes are required for integration of accessory genes.

I found evidence of only one potentially laterally transferred metabolic-core gene. Rather than being a statistical anomaly, it is possible that the lateral transfer of core genes may be an important factor in the successful adaptation of cyanobacteria to new environmen­ tal conditions. The evolution of a bacterial genome towards a state that is optimised following acquisition of a new beneficial gene may be analogous to the simulated anneal­ ing algorithm in computer science where both “better” and “worse” solutions are explored leading to more rapid convergence to a better solution than a search that explores only

274 strictly “better” solutions (Kirkpatrick et al., 1983; Cerny, 1985), or to the genome “an­ nealing” and “refinement” that is hypothesised to have occurred following the reduction in LGT between organisms and the divergence of the Bacteria, Archaea and Eukarya (Woese, 1998; Doolittle, 1999). It is possible that new genes that facilitate adaptation to novel environmental conditions interact with genes from the pathways for fundamental eubacterial metabolic functions. Experimentation with laterally transferred metabolic- core genes acquired around the same time as foreign accessory genes may enable more rapid convergence of the whole genome to a state that optimizes the efficiency of the new complement of biochemical pathways, thus increasing the chance that the novel accessory genes are successfully integrated into the new host genome. Hence, while the majority of the cyanobacterial genes that are conserved across longer periods of evolution have been inherited vertically, the assimilation of foreign metabolic-core genes, that may increase the efficiency of protein-protein interactions in the new host, may have been the key to the successful diversification of the cyanobacteria. This could be tested by checking whether there is a significant correlation between the pathways to which the putatively laterally- acquired metabolic-core house-keeping genes, and pathways to which recently acquired accessory genes, belong.

A unified evolutionary framework that can capture the stability and flexibility of prokary­ otic genomes at a range of evolutionary time scales does not exist. One hypothesis that would explain the loss of historical signal over recent and longer time scales observed in the present study, the potentially extensive LGT among recently emerged lineages such as the PS-group (Zhaxybayeva et al., 2009), the potentially high evolutionary rates over recent time scales for the PS-group (Dufresne et al., 2005) and the variation in evolu­ tionary rates observed over different time scales (Ho et al., 2005) is the Biological Big Bang (BBB) model proposed by Koonin (2007). He suggests that the major transitions in biological evolution have occurred through a bi-phasic model. The first phase is charac- 275 terised by the “explosive” emergence of new biological lineages where genetic information exchange through LGT, recombination, fusion, fission and the spread of mobile elements is frequent. This is followed by a more sedate period where genetic information exchange occurs more slowly, multiple lineages of the new type emerge and evolution is much more tree-like. In the terminology of this model, the emergence and LGT of genetic material among the PS-group taxa corresponds to the first phase of the bi-phasic model of evolu­ tion. On longer time scales, the emergence of each of the eubacterial phyla could also be understood as having originated in a BBB. The BBB theory is appealing for its ability to synthesise many of the troubling obser­ vations in bacterial evolution and may provide a framework for mathematical models to describe bacterial evolution on a grand scale using measures for genome stability. How­ ever, it is also possible that the BBB theory is a metaphor that cannot be turned into a model. Lineages that go extinct can rarely be observed at the genetic level, so what ap­ pear to be explosive radiations of bacterial lineages may merely correspond to the extant lineages that have evolved gradually and at a relatively consistent pace. Whichever the case, this hypothesis deserves further exploration.

276 Appendices

277

Appendix A

Selected relational database tables used in this study

A brief listing of the main relational database tables used in this study is provided below.

Table A .l: Selected relational database tables used in this study Table Description MGCOMP Completed microbial genomes at NCBI MGPROG Microbial genome projects in progress at NCBI MGINFO Information on all microbial genome projects at NCBI RAWFILE Source of original data file PROJ List of all chromosomes, plasmids or plastids in the genome project ORGANISM Name and taxonomic information CHROM Properties of a chromosome, plasmid, plastid or virus CHROM2PROJ Maps each chromosome to a project PROTFAA Translated DNA sequences downloaded from the NCBI RefSeq database TAX Mapping from Genbank Identifier to NCBI TaxID CHROM2MGINFO Mapping of chromosomes to NCBI TaxID

279 PROTTAB The start, end and strand specifiers for the DNA sequence of the proteins

PROTDNA The DNA sequence of a chromosome

BESTHIT Best B L A S T hits

BESTRECIPHIT Reciprocal and best B L A S T hits

SIGNIFHIT Significant B L A S T hits

RECIPSIGNIFHIT Reciprocal significant b l a s t hits

BSBHH Best and significant b l a s t hits with hard complexity filtering

BSBHS Best and significant B L A S T hits with soft complexity filtering

C O G L 1 Three levels of the COG functional classification hierarchy C O G L 2 C O G L 3 CO G L21 C O G L 32

KOL1 KEGG ORTHOLOGY database hierarchy KOL2 K O L3 KO L4 K OL12 K O L23 K O L34 K T L1 KEGG TAXONOMY database hierarchy K T L 2 K T L 3 K T L 4 K T L12 K T L23 K T L34

KGENES K E G G g e n e s database of functionally annotated genes

K 0 2 K 0 L Mapping of KEGG GENES to level 4 o f the KEGG ORTHOLOGY hier­ archy KORGCODEMAP Mapping of KEGG organism codes to my organism codes C Y O G L 2 The gene function hierarchy Mulkidjanian et al. (from 2006)

C Y O G L 3 C Y O G L 32

280 MULKALLOC The mapping of proteins to level 3 of the gene functional hierarchy of Mulkidjanian et al. (2006)

CYANORAK The mapping of proteins to clusters of proteins in the CYANORAK database of Dufresne et al. (2008) CYOG2COG Mapping from Level 3 of the CyOG annotation hierarchy to Level 3 of the COG hierarchy MCLOG2KEGG Mapping from the CyOG to the K EG G functional hierarchy MCLOG2KGENES Mapping from the CyOG to the K EG G functional hierarchy

CIPALLOC My clusters of putatively orthologous proteins, grouped by the func­ tional hierarchy of Mulkidjanian et al. (2006)

281 282 Appendix B

Properties of chromosomes and plasmids of the reduced core

The following is a full list of the bacteria from which the reduced core was inferred, grouped by photosynthetic group or phylum. Each chromosome and plasmid has been allocated unique full and abbreviated names. The NCBI REFSEQ accession number, length in nucleotides, GC%, number of genes and number of proteins are provided for each chro­ mosome and plasmid. The final set of taxa used in our study consisted of 73 chromosomes and 44 plasmids across 70 bacteria, of which 43 were from the Cyanobacteria. Note that the chromosomes and plasmids from Acaryochloris and Heliobacterium were added to the analysis after an initial selection of 68 bacteria and the subset of 62 protein-coding genes from the 68 bacteria had been selected for phylogenetic analysis (for a full explanation, see Section 4.2).

283 Bacterial/Photosynthetic group* Abbreviated ncbi refseq Length GC Gene Prot Full Name Name Accession Num (bp) (%) Num Num 3 1 -O "O W ■ co ■S • -2 o - W m •?o o CO p~ CO & d HO HO q o 00 to co co cq t cT o CO H r q c CO tO q © TP Oo" O CM cm H - CJ o 3 co í t C3 dl 3 co <3 3 CO 1 <3 h *C HO HO CJ o £ a CJ 01 3 O h O § CO 0 >-* u " <3 CCÍ 3 ‘ CJ 8 c u ¡ 3 ^ o " J O > 3 < o t • • £ d tO o o o l C H r 00 tO t 00 t q > co TP CO o H r q t iO t 00 o CO CN -H i- H - H - H 3 H - <11 CJ 3 co o O P-l •* <3 • o t £ d t © t o o t 00 oo t tO cT o c CO CO Tp q co co t H H - H - « 5 H - «3 H - o Oi H H - di CJ 3 CO c3 § 01 CJ 01 C3 8 a i CQ 0 " 3 ■ HO HO O 01 01 03 CJ 01 ÍH _ . 4 CQ O O z; ¡z J ( • H rO £ 1 0 o c TP o c £ d P3 HO ___ o oó O C tO q co CM CM tp CM o tO o co t TP t tp co p t tp p~ p t O TP t OO H - H - CO CJ CQ P £ d CQ £ <31 o c TP CO PQ HO O < CO t o c <31 CO o q co oo p- o c q t t o CO CM o c q co co" 00 p t —H QJ CO cu CO CJ 8 <3 $ cx a 31 1 P h h Ü 3 >> H ' O Ü 8 - 3 1 2 O 3 - 8 - Ü P HH ------• £ d Ü •CO o q co CO CO t O C t © o co ci cq l C CM t r^- 1—1 co co to 0 0 Cl Cl H - H - H - a, CJ C3 3 £ i 3 0 0 <3 di 0 O C/3 01 o o3 a a , h ’ Chlamydophila caviae GPIC [Plasmid pCpGPl] Cea pi NC-004720.1 7,966 33.0 284 <3 Q, Chlorobi (includes green sulfur bacteria) S Chlorobium chlorochromatii CaD3 Cch NC-007514.1 2,572,079 44.0 2,047 2,002 S Chlorobium tepidum TLS Cte NCD02932.3 2,154,946 56.0 2,344 2,252 S Pelodictyon luteolum DSM 273 Plu NC.007512.1 2,364,842 57.0 2,137 2,083 a < id V 00 o O . n hH C i 2 C - » q f o 00 o cm o CO cm m a o kO tí co CO oo d d d co HO a O C/3 g (D o q ÍH o P biO a tH o c X i h o % C < H 05 £ N tí o kO t CO o co" oo - e co" co CM kO co 05 oo - e -C •«O T—H o H - CÖ P es s £ e o H C0 « CJ s co fc i d Q CO D k co í^ t 00 o T—H tí CM T“H CM cb co o tí O Q o O C CM - e Q o 05 T— O L O 05 • «o 0> co O O £ CÜ CJ co o Ci s 8* H 1 < 5 X HH £ H CO CM co b k t ü NI kO O CO Ö o kO o d kO t CO o o to r-H ÍC¡ Z •co o c a CM HO HO H - -H CÖ Ö 0 s- co S £ o s o e CS s a eu c ’ O K ¡z; o N a o h o c 00 " í t o kO S o co í t ko" d o - e o CO CM - f - f tí r-H o t/i CO QJ QJ S à H o " O O r _cö O ce Ö rH ce ' < o' 1 o d z í t CO c M C kO tí o co" CO 05 ------o í t kD h CM o kb t q 05 co" S ü d o 05 CM ü co o -CS tí H H _o •cO t —H r o t - e T! H - —H —H ¿ q CÖ 0 o O co o CO £ £ CJ a o CJ K CS U. cs 1 ' < d o c z o o CM o co tí o O ------q í t W í t co C5 ü tí r-H í t - e co t -c cb *— i q 05 s H H O 05 CM _c í t t tí - t t •CO t o t t ‘C H - H - -H H - H - H - £ cö CO tí a a CJ CO C d CS CJ a o es ' 1 < d tí z kO kO tí o o 0 0 05 co o O q í t w í t CM r-H o ------CM co í t kO 00 o a tí r-H q S í t O H o 05 _o í t CM •CO o - t t 'C t t H - —H -H £ CÔ a CO d ce a CJ s- CJ £ es CS a o co CS c 1 < d - CM co z o CM kb o ------co 00 J co tí CO '- f cm 06 t o 05 ü q í t H í t O 05 CM o co t T— i a í t T-H _o s í t O C • a ^H £ C3 Ö Ö g h 285 "a o s a> s S o CJ cs a 3 a> t-H co oo o co co co o CO r H LO CM co r- r- o OS CO t-H co CM co f- r- co LO co co oo oo OS CO co b- CM o OS H co o CM os os oo r-H CM «e co CM OS CO co t—H CO T—1 OS. OS. 00. CM OS. OS. t-H 00 oo r- LO. a a LO H l o " IS-" cm" r H cm" r H cm" t-H t-H cm" t-H r H r H cm" co -lo .O 3 o o CM r- co CM t-H co co o oo t-H LO CM OO co CO oo OS LO tT o o o os CM eq >> IO n\CJ CO OS oo co oo oo os oo co f- co co co LO or CO oo LO O* co co OO co CM os. co H c o t“H OS H 0 0 0 0 OS OS. CM os r H LO lo ° ì § O, l o " l o " t-" cm" t-H co H cm" t-H t-H cm" T—H t-H cm" cm" ■2 rss LO LO O T— ( o co LO CM o c o os CM or r - o o CM f - o o o t-H o r 0 0 LO < 2 o od cm a <>i r-H o o r H t-H o r H OS r H d H d o t-H or LO CO d LO CI *H-H co co co T f ot* ot* o r o r or co co LO co LO co co co co co co LO •-riVJ 0) TT CM 00 co os t—H T—1 Of LO of< o or Is­ co os LO or co co co r H os o o LO 3 LO CO LO LO T“H r- O r H CO r H or 00 eo o t- r- o r- IS- oo co OS oo os LO .g co r- r- T—i r—1 co OS. or co LO O o oo CO CM 00 t-H 00 f- 00 o OS CM HO co" l o " o oo" os" co" oo" co" r H l o " o" LO o" os" r H cm" os" o" Tf" os" cm" t-H r-" co" CJ co CO o co LO r— 1 o 0 0 o LO or CM co O f oo o t-H o CO CO or LO LO OS o or* r— 1 r H © CO. CO I- r - t - CO CJ co co ° i co oo co oo oo co co" co" os" H H cm" r H cm" t—H r-H r H t-H t-H t-H cm"

rH r-H r-H rH rH r-H t-H r—H t-H t-H t-H t-H r-H t-H t-H t-H t-H t-H r-H t-H t-H t-H d rH CM LO CM CO d a co o t-H r-H d a t-H a co OS LO CM CM co t-H rH r-H > CM r- t- or co r- t- or 5 CU OS CM N t-H rH rH CO b- i- or or or t-H CM CM CM CM CM CM CM o oo LO O 00 oo oo co o O LO r- f- t- Q LO co co co co CO co co < a OS 00 r- LO 00 oo oo N LO LO co o O o C o o o o o o o o < < o o o o o o o O o o o o O o < o o o o o o o c <5 < o o o o o o o O o o o d d d NI d d d d d d d d NI NI d d d d d d d d d d d 55 5 55 55 55 55 55 55 55 ¡5 55 55 ¡5 5 55 ¡5 5 55 5 55 5 55 55 55

lo co r - oo t-H t-H co CM co LO << c co os t-H t-H rH o o t-H t-H t-H CM t-H CM co co co co LO o a a o OS os os os os os co H a HH co EH H H H H H OS h H S co < O Q w M M HH HH HH HH m c < o o O < PQ o PQ NI HH a cu cu cu cu a a s s1 | I | < i o1 o1 o cu cu a cd cd cd cd co co m co co c/s co 0 cb cdrH cd rHcd cdr-1 cdrH rHcd cdrH cbrH cdrH cdrH i > > > > o O o o o O O a s a s e s a s s s a a < 5 < < o o & 5 5 ¡ 5 5 5 5 5 5 a a a a a a a a a a a CO

jo ¡e *0 30 2 CJ5 Cu £ O o od t ì ad a cd o U o < a o cd cd e x a ’co QJ Q) T 3 " 0 cd a QJ "cd a bX) QJ N o o o o O o a a a CM CM CM CM CM CM t-H t-H co CM co LO T—H t-H t-H t-H t-H t—H fS s ffi t-H o o t-H t-H t-H c < ! 'C r - r - r - CM co co co co LO T—H (M soo a a a o> O O O CJ O o OS OS OS os OS a o a a CO £ H co co co a o o O O o CJ Eh H Eh Eh Eh Eh OS t-H o HH HH HH HH HH r-H t-H t-H t-H a a a a a a m s. 8 C/3 CM < < . CO TT o c m O 'V " 0 T d 'O "0 o s s S S !S < 5 5 os os OS ooLO J* o CM CM CM Qa § § a § § a co 5o 5h 5o 5h ù Sh 5-1 -o r- -H> H a Ha 3 3 o O O O K O H % $ a a a CO CO CO CO CO CO CO CO CO co fu o CJ CJ o a a a O co co CO co CO co co CO CO H Eh Eh a u. l £ £ a CJ S s 53 s £ < <5 < • «Sì o o o o o o o a e g e e e e •e •e •cs»e e CM CM CM CM CM CM CM (U s~ u. u. u. u fu u. u. ©5 CQ co co o t-H t-H T—H t-H t-H t-H t-H cu c Cd cd cd a cd Cd cd cd HO CJ t- r- t> b- n b- r- e -S -o• • «s> a a o « cd « § O O o O O O O cs CO co co CO CO co CO CO CO • • (r* • • CJ CJ CJ CJ CJ CJ S a a 5d s a 3 3 3 a U- u. u. o HO CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ C3 C3 cd g a a a a a a a CJ CJ CJ CJ CJ CJ CJ CJ © © a g O CS o o O o o © o o CJ QJ CJ CJ CJ CJ CJ CJ CJ CJ CJ cd HO CJ © o a cd cd à d à à à à à s o o o o o o o o O CJ gs ?d e CJ CO CO CO CO CO CO CO a (U u. !- u. u. u. c- f- © 8

286 fl) CM o CM iO t ^ CM o o CD CD 0 5 CM CD 05 CM O CD r-H 2P t—H iO 0 5 ■ cf o CD CD t - •cf 00 r-H b - 0 T f CD r-H b - iD <3 oo CD CD oo^ t q t q CD CO i q t-H 7— i H t-H q ^f P h cm" o f cm" cm" cm" cm" cm" cd" cm" cm" cd" cm" CO . o IO o CD oo CM iD o t—H T f 0 CM CD 0 5 CM 0 iD ^f t-H 05 CD ID iD iD o CM 0 CD 00 CM O CD CM CM (-1 CD^ 05 t q CD c q q c q q c q O c q r-H t-H iD T—l P, cm" CM cm" cm" cm" cm" cm" cd" cm" cd" cd" cm" iD "

Qs o q q CM CM iD CM iD CD t f 0 q q q q q t-H f h i d cm CM 0 5 00 Ö i d 0 5 t d ■ d 06 CM CM CD m- i iO iO iO iD iD iD CD CD CD iD iD t f •cF t F 'C f •'CF iD CD "O 05 CD CD oo 0 5 00 CM CD CM •^ b - 00 0 b - CD iD ^ f b - OO 3 O CD iD CM 00 CD •cf CD CD CM b - 0 ^F 0 5 0 iD O q c q c q T----1 ■ S t q OO t q i q 00 CD t F q cd CD OO 0 ^ 00 iO cd" CD o " T f cd" cm" 05" cd" 0 " ' d ' cd" cd" •CF" 05" cd" cd" O" 3 0 5 O 1-H CD CD b - t f CM CD b - 0 •CF T- H 0 0 5 ID o CD c q i q CM o c q i q 0 c q q i q r-H t-H i q t q cm" cm" cm" cm" cd" cm" cm" cd" cm" cm" cd" cm" b -"

r-H t-H r-H t-H t-H r-H t-H t-H r-H t-H t-H T—H r-H r-H t-H id 05 CD CD CD id 0 t-H 0 r-H 05 CM CD cm 0 05 r-H t-H r-H b - b - P h O X b - t-H CD CD CM CD t-H t-H CD iD CD iD iD 13 b - X X O 0 05 CM CM CM CM t-H CD b - b - 0 0 b - b - b - b - iD O iD iD i D iD t F OO 0 O 0 O O 0 0 < < O O O O O O 0 O q q q q q q 0 < < < O O q O O O 0 q d d d d d d d rsi S3 N d d d d d d d d x x x x x x x X X X x x £ x x x X NC-001264.1 412,348 66.0 369 368

T—H < 0 S X O r a P h P h P h

CM CM t—H iD CM CD CD CD CD CD xF t F r-H iD CM b - 0 O O 0 O 0 r-H O O 05 05 O O t-H 00 T—H 00 00 0 0 OO 0 0 05 b - b - CD CD 05 l Q 13 00 CD CD CD CD CD CJ CJ 05 05 05 CM CD 05 UJ CJ O O Ü O Ü CJ O c n x X O O > iD O O CJ O CJ £ £ c n cn c n c n c n c n c n c n cn cn c n c n c n cn CO CO H H Dra pMPl NC-000958.1 177,466 63.0 148 145 Dra c2 Dra pCPl NC-000959.1 45,704 56.0 40 39

0 B 0 CO < O X X c n c n c n c n ö 0 ) > > " J - 1 X 1 s c n c n c n c n - 0 CO 0 P h P h P h P h CJ P h ^ 0 ■ o T 3 "O t-H __N a T—H CM CM B 1 ä 1 d CD ¡H JQ 2 «3 p p 0 5 o * t-H 4 P cd cd cd cd b - r - 1 co c n CM Ü X a ! p p X O o h—1 T— " ö 1 [Plasmid MP1] JP H CM CD CD CD CD CD 1 2] [Chromosome 1 [Plasmid CPI] Ü o iD 0 5 t-H 0 O O O O O O O R1 [Chromosome 1] R Dra cl NC-001263.1 2,648,638 67.0 2,687 2,629 R iD CM PP < s R PLh b - b - 0 0 t-H 0 0 0 0 0 0 OO 0 0 3

t-H O 0 CD CD r-H 0 3 CO co CD CD 0 5 1 lD 0 0 CD CD CD CD CD QJ CNI c o s 3 0 5 0 5 0 5 0 5 HO HO l 1 ffi X X CJ 0 CJ O 0 g <3 Ö O 0 O c n CO c c CJ Ü O O CJ 3 •HO 0 5 055 O 0 O 1— 5 ►"O X £ £ CJ e e P h p p P h PP P h CJ 5 » O - 2 - 2 OH d d d d d d d O h qj CO CO CO CO CO CO CO CO CO d d d d d CJ CO CO CO CO CO 0 CO co CO co co co co CO CO co CO 3 s 3 3 3 3 3 3 3 3 3 3 CO CO co CO CO CJ 3 CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ ■HO -HO •HO CJ CJ CJ CJ 0 CJ CJ CJ CJ 0 CJ CO CO co co co 3 g 0 0 O 0 0 O 0 0 0 0 0 3 5 3 5 3 5 3 5 CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ C.) CJ CJ CO 0 0 0 0 O O 0 0 0 O O 0 0 0 0 0 0 - c - c - c - 3 - 3 - 3 - 3 -3 ~3 -3 -3 - 3 -3 -3 -3 3 0 CJ CJ C.) CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ CJ QJ QJ QJ QJ QJ QJ QJ QJ c ß s 3 3 3 3 3 3 3 3 3 3 3 3 3 QJ Ä 35 35 35 35 35 35 35 35 35 35 35 3 q bn EH Deinococcus radiodurans Deinococcus radiodurans Deinococcus radiodurans <-0 CO <-0 C o Co CO c o c o Co Co c o Co C o Co C o c o Deinococcus radiodurans

0 O Ü 0 0 O O 0 0 Ü CJ Ü CJ CJ CJ CJ CJ Ü

287 05 'Cf ^f o oo rH oo b- LO LO co S- co 00 t-H co co f - G O O o o 00 co 00 CO CS t-H co Tf 05 Tf f - 00 t-H ctì co CS r—1 o LO LO o CO t'­ q LO t-H q q 00 00 cCÿ Q, LO CO CS CS CS S-" es" t-H H" co" Tf 'Cf O, i« HH 3 O £ LO t - co 00 05 t-H r—H Cd5 LO o es t - t-H co 05 co r-H t - G co t-H t-H co CO co es CS O CO LO Tf o C3 t-H t-H 05

es es q r-H es r-H r-H t-H T-1 r-H t-H es t-H t-H Ò CS co f - LO Ö 05 H t- CS co H LO pq LO CO t'- co CS es co 'Cf 'Cf CO LO es CO co CO CO HH UHHh CS 05 05 LO co co co t'­ t-H oo Tf O o o o o 2 S 05 CS CS r- t - o es co co co LO CO co co co t- LO LO o o o r-H o o o o O o o o o C < O 0 0 q q q q q q q q O c q q q q q q I < < d d d d d d d d U d d d d s i s i d d d z z z z z z z z z Z z z z Z Z z z z

LO 05 < CS es G< 0 0 t - t- 05 05 t-H r-H 00 O O lO ü ü t-H O O O O CQ < < í co CO O O T—H es Z ü .-i < H a g , a CJ CJ eu a S CQ O ü o 1 G 0 3 o 0 ce G G G G Gh èu cé cS ce ce a ce ce HH G HH HH HH HH Qh Gh Gl m S m K m co H Uh tí t í t í < < â s t í t í t í

co >—i CS CU CO oo 3 ce LO 3 3 » O O a z LO a X X o a .3 ö _ < en CS 3 U G o p _ o a ex O T d CO o a t í O Td T ! s -i o a ü œ 1 1 <2 H O ^ $ 3 ï T d - o 33 ü IX 0 « ü a a E E co co 1 J j e j e CO CO o< U G J G O (U CQ £ z 3 ü O P i LO 0 5 0 5 0 5 ce CS O o -o o o o o 00 00 o o 3 0) -Q co co X s a a 3 3 G < -s a a a 33 o 3 CO CO U f ■a -a -a -3 3 3 3 < - «3 ;§ c < < -Q 3 5 XÍ QJ S ci 3 3 K 3 co K . . CD "3 33 3 3 3 -ta - U Gd a, a, a, 0 < co M 33 jG co co CO O 3a 3 3 3 3 "o 3 3 3 3 3 eu CJ CJ 3 3 -3 -3 O O O Td g a a a. a< _G •3 a a a a a -«SHH -g 3 O o O 13 3 "3 "3 33 a s s s 3 3 3 3 c 3 s o 3 3 3 3 3 3 3 3 co 3 O 3 ÎO ce 3 - 3 -O a, -O 3 5 3 5 3 -Q 3 3 O o o -3 -3 -3 §< -O -O & O 3 O "3 o "3 33 "3 " 3 33 ce CJ 3 c<5 O 3 O O O O O CJ 3 333 §- §• -3 3a 5 0a 5 0a 5 0a 5 -3 -3 -3 - 3 -3 -Q O cq ce; I tí t í t í t c j CO CO O i tí tí tí tí *-i co I Uh e2 E C u P h C G c q pu,

288 oo LO CM t-H CO t-H co t-H CM LO Tf< co a © a co oo O o t-H co b- o © o b- © 00 t-H O CO CO CM t-H CM © LO n a cm" co" cm" co" LO co" CO .O © a b- a o t-H r-H h * O LO co © o co a 00 4 o co CO 00 CO b- C 5 a LO 00 CM a a © CO co CM t-H co^ CD iO LO LO a cm" co" cm" 'cf ^cr" LO" co" Qa 05 t-H b- LO o O o © CO o o CO co o a efn 00 00 CM cd cd o d b LO LO d d CM CO LO co CO co co © CM CM LO CO ~o © co t—I CM a 'nT o a oo r—l CM LO o CD oo a © a LO CM oo oo CM LO o b~ o CM oo a CM co continued on next page .© LO^ co o t- t> CO b- LO^ °l CO Tf t-H t-H co" co" oo" o" co" t-H oo b-" oo LO cm" t-H H' © i-H b- LO CM 00 co b- CM © t-H co t-H CjO t-H b- t-H Cd LO CM t-H t-H b- © CM 00 H rf cm" co" cm" iO LO" co"

t-H CO CM t—H t-H t-H r-H t-H t-H t-H CM r—l t-H t-H c m oo CO cd H t-H cm CO LO H cd LO LO t-H a TP CM b- b- b- b- 00 LO LO LO LO CO co t-H a a a a a a o r-H t-H t—H LO LO a CO CM CM b- b- b- b- b- CO CO CO •'CF Tf CM o o O o o o o o o o o O o o O o o © © o o o o o o o O o o o d d d d d d d d d d d d d d d ¡z; ¡z; £ ¡z; ¡z; £

CO io 05 CM co a Dh 4* PP t-H CM t-H CM PP © © a a a a a CD © © © cd o ü ö cd CO (X) a a -4-5 a bß s B q q © a D h cd cd Pi PP iz pp X X PP X ¡x >« PP PP O CO LO a CM 42co a > o >H >H o a a oo o 'V T3 © oa a a H o $ 3 a w co 43 a p p O O O xl CJ a ^ £ o a CO a co co co O T 3 o COa COa CD LO lO LO H a a 43 qj ccj CM CMa CMa a U O a a cda CO £ £ o CJ CM C5 ^ ^ Tf Tf Ph Ph Ph a a Ü CM ffi ffi K a PP PP PU str. Sheeba [Chromosome] str. Sheeba [Plasmid pHacl] Hac Hac pHacl NC-008230.1 NC-008229.1 1,553,927 3,661 38.2 34.0 1,654 1,612 6 6

Ü O Ü O co £ b - e s © o a a H © a © © s o © -o -© -o Pm "© ~© ~© ~© ~© 3 © © Ü a© t "© "© © ~© ~© -£5 © ~© ■© ~© .© © © © © -© & © © © © ©5 e §* & © Äh Äh Äh ~© CJ CJ a a a © CJ CÖ cd © © © cd © 42 -© 42 HO © © 42 © O O -e ~© -© -© © © © © © © © o •

< a d~ e-proteobacteria

289 g> CM CD Tt< 00 CO 1“H oT co CM t - cO iO T—H Ctì 00 00 < q CM a i-H t—H co 33 G .0 -4 g C i CD 00 CD i-H o r CD CM f - CM CM CO i-H a 00 ° l CM ho g a r-H CM a 0 q q q q O O 0 a ? 06 cO CD CD c i c i c i CM CM CM o r CD CD CD 33 G CD OO O CO CM CM CM o G o r O CD CM T f a CM ctì G CM r-H c q b~ C i CO 43 T f ß -‘ lO O c T cd" C i" G O CM cO CD oT CO O CJ C i OO 00 CM o> r-H r-H 43 -u 33 G a t-H 1-H 1-H r-H r-H i-H i-H CD OO c i co r-H CM CO 0 CO CM CM cO CD CD CD &H 43 1—H 1-H î-H 00 o r o r o r CD CD CD 0 CD CD CD a1 O O O 0 O O 0 G q O O q q q q O d d d d d d d £ 2 ; £ £ £ £ £ „ ctì hß ‘ C G O ctì 43 t''- .2 CD T* CM 00 CM cO ’Ö; a a H H 43 CJ M r-H H H a a D a a G Ctì ctì ctì G 43 43 43 hß hß hß G h j -M HJ PQ PQ PQ H H H H a .g

cj ctì 43 O

>> CJ a G O Sh U hß

~ d T 3 23 -g JO ^ a a 0 22 1 0 CO CO G a a Ctì Ctì o X! 33 3 £ iE T 0 0 a '§ PQ 00 00 i ' a co PQ PQ 0) H £ 43 43 3 -M O eu pu G PQ PQ PQ U h PU Ph O 43 a hß G G 43 G CD c d c d §> 0) ctì - bjO +j . — O a 3 U o U ctì CJ O o 03 o a 4 l PQ PQ PQ <3 <ü 43 ctì ' a 43 43 co H H

290 Appendix C

Properties of proteins of the reduced core

For each data set, Table C .l lists the names of the KEGG functional category to which each gene belongs and the gene name. This is followed by properties used in the selection of the data sets; such as the number of sequences and the number of bacteria represented, and properties of the alignment, such as the pre- and post-GBLOCKS length, and the number of invariant and varying sites. The substitution model, the a shape parameter for the

4-category discrete T distribution used to model rate heterogeneity, and the proportion of invariant sites (I) used in the ML and Neighbor-Net analyses are also listed. Note that the a and I parameters used in the ML analyses were estimated by the TREEFINDER program, while those used in the Neighbor-Net analyses were estimated by PROTTEST and provided to SPLITSTREE before analysis. In both cases, the nucleotide frequencies were estimated from the data. Finally, proteins for which an LGT event was suspected are marked.

291 Table C .l: Properties of the alignments and parameters used in the phylogenetic analyses of the reduced core

Nurn Num Orig Final Num Num Subst a e 1/ Suspected Fna Geneb Seqsc B actd Len Len Invar Var M odel M L nnet M L nnet L G T 9

A ahcY 56 56 552 335 70 265 W A G 1.004 / 1.194 0.148 / 0.163 / A arg G 65 65 525 331 25 306 W A G 1.014 / 1.138 0.051 / 0.056 A aroC 69 68 532 218 24 194 W AG 1.013 / 1.148 0.081 / 0.085 G A his B 63 63 436 177 21 156 W AG 0.568 / 0.652 A m etK 69 68 578 295 53 242 W A G 0.704 / 1.030 0.077 / 0.174 A typA 64 64 692 534 88 446 W A G 1.092 / 1.289 0.137 / 0.144

B accA 63 62 479 293 31 262 W A G 0.932 / 1.028 0.073 / 0.077 B frr 70 69 203 167 8 159 W A G 1.087 / 1.312 B ilv H 64 63 240 123 2 121 J T T 1.116 / 1.250 B prs A 74 69 501 236 10 226 W A G 0.915 / 0.989 B rpe 70 68 383 163 9 154 W AG 0.875 / 0.977 / C mre B 63 58 476 286 33 253 W AG 1.160 / 1.423 0.074 / 0.098

E atpA 67 66 621 422 90 332 W AG 0.957 / 1.121 0.159 / 0.172 E atpB 69 65 612 103 7 96 W AG 1.965 / 2.591 0.067 / 0.067 E atpD 67 66 589 441 100 341 W A G 0.713 / 0.788 0.129 / 0.153 E ree A 74 68 490 284 58 226 W AG 0.973 / 1.179 0.161 / 0.179 F ahpC 56 55 238 155 6 149 W A G 1.432 / 1.750 0.017 / 0.029 G F bcp 61 61 252 125 14 111 W A G 1.385 / 1.733 0.102 / 0.105 / F pro EL 74 68 593 486 71 415 W AG 0.987 / 1.118 0.091 / 0.098 F sec Y 70 69 585 238 22 216 W A G 1.162 / 1.340 0.059 / 0.064 F smp B 70 70 232 132 11 121 W AG 0.849 / 1.005

G Ipx A 59 56 364 190 10 180 W AG 1.304 / 1.499 0.041 / 0.042

L fabl 62 62 378 195 14 181 W A G 1.364 / 1.550 0.055 / 0.057 L gcpE h 63 63 854 241 29 212 W AG 1.025 / 1.176 0.089 / 0.098 N apt 62 62 457 118 21 97 W AG 0.603 / 0.692 / N ndk 67 67 273 115 9 106 W AG 0.801 / 0.910 G N p yr H 68 68 365 215 19 196 W AG 0.950 / 1.171 0.057 / 0.081 N su f B 54 54 655 426 29 397 W AG 0.830 / 0.873 G

M clpP 70 69 288 182 23 159 W AG 0.750 / 0.855 M clpX 68 68 548 315 79 236 W A G 0.770 / 1.047 0.188 / 0.240 M eoa D 67 67 271 132 11 121 W AG 0.849 / 0.937 continued on next page

292 continued from previous page M rib H 66 66 338 119 7 112 WAG 1.483 / 1.676 0.050 / 0.051 M thiC 60 59 879 395 50 345 WAG 0.717 / 1.374 0.000 / 0.140 M trnFC 56 56 227 132 11 121 WAG 0.782 / 0.851 - / - R recR 66 66 308 150 9 141 WAG 0.828 / 1.019 0.021 / 0.056 R ruv C 65 65 274 127 8 119 WAG 0.891 / 1.013 - / - T infC 70 70 347 150 9 141 WAG 0.934 / 1.101 - / - T rpoA 70 70 510 180 15 165 WAG 0.916 / 1.042 - / - S efp 73 70 326 143 2 141 WAG 1.297 / 1.544 - / - S fus A 74 69 796 628 92 536 WAG 0.842 / 1.103 0.088 / 0.144 s lepA 71 70 721 561 87 474 WAG 0.938 / 1.058 0.125 / 0.130 s petB 55 55 990 164 11 153 WAG 1.404 / 1.643 - / - s rp/B 69 69 328 261 51 210 WAG 1.141 / 1.393 0.165 / 0.182 s rp/D 70 70 389 100 7 93 WAG 1.284 / 1.464 0.061 / 0.062 s rp/E 70 70 233 174 19 155 WAG 0.807 / 1.072 0.015 / 0.063 s rplF 70 70 254 128 10 118 WAG 0.770 / 0.852 - / - s rplK 69 69 224 122 15 107 WAG 0.786 / 0.841 - / - s rplM 69 69 247 n o 11 99 WAG 0.743 / 0.832 - / - s rp/N 71 70 152 119 15 104 WAG 0.637 / 0.731 - / - s rp/P 70 70 185 125 15 110 WAG 0.770 / 0.847 - / - s rplT 70 70 143 105 11 94 WAG 1.431 / 1.716 0.090 / 0.093 s rps B 70 70 566 205 18 187 WAG 0.922 / 1.211 0.043 / 0.083 s rps C 70 70 388 178 20 158 WAG 0.710 / 0.809 - / - s rpsD 70 70 234 140 11 129 WAG 0.764 / 0.894 - / - s rps E 69 69 267 138 18 120 WAG 0.727 / 0.820 - / - s rpsG 71 70 170 149 18 131 JT T 0.671 / 0.725 - / - s rpsH 70 70 180 102 11 91 WAG 0.839 / 0.907 - / - s rpsK 68 68 146 113 16 97 JT T 0.721 / 0.763 - / - s rpsL 68 68 192 116 41 75 WAG 0.363 / 0.419 - / - s rpsM 70 70 186 109 16 93 WAG 0.743 / 0.862 - / - s tsf 70 70 621 152 9 143 WAG 0.963 / 1.121 - / - s tuf A 75 70 423 374 113 261 WAG 0.830 / 0.957 0.233 / 0.248 continued on next page

293 continued from previous page a The KEGG gene function category where: A - Amino acid metabolism; B - Car­ bohydrate metabolism; C - Cell growth and death; E - Energy metabolism; F Folding, sorting and degradation; G - Glycan biosynthesis and metabolism; L - Lipid metabolism; M - Metabolism of cofactors and vitamins; R - Replication and repair; T - Transcription; and S - Translation. b See Table 4.2 for the full name of the gene products. c The number of sequences in the data set. d The number of bacteria (rather than the number of chromosomes and plasmids) rep­ resented in the data set. e Shape parameter (a) of the discrete T distribution used to model rate heterogene­

ity across sites, inferred by tr e e fin d e r for the ML analyses and PROTTEST for the Neighbor-Net analyses. f Proportion of invariant sites (I) inferred by tr eefin d er for the ML analyses and PROTTEST for the Neighbor-Net analyses. g Phylogenies which show Gloeobacter in an unusual position are marked with a ‘G ’. Proteins for which an LGT event was suspected are marked with a tick. h Also known as 1-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate synthase (¿spG). i This gene name is a temporary name used only in this study.

294 Bibliography

Ababneh, F., L. S. Jermiin, C. S. Ma, and J. Robinson. 2006. Matched-pairs tests of homo­ geneity with applications to homologous nucleotide sequences. Bioinformatics 22:1225- 1231. Abascal, F., R. Zardoya, and D. Posada. 2005. ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21:2104-2105. Abby, S. and V. Daubin. 2007. Comparative genomics and the evolution of prokaryotes. Trends in Microbiology 15:135—141. Agosti, D., D. Jacobs, and R. DeSalle. 1996. On combining protein sequences and nucleic acid sequences in phylogenetic analysis: The homeobox protein case. Cladistics-the International Journal of the Willi Flennig Society 12:65-82. Akaike, H. 1974. New look at statistical-model identification. IEEE Transactions on Au­ tomatic Control AC19:716-723. Albertano, P., D. DiSomma, and E. Capucci. 1997. Cyanobacterial picoplankton from the Central Baltic Sea: cell size classification by image-analyzed fluorescence microscopy. Journal of Plankton Research 19:1405-1416. Altenhoff, A. M. and C. Dessimoz. 2009. Phylogenetic and functional assessment of or­ thologs inference: projects and methods. PLoS Computational Biology 5:11. Altschul, S. F. 1991. Amino-acid substitution matrices from an information theoretic perspective. Journal of Molecular Biology 219:555-565. Altschul, S. F. 1993. A protein alignment scoring system sensitive at all evolutionary distances. Journal of Molecular Evolution 36:290-300. 295 Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic Local Alignment Search Tool. Journal of Molecular Biology 215:403-410.

Altschul, S. F., T. L. Madden, A. A. Schaffer, J. H. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402.

Anbar, A. D., Y. Duan, T. W. Lyons, G. L. Arnold, B. Kendall, R. A. Creaser, A. J. Kaufman, G. W. Gordon, C. Scott, J. Garvin, and R. Buick. 2007. A whiff of oxygen before the Great Oxidation Event? Science 317:1903-1906.

Andam, C., W. Williams, and J. Gogarten. 2010. Biased gene transfer mimics patterns created through shared ancestry. Proceedings of the National Academy of Sciences of the United States of America 107:10679-10684.

Angly, F. E., B. Felts, M. Breitbart, P. Salamon, R. A. Edwards, C. Carlson, A. M. Chan, M. Haynes, S. Kelley, H. Liu, J. M. Mahaffy, J. E. Mueller, J. Nulton, R. Olson, R. Parsons, S. Rayhawk, C. A. Suttle, and F. Rohwer. 2006. The marine viromes of four oceanic regions. PLoS Biology 4:e368.

Archibald, J. K., M. E. Mort, and D. J. Crawford. 2003a. Bayesian inference of phylogeny: a non-technical primer. Taxon 52:187-191.

Archibald, J. M. and P. J. Keeling. 2002. Recycled plastids: a “green movement” in eukaryotic evolution. Trends in Genetics 18:577-584.

Archibald, J. M. and A. J. Roger. 2002. Gene conversion and the evolution of euryarchaeal chaperonins: a maximum likelihood-based method for detecting conflicting phylogenetic signals. Journal of Molecular Evolution 55:232-245.

Archibald, J. M., M. B. Rogers, M. Toop, K. Ishida, and P. J. Keeling. 2003b. Lateral gene transfer and the evolution of plastid-targeted proteins in the secondary plastid- containing alga Bigelowiella natans. Proceedings of the National Academy of Sciences of the United States of America 100:7678-7683.

Ashby, M. K. and J. Houmard. 2006. Cyanobacterial two-component proteins: struc­ ture, diversity, distribution, and evolution. Microbiology and Molecular Biology Re­ views 70:472-509.

296 Awramik, S. M. 1992. The Oldest Records of Photosynthesis. Photosynthesis Research 33:75-89.

Balows, A., H. Truper, M. Dworkin, W. Harder, and K.-H. Schliefer, eds. 1992. The Prokaryotes. 2nd ed. Springer-Verlag, Berlin.

Bandelt, H.-J. 1992. Generating median graphs from Boolean matrices. Pages 305-309 in Li - Statistical Analysis and related methods (Y. Dodge, ed.). Elsevier, Amsterdam.

Bandelt, H. J. and A. W. M. Dress. 1992. A canonical decomposition-theory for metrics on a finite set. Advanced Mathematics 92:47-105.

Bandelt, H. J., P. Forster, and A. Rohl. 1999. Median-joining networks for inferring in­ traspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Bandelt, H. J., P. Forster, B. C. Sykes, and M. B. Richards. 1995. Mitochondrial portraits of human populations using median networks. Genetics 141:743-753.

Bapteste, E., E. Susko, J. Leigh, I. Ruiz-Trillo, J. Bucknam, and W. F. Doolittle. 2008. Alternative methods for concatenation of core genes indicate a lack of resolution in deep nodes of the prokaryotic phylogeny. Molecular Biology and Evolution 25:83-91.

Beiko, R. G., T. J. Harlow, and M. A. Ragan. 2005. Highways of gene sharing in prokary­ otes. Proceedings of the National Academy of Sciences of the United States of America 102:14332-14337.

Bekker, A., H. D. Holland, P. L. Wang, D. Rumble, H. J. Stein, J. L. Hannah, L. L. Coetzee, and N. J. Beukes. 2004. Dating the rise of atmospheric oxygen. Geochimica et Cosmochimica Acta 68:A780-A780.

Benner, S. A., M. A. Cohen, and G. H. Gonnet. 1994. Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Engineering 7:1323-1332.

Benson, D. A., M. S. Boguski, D. J. Lipman, J. Ostell, and B. F. F. Ouellette. 1998. GenBank. Nucleic Acids Research 26:1-7.

Bern, M. and D. Goldberg. 2005. Automatic selection of representative proteins for bac­ terial phylogeny. BMC Evolutionary Biology 5.

297 Bhattacharya, D. and L. Medlin. 1995. The phylogeny of plastids: a review based on com­ parisons of sinall-subunit ribosomal-RNA coding regions. Journal of Phycology 31:489- 498.

Bhattacharya, D. and L. Medlin. 1998. Algal phylogeny and the origin of land plants. Plant Physiology 116:9-15.

Bininda-Emonds, O. R. P., J. L. Gittleman, and M. A. Steel. 2002. The (super)tree of life: procedures, problems, and prospects. Annual Review of Ecology and Systematics 33:265-289.

Blair, J. E., P. Shah, and S. B. Hedges. 2005. Evolutionary sequence analysis of complete eukaryote genomes. BMC Bioinformatics 6:53.

Blankenship, R. E. 1992. Origin and early evolution of photosynthesis. Photosynthesis Research 33:91-111.

Blankenship, R. E. 2002. Molecular mechanisms of photosynthesis. Blackwell Science, Oxford.

Blankenship, R. E. and H. Hartman. 1998. The origin and evolution of oxygenic photo­ synthesis. Trends in Biochemical Sciences 23:94 97.

Blankenship, S. S., R.E. and J. Raymond. 2007. The evolutionary transition from anoxy- genic to oxygenic photosynthesis, chap. 3, Pages 21-35 in Evolution of Primary Pro­ ducers in the Sea (P. Falkowski and A. Knoll, eds.). Elsevier Academic Press.

Bondy, J. and U. Murty. 2007. Graph Theory. Springer.

Bono, H., H. Ogata, S. Goto, and M. Kanehisa. 1998. Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Research 8:203- 210.

Bork, P., T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. P. Yuan. 1998. Predicting function: from genes to genomes and back. Journal of Molecular Biology 283:707-725.

Boucher, Y., C. J. Douady, R. T. Papke, D. A. Walsh, M. E. R. Boudreau, C. L. Nesbo, R. J. Case, and W. F. Doolittle. 2003. Lateral gene transfer and the origins of prokary­ otic groups. Annual Review of Genetics 37:283-328.

298 Boussau, B., S. Blanquart, A. Necsulea, N. Lartillot, and M. Gouy. 2008. Parallel adap­ tations to high temperatures in the Archaean eon. Nature 456:942-U74. Bowker, A. H. 1948. A test for symmetry in contingency tables. Journal of the American Statistical Association 43:572-574. Brenner, S. 1999. Errors in genome annotation. Trends in Genetics 15:132-133. Brocks, J. J., R. Buick, G. A. Logan, and R. E. Summons. 2003a. Composition and syngeneity of molecular fossils from the 2.78 to 2.45 billion-year-old Mount Bruce Supergroup, Pilbara Craton, Western Australia. Geochimica et Cosmochimica Acta 67:4289-4319. Brocks, J. J., R. Buick, R. E. Summons, and G. A. Logan. 2003b. A reconstruction of Archean biological diversity based on molecular fossils from the 2.78 to 2.45 billion- year-old Mount Bruce Supergroup, Hamersley Basin, Western Australia. Geochimica et Cosmochimica Acta 67:4321-4335. Brocks, J. J., G. A. Logan, R. Buick, and R. E. Summons. 1999. Archean molecular fossils and the early rise of eukaryotes. Science 285:1033-1036. Brodal, G. S., R. Fagerberg, A. Ostlin, C. N. S. Pedersen, and S. S. Rao. 2003. Computing refined Buneman trees in cubic time. Pages 259-270 in Algorithms in Bioinformatics, Proceedings vol. 2812 of Lecture Notes In Bioinformatics. Brown, D. and I<. Sjolander. 2006. Functional classification using phylogenomic inference. PLoS Computational Biology 2:e77. Bruno, W. J. and A. L. Halpern. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Molecular Biology and Evolution 16:564-566. Bryant, D. 1991. Cyanobacterial phycobilisomes: progress toward complete structural and functional analysis via molecular genetics. Pages 257-300 in The Photosynthetic Apparatus: Molecular Biology and Operation (L. Bogorad and K. Vasil, eds.). Academic Press, Boston, MA. Bryant, D., A. G. Costas, J. Heidelberg, and D. Ward. 2007a. Candidatus Chloracidobac- terium thermophilum: an aerobic phototrophic Acidobacterium with chlorosomes and type 1 reaction centers. Photosynthesis Research 91:269-270. 299 Bryant, D. and V. Moulton. 2002. NeighborNet: An agglomerative method for the con­ struction of planar phylogenetic networks. Pages 375-391 in Algorithms in Bioinfor­ matics, Proceedings vol. 2452 of Lecture Notes in Computer Science.

Bryant, D. and V. Moulton. 2004. Neighbor-Net: an agglomerative method for the con­ struction of phylogenetic networks. Molecular Biology and Evolution 21:255-265.

Bryant, D. A., A. M. G. Costas, J. A. Maresca, A. G. M. Chew, C. G. Klatt, M. M. Bateson, L. J. Tallon, J. Hostetler, W. C. Nelson, J. F. Heidelberg, and D. M. Ward. 2007b. Candidatus Chloracidobacterium thermophilum: an aerobic phototrophic aci- dobacterium. Science 317:523-526.

Buick, R., J. S. R. Dunlop, and D. I. Groves. 1981. Stromatolite recognition in ancient rocks: an appraisal of irregularly laminated structures in an early Archean chert-barite unit from North Pole, Western Australia. Alcheringa 5:164-181.

Buneman, P. 1971. The recovery of trees from measures of dissimilarity. Pages 387-395 in Mathematics in Archaeological and Historical Sciences (D. Kendall and P. Tautu, eds.). Edinburg University Press, Oxford.

Burger-Wiersma, T., M. Veenhuis, H. J. Korthals, C. C. M. Vandewiel, and L. R. Mur. 1986. A new prokaryote containing chlorophyll a and chlorophyll b. Nature 320:262-264.

Burleigh, J. G., A. C. Driskell, and M. J. Sanderson. 2006. Supertree bootstrapping methods for assessing phylogenetic variation among genes in genome-scale data sets. Systematic Biology 55:426-440.

Butterfield, N. J. 2000. Bangiomorpha pubescens n. gen., n. sp.: implications for the evolution of sex, multicellularity, and the Mesoproterozoic/Neoproterozoic radiation of eukaryotes. Paleobiology 26:386-404.

Cameron, M., H. E. Williams, and A. Cannane. 2006. A deterministic finite automaton for faster protein hit detection in BLAST. Journal of Computational Biology 13:965-978.

Campbell, L., H. A. Nolla, and D. Vaulot. 1994. The importance of Prochlorococcus to community structure in the Central North Pacific Ocean. Limnology and Oceanography 39:954-961.

Canchaya, C., G. Fournous, S. Cliibani-Chennoufi, M. L. Dillmann, and H. Brussow. 2003. Phage as agents of lateral gene transfer. Current Opinion in Microbiology 6:417-424.

300 Canfield, D. E. 2005. The early history of atmospheric oxygen: homage to Robert A. Garrels. Annual Review of Earth and Planetary Sciences 33:1-36. Castresana, J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17:540-552. Cavalier-Smith, T. 1999. Principles of protein and lipid targeting in secondary symbiogen- esis: euglenoid, dinoflagellate, and sporozoan plastid origins and the eukaryote family tree. Journal of Eukaryotic Microbiology 46:347-366. Cavalier-Smith, T. 2006. Cell evolution and Earth history: stasis and revolution. Philo­ sophical Transactions of the Royal Society B-Biological Sciences 361:969-1006. Cavalier-Smith, T., M. Brasier, and T. M. Embley. 2006. Introduction: how and when did microbes change the world? Philosophical Transactions of the Royal Society B- Biological Sciences 361:845-850. Cerny, V. 1985. Thermodynamical approach to the traveling salesman problem: an effi­ cient simulation algorithm. Journal of Optimization Theory and Applications 45:41-51. Chen, F., A. J. Mackey, J. K. Vermunt, and D. S. Roos. 2007. Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS One 2:e383. Chen, M., L. L. Eggink, J. K. Hoober, and A. W. D. Larkum. 2005a. Influence of structure on binding of chlorophylls to peptide ligands. Journal of the American Chemical Society 127:2052-2053. Chen, M., R. G. Hiller, C. J. Howe, and A. W. D. Larkum. 2005b. Unique origin and lateral transfer of prokaryotic chlorophyll b and chlorophyll d light-harvesting systems. Molecular Biology and Evolution 22:21-28. Chen, M., R. G. Quinnell, and A. W. D. Larkum. 2002a. Chlorophyll d as the ma­ jor photopigment in Acaryochloris marina. Journal of Porphyrins and Phthalocyanines 6:763-773. Chen, M., R. G. Quinnell, and A. W. D. Larkum. 2002b. The major light-harvesting pigment protein of Acaryochloris marina. FEBS Letters 514:149-152. Chisholm, S. W., R. J. Olson, E. R. Zettler, R. Goericke, J. B. Waterbury, and N. A. Welschmeyer. 1988. A novel free-living prochlorophyte abundant in the oceanic euphotic zone. Nature 334:340-343. 301 Chiu, J. C., E. K. Lee, M. G. Egan, I. N. Sarkar, G. M. Coruzzi, and R. DeSalle. 2006. OrthologID: automation of genome-scale ortholog identification within a parsimony framework. Bioinformatics 22:699-707. Chun, E. H. L., M. H. Vaughan, and A. Rich. 1963. Isolation and characterization of DNA associated with chloroplast preparations. Journal of Molecular Biology 7:130-141. Chyba, C. F. 1993. The violent environment of the origin of life: progress and uncertain­ ties. Geochimica et Cosmochimica Acta 57:3351-3358. Ciccarelli, F. D., T. Doerks, C. von Mering, C. J. Creevey, B. Snel, and P. Bork. 2006. Toward automatic reconstruction of a highly resolved Tree of Life. Science 311:1283- 1287. Cohan, F. M. 2006. Towards a conceptual and operational union of bacterial systematics, ecology, and evolution. Philosophical Transactions of the Royal Society B-Biological Sciences 361:1985-1996. Coleman, M. L. and S. W. Chisholm. 2007. Code and context: Prochlorococcus as a model for cross-scale biology. Trends in Microbiology 15:398-407. Coleman, M. L., M. B. Sullivan, A. C. Martiny, C. Steglich, K. Barry, E. F. DeLong, and S. W. Chisholm. 2006. Genomic islands and the ecology and evolution of Prochlorococ­ cus. Science 311:1768-1770. Creevey, C. 2004. CLANN Manual (version 3.0.0). National University of Ireland. Creevey, C. J., D. A. Fitzpatrick, G. K. Philip, R. J. Kinsella, M. J. O’Connell, M. M. Pentony, S. A. Travers, M. Wilkinson, and J. O. Mclnerney. 2004. Does a tree-like pliylogeny only exist at the tips in the prokaryotes? Proceedings of the Royal Society B-Biological Sciences 271:2551-2558. Creevey, C. J. and J. O. Mclnerney. 2005. Clann: investigating phylogenetic information through supertree analyses. Bioinformatics 21:390-392. Crooks, G. E. and S. E. Brenner. 2005. An alternative model of amino acid replacement. Bioinformatics 21:975-980. Crosbie, N. D., M. Pockl, and T. Weisse. 2003. Dispersal and phylogenetic diversity of nonmarine picocyanobacteria, inferred from 16S rRNA gene and cpcBA-intergenic spacer sequence analyses. Applied and Environmental Microbiology 69:5716-5721. 302 Dagan, T. and W. Martin. 2006. The tree of one percent. Genome Biology 7. Dammeyer, T., E. Hofmann, and N. Frankenberg-Dinkel. 2008. Phycoerythrobilin syn­ thase (pebS) of a marine virus: crystal structures of the biliverdin complex and the substrate-free form. Journal of Biological Chemistry 283:27547-27554. Daubin, V., M. Gouy, and G. Perriere. 2002. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Research 12:1080-1090. Daubin, V., N. A. Moran, and H. Ochman. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829-832. Daubin, V. and G. Perriere. 2003. G+C3 structuring along the genome: a common feature in prokaryotes. Molecular Biology and Evolution 20:471-483. Dayhoff, M., R. Eck, and C. Park. 1972. A model of evolutionary change in proteins. Atlas of Protein Sequences and Structure 5:89-99. Dayhoff, M., R. Schwartz, and B. Orcutt. 1978. A model of evolutionary change in pro­ teins. Pages 345-352 in Atlas of Protein Sequence and Structure (D. Dayhoff, ed.) vol. 5. suppl. 3. National Biomedical Research Foundation, Washington, D.C. Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6:361-375. Delwiche, C. F. 1999. Tracing the thread of plastid diversity through the tapestry of life. American Naturalist 154:S164-S177. Delwiche, C. F. and J. D. Palmer. 1996. Rampant horizontal transfer and duplication of rubisco genes in eubacteria and plastids. Molecular Biology and Evolution 13:873-882. des Marais, D. J. 2000. Evolution - when did photosynthesis emerge on Earth? Science 289:1703-1705. Do, C. B., M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. 2005. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Research 15:330- 340. Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. Science 284:2124- 2128.

303 Doolittle, W. F., Y. Boucher, C. L. Nesbo, C. J. Douady, J. O. Andersson, and A. J. Roger. 2003. How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 358:39-57.

Dress, A., D. Huson, and V. Moulton. 1996. Analyzing and visualizing sequence and distance data using SPLITSTREE. Discrete Applied Mathematics 71:95-109.

Dress, A. W. M. and D. H. Huson. 2004. Constructing splits graphs. IEEE-ACM Trans­ actions on Computational Biology and Bioinformatiocs 1:109-115.

Drummond, A. J. and A. Rambaut. 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology 7.

Dufresne, A., L. Garczarek, and F. Partensky. 2005. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biology 6:R14.

Dufresne, A., M. Ostrowski, D. J. Scanlan, L. Garczarek, S. Mazard, B. P. Palenik, I. T. Paulsen, N. T. de Marsac, P. Wincker, C. Dossat, S. Ferriera, J. Johnson, A. F. Post, W. R. Hess, and F. Partensky. 2008. Unraveling the genomic mosaic of a ubiquitous genus of marine cyanobacteria. Genome Biology 9:16.

Dufresne, A., M. Salanoubat, F. Partensky, F. Artiguenave, I. M. Axmann, V. Barbe, S. Duprat, M. Y. Galperin, E. V. Koonin, F. Le Gall, K. S. Makarova, M. Ostrowski, S. Oztas, C. Robert, I. B. Rogozin, D. J. Scanlan, N. T. de Marsac, J. Weissenbach, P. Wincker, Y. I. Wolf, and W. R. Hess. 2003. Genome sequence of the cyanobacterium Prochlorococcus marinus SSI20, a nearly minimal oxyphototrophic genome. Proceedings of the National Academy of Sciences of the United States of America 100:10020-10025.

Dutilh, B. E., V. van Noort, R. van der Heijden, T. Boekhout, B. Snel, and M. A. Huynen. 2007. Assessment of phylogenomic and orthology approaches for phylogenetic inference. Bioinformatics 23:815-824.

Echlin, P. 1966. The cyanophytic origins of higher plant chloroplasts. British Phycological Bulletin 3:150-151.

Eck, R. and M. Dayhoff. 1966. Atlas of protein sequence and structure. National Biomed­ ical Research Foundation, Silver Springs, MD.

304 Edgar, R. C. 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113. Edgar, R. C. and S. Batzoglou. 2006. Multiple sequence alignment. Current Opinion in Structural Biology 16:368-373. Edwards, A. and L. Cavalli-Sforza. 1963. The reconstruction of evolution. Heredity 18:553. Edwards, A. W. F. 1996. The origin and early development of the method of minimum evolution for the reconstruction of phylogenetic trees. Systematic Biology 45:79-91. Eisen, J. A. and C. M. Fraser. 2003. Phylogenomics: intersection of evolution and ge­ nomics. Science 300:1706-1707. Eldredge, N. and S. Gould. 1972. Punctuated equilibria: an alternative to phyletic grad­ ualism. Pages 82-115 in Models in Paleobiology. Freeman Cooper. Enright, A. J., S. Van Dongen, and C. A. Ouzounis. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30:1575-1584. Falkowski, P. and A. H. Knoll, eds. 2007. Evolution of primary producers in the sea. Elsevier Academic Press, Burlington, MA. Falkowski, P. and J. Raven, eds. 2007. Aquatic Photosynthesis. 2nd ed. Princeton Uni­ versity Press, Princeton, NJ. Falkowski, P. G. 2006. Evolution: tracing oxygen’s imprint on Earth’s metabolic evolution. Science 311:1724-1725. Falkowski, P. G., M. E. Katz, A. H. Knoll, A. Quigg, J. A. Raven, O. Schofield, and F. J. R. Taylor. 2004. The evolution of modern eukaryotic phytoplankton. Science 305:354-360. Farquhar, J., H. M. Bao, and M. Thiemens. 2000. Atmospheric influence of Earth’s earliest sulfur cycle. Science 289:756-758. Farris, J. S. 1999. Likelihood and inconsistency. Cladistics 15:199-204. Fehling, J., D. Stoecker, and S. Baldauf. 2007. Photosynthesis and the Eukaryote Tree of Life. chap. 6, Pages 75-107 in Evolution of Primary Producers in the Sea. Academic Press. 305 Feil, E. J., E. C. Holmes, D. E. Bessen, M. S. Chan, N. P. J. Day, M. C. Enright, R. Goldstein, D. W. Hood, A. Kalla, C. E. Moore, J. J. Zhou, and B. G. Spratt. 2001. Recombination within natural populations of pathogenic bacteria: short-term empir­ ical estimates and long-term phylogenetic consequences. Proceedings of the National Academy of Sciences of the United States of America 98:182-187.

Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology 27:401-410.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum-likelihood approach. Journal of Molecular Evolution 17:368-376.

Felsenstein, J. 1985a. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 34:300-311.

Felsenstein, J. 1985b. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791.

Felsenstein, J. 1989. PHYLIP: Phylogeny Inference Package (Version 3.2). Cladistics 5:164-166.

Felsenstein, J. 2004. Inferring phylogenies. Sunderland, Mass.

Felsenstein, J. 2005. PHYLIP (Phylogeny Inference Package) version 3.6.3. Distributed by the author (www.phylip.com). Department of Genome Sciences, University of Wash­ ington, Seattle.

Ferris, M. J. and B. Palenik. 1998. Niche adaptation in ocean cyanobacteria. Nature 396:226-228.

Field, C. B., M. J. Behrenfeld, J. T. Randerson, and P. Falkowski. 1998. Primary produc­ tion of the biosphere: integrating terrestrial and oceanic components. Science 281:237- 240.

Fitch, W. M. 1970. Distinguishing homologous from analogous proteins. Systematic Zo­ ology 19:99-113.

Fitch, W. M. 1971. Nonidentity of invariable positions in cytochromes C of different species. Biochemical Genetics 5:231-241.

306 Fitch, W. M. 1997. Networks and viral evolution. Journal of Molecular Evolution 44:S65- S75. Fitch, W. M. 2000. Homology: a personal view on some of the problems. Trends in Genetics 16:227-231. Fitch, W. M. and E. Markowitz. 1970. An improved method for determining codon vari­ ability in a gene and its application to rate of fixation of mutations in evolution. Bio­ chemical Genetics 4:579-593. Foster, P. G., L. S. Jermiin, and D. A. Hickey. 1997. Nucleotide composition bias affects amino acid content in proteins coded by animal mitochondria. Journal of Molecular Evolution 44:282-288. Fournier, G. P. and J. P. Gogarten. 2010. Rooting the Ribosomal Tree of Life. Molecular Biology and Evolution 27:1792-1801. Fox, G. E., K. R. Pechman, and C. R. Woese. 1977. Comparative cataloging of 16s ri­ bosomal ribonucleic acid: molecular approach to procaryotic systematics. International Journal of Systematic Bacteriology 27:44-57. Fromme, S. W., P. and N. Krauss. 1994. Structure of photosystem-1: suggestions on the docking sites for plastocyanin, ferredoxin and the coordination of P700. Biochimica et Biophysica Sinica 1187:99-105. Frost, L. S., R. Leplae, A. O. Summers, and A. Toussaint. 2005. Mobile genetic elements: the agents of open source evolution. Nature Reviews Microbiology 3:722-732. Fuhrman, J. A. 1999. Marine viruses and their biogeochemical and ecological effects. Nature 399:541-548. Fuhrman, J. A. and R. T. Noble. 1995. Viruses and protists cause similar bacterial mor­ tality in coastal seawater. Limnology and Oceanography 40:1236-1242. Fuller, N. J., G. A. Tarran, M. Yallop, K. M. Orcutt, and D. J. Scanlan. 2006. Molecu­ lar analysis of picocyanobacterial community structure along an Arabian Sea transect reveals distinct spatial separation of lineages. Limnology and Oceanography 51:2515- 2526.

307 Galperin, M. and E. Koonin. 1998. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biology 1:55-67.

Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Molecular Biology and Evolution 18:866-874.

Galtier, N. and V. Daubin. 2008. Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society B-Biological Sciences 363:4023-4029.

Galtier, N., N. Tourasse, and M. Gouy. 1999. A nonhyperthermophilic common ancestor to extant life forms. Science 283:220-221.

Gantar, M., C. Sinigalliano, T. Collins, J. Sierra-Montes, C. Herrero, and J. Landrum. 2005. Prochlorales cyanobacterium EV-7 16S ribosomal RNA gene, partial sequence (Genbank Identifier 66815028). http://www.ncbi.nlm.nih.gov/nuccore/66815028.

Garcia-Vallve, S., A. Romeu, and J. Palau. 2000. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Research 10:1719-1725.

Garrett, R. and C. Grisham. 2005. Biochemistry. 3rd ed. Brooks Cole, Pacific Grove, CA.

Garrity, G., ed. 2001. Bergey’s manual of systematic bacteriology. Springer.

Gascuel, O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14:685-695.

Gascuel, O., D. Bryant, and F. Denis. 2001. Strengths and limitations of the minimum evolution principle. Systematic Biology 50:621-627.

Ge, F., L. S. Wang, and J. Kim. 2005. The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biology 3:1709-1718.

Gill, S. R., D. E. Fouts, G. L. Archer, E. F. Mongodin, R. T. DeBoy, J. Ravel, I. T. Paulsen, J. F. Kolonay, L. Brinkac, M. Beanan, R. J. Dodson, S. C. Daugherty, R. Madupu, S. V. Angiuoli, A. S. Durkin, D. H. Haft, J. Vamathevan, H. Khouri, T. Ut- terback, C. Lee, G. Dimitrov, L. X. Jiang, H. Y. Qin, J. Weidman, I<. Tran, K. Kang, I. R. Hance, K. E. Nelson, and C. M. Fraser. 2005. Insights on evolution of viru­ lence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biohlm-producing methicillin-resistant Staphylococ­ cus epidermidis strain. Journal of Bacteriology 187:2426-2438.

308 Gogarten, J. P., W. F. Doolittle, and J. G. Lawrence. 2002. Prokaryotic evolution in light of gene transfer. Molecular Biology and Evolution 19:2226-2238. Gogarten-Boekels, M., E. Hilario, and J. P. Gogarten. 1995. The effects of heavy meteorite bombardment on the early evolution - the emergence of the three domains of life. Origins of Life and Evolution of the Biosphere 25:251-264. Gojobori, T., K. Ishii, and M. Nei. 1982. Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. Journal of Molecular Evolution 18:414-423. Golubchik, T., M. J. Wise, S. Easteal, and L. S. Jermiin. 2007. Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Molecular Biology and Evolution 24:2433-2442. Gönnet, G. H., M. A. Cohen, and S. A. Benner. 1992. Exhaustive matching of the entire protein sequence database. Science 256:1443-1445. Gould, S. and N. Eldredge. 1977. Punctuated equilibria: the tempo and mode of evolution reconsidered. Paleobiology 3:115-151. Govindjee and D. Krogmann. 2004. Discoveries in oxygenic photosynthesis (1727-2003): a perspective. Photosynthesis Research 80:15-57. Graybeal, A. 1994. Evaluating the phylogenetic utility of genes: a search for genes infor­ mative about deep divergences among vertebrates. Systematic Biology 43:174-193. Graybeal, A. 1998. Is it better to add taxa or characters to a difficult phylogenetic prob­ lem? Systematic Biology 47:9-17. Guenoche, A. 1986. Graphical representation of a Boolean array. Computational Human­ ities 20:277-281. Guindon, S. and O. Gascuel. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology 52:696-704. Hadjantonakis, A. K. and A. Nagy. 2001. The color of mice: in the light of GFP-variant reporters. Histochemistry and Cell Biology 115:49-58. Hall, B. 2004. Phylogenetic trees made easy: a how-to manual. 2nd ed. Sunderland, MA.

309 Hanekamp, K., U. Bohnebeck, B. Beszteri, and K. Valentin. 2007. PhyloGena a user- friendly system for automated phylogenetic annotation of unknown sequences. Bioin­ formatics 23:793-801. Harlow, T. J., J. P. Gogarten, and M. A. Ragan. 2004. A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics 5:45. Hartigan, J. 1973. Minimum evolution fits to a given tree. Biometrics 29:53-65. Hasegawa, M., H. Kishino, and T. Yano. 1987. Man’s place in Hominoidea as inferred from molecular clocks of DNA. Journal of Molecular Evolution 22:132-147. Hayashi, T., K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han, E. Ohtsubo, I<. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami, T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori, and H. Shinagawa. 2001. Complete genome sequence of enterohemorrhagic Escherichia coli 0157 : H7 and genomic comparison with a laboratory strain K-12. DNA Research 8:11- 22. Heath, T. A., S. M. Hedtke, and D. M. Hillis. 2008a. Taxon sampling and the accuracy of phylogenetic analyses. Journal of Systematics and Evolution 46:239-257. Heath, T. A., D. J. Zwickl, J. Kim, and D. M. Hillis. 2008b. Taxon sampling affects inferences of macroevolutionary processes from phylogenetic trees. Systematic Biology 57:160-166. Hedges, S. B., H. Chen, S. Kumar, D. Wang, A. Thompson, and H. Watanabe. 2001. A genomic timescale for the origin of eukaryotes. BMC Evolutionary Biology 1:4. Hedtke, S., T. Townsend, and D. Hillis. 2006. Resolution of phylogenetic conflict in large data sets by increased taxon sampling. Systematic Biology 55:522-529. Hendy, M. D. and D. Penny. 1989. A framework for the quantitative study of evolutionary trees. Systematic Zoology 38:297-309. Henikoff, S. and J. G. Henikoff. 1992. Amino-acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89:10915-10919.

310 Herdman, M., R. Castenholz, J. Waterbury, and R. Rippka. 2001. Form-Genus XIII. Synechococcus. Pages 508-512 in Bergey’s Manual of Systematic Bacteriology (E. Boon and R. Castenholz, eds.) vol. 1 2nd ed. Springer-Verlag, New York. Hess, W. R., F. Partensky, G. W. M. van der Staay, J. M. Garcia-Fernandez, T. Borner, and D. Vaulot. 1996. Coexistence of phycoerythrin and a chlorophyll a/b antenna in a marine prokaryote. Proceedings of the National Academy of Sciences of the United States of America 93:11126-11130. Hess, W. R., G. Rocap, C. S. Ting, F. Larimer, S. Stilwagen, J. Lamerdin, and S. W. Chisholm. 2001. The photosynthetic apparatus of Prochlorococcus: insights through comparative genomics. Photosynthesis Research 70:53-71. Hess, W. R., A. Weihe, S. Loiseauxdegoer, F. Partensky, and D. Vaulot. 1995. Characteri­ zation of the single psbA gene of Prochlorococcus marinus CCMP1375 (Prochlorophyta). Plant Molecular Biology 27:1189-1196. Higgs, P. G. 1998. Compensatory neutral mutations and the evolution of RNA. Genetica 102-3:91-101. Higgs, P. G. 2000. RNA secondary structure: physical and computational aspects. Quar­ terly Reviews of Biophysics 33:199-253. Hillis, D., J. Huelsenbeck, and C. Cunningham. 1994. Application and accuracy of molec­ ular phylogenies. Science 264:671-677. Hillis, D. and J. Wiens. 2000. Molecules versus morphology in systematics: conflicts, artifacts, and misconceptions. Pages 1-19 in Phylogenetic analysis of morphological data (J. Wiens, ed.). Smithsonian Institution Press, Washington, DC. Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131. Hillis, D. M. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Systematic Biology 47:3-8. Ho, S. Y. W. and L. S. Jermiin. 2004. Tracing the decay of the historical signal in biological sequence data. Systematic Biology 53:623-637. Ho, S. Y. W., M. J. Phillips, A. Cooper, and A. J. Drummond. 2005. Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Molecular Biology and Evolution 22:1561-1568. 311 Hoffman, P. F., A. J. Kaufman, G. P. Halverson, and D. P. Schrag. 1998a. A Neoprotero- zoic Snowball Earth. Science 281:1342-1346.

Hoffman, P. F., D. P. Schrag, G. P. Halverson, and J. A. Kaufman. 1998b. An early Snowball Earth? Response. Science 282:1645-1646.

Hohl, M. and M. A. Ragan. 2007. Is multiple-sequence alignment required for accurate inference of phylogeny? Systematic Biology 56:206-221.

Holder, M. and P. O. Lewis. 2003. Phylogeny estimation: Traditional and Bayesian ap­ proaches. Nature Reviews Genetics 4:275-284.

Holland, B. R., F. Delsuc, and V. Moulton. 2005. Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Systematic Biology 54:66-76.

Holland, B. R., D. Penny, and M. D. Hendy. 2003. Outgroup misplacement and phylo­ genetic inaccuracy under a molecular clock: a simulation study. Systematic Biology 52:229-238.

Holland, H. 2006. The oxygenation of the atmosphere and oceans. Philosophical Trans­ actions of the Royal Society of London Series B-Biological Sciences 361:903-915.

Holmes, E. C., M. Worobey, and A. Rambaut. 1999. Phylogenetic evidence for recombi­ nation in dengue virus. Molecular Biology and Evolution 16:405-409.

Howe, C. J., A. C. Barbrook, R. E. R. Nisbet, P. J. Lockhart, and A. W. D. Larkum. 2008. The origin of plastids. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 363:2675-2685.

Hu, J. H. and J. L. Blanchard. 2009. Environmental sequence data from the Sargasso Sea reveal that the characteristics of genome reduction in Prochlorococcus are not a harbinger for an escalation in genetic drift. Molecular Biology and Evolution 26:5-13.

Hu, Q., H. Miyashita, I. Iwasaki, N. Kurano, S. Miyachi, M. Iwaki, and S. Itoh. 1998. A photosystem I reaction center driven by chlorophyll d in oxygenic photosynthesis. Proceedings of the National Academy of Sciences of the United States of America 95:13319-13323.

312 Huber, K. T., V. Moulton, P. Lockhart, and A. Dress. 2001. Pruned median networks: a technique for reducing the complexity of median networks. Molecular Phylogenetics and Evolution 19:302-310. Huelsenbeck, J. 1998. Systematic bias in phylogenetic analysis: is the strepsiptera problem solved? Systematic Biology 47:519-537. Huelsenbeck, J. 2002. Testing a covariotie model of DNA substitution. Molecular Biology and Evolution 19:698-707. Huelsenbeck, J. P. 1995. Performance of phylogenetic methods in simulation. Systematic Biology 44:17-48. Huelsenbeck, J. P. and K. A. Crandall. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annual Review of Ecology and Systematics 28:437-466. Huelsenbeck, J. P. and D. M. Hillis. 1993. Success of phylogenetic methods in the four- taxon case. Systematic Biology 42:247-264. Huelsenbeck, J. P. and K. M. Lander. 2003. Frequent inconsistency of parsimony under a simple model of cladogenesis. Systematic Biology 52:641-648. Huelsenbeck, J. P., B. Larget, R. E. Miller, and F. Ronquist. 2002. Potential applications and pitfalls of Bayesian inference of phylogeny. Systematic Biology 51:673-688. Huelsenbeck, J. P., F. Ronquist, R. Nielsen, and J. P. Bollback. 2001. Evolution: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310-2314. Hugall, A. F. and M. S. Y. Lee. 2007. The likelihood node density effect and consequences for evolutionary studies of molecular rates. Evolution 61:2293-2307. Hulsen, T., M. A. Huynen, J. de Vlieg, and P. M. A. Groenen. 2006. Benchmarking ortholog identification methods using functional genomics data. Genome Biology 7:R31. Hurvich, C. M. and C. L. Tsai. 1989. Regression and time-series model selection in small samples. Biometrika 76:297-307. Huson, D. 2005. ISMB-Tutorial: Introduction to Phylogenetic Networks. Huson, D. and D. Bryant. 2007. User manual for SplitsTree4 V4.8. 313 Huson, D. H. 1998. SplitsTree: analyzing and visualizing evolutionary data. Bioinformat- ics 14:68-73.

Huson, D. H. and D. Bryant. 2006. Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23:254-267.

Huson, D. H., T. Dezulian, T. Klopper, and M. A. Steel. 2004a. Phylogenetic super- networks from partial trees. Pages 388-399 in Proceedings of Algorithms in Bioinfor­ matics vol. 3240 of Lecture Notes In Computer Science.

Huson, D. H., T. Dezulian, T. Klopper, and M. A. Steel. 2004b. Phylogenetic super­ networks from partial trees. IEEE-ACM Transactions on Computational Biology and Bioinformatics 1:151-158.

Huson, D. H., D. C. Richter, C. Rausch, T. Dezulian, M. Franz, and R. Rupp. 2007. Dendroscope: an interactive viewer for large phylogenetic trees. BMC Bioinformatics 8:6.

Iyer, L. M., E. V. Koonin, and L. Aravind. 2004. Evolution of bacterial RNA polymerase: implications for large-scale bacterial phylogeny, domain accretion, and horizontal gene transfer. Gene 335:73-88.

Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer among genomes: the complexity hypothesis. Proceedings of the National Academy of Sciences of the United States of America 96:3801-3806.

Jain, R., M. C. Rivera, J. E. Moore, and J. A. Lake. 2003. Horizontal gene transfer accelerates genome innovation and evolution. Molecular Biology and Evolution 20:1598- 1602.

Jayaswal, V., J. Robinson, and L. Jermiin. 2007. Estimation of phylogeny and invariant sites under the general Markov model of nucleotide sequence evolution. Systematic Biology 56:155-162.

Jeanmougin, F., J. D. Thompson, M. Gouy, D. G. Higgins, and T. J. Gibson. 1998. Multiple sequence alignment with Clustal X. Trends in Biochemical Sciences 23:403- 405.

314 Jermiin, L., V. Jayaswal, F. Ababneh, and J. Robinson. 2008. Phylogenetic Model Evalua­ tion. Pages 331-364 in Bioinformatics, Vol. 1: Data, Sequence Analysis, and Evolution, vol 452 (J. Keith, ed.). Springer Science and Business Media, Totowa, NJ.

Jermiin, L. S., S. Y. W. Ho, F. Ababneh, J. Robinson, and A. W. D. Larkum. 2004. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Systematic Biology 53:638-643.

Jobb, G., A. von Haeseler, and K. Strimmer. 2004. TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics (version of June 2007). BMC Evolu­ tionary Biology 4:18. Available from www.treefinder.de.

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences 8:275- 282.

Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Molecular Biology and Evolution 19:1591-1601.

Jun, S. R., G. E. Sims, G. H. A. Wu, and S. H. Kirn. 2010. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences of the United States of America 107:133-138.

Kandler, O. 1994. The early diversification of life. Pages 152-509 in Early Life on Earth, Nobel Symposium (S. Bengston, ed.). Columbia University Press.

Kanehisa, M. 1997. A database for post-genome analysis. Trends in Genetics 13:375-376.

Kanehisa, M., S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa. 2006. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research 34:D354-D357.

Kashyap, R. L. and S. Subas. 1974. Statistical estimation of parameters in a phylogenetic tree using a dynamic model of substitutional process. Journal of Theoretical Biology 47:75-101.

315 Katoh, K., K. Misawa, K. Kuma, and T. Miyata. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30:3059-3066.

Kaufman, A. J., D. T. Johnston, J. Farquhar, A. L. Masterson, T. W. Lyons, S. Bates, A. D. Anbar, G. L. Arnold, J. Garvin, and R. Buick. 2007. Late Archean biospheric oxygenation and atmospheric evolution. Science 317:1900-1903.

Keane, T. M., C. J. Creevey, M. M. Pentony, T. J. Naughton, and J. O. Mclnerney. 2006. Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evolutionary Biology 6:29.

Keeling, P. J. 2010. The endosymbiotic origin, diversification and fate of plastids. Philo­ sophical Transactions of the Royal Society of London Series B-Biological Sciences 365:729-748.

Keswani, G. and L. O. Hall. 2002. Text classification with enhanced semi-supervised fuzzy clustering. Pages 621-626 in KDD-2007 Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Kettler, G. C., A. C. Martiny, K. Huang, J. Zucker, M. L. Coleman, S. Rodrigue, F. Chen, A. Lapidus, S. Ferriera, J. Johnson, C. Steglich, G. M. Church, P. Richardson, and S. W. Chisholm. 2007. Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genetics 3:2515-2528.

Kidd, K. and L. Sgaramella-Zonta. 1971. Phylogenetic analysis: concepts and methods. American Journal of Human Genetics 23:235-252.

Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.

Kirk, J. T. O. 1963. Deoxyribonucleic acid of broad bean chloroplasts. Biochimica et Biophysica Acta 76:417-424.

Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. 1983. Optimization by simulated anneal­ ing. Science 220:671-680.

316 Kirschvink, J. L. 1992. Late Proterozoic low-latitude global glaciation: the Snowball Earth. Pages 51-52 in The Proterozoic Biosphere: a multidisciplinary study (J. W. Schopf, ed.). Cambridge University Press. Kishino, H., T. Miyata, and M. Hasegawa. 1990. Maximum-likelihood inference of protein phylogeny and the origin of chloroplasts. Journal of Molecular Evolution 31:151-160. Kitching, I., P. Forey, C. Humphries, and D. Williams. 1998. Cladistics: The theory and practice of parsimony analysis. 2nd ed. Oxford University Press, Oxford. Knoll, A., W. J. Summons, R.E., and Z. J.E. 2007. The geological succession of primary producers in the oceans, chap. 8, Pages 133-163 in Evolution of Primary Producers in the Sea (P. Falkowski and A. Knoll, eds.). Elsevier Academic Press, Burlington MA. Kondratieva, E., N. Pfennig, and H. Truper. 1992. The phototrophic prokaryotes. Pages 312-330 in The Prokaryotes (A. Balows, H. Truper, M. Dworkin, W. Harder, and K. Schleifer, eds.) 2nd ed. Springer, NY. Koonin, E. V. 2003. Horizontal gene transfer: the path to maturity. Molecular Microbi­ ology 50:725-727. Koonin, E. V. 2005. Orthologs, paralogs, and evolutionary genomics. Annual Review of Genetics 39:309-338. Koonin, E. V. 2007. The Biological Big Bang model for the major transitions in evolution. Biology Direct 2:21. Koonin, E. V. 2009. Evolution of genome architecture. International Journal of Biochem­ istry and Cell Biology 41:298-306. Koonin, E. V., K. S. Makarova, and L. Aravind. 2001. Horizontal gene transfer in prokary­ otes: quantification and classification. Annual Review of Microbiology 55:709-742. Koski, L. B. and G. B. Golding. 2001. The closest BLAST hit is often not the nearest neighbor. Journal of Molecular Evolution 52:540-542. Koski, L. B., R. A. Morton, and G. B. Golding. 2001. Codon bias and base composition are poor indicators of horizontally transferred genes. Molecular Biology and Evolution 18:404-412.

317 Kotera, M., Y. Okuno, M. Hattori, S. Goto, and M. Kanehisa. 2004. Computational as­ signment of the EC numbers for genomic-scale analysis of enzymatic reactions. Journal of the American Chemical Society 126:16487-16498.

Kunin, V., L. Goldovsky, N. Darzentas, and C. A. Ouzounis. 2005. The net of life: recon­ structing the microbial phylogenetic network. Genome Research 15:954-959.

Kunin, V. and C. A. Ouzounis. 2003. The balance of driving forces during genome evolu­ tion in prokaryotes. Genome Research 13:1589-1594.

Kuzniar, A., R. van Ham, S. Pongor, and J. A. M. Leunissen. 2008. The quest for or­ thologs: finding the corresponding gene across genomes. Trends in Genetics 24:539-551.

La Roche, J., G. W. M. van der Staay, F. Partensky, A. Ducret, R. Aebersold, R. Li, S. S. Golden, R. G. Hiller, P. M. Wrench, A. W. D. Larkum, and B. R. Green. 1996. Independent evolution of the prochlorophyte and green plant chlorophyll a/b light­ harvesting proteins. Proceedings of the National Academy of Sciences of the United States of America 93:15244-15248.

Lake, J. A. 1991. The order of sequence alignment can bias the selection of tree topology. Molecular Biology and Evolution 8:378-385.

Lake, J. A. 2009. Evidence for an early prokaryotic endosymbiosis. Nature 460:967-971.

Lake, J. A. and M. C. Rivera. 2004. Deriving the genomic tree of life in the presence of horizontal gene transfer: conditioned reconstruction. Molecular Biology and Evolution 21:681-690.

Lan, R. T. and P. R. Reeves. 2000. Intraspecies variation in bacterial genomes: the need for a species genome concept. Trends in Microbiology 8:396-401.

Landry, M. R., S. L. Brown, J. Neveux, C. Dupouy, J. Blanchot, S. Christensen, and R. R. Bidigare. 2003. Phytoplankton growth and microzooplankton grazing in high- nutrient, low-chlorophyll waters of the equatorial Pacific: community and taxon-specific rate assessments from pigment and flow cytometric analyses. Journal of Geophysical Research - Oceans 108:8142.

Lapointe, F. J. and G. Cucumel. 1997. The average consensus procedure: combination of weighted trees containing identical or overlapping sets of taxa. Systematic Biology 46:306-312.

318 Larkin, M. A., G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, and D. G. Higgins. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23:2947-2948. Larkum, A. 2006. The evolution of chlorophylls and photosynthesis. Pages 261-282 in Chlorophylls and Bacteriochlorophylls: Biochemistry, Biophysics, Functions and Ap­ plications (B. Grimm, R. Porra, W. Rudiger, and H. Scheer, eds.). Springer, The Netherlands. Larkum, A. 2008. The evolution of photosynthesis, chap. 22 in Primary Processes of Pho­ tosynthesis: Principles and Apparatus (G. Renger, ed.) vol. 8, Part 2 of Comprehensive Series in Photochemical and Photobiological Sciences. Cambridge. Larkum, A. W. D. and M. Kuhl. 2005. Chlorophyll d: the puzzle resolved. Trends in Plant Science 10:355-357. Larkum, A. W. D., P. J. Lockhart, and C. J. Howe. 2007. Shopping for plastids. Trends in Plant Science 12:189-195. Lawrence, J. G. 1999. Gene transfer, speciation, and the evolution of bacterial genomes. Current Opinion in Microbiology 2:519-523. Lawrence, J. G. and H. Hendrickson. 2005. Genome evolution in bacteria: order beneath chaos. Current Opinion in Microbiology 8:572-578. Lecointre, G., H. Philippe, H. Van Le, and H. Le Guyader. 1993. Species sampling has a major impact on phylogenetic inference. Molecular Phylogenetics and Evolution 2:205- 224. Levasseur, C., P. A. Landry, V. Makarenkov, J. A. W. Kirsch, and F. J. Lapointe. 2003. Incomplete distance matrices, supertrees and bat phylogeny. Molecular Phylogenetics and Evolution 27:239-246. Lewin, R. 1977. Prochloron, type genus of the Prochlorophyta. Phycologia 16:217. Lewin, R. A. 1976. Prochlorophyta as a proposed new division of algae. Nature 261:697- 698. Lewin, R. A. 2002. Prochlorophyta: a matter of class distinctions. Photosynthesis Re­ search 73:59-61.

319 Li, L., C. J. Stoeckert, and D. S. Roos. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research 13:2178-2189.

Li, W. Z., F. Pio, K. Pawlowski, and A. Godzik. 2000. Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology. Bioinformatics 16:1105-1110.

Li, X., W.H. ad Gu. 1996. Estimating evolutionary distances between DNA sequences. U.K. edition, London.

Lindell, D., J. D. Jaffe, Z. I. Johnson, G. M. Church, and S. W. Chisholm. 2005. Photosyn­ thesis genes in marine viruses yield proteins during host infection. Nature 438:86-89.

Lindell, D., M. B. Sullivan, Z. I. Johnson, A. C. Tolonen, F. Rohwer, and S. W. Chisholm. 2004. Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proceedings of the National Academy of Sciences of the United States of America 101:11013-11018.

Linderstrm-Lang, K. 1952. Proteins and enzymes vol. 6. Stanford University Press.

Liu, H. B., H. A. Nolla, and L. Campbell. 1997. Prochlorococcus growth rate and contri­ bution to primary production in the equatorial and subtropical North Pacific Ocean. Aquatic Microbial Ecology 12:39-47.

Lockhart, P. J., T. J. Beanland, C. J. Howe, and A. W. D. Larkum. 1992a. Sequence of Prochloron didemni atpBE and the inference of chloroplast origins. Proceedings of the National Academy of Sciences of the United States of America 89:2742-2746.

Lockhart, P. J., C. J. Howe, D. A. Bryant, T. J. Beanland, and A. W. D. Larkum. 1992b. Substitutional bias confounds inference of cyanelle origins from sequence data. Journal of Molecular Evolution 34:153-162.

Lockhart, P. J., A. W. D. Larkum, M. A. Steel, P. J. Waddell, and D. Penny. 1996. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in se­ quence analysis. Proceedings of the National Academy of Sciences of the United States of America 93:1930-1934.

Lugmair, G. W. and A. Shukolyukov. 2001. Early solar system events and timescales. Meteoritics and Planetary Science 36:1017-1026.

320 Lyons-Weiler, J. and K. Takahashi. 1999. Branch length heterogeneity leads to noninde­ pendent branch length estimates and can decrease the efficiency of methods of phylo­ genetic inference. Journal of Molecular Evolution 49:392-405. Mackenzie, J. and R. Haselkorn. 1972. Photosynthesis and development of blue-green algal virus SM-1. Virology 49:517-521. Madigan, M., J. Martinko, and J. Parker. 1997. Brock Biology of Microorganisms. Prentice-Hall, Upper Saddle River, NJ. Mann, E. L. and S. W. Chisholm. 2000. Iron limits the cell division rate of Prochlorococcus in the eastern equatorial Pacific. Limnology and Oceanography 45:1067-1076. Mann, N. H. 2003. Phages of the marine cyanobacterial picophytoplankton. FEMS Mi­ crobiology Reviews 27:17-34. Mann, N. H., A. Cook, A. Millard, S. Bailey, and M. Clokie. 2003. Marine ecosystems: bacterial photosynthesis genes in a virus. Nature 424:741-741. Margulis, L. 1970. Origin of the eukaryotic cell. Yale University Press, Newhaven, CT. Marin, B., E. C. M. Nowack, G. Glckner, and M. Melkonian. 2007. The ancestor of the Paulinella chromatophore obtained a carboxysomal operon by horizontal gene transfer from a Nitrococcus-like y-proteobacterium. BMC Evolutionary Biology 7:85-98. Marin, B., E. C. M. Nowack, and M. Melkonian. 2005. A plastid in the making: primary endosymbiosis. Protist 156:425-432. Marri, P. R., W. Hao, and G. B. Golding. 2007. The role of laterally transferred genes in adaptive evolution. BMC Evolutionary Biology 7:S8. Martin, W. 1999. Mosaic bacterial chromosomes: a challenge on route to a tree of genomes. Bioessays 21:99-104. Martin, W. and K. V. Kowallik. 1999. Annotated English translation of Mereschkowsky’s 1905 paper “Uber Natur und Ursprung der Chromatophoren im Pflanzenreiche”. Eu­ ropean Journal of Phycology 34:287-295. Mathis, P. 1990. Compared structure of plant and bacterial photosynthetic reaction cen­ ters: evolutionary implications. Biochimica et Biophysica Acta 1018:163-167.

321 Mau, B. 1996. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Pli.D. thesis University of Wisconsin, Madison. McFadden, G. I. and R. F. Waller. 1997. Plastids in parasites of humans. Bioessays 19:1033-1040. McGinnis, S. and T. L. Madden. 2004. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 32:W20-W25. Medini, D., C. Donati, H. Tettelin, V. Masignani, and R. Rappuoli. 2005. The microbial pan-genome. Current Opinion in Genetics and Development 15:589-594. Medrano-Soto, A., G. M. Moreno-Hagelsieb, P. Vinuesa, J. A. Christen, and J. Collado- Vides. 2004. Successful lateral transfer requires codon usage compatibility between for­ eign genes and recipient genomes. Molecular Biology and Evolution 21:1884-1894. Mereschkowsky, C. 1905. Uber Natur and Ursprung der Chromatophoren im Pflanzenre- iche. Biol Centralbl 25:593-604. Michel, D. J., H. 1988. Relevance of the photosynthetic reaction center from purple bac­ teria to the structure of photosystem II. Biochemistry 27:1-7. Milkman, R. and M. M. Bridges. 1990. Molecular Evolution of the Escherichia-coli Chro­ mosome. 3. Clonal Frames. Genetics 126:505-517. Millard, A., M. R. J. Clokie, D. A. Shub, and N. H. Mann. 2004. Genetic organization of the psbAD region in phages infecting marine Synechococcus strains. Proceedings of the National Academy of Sciences of the United States of America 101:11007-11012. Miller, S. R., S. Augustine, T. Le Olson, R. E. Blankenship, J. Selker, and A. M. Wood. 2005. Discovery of a free-living chlorophyll d-producing cyanobacterium with a hybrid proteobacterial/cyanobacterial small-subunit rRNA gene. Proceedings of the National Academy of Sciences of the United States of America 102:850-855. Miyashita, H., H. Ikemoto, N. Kurano, S. Miyachi, and M. Chihara. 2003. Acaryochloris marina gen. et sp nov (Cyanobacteria), an oxygenic photosynthetic prokaryote contain­ ing Chi d as a major pigment. Journal of Phycology 39:1247-1253. Mojzsis, S. J., G. Arrhenius, K. D. McKeegan, T. M. Harrison, A. P. Nutman, and C. R. L. Friend. 1996. Evidence for life on Earth before 3,800 million years ago. Nature 384:55-59.

322 Monger, B. C., M. R. Landry, and S. L. Brown. 1999. Feeding selection of heterotrophic marine nanoflagellates based on the surface hydrophobicity of their picoplankton prey. Limnology and Oceanography 44:1917-1927.

Montague, M. G. and C. A. Hutchison. 2000. Gene content phylogeny of herpesviruses. Proceedings of the National Academy of Sciences of the United States of America 97:5334-5339.

Mooers, A. O. and E. C. Holmes. 2000. The evolution of base composition and phyloge­ netic inference. Trends in Ecology and Evolution 15:365-369.

Moore, L. R. and S. W. Chisholm. 1999. Photophysiology of the marine cyanobacterium Prochlorococcus: ecotypic differences among cultured isolates. Limnology and Oceanog­ raphy 44:628-638.

Moore, L. R., G. Rocap, and S. W. Chisholm. 1998. Physiology and molecular phylogeny of coexisting Prochlorococcus ecotypes. Nature 393:464-467.

Morel, A., Y. H. Ahn, F. Partensky, D. Vaulot, and H. Claustre. 1993. Prochlorococcus and Synechococcus: a comparative study of their optical properties in relation to their size and pigmentation. Journal of Marine Research 51:617-649.

Moreno-Hagelsieb, G. and K. Latimer. 2008. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24:319-324.

Moulton, V. and M. Steel. 1999. Retractions of finite distance functions onto tree metrics. Discrete Applied Mathematics 91:215-233.

Mueller, L. and F. Ayala. 1982. Estimation and interpretation of genetic distance in empirical studies. Genetical Research 40:127-137.

Mulkidjanian, A. Y., E. V. Koonin, K. S. Makarova, S. L. Mekhedov, A. Sorokin, Y. I. Wolf, A. Dufresne, F. Partensky, H. Burd, D. Kaznadzey, R. Haselkorn, and M. Y. Galperin. 2006. The cyanobacterial genome core and the origin of photosynthesis. Proceedings of the National Academy of Sciences of the United States of America 103:13126-31.

Muller, T. and M. Vingron. 2000. Modeling amino acid replacement. Journal of Compu­ tational Biology 7:761-776.

323 Murray, S., M. Flo Jorgensen, S. Y. W. Ho, D. J. Patterson, and L. S. Jermiin. 2005. Improving the analysis of dinofiagellate phylogeny based on rDNA. Protist 156:269- 286.

Nakamura, Y., T. Kaneko, S. Sato, M. Mimuro, H. Miyashita, T. Tsuchiya, S. Sasamoto, A. Watanabe, K. Kawashima, Y. Kishida, C. Kiyokawa, M. Kohara, M. Matsumoto, A. Matsuno, N. Nakazaki, S. Shimpo, C. Takeuchi, M. Yamada, and S. Tabata. 2003. Complete genome structure of Gloeobacter violaceus PCC 7421, a cyanobacterium that lacks thylakoids. DNA Research 10:137-145.

NC-IUBMB. 2010. Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) http://www.chem.qmul.ac.uk/iubmb/enzyme/.

NCBI website. 2005. BLAST information http://www.ncbi.nlm.nih.gov/blast/html/ sub_matrix.html, accessed 12 August 2005 .

Needleman, S. and C. D. Wunsch. 1970. A general method applicable to search for similar­ ities in amino acid sequence of two proteins. Journal of Molecular Biology 48:443-453.

Nei, M. and S. Kumar. 2000. Molecular Evolution and Phylogenetics. Oxford University Press.

Nelson, K. E., R. A. Clayton, S. R. Gill, M. L. Gwinn, R. J. Dodson, D. H. Haft, E. K. Hickey, L. D. Peterson, W. C. Nelson, K. A. Ketchum, L. McDonald, T. R. Utterback, J. A. Malek, K. D. Linher, M. M. Garrett, A. M. Stewart, M. D. Cotton, M. S. Pratt, C. A. Phillips, D. Richardson, J. Heidelberg, G. G. Sutton, R. D. Fleischmann, J. A. Eisen, O. White, S. L. Salzberg, H. O. Smith, J. C. Venter, and C. M. Fraser. 1999. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399:323-329.

Neyman, J. 1971. Molecular studies of evolution: a source of novel statistical problems. Pages 1-27 in Statistical Decision Theory and Related Topics (S. Gupta and J. Yackel, eds.). Academic Press, New York.

Nisbet, E. G. 1987. The Young Earth: An Introduction to Archaean Geology. Cambridge Univ. Press, Cambridge.

Nisbet, E. G. and C. M. R. Fowler. 1996. Early life - some liked it hot. Nature 382:404- 405.

324 Nisbet, E. G. and N. H. Sleep. 2001. The habitat and nature of early life. Nature 409:1083- 1091. Notredame, C., D. G. Higgins, and J. Heringa. 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302:205-217. Ochman, H., J. G. Lawrence, and E. A. Groisman. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405:299-304. Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 27:29-34. Okamoto, N. and G. I. McFadden. 2008. The mother of all parasites. Future Microbiology 3:391-395. Olsen, G. J. and C. R. Woese. 1993. Ribosomal RNA: a key to phylogeny. The Federation of American Societies for Experimental Biology Journal 7:113-123. Olsen, G. J. and C. R. Woese. 1996. Lessons from an Archaeal genome: what are we learning from Methanococcus jannaschii? Trends in Genetics 12:377-379. Olsen, G. J., C. R. Woese, and R. Overbeek. 1994. The winds of (evolutionary) change: breathing new life into microbiology. Journal of Bacteriology 176:1-6. Padan, E. and M. Shilo. 1973. Cyanophages: viruses attacking blue-green algae. Bacteri­ ological Reviews 37:343-370. Page, R. and E. Holmes. 2003. Molecular evolution: a phylogenetic approach. Blackwell Publishing, Oxford. Palenik, B., B. Brahamsha, F. W. Larimer, M. Land, L. Hauser, P. Chain, J. Lamerdin, W. Regala, E. E. Allen, J. McCarren, I. Paulsen, A. Dufresne, F. Partensky, E. A. Webb, and J. Waterbury. 2003. The genome of a motile marine Synechococcus. Nature 424:1037-1042. Palenik, B. and R. Haselkorn. 1992. Multiple evolutionary origins of prochlorophytes, the chlorophyll 5-containing prokaryotes. Nature 355:265-267. Palmer, J. D. 2003. The symbiotic birth and spread of plastids: how many times and whodunit? Journal of Phycology 39:4-11. 325 Partensky, F., W. R. Hess, and D. Vaulot. 1999. Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microiology and Molecular Biology Reviews 63:106- 127. Partensky, F., J. LaRoche, K. Wyman, and P. G. Falkowski. 1997. The divinyl- chlorophyll a/ 6-protein complexes of two strains of the oxyphototrophic marine prokary­ ote Prochlorococcus: characterization and response to changes in growth irradiance. Photosynthesis Research 51:209-222. Penny, D. and M. Hendy. 1985. Testing methods of evolutionary tree construction. Cladis- tics 1:266-278. Penny, D. and M. Hendy. 1986. Estimating the reliability of evolutionary trees. Molecular Biology and Evolution 3:403-417. Penny, D., P. Lockhart, M. Steel, and M. Hendy. 1994. The role of models in recon­ structing evolutionary trees. Pages 211-230 in Models in Phylogenetic Reconstruction (R. Scotland, D. Siebert, and D. Williams, eds.). Clarendon Press, Oxford. Perna, N. T., G. Plunkett, V. Burland, B. Mau, J. D. Glasner, D. J. Rose, G. F. Mayhew, P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. Hackett, S. Klink, A. Boutin, Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, A. Limk, E. T. Dimalanta, K. D. Potamousis, J. Apodaca, T. S. Anantharaman, J. Y. Lin, G. Yen, D. C. Schwartz, R. A. Welch, and F. R. Blattner. 2001. Genome sequence of enterohaemorrhagic Escherichia coh 0157 : H7. Nature 409:529-533. Philippe, H. and E. Douzery. 1994. The pitfalls of molecular phylogeny based on four species, as illustrated by the cetacea/artiodactyla relationships. Journal of Mammalian Evolution 2:133-152. Philippe, H. and P. Forterre. 1999. The rooting of the universal Tree of Life is not reliable. Journal of Molecular Evolution 49:509-523. Philippe, H. and J. Laurent. 1998. How good are deep phylogenetic trees? Current Opinion in Genetics and Development 8:616-623. Philippe, H., Y. Zhou, H. Brinkmann, N. Rodrigue, and F. Delsuc. 2005. Heterotachy and long-branch attraction in phylogenetics. BMC Evolutionary Biology 5.

326 Phillips, M. J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution 21:1455-1458.

Pisani, D., J. A. Cotton, and J. O. Mclnerney. 2007. Supertrees disentangle the chimerical origin of eukaryotic genomes. Molecular Biology and Evolution 24:1752-1760.

Poe, S. 2003. Evaluation of the strategy of long-branch subdivision to improve the accu­ racy of phylogenetic methods. Systematic Biology 52:423-428.

Poe, S. and D. L. Swofford. 1999. Taxon sampling revisited. Nature 398:299-300.

Pollock, D. D. and W. J. Bruno. 2000. Assessing an unknown evolutionary process: effect of increasing site-specific knowledge through taxon addition. Molecular Biology and Evolution 17:1854-1858.

Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maxi­ mum likelihood identification and relationship to structure. Journal of Molecular Biol­ ogy 287:187-198.

Pollock, D. D., D. J. Zwickl, J. A. McGuire, and D. M. Hillis. 2002. Increased taxon sampling is advantageous for phylogenetic inference. Systematic Biology 51:664-671.

Pond, S. L. K. and S. D. W. Frost. 2005a. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics 21:2531-2533. Available from www.datamonkey.org.

Pond, S. L. K. and S. D. W. Frost. 2005b. A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Molecular Biology and Evolution 22:478- 485.

Pond, S. L. K. and S. D. W. Frost. 2005c. Not so different after all: A comparison of methods for detecting amino acid sites under selection. Molecular Biology and Evolution 22:1208-1222.

Pond, S. L. K., S. D. W. Frost, and S. V. Muse. 2005. HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676-679.

Pond, S. L. K., D. Posada, M. B. Gravenor, C. H. Woelk, and S. D. W. Frost. 2006. Au­ tomated phylogenetic detection of recombination using a genetic algorithm. Molecular Biology and Evolution 23:1891-1901.

327 Poptsova, M. S. and J. P. Gogarten. 2007. The power of phylogenetic approaches to detect horizontally transferred genes. BMC Evolutionary Biology 7:17.

Poptsova, M. S. and J. P. Gogarten. 2010. Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology-SGM 156:1909-1917.

Posada, D. and K. A. Crandall. 1998. MODELTEST: testing the model of DNA substi­ tution. Bioinformatics 14:817-818.

Posada, D. and K. A. Crandall. 2002. The effect of recombination on the accuracy of phylogeny estimation. Journal of Molecular Evolution 54:396-402.

Pruitt, K. D., T. Tatusova, and D. R. Maglott. 2005. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nu­ cleic Acids Research 33:D501-D504.

R Development Core Team. 2009. R: A language and environment for statistical comput­ ing. R Foundation for Statistical Computing Vienna, Austria ISBN 3-900051-07-0.

Rambaut, A. and A. Drummond. 2007. Tracer vl.4, Available from http: / /beast.bio.ed.ac.uk/Tracer.

Rannala, B., J. P. Huelsenbeck, Z. H. Yang, and R. Nielsen. 1998. Taxon sampling and the accuracy of large phytogenies. Systematic Biology 47:702-710.

Rannala, B. and Z. H. Yang. 1996. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. Journal of Molecular Evolution 43:304-311.

Ranwez, V. and O. Gascuel. 2001. Quartet-based phylogenetic inference: improvements and limits. Molecular Biology and Evolution 18:1103-1116.

Rappe, M. S. and S. J. Giovannoni. 2003. The uncultured microbial majority. Annual Review of Microbiology 57:369-394.

Rasmussen, B., I. R. Fletcher, J. J. Brocks, and M. R. Kilburn. 2008. Reassessing the first appearance of eukaryotes and cyanobacteria. Nature 455:1101-U9.

Raven, J. A. and J. F. Allen. 2003. Genomics and chloroplast evolution: what did cyanobacteria do for plants? Genome Biology 4(3):209.

328 Raymond, J. and R. E. Blankenship. 2008. The origin of the oxygen-evolving complex. Coordination Chemistry Reviews 252:377-383. Raymond, J. and D. Segre. 2006. The effect of oxygen on biochemical networks and the evolution of complex life. Science 311:1764-1767. Raymond, J., O. Zhaxybayeva, J. P. Gogarten, and R. E. Blankenship. 2003a. Evolution of photosynthetic prokaryotes: a maximum-likelihood mapping approach. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 358:223-230. Raymond, J., O. Zhaxybayeva, J. P. Gogarten, and R. E. Blankenship. 2003b. Evolution of photosynthetic prokaryotes: a maximum-likelihood mapping approach. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 358:223-230. Raymond, J., O. Zhaxybayeva, J. P. Gogarten, S. Y. Gerdes, and R. E. Blankenship. 2002. Whole-genome analysis of photosynthetic prokaryotes. Science 298:1616-1620. Ren, Q. H. and A. T. Paulsen. 2005. Comparative analyses of fundamental differences in membrane transport capabilities in prokaryotes and eukaryotes. PLoS Computational Biology 1:190-201. Ridley, M. 2004. Evolution. 3rd ed. Blackwell Publishing, Oxford. Rippka, R., T. Coursin, W. Hess, C. Lichtle, D. J. Scanlan, K. A. Palinska, I. Iteman, F. Partensky, J. Houmard, and M. Herdman. 2000. Prochlorococcus marinus Chisholm et al. 1992 subsp pastoris subsp nov strain PCC 9511, the first axenic chlorophyll a(2)/b(2)-containing cyanobacterium (Oxyphotobacteria). International Journal of Sys­ tematic and Evolutionary Microbiology 50:1833-1847. Ris, H. and W. Plaut. 1962. Ultrastructure of DNA-constaining areas in chloroplast of Chlamydomonas. Journal of Cell Biology 13:383-391. Rivera, M. C., R. Jain, J. E. Moore, and J. A. Lake. 1998. Genomic evidence for two functionally distinct gene classes. Proceedings of the National Academy of Sciences of the United States of America 95:6239-6244. Rocap, G., D. L. Distel, J. B. Waterbury, and S. W. Chisholm. 2002. Resolution of Prochlorococcus and Synechococcus ecotypes by using 16S-23S ribosomal DNA internal transcribed spacer sequences. Applied and Environmental Microbiology 68:1180-1191. 329 Rocap, G., F. W. Larimer, J. Lamerdin, S. Malfatti, P. Chain, N. A. Ahlgren, A. Arel­ lano, M. Coleman, L. Hauser, W. R. Hess, Z. I. Johnson, M. Land, D. Lindell, A. F. Post, W. Regala, M. Shall, S. L. Shaw, C. Steglich, M. B. Sullivan, C. S. Ting, A. Tolo- nen, E. A. Webb, E. R. Zinser, and S. W. Chisholm. 2003. Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation. Nature 424:1042-1047.

Rodriguez-Ezpeleta, N., H. Brinkmann, S. C. Burey, B. Roure, G. Burger, W. Loffelhardt, H. J. Bohnert, H. Philippe, and B. F. Lang. 2005. Monophyly of primary photosynthetic eukaryotes: green plants, red algae, and glaucophytes. Current Biology 15:1325-1330.

Rohwer, F. and R. V. Thurber. 2009. Viruses manipulate the marine environment. Nature 459:207-212.

Ronquist, F. and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574.

Rumpho, M. E., J. M. Worful, J. Lee, K. Kannan, M. S. Tyler, D. Bhattacharya, A. Moustafa, and J. R. Manhart. 2008. Horizontal gene transfer of the algal nuclear gene psbO to the photosynthetic sea slug Elysia chlorotica. Proceedings of the National Academy of Sciences of the United States Of America 105:17867-17871.

Rye, R. and H. D. Holland. 1998. Paleosols and the evolution of atmospheric oxygen: a critical review. American Journal of Science 298:621-672.

Rzhetsky, A. and M. Nei. 1992a. A simple method for estimating and testing minimum- evolution trees. Molecular Biology and Evolution 9:945-967.

Rzhetsky, A. and M. Nei. 1992b. Statistical properties of the ordinary least-squares, gener­ alized least-squares, and minimum-evolution methods of phylogenetic inference. Journal of Molecular Evolution 35:367-375.

Rzhetsky, A. and M. Nei. 1993. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Molecular Biology and Evolution 10:1073-1095.

Sadekar, S., J. Raymond, and R. E. Blankenship. 2006. Conservation of distantly related membrane proteins: photosynthetic reaction centers share a common structural core. Molecular Biology and Evolution 23:2001-2007.

Safferman, R. S. and M. E. Morris. 1963. Algal Virus: isolation. Science 140(3567):679- 680.

330 Saitou, N. and M. Nei. 1987. The Neighbor-Joining method: a new method for recon­ structing phylogenetic trees. Molecular Biology and Evolution 4:406-425. Salemi, M., T. De Oliveira, V. Courgnaud, V. Moulton, B. Holland, S. Cassol, W. M. Switzer, and A. M. Vandamme. 2003. Mosaic genomes of the six major primate lentivirus lineages revealed by phylogenetic analyses. Journal of Virology 77:7202-7213. San Mauro, D. and A. Agorreta. 2010. Molecular systematics: a synthesis of the common methods and the state of knowledge. Cellular and Molecular Biology Letters 15:311- 341. Sanchez-Baracaldo, P., B. A. Handley, and P. K. Hayes. 2008. Picocyanobacterial commu­ nity structure of freshwater lakes and the Baltic Sea revealed by phylogenetic analyses and clade-specific quantitative PCR. Microbiology SGM 154:3347-3357. Sandaa, R. A. 2008. Burden or benefit? Virus-host interactions in the marine environment. Research in Microbiology 159:374-381. Sano, E., S. Carlson, L. Wegley, and F. Rohwer. 2004. Movement of viruses between biomes. Applied and Environmental Microbiology 70:5842-5846. Satoli, S., M. Ikeuchi, M. Mimuro, and A. Tanaka. 2001. Chlorophyll b expressed in cyanobacteria functions as a light harvesting antenna in photosystem I through flexi­ bility of the proteins. Journal of Biological Chemistry 276:4293-4297. Sattley, W. M., M. T. Madigan, W. D. Swingley, P. C. Cheung, K. M. Clocksin, A. L. Con­ rad, L. C. Dejesa, B. M. Honchak, D. O. Jung, L. E. Karbach, A. Kurdoglu, S. Lahiri, S. D. Mastrian, L. E. Page, H. L. Taylor, Z. T. Wang, J. Raymond, M. Chen, R. E. Blankenship, and J. W. Touchman. 2008. The genome of Heliobacterium modesticaldum, a phototrophic representative of the Firmicutes containing the simplest photosynthetic apparatus. Journal of Bacteriology 190:4687-4696. Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with sec­ ondary structure constraints: Comparison of substitution rate models using maximum- likelihood methods. Genetics 157:399-411. Scheffler, K., D. P. Martin, and C. Seoighe. 2006. Robust inference of positive selection from recombining coding sequences. Bioinformatics 22:2493-2499. 331 Schidlowski, M. 1988. A 3,800-million-year isotopic record of life from carbon in sedimen­ tary rocks. Nature 333:313-318.

Schidlowski, M. and P. Aharon. 1992. Early Organic Evolution: Implications for Mineral and Energy Resources Pages 147-175. Springer, Berlin.

Schierup, M. H. and J. Hein. 2000a. Consequences of recombination on traditional phy­ logenetic analysis. Genetics 156:879-891.

Schierup, M. H. and J. Hein. 2000b. Recombination and the molecular clock. Molecular Biology and Evolution 17:1578-1579.

Schimper, A. 1883. Ueber die Entwickelung der Chlorophyllkoerner und Farbkoerper. Botanische Zeitung 41:105-113.

Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioin­ formatics 18:502-504.

Schubert, W. D., O. Klukas, W. Saenger, H. T. Witt, P. Fromme, and N. Krauss. 1998. A common ancestor for oxygenic and anoxygenic photosynthetic systems: a comparison based on the structural model of photosystem 1. Journal of Molecular Biology 280:297- 314.

Schwartz, R. and M. Dayhoff. 1978. Matrices for detecting distant relationships. Pages 353-358 in Atlas of Protein Sequence and Structure (D. Dayhoff, ed.) vol. 5. suppl. 3. National Biomedical Research Foundation, Washington, D.C.

Schwarz, G. 1978. Estimating dimension of a model. Annals of Statistics 6:461-464.

Shapiro, B., A. Rambaut, and A. J. Drummond. 2006. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Molecular Biology and Evolution 23:7-9.

Sharon, I., S. Tzahor, S. Williamson, M. Shmoish, D. Man-Aharonovich, D. B. Rusch, S. Yooseph, G. Zeidner, S. S. Golden, S. R. Mackey, N. Adir, U. Weingart, D. Horn, J. C. Venter, Y. Mandel-Gutfreund, and O. Beja. 2007. Viral photosynthetic reaction center genes and transcripts in the marine environment. The ISME Journal 1:492-501.

She, R., J. S. C. Chu, I<. Wang, J. Pei, and N. S. Chen. 2009. genBlastA: enabling BLAST to identify homologous gene sequences. Genome Research 19:143-149.

332 Sherman, L. A. 1976. Infection of Synechococcus cedrorum by Cyanophage AS-1M. Ill: cellular-metabolism and phage development. Virology 71(1): 199-206.

Shi, T. and P. G. Falkowski. 2008. Genome evolution in cyanobacteria: the stable core and the variable shell. Proceedings of the National Academy of Sciences of the United States of America 105:2510-2515.

Shimada, A., N. Yano, S. Kanai, R. A. Lewin, and T. Maruyama. 2003. Molecular phylo­ genetic relationship between two symbiotic photo-oxygenic prokaryotes, Prochloron sp and Synechocystis trididemni. Phycologia 42:193-197.

Siddall, M. 1998. Success of parsimony in the four-taxon case: long branch repulsion by likelihood in the Farris Zone. Cladistics 14:209-220.

Sime-Ngando, T. and J. Colombet. 2009. Viruses and prophages in aquatic ecosystems. Canadian Journal of Microbiology 55:95-109.

Sims, G. E., S. R. Jun, G. A. Wua, and S. H. Kim. 2009. Alignment-free genome com­ parison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 106:2677-2682.

Sleep, N. H., K. J. Zahnle, J. F. Kasting, and H. J. Morowitz. 1989. Annihilation of ecosystems by large asteroid impacts on the early Earth. Nature 342:139-142.

Smith, A. B. 1994. Rooting molecular trees: problems and strategies. Biological Journal of the Linnean Society 51:279-292.

Sorek, R., Y. W. Zhu, C. J. Creevey, M. P. Francino, P. Bork, and E. M. Rubin. 2007. Genome-wide experimental determination of barriers to horizontal gene transfer. Sci­ ence 318:1449-1452.

St John, K., T. Warnow, B. M. E. Moret, and L. Vawter. 2003. Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. Journal of Algorithms 48:173-193.

States, D., W. Gish, and S. Altschul. 1991. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3:66-70.

Steel, M. and D. Penny. 2000. Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17:839-850.

333 Steel, M., L. Szekely, and M. Hendy. 1994. Reconstructing trees from sequences whose sites evolve at variable rates. Journal of Computational Biology 1:153-163. Stiller, J. W., D. C. Reel, and J. C. Johnson. 2003. A single origin of plastids revisited: convergent evolution in organellar genome content. Journal of Phycology 39:95-105. Stix, V. 2004. Finding all maximal cliques in dynamic graphs. Computational Optimiza­ tion and Applications 27:173-186. Strimmer, K. and A. Rambaut. 2002. Inferring confidence sets of possibly misspecified gene trees. Proceedings of the Royal Society of London Series B-Biological Sciences 269:137-142. Strimmer, K. and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Molecular Biology and Evolution 13:964- 969. Studier, J. A. and K. J. Keppler. 1988. A note on the Neighbor-Joining algorithm of Saitou and Nei. Molecular Biology and Evolution 5:729-731. Subramanian, A. R., J. Weyer-Menkhoff, M. Kaufmann, and B. Morgenstern. 2005. DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 6:66. Suerbaum, S., J. M. Smith, K. Bapumia, G. Morelli, N. H. Smith, E. Kunstmann, I. Dyrek, and M. Achtman. 1998. Free recombination within Helicobacter pylori. Proceedings of the National Academy of Sciences of the United States of America 95:12619-12624. Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Molecular Biology and Evolution 16:1347-1356. Sullivan, M. B., M. L. Coleman, P. Weigele, F. Rohwer, and S. W. Chisholm. 2005. Three Prochlorococcus cyanophage genomes: signature features and ecological interpretations. PLoS Biology 3:790-806. Sullivan, M. B., D. Lindell, J. A. Lee, L. R. Thompson, J. P. Bielawski, and S. W. Chisholm. 2006. Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biology 4:1344-1357. 334 Sullivan, M. B., J. B. Waterbury, and S. W. Chisholm. 2003. Cyanophages infecting the oceanic cyanobacterium Prochlorococcus. Nature 424:1047-1051.

Summons, R. E., L. L. Jahnke, J. M. Hope, and G. A. Logan. 1999. 2-Methylhopanoids as biomarkers for cyanobacterial oxygenic photosynthesis. Nature 400:554-557.

Susko, E., Y. Inagaki, and A. J. Roger. 2004. On inconsistency of the Neighbor-Joining, least squares, and minimum evolution estimation when substitution processes are in­ correctly modeled. Molecular Biology and Evolution 21:1629-1642.

Susko, E., J. Leigh, W. F. Doolittle, and E. Bapteste. 2006. Visualizing and assessing phy­ logenetic congruence of core gene sets: a case study of the y-proteobacteria. Molecular Biology and Evolution 23:1119-1030.

Susko, E., M. Spencer, and A. J. Roger. 2005. Biases in phylogenetic estimation can be caused by random sequence segments. Journal of Molecular Evolution 61:351-359.

Suttle, C. A. 2005. Viruses in the sea. Nature 437:356-361.

Suttle, C. A. and A. M. Chan. 1994. Dynamics and distribution of cyanophages and their effect on marine Synechococcus spp. Applied and Environmental Microbiology 60:3167- 3174.

Swingley, W. D., M. Chen, P. C. Cheung, A. L. Conrad, L. C. Dejesa, J. Hao, B. M. Honchak, L. E. Karbach, A. Kurdoglu, S. Lahiri, S. D. Mastrian, H. Miyashita, L. Page, P. Ramakrishna, S. Satoh, W. M. Sattley, Y. Shimada, H. L. Taylor, T. Tomo, T. Tsuchiya, Z. T. Wang, J. Raymond, M. Mimuro, R. E. Blankenship, and J. W. Touchman. 2008. Niche adaptation and genome expansion in the chloro­ phyll d-producing cyanobacterium Acaryochloris marina. Proceedings of the National Academy of Sciences of the United States of America 105:2005-2010.

Swofford, D. 2003. PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts, http: / /paup.csit.fsu.edu/paupfaq/ faq.html.

Swofford, D., G. Olsen, P. Waddell, and D. Hillis. 1996. Phylogenetic inference. Pages 407- 514 in Molecular Systematics (D. Hillis, C. Moritz, and B. Mable, eds.). Sinaurer, Sunderland, Mass.

335 Swofford, D. L. and G. Olsen. 1990. Phylogeny Reconstruction chap. 11, Pages 411-501. Sinauer Associates.

Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4.

Tatusov, R. L., M. Y. Galperin, D. A. Natale, and E. V. Koonin. 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Re­ search 28:33-36.

Tatusov, R. L., E. V. Koonin, and D. J. Lipman. 1997. A genomic perspective on protein families. Science 278:631-637.

Tatusov, R. L., D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Research 29:22-28.

Tavare, S. 1986. Some probabilistic and statistical problems on the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 17:57-86.

Telford, M. J., M. J. Wise, and V. Gowri-Shankar. 2005. Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the bilateria. Molecular Biology and Evolution 22:1129-1136.

Templeton, A. 1989. The meaning of species and speciation: a genetic perspective. Pages 3-27 in Speciation and its Consequences. Sinauer: Sunderland, Mass.

Tettelin, H., V. Masignani, M. J. Cieslewicz, C. Donati, D. Medini, N. L. Ward, S. V. Angiuoli, J. Crabtree, A. L. Jones, A. S. Durkin, R. T. DeBoy, T. M. Davidsen, M. Mora, M. Scarselli, I. M. Y. Ros, J. D. Peterson, C. R. Hauser, J. P. Sundaram, W. C. Nelson, R. Madupu, L. M. Brinkac, R. J. Dodson, M. J. Rosovitz, S. A. Sulli­ van, S. C. Daugherty, D. H. Haft, J. Selengut, M. L. Gwinn, L. W. Zhou, N. Zafar, H. Khouri, D. Radune, G. Dimitrov, K. Watkins, K. J. B. O ’Connor, S. Smith, T. R. Utterback, O. White, C. E. Rubens, G. Grandi, L. C. Madoff, D. L. Kasper, J. L. Telford, M. R. Wessels, R. Rappuoli, and C. M. Fraser. 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial

336 “pan-genome” . Proceedings of the National Academy of Sciences of the United States of America 102:16530-16530.

Thajuddin, N. and G. Subramanian. 2005. Cyanobacterial biodiversity and potential ap­ plications in biotechnology. Current Science 89:47-57.

Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. Clustal-W: improving the sensi­ tivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic Acids Research 22:4673-4680.

Ting, C. S., G. Rocap, J. King, and S. W. Chisholm. 2001. Phycobiliprotein genes of the marine photosynthetic prokaryote Prochlorococcus: evidence for rapid evolution of genetic heterogeneity. Microbiology-SGM 147:3171-3182.

Ting, C. S., G. Rocap, J. King, and S. W. Chisholm. 2002. Cyanobacterial photosynthe­ sis in the oceans: the origins and significance of divergent light-harvesting strategies. Trends in Microbiology 10:134-142.

Tomitani, A., A. H. Knoll, C. M. Cavanaugh, and T. Ohno. 2006. The evolutionary diversification of cyanobacteria: molecular-phylogenetic and paleontological perspec­ tives. Proceedings of the National Academy of Sciences of the United States of America 103:5442-5447.

Tomitani, A., K. Okada, H. Miyashita, H. C. P. Matthijs, T. Ohno, and A. Tanaka. 1999. Chlorophyll b and phycobilins in the common ancestor of cyanobacteria and chloroplasts. Nature 400:159-162.

Tomo, T., T. Okubo, S. Akimoto, M. Yokono, H. Miyashita, T. Tsuchiya, T. Noguchi, and M. Mimuro. 2007. Identification of the special pair of photosystem II in a chlorophyll d-dominated cyanobacterium. Proceedings of the National Academy of Sciences of the United States of America 104:7283-7288.

Tripp, H. J., S. R. Bench, K. A. Turk, R. A. Foster, B. A. Desany, F. Niazi, J. P. Affourtit, and J. P. Zehr. 2010. Metabolic streamlining in an open-ocean nitrogen-fixing cyanobacterium. Nature 464:90-94.

Tuffley, C. and M. Steel. 1997. Links between maximum likelihood and maximum par­ simony under a simple model of site substitution. Bulletin of Mathematical Biology 59:581-607.

337 Tuffley, C. and M. Steel. 1998. Modeling the covarion hypothesis of nucleotide substitution. Mathematical Biosciences 147:63-91. Turner, S., T. Burger-Wiersma, S. J. Giovannoni, L. R. Mur, and N. R. Pace. 1989. The relationship of a prochlorophyte Prochlorothrix hollandica to green chloroplasts. Nature 337:380-382. Ueda, K., T. Seki, T. Kudo, T. Yoshida, and M. Kataoka. 1999. Two distinct mechanisms cause heterogeneity of 16S rRNA. Journal of Bacteriology 181:78-82. Urbach, E., D. L. Robertson, and S. W. Chisholm. 1992. Multiple evolutionary origins of prochlorophytes within the cyanobacterial radiation. Nature 355:267-270. Urbach, E., D. J. Scanlan, D. L. Distel, J. B. Waterbury, and S. W. Chisholm. 1998. Rapid diversification of marine picophytoplankton with dissimilar light-harvesting structures inferred from sequences of Prochlorococcus and Synechococcus (Cyanobacteria). Journal of Molecular Evolution 46:188-201. van der Heijden, R. T. J. M., B. Snel, V. van Noort, and M. A. Huynen. 2007. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics 8:83. van der Staay, G. W. M., N. Yurkova, and B. R. Green. 1998. The 38 kDa chlorophyll a/b protein of the prokaryote Prochlorothrix hollandica is encoded by a divergent pcb gene. Plant Molecular Biology 36:709-716. van Dongen, S. 2000a. A cluster algorithm for graphs. Tech. Rep. INS-R0010 National Research Institute for Mathematics and Computer Science in the Netherlands. van Dongen, S. 2000b. Graph Clustering by Flow Simulation. Ph.D. thesis University of Utrecht, Utrecht, Netherlands. van Dongen, S. 2008. Graph clustering via a discrete uncoupling process. SIAM Journal on Matrix Analysis and Applications 30:121-141. Vetsigian, K., C. Woese, and N. Goldenfeld. 2006. Collective evolution and the genetic code. Proceedings of the National Academy of Sciences of the United States of America 103:10696-10701. Vinga, S. and J. Almeida. 2003. Alignment-free sequence comparison: a review. Bioinfor­ matics 19:513-523. 338 Waddell, P. 1996. Statistical methods of phylogenetic analaysis. Ph.D. thesis Massey Uni­ versity, Parlmerston North, New Zealand.

Wall, D. P., H. B. Fraser, and A. E. Hirsh. 2003. Detecting putative orthologs. Bioinfor­ matics 19:1710-1711.

Wang, G. L. and R. L. Dunbrack. 2003. PISCES: a protein sequence culling server. Bioin­ formatics 19:1589-1591.

Wang, H., M. Spencer, E. Susko, and A. Roger. 2007. Testing for covarion-like evolution in protein sequences. Molecular Biology and Evolution 24:294-305.

Ward, D. M., F. M. Cohan, D. Bhaya, J. F. Heidelberg, M. Kuhl, and A. Grossman. 2008. Genomics, environmental genomics and the issue of microbial species. Heredity 100:207-219.

Waterbury, J. B., S. W. Watson, R. R. L. Guillard, and L. E. Brand. 1979. Widespread occurrence of a unicellular, marine, planktonic, cyanobacterium. Nature 277:293-294.

Watson, E. B. and T. M. Harrison. 2005. Zircon thermometer reveals minimum melting conditions on earliest Earth. Science 308:841-844.

Wayne, L. G., D. J. Brenner, R. R. Colwell, P. A. D. Grimont, O. Kandler, M. I. Krichevsky, L. H. Moore, W. E. C. Moore, R. G. E. Murray, E. Stackebrandt, M. P. Starr, and H. G. Truper. 1987. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. International Journal of Systematic Bacteriology 37:463-464.

Webster, M. T., N. G. C. Smith, and H. Ellegren. 2002. Microsatellite evolution inferred from human-chimpanzee genomic sequence alignments. Proceedings of the National Academy of Sciences of the United States of America 99:8748-8753.

Weigele, P. R., W. H. Pope, M. L. Pedulla, J. M. Houtz, A. L. Smith, J. F. Conway, J. King, G. F. Hatfull, J. G. Lawrence, and R. W. Hendrix. 2007. Genomic and struc­ tural analysis of Syn9, a cyanophage infecting marine Prochlorococcus and Synechococ- cus. Environmental Microbiology 9:1675-1695.

Welch, R. A., V. Burland, G. Plunkett, P. Redford, P. Roesch, D. Rasko, E. L. Buckles, S. R. Liou, A. Boutin, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C. Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, and F. R. Blattner. 2002.

339 Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proceedings of the National Academy of Sciences of the United States of America 99:17020-17024.

Wetzel, R. 1994. The split decomposition algorithm. Ph.D. thesis University of Bielefeld, Bielefeld, Germany.

Wheeler, D. L., D. M. Church, R. Edgar, S. Federhen, W. Helmberg, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. O. Suzek, T. A. Tatusova, and L. Wagner. 2004. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Research 32:D35-D40.

Whelan, S. and N. Goldman. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular Biology and Evolution 18:691-699.

Whitfield, J. B. and P. J. Lockhart. 2007. Deciphering ancient rapid radiations. Trends in Ecology and Evolution 22:258-265.

Whitman, W. B., D. C. Coleman, and W. J. Wiebe. 1998. Prokaryotes: the unseen majority. Proceedings of the National Academy of Sciences of the United States of America 95:6578-6583.

Wilcox, R. M. and J. A. Fuhrinan. 1994. Bacterial viruses in coastal seawater: lytic rather than lysogenic production. Marine Ecology Progress Series 114:35-45.

Williams, J., L. Steiner, G. Feher, and M. Simon. 1984. Primary structure of the L-subunit of the reaction center from Rhodopseudomonas sphaeroides. Proceedings of the National Academy of Sciences of the United States of America 81:7303-7307.

Woese, C. 1992. Prokaryote systematics: the evolution of a science. Pages 3-18 in The Prokaryotes (A. Balows, H. Truper, M. Dworkin, W. Harder, and K. H. Schleifer, eds.) 2nd ed. Springer, New York.

Woese, C. R. 1998. The universal ancestor. Proceedings of the National Academy of Sciences of the United States of America 95:9710-9710.

Woese, C. R. 2002. On the evolution of cells. Proceedings of the National Academy of Sciences of the United States of America 99:8742-8747.

340 Woese, C. R. and G. E. Fox. 1977. Phylogenetic structure of prokaryotic domain: primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America 74:5088-5090.

Woese, C. R., G. J. Olsen, M. Ibba, and D. Soil. 2000. Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiology and Molecular Biology Reviews 64:202-236.

Wolf, Y. I., I. B. Rogozin, N. V. Grishin, and E. V. Koonin. 2002. Genome trees and the Tree of Life. Trends in Genetics 18:472-479.

Wong, K. M., M. A. Suchard, and J. P. Huelsenbeck. 2008. Alignment uncertainty and genomic analysis. Science 319:473-476.

Wood, A. M., T. Leatham, J. R. Manhart, and R. M. McCourt. 1992. The species concept in phytoplankton ecology. Journal of Phycology 28:723-729.

Worden, A. Z. and B. J. Binder. 2003. Application of dilution experiments for measuring growth and mortality rates among Prochlorococcus and Synechococcus populations in oligotrophic environments. Aquatic Microbial Ecology 30:159-174.

Wu, G. A., S. R. Jun, G. E. Sims, and S. H. Kim. 2009. Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proceedings of the National Academy of Sciences of the United States of America 106:12826-12831.

Wuyts, J., G. Perriere, and Y. V. de Peer. 2004. The European ribosomal RNA database. Nucleic Acids Research 32:D101-D103.

Xiong, J. and C. E. Bauer. 2002. Complex evolution of photosynthesis. Annual Review of Plant Biology 53:503-521.

Yang, Z. 1996. Phylogenetic analysis using parsimony and likelihood methods. Journal of Molecular Evolution 42:294-307.

Yang, Z. 2008. User Guide for PAML: phylogenetic analysis by maximum likelihood. 4.1 ed.

Yang, Z., N. Goldman, and A. Friday. 1995. Maximum likelihood trees from DNA se­ quences: a peculiar statistical estimation problem. Systematic Biology 44:384-399.

341 Yang, Z. H. 1994. Statistical properties of the maximum-likelihood method of phylogenetic estimation and comparison with distance matrix-methods. Systematic Biology 43:329- 342.

Yang, Z. H. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Computer Applications in the Biosciences 13:555-556.

Yang, Z. H. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution 24:1586-1591.

Yang, Z. H. and R. Nielsen. 2000. Estimating synonymous and nonsynonymous sub­ stitution rates under realistic evolutionary models. Molecular Biology and Evolution 17:32-43.

Yang, Z. H. and B. Rannala. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Molecular Biology and Evolution 14:717-724.

Yang, Z. H. and W. J. Swanson. 2002. Codon-substitution models to detect adaptive evo­ lution that account for heterogeneous selective pressures among site classes. Molecular Biology and Evolution 19:49-57.

Yap, W. H., Z. S. Zhang, and Y. Wang. 1999. Distinct types of rRNA operons exist in the genome of the actinomycete Thermomonospora chromogena and evidence for horizontal transfer of an entire rRNA operon. Journal of Bacteriology 181:5201-5209.

Yurkov, V. V. and J. T. Beatty. 1998. Aerobic anoxygenic phototrophic bacteria. Micro­ biology and Molecular Biology Reviews 62:695-724.

Zaitlin, M. and R. Hull. 1987. Plant virus-host interactions. Annual Review of Plant Physiology and Plant Molecular Biology 38:291-315.

Zehr, J. P., S. R. Bench, B. J. Carter, I. Hewson, F. Niazi, T. Shi, H. J. Tripp, and J. P. Affourtit. 2008. Globally distributed uncultivated oceanic N2-hxing cyanobacteria lack oxygenic photosystem II. Science 322:1110-1112.

Zeidner, G., J. P. Bielawski, M. Shmoish, D. J. Scanlan, G. Sabehi, and O. Beja. 2005. Po­ tential photosynthesis gene recombination between Prochlorococcus and Synechococcus via viral intermediates. Environmental Microbiology 7:1505-1513.

342 Zeidner, G., C. M. Preston, E. F. Delong, R. Massana, A. F. Post, D. J. Scanlan, and O. Beja. 2003. Molecular diversity among marine picophytoplankton as revealed by psbA analyses. Environmental Microbiology 5:212-216.

Zeigler, D. R. 2003. Gene sequences useful for predicting relatedness of whole genomes in bacteria. International Journal of Systematic and Evolutionary Microbiology 53:1893- 1900.

Zhang, Y. N., M. Chen, B. B. Zhou, L. S. Jermiin, and A. W. D. Larkum. 2007. Evolution of the inner light-harvesting antenna protein family of cyanobacteria, algae, and plants. Journal of Molecular Evolution 64:321-331.

Zhaxybayeva, O., W. F. Doolittle, R. T. Papke, and J. P. Gogarten. 2009. Intertwined evolutionary histories of marine Synechococcus and Prochlorococcus marinus. Genome Biology and Evolution 2009:325-339.

Zhaxybayeva, O. and J. P. Gogarten. 2002. Bootstrap, Bayesian probability and maxi­ mum likelihood mapping: exploring new tools for comparative genome analyses. BMC Genomics 3:4.

Zhaxybayeva, O. and J. P. Gogarten. 2003a. An improved probability mapping approach to assess genome mosaicism. BMC Genomics 4:37.

Zhaxybayeva, O. and J. P. Gogarten. 2003b. Spliceosornal introns: new insights into their evolution. Current Biology 13:R764-R766.

Zhaxybayeva, O., J. P. Gogarten, R. L. Charlebois, W. F. Doolittle, and R. T. Papke. 2006. Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Research 16:1099-1108.

Zhaxybayeva, O., L. Hamel, J. Raymond, and J. P. Gogarten. 2004a. Visualization of the phylogenetic content of five genomes using dekapentagonal maps. Genome Biology 5.

Zhaxybayeva, O., P. Lapierre, and J. P. Gogarten. 2004b. Genome mosaicism and organ- ismal lineages. Trends in Genetics 20:254-260.

Zinner, D., M. L. Arnold, and C. Roos. 2009. Is the new primate genus Rungwecebus a baboon? PLoS One 4:e4859.

Zwickl, D. J. and D. M. Hillis. 2002. Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology 51:588-598. 343 Zwirglmaier, K., J. L. Heywood, K. Chamberlain, E. M. S. Woodward, M. V. Zubkov, and D. J. Scanlan. 2007. Basin-scale distribution patterns lineages in the Atlantic Ocean. Environmental Microbiology 9:1278-1290.

344