1 TITLE

2 An updated phylogeny of the reveals that the parasitic 3 and Holosporales have independent origins

4 AUTHORS

5 Sergio A. Muñoz-Gómez1,2, Sebastian Hess1,2, Gertraud Burger3, B. Franz Lang3, 6 Edward Susko2,4, Claudio H. Slamovits1,2*, and Andrew J. Roger1,2*

7 AUTHOR AFFILIATIONS

8 1 Department of Biochemistry and Molecular Biology; Dalhousie University; Halifax, 9 Nova Scotia, B3H 4R2; Canada.

10 2 Centre for Comparative Genomics and Evolutionary Bioinformatics; Dalhousie 11 University; Halifax, Nova Scotia, B3H 4R2; Canada.

12 3 Department of Biochemistry, Robert-Cedergren Center in Bioinformatics and 13 Genomics, Université de Montréal, Montreal, Quebec, Canada.

14 4 Department of Mathematics and Statistics; Dalhousie University; Halifax, Nova Scotia, 15 B3H 4R2; Canada.

16

17 *Correspondence to: Department of Biochemistry and Molecular Biology; Dalhousie 18 University; Halifax, Nova Scotia, B3H 4R2; Canada; 1 902 494 2881, [email protected]

1

19 ABSTRACT

20 The Alphaproteobacteria is an extraordinarily diverse and ancient group of . 21 Previous attempts to infer its deep phylogeny have been plagued with methodological 22 artefacts. To overcome this, we analyzed a dataset of 200 single-copy and conserved 23 and employed diverse strategies to reduce compositional artefacts. Such 24 strategies include using novel dataset-specific profile mixture models and recoding 25 schemes, and removing sites, genes and taxa that are compositionally biased. We 26 show that the Rickettsiales and Holosporales (both groups of intracellular parasites of 27 ) are not sisters to each other, but instead, the Holosporales has a derived 28 position within the . A synthesis of our results also leads to an updated 29 proposal for the higher-level of the Alphaproteobacteria. Our robust 30 consensus phylogeny will serve as a framework for future studies that aim to place 31 mitochondria, and novel environmental diversity, within the Alphaproteobacteria.

32 KEYWORDS: , Holosporales, mitochondria, origin, Finniella inopinata, 33 Stachyamoba, Peranema, , Rhodospirillales, Azospirillaceae, 34 Rhodovibriaceae.

2

35 INTRODUCTION

36 The Alphaproteobacteria is an extraordinarily diverse and disparate group of bacteria 37 and well-known to most biologists for also encompassing the mitochondrial lineage 38 (Williams, Sobral, and Dickerman 2007; Roger, Muñoz-Gómez, and Kamikawa 2017). 39 The Alphaproteobacteria has massively diversified since its origin, giving rise to, for 40 example, some of the most abundant (e.g., Pelagibacter ubique) and metabolically 41 versatile (e.g., ) cells on Earth (Giovannoni 2017; Madigan, 42 Jung, and Madigan 2009). The basic structure of the tree of the Alphaproteobacteria 43 has largely been inferred through the analyses of 16S rRNA genes and several 44 conserved proteins (Garrity 2005; Lee et al. 2005; Rosenberg et al. 2014; Fitzpatrick, 45 Creevey, and McInerney 2006; Williams, Sobral, and Dickerman 2007; Brindefalk et al. 46 2011; Georgiades et al. 2011; Thrash et al. 2011; Luo 2015). Today, eight major orders 47 are well recognized, namely the , Rhizobiales, , 48 , , Rhodospirillales, Holosporales and Rickettsiales 49 (the latter two formerly grouped into the Rickettsiales sensu lato), and their 50 interrelationships have also recently become better understood (Viklund, Ettema, and 51 Andersson 2012; Viklund et al. 2013; Rodríguez-Ezpeleta and Embley 2012; Wang and 52 Wu 2014, 2015). These eight orders were grouped into two subclasses by Ferla et al. 53 (2013): the subclass Rickettsiidae comprising the order Rickettsiales and 54 Pelagibacterales, and the subclass Caulobacteridae comprising all other orders.

55 The great diversity of the Alphaproteobacteria itself presents a challenge to deciphering 56 the deepest divergences within the group. Such diversity encompasses a broad 57 spectrum of (nucleotide) and proteome () compositions (e.g., the 58 A+T%-rich Pelagibacterales versus the G+C%-rich ) and molecular 59 evolutionary rates (e.g., the fast-evolving Pelagibacteriales, Rickettsiales or 60 Holosporales versus many slow-evolving in the Rhodospirillales) (Ettema and 61 Andersson 2009). This diversity may lead to pervasive artefacts when inferring the 62 phylogeny of the Alphaproteobacteria, e.g., long-branch attraction (LBA) between the 63 Rickettsiales and Pelagibacterales, especially when including mitochondria (Rodríguez- 64 Ezpeleta and Embley 2012; Viklund, Ettema, and Andersson 2012; Viklund et al. 2013;

3

65 Luo 2015). Moreover, there are still important unknowns about the deep phylogeny of 66 the Alphaproteobacteria (Williams, Sobral, and Dickerman 2007; Ferla et al. 2013), for 67 example, the divergence order among the Rhizobiales, Rhodobacterales and 68 Caulobacterales (Williams, Sobral, and Dickerman 2007), the monophyly of the 69 Pelagibacterales (Viklund et al. 2013) and the Rhodospirillales (Ferla et al. 2013), and 70 the precise placement of the Rickettsiales and its relationship to the Holosporales 71 (Wang and Wu 2015; Martijn et al. 2018).

72 Systematic errors stemming from using over-simplified evolutionary models (which often 73 do not fit complex data as well by, for example, not accounting for compositional 74 heterogeneity across sites or branches) are perhaps the major confounding and limiting 75 factor to inferring deep evolutionary relationships; the number of taxa and genes (or 76 sites) can also be important factors. Previous multi- tree studies of the 77 Alphaproteobacteria were compromised by at least one of these problems, namely, 78 simpler or less realistic evolutionary models (because they were not available at the 79 time; e.g., Williams, Sobral, and Dickerman 2007 used the simple WAG+Γ4 model that 80 cannot account for compositional heterogeneity across sites), poor or uneven taxon 81 sampling (because the focus was too narrow or few were available; e.g., 82 Williams, Sobral, and Dickerman 2007 had very few rhodospirillaleans and no 83 holosporaleans; Georgiades et al. 2011 included only 42 alphaproteobacteria with only 84 one pelagibacteralean) or a small number of genes (because the focus was 85 mitochondria; e.g., Rodríguez-Ezpeleta and Embley 2012 used 24 genes; Wang and 86 Wu 2015 relied on 29 genes; Martijn et al. 2018 also used 24 genes; or because only a 87 small set of 28 compositionally homogeneous genes was used, e.g., Luo 2015). The 88 most recent study on the phylogeny of the Alphaproteobacteria, and mitochondria, 89 attempted to counter systematic errors (or phylogenetic artefacts) by reducing amino 90 acid compositional heterogeneity (Martijn et al. 2018). Even though some deep 91 relationships were not robustly resolved, these analyses suggested that the 92 Pelagibacterales, Rickettsiales and Holosporales, which have compositionally-biased 93 genomes, are not each other’s closest relatives (Martijn et al. 2018). A resolved and 94 robust phylogeny of the Alphaproteobacteria is fundamental to addressing questions 95 such as how streamlined bacteria, intracellular parasitic bacteria, or mitochondria

4

96 evolved from their alphaproteobacterial ancestors. Therefore, a systematic study of the 97 different biases affecting the phylogeny of the Alphaproteobacteria, and its underlying 98 data, is much needed.

99 Here, we revised the phylogeny of the Alphaproteobacteria by using a large dataset of 100 200 conserved single-copy genes and employing carefully designed strategies aimed at 101 alleviating phylogenetic artefacts. We found that amino acid compositional 102 heterogeneity, and more generally long-branch attraction, were major confounding 103 factors in estimating phylogenies of the Alphaproteobacteria. In order to counter these 104 biases, we used novel dataset-specific profile mixture models and recoding schemes 105 (both specifically designed to ameliorate compositional heterogeneity), and removed 106 sites, genes and taxa that were compositionally biased. We also present three draft 107 genomes for endosymbiotic alphaproteobacteria belonging to the Rickettsiales and 108 Holosporales: (1) an undescribed midichloriacean endosymbiont of Peranema 109 trichophorum, (2) an undescribed rickettsiacean endosymbiont of Stachyamoeba 110 lipophora, and (3) the holosporalean ‘Candidatus Finniella inopinata’, an endosymbiont 111 of the rhizarian amoeboflagellate Viridiraptor invadens (Hess, Suthaus, and Melkonian 112 2015). Our results provide the first strong evidence that the Holosporales is unrelated to 113 the Rickettsiales and originated instead from within the Rhodospirillales. We incorporate 114 these and other insights regarding the deep phylogeny of the Alphaproteobacteria into 115 an updated taxonomy.

116 RESULTS

117 The genomes and phylogenetic positions of three novel endosymbiotic 118 alphaproteobacteria (Rickettsiales and Holosporales)

119 We sequenced the genomes of the novel holosporalean ‘Candidatus Finniella 120 inopinata’, an endosymbiont of the rhizarian amoeboflagellate Viridiraptor invadens 121 (Hess, Suthaus, and Melkonian 2015), and two undescribed rickettsialeans, one 122 associated with the heterolobosean amoeba Stachyamoeba lipophora and the other 123 with the euglenoid flagellate Peranema trichophorum. The three genomes are small with 124 a reduced gene number and high A+T% content, strongly suggesting an endosymbiotic 125 lifestyle (Table 1). Comparisons of their rRNA genes show that these genomes are truly

5

126 novel, being considerably divergent from other described alphaproteobacteria. As of 127 February 2018, the closest 16S rRNA gene to that of the Stachyamoeba-associated 128 rickettsialean belongs to massiliae str. AZT80, with only 88% identity. On the 129 other hand, the closest 16S rRNA gene to that of the Peranema-associated 130 rickettsialean belongs to an endosymbiont of Acanthamoeba sp. UWC8, which is only 131 92% identical. Phylogenetic analysis of both the 16S rRNA gene and a dataset that 132 comprises 200 single-copy conserved marker genes (see below) confirm that each 133 species belongs to different families and orders within the Alphaproteobacteria 134 (Supplementary file 1 and Figure 2-figure supplement 1). ‘Candidatus Finniella 135 inopinata’ belongs to the recently described ‘Candidatus Paracaedibacteraceae’ in the 136 Holosporales (Hess, Suthaus, and Melkonian 2015), whereas the Stachyamoeba- 137 associated rickettsialean belongs to the , and the Peranema-associated 138 rickettsialean belongs to the ‘Candidatus ’, in the Rickettsiales.

6

139 Table 1. Genome features for the three novel rickettsialeans sequenced in this study. 140 See Supplementary file 1 as well.

Species ‘Candidatus Finniella Stachyamoeba- Peranema- inopinata’ associated associated rickettsialean rickettsialean 1,792,168 bp 1,738,386 bp 1,375,759 bp N50 174,737 bp 1,738,386 bp 28,559 bp Contig number 28 1 125 Gene number1 1,741 1,588 1,223 A+T% content 56.58% 67.01% 59.13% Family Paracaedibacteraeae Rickettsiaceae ‘Candidatus Midicloriaceae’ Order Holosporales Rickettsiales Rickettsiales Completeness2 94.96% 97.12% (=100%) 92.08% Redundancy2 0.0% 0.0% 2.1% 141 1 as predicted by Prokka v.1.13 (rRNA genes were searched with BLAST). 142 2 as estimated by Anvi’o v.2.4.0 using the Campbell et al., 2013 marker gene set.

7

143 Compositional heterogeneity appears to be a major confounding factor affecting 144 phylogenetic inference of the Alphaproteobacteria

145 The average-linkage clustering of amino acid compositions shows that the Rickettsiales, 146 Pelagibacterales (together with alphaproteobacterium sp. HIMB59) and Holosporales 147 are clearly distinct from other alphaproteobacteria. This indicates that these three taxa 148 have divergent proteome amino acid compositions (Figure 1A). These taxa also have 149 the lowest GARP:FIMNKY amino acid ratios in all the Alphaproteobacteria (Figure 1A; 150 GARP amino acids are encoded by G+C%-rich codons, whereas FIMNKY amino acids 151 are encoded by A+T%-rich codons. Proteomes that have low GARP:FIMNKY ratios are 152 compositionally biased and therefore come from A+T%-rich genomes); the 153 Pelagibacterales (including alphaproteobacterium sp. HIMB59) being the most 154 divergent, followed by the Rickettsiales and then the Holosporales. Such biased amino 155 acid compositions appear to be the consequence of genome nucleotide compositions 156 that are strongly biased towards high A+T%—a scatter plot of genome G+C% and 157 proteome GARP:FIMNKY ratios shows a similar clustering of the Rickettsiales, 158 Pelagibacterales (including alphaproteobacterium sp. HIMB59) and Holosporales 159 (Figure 1B). This compositional similarity in the proteomes of the Rickettsiales, 160 Pelagibacterales (plus alphaproteobacterium sp. HIMB59) and Holosporales, which also 161 turn out to be the longest-branching alphaproteobacterial groups in previously published 162 phylogenies (e.g., Wang and Wu 2015), could be the outcome of either a shared 163 evolutionary history (i.e., the groups are most closely related to one another), or 164 alternatively, evolutionary convergence (e.g., because of similar lifestyles or 165 evolutionary trends toward small cell and genome sizes).

8

166 167

9

168 Figure 1. Compositional heterogeneity in the Alphaproteobacteria is a major factor that 169 confounds phylogenetic inference. There are great disparities in the genome G+C% 170 content and amino acid compositions of the Rickettsiales, Pelagibacterales (including 171 alphaproteobacterium sp. HIMB59) and Holosporales with all other alphaproteobacteria. 172 A. A UPGMA (average-linkage) clustering of amino acid compositions (based on the 173 200 gene set for the Alphaproteobacteria) shows that the Rickettsiales (brown), 174 Pelagibacterales (maroon), and Holosporales (light blue) all have very similar proteome 175 amino acid compositions. At the tips of the tree, GARP:FIMNKY amino acid ratio values 176 are shown as bars. B. A scatterplot depicting the strong correlation between G+C% 177 (nucleotide compositions) and GARP:FIMNKY ratios (amino acid composition) for the 178 120 taxa in the Alphaproteobacteria (and outgroup) shows a similar clustering of the 179 Rickettsiales, Pelagibacterales (including alphaproteobacterium sp. HIMB59) and 180 Holosporales.

10

181 As a first step to discriminate between these two alternatives, we used maximum 182 likelihood to estimate a tree on a dataset that comprised 200 single-copy and rarely 183 laterally-transferred marker genes for the Alphaproteobacteria (as determined by Phyla- 184 AMPHORA; see Methods for more details; Wang and Wu 2013) under the site- 185 heterogenous model LG+PMSF(ES60)+F+R6. The resulting tree united the 186 Rickettsiales, Pelagibacterales (with alphaproteobacterium sp. HIMB59 at its base) and 187 Holosporales in a fully supported (Figure 2A; see Figure 2-figure supplement 1 for 188 labeled trees). The clustering of these three groups is suggestive of a phylogenetic 189 artefact (e.g., long-branch attraction or LBA); indeed, such a pattern resembles the one 190 seen in the tree of proteome amino acid compositions (see Figure 1A). This is because 191 the three groups have the longest branches in the Alphaproteobacteria tree and have 192 compositionally-biased and fast-evolving genomes (see Figure 2). If evolutionary 193 convergence in amino acid compositions is confounding phylogenetic inference for the 194 Alphaproteobacteria, methods aimed at reducing compositional heterogeneity might 195 disrupt the clustering of the Rickettsiales, Pelagibacterales and Holosporales.

196 To further test whether the clustering of the Rickettsiales, Pelagibacterales and 197 Holosporales is real or artefactual we used several different strategies to reduce the 198 compositional heterogeneity of our dataset (see Figure 2-figure supplement 2 for the 199 diverse strategies employed). When removing the 50% most compositionally-biased 200 (heterogeneous) sites according to ɀ (a novel metric that measures amino acid 201 compositional disparity at a site; see Methods), the clustering between the Rickettsiales, 202 Pelagibacterales, alphaproteobacterium sp. HIMB59 and Holosporales is disrupted 203 (Figure 2B; see also Figure 2-figure supplement 3). The new more derived placements 204 for the Pelagibacterales, alphaproteobacterium sp. HIMB59 and Holosporales are well 205 supported (further described below), and support tends to increase as compositionally- 206 biased sites are removed (Supplementary file 2-table S1). Furthermore, when each of 207 these long-branching and compositionally-biased taxa is analyzed in isolation (i.e., in 208 the absence of the other), and compositionally heterogeneity is further decreased, new 209 phylogenetic patterns emerge that are incompatible, or in conflict, with their clustering 210 (Figure 2-figure supplement 4 and Figure 3-figure supplement 1-5). Various strategies 211 to reduce compositional heterogeneity, such as removing the most compositionally-

11

212 biased sites, recoding the data into reduced character-state alphabets, or using only the 213 most compositionally homogeneous genes, converge to very similar phylogenetic 214 patterns for the Alphaproteobacteria in which the clustering of the Rickettsiales, 215 Pelagibacterales, alphaprotobacterium sp. HIMB59 and Holosporales is disrupted; the 216 Pelagibacterales, alphaproteobacterium sp. HIMB59 and Holosporales have much more 217 derived phylogenetic placements (e.g., Figure 3, Figure 2-figure supplement 4 and 218 Figure 3-figure supplement 1-5). On the other hand, removing fast-evolving sites does 219 not disrupt the clustering of these three long-branching groups (Supplementary file 2- 220 table S2), suggesting that high evolutionary rates per site are not a major confounding 221 factor when inferring the phylogeny of the Alphaproteobacteria.

12

222 223 224

13

225 Figure 2. Decreasing compositional heterogeneity by removing compositionally-biased 226 sites disrupts the clustering of the Rickettsiales, Pelagibacterales (including 227 alphaprotobacterium sp. HIMB59) and Holosporales. All branch support values are 228 100% SH-aLRT and 100% UFBoot unless annotated. A. A maximum-likelihood tree 229 inferred under the LG+PMSF(ES60)+F+R6 model and from the untreated dataset which 230 is highly compositionally heterogeneous. The three long-branching orders, the 231 Rickettsiales, Pelagibacterales (including alphaprotobacterium sp. HIMB59) and 232 Holosporales, that have similar amino acid compositions form a clade. B. A maximum- 233 likelihood tree inferred under the LG+PMSF(ES60)+F+R6 model and from a dataset 234 whose compositional heterogeneity has been decreased by removing 50% of the most 235 biased sites according to ɀ. In this phylogeny the clustering of the Rickettsiales, 236 Pelagibacterales and Holosporales is disrupted. The Pelagibacterales is sister to the 237 Rhodobacterales, Caulobacterales and Rhizobiales. The Holosporales, and 238 alphaproteobacterium sp. HIMB59, become sister to the Rhodospirillales. The 239 Rickettsiales remains as the sister to the Caulobacteridae. See Figure 2-supplement 240 figure 1 for taxon names. See Figure 2-figure supplement 3 for the Bayesian consensus 241 trees inferred in PhyloBayes MPI v1.7 under the CAT-Poisson+Γ4 model. See also 242 Figure 2-figure supplement 2 and 4-7.

243

14

244 The Holosporales is unrelated to the Rickettsiales and is instead most likely derived 245 within the Rhodospirillales

246 The Holosporales has traditionally been considered part of the Rickettsiales sensu lato 247 because it appears as sister to the Rickettsiales in many trees (e.g., Hess, Suthaus, and 248 Melkonian 2015; Montagna et al. 2013; Santos and Massard 2014). It is exclusively 249 composed of endosymbiotic bacteria living within diverse eukaryotes, and such a 250 lifestyle is shared with all other members of the Rickettsiales (with the possible 251 exception of a recently reported ectosymbiotic rickettialean; see Castelli et al. 2018). 252 When we decrease, and then account for, compositional heterogeneity, we recover tree 253 topologies in which the Holosporales moves away from the Rickettsiales (e.g., Figure 254 2B, Figure 2-figure supplement 4B and 4D). For example, the Holosporales becomes 255 sister to all free-living alphaproteobacteria (the Caulobacteridae) when only the 40 most 256 homogeneous genes are used (Figure 2-figure supplement 4D) or when 10% of the 257 most compositionally-biased sites are removed (Supplementary file 2-table S1). When 258 compositional heterogeneity is further decreased by removing 50% of the most 259 compositionally-biased sites, the Holosporales becomes sister to the Rhodospirillales 260 (Figure 2B and Supplementary file 2-table S1; and see also Figure 2-figure supplement 261 4B).

262 Similarly, when the long-branching and compositionally-biased Rickettsiales, 263 Pelagibacterales, and alphaproteobacterium sp. HIMB59 (plus the extremely long- 264 branching genera Holospora and ‘Candidatus Hepatobacter’) are removed, after 265 compositional heterogeneity had been decreased through site removal, the 266 Holosporales move to a much more derived position well within the Rhodospirillales 267 (Figure 3A, Figure 3-figure supplement 1B, 1C and Figure 3-figure supplement 6). If the 268 very compositionally-biased and fast-evolving Holospora and ‘Candidatus Hepatobacter’ 269 are left in, the Holosporales are pulled away from its derived position and the whole 270 clade moves closer to the base of the tree (Figure 3-figure supplement 7). The same 271 pattern in which the Holosporales is derived within the Rhodospirillales is seen when 272 these same taxa are removed, and the data are then recoded into four- or six-character 273 states (Figure 3B, Figure 3-figure supplement 6 and Figure 3-figure supplement 8).

15

274 Specifically, the Holosporales now consistently branches as sister to a subgroup of 275 rhodospirillaleans that includes, among others, the epibiotic predator Micavibrio 276 aeruginosavorus and the purple nonsulfur bacterium Rhodocista centenaria (the 277 Azospirillaceae, see below) (Figure 3). This new placement of the Holosporales has 278 nearly full support under both maximum likelihood (>95% UFBoot; see Figure 3) and 279 Bayesian inference (>0.95 posterior probability; see Figure 3-figure supplement 6).Thus, 280 three different analyses independently converge to the same pattern and support a 281 derived origin of the Holosporales within the Rhodospirillales: (1) removal of 282 compositionally-biased sites (Figure 3A), (2) data recoding into four-character states 283 using the dataset-specific scheme S4 (Figure 3B and Figure 3-figure supplement 7), 284 and (3) data recoding into six-character states using the dataset-specific scheme S6 285 (Figure 3-figure supplement 8); each of these strategies had to be combined with the 286 removal of the Pelagibacterales, alphaproteobacterium sp. HIMB59, and Rickettsiales to 287 recover this phylogenetic position for the Holosporales.

16

288 289 290

17

291 Figure 3. The Holosporales (renamed and lowered in rank to the Holosporaceae family 292 here) branches in a derived position within the Rhodospirillales when compositional 293 heterogeneity is reduced and the long-branching and compositionally-biased 294 Rickettsiales, Pelagibacterales, and alphaproteobacterium sp. HIMB59 are removed. 295 Branch support values are 100% SH-aLRT and 100% UFBoot unless annotated. A. A 296 maximum-likelihood tree, inferred under the LG+PMSF(ES60)+F+R6 model, to place 297 the Holosporaceae in the absence of the Rickettsiales, Pelagibacterales, and 298 alphaproteobacterium sp. HIMB59 and when compositional heterogeneity has been 299 decreased by removing 50% of the most biased sites. The Holosporaceae is sister to 300 the Azospirillaceae fam. nov. within the Rhodospirillales. B. A maximum-likelihood tree, 301 inferred under the GTR+ES60S4+F+R6 model, to place the Holosporaceae in the 302 absence of the Rickettsiales, Pelagibacterales, and alphaproteobacterium sp. HIMB59, 303 and when the data have been recoded into a four-character state alphabet (the dataset- 304 specific recoding scheme S4: ARNDQEILKSTV GHY CMFP W) to reduce compositional 305 heterogeneity. This phylogeny shows a pattern that matches that inferred when 306 compositional heterogeneity has been alleviated through site removal. See Figure 3- 307 figure supplement 6 for the Bayesian consensus trees inferred in PhyloBayes MPI v1.7 308 and under the and the CAT-Poisson+Γ4 model. See also Figure 3-figure supplement 1- 309 5 and 7-8.

18

310 A fourth independent analysis further supports a derived placement of the Holosporales 311 nested within the Rhodospirillales. Bayesian inference using the CAT-Poisson+Γ4 312 model, on a dataset whose compositional heterogeneity had been decreased by 313 removing 50% of the most compositionally-biased sites but for which no taxon had been 314 removed, also recovered the Holosporales as sister to the Azospirillaceae (see Figure 315 2-figure supplement 3).

316 The Rhodospirillales is a diverse order and comprises five well-supported families

317 The Rhodospirillales is an ancient and highly diversified group, but unfortunately this is 318 rarely obvious from published phylogenies because most studies only include a few 319 species for this order (Williams, Sobral, and Dickerman 2007; Georgiades et al. 2011; 320 Ferla et al. 2013). We have included a total of 31 Rhodospirillales taxa to better cover 321 its diversity. Such broad sampling reveals trees with five clear subgroups within the 322 Rhodospirillales that are well-supported in most of our analyses (e.g., Figure 2B and 3). 323 First is the Acetobacteraceae which comprises acetic acid (e.g., 324 oboediens), acidophilic (e.g., rubrifaciens), and photosynthesizing 325 (-containing; e.g., Rubritepida flocculans) bacteria. The 326 Acetobacteraceae is strongly supported and relatively divergent from all other families 327 within the Rhodospirillales. Sister to the Acetobacteraceae is another subgroup that 328 comprises many photosynthesizing bacteria, including the species for the 329 Rhodospirillales, , as well as the magnetotactic bacterial genera 330 , Magnetovibrio and Magnetospira (Figure 3). This subgroup best 331 corresponds to the poorly defined and paraphyletic family. We amend 332 the Rhodospirillaceae taxon and restrict it to the clade most closely related to the 333 Acetobacteraceae. As described above, when artefacts are accounted for, the 334 Holosporales most likely branches within the Rhodospirillales and therefore we suggest 335 the Holosporales sensu Szokoli et al. (2016) be lowered in rank to the family 336 Holosporaceae (containing e.g., Caedibacter sp. 37-49 and ‘Candidatus 337 Paracaedibacter symbiosus’), which is sister to the Azospirillaceae (Figure 3). The 338 Azospirillaceae fam. nov. contains the purple bacterium Rhodocista centenaria and the 339 epibiotic (neither periplasmic nor intracellular) predator Micavibrio aeruginosavorus,

19

340 among others. The Holosporaceae and the Azospirillaceae appear to be sister to 341 the Rhodovibriaceae fam. nov. (Figure 3), a well-supported group that comprises the 342 purple nonsulfur bacterium Rhodovibrio salinarum, the aerobic heterotroph Kiloniella 343 laminariae, and the marine bacterioplankter ‘Candidatus Puniceispirillum marinum’ (or 344 the SAR116 clade). Each of these subgroups and their interrelationships—with the 345 exception of the Holosporaceae that branches within the Rhodospirillales only after 346 compositional heterogeneity is countered—are strongly supported in nearly all of our 347 analyses (e.g., see Figure 2B and 3).

348 The Geminicoccaceae might be sister to all other free-living alphaproteobacteria (the 349 Caulobacteridae)

350 The Geminicocacceae is a recently proposed family within the Rhodospirillales 351 (Proença et al. 2017). It is currently represented by only two genera, and 352 Arboriscoccus (Foesel et al. 2007; Proença et al. 2017). In most of our trees, however, 353 mobilis is often sister to Geminicoccus roseus with full statistical support (e.g., 354 Figure 2B and 3A, but see Figure 3-figure supplement 6 for an exception) and we 355 therefore consider it to be part of the Geminococcaceae. Interestingly, the 356 Geminicoccaceae tends to have two alternative stable positions in our analyses, either 357 as sister to all other families of the Rhodospirillales (e.g., Figure 2A and 3A), or as sister 358 to all other orders of the Caulobacteridae (i.e., representing the most basal lineage of 359 free-living alphaproteobacteria; Figure 2B and Figure 2-figure supplement 3B, Figure 3B 360 and Figure 3-figure supplement 6, or Figure 2-figure supplement 4B, Figure 3-figure 361 supplement 1C, Figure 3-figure supplement 2B-D, Figure 3-figure supplement 3B, 3D, 362 and Figure 3-figure supplement 5C). Our analyses designed to alleviate compositional 363 heterogeneity, specifically site removal and recoding (without taxon removal), favor the 364 latter position for the Geminicoccaceae (Figure 2B and 3B). Moreover, as 365 compositionally-biased sites are progressively removed, support for the affiliation of the 366 Geminicoccaceae with the Rhodospirillales decreases, and after 50% of the sites have 367 been removed, the Geminicoccaceae emerges as sister to all other free-living 368 alphaproteobacteria with strong support (>95% UFBoot; Supplementary file 2-table S1). 369 In further agreement with this trend, the much simpler model LG4X places the

20

370 Geminicocacceae in a derived position as sister to the Acetobacteraceae (Figure 2- 371 figure supplement 5), but as model complexity increases, and compositional 372 heterogeneity is reduced, the Geminicoccaceae moves closer to the base of the 373 Alphaproteobacteria (Figure 2A and 3A). Such a placement suggests that the 374 Geminicoccaceae may be a novel and independent order-level lineage in the 375 Alphaproteobacteria. However, because of the uncertainty in our results we opt here for 376 conservatively keeping the Geminicoccaceae as the sixth family of the Rhodospirillales 377 (Figure 3A).

378 Other deep relationships in the Alphaproteobacteria (Pelagibacterales, Rickettsiales, 379 alphaproteobacterium sp. HIMB59)

380 The clustering of the Pelagibacterales (formerly the SAR11 clade) with the Rickettsiales 381 and Holosporales is more easily disrupted than that of the Holosporales, either when 382 long-branching (or compositionally-biased) taxon removal is performed to control for 383 compositional attractions or not. The removal of compositionally-biased sites (from 30% 384 on; 16,320 out of 54,400 sites; see Supplementary file 2-table S1, Figure 2B, Figure 2- 385 figure supplement 3B and Figure 3-figure supplement 4B), data recoding into four- 386 character states (Figure 3-figure supplement 4C), and a set of the most compositionally 387 homogeneous genes (Figure 3-figure supplement 4D), all support a derived placement 388 of the Pelagibacterales as sister to the Rhodobacterales, Caulobacterales and 389 Rhizobiales. Attempts to account for compositional heterogeneity both across sites 390 (e.g., Rodríguez-Ezpeleta and Embley 2012; Viklund, Ettema, and Andersson 2012; 391 Viklund et al. 2013, and Martijn et al. 2018) and taxa (e.g., Luo et al. 2013; Luo 2015) 392 tend to disrupt the potentially artefactual clustering of the Pelagibacterales and the 393 Rickettsiales (in contrast to the studies of e.g., Williams, Sobral, and Dickerman 2007; 394 Thrash et al. 2011; and Georgiades et al. 2011 that did not account for compositional 395 heterogeneity). The Caulobacterales is sister to the Rhizobiales, and the 396 Rhodobacterales sister to both (e.g., Figure 2B and 3). This is consistent throughout 397 most of our results and such interrelationships become very robustly supported as 398 compositional heterogeneity is increasingly alleviated (Supplementary file 2-table S1). 399 The placement of the Rickettsiales as sister to the Caulobacteridae (i.e., all other

21

400 alphaproteobacteria) remains stable across different analyses (see Supplementary file 401 2-table S1, and also Figure 2B and Figure 3-figure supplement 2); this is also true when 402 the other long-branching taxa, the Pelagibacterales, alphaproteobacterium sp. HIMB59 403 and Holosporales, and even the Beta- outgroup, are removed 404 (see Figure 3-figure supplement 2 and Figure 3-figure supplement 3). Yet, the 405 interrelationships inside the Rickettsiales order remain uncertain; the ‘Candidatus 406 Midichloriaceae’ becomes sister to the when fast-evolving sites are 407 removed (Supplementary file 2-table S2), but to the Rickettsiaceae when 408 compositionally-biased sites are removed (Supplementary file 2-table S1). The 409 placement of alphaproteobacterium sp. HIMB59 is uncertain (e.g., see Figure 2 and 410 Figure 2-figure supplement 3, and Figure 2-figure supplement 4 and Figure 3-figure 411 supplement 5; in contrast to Grote et al. 2012); taxon-removal analyses suggest that 412 alphaproteobacterium sp. HIMB59 is sister to the Caulobacteridae (Figure 3-figure 413 supplement 5), but the inclusion of any other long-branching group immediately 414 destabilizes this position (e.g., see Figure 2 and Figure 2-figure supplement 2, and 415 Figure 2-figure supplement 4). This is consistent with previous reports that suggest that 416 alphaproteobacterium sp. HIMB59 is not closely related to the Pelagibacterales (Viklund 417 et al. 2013; Martijn et al. 2018).

418 DISCUSSION

419 We have employed a diverse set of strategies to investigate the phylogenetic signal 420 contained within 200 genes for the Alphaproteobacteria. Specifically, such strategies 421 were primarily aimed at reducing amino acid compositional heterogeneity among taxa— 422 a phenomenon that permeates our dataset (Figure 1). Compositional heterogeneity is a 423 clear violation of the phylogenetic models used in our, and previous, analyses, and 424 known to cause phylogenetic artefacts (Foster 2004). In the absence of more 425 sophisticated models for inferring deep phylogeny (i.e., those that best fit complex data), 426 the only way to counter artefacts caused by compositional heterogeneity is by removing 427 compositionally-biased sites or taxa, or recoding amino acids into reduced alphabets 428 (e.g., see Susko and Roger 2007; Heiss et al. 2018; Viklund, Ettema, and Andersson 429 2012). A combination of these strategies reveals that the Rickettsiales sensu lato (i.e.,

22

430 the Rickettsiales and Holosporales) is polyphyletic. Our analyses suggest that the 431 Holosporales is derived within the Rhodospirillales, and that therefore this taxon should 432 be lowered in rank and renamed the Holosporaceae family (see Figure 2B and 3). The 433 same methods suggest that the Rhodospirillales might indeed be a paraphyletic order 434 and that the Geminicoccaceae could be a separate lineage that is sister to the 435 Caulobacteridae (e.g., Figure 2B). These two results, combined with our broader 436 sampling, reorganize the internal phylogenetic structure of the Rhodospirillales and 437 show that its diversity can be grouped into at least five well-supported major families 438 (Figure 3).

439 In 16S rRNA gene trees, the Holosporales has most often been allied to the 440 Rickettsiales (Montagna et al. 2013; Hess, Suthaus, and Melkonian 2015). The 441 apparent diversity of this group has quickly increased in recent years as more and more 442 intracellular bacteria living within have been described (e.g., Hess, Suthaus, and 443 Melkonian 2015; Szokoli et al. 2016; Eschbach et al. 2009; Boscaro et al. 2013). An 444 endosymbiotic lifestyle is shared by all members of the Holosporales and is also shared 445 with all those that belong to the Rickettsiales. Thus, it had been reasonable to accept 446 their shared ancestry as suggested by some 16S rRNA gene trees (e.g., Montagna et 447 al. 2013; Santos and Massard 2014; Hess, Suthaus, and Melkonian 2015). Apparent 448 strong support for the monophyly of the Rickettsiales and the Holosporales recently 449 came from some multi-gene trees by Wang and Wu (2014, 2015) who expanded 450 sampling for the Holosporales. However, an alternative placement for the Holosporales 451 as sister to the Caulobacteridae has been reported by Ferla et al. (2013) based on 452 rRNA genes, by Georgiades et al., (2011) based on 65 genes, by Schulz et al., (2015) 453 based on 139 genes, as well as by Wang and Wu (2015) based on 26, 29, or 200 genes 454 (see the supplementary information in Wang and Wu, 2015). This placement was 455 acknowledged by Szokoli et al. (2015), who formally established the order 456 Holosporales. Most recently, Martijn et al. 2018, who used strategies to reduce 457 compositional heterogeneity, and similarly to Wang and Wu (2015), recovered a number 458 of placements for the Holosporales within the Alphaproteobacteria; however these 459 different placements for the Holosporales were poorly supported. Here we provide 460 strong evidence for the hypothesis that the Holosporales is not related to the

23

461 Rickettsiales, as suggested earlier (Georgiades et al. 2011; Ferla et al. 2013; Szokoli et 462 al. 2016). The Rickettsiales sensu lato is polyphyletic. We show that the Holosporales is 463 artefactually attracted to the Rickettsiales (e.g., Figure 2A), but as compositional bias is 464 increasingly alleviated (through site removal and recoding), they move further away 465 from them (Figure 2B). The Holosporales is placed within the Rhodopirillales as sister to 466 the family Azospirillaceae (Figure 3). The similar lifestyles of the Holosporales and 467 Rickettsiales, as well as other features like the presence of an ATP/ADP translocase 468 (Wang and Wu 2014), are therefore likely the outcome of convergent evolution.

469 A derived origin of the Holosporales has important implications for understanding the 470 origin of mitochondria and the nature of their ancestor. Wang and Wu (2014, 2015) 471 proposed that mitochondria are phylogenetically embedded within the Rickettsiales 472 sensu lato. In their trees, mitochondria were sister to a clade formed by the 473 Rickettsiaceae, Anaplasmataceae and ‘Candidatus Midichloriaceae’, and the 474 Holosporales was itself sister to all of them. This phylogenetic placement for 475 mitochondria suggested that the ancestor of mitochondria was an 476 (Wang and Wu, 2014). But if the Holosporales is a derived group of rhodospirillaleans 477 as we show here (see Figure 3), then the argument that mitochondria necessarily 478 evolved from parasitic alphaproteobacteria no longer holds. While the sisterhood of 479 mitochondria and the Rickettsiales sensu stricto is still a possibility, such a relationship 480 does not imply that the two groups shared a parasitic common ancestor (i.e., a parasitic 481 ancestry for mitochondria). The most recent analyses done by Martijn et al. (2018) 482 suggest that mitochondria are sister to all known alphaproteobacteria, also suggesting 483 their non-parasitic ancestry. Our study, and that of Martijn et al., thus complement each 484 other and support the view that mitochondria most likely evolved from ancestral free- 485 living alphaproteobacteria (contra Sassera et al. 2011; Wang and Wu 2014, 2015).

486 The order Rhodospirillales is quite diverse and includes many purple nonsulfur bacteria 487 as well as all magnetotactic bacteria within the Alphaproteobacteria. The 488 Rhodospirillales is sister to all other orders in the Caulobacteridae, and has historically 489 been subdivided into two families: the Rhodospirillaceae and the Acetobacteraceae. 490 Recently, a new family, the Geminicoccaceae, was established for the Rhodospirillales

24

491 (Proença et al. 2017). However, some of our analyses suggest that the 492 Geminicoccaceae might be sister to all other Caulobacteridae (e.g., Figure 2B and 3B). 493 This phylogenetic pattern, therefore, suggests that the Rhodospirillales may be a 494 paraphyletic order. The placement of the Geminicoccaceae as sister to the 495 Caulobacteridae needs to be further tested once more sequenced diversity for this 496 group becomes available; if it were to be confirmed, the Geminicoccaceae should be 497 elevated to the order level. Whereas the Acetobacteraceae is phylogenetically well- 498 defined, there has been considerable uncertainty about the Rhodospirillaceae (e.g., 499 Ferla et al., 2013), primarily because of poor sampling and a lack of resolution provided 500 by the 16S rRNA gene. We subdivide the Rhodospirillaceae sensu lato into three 501 subgroups (Figure 3). We restrict the Rhodospirillaceae sensu stricto to the subgroup 502 that is sister to the Acetobacteracae (Figure 3). The other two subgroups are the 503 Rhodovibriaceae and the Azospirillaceae; the latter is sister to the Holosporaceae 504 (Figure 3).

505 Based on our fairly robust phylogenetic patterns, we have updated the higher-level 506 taxonomy of the Alphaproteobacteria (Figure 4). We exclude the Magnetococcales from 507 the Alphaproteobacteria class because of its divergent nature (e.g., see Figure 1 in 508 Esser, Martin, and Dagan 2007 which shows that many of Magnetococcus’ genes are 509 more similar to those of beta-, and gammaproteobacteria). In agreement with its 510 intermediate phylogenetic placement, we endorse the Magnetococcia class as 511 proposed by Parks et al. (2018). At the highest level we define the Alphaproteobacteria 512 class as comprising two subclasses sensu Ferla et al. (2013), the Rickettsidae and the 513 Caulobacteridae. The former contains the Rickettsiales, and the latter contains all other 514 orders, which are primarily and ancestrally free-living alphaproteobacteria. The order 515 Rickettsiales comprises three families as previously defined, the Rickettsiaceae, the 516 Anaplasmataceae, and the ‘Candidatus Midichloriaceae’. On the other hand, the 517 Caulobacteridae is composed of seven phylogenetically well-supported orders: the 518 Rhodospirillales, Sneathiellales, Sphingomonadales, Pelagibacterales, 519 Rhodobacterales, Caulobacterales and Rhizobiales. Among the many species claimed 520 to represent new order-level lineages on the basis of 16S rRNA gene trees (Cho and 521 Giovannoni 2003; Kwon et al. 2005; Kurahashi et al. 2008; Wiese et al. 2009; Harbison

25

522 et al. 2017), only Sneathiella deserves order-level status (Kurahashi et al. 2008), since 523 all others have derived placements in our trees and those published by others (Williams 524 et al. 2012; Bazylinski et al. 2013; Venkata Ramana et al. 2013; Harbison et al. 2017). 525 The Rhodospirillales order comprises six families, three of which are new, namely the 526 Holosporaceae, Azospirillaceae and Rhodovibriaceae (Figure 4). This new higher-level 527 classification of the Alphaproteobacteria updates and expands those presented by Ferla 528 et al. (2013), the ‘Bergey’s Manual of Systematics of Archaea and Bacteria’ (Garrity 529 2005; Whitman 2015), and ‘The Prokaryotes’ (Rosenberg et al. 2014). The classification 530 scheme proposed here could be partly harmonized with that recently proposed by Parks 531 et al., (2018) by elevating the six families within the Rhodospirllales to the order level; 532 the phylograms by Parks et al., (2018), however, are in conflict with those shown here 533 and many of their proposed taxa are as well.

26

534

535 Figure 4. A higher-level classification scheme for the Alphaproteobacteria and the 536 Magnetococcia classes within the , and the Rickettsiales and 537 Rhodospirillales orders within the Alphaproteobacteria. 538

27

539 CONCLUSIONS

540 We employed a combination of methods to decrease compositional heterogeneity in 541 order to disrupt artefacts that arise when inferring the phylogeny of the 542 Alphaproteobacteria. This is an example of the complex nature of the historical signal 543 contained in modern genomes and the limitations of our current evolutionary models to 544 capture these signals. A robust phylogeny of the Alphaproteobacteria is a precondition 545 for placing the mitochondrial lineage. This is because including mitochondria certainly 546 exacerbates the already strong biases in the data, and therefore represents additional 547 sources of artefacts in phylogenetic inference (as seen in Wang and Wu, 2015 where 548 the Holosporales is attracted by both mitochondria and the Rickettsiales). The robust 549 phylogenetic framework developed here will serve as a reference for future studies that 550 aim to place mitochondria and novel not-yet-cultured environmental diversity within the 551 Alphaproteobacteria.

552 TAXON DESCRIPTIONS

553 Rickettsidae emend. (Alphaproteobacteria) Rickettsia is the type of the subclass. 554 The Rickettsidae subclass is here amended by redefining its circumscription so it 555 remains monophyletic by excluding the Pelagibacterales order. The emended 556 Rickettsidae subclass within the Alphaproteobacteria class is defined based on 557 phylogenetic analyses of 200 genes which are predominantly single-copy and vertically- 558 inherited (unlikely laterally transferred) when compositionally heterogeneity was 559 decreased by site removal or recoding. Phylogenetic (node-based) definition: the least 560 inclusive clade containing phagocytophilum HZ, Wilmington, 561 and ‘Candidatus mitochondrii’ IricVA. The Rickettsidae does not include: 562 Pelagibacter sp. HIMB058, ‘Candidatus Pelagibacter sp.’ IMCC9063, 563 alphaproteobacterium sp. HIMB59, Caedibacter sp. 37-49, ‘Candidatus Nucleicultrix 564 amoebiphila’ FS5, ‘Candidatus Finniella lucida’, Holospora obtusa F1, Sneathiella 565 glossodoripedis JCM 23214, wittichii, and subvibrioides 566 ATCC 15264.

567 Caulobacteridae emend. (Alphaproteobacteria) Caulobacter is the type genus of the 568 subclass. The Caulobacteridae subclass is here amended by redefining its

28

569 circumscription so it remains monophyletic by including the Pelagibacterales order. The 570 emended Caulobacteridae subclass within the Alphaproteobacteria class is defined 571 based on phylogenetic analyses of 200 genes which are predominantly single-copy and 572 vertically-inherited (unlikely laterally transferred) when compositionally heterogeneity 573 was decreased by site removal or recoding. Phylogenetic (node-based) definition: the 574 least inclusive clade containing Pelagibacter sp. HIMB058, ‘Candidatus Pelagibacter 575 sp.’ IMCC9063, alphaproteobacterium sp. HIMB59, Caedibacter sp. 37-49, ‘Candidatus 576 Nucleicultrix amoebiphila’ FS5, ‘Candidatus Finniella lucida’, Holospora obtusa F1, 577 Sneathiella glossodoripedis JCM 23214, , and Brevundimonas 578 subvibrioides ATCC 15264. The Caulobacteridae does not include: Anaplasma 579 phagocytophilum HZ, Rickettsia typhi Wilmington, and ‘Candidatus Midichloria 580 mitochondrii’ IricVA

581 Azospirillaceae fam. nov. (Rhodospirillales, Alphaproteobacteria) is the 582 type genus of the family. This new family within the Rhodospirillales order is defined 583 based on phylogenetic analyses of 200 genes which are predominantly single-copy and 584 vertically-inherited (unlikely laterally transferred). Phylogenetic (node-based) definition: 585 the least inclusive clade containing Micavibrio aeruginoavorus ARL-13, Rhodocista 586 centenaria SW, and limosus DSM 16000. The Azospirillaceae does not 587 include: Rhodovibrio salinarum DSM 9154, ‘Candidatus Puniceispirillum marinum’ IMCC 588 1322, Rhodospirillum rubrum ATCC 11170, pusilla DSM 6293, 589 angustum ATCC 49957, and Elioraea tepidiphila DSM 17972.

590 Rhodovibriaceae fam. nov. (Rhodospirillales, Alphaproteobacteria) Rhodovibrio is the 591 type genus of the family. This new family within the Rhodospirillales order is defined 592 based on phylogenetic analyses of 200 genes which are predominantly single-copy and 593 vertically-inherited (unlikely laterally transferred). Phylogenetic (node-based) definition: 594 the least inclusive clade containing Rhodovibrio salinarum DSM 9154, Kiloniella 595 laminariae DSM 19542, Oceanibaculum indicum P24, Thalassobaculum salexigens 596 DSM 19539 and ‘Candidatus Puniceispirillum marinum’ IMCC 1322. The 597 Rhodovobriaceae does not include: Rhodospirillum rubrum ATCC 11170, Terasakiella

29

598 pusilla DSM 6293, Rhodocista centenaria SW, Micavibrio aeruginoavorus ARL-13, 599 Acidiphilium angustum ATCC 49957, and Elioraea tepidiphila DSM 17972.

600 Rhodospirillaceae emend. (Rhodospirillales, Alphaproteobacteria) Rhodospirillum is the 601 type genus of the family. The Rhodospirillaceae family is here amended by redefining its 602 circumscription so it remains monophyletic. The emended Rhodospirillaceae family 603 within the Rhodospirillales order is defined based on phylogenetic analyses of 200 604 genes which are predominantly single-copy and vertically-inherited (unlikely laterally 605 transferred). Phylogenetic (node-based) definition: the least inclusive clade containing 606 Rhodospirillum rubrum ATCC 11170, parvum 930l, Magnetospirillum 607 magneticum AMB-1 and Terasakiella pusilla DSM 6293. The Rhodospirillaceae does 608 not include: Rhodocista centenaria SW, Micavibrio aeruginoavorus ARL-13, ‘Candidatus 609 Puniceispirillum marinum’ IMCC 1322, Rhodovibrio salinarum DSM 9154, Elioraea 610 tepidiphila DSM 17972, and Acidiphilium angustum ATCC 49957.

611 Holosporaceae (Rhodospirillales, Alphaproteobacteria) Holospora is the type genus of 612 the family. The Holosporaceae family as defined here has the same taxon 613 circumscription as the Holosporales order sensu Szokoli et al., (2016), but it is here 614 lowered to the family level and placed within the Rhodospirillales order. The new family 615 rank-level for this group is based on the phylogenetic analysis of 200 genes, which are 616 predominantly single-copy and vertically-inherited (unlikely laterally transferred), when 617 compositionally heterogeneity was decreased by site removal or recoding (and coupled 618 to the removal of the long-branching taxa Pelagibacterales and Rickettsiales). The 619 family contains three subfamilies (lowered in rank from a former family level) and one 620 formally-undescribed clade, namely, the Holosporodeae, and ‘Candidatus 621 Paracaedibacteriodeae’, ‘Candidatus Hepatincolodeae’, and the Caedibacter- 622 Nucleicultrix clade.

623 METHODS

624 Genome sequencing

625 Cultures of Viridiraptor invadens strain Virl02, the host of ‘Candidatus Finniella 626 inopinata’, were grown on the filamentous green alga Zygnema pseudogedeanum strain

30

627 CCAC 0199 as described in (Hess and Melkonian 2013). Once the algal food was 628 depleted, Viridiraptor cells were harvested by filtration through a cell strainer (mesh size 629 40 µm to remove algal cell walls) and centrifugation (~1,000 g for 15 min). For short- 630 read sequencing, DNA extraction of total gDNA was carried out with the ZR 631 Fungal/Bacterial DNA MicroPrep Kit (Zymo Research) using a BIO101/Savant FastPrep 632 FP120 high-speed bead beater and 20 uL of proteinase K (20 mg/mL). A sequencing 633 library was made using the NEBNext Ultra II DNA Library Prep Kit (New England 634 Biolabs). Paired-end DNA sequencing libraries were sequenced with an Illumina MiSeq 635 instrument (Dalhousie University; Canada). (number of reads=3,006,282, read 636 length=150 bp). For long-read sequencing, DNA extraction was performed using a 637 CTAB and phenol-chloroform method. Total gDNA was further cleaned through a 638 QIAGEN Genomic-Tip 20/G. A sequencing library was made using the Nanopore 639 Ligation Sequencing Kit 1D (SQK-LSK108). Sequencing was done on a portable 640 MinION instrument (Oxford Nanopore Technologies). (total bases=191,942,801 bp, 641 number of reads=73,926, longest read=32,236 bp, mean read length=2,596 bp, mean 642 read quality=9.4).

643 Peranema trichophorum strain CCAP 1260/1B was obtained from the Culture Collection 644 of Algae and (CCAP, Oban, Scotland) and grown in liquid Knop media plus 645 yolk crystals. Total gDNA was extracted following Lang and Burger (2007). A 646 paired-end sequencing library was made using a TruSeq DNA Library Prep Kit 647 (Illumina). DNA sequencing libraries were sequenced with an Illumina MiSeq instrument 648 (Genome Quebec Innovation Centre; Canada). (number of reads=4,157,475, read 649 length=300 bp).

650 Stachyamoeba lipophora strain ATCC 50324 cells feeding on were 651 harvested and then broken up with pestle and mortar in the presence of glass beads (< 652 450 µm diameter). Total gDNA was extracted using the QIAGEN Genomic G20 Kit. A 653 paired-end sequencing library was made using a TruSeq DNA Library Prep Kit 654 (Illumina). DNA sequencing libraries were sequenced with an Illumina MiSeq instrument 655 (Genome Quebec Innovation Centre; Canada). (number of reads=35,605,415, read 656 length=100 bp).

31

657 Genome assembly and annotation

658 Short sequencing reads produced in an Illumina MiSeq from Viridiraptor invadens, 659 Peranema trichophorum, and Stachyamoeba lipophora were first assessed with 660 FASTQC v0.11.6 and then, based on its reports, trimmed with Trimmomatic v0.32 661 (Bolger, Lohse, and Usadel 2014) using the options: HEADCROP:16 LEADING:30 662 TRAILING:30 MINLEN:36. Illumina adapters were similarly removed with Trimmomatic 663 v0.32 using the option ILLUMINACLIP. Long sequencing reads produced in a Nanopore 664 MinION instrument from Viridiraptor invadens were basecalled with Albacore v2.1.7, 665 adapters were removed with Porechop v0.2.3, lambda phage reads were removed with 666 NanoLyse v0.5.1, quality filtering was done with NanoFilt v2.0.0 (with the options ‘-- 667 headcrop 50 -q 8 -l 1000’), and identity filtering against the high-quality short Illumina 668 reads was done with Filtlong v0.2.0 (and the options ‘--keep_percent 90 --trim --split 500 669 --length_weight 10 --min_length 1000’). Statistics were calculated throughout the read 670 processing workflow with NanoStat v0.8.1 and NanoPlot v1.9.1. A hybrid co-assembly 671 of both processed Illumina short reads and Nanopore long reads from Viridiraptor 672 invadens was done with SPAdes v3.6.2 (Bankevich et al. 2012). Assemblies of the 673 Illumina short reads from Peranema trichophorum and Stachyamobea lipophora were 674 separately done with SPAdes v3.6.2 (Bankevich et al. 2012). The resulting assemblies 675 for both Viridiraptor invadens and Peranema trichophorum were later separately 676 processed with the Anvi’o v2.4.0 pipeline (Eren et al. 2015) and refined genome bins 677 corresponding to ‘Candidatus Finniella inopinata’ and the Peranema-associated 678 rickettsialean were isolated primarily based on tetranucleotide sequence composition 679 and taxonomic affiliation of its contigs. A single contig corresponding to the genome of 680 the Stachyamoeba-associated rickettsialean was obtained from its assembly and this 681 was circularized by collapsing the overlapping ends of the contig. Gene prediction and 682 genome annotation was carried out with Prokka v.1.13 (see Table 1).

683 Dataset assembly (taxon and gene selection)

684 The selection of 120 taxa was largely based on the phylogenetically diverse set of 685 alphaproteobacteria determined by Wang and Wu (2015). To this set of taxa, recently 686 sequenced and divergent unaffiliated alphaproteobacteria were added, as well as those

32

687 claimed to constitute novel order-level taxa. Some other groups, like the 688 Pelagibacterales, Rhodospirillales and the Holosporales, were expanded to better 689 represent their diversity. A set of four and four 690 gammaproteobacterial were used as outgroup (see Figure 2-figure supplement 6 for 691 taxon names; see Supplementary file 2-table S3 for accession numbers).

692 A set of 200 gene markers (54,400 sites; 9.03% missing data, see Figure 2-figure 693 supplement 6) defined by Phyla-AMPHORA was used (Wang and Wu 2013). The genes 694 are single-copy and predominantly vertically-inherited as assessed by congruence 695 among them (Wang and Wu 2013). In brief, Phyla-AMPHORA searches for each 696 marker gene using a profile Hidden Markov Model (HMM), then aligns the best hits to 697 the profile HMM using hmmalign of the HMMER suite, and then trims the alignments 698 using pre-computed quality scores (the mask) previously generated using the 699 probabilistic masking program ZORRO (Wu, Chatterji, and Eisen 2012; Wang and Wu 700 2013). Phylogenetic trees for each marker gene were inferred from the trimmed multiple 701 alignments in IQ-TREE v1.5.5 (Minh, Nguyen, and von Haeseler 2013; Nguyen et al. 702 2015) and under the model LG4X+F model. Single-gene trees were examined 703 individually to remove distant paralogues, contaminants or laterally transferred genes. 704 All this was done before concatenating the single-gene alignments into a supermatrix 705 with SequenceMatrix v 1.8 (Vaidya, Lohman, and Meier 2011). Another smaller dataset 706 of 40 compositionally-homogenous genes (5,570 sites; 5.98% missing data) was built 707 by selecting the least compositionally heterogeneous genes from the larger 200 gene 708 set according compositional homogeneity tests performed in P4 (Foster 2004; see 709 Supplementary file 2-table S4 for a list of the 40 most compositionally homogenous 710 genes). This was done as an alternative way to overcome the strong compositional 711 heterogeneity observed in datasets for the Alphaproteobacteria with a broad selection of 712 taxa. In brief, the P4 tests rely on simulations based on a provided tree (here inferred for 713 each gene under the model LG4X+F in IQ-TREE) and a model (LG+F+G4 available in 714 P4) to obtain proper null distributions to which to compare the X2 statistic. Most 715 standard tests for compositional homogeneity (those that do not rely on simulate the 716 data on a given tree) ignore correlation due to phylogenetic relatedness, and can suffer 717 from a high probability of false negatives (Foster 2004).

33

718 Variations of our full set were made to specifically assess the placement of each long- 719 branching and compositionally-biased group individually. In other words, each group 720 with comparatively long branches (the Rickettsiales, Pelagibacterales, Holosporales, 721 and alphaproteobacterium sp. HIMB59) was analyzed in isolation, i.e., in the absence of 722 other long-branching and compositionally-biased taxa. This was done with the purpose 723 of reducing the potential artefactual attraction among these groups. Taxon removal was 724 done in addition to compositionally-biased site removal and data recoding into reduced 725 character-state alphabets (for a summary of the different methodological strategies 726 employed see Figure 2-figure supplement 2).

727 Removal of compositionally-biased and fast-evolving sites

728 As an effort to reduce artefacts in phylogenetic inference from our dataset (which might 729 stem from extreme divergence in the evolution of the Alphaproteobacteria), we removed 730 sites estimated to be highly compositionally heterogeneous or fast evolving. The 731 compositional heterogeneity of a site was estimated by using a metric intended to 732 measure the degree of disparity between the most %AT-rich taxa and all others. Taxa 733 were ordered from lowest to highest proteome GARP:FIMNKY ratios; ‘GARP’ amino 734 acids are encoded by %GC-rich codons, whereas ‘FIMNKY’ amino acids are encoded 735 by %AT-rich codons. The resulting plot was visually inspected and a GARP:FIMNKY 736 ratio cutoff of 1.06 (which represented a discontinuity or gap in the distribution which 737 separated the long-branching and compositionally-biased taxa Pelagibacterales, 738 Holosporales and Rickettsiales from all others) was chosen to divide the dataset into 739 low GARP:FMINKY (or %AT-rich) and higher GARP:FIMNKY (or ‘GC-rich’) taxa (Figure 740 2-figure supplement 7). Next, we determined the degree of compositional bias per site 741 (ɀ) for the frequencies of both FIMNKY and GARP amino acids between the %AT-rich 742 and all other (‘GC-rich’) alphaproteobacteria. To calculate this metric for each site the 743 following formula was used:

ɀ = (휋퐹퐼푀푁퐾푌%AT-rich − 휋퐹퐼푀푁퐾푌%GC-rich) + (휋퐺퐴푅푃%GC-rich − 휋퐺퐴푅푃%AT-rich)

744 where 휋퐹퐼푀푁퐾푌 and 휋퐺퐴푅푃 are the sum of the frequencies for FIMNKY and GARP 745 amino acids at a site, respectively, for either ‘%AT-rich’ or ‘%GC-rich’ taxa. According to 746 this metric, higher values measure a greater disparity between %AT-rich 34

747 alphaproteobacteria and all others; a measure of compositional heterogeneity or bias 748 per site. The most compositionally heterogeneous sites according to ɀ were 749 progressively removed using the software SiteStripper (Verbruggen 2018) in increments 750 of 10%. We also progressively removed the fastest evolving sites in increments of 10%. 751 Conditional mean site rates were estimated under the LG+C60+F+R6 model in IQ- 752 TREE v1.5.5 using the ‘-wsr’ flag (Nguyen et al. 2015).

753 Data recoding

754 Our datasets were recoded into four- and six-character state amino acid alphabets 755 using dataset-specific recoding schemes aimed at minimizing compositional 756 heterogeneity in the data (Susko and Roger 2007). The program minmax-chisq, 757 which implements the methods of Susko and Roger (2007), was used to find the best 758 recoding schemes—please see Figure 3, Figure 2-figure supplement 4 and Figure 3- 759 figure supplement 1-6, and Figure 3-figure supplement 8 legends for the specific 760 recoding schemes used for each dataset. The approach uses the chi-squared (X2) 761 statistic for a test of homogeneity of frequencies as a criterion function for determining

762 the best recoding schemes. Let 휋푖 denote the frequency of bin 푖 for the recoding 763 scheme currently under consideration. For instance, suppose the amino acids were 764 recoded into four bins,

765 RNCM EHIPTWV ADQLKS GFY.

766 Then 휋4would be the frequency with which the amino acids G, F or Y were observed. 2 767 Let 휋푖푠 be the frequency of bin 푖 for the 푠th taxa. Then the X statistic for the null 768 hypothesis that the frequencies are constant, over taxa, against the unrestricted 769 hypothesis is

2 푡푠 = ∑ (휋푖푠 − 휋푖) /휋푖 푖푠

770 The X2 statistic provides a measure of how different the frequencies for the 푠th taxa are

771 from the average frequencies. The maximum 푡푠 over 푠 is taken as an overall measure of 772 how heterogeneous the frequencies are for a given recoding scheme. The minmax-

35

773 chisq program searches through recoding schemes, moving amino acids from one bin

774 to another, to try to minimize the 푚푎푥 푡푠 (Susko and Roger 2007).

775 Phylogenetic inference

776 The inference of phylogenies was primarily done under the maximum likelihood 777 framework and using IQ-TREE v1.5.5 (Minh, Nguyen, and von Haeseler 2013; Nguyen 778 et al. 2015). ModelFinder in IQ-TREE v1.5.5 (Kalyaanamoorthy et al. 2017) was used to 779 assess the best-fitting amino acid empirical matrix (e.g., JTT, WAG, and LG), on a 780 maximum-likelihood tree, to our full dataset of 120 taxa and 200 conserved single-copy 781 marker genes (see Supplementary file 2-table S5 and Supplementary file 2-table S6). 782 We first inferred guide trees (for a PMSF analysis) with a model that comprises the LG 783 empirical matrix, with empirical frequencies estimated from the data (F), six rates for the 784 FreeRate model to account for rate heterogeneity across sites (R6), and a mixture 785 model with 60 amino acid profiles (C60) to account for compositional heterogeneity 786 across sites—LG+C60+F+R6. Because the computational power and time required to 787 properly explore the whole tree space (given such a big dataset and complex model) 788 was too high, constrained tree searches were employed to obtain these initial guide 789 trees (see Figure 2-figure supplement 6 for the constraint tree). Many shallow nodes 790 were constrained if they received maximum UFBoot and SH-aLRT support in a 791 LG+PMSF(C60)+F+R6 analysis. All deep nodes, those relevant to the questions 792 addressed here, were left unconstrained (Figure 2-figure supplement 6). The guide 793 trees were then used together with a dataset-specific mixture model ES60 to estimate 794 site-specific amino acid profiles, or a PMSF (Posterior Mean Site Frequency Profiles) 795 model, that best account for compositional heterogeneity across sites (Wang et al. 796 2018). The dataset-specific empirical mixture model ES60 also has 60 categories but, 797 unlike the general C60, was directly estimated from our large dataset of 200 genes and 798 120 alphaproteobacteria (and outgroup) using the methods described in (Susko, 799 Lincker, and Roger 2018; ModelFinder (Kalyaanamoorthy et al. 2017) suggests that the 800 LG+ES60+F+R6 model is the best-fitting model; the R6 model component, however, 801 considerably increases computational burden; see Supplementary file 2-table S6 and 802 Supplementary file 2-table S7). Final trees were inferred using the

36

803 LG+PMSF(ES60)+F+R6 model and a fully unconstrained tree search. Those datasets 804 that produced the most novel topologies under maximum likelihood were further 805 analyzed under a Bayesian framework using PhyloBayes MPI v1.7 and the CAT- 806 Poisson+Γ4 model (Lartillot and Philippe 2004; Lartillot, Lepage, and Blanquart 2009). 807 This model allows for a very large number of classes to account for compositional 808 heterogeneity across sites and, unlike in the more complex CAT-GTR+Γ4 model, also 809 allows for convergence to be more easily achieved between MCMC chains. PhyloBayes 810 MCMC chains were run for at least 10,000 cycles until convergence between the chains 811 was achieved and the largest discrepancy (i.e., maxdiff parameter) was ≤ 0.4 (except 812 for the untreated dataset analyzed in Figure 2-figure supplement 3A; see 813 Supplementary file 2-table S8 for several summary statistics for each PhyloBayes 814 MCMC chain, including discrepancy and effective sample size values). A consensus 815 tree was generated from two PhyloBayes MCMC chains using a burn-in of 500 trees 816 and sub-sampling every 10 trees.

817 Phylogenetic analyses of recoded datasets into four-character state alphabets were 818 analyzed using IQ-TREE v1.5.5 and the model GTR+ES60S4+F+R6. ES60S4 is an 819 adaptation of the dataset-specific empirical mixture model ES60 to four-character 820 states. It is obtained by adding the frequencies of the amino acids that belong to each 821 bin in the dataset-specific four-character state scheme S4 (see Data Recoding for 822 details). analyses of recoded datasets into six-character state alphabets 823 were analyzed using PhyloBayes MPI v1.7 and the CAT-Poisson+Γ4 model. Maximum- 824 likelihood analyses with a six-state recoding scheme could not be performed because 825 IQ-TREE currently only supports amino acid datasets recoded into four-character 826 states.

827 Other analyses

828 The 16S rRNA genes of ‘Candidatus Finniella inopinata’, and the presumed 829 endosymbionts of Peranema trichophorum and Stachyamoeba lipophora were identified 830 with RNAmmer 1.2 server and BLAST searches. A set of 16S rRNA genes for diverse 831 rickettsialeans and holosporaleans, and other alphaproteobacteria as outgroup, were 832 retrieved from NCBI GenBank. The selection was based on Hess et al., (2016), Szokoli

37

833 et al., (2016) and Wang and Wu (2015). Environmental sequences for uncultured and 834 undescribed rickettsialeans were retrieved by keeping the 50 best hits resulting from a 835 BLAST search of our three novel 16S rRNA genes against the NCBI GenBank non- 836 redundant (nr) database. The sequences were aligned with the SILVA aligner SINA 837 v1.2.11 and all-gap sites were later removed. Phylogenetic analyses on this alignment 838 were performed on IQ-TREE v1.5.5 using the GTR+F+R8 model.

839 A UPGMA (average-linkage) clustering of amino acid compositions based on the 200 840 gene set for the Alphaproteobacteria was built in MEGA 7 (Kumar, Stecher, and Tamura 841 2016) from a matrix of Euclidean distances between amino acid compositions of 842 sequences exported from the phylogenetic software P4 (Foster 2004; 843 http://p4.nhm.ac.uk/index.html).

844 Data availability

845 The genome of ‘Candidatus Finniella inopinata’, endosymbiont of Peranema 846 trichophorum strain CCAP 1260/1B and endosymbiont of Stachyamoeba lipophora 847 strain ATCC 50324 were deposited in NCBI GenBank under the BioProject 848 PRJNA501864. Raw sequencing reads were deposited on the NCBI SRA archive under 849 the BioProject PRJNA501864. Multi-gene datasets as well as phylogenetic trees 850 inferred in this study were deposited at Mendeley Data under the DOI: 851 http://dx.doi.org/10.17632/75m68dxd83.1.

852 ACKNOWLEDGEMENTS

853 Sergio A. Muñoz-Gómez is supported by a Killam Predoctoral Scholarship and a Nova 854 Scotia Graduate Scholarship. This work was supported by Natural Sciences and 855 Engineering Research (NSERC) Discovery Grants 2016-06792 to A.J.R, RGPIN/05754- 856 2015 to C.H.S, RGPIN/05286-2014 to G.B., and RGPIN-2017-05411 to B.F.L. We thank 857 Jon Jerlström Hultqvist and Gina Filloramo (both at Dalhousie University) for advice on 858 long-read sequencing with the Nanopore MinION. We also thank Camilo A. Calderón- 859 Acevedo for advice about taxonomic issues, and Franziska Szokoli for reading and 860 commenting on a late version of this manuscript. Bruce Curtis (Dalhousie University) 861 and Peter G. Foster (Natural History Museum of London) kindly provided technical help

38

862 with bioinformatics and with the software P4, respectively. We thank Joanny Roy, 863 Georgette Kiethega, Matus Valach, and Shona Teijeiro (all at the Université de 864 Montréal), and Drahomira Faktora (University of South Bohemia), for help with the 865 culturing, DNA preparation and sequencing of the endosymbiont of Peranema 866 trichophorum. Some of the genome data used in this study were produced by the US 867 Department of Energy Joint Genome Institute (http://www.jgi.doe.gov/) in collaboration 868 with the user community.

869 COMPETING INTERESTS

870 No competing interests declared by the authors.

871 REFERENCES

872 Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander 873 S. Kulikov, Valery M. Lesin, et al. 2012. “SPAdes: A New Genome Assembly Algorithm 874 and Its Applications to Single-Cell Sequencing.” Journal of Computational Biology 19 (5): 875 455–77. https://doi.org/10.1089/cmb.2012.0021. 876 Bazylinski, Dennis A., Timothy J. Williams, Christopher T. Lefèvre, Denis Trubitsyn, Jiasong 877 Fang, Terrence J. Beveridge, Bruce M. Moskowitz, et al. 2013. “Magnetovibrio 878 Blakemorei Gen. Nov., Sp. Nov., a Magnetotactic Bacterium (Alphaproteobacteria: 879 Rhodospirillaceae) Isolated from a Salt Marsh.” International Journal of Systematic and 880 Evolutionary 63 (5): 1824–33. https://doi.org/10.1099/ijs.0.044453-0. 881 Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible Trimmer 882 for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. 883 https://doi.org/10.1093/bioinformatics/btu170. 884 Boscaro, Vittorio, Sergei I. Fokin, Martina Schrallhammer, Michael Schweikert, and Giulio 885 Petroni. 2013. “Revised Systematics of Holospora-like Bacteria and Characterization of 886 ‘Candidatus Gortzia Infectiva’, a Novel Macronuclear Symbiont of 887 Jenningsi.” Microbial Ecology 65 (1): 255–67. https://doi.org/10.1007/s00248-012-0110- 888 2. 889 Brindefalk, Björn, Thijs J. G. Ettema, Johan Viklund, Mikael Thollesson, and Siv G. E. 890 Andersson. 2011. “A Phylometagenomic Exploration of Oceanic Alphaproteobacteria 891 Reveals Mitochondrial Relatives Unrelated to the SAR11 Clade.” PLoS ONE 6 (9): 892 e24457. https://doi.org/10.1371/journal.pone.0024457. 893 Castelli, Michele, Elena Sabaneyeva, Olivia Lanzoni, Natalia Lebedeva, Anna Maria Floriano, 894 Stefano Gaiarsa, Konstantin Benken, et al. 2018. “The Extracellular Association of the 895 Bacterium ‘Candidatus Deianiraea Vastatrix’ with the Ciliate Paramecium Suggests an 896 Alternative Scenario for the Evolution of Rickettsiales.” BioRxiv, November, 479196. 897 https://doi.org/10.1101/479196. 898 Cho, Jang-Cheon, and Stephen J. Giovannoni. 2003. “ Bermudensis Gen. Nov., 899 Sp. Nov., a Marine Bacterium That Forms a Deep Branch in the Alpha-Proteobacteria.” 900 International Journal of Systematic and Evolutionary Microbiology 53 (Pt 4): 1031–36. 901 https://doi.org/10.1099/ijs.0.02566-0. 902 Eren, A. Murat, Özcan C. Esen, Christopher Quince, Joseph H. Vineis, Hilary G. Morrison, 903 Mitchell L. Sogin, and Tom O. Delmont. 2015. “Anvi’o: An Advanced Analysis and

39

904 Visualization Platform for ‘omics Data.” PeerJ 3 (October): e1319. 905 https://doi.org/10.7717/peerj.1319. 906 Eschbach, Erik, Martin Pfannkuchen, Michael Schweikert, Denja Drutschmann, Franz Brümmer, 907 Sergei Fokin, Wolfgang Ludwig, and Hans-Dieter Görtz. 2009. “‘Candidatus 908 Paraholospora Nucleivisitans’, an Intracellular Bacterium in Paramecium Sexaurelia 909 Shuttles between the Cytoplasm and the Nucleus of Its Host.” Systematic and Applied 910 Microbiology 32 (7): 490–500. https://doi.org/10.1016/j.syapm.2009.07.004. 911 Esser, Christian, William Martin, and Tal Dagan. 2007. “The Origin of Mitochondria in Light of a 912 Fluid Prokaryotic Chromosome Model.” Biology Letters 3 (2): 180–84. 913 https://doi.org/10.1098/rsbl.2006.0582. 914 Ettema, Thijs J. G., and Siv G. E. Andersson. 2009. “The Alpha-Proteobacteria: The Darwin 915 Finches of the Bacterial World.” Biology Letters 5 (3): 429–32. 916 https://doi.org/10.1098/rsbl.2008.0793. 917 Ferla, Matteo P., J. Cameron Thrash, Stephen J. Giovannoni, and Wayne M. Patrick. 2013. 918 “New RRNA Gene-Based Phylogenies of the Alphaproteobacteria Provide Perspective 919 on Major Groups, Mitochondrial Ancestry and Phylogenetic Instability.” PLoS ONE 8 920 (12): e83383. https://doi.org/10.1371/journal.pone.0083383. 921 Fitzpatrick, David A., Christopher J. Creevey, and James O. McInerney. 2006. “Genome 922 Phylogenies Indicate a Meaningful Alpha-Proteobacterial Phylogeny and Support a 923 Grouping of the Mitochondria with the Rickettsiales.” Molecular Biology and Evolution 23 924 (1): 74–85. https://doi.org/10.1093/molbev/msj009. 925 Foesel, Bärbel U., Anita S. Gössner, Harold L. Drake, and Andreas Schramm. 2007. 926 “Geminicoccus Roseus Gen. Nov., Sp. Nov., an Aerobic Phototrophic 927 Alphaproteobacterium Isolated from a Marine Aquaculture Biofilter.” Systematic and 928 Applied Microbiology 30 (8): 581–86. https://doi.org/10.1016/j.syapm.2007.05.005. 929 Foster, Peter G. 2004. “Modeling Compositional Heterogeneity.” Systematic Biology 53 (3): 930 485–95. 931 Garrity, George, ed. 2005. Bergey’s Manual of Systematic Bacteriology: Volume 2 : The 932 Proteobacteria. 2nd ed. Springer US. //www.springer.com/gp/book/9780387950402. 933 Georgiades, Kalliopi, Mohammed-Amine Madoui, Phuong Le, Catherine Robert, and Didier 934 Raoult. 2011. “Phylogenomic Analysis of Odyssella Thessalonicensis Fortifies the 935 Common Origin of Rickettsiales, Pelagibacter Ubique and Reclimonas Americana 936 .” PLoS ONE 6 (9): e24857. https://doi.org/10.1371/journal.pone.0024857. 937 Giovannoni, Stephen J. 2017. “SAR11 Bacteria: The Most Abundant Plankton in the Oceans.” 938 Annual Review of Marine Science 9 (January): 231–55. https://doi.org/10.1146/annurev- 939 marine-010814-015934. 940 Grote, Jana, J. Cameron Thrash, Megan J. Huggett, Zachary C. Landry, Paul Carini, Stephen J. 941 Giovannoni, and Michael S. Rappé. 2012. “Streamlining and Core Genome 942 Conservation among Highly Divergent Members of the SAR11 Clade.” MBio 3 (5): 943 e00252-12. https://doi.org/10.1128/mBio.00252-12. 944 Harbison, Austin B., Laiken E. Price, Michael D. Flythe, and Suzanna L. Bräuer. 2017. 945 “Micropepsis Pineolensis Gen. Nov., Sp. Nov., a Mildly Acidophilic 946 Alphaproteobacterium Isolated from a Poor Fen, and Proposal of Micropepsaceae Fam. 947 Nov. within Micropepsales Ord. Nov.” International Journal of Systematic and 948 Evolutionary Microbiology 67 (4): 839–44. https://doi.org/10.1099/ijsem.0.001681. 949 Heiss, Aaron A., Martin Kolisko, Fleming Ekelund, Matthew W. Brown, Andrew J. Roger, and 950 Alastair G. B. Simpson. 2018. “Combined Morphological and Phylogenomic Re- 951 Examination of Malawimonads, a Critical Taxon for Inferring the Evolutionary History of 952 Eukaryotes.” Royal Society Open Science 5 (4): 171707. 953 https://doi.org/10.1098/rsos.171707.

40

954 Hess, Sebastian, and Michael Melkonian. 2013. “The Mystery of Clade X: Orciraptor Gen. Nov. 955 and Viridiraptor Gen. Nov. Are Highly Specialised, Algivorous Amoeboflagellates 956 (Glissomonadida, Cercozoa).” 164 (5): 706–47. 957 https://doi.org/10.1016/j.protis.2013.07.003. 958 Hess, Sebastian, Andreas Suthaus, and Michael Melkonian. 2015. “‘Candidatus Finniella’ 959 (Rickettsiales, Alphaproteobacteria), Novel Endosymbionts of Viridiraptorid 960 Amoeboflagellates (Cercozoa, Rhizaria).” Applied and Environmental Microbiology 82 961 (2): 659–70. https://doi.org/10.1128/AEM.02680-15. 962 Kalyaanamoorthy, Subha, Bui Quang Minh, Thomas K. F. Wong, Arndt von Haeseler, and Lars 963 S. Jermiin. 2017. “ModelFinder: Fast Model Selection for Accurate Phylogenetic 964 Estimates.” Nature Methods 14 (6): 587–89. https://doi.org/10.1038/nmeth.4285. 965 Kumar, Sudhir, Glen Stecher, and Koichiro Tamura. 2016. “MEGA7: Molecular Evolutionary 966 Genetics Analysis Version 7.0 for Bigger Datasets.” Molecular Biology and Evolution 33 967 (7): 1870–74. https://doi.org/10.1093/molbev/msw054. 968 Kurahashi, Midori, Yukiyo Fukunaga, Shigeaki Harayama, and Akira Yokota. 2008. “Sneathiella 969 Glossodoripedis Sp. Nov., a Marine Alphaproteobacterium Isolated from the Nudibranch 970 Glossodoris Cincta, and Proposal of Sneathiellales Ord. Nov. and Sneathiellaceae Fam. 971 Nov.” International Journal of Systematic and Evolutionary Microbiology 58 (Pt 3): 548– 972 52. https://doi.org/10.1099/ijs.0.65328-0. 973 Kwon, Kae Kyoung, Hee-Soon Lee, Sung Hyun Yang, and Sang-Jin Kim. 2005. “Kordiimonas 974 Gwangyangensis Gen. Nov., Sp. Nov., a Marine Bacterium Isolated from Marine 975 Sediments That Forms a Distinct Phyletic Lineage (Kordiimonadales Ord. Nov.) in the 976 ‘Alphaproteobacteria.’” International Journal of Systematic and Evolutionary Microbiology 977 55 (5): 2033–37. https://doi.org/10.1099/ijs.0.63684-0. 978 Lang, B. Franz, and Gertraud Burger. 2007. “Purification of Mitochondrial and Plastid DNA.” 979 Nature Protocols 2 (3): 652–60. https://doi.org/10.1038/nprot.2007.58. 980 Lartillot, Nicolas, Thomas Lepage, and Samuel Blanquart. 2009. “PhyloBayes 3: A Bayesian 981 Software Package for Phylogenetic Reconstruction and Molecular Dating.” 982 Bioinformatics 25 (17): 2286–88. https://doi.org/10.1093/bioinformatics/btp368. 983 Lartillot, Nicolas, and Hervé Philippe. 2004. “A Bayesian Mixture Model for Across-Site 984 Heterogeneities in the Amino-Acid Replacement Process.” Molecular Biology and 985 Evolution 21 (6): 1095–1109. https://doi.org/10.1093/molbev/msh112. 986 Lee, Kyung-Bum, Chi-Te Liu, Yojiro Anzai, Hongik Kim, Toshihiro Aono, and Hiroshi Oyaizu. 987 2005. “The Hierarchical System of the ‘Alphaproteobacteria’: Description of 988 Fam. Nov., Fam. Nov. and 989 Fam. Nov.” International Journal of Systematic and Evolutionary Microbiology 55 (Pt 5): 990 1907–19. https://doi.org/10.1099/ijs.0.63663-0. 991 Luo, Haiwei. 2015. “Evolutionary Origin of a Streamlined Marine Bacterioplankton Lineage.” The 992 ISME Journal 9 (6): 1423–33. https://doi.org/10.1038/ismej.2014.227. 993 Luo, Haiwei, Miklós Csűros, Austin L. Hughes, and Mary Ann Moran. 2013. “Evolution of 994 Divergent Life History Strategies in Marine Alphaproteobacteria.” MBio 4 (4): e00373-13. 995 https://doi.org/10.1128/mBio.00373-13. 996 Madigan, Michael T., Deborah O. Jung, and Michael T. Madigan. 2009. “An Overview of Purple 997 Bacteria: Systematics, , and Habitats.” In The Purple Phototrophic Bacteria, 998 1–15. Advances in Photosynthesis and Respiration. Springer, Dordrecht. 999 https://doi.org/10.1007/978-1-4020-8815-5_1. 1000 Martijn, Joran, Julian Vosseberg, Lionel Guy, Pierre Offre, and Thijs J. G. Ettema. 2018. “Deep 1001 Mitochondrial Origin Outside the Sampled Alphaproteobacteria.” Nature 557 (7703): 1002 101–5. https://doi.org/10.1038/s41586-018-0059-5.

41

1003 Minh, Bui Quang, Minh Anh Thi Nguyen, and Arndt von Haeseler. 2013. “Ultrafast 1004 Approximation for Phylogenetic Bootstrap.” Molecular Biology and Evolution 30 (5): 1005 1188–95. https://doi.org/10.1093/molbev/mst024. 1006 Montagna, Matteo, Davide Sassera, Sara Epis, Chiara Bazzocchi, Claudia Vannini, Nathan Lo, 1007 Luciano Sacchi, Takema Fukatsu, Giulio Petroni, and Claudio Bandi. 2013. “‘Candidatus 1008 Midichloriaceae’ Fam. Nov. (Rickettsiales), an Ecologically Widespread Clade of 1009 Intracellular Alphaproteobacteria.” Applied and Environmental Microbiology 79 (10): 1010 3241–48. https://doi.org/10.1128/AEM.03971-12. 1011 Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. “IQ- 1012 TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood 1013 Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74. 1014 https://doi.org/10.1093/molbev/msu300. 1015 Parks, Donovan H., Maria Chuvochina, David W. Waite, Christian Rinke, Adam Skarshewski, 1016 Pierre-Alain Chaumeil, and Philip Hugenholtz. 2018. “A Standardized Bacterial 1017 Taxonomy Based on Genome Phylogeny Substantially Revises the Tree of Life.” Nature 1018 Biotechnology 36 (10): 996–1004. https://doi.org/10.1038/nbt.4229. 1019 Proença, Diogo N., William B. Whitman, Neha Varghese, Nicole Shapiro, Tanja Woyke, Nikos 1020 C. Kyrpides, and Paula V. Morais. 2017. “Arboriscoccus Pini Gen. Nov., Sp. Nov., an 1021 Endophyte from a Pine Tree of the Class Alphaproteobacteria, Emended Description of 1022 Geminicoccus Roseus, and Proposal of Geminicoccaceae Fam. Nov.” Systematic and 1023 Applied Microbiology, December. https://doi.org/10.1016/j.syapm.2017.11.006. 1024 Rodríguez-Ezpeleta, Naiara, and T. Martin Embley. 2012. “The SAR11 Group of Alpha- 1025 Proteobacteria Is Not Related to the Origin of Mitochondria.” PLoS ONE 7 (1): e30520. 1026 https://doi.org/10.1371/journal.pone.0030520. 1027 Roger, Andrew J., Sergio A. Muñoz-Gómez, and Ryoma Kamikawa. 2017. “The Origin and 1028 Diversification of Mitochondria.” Current Biology 27 (21): R1177–92. 1029 https://doi.org/10.1016/j.cub.2017.09.015. 1030 Rosenberg, Eugene, Edward F. DeLong, Stephen Lory, Erko Stackebrandt, and Fabiano 1031 Thompson, eds. 2014. The Prokaryotes: Alphaproteobacteria and Betaproteobacteria. 1032 4th ed. Berlin Heidelberg: Springer-Verlag. 1033 //www.springer.com/gp/book/9783642301964. 1034 Santos, Huarrisson Azevedo, and Carlos Luiz Massard. 2014. “The Family Holosporaceae.” In 1035 The Prokaryotes: Alphaproteobacteria and Betaproteobacteria, edited by Eugene 1036 Rosenberg, Edward F. DeLong, Stephen Lory, Erko Stackebrandt, and Fabiano 1037 Thompson, 237–46. Berlin, Heidelberg: Springer Berlin Heidelberg. 1038 https://doi.org/10.1007/978-3-642-30197-1_264. 1039 Sassera, Davide, Nathan Lo, Sara Epis, Giuseppe D’Auria, Matteo Montagna, Francesco 1040 Comandatore, David Horner, et al. 2011. “Phylogenomic Evidence for the Presence of a 1041 and Cbb3 Oxidase in the Free-Living Mitochondrial Ancestor.” Molecular 1042 Biology and Evolution 28 (12): 3285–96. https://doi.org/10.1093/molbev/msr159. 1043 Susko, Edward, Léa Lincker, and Andrew J. Roger. 2018. “Accelerated Estimation of Frequency 1044 Classes in Site-Heterogeneous Profile Mixture Models.” Molecular Biology and Evolution 1045 may026. https://doi.org/10.1093/molbev/msy026. 1046 Susko, Edward, and Andrew J. Roger. 2007. “On Reduced Amino Acid Alphabets for 1047 Phylogenetic Inference.” Molecular Biology and Evolution 24 (9): 2139–50. 1048 https://doi.org/10.1093/molbev/msm144. 1049 Szokoli, Franziska, Michele Castelli, Elena Sabaneyeva, Martina Schrallhammer, Sascha 1050 Krenek, Thomas G. Doak, Thomas U. Berendonk, and Giulio Petroni. 2016. 1051 “Disentangling the Taxonomy of Rickettsiales and Description of Two Novel Symbionts 1052 (‘Candidatus Bealeia Paramacronuclearis’ and ‘Candidatus Fokinia Cryptica’) Sharing

42

1053 the Cytoplasm of the Ciliate Protist .” Applied and Environmental 1054 Microbiology 82 (24): 7236–47. https://doi.org/10.1128/AEM.02284-16. 1055 Thrash, J. Cameron, Alex Boyd, Megan J. Huggett, Jana Grote, Paul Carini, Ryan J. Yoder, 1056 Barbara Robbertse, Joseph W. Spatafora, Michael S. Rappé, and Stephen J. 1057 Giovannoni. 2011. “Phylogenomic Evidence for a Common Ancestor of Mitochondria 1058 and the SAR11 Clade.” Scientific Reports 1 (June): 13. 1059 https://doi.org/10.1038/srep00013. 1060 Vaidya, Gaurav, David J Lohman, and Rudolf Meier. 2011. “SequenceMatrix: Concatenation 1061 Software for the Fast Assembly of Multi-gene Datasets with Character Set and Codon 1062 Information.” Cladistics 27 (2): 171–80. https://doi.org/10.1111/j.1096- 1063 0031.2010.00329.x. 1064 Venkata Ramana, V., S. Kalyana Chakravarthy, E. V. V. Ramaprasad, V. Thiel, J. F. Imhoff, Ch 1065 Sasikala, and Ch V. Ramana. 2013. “Emended Description of the Genus 1066 Rhodothalassium Imhoff et Al., 1998 and Proposal of Rhodothalassiaceae Fam. Nov. 1067 and Rhodothalassiales Ord. Nov.” Systematic and Applied Microbiology 36 (1): 28–32. 1068 https://doi.org/10.1016/j.syapm.2012.09.003. 1069 Verbruggen, Heroen. 2018. “SiteStripper Version 1.03.” http://www.phycoweb.net/software. 1070 Viklund, Johan, Thijs J. G Ettema, and Siv G. E Andersson. 2012. “Independent Genome 1071 Reduction and Phylogenetic Reclassification of the Oceanic SAR11 Clade.” Molecular 1072 Biology and Evolution 29 (2): 599–615. https://doi.org/10.1093/molbev/msr203. 1073 Viklund, Johan, Joran Martijn, Thijs J. G. Ettema, and Siv G. E. Andersson. 2013. “Comparative 1074 and Phylogenomic Evidence That the Alphaproteobacterium HIMB59 Is Not a Member 1075 of the Oceanic SAR11 Clade.” PLoS ONE 8 (11): e78858. 1076 https://doi.org/10.1371/journal.pone.0078858. 1077 Wang, Huai-Chun, Bui Quang Minh, Edward Susko, and Andrew J. Roger. 2018. “Modeling Site 1078 Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate 1079 Phylogenomic Estimation.” Systematic Biology 67 (2): 216–35. 1080 https://doi.org/10.1093/sysbio/syx068. 1081 Wang, Zhang, and Martin Wu. 2013. “A Phylum-Level Bacterial Phylogenetic Marker Database.” 1082 Molecular Biology and Evolution 30 (6): 1258–62. 1083 https://doi.org/10.1093/molbev/mst059. 1084 ———. 2014. “Phylogenomic Reconstruction Indicates Mitochondrial Ancestor Was an Energy 1085 Parasite.” PLoS ONE 9 (10): e110685. https://doi.org/10.1371/journal.pone.0110685. 1086 ———. 2015. “An Integrated Phylogenomic Approach toward Pinpointing the Origin of 1087 Mitochondria.” Scientific Reports 5 (January): 7949. https://doi.org/10.1038/srep07949. 1088 Whitman, William B., ed. 2015. Bergey’s Manual of Systematics of Archaea and Bacteria. Wiley. 1089 https://onlinelibrary.wiley.com/doi/book/10.1002/9781118960608. 1090 Wiese, Jutta, Vera Thiel, Andrea Gärtner, Rolf Schmaljohann, and Johannes F. Imhoff. 2009. 1091 “Kiloniella Laminariae Gen. Nov., Sp. Nov., an Alphaproteobacterium from the Marine 1092 Macroalga Laminaria Saccharina.” International Journal of Systematic and Evolutionary 1093 Microbiology 59 (2): 350–56. https://doi.org/10.1099/ijs.0.001651-0. 1094 Williams, Kelly P., Bruno W. Sobral, and Allan W. Dickerman. 2007. “A Robust Species Tree for 1095 the Alphaproteobacteria.” Journal of Bacteriology 189 (13): 4578–86. 1096 https://doi.org/10.1128/JB.00269-07. 1097 Williams, Timothy J., Christopher T. Lefèvre, Weidong Zhao, Terry J. Beveridge, and Dennis A. 1098 Bazylinski. 2012. “Magnetospira Thiophila Gen. Nov., Sp. Nov., a Marine Magnetotactic 1099 Bacterium That Represents a Novel Lineage within the Rhodospirillaceae 1100 (Alphaproteobacteria).” International Journal of Systematic and Evolutionary 1101 Microbiology 62 (10): 2443–50. https://doi.org/10.1099/ijs.0.037697-0.

43

1102 Wu, Martin, Sourav Chatterji, and Jonathan A. Eisen. 2012. “Accounting for Alignment 1103 Uncertainty in Phylogenomics.” PloS One 7 (1): e30288. 1104 https://doi.org/10.1371/journal.pone.0030288. 1105

44

1106 SUPPLEMENTARY FIGURE LEGENDS

1107 Figure 2-figure supplement 1. A labeled version showing taxon names for Figure 2. 1108 Branch support values are 100% SH-aLRT and 100% UFBoot unless annotated.

1109 Figure 2-figure supplement 2. A diagram of the strategies and phylogenetic analyses 1110 employed in this study.

1111 Figure 2-figure supplement 3. Bayesian consensus trees inferred with PhyloBayes 1112 MPI v1.7 and the CAT-Poisson+Γ4 model. Branch support values are 1.0 posterior 1113 probabilities unless annotated. A. Bayesian consensus tree inferred from the full dataset 1114 which is highly compositionally heterogeneous. B. Bayesian consensus tree inferred 1115 from a dataset whose compositional heterogeneity has been decreased by removing 1116 50% of the most biased sites according to ɀ. See Figs. 2A and 2B for the most likely 1117 trees inferred in IQ-TREE v1.5.5 and the LG+PMSF(C60)+F+R6 model.

1118 Figure 2-figure supplement 4. Maximum-likelihood trees to assess the placements of 1119 the Holosporales, Rickettsiales, Pelagibacterales and alphaproteobacterium sp. 1120 HIMB59 when all four groups are included. Branch support values are 100% SH-aLRT 1121 and 100% UFBoot unless annotated. A. A tree that results from the analysis of the 1122 untreated dataset. B. A tree that results from the analysis of a dataset from which the 1123 50% most compositionally-biased sites have been removed. C. A tree that results from 1124 the analysis of a dataset that has been recoded into the four-character state recoding 1125 scheme S4 (recoding scheme: RNCM EHIPTWV ADQLKS GFY). D. A tree that results 1126 from the analysis of a dataset that only comprises the 40 most compositionally 1127 homogeneous genes.

1128 Figure 3-figure supplement 1. Maximum-likelihood trees to assess the placement of 1129 the Holosporales in the absence of the Rickettsiales, Pelagibacterales and 1130 alphaproteobacterium sp. HIMB59. Branch support values are 100% SH-aLRT and 1131 100% UFBoot unless annotated. A. A tree that results from the analysis of the untreated 1132 dataset. B. A tree that results from the analysis of a dataset from which the 50% most 1133 compositionally-biased sites have been removed. C. A tree that results from the 1134 analysis of a dataset that has been recoded into the four-character state recoding 1135 scheme S4 (recoding scheme: ARNDQEILKSTV GHY CMFP W). D. A tree that results 1136 from the analysis of a dataset that only comprises the 40 most compositionally 1137 homogeneous genes.

1138 Figure 3-figure supplement 2. Maximum-likelihood trees to assess the placement of 1139 the Rickettsiales in the absence of the Holosporales, Pelagibacterales, and 1140 alphaproteobacterium sp. HIMB59. Branch support values are 100% SH-aLRT and 1141 100% UFBoot unless annotated. A. A tree that results from the analysis of the untreated 1142 dataset. B. A tree that results from the analysis of a dataset from which the 50% most 1143 compositionally-biased sites have been removed. C. A tree that results from the 1144 analysis of a dataset that has been recoded into the four-character state recoding 1145 scheme S4 (recoding scheme: PY RNMF GHLKTW ADCQEISV). D. A tree that results

45

1146 from the analysis of a dataset that only comprises the 40 most compositionally 1147 homogeneous genes.

1148 Figure 3-figure supplement 3. Maximum-likelihood trees to assess the placement of 1149 the Rickettsiales in the absence of the Holosporales, Pelagibacterales, 1150 alphaproteobacterium sp. HIMB59 and the Beta-, and Gammaproteobacteria outgroup. 1151 Branch support values are 100% SH-aLRT and 100% UFBoot unless annotated. A. A 1152 tree that results from the analysis of the untreated dataset. B. A tree that results from 1153 the analysis of a dataset from which the 50% most compositionally-biased sites have 1154 been removed. C. A tree that results from the analysis of a dataset that has been 1155 recoded into the four-character state recoding scheme S4 (recoding scheme: RNMF 1156 GHLKTW ADCQEISV PY). D. A tree that results from the analysis of a dataset that only 1157 comprises the 40 most compositionally homogeneous genes. Branch support values 1158 are 100% SH-aLRT and 100% UFBoot unless annotated.

1159 Figure 3-figure supplement 4. Maximum-likelihood trees to assess the placement of 1160 the Pelagibacterales in the absence of the Holosporales, Rickettsiales and 1161 alphaproteobacterium sp. HIMB59. Branch support values are 100% SH-aLRT and 1162 100% UFBoot unless annotated. A. A tree that results from the analysis of the untreated 1163 dataset. B. A tree that results from the analysis of a dataset from which the 50% most 1164 compositionally-biased sites have been removed. C. A tree that results from the 1165 analysis of a dataset that has been recoded into the four-character state recoding 1166 scheme S4 (recoding scheme: EGIV ARNDQHKMPSY LFT CW). D. A tree that results 1167 from the analysis of a dataset that only comprises the 40 most compositionally 1168 homogeneous genes.

1169 Figure 3-figure supplement 5. Maximum-likelihood trees to assess the placement of 1170 alphaproteobacterium sp. HIMB59 in the absence of the Holosporales, Rickettsiales and 1171 Pelagibacterales. Branch support values are 100% SH-aLRT and 100% UFBoot unless 1172 annotated. A. A tree that results from the analysis of the untreated dataset. B. A tree 1173 that results from the analysis of a dataset from which the 50% most compositionally- 1174 biased sites have been removed. C. A tree that results from the analysis of a dataset 1175 that has been recoded into the four-character state recoding scheme S4 (recoding 1176 scheme: RLKMT ANDQEIPSV CW GHFY). D. A tree that results from the analysis of a 1177 dataset that only comprises the 40 most compositionally homogeneous genes.

1178 Figure 3-figure supplement 6. Bayesian consensus trees inferred with PhyloBayes 1179 MPI v1.7 and the CAT-Poisson+Γ4 model. Branch support values are 1.0 posterior 1180 probabilities unless annotated. A. Bayesian consensus tree inferred to place the 1181 Holosporales in the absence of the Rickettsiales and the Pelagibacterales and when 1182 compositional heterogeneity has been decreased by removing 50% of the most biased 1183 sites according to ɀ. B. Bayesian consensus tree inferred to place the Holosporales in 1184 the absence of the Rickettsiales and the Pelagibacterales and when the data have been 1185 recoded into a four-character state alphabet (the dataset-specific recoding scheme S4: 1186 ARNDQEILKSTV GHY CMFP W) to reduce compositional heterogeneity. See Figs. 2A

46

1187 and 2B for the most likely trees inferred in IQ-TREE v1.5.5 and the 1188 LG+PMSF(C60)+F+R6 and GTR+ES60S4+F+R6 models, respectively.

1189 Figure 3-figure supplement 7. Maximum-likelihood trees to assess the placement of 1190 the Holosporales when the fast-evolving Holospora and ‘Candidatus Hepatobacter’ are 1191 also included in the absence of the Rickettsiales, Pelagibacterales and 1192 alphaproteobacterium sp. HIMB59. Branch support values are 100% SH-aLRT and 1193 100% UFBoot unless annotated.

1194 Figure 3-figure supplement 8. Bayesian consensus tree inferred to place the 1195 Holosporales in the absence of the Pelagibacterales, alphaproteobacterium sp. 1196 HIMB59, and Rickettsiales, and when the data have been recoded into a six-character 1197 state alphabet (the dataset-specific recoding scheme S6: AQEHISV RKMT PY DCLF 1198 NG W) to reduce compositional heterogeneity. Branch support values are 1.0 posterior 1199 probabilities unless annotated.

1200 Figure 2-figure supplement 5. Maximum-likelihood tree from the untreated dataset 1201 from which no taxon has been removed and analyzed under simpler LG4X model. In 1202 this tree, derived from an analysis using a model that does not account for 1203 compositional heterogeneity across sites, the Geminicoccaceae has a more derived 1204 placements within the Rhodospirillales as sister to the Acetobacteraceae.

1205 Figure 2-figure supplement 6. Constraint tree, used for IQ-TREE analyses, labeled 1206 with taxon names and also degree of missing data per taxon. Magnetococcales in gray; 1207 Rickettsiales in brown; Pelagibacterales in maroon; Holosporales in light blue; 1208 Rhizobiales in green; Caulobacterales in ; Rhodobacterales in red; Sneathiellales 1209 in pink; Rhodospirillales in purple; Beta- and Gammaproteobacteria in black.

1210 Figure 2-figure supplement 7. GARP:FIMNKY ratios across the proteomes of the 120 1211 alphaproteobacteria and outgroup used in this study.

1212

1213

1214

1215 SUPPLEMENTARY FILES

1216 Supplementary file 1. A 16S rRNA gene maximum-likelihood tree of the Rickettsiales 1217 and Holosporales that phylogenetically places the three endosymbionts whose 1218 genomes were sequenced in this study: (1) ‘Candidatus Finniella inopinata’ 1219 endosymbiont of Viridiraptor invadens strain Virl02, (2) an alphaproteobacterium 1220 associated with Peranema trichophorum strain CCAP 1260/1B, and (3) an 1221 alphaproteobacterium associated with Stachyamoeba lipophora strain ATCC 50324. 1222 Branch support values are SH-aLRT and UFBoot.

1223

47

1224

1225 Supplementary file 2-table S1. Ultrafast bootstrap (UFBoot) variation for several 1226 clades discussed in this study as compositionally-biased sites, according to ɀ, are 1227 progressively removed in steps of 10%.

1228 Supplementary file 2-table S2. Ultrafast bootstrap (UFBoot) variation for several 1229 clades discussed in this study as the fastest sites are progressively removed in steps of 1230 10%.

1231 Supplementary file 2-table S3. GenBank assembly accession numbers for the 120 1232 alphaproteobacterial and outgroup genomes used in this study.

1233 Supplementary file 2-table S4. A list of the least compositionally heterogeneous genes 1234 out of the 200 single-copy and vertically-inherited genes used in this study.

1235 Supplementary file 2-table S5. Model fit of amino acid replacement matrices as 1236 components of simple models that do not account for compositional heterogeneity 1237 across sites. Models are ordered from lowest to highest BIC. -LnL: log-likelihood; df: 1238 degrees of freedom or number of free parameters; AIC: Akaike information criterion; 1239 AICc: corrected Akaike information criterion; BIC: Bayesian information criterion.

1240 Supplementary file 2-table S6. Model fit of amino acid replacement matrices as 1241 components of complex models that account for compositional heterogeneity across 1242 sites. Models are ordered from lowest to highest BIC. -LnL: log-likelihood; df: degrees of 1243 freedom or number of free parameters; AIC: Akaike information criterion; AICc: 1244 corrected Akaike information criterion; BIC: Bayesian information criterion.

1245 Supplementary file 2-table S7. Model fit of LG+ES60+F for which the model 1246 component that accounts for rate heterogeneity across sites varies. Models are ordered 1247 from lowest to highest BIC. -LnL: log-likelihood; df: degrees of freedom or number of 1248 free parameters; AIC: Akaike information criterion; AICc: corrected Akaike information 1249 criterion; BIC: Bayesian information criterion.

1250 Supplementary file 2-table S8. Several summary statistics for the PhyloBayes MCMC 1251 chains run for each analysis under the CAT-Poisson+Γ4.

48

A. 2.36 B. Rickettsiales Pelagibacterales + alphaproteobacterium sp. HIMB59 80 Holosporales other Alphaproteobacteria

0.45 70

60

50 G+C%

40

Rickettsiales Pelagibacterales + alphaproteobacterium sp. HIMB59 30 Holosporales other Alphaproteobacteria

20 0.006 0.2 0.7 1.2 1.7 2.2 GARP:FIMNKY A. 90/92 B.

Rickettsiales Rickettsiales 78.3/89 Geminicoccaceae 3.8/51 99.4/99 Pelagibacterales 95.9/93

alphaproteobacterium HIMB59 99.9/100 Holosporales Rhizobiales 99.2/98 84.2/88

Caulobacterales 93.5/94

Rhizobiales 98.6/99 Rhodobacterales 100/94

98.6/97 Pelagibacterales Caulobacterales 100/94 93.2/90 Sphingomonadales 100/94 Rhodobacterales Sneathiellales 80.9/84 Sphingomonadales Holosporales 98.6/99 Sneathiellales 100/92 Geminicoccaceae alphaproteobacterium HIMB59 98.7/98 100/93 99.9/100 99.9/100 Rhodospirillales 87.8/94 99.8/99 Rhodospirillales 60.6/87 99.9/100 99.8/100 99.9/100 99.4/98 91.6/93 99.4/100

0.4 0.2 A. B. Geminicoccaceae

Rhizobiales

98.8/99 44.3/89 Caulobacterales

100/100 Rhodobacterales

52.3/57 99.9/100 Sphingomonadales Sneathiellales Geminicoccaceae Sneathiellales Caedibacter sp. 37-49 Caedibacter sp. 37-49

Holosporaceae 99.6/99 96/97 ‘Ca. Paracaedibacter symbiosus’ 99.7/99 ‘Ca. Paracaedibacter symbiosus’ 99.3/100 Micavibrio aeruginosavorus 99.9/99 Micavibrio aeruginosavorus 37.2/93 Rhodocista centenaria Rhodocista centenaria 77.3/83 Azospirillaceae ‘Ca. Puniceispirillum marinum’ ‘Ca. Puniceispirillum marinum’ 98.8/99 100/99

Kiloniella laminariae Rhodovibriaceae Kiloniella laminariae Rhodovibrio salinarum Rhodovibrio salinarum 98/99

Rhodospirillum rubrum 70.2/79 Rhodospirillum rubrum Magnetospirillum magneticum Rhodospirillaceae Magnetospirillum magneticum 13.8/78

100/99 99.8/99 99.4/99 67/87 Rubritepida flocculans Rubritepida flocculans 99.9/100 Acetobacteraceae 96/100 Acidisphaera rubrifaciens Acidisphaera rubrifaciens 92.5/87 99.7/97 14.9/83 90.9/99 12.8/75

100/100

0.2 0.2 Class 1. Alphaproteobacteria Garrity et al. 2005 Subclass 1. Rickettsidae Ferla et al. 2013 emend. Muñoz-Gómez et al. 2018 Order 1. Rickettsiales Gieszczykiewicz 1939 emend. Dumler et al. 2001 Family 1. Anaplasmataceae Philip 1957 Family 2. 'Candidatus Midichloriaceae' Montagna et al. 2013 Family 3. Rickettsiaceae Pinkerton 1936 Subclass 2. Caulobacteridae Ferla et al. 2013 emend. Muñoz-Gómez et al. 2018 Order 1. Rhodospirillales Pfennig and Truper 1971 emend. Muñoz-Gómez et al. 2018 Family 1. Acetobacteriacae ex Henrici 1939 Gillis and De Ley 1980 Family 2. Rhodospirillaceae Pfennig and Truper 1971 emend. Muñoz-Gómez et al. 2018 Family 3. Azospirillaceae fam. nov. Muñoz-Gómez et al. 2018 Family 4. Holosporaceae Szokoli et al. 2016 Family 5. Rhodovibriaceae fam. nov. Muñoz-Gómez et al. 2018 Family 6. Geminicoccaceae Proença et al. 2017 Order 2. Sneathiellales Kurahashi et al. 2008 Order 3. Sphingomonadales Yabuuchi et al. 1990 Order 4. Pelagibacterales Grote et al. 2012 Order 5. Rhodobacterales Garrity et al. 2006 Order 6. Caulobacterales Henrici and Johnson 1935 Order 7. Rhizobiales Kuykendall 2006 Class 2. Magnetococcia Parks et al. 2018 Order 1. Magnetococcales Bazylinki et al. 2013 MC-1 Magnetococcus marinus MC-1 Magnetofaba australis IT-1 Magnetofaba australis IT-1 Anaplasma phagocytophilum HZ Anaplasma phagocytophilum HZ canis Jake Jake 90/92 endosymbiont of Culex quinquefasciatus Pel Wolbachia endosymbiont of Culex quinquefasciatus Pel Wolbachia endosymbiont of Onchocerca ochengi Wolbachia endosymbiont of Onchocerca ochengi 78.3/89 sennetsu Miyayama Miyayama Candidatus Jidaibacter acanthamoeba Candidatus Arcanobacter lacustris endosymbiont of Peranema tsutsugamushi Boryong 3.8/51 99.4/99 Candidatus Midichloria mitochondrii IricVA Rickettsia typhi Wilmington Candidatus Arcanobacter lacustris endosymbiont of Stachyamoeba Boryong Candidatus Jidaibacter acanthamoeba Rickettsia typhi Wilmington endosymbiont of Peranema endosymbiont of Stachyamoeba Candidatus Midichloria mitochondrii IricVA Candidatus Pelagibacter sp. IMCC9063 Geminicoccus roseus DSM 18922 alphaproteobacterium HIMB114 Tistrella mobilis KA081020-065 alphaproteobacterium SCGC AAA280-P20 Ahrensia sp. R2A130 95.9/93Candidatus Pelagibacter ubique HTCC1002 quintana RM-11 Candidatus Pelagibacter ubique HTCC8051 Chelativorans sp. J32 99.9/100 alphaproteobacterium HIMB5 Rhizobium leguminosarum bv. trifolii WSM1689 alphaproteobacterium SCGC AAA288-N07 Polymorphum gilvum SL003B-26A1 Pelagibacter sp. HIMB058 Roseibium sp. TrichSKD4 alphaproteobacterium HIMB59 Pseudovibrio sp. FO-BEG1 Caedibacter sp. 37-49 Pelagibacterium halotolerans B2 84.2/88 Caedibacter sp. 38-128 extorquens AM1 99.2/98 Candidatus Caedibacter acanthamoebae BL2 Caedibacter varicaedens Methylocystis sp. SC2 Candidatus Nucleicultrix amoebiphila FS5 palustris TIE-1 Candidatus Finniella lucida Xanthobacter autotrophicus Py2 Candidatus Odyssella thessalonicensis L13 Hyphomicrobium denitrificans ATCC 51888 Candidatus Paracaedibacter acanthamoebae isolate PRA3 vannielii ATCC 17100 Candidatus Paracaedibacter symbiosus Thermopetrobacter sp. TC1 Candidatus Hepatobacter penaei Parvibaculum lavamentivorans DS-1 Holospora obtusa F1 alphaproteobacterium Mf105b01 Holospora undulata HU1 alphaproteobacterium IMCC14465 93.5/94 Ahrensia sp. R2A130 alphaproteobacterium RS24 RM-11 excentricus CB 48 Chelativorans sp. J32 Brevundimonas subvibrioides ATCC 15264 Rhizobium leguminosarum bv. trifolii WSM1689 alphaproteobacterium L41A 98.6/99 Polymorphum gilvum SL003B-26A1 zucineum HLK1 Roseibium sp. TrichSKD4 baltica ATCC 49814 Pseudovibrio sp. FO-BEG1 neptunium ATCC 15444 Pelagibacterium halotolerans B2 maris MCS10 Methylobacterium extorquens AM1 sp. HTCC2633 Methylocella silvestris BL2 HTCC2503 Methylocystis sp. SC2 Jannaschia sp. EhC01 Rhodopseudomonas palustris TIE-1 Octadecabacter antarcticus 307 Xanthobacter autotrophicus Py2 Ruegeria sp. ANG-R Hyphomicrobium denitrificans ATCC 51888 stellata E-37 100/94 ATCC 17100 Ketogulonicigenium vulgare WSH-001 Thermopetrobacter sp. TC1 Rubellimicrobium thermophilum DSM 16684 Parvibaculum lavamentivorans DS-1 PD1222 alphaproteobacterium Mf105b01 Rhodobacter sphaeroides 2.4.1 98.6/97 alphaproteobacterium IMCC14465 Meganema perideroedes DSM 15528 alphaproteobacterium RS24 Candidatus Pelagibacter sp. IMCC9063 Asticcacaulis excentricus CB 48 alphaproteobacterium HIMB114 100/94 Brevundimonas subvibrioides ATCC 15264 alphaproteobacterium SCGC AAA280-P20 alphaproteobacterium L41A Candidatus Pelagibacter ubique HTCC1002 Phenylobacterium zucineum HLK1 Candidatus Pelagibacter ubique HTCC8051 Hirschia baltica ATCC 49814 alphaproteobacterium HIMB5 93.2/90 Hyphomonas neptunium ATCC 15444 alphaproteobacterium SCGC AAA288-N07 100/94 Maricaulis maris MCS10 Pelagibacter sp. HIMB058 Oceanicaulis sp. HTCC2633 sp. JLT1363 Parvularcula bermudensis HTCC2503 alphaproteobacterium LLX12A Jannaschia sp. EhC01 Sphingomonas wittichii Octadecabacter antarcticus 307 sub mobilis ATCC 10988 Ruegeria sp. ANG-R alphaproteobacterium AAP81b E-37 Kordiimonas gwangyangensis DSM 19435 - JCM 12864 Ketogulonicigenium vulgare WSH-001 alphaproteobacterium Q-1 Rubellimicrobium thermophilum DSM 16684 Sneathiella glossodoripedis JCM 23214 Paracoccus denitrificans PD1222 Caedibacter sp. 37-49 Rhodobacter sphaeroides 2.4.1 Caedibacter sp. 38-128 80.9/84 Meganema perideroedes DSM 15528 Candidatus Caedibacter acanthamoebae Citromicrobium sp. JLT1363 Caedibacter varicaedens 98.6/99 alphaproteobacterium LLX12A Candidatus Nucleicultrix amoebiphila FS5 Sphingomonas wittichii Candidatus Finniella lucida Zymomonas mobilis sub mobilis ATCC 10988 Candidatus Odyssella thessalonicensis L13 alphaproteobacterium AAP81b Candidatus Paracaedibacter acanthamoebae isolate PRA3 100/92 Kordiimonas gwangyangensis DSM 19435 - JCM 12864 Candidatus Paracaedibacter symbiosus alphaproteobacterium Q-1 Candidatus Hepatobacter penaei Sneathiella glossodoripedis JCM 23214 Holospora obtusa F1 Geminicoccus roseus DSM 18922 Holospora undulata HU1 Tistrella mobilis KA081020-065 alphaproteobacterium HIMB59 Caenispirillum salinarum AK4 Candidatus Puniceispirillum marinum IMCC1322 strain 930I Thalassobaculum salexigens DSM 19539 Pararhodospirillum photometricum DSM 122 alphaproteobacterium BAL199 Rhodospirillum rubrum ATCC 11170 Oceanibaculum indicum P24 98.7/98 Magnetospirillum magneticum AMB-1 Kiloniella laminariae DSM 19542 Phaeospirillum fulvum MGU-K5 Rhodovibrio salinarum DSM 9154 Terasakiella pusilla DSM 6293 DSM 16000 Thalassospira profundimaris WP0211 Rhodocista sp. MIMtkB3 100/93 Candidatus Puniceispirillum marinum IMCC1322 alphaproteobacterium AAP38 99.9/100 Thalassobaculum salexigens DSM 19539 Rhodospirillum centenum SW alphaproteobacterium BAL199 Micavibrio aeruginosavorus ARL-13 87.8/94 Oceanibaculum indicum P24 Caenispirillum salinarum AK4 99.9/100 Kiloniella laminariae DSM 19542 Roseospirillum parvum strain 930I Rhodovibrio salinarum DSM 9154 Pararhodospirillum photometricum DSM 122 99.8/99 Inquilinus limosus DSM 16000 Rhodospirillum rubrum ATCC 11170 60.6/87 Rhodocista sp. MIMtkB3 Magnetospirillum magneticum AMB-1 alphaproteobacterium AAP38 Phaeospirillum fulvum MGU-K5 Rhodospirillum centenum SW Terasakiella pusilla DSM 6293 99.9/100 Micavibrio aeruginosavorus ARL-13 Thalassospira profundimaris WP0211 Elioraea tepidiphila DSM 17972 Elioraea tepidiphila DSM 17972 moabensis DSM 16746 Belnapia moabensis DSM 16746 99.8/100 Rubritepida flocculans DSM 14296 Rubritepida flocculans DSM 14296 cervicalis ATCC 49957 Roseomonas cervicalis ATCC 49957 Acidiphilium angustum ATCC 35903 Acidiphilium angustum ATCC 35903 99.9/100 Acidisphaera rubrifaciens HS-AP3 Acidisphaera rubrifaciens HS-AP3 bethesdensis CGDNIH1 Granulibacter bethesdensis CGDNIH1 99.4/98 intestini A911 Commensalibacter intestini A911 91.6/93 diazotrophicus PA1 5 Gluconacetobacter diazotrophicus PA1 5 oxydans H24 Gluconobacter oxydans H24 99.4/100 Alteromonas lipolytica Alteromonas lipolytica Enterobacter soli ATCC BAA-2102 Enterobacter soli ATCC BAA-2102 Congregibacter litoralis KT71 Congregibacter litoralis KT71 Spongiibacter tropicus DSM 19543 Spongiibacter tropicus DSM 19543 Burkholderia thailandensis E264 Burkholderia thailandensis E264 Ralstonia solanacearum GMI1000 Ralstonia solanacearum GMI1000 Methylovorus glucosetrophus SIP3-4 Methylovorus glucosetrophus SIP3-4 Nitrosospira multiformis ATCC 25196 Nitrosospira multiformis ATCC 25196 0.4 0.2 120 taxa & 200 genes dataset

Set of 40 most Site removal Recoding compositionally Taxon in steps of 10% into S4 removal homogeneous genes

Phylogenetic Phylogenetic Phylogenetic analyses analyses analyses by compositional by bias rate

Taxon Taxon removal removal Phylogenetic Phylogenetic analyses analyses

Phylogenetic Phylogenetic analyses analyses Taxon removal

Phylogenetic analyses A. B. Rhizobiales Rhizobiales

Caulobacterales Caulobacterales

0.99 Rhodobacterales Rhodobacterales

0.99 Sphingomonadales Pelagibacterales Sneathiellales 0.99 0.99 Sphingomonadales Sneathiellales 0.99 0.99 Holosporales 0.99 Rhodospirillales 0.99 0.61

0.991 0.97 alphaproteobacterium sp. HIMB59 0.5 0.99 0.98 0.98

0.99 0.97 0.5 Holosporales Rhodospirillales

Pelagibacterales alphaproteobacterium sp. HIMB59 Geminicoccaceae Rickettsiales Rickettsiales Geminicoccaceae 0.99 Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria

0.4 0.2 A. B. Rhizobiales Rhizobiales

99/97 93.5/94

Caulobacterales Caulobacterales

Rhodobacterales 100/94 Rhodobacterales

Sphingomonadales 100/94 Pelagibacterales

100/94 Sneathiellales Sphingomonadales Sneathiellales Rhodospirillales 98.6/99

99.7/100 Rhodospirillales

Geminicoccaceae 100/93 Rickettsiales Holosporales 100/92 Pelagibacterales alphaproteobacterium sp. HIMB59 alphaproteobacterium sp. HIMB59 Geminicoccaceae Holosporales Rickettsiales Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria

0.4 0.2 C. D. Rhizobiales 99.1/100 Rhizobiales

99.8/100 79.2/91

Caulobacterales Caulobacterales

Rhodobacterales 98.5/100 Rhodobacterales

Sphingomonadales Sphingomonadales 94.5/98 Sneathiellales Sneathiellales 99.9/100 Geminicoccaceae 99.9/100 98.4/99

76.8/100 Rhodospirillales Geminicoccaceae Rhodospirillales

Rickettsiales Holosporales

Pelagibacterales Rickettsiales alphaproteobacterium sp. HIMB59

97.2/99 Holosporales Pelagibacterales alphaproteobacterium sp. HIMB59 Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria 96.3/100 99.3/100

0.4 0.3 Rhizobiales

Caulobacterales 99.2/98 Rhodobacterales 95.1/82 Sphingomonadales Sneathiellales

85.3/82 100/96 98.5/97 100/97 36.9/88 Rhodospirillales

100/97

99.2/100 100/52 64.7/50 99.4/99 100/97 Geminicoccaceae Rickettsiales

97.7/46 Pelagibacterales

alphaproteobacteriumHolospora obtusa F1 sp. HIMB59 Holospora undulata HU1 Candidatus Hepatobacter penaei

100/46 Holosporales Magnetococcales Beta- & Gammaproteobacteria

0.3 Magnetococcus marinus MC-1 Magnetofaba australis IT-1 Anaplasma phagocytophilum HZ Ehrlichia canis Jake Wolbachia endosymbiont of Onchocerca ochengi Wolbachia endosymbiont of Culex quinquefasciatus Pel Neorickettsia sennetsu Miyayama Candidatus Jidaibacter acanthamoeba endosymbiont of Peranema Candidatus Midichloria mitochondrii IricVA Candidatus Arcanobacter lacustris Orientia tsutsugamushi Boryong Rickettsia typhi Wilmington endosymbiont of Stachyamoeba Candidatus Pelagibacter sp. IMCC9063 alphaproteobacterium HIMB114 alphaproteobacterium SCGC AAA280-P20 Candidatus Pelagibacter ubique HTCC1002 Candidatus Pelagibacter ubique HTCC8051 alphaproteobacterium HIMB5 alphaproteobacterium SCGC AAA288-N07 Pelagibacter sp. HIMB058 alphaproteobacterium HIMB59 Caedibacter sp. 37-49 Caedibacter sp. 38-128 Candidatus Caedibacter acanthamoebae Caedibacter varicaedens Candidatus Nucleicultrix amoebiphila FS5 Candidatus Finniella lucida Candidatus Odyssella thessalonicensis L13 Candidatus Paracaedibacter acanthamoebae isolate PRA3 Candidatus Paracaedibacter symbiosus Candidatus Hepatobacter penaei Holospora obtusa F1 Holospora undulata HU1 Ahrensia sp. R2A130 Bartonella quintana RM-11 Chelativorans sp. J32 Rhizobium leguminosarum bv. trifolii WSM1689 Polymorphum gilvum SL003B-26A1 Roseibium sp. TrichSKD4 Pseudovibrio sp. FO-BEG1 Pelagibacterium halotolerans B2 Methylobacterium extorquens AM1 Methylocella silvestris BL2 Methylocystis sp. SC2 Rhodopseudomonas palustris TIE-1 Xanthobacter autotrophicus Py2 Hyphomicrobium denitrificans ATCC 51888 Rhodomicrobium vannielii ATCC 17100 Thermopetrobacter sp. TC1 Parvibaculum lavamentivorans DS-1 alphaproteobacterium Mf105b01 alphaproteobacterium IMCC14465 alphaproteobacterium RS24 Asticcacaulis excentricus CB 48 Brevundimonas subvibrioides ATCC 15264 alphaproteobacterium L41A Phenylobacterium zucineum HLK1 Hirschia baltica ATCC 49814 Hyphomonas neptunium ATCC 15444 Maricaulis maris MCS10 Oceanicaulis sp. HTCC2633 Parvularcula bermudensis HTCC2503 Jannaschia sp. EhC01 Octadecabacter antarcticus 307 Ruegeria sp. ANG-R Sagittula stellata E-37 Ketogulonicigenium vulgare WSH-001 Rubellimicrobium thermophilum DSM 16684 Paracoccus denitrificans PD1222 Rhodobacter sphaeroides 2.4.1 Meganema perideroedes DSM 15528 Citromicrobium sp. JLT1363 alphaproteobacterium LLX12A Sphingomonas wittichii Zymomonas mobilis sub mobilis ATCC 10988 alphaproteobacterium AAP81b Kordiimonas gwangyangensis DSM 19435 - JCM 12864 alphaproteobacterium Q-1 Sneathiella glossodoripedis JCM 23214 Geminicoccus roseus DSM 18922 Tistrella mobilis KA081020-065 Caenispirillum salinarum AK4 Roseospirillum parvum strain 930I Pararhodospirillum photometricum DSM 122 Rhodospirillum rubrum ATCC 11170 Magnetospirillum magneticum AMB-1 Phaeospirillum fulvum MGU-K5 Terasakiella pusilla DSM 6293 Thalassospira profundimaris WP0211 Candidatus Puniceispirillum marinum IMCC1322 Thalassobaculum salexigens DSM 19539 alphaproteobacterium BAL199 Oceanibaculum indicum P24 Kiloniella laminariae DSM 19542 Rhodovibrio salinarum DSM 9154 Inquilinus limosus DSM 16000 Rhodocista sp. MIMtkB3 alphaproteobacterium AAP38 Rhodospirillum centenum SW Micavibrio aeruginosavorus ARL-13 Elioraea tepidiphila DSM 17972 Belnapia moabensis DSM 16746 Rubritepida flocculans DSM 14296 Roseomonas cervicalis ATCC 49957 Acidiphilium angustum ATCC 35903 Acidisphaera rubrifaciens HS-AP3 Granulibacter bethesdensis CGDNIH1 Commensalibacter intestini A911 Gluconacetobacter diazotrophicus PA1 5 Gluconobacter oxydans H24 Alteromonas lipolytica Enterobacter soli ATCC BAA-2102 Congregibacter litoralis KT71 Spongiibacter tropicus DSM 19543 Burkholderia thailandensis E264 Ralstonia solanacearum GMI1000 Methylovorus glucosetrophus SIP3-4 Nitrosospira multiformis ATCC 25196

0 27,200 54,400 (0%) (50%) (100%) 2.5 o i

t 2 r a

1.5 K Y N M I

F 1 P : R

A 0.5 G i i i l I i s s 3 2 1 3 9 6 2 5 8 9 1 6 1 0 0 1 a 3 5 1 2 1 5 4 9 1 e 1 8 3 8 8 2 1 4 1 1 a s 0 3 4 7 1 9 4 8 0 1 4 2 1 3 1 6 a n 5 1 4 0 4 2 Z 4 3 2 5 1 3 a 1 1 4 4 2 7 7 2 9 1 3 3 4 2 0 a 2 4 0 8 8 7 g 3 1 4 1 4 A 5 3 b 4 1 2 0 A R g i e A e . - - -

0 - W - - - - b n 0 0 2 0 5 3 3 5 0 4 5 4 o 8 1 1 4 1 6 0 6 d 6 y 2 2 5 3 k e 4 1 0 r 6 L 2 0 5 8 0 n 3 2 1 c a 7 0 6 1 6 9 1 2 9 7 2 2 8 1 9 7 2 0 b 9 0 0 6 4 3 F a 8 1 4 2 2 K 1 K K A i 2 A P B 1 S B C G D H U H n C m a 1 3 -

4 m t M 3 - - - P T c h S e C E -

I J 3 0 0 3 6 Q B 1 s u 9 3 8 2 1 9 9 1 1 3 1 6 2 1 e 2 0 b 4 4 5 S b i 8 L 5 1 2 e 0 9 . 1 1 1 5 0 7 6 2 5 8 0 0 0 k B 8 2 9 P 5 2 B 9 P 1 T P I 4 G N B P I L S c i e S 6 s t

n c V H C A F

T

A K g - H t S 7 L e

R 9 - M - s t B E A P E J a

. i y t o A -

e 1

5

0 o d 6 - 5 2 9 2 9 8 6 6 3 9 9 9 1 2 1 9

2 1 0 y o 2 s l 7 1 - 5

L 8 5 1 1 1 M 4 7 L o 6 5 4 5 9 t D A U

N I

2 E B r M B i h P A I K . s M A

T y a n H C m u s . R n I e s r X 8 s

n s a i N M S - 0 l 3 s s P

0

a m B e 8 i R - i 1 I m u 1 P 1 n 1 2 1 1 1 i H 0 1 1 1 o 5 C 1 M 1 3 1 1 1 2

1 4 a 1 M s 4 s

c u m o b

T S A i l P I s 2 C u C C

D D c h a C p C

L

E i A C

M H R

l m

u

s m - m

A

G

2 t

I

S i . a r s

M o n A M t

3 s i a w s i A s A s l t s p

i

8 t u 8 M M A B r M i u

p

n

i m M S r 1 S

i y a i a n o a I c u I L e s B a C s p s

u a e H R m m P c u s p m r C i i a a H . A i C . l h f m n c h C i C C 0 l a i n

a C l C i B r a t C C C S 2 C C n C n

. W O c a 2 C u A r l C G M C s 3 s i s t r l

M r J L M

a M r M b M r M

i

S M M i M M M h M l l e M l e

c u s

u S e s i d s G H e H m

m a y a a r D a h r . m u

1

l s p u s r i T i d a i p l C r i n T i T h s u s T n

s s t n A W M o e

a F t A C e M u C u . C i

0 C

c t e

l S C . M C C S r C r i C S C a S r C S i B S s t S M p S S i P n s m c u s t s p S m W s t

h t e c u s

m D e r i e c a r u d . s y m i m I s p u D o

c e i s p i T u

8 I p c a r e

r n s c i a . o r H

W m H d n h

H 0 e v e c t H

n

e

i

o b

t M t i n n T

i u o J C T

q J C T r T a T T T i T A e c h e T n n T r

i r A r s p

a n r l f n m

s h c t x y d a c u m u D u D t r t D u I D a D m D u r

D l D n D D 0

s

e e a o a r h u r i t c t t l v a a .

i r r o p d i r a i i

e s p a i i e r a m n n v o i C

n s p s t a L c t t . f s o i l o

r o a a

- r o

A u s p a n t r i i m A

r A n A n l A A a a l A A i A A e r m e n A e

A i

s s i u o u e p r m e e c t l i e a s u h c i s i e A o s i

c a

i i s

t l a

s a n s t r c a s p s

a s s m v u i e u

t o c t c t v o n r b

i

a s

o o u o o r t i c e S a

t n o e e c i

i c e l c u

a m C l a r a c y s t r b r e l e r a i m a e s p u u p m c y t u

f s i

v o c h r i o e u s f u i

m s b i s a m e a F n 5 o s l a a m a s o s p c t e m s r u c t i c t n a S g r n c a e n c t

r e d a K c t x t i o t a b a i i m a u r a i l C s i e

i o C s i T s i u s i t c h o l n t o n v u m e l

h m r

m h q u o i q a l l l t e o e e n r m u o b i s c u y p z u i f c a e 3 r s u i n i

m h a t r f u e c t e a q a b c t e a n o d t e r c t a i s l e r i s t u b b s a l o u i e a t a i s c h c t i t a e u f

e i s p g d c t a n i n n t l

r i a e l t m l o r s g q i

l u

m r r u i

g s s u t i a i e r u

i G a r x c e o m G m A

t

i r e y l a i a i p r d a m s e a i g e e a b b i 4 o s e t o o b l g c t t o c t c t b b a a r b c b c t a o g o r t e b n a

n u p a o l b l c i c h n l n i a i a p v u u s a e n a b s p e p e h u a i d e p r a i b r c t i n f m

c t l m i u i i a i i n c a i z o b a l

e l n c a e e u s t

r a m m t s d 9 h

d h h c e b a o a e p b o o a a C b x t n e e o C b l i o u l n r n u a n e n v u o e m u a n b b v . l x i l i e i t u l u b o b m m o g i t t c u d o o o a d o i a o m l c t r c u

p i h

g r e t l e

r e u d i a

i r n a u i f r f i v i u b e i e u

o c t m u o f o u p i u o h r c a u a 1 v i E c o i c a h s u b t b b o i o u e n b e c u b t S r l i n c t r h i s o S b e r o s e o s i i l r s p b u l d e a n r m i i m t e h v a a e i t a o i

r a a g o i t p l

e r b e t

l t r b i c t r i e R o r d o t

t l p a o

i d e O r m H a a q p

r i t

l H t l r o g

s a s o

r d h u g a i o

c r m e s a e e l o a o i t e A e r d i m t

l e o r

r i y l

c a o d i e l l

n o u p t

o

i t o l u m b C g i m b t l a o g v a t e i o p a i d a r b t c o o m r a l s f S

i r a A e

t l e i p l e d r e a M C r M r A b m a n u i o i n c t r i m a a a e i s c t r p s y m b e n x t u J a m u r o m f

m b m r d m e e h g h s a c a r l o e o e s p b S a

p o s e o o c t e r l o c t c e o n s u h t a o r v i o

b e s a n l m r a u o

u e t n l

l m m C t k i c u n i a s p

a a a i m e s i

a e e m l t u r p u r g r c t o t

l r h r t p c t P S a o u l e

i e r u s a s o r u i i

u o r

t n b s a

o l a h e m n o e e a d e M a a c t n R s e a c a t p l a t b l m u c u r a i e c a h m i b c t a t c k e H

s b d e r o l p

g a p a o d p a C d p e s p s o l i b a h u d s u l i c o r r t i p n l

a r e e e n r a a a l i a l u a r a i d

r b b c t h s m l b r l D

u n a c t l d u g r t m d p s e R u i l s a i a u t e a i r e t a i s a s e p u i m o n i a o o t i s c h c u o h c h P a i u p a n i a e h n i a b r h

l J i a h b p o M n i t i a a i e l e e a l e r e r o a s c e l h s u r B M c l i i p o u o v i e

R a C u b l c e d d a o a r c o s a a s y m a P u d g p b l g i u s n i l a i C i h i o p b h c i l a a m h g e i r t h p e l i a P i s p h i p n r T a a i p i b g o d n

r r i g h s e o t a r r i C b m l o c t u a m c a s l c t l o i a

c k e q c t o c o d i a O G f c a s o o n y s e i u s p o a a a m p a O d s p h b o p n e p p C c u i p n i H a s i v o k h i e o p o C e n a s p l l p o y l l s t n b s p l i a n t l a h u m C u l a a d g n r c u o r b i o s i o l s p i N l i d o R d l h a a i y l n a a o a a i i c u i a t h l r h o u d n o d M n c t p t m s p l a i e l e I m r r r

a r c r a T o s t a M o l l n i i a o t l u a s t b p s e r b o n o b v i u t d n o m i l e n a b e i n T h e a s e r t o l l h O i b a p R e P i o a n E s m i P d t e b P d t a y l O e a m b o c e u n A s o o h o e e l X l s O n d u e y m B o e o c e e s o a i P a b t o d i K A g a m a c i m u p o B o e l C n s y m C u u a v i a i d o n h n u h P u e t l s e p c a o o g r l R s h N s n t i t d a o t n a s p t s e i R o A c t m G R o h c r d h t a o C s g i r g a u g E P n v u i h R o a i c i u S c r u M e n t o u d t o a a o o r a b d t t i t a a d d a i e e n i a a M P r d s o r R C d l h A y a d R i P y p a d P r a n b

o e m a d a e a R M n i n p a p i n N m g M c o d l d M o T i e h o d d P o d a K H i n l n G a s C a i i d

u m h l n n a l h d a n s y m R h u r h d S d C e a v u u e a m a t h i i a o n e G p a p C n n o b l a T y p r l w r b C a d c a a a C u c h a d a g a n B i H a C C C a z o R y m e r d P i s

b a Z l h n a a i o P a n R o C s c h W u m a t i i b a l d d r o i o d W K n a C Taxon A. B.

Rhizobiales Rhizobiales

99.7/100 98.8/99

Caulobacterales Caulobacterales

Rhodobacterales Rhodobacterales

99.9/100 Sphingomonadales Sphingomonadales

Sneathiellales Sneathiellales Holosporales

99.9/99

99.7/99 100/99 Rhodospirillales

99.7/97 Rhodospirillales 100/10092.5/87

99.9/100

77.3/83 97/96 99.4/99 70.2/79 Holosporales 100/99 Geminicoccaceae Geminicoccaceae Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria

0.3 0.2 C. D.

Rhizobiales 99/100 Rhizobiales

66.8/87 91.3/96

Caulobacterales Caulobacterales

Rhodobacterales 99.2/100 Rhodobacterales

98.9/100 Sphingomonadales Sphingomonadales Sneathiellales Sneathiellales Holosporales 99.8/99 98.5/99

63.8/90 96.8/9699.7/99 99.8/100 Rhodospirillales 96.7/98

77.8/95 73.8/89 99.9/100 Rhodospirillales 98.9/99 3.1/64 99.1/99

98/99 39.4/87 61.7/88 Holosporales 99.7/100 Geminicoccaceae 99.9/100 Geminicoccaceae Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria 84.3/92 99.3/99

0.2 0.2 A. B.

Rhizobiales Rhizobiales

99.4/99 98.6/98

Caulobacterales Caulobacterales

Rhodobacterales Rhodobacterales

Sphingomonadales Sphingomonadales Sneathiellales Sneathiellales

Rhodospirillales Rhodospirillales

84.2/84

Geminicoccaceae Geminicoccaceae

Rickettsiales Rickettsiales

Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria

0.3 0.2 C. D.

Rhizobiales 99.3/100 Rhizobiales

94.9/99 82.6/91

Caulobacterales Caulobacterales

Rhodobacterales 99.3/100 Rhodobacterales

Sphingomonadales Sphingomonadales Sneathiellales Sneathiellales 87.2/88 66.9/85

100/97 Rhodospirillales Rhodospirillales

98.1/95 Geminicoccaceae 98.9/100 Geminicoccaceae

Rickettsiales Rickettsiales

Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria 63.2/87 99.1/99

0.4 0.3 A. B.

Rhizobiales Rhizobiales

99.2/99

Caulobacterales Caulobacterales

Rhodobacterales Rhodobacterales

Sphingomonadales Sphingomonadales

Sneathiellales Sneathiellales 99.8/100

Rhodospirillales Rhodospirillales

55.3/70

Geminicoccaceae 99.9/100 Geminicoccaceae

Rickettsiales Rickettsiales

Magnetococcales Magnetococcales

0.4 0.2 C. D.

Rhizobiales 98.2/98 Rhizobiales

88.1/100 92.3/94

Caulobacterales Caulobacterales

Rhodobacterales 99.4/100 Rhodobacterales

75.6/21 Sphingomonadales Sphingomonadales Sneathiellales Geminicoccaceae Sneathiellales 13.6/60

98.8/100

98.5/24 Rhodospirillales Rhodospirillales 99.8/100

98.1/99 Geminicoccaceae

Rickettsiales Rickettsiales

Magnetococcales Magnetococcales

0.4 0.3 A. B.

Rhizobiales Rhizobiales

93.2/94

Caulobacterales Caulobacterales

100/59 Rhodobacterales Rhodobacterales

Sphingomonadales Pelagibacterales

Sphingomonadales 100/59 Sneathiellales Rhodospirillales

Rhodospirillales 100/59

Geminicoccaceae 90.4/92 Sneathiellales Pelagibacterales Geminicoccaceae Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria

0.3 0.2 C. D.

Rhizobiales 99.2/100 Rhizobiales

97.8/98

Caulobacterales 98.8/99 Caulobacterales

19.3/98

Rhodobacterales 94.6/92 Rhodobacterales

Pelagibacterales 97.9/94 Pelagibacterales

98.9/96

61.2/88 Sphingomonadales Sphingomonadales Sneathiellales Geminicoccaceae Sneathiellales

99.9/100 Rhodospirillales 79.5/95 Rhodospirillales

96.7/98

99.4/100 Geminicoccaceae Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria 98.1/100 97.9/98

0.3 0.2 A. B.

Rhizobiales Rhizobiales

99.2/99

Caulobacterales Caulobacterales

Rhodobacterales Rhodobacterales

Sphingomonadales Sphingomonadales

Sneathiellales Sneathiellales

99.8/97

Rhodospirillales Rhodospirillales

99.8/100 99.5/97

Geminicoccaceae alphaproteobacterium sp. HIMB59 Geminicoccaceae alphaproteobacterium sp. HIMB59 Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria

0.3 0.2 C. D.

Rhizobiales 99.5/99 Rhizobiales

99.8/100 95.2/96

Caulobacterales Caulobacterales

Rhodobacterales 98.4/100 Rhodobacterales

99.1/100 Sphingomonadales Sphingomonadales

Sneathiellales Sneathiellales 95.6/99

99.1/91

Rhodospirillales 99.9/96 Rhodospirillales 97.4/91

87.8/86

86.4/90 99.9/100 Geminicoccaceae alphaproteobacterium sp. HIMB59 Geminicoccaceae alphaproteobacterium sp. HIMB59 Magnetococcales Magnetococcales Beta- & Gammaproteobacteria Beta- & Gammaproteobacteria 89.1/100 98.4/99

0.3 0.2 A. B.

Rhizobiales Rhizobiales

Caulobacterales Caulobacterales

0.96

Rhodobacterales Rhodobacterales

Sphingomonadales Sphingomonadales Sneathiellales Sneathiellales Holosporales Holosporales 0.98

0.92 0.97

0.99 0.97

Rhodospirillales 0.98 0.97 Rhodospirillales 0.95 0.55

0.99 0.99 0.99 0.99 0.99 0.98 0.99 Geminicoccaceae Geminicoccaceae Magnetococcales 0.8 Magnetococcales 0.99 0.97 Beta- & Gammaproteobacteria 0.98 Beta- & Gammaproteobacteria 0.99

0.3 0.2 Rhizobiales

98.4/98

Caulobacterales

Rhodobacterales

Sphingomonadales

Sneathiellales

99.7/97 Rhodospirillales

78.5/89 34.1/57 Holosporales

Holospora obtusa F1 Holospora undulata HU1 Candidatus Hepatobacter penaei Geminicoccaceae Magnetococcales Beta- & Gammaproteobacteria

0.2 Rhizobiales

0.96

Caulobacterales

Rhodobacterales

Sphingomonadales Sneathiellales Holosporales 0.83

0.99 0.98

0.99

0.92 0.95 0.61 Rhodospirillales

0.99 0.99 0.88

0.99 0.99 Geminicoccaceae 0.99 Magnetococcales 0.98

0.99 0.99 Beta- & Gammaproteobacteria 0.99

0.4