<<

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 TITLE

2 An updated phylogeny of the reveals that the parasitic 3 and Holosporales have independent origins

4 AUTHORS

5 Sergio A. Muñoz-Gómez1,2, Sebastian Hess1,2, Gertraud Burger3, B. Franz Lang3, 6 Edward Susko2,4, Claudio H. Slamovits1,2*, and Andrew J. Roger1,2*

7 AUTHOR AFFILIATIONS

8 1 Department of Biochemistry and Molecular Biology; Dalhousie University; Halifax, 9 Nova Scotia, B3H 4R2; Canada.

10 2 Centre for Comparative Genomics and Evolutionary Bioinformatics; Dalhousie 11 University; Halifax, Nova Scotia, B3H 4R2; Canada.

12 3 Department of Biochemistry, Robert-Cedergren Center in Bioinformatics and 13 Genomics, Université de Montréal, Montreal, Quebec, Canada.

14 4 Department of Mathematics and Statistics; Dalhousie University; Halifax, Nova Scotia, 15 B3H 4R2; Canada.

16

17 *Correspondence to: Department of Biochemistry and Molecular Biology; Dalhousie 18 University; Halifax, Nova Scotia, B3H 4R2; Canada; 1 902 494 7894, 19 [email protected]

20 *Correspondence to: Department of Biochemistry and Molecular Biology; Dalhousie 21 University; Halifax, Nova Scotia, B3H 4R2; Canada; 1 902 494 2881, [email protected]

1

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

22 ABSTRACT

23 The Alphaproteobacteria is an extraordinarily diverse and ancient group of . 24 Previous attempts to infer its deep phylogeny have been plagued with methodological 25 artefacts. To overcome this, we analyzed a dataset of 200 single-copy and conserved 26 and employed diverse strategies to reduce compositional artefacts. Such 27 strategies include using novel dataset-specific profile mixture models and recoding 28 schemes, and removing sites, genes and taxa that are compositionally biased. We 29 show that the Rickettsiales and Holosporales (both groups of intracellular parasites of 30 ) are not sisters to each other, but instead, the Holosporales has a derived 31 position within the . Furthermore, we find that the Rhodospirillales might 32 be paraphyletic and that the Geminicoccaceae could be sister to all ancestrally free- 33 living alphaproteobacteria. Our robust phylogeny will serve as a framework for future 34 studies that aim to place mitochondria, and novel environmental diversity, within the 35 Alphaproteobacteria.

36 KEYWORDS: , Holosporales, mitochondria, origin, Finniella inopinata, 37 Stachyamoba, Peranema, , Rhodospirillales, Azospirillaceae, 38 Rhodovibriaceae.

2

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

39 INTRODUCTION

40 The Alphaproteobacteria is an extraordinarily diverse and disparate group of bacteria 41 and well-known to most biologists for also encompassing the mitochondrial lineage 42 (Williams, Sobral, and Dickerman 2007; Roger, Muñoz-Gómez, and Kamikawa 2017). 43 The Alphaproteobacteria has massively diversified since its origin, giving rise to, for 44 example, some of the most abundant (e.g., Pelagibacter ubique) and metabolically 45 versatile (e.g., sphaeroides) cells on Earth (Giovannoni 2017; Madigan, 46 Jung, and Madigan 2009). The basic structure of the tree of the Alphaproteobacteria 47 has largely been revealed through the analyses of 16S rRNA genes and several 48 conserved proteins (Garrity 2005; Lee et al. 2005; Rosenberg et al. 2014; Fitzpatrick, 49 Creevey, and McInerney 2006; Williams, Sobral, and Dickerman 2007; Brindefalk et al. 50 2011; Georgiades et al. 2011; Thrash et al. 2011; Luo 2015). Today, eight major orders 51 are well recognized, namely the , Rhizobiales, , 52 , , Rhodospirillales, Holosporales and Rickettsiales 53 (the latter two formerly grouped into the Rickettsiales sensu lato), and their 54 interrelationships have also recently become better understood (Viklund, Ettema, and 55 Andersson 2012; Viklund et al. 2013; Rodríguez-Ezpeleta and Embley 2012; Wang and 56 Wu 2014, 2015). These eight orders were grouped into two subclasses by Ferla et al. 57 (2013): the subclass Rickettsiidae comprising the order Rickettsiales and 58 Pelagibacterales, and the subclass Caulobacteridae comprising all other orders.

59 The great diversity of the Alphaproteobacteria itself presents a challenge to deciphering 60 the deepest divergences within the group. Such diversity encompasses a broad 61 spectrum of genome (nucleotide) and proteome (amino acid) compositions (e.g., the 62 A+T%-rich Pelagibacterales versus the G+C%-rich ) and molecular 63 evolutionary rates (e.g., the fast-evolving Pelagibacteriales, Rickettsiales or 64 Holosporales versus many slow-evolving in the Rhodospirillales) (Ettema and 65 Andersson 2009). This diversity may lead to pervasive artefacts when inferring the 66 phylogeny of the Alphaproteobacteria, e.g., long-branch attraction (LBA) between the 67 Rickettsiales and Pelagibacterales, especially when including mitochondria (Rodríguez- 68 Ezpeleta and Embley 2012; Viklund, Ettema, and Andersson 2012; Viklund et al. 2013).

3

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

69 Moreover, there are still important unknowns about the deep phylogeny of the 70 Alphaproteobacteria (Williams, Sobral, and Dickerman 2007; Ferla et al. 2013), for 71 example, the divergence order among the Rhizobiales, Rhodobacterales and 72 Caulobacterales (Williams, Sobral, and Dickerman 2007), the monophyly of the 73 Pelagibacterales (Viklund et al. 2013) and the Rhodospirillales (Ferla et al. 2013), and 74 the precise placement of the Rickettsiales and its relationship to the Holosporales 75 (Wang and Wu 2015; Martijn et al. 2018).

76 Systematic errors stemming from using over-simplified evolutionary models are perhaps 77 the major confounding and limiting factor to inferring deep evolutionary relationships; 78 the number of taxa and genes (or sites) can also be important factors. Previous multi- 79 tree studies of the Alphaproteobacteria were compromised by at least one of 80 these problems, namely, simple or unrealistic evolutionary models (because they were 81 not available at the time; e.g., Williams, Sobral, and Dickerman 2007), poor taxon 82 sampling (because the focus was too narrow or few genomes were available; e.g., 83 Williams, Sobral, and Dickerman 2007; Georgiades et al. 2011; Martijn et al. 2015) or a 84 small number of genes (because the focus was mitochondria; e.g., Rodríguez-Ezpeleta 85 and Embley 2012; Wang and Wu 2015; Martijn et al. 2018). The most recent study on 86 the phylogeny of the Alphaproteobacteria, and mitochondria, attempted to counter 87 systematic errors (or phylogenetic artefacts) by reducing amino acid compositional 88 heterogeneity (Martijn et al. 2018). Even though some deep relationships were not 89 robustly resolved, these analyses suggested that the Pelagibacterales, Rickettsiales 90 and Holosporales, which have compositionally biased genomes, are not each other’s 91 closest relatives (Martijn et al. 2018). A resolved and robust phylogeny of the 92 Alphaproteobacteria is fundamental to addressing questions such as how streamlined 93 bacteria, intracellular parasitic bacteria, or mitochondria evolved from their 94 alphaproteobacterial ancestors. Therefore, a systematic study of the different biases 95 affecting the phylogeny of the Alphaproteobacteria, and its underlying data, is much 96 needed.

97 Here, we revised the phylogeny of the Alphaproteobacteria by using a large dataset of 98 200 conserved single-copy genes and employing carefully designed strategies aimed at

4

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

99 alleviating phylogenetic artefacts. We found that amino acid compositional 100 heterogeneity, and more generally long-branch attraction, were major confounding 101 factors in estimating phylogenies of the Alphaproteobacteria. In order to counter these 102 biases, we used novel dataset-specific profile mixture models and recoding schemes 103 (both specifically designed to ameliorate compositional heterogeneity), and removed 104 sites, genes and taxa that were compositionally biased. We also present three draft 105 genomes for endosymbiotic alphaproteobacteria belonging to the Rickettsiales and 106 Holosporales: (1) an undescribed midichloriacean of Peranema 107 trichophorum, (2) an undescribed rickettsiacean endosymbiont of Stachyamoeba 108 lipophora, and (3) the holosporalean ‘Candidatus Finniella inopinata’, an endosymbiont 109 of the rhizarian amoeboflagellate Viridiraptor invadens (Hess, Suthaus, and Melkonian 110 2015). Our results provide the first strong evidence that the Holosporales are unrelated 111 to the Rickettsiales and originated instead from within the Rhodospirillales. We 112 incorporate these and other insights regarding the deep phylogeny of the 113 Alphaproteobacteria into an updated .

114 RESULTS

115 The genomes and phylogenetic positions of three novel endosymbiotic 116 alphaproteobacteria (Rickettsiales and Holosporales)

117 We sequenced the genomes of the novel holosporalean ‘Candidatus Finniella 118 inopinata’, an endosymbiont of the rhizarian amoeboflagellate Viridiraptor invadens 119 (Hess, Suthaus, and Melkonian 2015), and two undescribed rickettsialeans, one 120 associated with the heterolobosean amoeba Stachyamoeba lipophora and the other 121 with the euglenoid flagellate Peranema trichophorum. The three genomes are small with 122 a reduced gene number and high A+T% content, strongly suggesting an endosymbiotic 123 lifestyle (Table 1). Comparisons of their rRNA genes show that these genomes are truly 124 novel, being considerably divergent from other described alphaproteobacteria. As of 125 February 2018, the closest 16S rRNA gene to that of the Stachyamoeba-associated 126 rickettsialean belongs to massiliae str. AZT80, with only 88% identity. On the 127 other hand, the closest 16S rRNA gene to that of the Peranema-associated 128 rickettsialean belongs to an endosymbiont of Acanthamoeba sp. UWC8, which is only

5

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

129 92% identical. Phylogenetic analysis of both the 16S rRNA gene and the 200-gene set 130 confirm that each species belongs to different families and orders within the 131 Alphaproteobacteria (Fig. S4). ‘Candidatus Finniella inopinata’ belongs to the recently 132 described ‘Candidatus Paracaedibacteraceae’ in the Holosporales (Hess, Suthaus, and 133 Melkonian 2015), whereas the Stachyamoeba-associated rickettsialean belongs to the 134 , and the Peranema-associated rickettsialean belongs to the ‘Candidatus 135 ’, in the Rickettsiales.

6

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

136 Table 1. Genome features for the three novel rickettsialeans sequenced in this study.

Species ‘Candidatus Finniella Stachyamoeba- Peranema- inopinata’ associated associated rickettsialean rickettsialean Genome size 1,792,168 bp 1,738,386 bp 1,375,759 bp N50 174,737 bp 1,738,386 bp 28,559 bp Contig number 28 1 125 Gene number1 1,741 1,588 1,223 A+T% content 56.58% 67.01% 59.13% Family Paracaedibacteraeae Rickettsiaceae ‘Candidatus Midicloriaceae’ Order Holosporales Rickettsiales Rickettsiales Completeness2 94.96% 97.12% (=100%) 92.08% Redundancy2 0.0% 0.0% 2.1% 137 1 as predicted by prokka v.1.13 (rRNA genes were searched with BLAST). 138 2 as estimated by anvi’o v.2.4.0 using the Campbell et al., 2013 marker set.

7

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

139 Compositional heterogeneity appears to be a major confounding factor affecting 140 phylogenetic inference of the Alphaproteobacteria

141 The average-linkage clustering of amino acid compositions shows that the Rickettsiales, 142 Pelagibacterales and Holosporales are clearly distinct from other alphaproteobacteria. 143 This indicates that these three taxa have divergent proteome amino acid compositions 144 (Fig. 1A). These taxa also have the lowest GARP:FIMNKY ratios in all the 145 Alphaproteobacteria (Fig. 1A); the Pelagibacterales being the most divergent, followed 146 by the Rickettsiales and then the Holosporales. Such biased amino acid compositions 147 appear to be the consequence of genome nucleotide compositions that are strongly 148 biased towards high A+T%—a scatter plot of genome G+C% and proteome 149 GARP:FIMNKY ratios shows a similar clustering of the Rickettsiales, Pelagibacterales 150 and Holosporales (Fig. 1B). This compositional similarity in the proteomes of the 151 Rickettsiales, Pelagibacterales and Holosporales, which also turn out to be the longest- 152 branching alphaproteobacterial groups in previously published phylogenies (e.g., Wang 153 and Wu 2015), could be the outcome of either a shared evolutionary history (i.e., the 154 groups are most closely related to one another), or alternatively, evolutionary 155 convergence (e.g., because of similar lifestyles or evolutionary trends toward small cell 156 and genome sizes).

8

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

157 158

9

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

159 Figure 1. Compositional heterogeneity in the Alphaproteobacteria is a major factor that 160 confounds phylogenetic inference. There are great disparities in the genome G+C% 161 content and amino acid compositions of the Rickettsiales, Pelagibacterales and 162 Holosporales with all other alphaproteobacteria. A. A UPGMA (average-linkage) 163 clustering of amino acid compositions (based on the 200 gene set for the 164 Alphaproteobacteria) shows that the Rickettsiales (brown), Pelagibacterales (gray), and 165 Holosporales (light blue) all have very similar proteome amino acid compositions. At the 166 tips of the tree, GARP:FIMNKY ratio values are shown as bars. B. A scatterplot 167 depicting the strong correlation between G+C% (nucleotide compositions) and 168 GARP:FIMNKY ratios (amino acid composition) for the 120 taxa in the 169 Alphaproteobacteria shows a similar clustering of the Rickettsiales, Pelagibacterales 170 and Holosporales.

10

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

171 As a first step to discriminate between these two alternatives, we used maximum 172 likelihood to estimate a tree on our 200-gene dataset for the Alphaproteobacteria under 173 the site-heterogenous model LG+PMSF(ES60)+F+R6. The resulting tree united the 174 Rickettsiales, Pelagibacterales and Holosporales in a fully supported clade (Fig. 2A). 175 The clustering of these three groups is suggestive of a phylogenetic artefact (e.g., long- 176 branch attraction or LBA); indeed, such a pattern resembles the one seen in the tree of 177 proteome amino acid compositions (see Fig. 1A). This is because the three groups have 178 the longest branches in the Alphaproteobacteria tree and have compositionally biased 179 and fast-evolving genomes (see Fig. 2). If evolutionary convergence in amino acid 180 compositions is confounding phylogenetic inference for the Alphaproteobacteria, 181 methods aimed at reducing compositional heterogeneity might disrupt the clustering of 182 the Rickettsiales, Pelagibacterales and Holosporales.

183 To further test whether the clustering of the Rickettsiales, Pelagibacterales and 184 Holosporales is real or artefactual we used several different strategies to reduce the 185 compositional heterogeneity of our dataset (see Fig. S2 for the diverse strategies 186 employed). When removing the 50% most compositionally biased (heterogeneous) sites 187 according to ɀ, the clustering between the Rickettsiales, Pelagibacterales and 188 Holosporales is disrupted (Fig. 2B). The new more derived placements for the 189 Pelagibacterales and Holosporales are well supported (further described below), and 190 support tends to increase as compositionally biased sites are removed (Fig. S8). 191 Furthermore, when each of these long-branching taxa is analyzed in isolation (i.e., in 192 the absence of the other two), and compositionally heterogeneity is decreased, new 193 phylogenetic patterns emerge that are incompatible, or in conflict, with their clustering 194 (Fig. S10-S13). In other words, removing the most compositionally biased sites, 195 recoding the data into reduced character-state alphabets to minimize compositional 196 bias, or using only the most compositionally homogeneous genes, converge to very 197 similar phylogenetic patterns for the Alphaproteobacteria that are however incompatible 198 with the clustering of the Rickettsiales, Pelagibacterales and Holosporales (e.g., Fig. 199 S10-S13, and 3). On the other hand, removing fast-evolving sites does not disrupt the 200 clustering of the these three long-branching groups (Fig. S7), suggesting that high

11

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

201 evolutionary rates per site are not a major confounding factor when inferring the 202 phylogeny of the Alphaproteobacteria.

12

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

203 204

13

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

205 Figure 2. Decreasing compositional heterogeneity by removing compositionally biased 206 sites disrupts the clustering of the Rickettsiales, Pelagibacterales and Holosporales. All 207 branch support values are 100% SH-aLRT and 100% UFBoot unless annotated. A. A 208 maximum-likelihood tree inferred under the LG+PMSF(ES60)+F+R6 model and from 209 the intact dataset which is highly compositionally heterogeneous. The three long- 210 branching orders, the Rickettsiales Pelagibacterales and Holosporales, that have similar 211 amino acid compositions form a clade. B. A maximum-likelihood tree inferred under the 212 LG+PMSF(ES60)+F+R6 model and from a dataset whose compositional heterogeneity 213 has been decreased by removing 50% of the most biased sites according to ɀ. In this 214 phylogeny the clustering of the Rickettsiales, Pelagibacterales and Holosporales is 215 disrupted. The Pelagibacterales is sister to the Rhodobacterales, Caulobacterales and 216 Rhizobiales. The Holosporales becomes sister to the Rhodospirillales. The Rickettsiales 217 retains its basal position as sister to the Caulobacteridae. See Fig. S6 for the Bayesian 218 consensus trees inferred in PhyloBayes MPI v1.7 under the CAT-Poisson+Γ4 model.

14

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

219 The Holosporales is unrelated to the Rickettsiales and is instead most likely derived 220 within the Rhodospirillales

221 The Holosporales has traditionally been considered part of the Rickettsiales sensu lato 222 because it appears as sister to the Rickettsiales in many trees. It is exclusively 223 composed of endosymbiotic bacteria living within diverse eukaryotes, and such a 224 lifestyle is shared with all other members of the Rickettsiales. When we decrease, and 225 then account for compositional heterogeneity, we recover tree topologies in which the 226 Holosporales moves away from the Rickettsiales (e.g., Fig. 2B and S13A). For example, 227 the Holosporales becomes sister to all free-living alphaproteobacteria (the 228 Caulobacteridae) when only the 40 most homogeneous genes are used (Fig. S13A) or 229 when 10% of the most compositionally biased sites are removed (Fig. S8). When 230 compositional heterogeneity is further decreased by removing 50% of the most 231 compositionally biased sites, the Holosporales becomes sister to the Rhodospirillales 232 (Fig. 2B and S8; and see also Fig. S10B and S13B).

233 Similarly, when the long-branching Rickettsiales and Pelagibacterales groups (plus the 234 extremely long-branching genera Holospora and ‘Candidatus Hepatobacter’) are 235 removed, after compositional heterogeneity had been decreased through site removal, 236 the Holosporales move to a much more derived position well within the Rhodospirillales 237 (Fig. 3A). If the very compositionally biased and fast-evolving Holospora and 238 ‘Candidatus Hepatobacter’ are left in, the Holosporales are pulled away from its derived 239 position and the whole clade moves closer to the base of the tree (data not shown). The 240 same behaviour is seen when these same taxa are removed, and the data are then 241 recoded into four- or six-character states (Fig. 3B and S14). Specifically, the 242 Holosporales now consistently branches as sister to a subgroup of rhodospirillaleans 243 that includes, among others, the epibiotic predator Micavibrio aeruginosavorus and the 244 purple nonsulfur bacterium Rhodocista centenaria (the Azospirillaceae, see below). This 245 new placement of the Holosporales has nearly full support under both maximum 246 likelihood and Bayesian inference (e.g., >95% UFBoot; see Fig. 3).Thus, three different 247 analyses independently converge to the same pattern and support a derived origin of 248 the Holosporales within the Rhodospirillales: (1) removal of compositionally biased sites

15

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

249 (Fig. 3A), (2) data recoding into four-character states using the dataset-specific scheme 250 S4 (Fig. 3B), and (3) data recoding into six-character states using the dataset-specific 251 scheme S6 (Fig. S14); each of these strategies had to be combined with the removal of 252 the Pelagibacterales and Rickettsiales to recover this phylogenetic position for the 253 Holosporales.

16

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

254 255

17

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

256 Figure 3. The Holosporales branches in a derived position within the Rhodospirillales 257 when compositional heterogeneity is reduced and the long-branching Rickettsiales and 258 Pelagibacterales are removed. Branch support values are 100% SH-aLRT and 100% 259 UFBoot unless annotated. A. A maximum-likelihood tree, inferred under the 260 LG+PMSF(ES60)+F+R6 model, to place the Holosporales in the absence of the 261 Rickettsiales and the Pelagibacterales and when compositional heterogeneity has been 262 decreased by removing 50% of the most biased sites. The Holosporales is sister to the 263 Azospirillaceae fam. nov. within the Rhodospirillales. B. A maximum-likelihood tree, 264 inferred under the GTR+ES60S4+F+R6 model, to place the Holosporales in the 265 absence of the Rickettsiales and the Pelagibacterales and when the data have been 266 recoded into a four-character state alphabet (the dataset-specific recoding scheme S4: 267 ARNDQEILKSTV GHY CMFP W) to reduce compositional heterogeneity. This 268 phylogeny shows a pattern that matches that inferred when compositional heterogeneity 269 has been alleviated through site removal. See Fig. S9 for the Bayesian consensus trees 270 inferred in PhyloBayes MPI v1.7 and under the and the CAT-Poisson+Γ4 model.

18

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

271 A fourth independent analysis further supports a derived placement of the Holosporales 272 nested within the Rhodospirillales. Bayesian inference using the CAT-Poisson+Γ4 273 model, on a dataset whose compositional heterogeneity had been decreased by 274 removing 50% of the most compositionally biased sites but for which no taxa had been 275 removed, also recovered the Holosporales as sister to the Azospirillaceae (see Fig. S6).

276 The Rhodospirillales is a diverse order and comprises five well-supported families

277 The Rhodospirillales is an ancient and highly diversified group, but unfortunately this is 278 rarely obvious from published phylogenies because most studies only include a few 279 species for this order (Williams, Sobral, and Dickerman 2007; Georgiades et al. 2011; 280 Ferla et al. 2013). We have included a total of 31 Rhodospirillales taxa to better cover 281 its diversity. Such broad sampling reveals trees with five clear subgroups within the 282 Rhodospirillales that are well-supported in most of our analyses (e.g., Fig. 2B and 3). 283 First is the Acetobacteraceae which comprises acetic acid, acidophilic, and 284 photosynthesizing (-containing) bacteria. The Acetobacteraceae is 285 strongly supported and relatively divergent from all other families within the 286 Rhodospirillales. Sister to the Acetobacteraceae is another subgroup that comprises 287 many photosynthesizing bacteria, including the type species for the Rhodospirillales, 288 , as well as the magnetotactic bacterial genera , 289 Magnetovibrio and Magnetospira (Fig. 3). This subgroup best corresponds to the poorly 290 defined and paraphyletic family. We amend the Rhodospirillaceae 291 taxon and restrict it to the clade most closely related to the Acetobacteraceae. As 292 described above, when artefacts are accounted for, the Holosporales most likely 293 branches within the Rhodospirillales and therefore we suggest the Holosporales sensu 294 Szokoli et al. (2016) is lowered in rank to the family Holosporaceae, which is sister to 295 the Azospirillaceae (Fig. 3). The Azospirillaceae fam. nov. contains the purple 296 bacterium Rhodocista centenaria and the epibiotic (neither periplasmic nor intracellular) 297 predator Micavibrio aeruginosavorus, among others. The Holosporaceae and the 298 Azospirillaceae clades appear to be sister to the Rhodovibriaceae fam. nov. (Fig. 3), a 299 well-supported group that comprises the purple nonsulfur bacterium Rhodovibrio 300 salinarum, the aerobic heterotroph Kiloniella laminariae, and the marine

19

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

301 bacterioplankter ‘Candidatus Puniceispirillum marinum’ (or the SAR116 clade). Each of 302 these subgroups and their interrelationships—with the exception of the Holosporaceae 303 that branches within the Rhodospirillales only after compositional heterogeneity is 304 countered—are strongly supported in nearly all of our analyses (e.g., see Fig. 2B and 305 3).

306 The Geminicoccaceae might be basal to all other free-living alphaproteobacteria (the 307 Caulobacteridae)

308 The Geminicocacceae is a recently proposed family within the Rhodospirillales 309 (Proença et al. 2017). It is currently represented by only two genera, and 310 Arboriscoccus (Foesel et al. 2007; Proença et al. 2017). In most of our trees, however, 311 mobilis is often sister to Geminicoccus roseus with full statistical support (e.g., 312 Fig. 2B and 3A, but see Fig. S9 for an exception) and we therefore consider it to be part 313 of the Geminococcaceae. Interestingly, the Geminicoccaceae tends to have two 314 alternative stable positions in our analyses, either as sister to all other families of the 315 Rhodospirillales (e.g., Fig. 2A and 3A), or alternatively, as sister to all other orders of 316 the Caulobacteridae (i.e., representing the most basal lineage of free-living 317 alphaproteobacteria; Fig. 2B and 3B, or Fig. S11C and S12C). Our analyses designed 318 to alleviate compositional heterogeneity, specifically site removal and recoding (without 319 taxon removal), favor the latter position for the Geminicoccaceae (Fig. 2B and 3B). 320 Moreover, as compositionally biased sites are progressively removed, support for the 321 affiliation of the Geminicoccaceae with the Rhodospirillales decreases, and after 50% of 322 the sites have been removed, the Geminicoccaceae emerges as sister to all other free- 323 living alphaproteobacteria with strong support (>95% UFBoot; Fig. S8). In further 324 agreement with this trend, the much simpler model LG4X places the Geminicocacceae 325 in a derived position as sister to the Acetobacteraceae (data not shown), but as model 326 complexity increases, and compositional heterogeneity is reduced, the 327 Geminicoccaceae moves closer to the base of the Alphaproteobacteria (Fig. 2A and 328 3A). Such a placement suggests that the Geminicoccaceae may be a novel and 329 independent order-level lineage in the Alphaproteobacteria. However, because of the

20

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

330 uncertainty in our results we opt here for conservatively keeping the Geminicoccaceae 331 as the sixth family of the Rhodospirillales (Fig. 3A).

332 Other deep relationships in the Alphaproteobacteria (Pelagibacterales, Rickettsiales, 333 alphaproteobacterium sp. HIMB59)

334 The clustering of the Pelagibacterales (formerly the SAR11 clade) with the Rickettsiales 335 and Holosporales is more easily disrupted than that of the Holosporales. The removal of 336 compositionally biased sites (from 30% on; 16,320 out of 54,400 sites; Fig. S8), data 337 recoding into four-character states (Fig. S12D), and a set of the most compositionally 338 homogeneous genes (Fig. S13D), all support a derived placement of the 339 Pelagibacterales as sister to the Rhodobacterales, Caulobacterales and Rhizobiales (in 340 agreement with Rodríguez-Ezpeleta and Embley 2012; Viklund, Ettema, and Andersson 341 2012; Viklund et al. 2013; Martijn et al. 2018). The Caulobacterales is sister to the 342 Rhizobiales, and the Rhodobacterales sister to both (e.g., Fig. 2B and 3). This is 343 consistent throughout most of our results and such interrelationships become very 344 robustly supported as compositional heterogeneity is increasingly alleviated (Fig. S8). 345 The placement of the Rickettsiales as sister to the Caulobacteridae (i.e., all other 346 alphaproteobacteria) remains stable across different analyses (Fig. 2B, S10C, S11C, 347 S12C and S13D); this is also true when the other long-branching taxa, the 348 Pelagibacterales and Holosporales, are removed. Yet, the interrelationships inside the 349 Rickettsiales order remain uncertain; the ‘Candidatus Midichloriaceae’ becomes sister 350 to the when fast sites are removed (Fig. S7), but to the 351 Rickettsiaceae when compositionally biased sites are removed (Fig. S8). The 352 placement of alphaproteobacterium sp. HIMB59 is entirely uncertain (e.g., see Fig. 2B 353 and S11E; in contrast to Grote et al. 2012); it does not have a stable position in the 354 Alphaproteobacteria across our diverse analyses. This is consistent with previous 355 reports that suggest that alphaproteobacterium sp. HIMB59 is not closely related to the 356 Pelagibacterales (Viklund et al. 2013; Martijn et al. 2018).

357 DISCUSSION

358 We have employed a diverse set of strategies to investigate the phylogenetic signal 359 contained within 200 genes for the Alphaproteobacteria. Specifically, such strategies

21

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

360 were primarily aimed at reducing amino acid compositional heterogeneity among taxa— 361 a phenomenon that permeates our dataset (Fig. 1). Compositional heterogeneity is a 362 clear violation of the phylogenetic models used in our, and previous, analyses, and 363 known to cause phylogenetic artefacts (Foster 2004). In the absence of more 364 sophisticated models for inferring deep phylogeny, the only way to counter artefacts 365 caused by compositional heterogeneity is by removing compositionally biased sites or 366 taxa, or recoding amino acids into reduced alphabets. A combination of these strategies 367 reveals that the Rickettsiales sensu lato (i.e., the Rickettsiales and Holosporales) is 368 polyphyletic. Our analyses suggest that the Holosporales is derived within the 369 Rhodospirillales, and that therefore this taxon should be lowered in rank and renamed 370 the Holosporaceae family (see Fig. 2B and 3). The same methods suggest that the 371 Rhodospirillales might indeed be a paraphyletic order and that the Geminicoccaceae 372 could be a separate lineage that is sister to the Caulobacteridae (e.g., Fig. 2B). These 373 two results, combined with our broader sampling, reorganize the internal phylogenetic 374 structure of the Rhodospirillales and show that its diversity can be grouped into at least 375 five well-supported major families (Fig. 3).

376 In 16S rRNA gene trees, the Holosporales has most often been allied to the 377 Rickettsiales (Montagna et al. 2013; Hess, Suthaus, and Melkonian 2015). The 378 apparent diversity of this group has quickly increased in recent years as more and more 379 intracellular bacteria living within protists have been described (e.g., Hess, Suthaus, and 380 Melkonian 2015; Szokoli et al. 2016; Eschbach et al. 2009; Boscaro et al. 2013). An 381 endosymbiotic lifestyle is shared by all members of the Holosporales and is also shared 382 with all those that belong to the Rickettsiales. Thus, it had been reasonable to accept 383 their shared ancestry as suggested by some 16S rRNA gene trees (e.g., Montagna et 384 al. 2013; Santos and Massard 2014; Hess, Suthaus, and Melkonian 2015). Apparent 385 strong support for the monophyly of the Rickettsiales and the Holosporales recently 386 came from some multi-gene trees by Wang and Wu (2014, 2015) who expanded 387 sampling for the Holosporales. However, an alternative placement for the Holosporales 388 as sister to the Caulobacteridae has been reported by Ferla et al. (2013) based on 389 rRNA genes, by Georgiades et al., (2011) based on 65 genes, by Schulz et al., (2015) 390 based on 139 genes, as well as by Wang and Wu (2015) based on 26, 29, or 200 genes

22

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

391 (see the supplementary information in Wang and Wu, 2015). This placement was 392 acknowledged by Szokoli et al. (2015), who formally established the order 393 Holosporales. Most recently, Martijn et al. 2018, who used strategies to reduce 394 compositional heterogeneity, and similarly to Wang and Wu (2015), recovered a number 395 of placements for the Holosporales within the Alphaproteobacteria; however these 396 different placements for the Holosporales were poorly supported. Here we provide 397 strong evidence for the hypothesis that the Holosporales is not related to the 398 Rickettsiales, as suggested earlier (Georgiades et al. 2011; Ferla et al. 2013; Szokoli et 399 al. 2016). The Rickettsiales sensu lato is polyphyletic. We show that the Holosporales is 400 artefactually attracted to the Rickettsiales (e.g., Fig. 2A), but as compositional bias is 401 increasingly alleviated (through site removal and recoding), they move further away 402 from them (Fig. 2B). The Holosporales is placed within the Rhodopirillales as sister to 403 the family Azospirillaceae (Fig. 3). The similar lifestyles of the Holosporales and 404 Rickettsiales, as well as other features like the presence of an ATP/ADP translocase 405 (Wang and Wu 2014), are therefore likely the outcome of .

406 A derived origin of the Holosporales has important implications for understanding the 407 origin of mitochondria and the nature of their ancestor. Wang and Wu (2014, 2015) 408 proposed that mitochondria are phylogenetically embedded within the Rickettsiales 409 sensu lato. In their trees, mitochondria were sister to a clade formed by the 410 Rickettsiaceae, Anaplasmataceae and ‘Candidatus Midichloriaceae’, and the 411 Holosporales was itself sister to all of them. This phylogenetic placement for 412 mitochondria suggested that the ancestor of mitochondria was an intracellular parasite 413 (Wang and Wu, 2014). But if the Holosporales is a derived group of rhodospirillaleans 414 as we show here (see Fig. 3), then the argument that mitochondria necessarily evolved 415 from parasitic alphaproteobacteria no longer holds. While the sisterhood of mitochondria 416 and the Rickettsiales sensu stricto is still a possibility, such a relationship does not imply 417 that the two groups shared a parasitic common ancestor (i.e., a parasitic ancestry for 418 mitochondria). The most recent analyses done by Martijn et al. (2018) suggest that 419 mitochondria are sister to all known alphaproteobacteria, also suggesting their non- 420 parasitic ancestry. Our study, and that of Martijn et al., thus complement each other and

23

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

421 support the view that mitochondria most likely evolved from ancestral free-living 422 alphaproteobacteria (contra Sassera et al. 2011; Wang and Wu 2014, 2015).

423 The order Rhodospirillales is quite diverse and includes many purple nonsulfur bacteria 424 as well as all within the Alphaproteobacteria. The 425 Rhodospirillales is sister to all other orders in the Caulobacteridae, and has historically 426 been subdivided into two families: the Rhodospirillaceae and the Acetobacteraceae. 427 Recently, a new family, the Geminicoccaceae, was established for the Rhodospirillales 428 (Proença et al. 2017). However, some of our analyses suggest that the 429 Geminicoccaceae might be sister to all other Caulobacteridae (e.g., Fig. 2B and 3B). 430 This phylogenetic pattern, therefore, suggests that the Rhodospirillales may be a 431 paraphyletic order. The placement of the Geminicoccaceae as sister to the 432 Caulobacteridae needs to be further tested once more sequenced diversity for this 433 group becomes available; if it were to be confirmed, the Geminicoccaceae should be 434 elevated to the order level. Whereas the Acetobacteraceae is phylogenetically well- 435 defined, there has been considerable uncertainty about the Rhodospirillaceae (e.g., 436 Ferla et al., 2013), primarily because of poor sampling and a lack of resolution provided 437 by the 16S rRNA gene. We subdivide the Rhodospirillaceae sensu lato into three 438 subgroups (Fig. 3). We restrict the Rhodospirillaceae sensu stricto to the subgroup that 439 is sister to the Acetobacteracae (Fig. 3). The other two subgroups are the 440 Rhodovibriaceae and the Azospirillaceae; the latter is sister to the Holosporaceae (Fig. 441 3).

442 Based on our fairly robust phylogenetic patterns, we have updated the higher-level 443 taxonomy of the Alphaproteobacteria (Fig. 4). We exclude the Magnetococcales from 444 the Alphaproteobacteria class because of its divergent nature (e.g., see Fig. 1 in Esser, 445 Martin, and Dagan 2007 which shows that many of Magnetococcus’ genes are more 446 similar to those of beta-, and ). In agreement with its intermediate 447 phylogenetic placement, we endorse the Magnetococcia class as proposed by Parks et 448 al. (2018). At the highest level we define the Alphaproteobacteria class as comprising 449 two subclasses sensu Ferla et al. (2013), the Rickettsidae and the Caulobacteridae. 450 The former contains the Rickettsiales, and the latter contains all other orders, which are

24

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

451 primarily and ancestrally free-living alphaproteobacteria. The order Rickettsiales 452 comprises three families as previously defined, the Rickettsiaceae, the 453 Anaplasmataceae, and the ‘Candidatus Midichloriaceae’. On the other hand, the 454 Caulobacteridae is composed of seven phylogenetically well-supported orders: the 455 Rhodospirillales, Sneathiellales, Sphingomonadales, Pelagibacterales, 456 Rhodobacterales, Caulobacterales and Rhizobiales. Among the many species claimed 457 to represent new order-level lineages on the basis of 16S rRNA gene trees (Cho and 458 Giovannoni 2003; Kwon et al. 2005; Kurahashi et al. 2008; Wiese et al. 2009; Harbison 459 et al. 2017), only Sneathiella deserves order-level status (Kurahashi et al. 2008), since 460 all others have derived placements in our trees and those published by others (Williams 461 et al. 2012; Bazylinski et al. 2013; Venkata Ramana et al. 2013; Harbison et al. 2017). 462 The Rhodospirillales order comprises six families, three of which are new, namely the 463 Holosporaceae, Azospirillaceae and Rhodovibriaceae (Fig. 4). This new higher-level 464 classification of the Alphaproteobacteria updates and expands those presented by Ferla 465 et al. (2013), the ‘Bergey’s Manual of Systematics of and Bacteria’ (Garrity 466 2005; Whitman 2015), and ‘The ’ (Rosenberg et al. 2014). The classification 467 scheme proposed here could be partly harmonized with that recently proposed by Parks 468 et al., (2018) by elevating the six families within the Rhodospirllales to the order level; 469 the phylograms by Parks et al., (2018), however, are in conflict with those shown here 470 and many of their proposed taxa are as well.

25

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

471

472 Figure 4. A higher-level classification scheme for the Alphaproteobacteria and the 473 Magnetococcia classes within the , and the Rickettsiales and 474 Rhodospirillales orders within the Alphaproteobacteria. 475

26

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

476 CONCLUSIONS

477 We employed a combination of methods to decrease compositional heterogeneity in 478 order to disrupt artefacts that arise when inferring the phylogeny of the 479 Alphaproteobacteria. This is an example of the complex nature of the historical signal 480 contained in modern genomes and the limitations of our current evolutionary models to 481 capture these signals. A robust phylogeny of the Alphaproteobacteria is a precondition 482 for placing the mitochondrial lineage. This is because including mitochondria certainly 483 exacerbates the already strong biases in the data, and therefore represents additional 484 sources for artefact in phylogeny inference (as seen in Wang and Wu, 2015 where the 485 Holosporales is attracted by both mitochondria and the Rickettsiales). Future endeavors 486 should take into account the new phylogenetic framework developed in here. The 487 incorporation of non-culturable 'environmental' diversity recovered from metagenomes 488 will certainly expand the known diversity of the Alphaproteobacteria and improve 489 phylogenetic inference.

490 TAXON DESCRIPTIONS

491 Rickettsidae emend. (Alphaproteobacteria) Rickettsia is the type of the subclass. 492 The Rickettsidae subclass is here amended by redefining its circumscription so it 493 remains monophyletic by excluding the Pelagibacterales order. The emended 494 Rickettsidae subclass within the Alphaproteobacteria class is defined based on 495 phylogenetic analyses of 200 genes which are predominantly single-copy and vertically- 496 inherited (unlikely laterally transferred) when compositionally heterogeneity was 497 decreased by site removal or recoding. Phylogenetic (node-based) definition: the least 498 inclusive clade containing phagocytophilum HZ, Wilmington, 499 and ‘Candidatus mitochondrii’ IricVA. The Rickettsidae does not include: 500 Pelagibacter sp. HIMB058, ‘Candidatus Pelagibacter sp.’ IMCC9063, 501 alphaproteobacterium HIMB59, Caedibacter sp. 37-49, ‘Candidatus Nucleicultrix 502 amoebiphila’ FS5, ‘Candidatus Finniella lucida’, Holospora obtusa F1, Sneathiella 503 glossodoripedis JCM 23214, wittichii, and subvibrioides 504 ATCC 15264.

27

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

505 Caulobacteridae emend. (Alphaproteobacteria) is the type genus of the 506 subclass. The Caulobacteridae subclass is here amended by redefining its 507 circumscription so it remains monophyletic by including the Pelagibacterales order. The 508 emended Caulobacteridae subclass within the Alphaproteobacteria class is defined 509 based on phylogenetic analyses of 200 genes which are predominantly single-copy and 510 vertically-inherited (unlikely laterally transferred) when compositionally heterogeneity 511 was decreased by site removal or recoding. Phylogenetic (node-based) definition: the 512 least inclusive clade containing Pelagibacter sp. HIMB058, ‘Candidatus Pelagibacter 513 sp.’ IMCC9063, alphaproteobacterium HIMB59, Caedibacter sp. 37-49, ‘Candidatus 514 Nucleicultrix amoebiphila’ FS5, ‘Candidatus Finniella lucida’, Holospora obtusa F1, 515 Sneathiella glossodoripedis JCM 23214, , and Brevundimonas 516 subvibrioides ATCC 15264. The Caulobacteridae does not include: Anaplasma 517 phagocytophilum HZ, Rickettsia typhi Wilmington, and ‘Candidatus Midichloria 518 mitochondrii’ IricVA

519 Azospirillaceae fam. nov. (Rhodospirillales, Alphaproteobacteria) is the 520 type genus of the family. This new family within the Rhodospirillales order is defined 521 based on phylogenetic analyses of 200 genes which are predominantly single-copy and 522 vertically-inherited (unlikely laterally transferred). Phylogenetic (node-based) definition: 523 the least inclusive clade containing Micavibrio aeruginoavorus ARL-13, Rhodocista 524 centenaria SW, and limosus DSM 16000. The Azospirillaceae does not 525 include: Rhodovibrio salinarum DSM 9154, ‘Candidatus Puniceispirillum marinum’ IMCC 526 1322, Rhodospirillum rubrum ATCC 11170, pusilla DSM 6293, 527 angustum ATCC 49957, and Elioraea tepidiphila DSM 17972.

528 Rhodovibriaceae fam. nov. (Rhodospirillales, Alphaproteobacteria) Rhodovibrio is the 529 type genus of the family. This new family within the Rhodospirillales order is defined 530 based on phylogenetic analyses of 200 genes which are predominantly single-copy and 531 vertically-inherited (unlikely laterally transferred). Phylogenetic (node-based) definition: 532 the least inclusive clade containing Rhodovibrio salinarum DSM 9154, Kiloniella 533 laminariae DSM 19542, Oceanibaculum indicum P24, Thalassobaculum salexigens 534 DSM 19539 and ‘Candidatus Puniceispirillum marinum’ IMCC 1322. The

28

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

535 Rhodovobriaceae does not include: Rhodospirillum rubrum ATCC 11170, Terasakiella 536 pusilla DSM 6293, Rhodocista centenaria SW, Micavibrio aeruginoavorus ARL-13, 537 Acidiphilium angustum ATCC 49957, and Elioraea tepidiphila DSM 17972.

538 Rhodospirillaceae emend. (Rhodospirillales, Alphaproteobacteria) Rhodospirillum is the 539 type genus of the family. The Rhodospirillaceae family is here amended by redefining its 540 circumscription so it remains monophyletic. The emended Rhodospirillaceae family 541 within the Rhodospirillales order is defined based on phylogenetic analyses of 200 542 genes which are predominantly single-copy and vertically-inherited (unlikely laterally 543 transferred). Phylogenetic (node-based) definition: the least inclusive clade containing 544 Rhodospirillum rubrum ATCC 11170, parvum 930l, Magnetospirillum 545 magneticum AMB-1 and Terasakiella pusilla DSM 6293. The Rhodospirillaceae does 546 not include: Rhodocista centenaria SW, Micavibrio aeruginoavorus ARL-13, ‘Candidatus 547 Puniceispirillum marinum’ IMCC 1322, Rhodovibrio salinarum DSM 9154, Elioraea 548 tepidiphila DSM 17972, and Acidiphilium angustum ATCC 49957.

549 Holosporaceae (Rhodospirillales, Alphaproteobacteria) Holospora is the type genus of 550 the family. The Holosporaceae family as defined here has the same taxon 551 circumscription as the Holosporales order sensu Szokoli et al., (2016), but it is here 552 lowered to the family level and placed within the Rhodospirillales order. The new family 553 rank-level for this group is based on the phylogenetic analysis of 200 genes, which are 554 predominantly single-copy and vertically-inherited (unlikely laterally transferred), when 555 compositionally heterogeneity was decreased by site removal or recoding (and coupled 556 to the removal of the long-branching taxa Pelagibacterales and Rickettsiales). The 557 family contains three subfamilies (lowered in rank from a former family level) and one 558 formally-undescribed clade, namely, the Holosporodeae, and ‘Candidatus 559 Paracaedibacteriodeae’, ‘Candidatus Hepatincolodeae’, and the Caedibacter- 560 Nucleicultrix clade.

561 METHODS

562 Genome sequencing

29

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

563 Axenic cultures of Viridiraptor invadens strain Virl02, the host of ‘Candidatus Finniella 564 inopinata’, were grown on the filamentous green alga Zygnema pseudogedeanum strain 565 CCAC 0199 as described in (Hess and Melkonian 2013). Once the algal food was 566 depleted, Viridiraptor cells were harvested by filtration through a cell strainer (mesh size 567 40 µm to remove algal cell walls) and centrifugation (~1,000 g for 15 min). For short- 568 read sequencing, DNA extraction of total gDNA was carried out with the ZR 569 Fungal/Bacterial DNA MicroPrep Kit (Zymo Research) using a BIO101/Savant FastPrep 570 FP120 high-speed bead beater and 20 uL of proteinase K (20 mg/mL). A sequencing 571 library was made using the NEBNext Ultra II DNA Library Prep Kit (New England 572 Biolabs). DNA sequencing libraries were sequenced with an Illumina MiSeq instrument 573 (Dalhousie University; Canada). For long-read sequencing, DNA extraction was 574 performed using a CTAB and phenol-chloroform method. Total gDNA was further 575 cleaned through a QIAGEN Genomic-Tip 20/G. A sequencing library was made using 576 the Nanopore Ligation Sequencing Kit 1D (SQK-LSK108). Sequencing was done on a 577 portable MinION instrument (Oxford Nanopore Technologies).

578 Peranema trichophorum strain CCAP 1260/1B was obtained from the Culture Collection 579 of Algae and (CCAP, Oban, Scotland) and grown in liquid Knop media plus 580 egg yolk crystals. Total gDNA was extracted following Lang and Burger 2007. A 581 sequencing library was made using a TruSeq DNA Library Prep Kit (Illumina). DNA 582 sequencing libraries were sequenced with an Illumina MiSeq instrument (Genome 583 Quebec Innovation Centre; Canada).

584 Stachyamoeba lipophora strain ATCC 50324 cells feeding on were 585 harvested and these were broken up with pestle and mortar in the presence of glass 586 beads (< 450 µm diameter). Total gDNA was extracted using the QIAGEN Genomic 587 G20 Kit. A sequencing library was made using a TruSeq DNA Library Prep Kit 588 (Illumina). DNA sequencing libraries were sequenced with an Illumina MiSeq instrument 589 (Genome Quebec Innovation Centre; Canada).

590 Genome assembly and annotation

591 Short sequencing reads produced in an Illumina MiSeq from Viridiraptor invadens, 592 Peranema trichophorum, and Stachyamoeba lipophora were first assessed with

30

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

593 FASTQC v0.11.6 and then, based on its reports, trimmed with Trimmomatic v0.32 594 (Bolger, Lohse, and Usadel 2014) using the options: HEADCROP:16 LEADING:30 595 TRAILING:30 MINLEN:36. Illumina adapters were similarly removed with Trimmomatic 596 v0.32 using the option ILLUMINACLIP. Long sequencing reads produced in an 597 Nanopore MinION instrument from Viridiraptor invadens were basecalled with Albacore 598 v2.1.7, adapters were removed with Porechop v0.2.3, lambda phage reads were 599 removed with NanoLyse v0.5.1, quality filtering was done with NanoFilt v2.0.0 (with the 600 options ‘--headcrop 50 -q 8 -l 1000’), and identity filtering against the high-quality short 601 Illumina reads was done with Filtlong v0.2.0 (and the options ‘--keep_percent 90 --trim -- 602 split 500 --length_weight 10 --min_length 1000’). Statistics were calculated throughout 603 the read processing workflow with NanoStat v0.8.1 and NanoPlot v1.9.1. A hybrid co- 604 assembly of both processed Illumina short reads and Nanopore long reads from 605 Viridiraptor invadens was done with SPAdes v3.6.2 (Bankevich et al. 2012). Assemblies 606 of the Illumina short reads from Peranema trichophorum and Stachyamobea lipophora 607 were separately done with SPAdes v3.6.2 (Bankevich et al. 2012). The resulting 608 assemblies for both Viridiraptor invadens and Peranema trichophorum were later 609 separately processed with the Anvi’o v2.4.0 pipeline (Eren et al. 2015) and refined 610 genome bins corresponding to ‘Candidatus Finniella inopinata’ and the Peranema- 611 associated rickettsialean were isolated primarily based on tetranucleotide sequence 612 composition and taxonomic affiliation of its contigs. A single contig corresponding to the 613 genome of the Stachyamoeba-associated rickettsialean was obtained from its assembly 614 and this was circularized by collapsing the overlapping ends of the contig. Gene 615 prediction and genome annotation was carried out with Prokka v.1.13 (see Table 1).

616 Taxon and gene selection

617 The selection of 120 taxa was largely based on the phylogenetically diverse set of 618 alphaproteobacteria determined by Wang and Wu (2015). To this set of taxa, recently 619 sequenced and divergent unaffiliated alphaproteobacteria were added, as well as those 620 claimed to constitute novel order-level taxa. Some other groups, like the 621 Pelagibacterales, Rhodospirillales and the Holosporales, were expanded to better 622 represent their diversity (see Fig. S1).

31

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

623 A set of 200 gene markers (54,400 sites; 9.03% missing data, see Fig. S1) defined by 624 Phyla-AMPHORA was used (Wang and Wu 2013). The genes are single-copy and 625 predominantly vertically-inherited as assessed by congruence among them (Wang and 626 Wu 2013). Another smaller dataset of 40 compositionally-homogenous genes (5,570 627 sites; 5.98% missing data) was built by selecting the least compositionally 628 heterogeneous genes from the larger 200 gene set according compositional 629 homogeneity tests performed in P4 (Table S1). This was done as an alternative way to 630 overcome the strong compositional heterogeneity observed in datasets for the 631 Alphaproteobacteria with a broad selection of taxa. In brief, the P4 tests rely on 632 simulations based on a provided tree (here inferred for each gene under the model 633 LG4X+F in IQ-TREE) and a model (LG+F+G4 available in P4) to obtain proper null 634 distributions to which to compare the X2 statistic. Most standard tests for compositional 635 homogeneity (those that do not rely on simulate the data on a given tree) ignore 636 correlation due to phylogenetic relatedness, and can suffer from a high probability of 637 false negatives (Foster 2004).

638 Variations of our full set were made to specifically assess the placement of each long- 639 branching groups individually. In other words, each group with comparatively long 640 branches (the Rickettsiales, Pelagibacterales, Holosporales, and alphaproteobacterium 641 sp. HIMB59) was analyzed in isolation, i.e., in the absence of other long-branching taxa. 642 This was done with the purpose of removing the potential artefactual attraction among 643 these groups. Taxon removal was done in addition to compositionally biased site 644 removal and data recoding into reduced character-state alphabets (for a summary of the 645 different methodological strategies employed see Fig. S2).

646 Removal of compositionally biased and fast sites

647 As an effort to reduce artefacts in phylogenetic inference from our dataset (which might 648 stem from extreme divergence in the evolution of the Alphaproteobacteria), we removed 649 sites estimated to be highly compositionally heterogeneous or fast evolving. The 650 compositional heterogeneity of a site was estimated by using a metric intended to 651 measure the degree of disparity between the most %AT-rich taxa and all others. Taxa 652 were ordered from lowest to highest proteome GARP:FIMNKY ratios; ‘GARP’ amino

32

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

653 acids are encoded by %GC-rich codons, whereas ‘FIMNKY’ amino acids are encoded 654 by %AT-rich codons. The resulting plot was visually inspected and a GARP:FIMNKY 655 ratio cutoff of 1.06 (which represented a discontinuity or gap in the distribution) was 656 chosen to divide the dataset into low GARP:FMINKY (or %AT-rich) and higher 657 GARP:FIMNKY (or ‘GC-rich’) taxa (Fig. S3). Next, we determined the degree of 658 compositional bias per site (ɀ) for the frequencies of both FIMNKY and GARP amino 659 acids between the %AT-rich and all other (‘GC-rich’) alphaproteobacteria. To calculate 660 this metric for each site the following formula was used:

661 ɀ = (휋퐹퐼푀푁퐾푌%AT-rich − 휋퐹퐼푀푁퐾푌%GC-rich) + (휋퐺퐴푅푃%GC-rich − 휋퐺퐴푅푃%AT-rich)

662 where 휋퐹퐼푀푁퐾푌 and 휋퐺퐴푅푃 are the sum of the frequencies for FIMNKY and GARP 663 amino acids at a site, respectively, for either ‘%AT-rich’ or ‘%GC-rich’ taxa. According to 664 this metric, higher values measure a greater disparity between %AT-rich 665 alphaproteobacteria and all others; a measure of compositional heterogeneity or bias 666 per site. The most compositionally heterogeneous sites according to ɀ were 667 progressively removed using the software SiteStripper (Verbruggen 2018) in increments 668 of 10%. We also progressively removed the fastest evolving sites in increments of 10%. 669 Conditional mean site rates were estimated under the LG+C60+F+R6 model in IQ- 670 TREE v1.5.5 using the ‘-wsr’ flag (Nguyen et al. 2015).

671 Data recoding

672 Our datasets were recoded into four- and six-character state amino acid alphabets 673 using dataset-specific recoding schemes aimed at minimizing compositional 674 heterogeneity in the data (Susko and Roger 2007). The program minmax-chisq, 675 which implements the methods of Susko and Roger (2007), was used to find the best 676 recoding schemes—please see Fig. 3, S12 and S14 legends for the specific recoding 677 schemes used for each dataset. The approach uses the chi-squared (X2) statistic for a 678 test of homogeneity of frequencies as a criterion function for determining the best

679 recoding schemes. Let 휋푖 denote the frequency of bin 푖 for the recoding scheme 680 currently under consideration. For instance, suppose the amino acids were recoded into 681 four bins,

33

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

682 RNCM EHIPTWV ADQLKS GFY.

683 Then 휋4would be the frequency with which the amino acids G, F or Y were observed. 2 684 Let 휋푖푠 be the frequency of bin 푖 for the 푠th taxa. Then the X statistic for the null 685 hypothesis that the frequencies are constant, over taxa, against the unrestricted 686 hypothesis is

2 687 푡푠 = ∑ (휋푖푠 − 휋푖) /휋푖 푖푠

688 The X2 statistic provides a measure of how different the frequencies for the 푠th taxa are

689 from the average frequencies. The maximum 푡푠 over 푠 is taken as an overall measure of 690 how heterogeneous the frequencies are for a given recoding scheme. The minmax- 691 chisq program searches through recoding schemes, moving amino acids from one bin

692 to another, to try to minimize the 푚푎푥 푡푠 (Susko and Roger 2007).

693 Phylogenetic inference

694 The inference of phylogenies was primarily done under the maximum likelihood (ML) 695 framework and using IQ-TREE v1.5.5 (Minh, Nguyen, and von Haeseler 2013; Nguyen 696 et al. 2015). We first inferred guide trees (for a PMSF analysis) with a model that 697 comprises the LG empirical matrix, with empirical frequencies estimated from the data 698 (F), six rates for the FreeRate model to account for rate heterogeneity across sites (R6), 699 and a mixture model with 60 amino acid profiles (C60) to account for compositional 700 heterogeneity across sites—LG+C60+F+R6. Because the computational power and 701 time required to properly explore the whole tree space (given such a big dataset and 702 complex model) was too high, constrained tree searches were employed to obtain these 703 initial guide trees (see Fig. S1 for the constraint tree). Many shallow nodes were 704 constrained if they received maximum UFBoot and SH-aLRT support in a 705 LG+PMSF(C60)+F+R6 analysis. All deep nodes, those relevant to the questions 706 addressed here, were left unconstrained (Fig. S1). The guide trees were then used 707 together with a dataset-specific mixture model ES60 to estimate site-specific amino acid 708 profiles, or a PMSF (Posterior Mean Site Frequency Profiles) model, that best account 709 for compositional heterogeneity across sites (Wang et al. 2018). The dataset-specific

34

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

710 empirical mixture model ES60 also has 60 categories but, unlike the general C60, was 711 directly estimated from our large dataset of 200 genes and 120 alphaproteobacteria 712 using the methods described in (Susko, Lincker, and Roger 2018). Final trees were 713 inferred using the LG+PMSF(ES60)+F+R6 model and a fully unconstrained tree search. 714 Those datasets that produced the most novel topologies under ML were further 715 analyzed under a Bayesian framework using PhyloBayes MPI v1.7 and the CAT- 716 Poisson+Γ4 model (Lartillot and Philippe 2004; Lartillot, Lepage, and Blanquart 2009). 717 This model allows for a very large number of classes to account for compositional 718 heterogeneity across sites and, unlike in the more complex CAT-GTR+Γ4 model, also 719 allows for convergence to be more easily achieved between MCMC chains. PhyloBayes 720 MCMC chains were run for more than 10,000 cycles until convergence between the 721 chains was achieved and the largest discrepancy (i.e., maxdiff parameter) was < 0.1. A 722 consensus tree was generated from two PhyloBayes MCMC chains using a burn-in of 723 500 trees and sub-sampling every 10 trees.

724 Phylogenetic analyses of recoded datasets into four-character state alphabets were 725 analyzed using IQ-TREE v1.5.5 and the model GTR+ES60S4+F+R6. ES60S4 is an 726 adaptation of the dataset-specific empirical mixture model ES60 to four-character 727 states. It is obtained by adding the frequencies of the amino acids that belong to each 728 bin in the dataset-specific four-character state scheme S4 (see Data Recoding for 729 details). Phylogenetics analyses of recoded datasets into six-character state alphabets 730 were analyzed using PhyloBayes MPI v1.7 and the CAT-Poisson+Γ4 model. Maximum- 731 likelihood analyses with a six-state recoding scheme could not be performed because 732 IQ-TREE currently only supports amino acid datasets recoded into four-character 733 states.

734 Other analyses

735 The 16S rRNA genes of ‘Candidatus Finniella inopinata’, and the presumed 736 of Peranema trichophorum and Stachyamoeba lipophora were identified 737 with RNAmmer 1.2 server and BLAST searches. A set of 16S rRNA genes for diverse 738 rickettsialeans and holosporaleans, and other alphaproteobacteria as , were 739 retrieved from NCBI GenBank. The selection was based on Hess et al., (2016), Szokoli

35

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

740 et al., (2016) and Wang and Wu (2015). Environmental sequences for uncultured and 741 undescribed rickettsialeans were retrieved by keeping the 50 best hits resulting from a 742 BLAST search of our three novel 16S rRNA genes against the NCBI GenBank non- 743 redundant (nr) database. The sequences were aligned with the SILVA aligner SINA 744 v1.2.11 and all-gap sites were later removed. Phylogenetic analyses on this alignment 745 were performed on IQ-TREE v1.5.5 using the GTR+F+R8 model.

746 A UPGMA (average-linkage) clustering of amino acid compositions based on the 200 747 gene set for the Alphaproteobacteria was built in MEGA 7 (Kumar, Stecher, and Tamura 748 2016) from a matrix of Euclidean distances between amino acid compositions of 749 sequences exported from the phylogenetic software P4 (http://p4.nhm.ac.uk/index.html).

750 Data availability

751 The genome of ‘Candidatus Finniella inopinata’, endosymbiont of Peranema 752 trichophorum strain CCAP 1260/1B and endosymbiont of Stachyamoeba lipophora 753 strain ATCC 50324 were deposited in NCBI GenBank under the BioProject 754 PRJNA501864. Raw sequencing reads were deposited on the NCBI SRA archive under 755 the BioProject PRJNA501864. Multi-gene datasets as well as phylogenetic trees 756 inferred in this study were deposited at Mendeley Data under the DOI: 757 http://dx.doi.org/10.17632/75m68dxd83.1.

758 ACKNOWLEDGEMENTS

759 Sergio A. Muñoz-Gómez is supported by a Killam Predoctoral Scholarship and a Nova 760 Scotia Graduate Scholarship. This work was supported by Natural Sciences and 761 Engineering Research (NSERC) Discovery Grants 2016-06792 to A.J.R, RGPIN/05754- 762 2015 to C.H.S, RGPIN/05286-2014 to G.B., and RGPIN-2017-05411 to B.F.L. We thank 763 Jon Jerlström Hultqvist and Gina Filloramo (both at Dalhousie University) for advice on 764 long-read sequencing with the Nanopore MinION. We also thank Camilo A. Calderón- 765 Acevedo for advice about taxonomic issues, and Franziska Szokoli for reading and 766 commenting on a late version of this manuscript. Bruce Curtis (Dalhousie University) 767 and Peter G. Foster (Natural History Museum of London) kindly provided technical help 768 with bioinformatics and with the software P4, respectively. We thank Joanny Roy,

36

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

769 Georgette Kiethega, Matus Valach, and Shona Teijeiro (all at the Université de 770 Montréal), and Drahomira Faktora (University of South Bohemia), for help with the 771 culturing, DNA preparation and sequencing of the endosymbiont of Peranema 772 trichophorum. Some of the genome data used in this study were produced by the US 773 Department of Energy Joint Genome Institute (http://www.jgi.doe.gov/) in collaboration 774 with the user community.

775 COMPETING INTERESTS

776 No competing interests declared by the authors.

777 REFERENCES

778 Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, 779 Alexander S. Kulikov, Valery M. Lesin, et al. 2012. “SPAdes: A New Genome 780 Assembly Algorithm and Its Applications to Single-Cell Sequencing.” Journal of 781 Computational Biology 19 (5): 455–77. https://doi.org/10.1089/cmb.2012.0021. 782 Bazylinski, Dennis A., Timothy J. Williams, Christopher T. Lefèvre, Denis Trubitsyn, 783 Jiasong Fang, Terrence J. Beveridge, Bruce M. Moskowitz, et al. 2013. 784 “Magnetovibrio Blakemorei Gen. Nov., Sp. Nov., a Magnetotactic Bacterium 785 (Alphaproteobacteria: Rhodospirillaceae) Isolated from a Salt Marsh.” 786 International Journal of Systematic and Evolutionary Microbiology 63 (5): 1824– 787 33. https://doi.org/10.1099/ijs.0.044453-0. 788 Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible 789 Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. 790 https://doi.org/10.1093/bioinformatics/btu170. 791 Boscaro, Vittorio, Sergei I. Fokin, Martina Schrallhammer, Michael Schweikert, and 792 Giulio Petroni. 2013. “Revised Systematics of Holospora-like Bacteria and 793 Characterization of ‘Candidatus Gortzia Infectiva’, a Novel Macronuclear 794 Symbiont of Jenningsi.” Microbial Ecology 65 (1): 255–67. 795 https://doi.org/10.1007/s00248-012-0110-2. 796 Brindefalk, Björn, Thijs J. G. Ettema, Johan Viklund, Mikael Thollesson, and Siv G. E. 797 Andersson. 2011. “A Phylometagenomic Exploration of Oceanic 798 Alphaproteobacteria Reveals Mitochondrial Relatives Unrelated to the SAR11 799 Clade.” PLoS ONE 6 (9): e24457. https://doi.org/10.1371/journal.pone.0024457. 800 Cho, Jang-Cheon, and Stephen J. Giovannoni. 2003. “ Bermudensis Gen. 801 Nov., Sp. Nov., a Marine Bacterium That Forms a Deep Branch in the Alpha- 802 Proteobacteria.” International Journal of Systematic and Evolutionary 803 Microbiology 53 (Pt 4): 1031–36. https://doi.org/10.1099/ijs.0.02566-0. 804 Eren, A. Murat, Özcan C. Esen, Christopher Quince, Joseph H. Vineis, Hilary G. 805 Morrison, Mitchell L. Sogin, and Tom O. Delmont. 2015. “Anvi’o: An Advanced 806 Analysis and Visualization Platform for ‘omics Data.” PeerJ 3 (October): e1319. 807 https://doi.org/10.7717/peerj.1319.

37

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

808 Eschbach, Erik, Martin Pfannkuchen, Michael Schweikert, Denja Drutschmann, Franz 809 Brümmer, Sergei Fokin, Wolfgang Ludwig, and Hans-Dieter Görtz. 2009. 810 “‘Candidatus Paraholospora Nucleivisitans’, an Intracellular Bacterium in 811 Paramecium Sexaurelia Shuttles between the Cytoplasm and the Nucleus of Its 812 Host.” Systematic and Applied Microbiology 32 (7): 490–500. 813 https://doi.org/10.1016/j.syapm.2009.07.004. 814 Esser, Christian, William Martin, and Tal Dagan. 2007. “The Origin of Mitochondria in 815 Light of a Fluid Prokaryotic Chromosome Model.” Biology Letters 3 (2): 180–84. 816 https://doi.org/10.1098/rsbl.2006.0582. 817 Ettema, Thijs J. G., and Siv G. E. Andersson. 2009. “The Alpha-Proteobacteria: The 818 Darwin Finches of the Bacterial World.” Biology Letters 5 (3): 429–32. 819 https://doi.org/10.1098/rsbl.2008.0793. 820 Ferla, Matteo P., J. Cameron Thrash, Stephen J. Giovannoni, and Wayne M. Patrick. 821 2013. “New RRNA Gene-Based Phylogenies of the Alphaproteobacteria Provide 822 Perspective on Major Groups, Mitochondrial Ancestry and Phylogenetic 823 Instability.” PLoS ONE 8 (12): e83383. 824 https://doi.org/10.1371/journal.pone.0083383. 825 Fitzpatrick, David A., Christopher J. Creevey, and James O. McInerney. 2006. “Genome 826 Phylogenies Indicate a Meaningful Alpha-Proteobacterial Phylogeny and Support 827 a Grouping of the Mitochondria with the Rickettsiales.” Molecular Biology and 828 Evolution 23 (1): 74–85. https://doi.org/10.1093/molbev/msj009. 829 Foesel, Bärbel U., Anita S. Gössner, Harold L. Drake, and Andreas Schramm. 2007. 830 “Geminicoccus Roseus Gen. Nov., Sp. Nov., an Aerobic Phototrophic 831 Alphaproteobacterium Isolated from a Marine Aquaculture Biofilter.” Systematic 832 and Applied Microbiology 30 (8): 581–86. 833 https://doi.org/10.1016/j.syapm.2007.05.005. 834 Foster, Peter G. 2004. “Modeling Compositional Heterogeneity.” Systematic Biology 53 835 (3): 485–95. 836 Garrity, George, ed. 2005. Bergey’s Manual of Systematic Bacteriology: Volume 2 : The 837 Proteobacteria. 2nd ed. Springer US. 838 //www.springer.com/gp/book/9780387950402. 839 Georgiades, Kalliopi, Mohammed-Amine Madoui, Phuong Le, Catherine Robert, and 840 Didier Raoult. 2011. “Phylogenomic Analysis of Odyssella Thessalonicensis 841 Fortifies the Common Origin of Rickettsiales, Pelagibacter Ubique and 842 Reclimonas Americana .” PLoS ONE 6 (9): e24857. 843 https://doi.org/10.1371/journal.pone.0024857. 844 Giovannoni, Stephen J. 2017. “SAR11 Bacteria: The Most Abundant Plankton in the 845 Oceans.” Annual Review of Marine Science 9 (January): 231–55. 846 https://doi.org/10.1146/annurev-marine-010814-015934. 847 Grote, Jana, J. Cameron Thrash, Megan J. Huggett, Zachary C. Landry, Paul Carini, 848 Stephen J. Giovannoni, and Michael S. Rappé. 2012. “Streamlining and Core 849 Genome Conservation among Highly Divergent Members of the SAR11 Clade.” 850 MBio 3 (5): e00252-12. https://doi.org/10.1128/mBio.00252-12. 851 Harbison, Austin B., Laiken E. Price, Michael D. Flythe, and Suzanna L. Bräuer. 2017. 852 “Micropepsis Pineolensis Gen. Nov., Sp. Nov., a Mildly Acidophilic 853 Alphaproteobacterium Isolated from a Poor Fen, and Proposal of

38

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

854 Micropepsaceae Fam. Nov. within Micropepsales Ord. Nov.” International 855 Journal of Systematic and Evolutionary Microbiology 67 (4): 839–44. 856 https://doi.org/10.1099/ijsem.0.001681. 857 Hess, Sebastian, and Michael Melkonian. 2013. “The Mystery of Clade X: Orciraptor 858 Gen. Nov. and Viridiraptor Gen. Nov. Are Highly Specialised, Algivorous 859 Amoeboflagellates (Glissomonadida, Cercozoa).” Protist 164 (5): 706–47. 860 https://doi.org/10.1016/j.protis.2013.07.003. 861 Hess, Sebastian, Andreas Suthaus, and Michael Melkonian. 2015. “‘Candidatus 862 Finniella’ (Rickettsiales, Alphaproteobacteria), Novel Endosymbionts of 863 Viridiraptorid Amoeboflagellates (Cercozoa, Rhizaria).” Applied and 864 Environmental Microbiology 82 (2): 659–70. https://doi.org/10.1128/AEM.02680- 865 15. 866 Kumar, Sudhir, Glen Stecher, and Koichiro Tamura. 2016. “MEGA7: Molecular 867 Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets.” Molecular 868 Biology and Evolution 33 (7): 1870–74. https://doi.org/10.1093/molbev/msw054. 869 Kurahashi, Midori, Yukiyo Fukunaga, Shigeaki Harayama, and Akira Yokota. 2008. 870 “Sneathiella Glossodoripedis Sp. Nov., a Marine Alphaproteobacterium Isolated 871 from the Nudibranch Glossodoris Cincta, and Proposal of Sneathiellales Ord. 872 Nov. and Sneathiellaceae Fam. Nov.” International Journal of Systematic and 873 Evolutionary Microbiology 58 (Pt 3): 548–52. https://doi.org/10.1099/ijs.0.65328- 874 0. 875 Kwon, Kae Kyoung, Hee-Soon Lee, Sung Hyun Yang, and Sang-Jin Kim. 2005. 876 “Kordiimonas Gwangyangensis Gen. Nov., Sp. Nov., a Marine Bacterium 877 Isolated from Marine Sediments That Forms a Distinct Phyletic Lineage 878 (Kordiimonadales Ord. Nov.) in the ‘Alphaproteobacteria.’” International Journal 879 of Systematic and Evolutionary Microbiology 55 (5): 2033–37. 880 https://doi.org/10.1099/ijs.0.63684-0. 881 Lang, B. Franz, and Gertraud Burger. 2007. “Purification of Mitochondrial and Plastid 882 DNA.” Nature Protocols 2 (3): 652–60. https://doi.org/10.1038/nprot.2007.58. 883 Lartillot, Nicolas, Thomas Lepage, and Samuel Blanquart. 2009. “PhyloBayes 3: A 884 Bayesian Software Package for Phylogenetic Reconstruction and Molecular 885 Dating.” Bioinformatics 25 (17): 2286–88. 886 https://doi.org/10.1093/bioinformatics/btp368. 887 Lartillot, Nicolas, and Hervé Philippe. 2004. “A Bayesian Mixture Model for Across-Site 888 Heterogeneities in the Amino-Acid Replacement Process.” Molecular Biology and 889 Evolution 21 (6): 1095–1109. https://doi.org/10.1093/molbev/msh112. 890 Lee, Kyung-Bum, Chi-Te Liu, Yojiro Anzai, Hongik Kim, Toshihiro Aono, and Hiroshi 891 Oyaizu. 2005. “The Hierarchical System of the ‘Alphaproteobacteria’: Description 892 of Fam. Nov., Fam. Nov. and 893 Fam. Nov.” International Journal of Systematic and 894 Evolutionary Microbiology 55 (Pt 5): 1907–19. https://doi.org/10.1099/ijs.0.63663- 895 0. 896 Luo, Haiwei. 2015. “Evolutionary Origin of a Streamlined Marine Bacterioplankton 897 Lineage.” The ISME Journal 9 (6): 1423–33. 898 https://doi.org/10.1038/ismej.2014.227.

39

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

899 Madigan, Michael T., Deborah O. Jung, and Michael T. Madigan. 2009. “An Overview of 900 : Systematics, Physiology, and .” In The Purple 901 Phototrophic Bacteria, 1–15. Advances in Photosynthesis and Respiration. 902 Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-8815-5_1. 903 Martijn, Joran, Frederik Schulz, Katarzyna Zaremba-Niedzwiedzka, Johan Viklund, 904 Ramunas Stepanauskas, Siv G. E. Andersson, Matthias Horn, Lionel Guy, and 905 Thijs J. G. Ettema. 2015. “Single-Cell Genomics of a Rare Environmental 906 Alphaproteobacterium Provides Unique Insights into Rickettsiaceae Evolution.” 907 The ISME Journal 9 (11): 2373–85. https://doi.org/10.1038/ismej.2015.46. 908 Martijn, Joran, Julian Vosseberg, Lionel Guy, Pierre Offre, and Thijs J. G. Ettema. 2018. 909 “Deep Mitochondrial Origin Outside the Sampled Alphaproteobacteria.” Nature 910 557 (7703): 101–5. https://doi.org/10.1038/s41586-018-0059-5. 911 Minh, Bui Quang, Minh Anh Thi Nguyen, and Arndt von Haeseler. 2013. “Ultrafast 912 Approximation for Phylogenetic Bootstrap.” Molecular Biology and Evolution 30 913 (5): 1188–95. https://doi.org/10.1093/molbev/mst024. 914 Montagna, Matteo, Davide Sassera, Sara Epis, Chiara Bazzocchi, Claudia Vannini, 915 Nathan Lo, Luciano Sacchi, Takema Fukatsu, Giulio Petroni, and Claudio Bandi. 916 2013. “‘Candidatus Midichloriaceae’ Fam. Nov. (Rickettsiales), an Ecologically 917 Widespread Clade of Intracellular Alphaproteobacteria.” Applied and 918 Environmental Microbiology 79 (10): 3241–48. 919 https://doi.org/10.1128/AEM.03971-12. 920 Nguyen, Lam-Tung, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. 2015. 921 “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum- 922 Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74. 923 https://doi.org/10.1093/molbev/msu300. 924 Parks, Donovan H., Maria Chuvochina, David W. Waite, Christian Rinke, Adam 925 Skarshewski, Pierre-Alain Chaumeil, and Philip Hugenholtz. 2018. “A 926 Standardized Based on Genome Phylogeny Substantially 927 Revises the Tree of Life.” Nature 36 (10): 996–1004. 928 https://doi.org/10.1038/nbt.4229. 929 Proença, Diogo N., William B. Whitman, Neha Varghese, Nicole Shapiro, Tanja Woyke, 930 Nikos C. Kyrpides, and Paula V. Morais. 2017. “Arboriscoccus Pini Gen. Nov., 931 Sp. Nov., an Endophyte from a Pine Tree of the Class Alphaproteobacteria, 932 Emended Description of Geminicoccus Roseus, and Proposal of 933 Geminicoccaceae Fam. Nov.” Systematic and Applied Microbiology, December. 934 https://doi.org/10.1016/j.syapm.2017.11.006. 935 Rodríguez-Ezpeleta, Naiara, and T. Martin Embley. 2012. “The SAR11 Group of Alpha- 936 Proteobacteria Is Not Related to the Origin of Mitochondria.” PLoS ONE 7 (1): 937 e30520. https://doi.org/10.1371/journal.pone.0030520. 938 Roger, Andrew J., Sergio A. Muñoz-Gómez, and Ryoma Kamikawa. 2017. “The Origin 939 and Diversification of Mitochondria.” Current Biology 27 (21): R1177–92. 940 https://doi.org/10.1016/j.cub.2017.09.015. 941 Rosenberg, Eugene, Edward F. DeLong, Stephen Lory, Erko Stackebrandt, and 942 Fabiano Thompson, eds. 2014. The Prokaryotes: Alphaproteobacteria and 943 . 4th ed. Berlin Heidelberg: Springer-Verlag. 944 //www.springer.com/gp/book/9783642301964.

40

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

945 Santos, Huarrisson Azevedo, and Carlos Luiz Massard. 2014. “The Family 946 Holosporaceae.” In The Prokaryotes: Alphaproteobacteria and 947 Betaproteobacteria, edited by Eugene Rosenberg, Edward F. DeLong, Stephen 948 Lory, Erko Stackebrandt, and Fabiano Thompson, 237–46. Berlin, Heidelberg: 949 Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-30197-1_264. 950 Sassera, Davide, Nathan Lo, Sara Epis, Giuseppe D’Auria, Matteo Montagna, 951 Francesco Comandatore, David Horner, et al. 2011. “Phylogenomic Evidence for 952 the Presence of a and Cbb3 Oxidase in the Free-Living Mitochondrial 953 Ancestor.” Molecular Biology and Evolution 28 (12): 3285–96. 954 https://doi.org/10.1093/molbev/msr159. 955 Susko, Edward, Léa Lincker, and Andrew J. Roger. 2018. “Accelerated Estimation of 956 Frequency Classes in Site-Heterogeneous Profile Mixture Models.” Molecular 957 Biology and Evolution may026. https://doi.org/10.1093/molbev/msy026. 958 Susko, Edward, and Andrew J. Roger. 2007. “On Reduced Amino Acid Alphabets for 959 Phylogenetic Inference.” Molecular Biology and Evolution 24 (9): 2139–50. 960 https://doi.org/10.1093/molbev/msm144. 961 Szokoli, Franziska, Michele Castelli, Elena Sabaneyeva, Martina Schrallhammer, 962 Sascha Krenek, Thomas G. Doak, Thomas U. Berendonk, and Giulio Petroni. 963 2016. “Disentangling the Taxonomy of Rickettsiales and Description of Two 964 Novel Symbionts (‘Candidatus Bealeia Paramacronuclearis’ and ‘Candidatus 965 Fokinia Cryptica’) Sharing the Cytoplasm of the Ciliate Protist Paramecium 966 Biaurelia.” Applied and Environmental Microbiology 82 (24): 7236–47. 967 https://doi.org/10.1128/AEM.02284-16. 968 Thrash, J. Cameron, Alex Boyd, Megan J. Huggett, Jana Grote, Paul Carini, Ryan J. 969 Yoder, Barbara Robbertse, Joseph W. Spatafora, Michael S. Rappé, and 970 Stephen J. Giovannoni. 2011. “Phylogenomic Evidence for a Common Ancestor 971 of Mitochondria and the SAR11 Clade.” Scientific Reports 1 (June): 13. 972 https://doi.org/10.1038/srep00013. 973 Venkata Ramana, V., S. Kalyana Chakravarthy, E. V. V. Ramaprasad, V. Thiel, J. F. 974 Imhoff, Ch Sasikala, and Ch V. Ramana. 2013. “Emended Description of the 975 Genus Rhodothalassium Imhoff et Al., 1998 and Proposal of Rhodothalassiaceae 976 Fam. Nov. and Rhodothalassiales Ord. Nov.” Systematic and Applied 977 Microbiology 36 (1): 28–32. https://doi.org/10.1016/j.syapm.2012.09.003. 978 Verbruggen, Heroen. 2018. “SiteStripper Version 1.03.” 979 http://www.phycoweb.net/software. 980 Viklund, Johan, Thijs J. G Ettema, and Siv G. E Andersson. 2012. “Independent 981 Genome Reduction and Phylogenetic Reclassification of the Oceanic SAR11 982 Clade.” Molecular Biology and Evolution 29 (2): 599–615. 983 https://doi.org/10.1093/molbev/msr203. 984 Viklund, Johan, Joran Martijn, Thijs J. G. Ettema, and Siv G. E. Andersson. 2013. 985 “Comparative and Phylogenomic Evidence That the Alphaproteobacterium 986 HIMB59 Is Not a Member of the Oceanic SAR11 Clade.” PLoS ONE 8 (11): 987 e78858. https://doi.org/10.1371/journal.pone.0078858. 988 Wang, Huai-Chun, Bui Quang Minh, Edward Susko, and Andrew J. Roger. 2018. 989 “Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles

41

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

990 Accelerates Accurate Phylogenomic Estimation.” Systematic Biology 67 (2): 216– 991 35. https://doi.org/10.1093/sysbio/syx068. 992 Wang, Zhang, and Martin Wu. 2013. “A -Level Bacterial Phylogenetic Marker 993 Database.” Molecular Biology and Evolution 30 (6): 1258–62. 994 https://doi.org/10.1093/molbev/mst059. 995 ———. 2014. “Phylogenomic Reconstruction Indicates Mitochondrial Ancestor Was an 996 Energy Parasite.” PLoS ONE 9 (10): e110685. 997 https://doi.org/10.1371/journal.pone.0110685. 998 ———. 2015. “An Integrated Phylogenomic Approach toward Pinpointing the Origin of 999 Mitochondria.” Scientific Reports 5 (January): 7949. 1000 https://doi.org/10.1038/srep07949. 1001 Whitman, William B., ed. 2015. Bergey’s Manual of Systematics of Archaea and 1002 Bacteria. Wiley. 1003 https://onlinelibrary.wiley.com/doi/book/10.1002/9781118960608. 1004 Wiese, Jutta, Vera Thiel, Andrea Gärtner, Rolf Schmaljohann, and Johannes F. Imhoff. 1005 2009. “Kiloniella Laminariae Gen. Nov., Sp. Nov., an Alphaproteobacterium from 1006 the Marine Macroalga Laminaria Saccharina.” International Journal of Systematic 1007 and Evolutionary Microbiology 59 (2): 350–56. 1008 https://doi.org/10.1099/ijs.0.001651-0. 1009 Williams, Kelly P., Bruno W. Sobral, and Allan W. Dickerman. 2007. “A Robust Species 1010 Tree for the Alphaproteobacteria.” Journal of Bacteriology 189 (13): 4578–86. 1011 https://doi.org/10.1128/JB.00269-07. 1012 Williams, Timothy J., Christopher T. Lefèvre, Weidong Zhao, Terry J. Beveridge, and 1013 Dennis A. Bazylinski. 2012. “Magnetospira Thiophila Gen. Nov., Sp. Nov., a 1014 Marine Magnetotactic Bacterium That Represents a Novel Lineage within the 1015 Rhodospirillaceae (Alphaproteobacteria).” International Journal of Systematic 1016 and Evolutionary Microbiology 62 (10): 2443–50. 1017 https://doi.org/10.1099/ijs.0.037697-0. 1018

42

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1019 SUPPLEMENTARY FIGURE LEGENDS

1020 Figure S1. Constraint tree, used for IQ-TREE analyses, labeled with taxon names and 1021 also degree of missing data per taxon.

1022 Figure S2. A diagram of the strategies employed in this study.

1023 Figure S3. GARP:FIMNKY ratios across the proteomes of the 120 alphaproteobacteria 1024 used in this study.

1025 Figure S4. A 16S rRNA gene maximum-likelihood tree of the Rickettsiales and 1026 Holosporales that phylogenetically places the three endosymbionts whose genomes 1027 were sequenced in this study: (1) ‘Candidatus Finniella inopinata’ endosymbiont of 1028 Viridiraptor invadens strain Virl02, (2) an alphaproteobacterium associated with 1029 Peranema trichophorum strain CCAP 1260/1B, and (3) an alphaproteobacterium 1030 associated with Stachyamoeba lipophora strain ATCC 50324. Branch support values 1031 are SH-aLRT and UFBoot.

1032 Figure S5. A labeled version of Figure 2. Branch support values are 100% SH-aLRT and 1033 100% UFBoot unless annotated.

1034 Figure S6. Bayesian consensus trees inferred with PhyloBayes MPI v1.7 and the CAT- 1035 Poisson+Γ4 model. Branch support values are 1.0 posterior probabilities unless 1036 annotated. A. Bayesian consensus phylogram inferred from the full dataset which is 1037 highly compositionally heterogeneous. B. Bayesian consensus phylogram inferred from 1038 a dataset whose compositional heterogeneity has been decreased by removing 50% of 1039 the most biased sites according to ɀ. See Figs. 2A and 2B for the most likely trees 1040 inferred in IQ-TREE v1.5.5 and the LG+PMSF(C60)+F+R6 model.

1041 Figure S7. Ultrafast bootstrap (UFBoot) variation as the fastest sites are progressively 1042 removed in steps of 10%.

1043 Figure S8. Ultrafast bootstrap (UFBoot) variation as compositionally biased sites, 1044 according to ɀ, are progressively removed in steps of 10%.

1045 Figure S9. Bayesian consensus trees inferred with PhyloBayes MPI v1.7 and the CAT- 1046 Poisson+Γ4 model. Branch support values are 1.0 posterior probabilities unless 1047 annotated. A. Bayesian consensus phylogram inferred to place the Holosporales in the 1048 absence of the Rickettsiales and the Pelagibacterales and when compositional 1049 heterogeneity has been decreased by removing 50% of the most biased sites according 1050 to ɀ. B. Bayesian consensus phylogram inferred to place the Holosporales in the 1051 absence of the Rickettsiales and the Pelagibacterales and when the data have been 1052 recoded into a four-character state alphabet (the dataset-specific recoding scheme S4: 1053 ARNDQEILKSTV GHY CMFP W) to reduce compositional heterogeneity. See Figs. 2A 1054 and 2B for the most likely trees inferred in IQ-TREE v1.5.5 and the 1055 LG+PMSF(C60)+F+R6 and GTR+ES60S4+F+R6 models, respectively.

43

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1056 Figure S10. Maximum-likelihood phylograms obtained from the full dataset and after 1057 removing long-branching taxa. A. A phylogram for which all long-branching taxa are 1058 included. B. A phylogram to assess the placement of the Holosporales. C. A phylogram 1059 to assess the placement of the Rickettsiales. D. A phylogram to assess the placement 1060 of the Pelagibacterales. E. A phylogram to assess the placement of 1061 alphaproteobacterium sp. HIMB59.

1062 Figure S11. Maximum-likelihood phylograms obtained after removing 50% of the most 1063 compositionally biased sites and removing long-branching taxa. A. A phylogram for 1064 which all long-branching taxa are included. B. A phylogram to assess the placement of 1065 the Holosporales. C. A phylogram to assess the placement of the Rickettsiales. D. A 1066 phylogram to assess the placement of the Pelagibacterales. E. A phylogram to assess 1067 the placement of alphaproteobacterium sp. HIMB59.

1068 Figure S12. Maximum-likelihood phylograms obtained after recoding the data into S4 1069 and removing long-branching taxa. A. A phylogram for which all long-branching taxa are 1070 included (recoding scheme: RNCM EHIPTWV ADQLKS GFY). B. A phylogram to 1071 assess the placement of the Holosporales (recoding scheme: ARNDQEILKSTV GHY 1072 CMFP W). C. A phylogram to assess the placement of the Rickettsiales (recoding 1073 scheme: PY RNMF GHLKTW ADCQEISV). D. A phylogram to assess the placement of 1074 the Pelagibacterales. E. A phylogram to assess the placement of alphaproteobacterium 1075 sp. HIMB59 (recoding scheme: RLKMT ANDQEIPSV CW GHFY).

1076 Figure S13. Maximum-likelihood phylograms obtained from the 40 most compositionally 1077 homogeneous genes and removing long-branching taxa. A. A phylogram for which all 1078 long-branching taxa are included. B. A phylogram to assess the placement of the 1079 Holosporales. C. A phylogram to assess the placement of the Rickettsiales. D. A 1080 phylogram to assess the placement of the Pelagibacterales. E. A phylogram to assess 1081 the placement of alphaproteobacterium sp. HIMB59.

1082 Figure S14. Bayesian consensus phylogram inferred to place the Holosporales in the 1083 absence of the Rickettsiales and the Pelagibacterales and when the data have been 1084 recoded into a six-character state alphabet (the dataset-specific recoding scheme S6: 1085 AQEHISV RKMT PY DCLF NG W) to reduce compositional heterogeneity. Branch 1086 support values are 1.0 posterior probabilities unless annotated.

1087 Table S1. A list of the least compositionally heterogeneous genes out of the 200 single- 1088 copy and vertically-inherited genes used in this study.

44

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A. 2.36 B. Rickettsiales Pelagibacterales 80 Holosporales other Alphaproteobacteria

0.45 70

60

50 G+C%

40

Rickettsiales Pelagibacterales 30 Holosporales other Alphaproteobacteria

20 0.006 0.2 0.7 1.2 1.7 2.2 GARP:FIMNKY A. 90/92 B.

Rickettsiales Rickettsiales 78.3/89 3.8/51 bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was Geminicoccaceae not certified by peer review) is the author/funder, who has granted bioRxiv a license99.4/99 to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Pelagibacterales 95.9/93

alphaproteobacterium HIMB59 99.9/100 Holosporales Rhizobiales 99.2/98 84.2/88

Caulobacterales 93.5/94

Rhizobiales 98.6/99 Rhodobacterales 100/94

98.6/97 Pelagibacterales Caulobacterales 100/94 93.2/90 Sphingomonadales 100/94 Rhodobacterales Sneathiellales 80.9/84 Sphingomonadales Holosporales 98.6/99 Sneathiellales 100/92 Geminicoccaceae alphaproteobacterium HIMB59 98.7/98 100/93 99.9/100 99.9/100 Rhodospirillales 87.8/94 99.8/99 Rhodospirillales 60.6/87 99.9/100 99.8/100 99.9/100 99.4/98 91.6/93 99.4/100

0.4 0.2 A. B. Geminicoccaceae

bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International98.8/99 license. 44.3/89

100/100

52.3/57 99.9/100 Geminicoccaceae

Holosporaceae 99.6/99 96/97

99.7/99 99.3/100 99.9/99 37.2/93 77.3/83 Azospirillaceae 98.8/99 100/99 Rhodovibriaceae 98/99

70.2/79 Rhodospirillaceae 13.8/78

100/99 99.8/99 99.4/99 67/87

99.9/100 Acetobacteraceae 96/100

92.5/87 99.7/97 14.9/83 90.9/99 12.8/75

100/100

0.2 0.2 bioRxiv preprint doi: https://doi.org/10.1101/462648; this version posted November 5, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Class 1. Alphaproteobacteria Garrity et al. 2005 Subclass 1. Rickettsidae Ferla et al. 2013 emend. Muñoz-Gómez et al. 2018 Order 1. Rickettsiales Gieszczykiewicz 1939 emend. Dumler et al. 2001 Family 1. Anaplasmataceae Philip 1957 Family 2. 'Candidatus Midichloriaceae' Montagna et al. 2013 Family 3. Rickettsiaceae Pinkerton 1936 Subclass 2. Caulobacteridae Ferla et al. 2013 emend. Muñoz-Gómez et al. 2018 Order 1. Rhodospirillales Pfennig and Truper 1971 emend. Muñoz-Gómez et al. 2018 Family 1. Acetobacteriacae ex Henrici 1939 Gillis and De Ley 1980 Family 2. Rhodospirillaceae Pfennig and Truper 1971 emend. Muñoz-Gómez et al. 2018 Family 3. Azospirillaceae fam. nov. Muñoz-Gómez et al. 2018 Family 4. Holosporaceae Szokoli et al. 2016 Family 5. Rhodovibriaceae fam. nov. Muñoz-Gómez et al. 2018 Family 6. Geminicoccaceae Proença et al. 2017 Order 2. Sneathiellales Kurahashi et al. 2008 Order 3. Sphingomonadales Yabuuchi et al. 1990 Order 4. Pelagibacterales Grote et al. 2012 Order 5. Rhodobacterales Garrity et al. 2006 Order 6. Caulobacterales Henrici and Johnson 1935 Order 7. Rhizobiales Kuykendall 2006 Class 2. Magnetococcia Parks et al. 2018 Order 1. Magnetococcales Bazylinki et al. 2013