tRNA signatures reveal polyphyletic origins of streamlined SAR11 genomes among the

Katherine C.H. Amrine1, Wesley D. Swingley1,2, David H. Ardell1,* October 3, 2018

1. Program in Quantitative and Systems Biology, 5200 North Lake Road, University of California, Merced, Merced, CA 95343

2. Current Address: Department of Biological Sciences, Northern Illinois University, DeKalb, IL 60115

* Corresponding Author: David H. Ardell, (209) 228-2953, [email protected].

Abstract

Phylogenomic analyses are subject to bias from convergence in macromolecular compositions and noise from horizontal gene transfer (HGT). Accordingly, com- positional convergence leads to contradictory results on the phylogeny of taxa such as the ecologically dominant SAR11 group of Alphaproteobacteria, which have extremely streamlined, A+T-biased genomes. While careful modeling can reduce bias artifacts caused by convergence, the most consistent and robust phy- logenetic signal in genomes may lie distributed among encoded functional fea- tures that govern macromolecular interactions. Here we develop a novel phylo- classification method based on signatures derived from bioinformatically defined tRNA Class-Informative Features (CIFs). tRNA CIFs are enriched for features that underlie tRNA-protein interactions. Using a simple tRNA-CIF-based phy- loclassifier, we obtained results consistent with bias-corrected whole proteome arXiv:1305.7256v1 [q-bio.PE] 30 May 2013 phylogenomic studies, rejecting monophyly of SAR11 and affiliating most strains with Rhizobiales with strong statistical support. Yet, as expected by their el- evated genomic A+T contents, SAR11 and tRNA genes are also similarly and distinctly A+T-rich within Alphaproteobacteria. Using conven- tional supermatrix methods on total tRNA sequence data, we could recover the artifactual result of a monophyletic SAR11 grouping with Rickettsiales. Thus

1 tRNA CIF-based phyloclassification is more robust to base content convergence than supermatrix phylogenomics with whole tRNA sequences. Also, given the notoriously promiscuous HGT rates of aminoacyl-tRNA synthetase genes, tRNA CIF-based phyloclassification may be at least partly robust to HGT of network components. We describe how unique features of the tRNA-protein interaction network facilitate mining of traits governing macromolecular interactions from genomic data, and discuss why interaction-governing traits may be especially useful to solve difficult problems in microbial classification and phylogeny.

Author Summary

In this study, we describe a new way to classify living things using information from whole genomes. First, for a group of related organisms, we bioinformat- ically predict features by which specific classes of tRNAs are recognized by certain proteins or complexes. Second, we train an artificial neural network to recognize which code a new, unknown genome belongs to. We apply our method to SAR11, one of the most abundant in the world’s oceans. We find that different strains of SAR11 are more distantly related, both to each other and to mitochondria, than previously thought. However, with more traditional treat- ments of whole tRNA sequence data, we obtain different results, best explained as artifacts of base content convergence. Our tRNA features are therefore more robust to genomic base content convergence than the tRNAs in which they are embedded; this is additional evidence of their functional importance. The tRNA features we study form a clade-specific and slowly diverging “feature network” that underlies a universally conserved macromolecular interaction network. We discuss on theoretical grounds why traits governing macromolecular interactions may be especially well-suited to resolve deep relationships in the Tree of Life.

Introduction

What parts of genomes are most robust to compositional convergence? What information is most faithfully inherited vertically? The key assumptions of com- positional stationarity and consistency in gene histories underpin most current approaches in phylogenomics and are frequently violated (reviewed in e.g.[1]). HGT is so widespread that the very existence of a “Tree of Life” has been questioned [2, 3]. Better understanding of ancient phylogenetic relationships requires discovery of new universal, slowly-evolving phylogenetic markers that are robust to compositional convergence and HGT. The controversial phylogeny of Ca.Pelagibacter ubique (SAR11) is a case in point. SAR11 make up between a fifth and a third of the bacterial biomass in marine and freshwater ecosystems [4]. Adaptations to extreme environmental nutrient limitation may explain why SAR11 have very small cell and genome sizes and small fractions of intergenic DNA [5]. While some recent phylogenomic studies define a clade among SAR11, the largely endoparasitic Rickettsiales, and

2 the alphaproteobacterial ancestor of mitochondria [6, 7, 8], others argue that this placement of SAR11 is an artifact of independent convergence towards increased genomic A+T content, and that SAR11 belongs closer to other free- living Alphaproteobacteria such as the Rhizobiales and Rhodobacteraceae [9, 10, 11]. Monophyly of SAR11 was also recently rejected [10]. Nonstationary macromolecular compositions are a known source of bias in phylogenomics [12, 13]. Widespread variation in macromolecular compositions may be associated with loss of DNA repair pathways in reduced genomes [14, 11], unveiling an inherent A+T-bias of mutation in bacteria [15] and elevating ge- nomic A+T content [16, 17]. A process such as this has likely altered protein and RNA compositions genome-wide in SAR11, and if such effects are accounted for, the placement of SAR11 with Rickettsiales drops away as an apparent ar- tifact [10, 11]. Consistent with this interpretation, SAR11 strain HTTC1062 shares a surprising and unique codivergence of tRNAHis and histidyl-tRNA synthetase (HisRS) with a clade of free-living Alphaproteobacteria [18, 19] that likely arose only once in bacteria [20]. This synapomorphy contradicts the place- ment of SAR11 with Rickettsiales.

[Figure 1 about here.]

This work was motivated to determine whether the entire system of tRNA- protein interactions could be exploited to address phylogeny of bacteria, partic- ularly SAR11. The highly conserved tRNA-protein interaction network (Fig. 1) has special advantages for comparative systems biological study from genomic data. First, the components and interactions of this network are highly con- served. Second, bioinformatic mining of interaction-determining traits from genomic tRNA data is favorable because tRNA structures are highly conserved not just across extant taxa but also across different functional classes of tRNAs (“conformity” [21]). Yet each functional class of tRNA must maintain a hierar- chy of increasingly specific interactions with various proteins and other factors (“identity” [22]). The conflicting requirements of conformity and identity allow structural comparison and contrast to predict class-informative traits of tRNAs from sequence data by relatively simple bioinformatic methods [19]. The fea- tures that govern tRNA-protein interactions diverge across the three domains of life (reviewed in [23]) and also within the domain of bacteria [20]. In prior work, we developed “function logos” to predict, at the level of in- dividual nucleotides before post-transcriptional modification, what genetically templated information in tRNA gene sequences is associated to specific func- tional identity classes [24]. We now call these function-logo-based predictions Class-Informative Features (CIFs). A tRNA CIF answers a question like: “if a tRNA gene from a group of related genomes carries a specific nucleotide at a specific structural position, how much informaiton do we gain about that tRNAs specific function?” Such information estimates are corrected for biased sampling of functional classes and sample size effects [24], and their statistical significance may be calculated [19]. Although an individual bacterial genome does not present enough data to generate a function logo, related genome data

3 may be lumped, weakly assuming homogeneity of tRNA identity rules (although heterogeneity generally reduces signal). Function logos recover known tRNA identity elements (i.e. features that govern the specificity of interactions be- tween tRNAs and proteins) [23], and more generally, predict features governing interactions with class-specific network partners such as amidotransferases [25]. A recent molecular dynamics study on a tRNAGlu-GluRS (Glutaminal tRNA- synthetase) complex identified tRNA functional sites involved in intra- and inter- molecular allosteric signalling within GluRS that couples substrate recognition to reaction catalysis [26]. The predicted sites are correlated with those from proteobacterial function logos [27]. In this work, we show that tRNA CIFs have diverged among Alphapro- teobacteria in a phylogenetically informative manner. Second, as phylogenetic markers, tRNA CIFs are more robust to compositional convergence than the tRNA bodies in which they are embedded. Using our tRNA-CIF-based phylo- classification approach, we confirm that SAR11 are polyphyletic with the ma- jority of strains clustering with the free-living Alphaproteobacteria. Our results have implications for how to best mine genomic data for phylogenetic signals.

Results

We reannotated Alphaproteobacterial tDNA data from tRNAdb-CE 2011 [28] and other prepublication genomic data, and split them into two groups ac- cording to whether or not their source genome contained the uniquely derived synapomorphic traits previously described [20]: a gene for tRNAHis contain- ing A73 (using “Sprinzl coordinates”, [29]) and lacking templated −1G. We could thereby partition the data into an RRCH clade (Rhodobacteraceae, Rhizo- biales, Caulobacterales, Hyphomonadaceae), which present the uniquely derived tRNAHis, and the RSR grade (Rhodospirillales, Sphingomonadales, and Rick- ettsiales, excluding SAR11), which present “normal” bacterial tRNAHis with C73 and genomically templated −1G. In all, data from 214 Alphaproteobac- terial genomes represented 11644 predicted tRNA sequences (8773 sequences unique within genomes and 3064 total unique sequences). Our final dataset contained 147 genomes (8597 tRNAs) for the RRCH clade, 59 genomes (2792 tRNAs) for the RSR grade, and 8 genomes (255 tRNAs) of SAR11 strains.

[Figure 2 about here.]

The unique traits of the RRCH tRNAHis are perfectly associated to substi- tutions of key residues in the motif IIb tRNA-binding loops of HisRS involved in tRNA recognition [20]. Seven of eight SAR11 strains exhibited the unique tRNAHis/HisRS codivergence traits in common with RRCH genomes. In con- trast, strain HIMB59 presented ancestral bacterial characters in both tRNAHis and HisRS (Fig. S1). These results immediately suggest that HIMB59 is not monophyletic with the other SAR11 strains, consistent with [10].

[Figure 3 about here.]

4 We computed function logos [24] of the RRCH clade and RSR grade to form the basis of a tRNA-CIF-based binary phyloclassifier as shown schematically in Fig. 2. To reduce bias, we used a Leave-One-Out Cross-Validation (LOOCV) approach. For comparison, we also performed LOOCV phyloclassification us- ing sequence profiles of entire tRNAs, with typical results shown in Fig. 3B. Although the tRNA-CIF-based phyloclassifier (Fig. 3A) was biased positively by the much larger RRCH sample size, it achieved better phylogenetic separa- tion of genomes than the total-tRNA-sequence-based phyloclassifier (Fig. 3B). The Sphingomonadales and Rhodospirillales separated in scores from the Rick- ettsiales in both classifiers. Most importantly, the tRNA-CIF-based phyloclas- sifier placed all eight SAR11 genomes closer to the RRCH clade and far away from the Rickettsiales with HIMB59 overlapping the Rhodospirillales, while the total-tRNA-sequence-based phyloclassifier placed all eight SAR11 genomes closer to the Rickettsiales. Fig. S2 shows the effects of different treatments of missing data in the total-tRNA-sequence-based classifier. Method “zero,” shown in Fig. 3B, is most analogous to the method used to generate Fig. 3A. Method “skip” (Fig. S2B) shows that SAR11 tRNAs share sequence characters in common with the RSR grade that are not seen in the RRCH clade. Methods “small” and “pseudo” (Figs. S2C and S2D) show that SAR11 have sequence traits not observed in either RSR or RRCH.

[Figure 4 about here.]

Many other tRNA classes besides tRNAHis contribute to the differentiated classification of RRCH and RSR genomes by the CIF-based binary classifier (Fig. 4). Other tRNA classes are also differentiated between these two groups, Cys Asp Glu Ile Lys including tRNA , tRNA , tRNA , tRNALAU (symbolized “J”), tRNA , tRNATyr. These results extend the observations of [18] who discovered unusual base-pair features of tRNAGlu in the RRCH clade. In classes for which the RRCH and RSR groups are well-differentiated, HIMB59 uniquely groups with RSR while other strains group with RRCH, while for other tRNA classes, all putative SAR11 strains lie outside the RRCH and RSR distributions. This implies that more diverse Alphaproteobacterial genomic data are necessary to completely resolve the phylogenetic affiliation of SAR11 strains, but strongly contradict a monophyletic affiliation of SAR11 with Rickettsiales.

[Figure 5 about here.]

The increases in genomic A+T contents in SAR11 and Rickettsiales have also driven elevated A+T contents of their tRNA genes (Fig. 5A). Rickettsiales and SAR11 tRNA genes are both notably elevated in both A and T, and share an overall similarity in composition distinct from other Alphaproteobacteria. Hierarchical clustering of Alphaproteobacterial taxa based on tRNA gene base contents closely group SAR11 and Rickettsiales together (Fig. 5B).

[Figure 6 about here.]

5 Nonstationary tRNA base content — convergence to greater A+T content — causes all eight SAR11 strains in our dataset to group with Rickettsiales using phylogenomic approaches based on total tRNA sequence evidence. In a “supermatrix” phylogenomic approach, concatenating genes for 28 isoacceptor classes from 169 species (2156 total sites) and using the GTR+Gamma model in RAxML, we estimated a Maximum Likelihood tree in which all eight putative SAR11 strains branch together with Rickettsiales (Fig. S3). For this analysis, in 31% of instances when isoacceptor genes were picked from a genome, we randomly picked one gene from a set of isoacceptor paralogs. However, our results did not depend on which paralog we picked. Using a distance-based approach with FastTree, we computed a consensus cladogram over 100 replicate alignments each representing different randomized picks over paralogs. As the consensus cladogram shows (Fig. S4) each replicate distance tree placed all eight putative SAR11 strains together with Rickettsiales. The recently introduced tRNA-specific FastUniFrac-based method for microbial classification [31] also places all SAR11 strains together with Rickettsiales (Fig. 6).

[Figure 7 about here.]

However, as shown in Fig. 7, a multiway classifier based on tRNA CIFs bins all SAR11 strains with the Rhizobiales except for HIMB59, which bins with the Rhodospirillales, consistent with the results of [10]. These results use a multi- layer perceptron (MLP) classifier implemented in WEKA [32] and only seven taxon-specific CIF-based summary scores. The MLP is the simplest non-linear classifier able to handle the interdependent signals in the CIF-based scores for tree-like data [33]. In a Leave-One-Out cross-classification, all other genomes scored consistently with NCBI except three placed in Rhodobac- teraceae based on 16S ribosomal RNA evidence: Stappia aggregata, Labrenzia alexandrii and the denitrifying Pseudovibrio sp. JE062. None of these genomes scored strongly against Rhodobacteraceae except Pseudovibrio, which scored four times greater against the Rhizobiales. To assess robustness of our results we performed two controls: we boot- strapped sites of tRNA data in each genome to be classified, and we filtered away small CIFs with Gorodkin heights < 0.5 bits from our models, retrained the classifier and bootstrapped sites again. Generally bootstrap support val- ues correspond to original classification probabilities. All SAR11 strains have support values > 80% as Rhizobiales, majority bootstrap values as Rhizobiales (HIMB114 at 70% with Rickettsiales at 15% and HTCC7211 at 54% with Rick- ettsiales at 13%), or plurality bootstrap value as Rickettsiales (HIM5 at 48% with Rickettsiales at 18%) except HIM59 which had a bootstrap support value of 87% to be in the Rhodospirillales. Full bootstrap statistics with these model are provided in Table S1.

6 Discussion

Our results provide strong, albeit unconventional, evidence that most SAR11 strains are affiliated with Rhizobiales, while strain HIMB59 is affiliated with Rhodospirillales. These results are entirely consistent with comprehensive phy- logenomic studies that control for nonstationary macromolecular compositions in Alphaproteobacteria [9, 10, 11] or a site-rate-filtered analysis [34]. Our CIF- based method works even though SAR11 and Rickettsiales tRNAs have con- verged in base content, so that total tRNA sequence-based phylogenomics gives opposite results. tRNA CIFs must be at least partly robust to compositional convergence of the tRNA bodies in which they are embedded. It is well known that aminoacyl-tRNA synthetases (aaRS) are highly prone to HGT [35, 36, 37, 38, 39] including in Alphaproteobacteria [20, 40, 41]. We hypothesize that our tRNA-CIF-based phyloclassifiers are also robust to HGT of components of the tRNA-protein interaction network, consistent with [42], who argued that a horizontally transferred aaRS is more likely to functionally ameliorate to a tRNA-protein network into which it has been transferred rather than remodel that network to accomodate itself. HGT of aaRSs may also per- turb a network so as to cause a distinct pattern of divergence ([20] and this work). Wang et al. [18] discuss the possibility that RRCH tRNAHis and HisRS were co-transferred into an ancestral SAR11 genome. However, this fails to ex- plain the correlations of many other tRNA traits of SAR11 genomes with the RRCH clade reported here. Further study is needed to address the robustness of our method to component HGT. A more distant relationship between most SAR11 strains and Rickettsiales actually strengthens the genome streamlining hypothesis [5]. If SAR11 were a true branch within Rickettsiales, it becomes more difficult to claim that genome reduction in SAR11 occurred by a selection-driven evolutionary process distinct from the drift-dominated erosion of genomes in the Rickettsiales [43, 16, 44]. By the same token, polyphyly of nominal SAR11 strains implies that the extensive similarity in genome structure and other traits between HIMB59 and SAR11 reported by [45] may have originated independently. Perhaps convergence in some traits is consistent with streamlining, which could also explain trait-sharing between SAR11 and Prochlorococcus, marine cyanobacteria also argued to have undergone streamlining [46]. Clear signs of data-limitation in our study should be taken to mean that better taxonomic sampling will improve our results and could ultimately resolve more than two origins of SAR11-type genomes among Alphaproteobacteria. We extracted accurate and robust phylogenetic signals from tRNA gene se- quences by first integrating within genomes to identify features likely to gov- ern functional interactions with other macromolecules. Unlike small molecule interactions, macromolecular interactions are mediated by genetically deter- mined structural and dynamic complementarities. These are intrinsically rela- tive; a large neutral network [47] of interaction-determining features should be compatible with the same interaction network. Coevolutionary divergence — turnover—of features that mediate macromolecular interactions, while conserv-

7 ing network architecture, has been described in the transcriptional networks of yeast [48, 49] and worms [50] and in post-translational modifications underlying protein-protein interactions [51]. This work demonstrates that divergence of interaction-governing features is phylogenetically informative. It remains open how such features diverge, with possibilities including com- pensatory nearly neutral mutations [52], fluctuating selection [53], adaptive re- versals [54], and functionalization of pre-existent variation [55]. Major changes to interaction interfaces may be sufficient to induce genetic isolation between related lineages, as discussed for the 16S rRNA- and 23S rRNA-based stan- dard model of the “Tree of Life,” in which many important and deep branches associate with large, rare macromolecular changes (“signatures”) in ribosome structure and function [56, 57, 58]. Interaction-mediating features of macromolecules may be systems biology’s answer to the phylogeny problem. Perhaps no other traits of genomes are verti- cally inherited more consistently than those that mediate functional interactions with other macromolecules in the same lineage. In fact, the structural and dy- namic basis of interaction among macromolecular components — essential to their collaborative function in a system — may define a lineage better than any of those components can themselves, either alone or in ensemble.

Materials and Methods

Supplementary data packages are provided to reproduce all figures from raw data and enable third-party classification of alphaproteobacterial genomes.

Data The 2011 release of the tRNAdb-CE database [28] was downloaded on August 24, 2011. From this master database, we selected Alphaproteobacteria data as specified by NCBI Taxonomy data (downloaded September 24, 2010, [59]). Also using NCBI Taxonomy, we further tripartitioned Alphaproteobacterial tRNAdb- CE data into those from the RRCH clade, the RSR grade (excluding SAR11), and three SAR11 genomes, as documented in Supplementary data for figure 2. Five additional SAR11 genomes (for strains HIMB59, HIMB5, HIMB114, IMCC9063 and HTCC9565) were obtained from J. Cameron Thrash courtesy of the lab of S. Giovannoni. We custom annotated tRNA genes in these genomes as the union of predictions from tRNAscan-SE version 1.3.1 (with -B option, [60]) Ile and Aragorn version 1.2.34 [61]. We classified initiator tRNAs and tRNACAU using TFAM version 1.4 [62] using a model previously created to do this based on identifications in [63] provided as supplementary data. We aligned tRNAs with covea version 2.4.4 [64] and the prokaryotic tRNA covariance model [60], removed sites with more than 97% gaps with a bioperl-based utility [65], and edited the alignment manually in Seaview 4.1 [66] to remove CCA tails and re- move sequences with unusual secondary structures. We mapped sites to Sprinzl coordinates manually [29] and verified by spot-checks against tRNAdb [67]. We

8 added a gap in the -1 position for all sequences and -1G for tRNAHis in the RSR group [18]. tRNA CIF Estimation and Binary Classifiers Our tRNA-CIF-based binary phyloclassifier with Leave-One-Out Cross-Validation (LOO CV) is computed directly from function logos, estimated from tDNA alignments as described in [24]. Here, we define a feature f ∈ F as a nu- cleotide n ∈ N at a position l ∈ L in a structurally aligned tDNA, where N = {A, C, G, T } and L is the set of all Sprinzl coordinates [29]. The set F of all pos- sible features is the Cartesian product F = N ×L.A functional class or class of a tDNA is denoted c ∈ C where C = {A, C, D, E, F, G, H, I, J, K, L, M, N, P, Q, R, S, T, V, W, X, Y } is the universe of functions we here consider, symbolized by IUPAC one-letter amino acid codes (for aminoacylation classes), X for initiator tRNAs, and J Ile for tDNALAU.A taxon set of genomes or just taxon set S ∈ P(G) is a set of genomes, where G is the set of all genomes, and P(G) is the power set of G. In this work a genome G is represented by the multiset of tDNA sequences it contains, denoted TG. The functional information of features is computed with a map h :(F × C × P(G)) −→ R≥0 from the Cartesian product of features, classes and taxon sets to non-negative real numbers. For a feature f ∈ F , class c ∈ C and taxon set S ∈ P(G), h(f, c, S) is the fraction of functional information or “Gorodkin height” [68], measured in bits, associated to that feature, class and taxon set. In this work, for a given taxon set S, a function logo H(S) is the tuple:

H(S) = {(α, β) | β = h(α, S), ∀α ∈ (F × C)}. (1) Furthermore the set I(S) ⊂ (F × C) of tRNA Class-Informative Features for taxon set S is defined:

I(S) = {α ∈ (F × C) | h(α, S) > 0}. (2) Briefly, a tRNA Class-Informative Feature is a tRNA structural feature that is informative about the functional classes it associates with, given the context of tRNA structural features that actually co-occur among a taxon set of re- lated cells, and corrected for biased sampling of classes and finite sampling of sequences [24]. Let A denote a set of Alphaproteobacterial genomes partitioned into three disjoint subsets X, Y and Z with X ∪ Y ∪ Z = A, representing genomes from the RRCH clade, the RSR grade, and the eight nominal Ca. Pelagibacter strains respectively. To execute Leave-One-Out Cross-Validation of a tRNA CIF-based binary phyloclassifier for a genome G ∈ A, we compute a score SC (G, S1,S2), averaging contributions from the multiset TG of tDNAs in G scored against two function logos H(S1) and H(S2) computed respectively from two disjoint taxon sets S1 ⊂ A and S2 ⊂ A, with G/∈ S1 ∪ S2. In this study, those sets are X \ G and Y \ G, denoted XG and YG respectively. Each

9 tDNA t ∈ TG presents a set of features Ft ⊂ F and has a functional class ct ∈ C associated to it. The score SC (G, XG,YG) is then defined: 1 X X SC (G, XG,YG) ≡ h(f, ct,XG) − h(f, ct,YG). (3) |TG| t∈TG f∈Ft As controls, we implemented four total-tDNA-sequence based binary phylo- classifiers to score a genome G. All are slight variations in which a tRNA t ∈ TG of class c(t) contributes a score that is a difference in log relative frequencies of the features it shares in class-specific profile models generated from XG and YG. Z The default “zero” scoring scheme method ST (G, XG,YG) shown in Fig. 3B is defined as:

∗ Z 1 X X p (f|ct,XG) ST (G, XG,YG) ≡ log2 ∗ , (4) |TG| p (f|ct,YG) t∈TG f∈Ft where

( #{f, c, S}/#{c, S} #{f, c, S} > 0 p∗(f|c, S) ≡ , (5) 1 #{f, c, S} = 0

#{f, c, S} is the observed frequency of feature f in tDNAs of class c in set S, and #{c, S} is the frequency of tDNAs of class c in set S. K Method “skip” corresponds to scoring scheme ST (G, XG,YG) defined as:

K 1 X X k ST (G, XG,YG) ≡ s (f, ct,XG,YG), (6) |TG| t∈TG f∈Ft where

( log p(f|c,S) #{f, c, S} > 0 ∧ #{f, c, T } > 0 sk(f, c, S, T ) ≡ 2 p(f|c,T ) , (7) 0 #{f, c, S} = 0 ∨ #{f, c, T } = 0 and p(f|c, R) ≡ #{f, c, R}/#{c, R} for R ∈ {S, T } as before. I Methods “pseudo” and “small” correspond to scoring schemes ST (G, XG,YG):

I I 1 X X p (f|ct,XG) ST (G, XG,YG) ≡ log2 I , (8) |TG| p (f|ct,YG) t∈TG f∈Ft where

( I o/t ∀n ∈ N :#{(n, l), c, S} > 0 p (f|c, S) ≡ o+I , (9) t+4I ∃n ∈ N :#{(n, l), c, S} = 0 where f = (n, l), o ≡ #{f, c, S}, t ≡ #{c, S}, I = 1 for method “pseudo,” and, P for method “small,” I = 1/TA, where TA = G∈A TG.

10 Analysis of tRNA Base Composition We computed the base composition of tRNAs aggregated by clades using bioperl- based [65] scripts, and transformed them by the centered log ratio transforma- tion [30] with a custom script provided as supplementary data. We then com- puted Euclidean distances on the transformed composition data, and then per- formed hierarchical clustering by UPGMA on those distances as implemented in the program NEIGHBOR from Phylip 3.6b [69] and visualized in FigTree v.1.4.

Supermatrix and FastUniFrac Analysis For supermatrix approaches, we created concatenated tRNA alignments from 169 Alphaproteobacteria genomes (117 RRCH, 44 RSR, 8 PEL) that all shared the same 28 isoacceptors with 77 sites per gene (2156 total sites). In cases where a species contained more than a single isoacceptor, one was chosen at random. Using a GTR+Gamma model, we ran RAxML by means of The iPlant Collab- orative project RAxML server (http://www.iplantcollaborative.org, [70]) on January 23, 2013 with their installment of RAxML version 7.2.8-Alpha (ex- ecutable raxmlHPC-SSE3, a sequential version of RAxML optimized for par- allelization). We tested the robustness of our result to random picking of isoacceptors by creating 100 replicate concatenated alignments and running them through FastTree [71]. For the FastUniFrac analysis we used the Fas- tUniFrac [72] web-server at http://bmf2.colorado.edu/fastunifrac/ to ac- comodate our large dataset. We removed two genomes from our dataset for containing fewer than 20 tRNAs, and following [31] removed anticodon sites. Following [31] deliberately, we computed an approximate ML tree based on Jukes-Cantor distances using FastTree [71]. We then queried the FastUniFrac webserver with this tree, defining environments as genomes. We then computed a UPGMA tree based on the server’s output FastUniFrac distance matrix in NEIGHBOR from Phylip 3.6b [69].

Multiway Classifier All tDNA data from the RSR and RRCH clades were partitioned into one of seven monophyletic clades: orders Rickettsiales (N = 40 genomes), Rhodospiril- lales (N = 10), Sphingomonadales (N = 9), Rhizobiales (N = 91), and Caulobac- terales (N = 6), or families Rhodobacteraceae (N = 43) or Hyphomonadaceae (N = 4) as specified by NCBI taxonomy (downloaded September 24, 2010, [59]) and documented in supplementary data for figure 7. We withheld data from the eight nominal SAR11 strains, as well as from three genera Stappia, Pseudovibrio, and Labrenzia, based on preliminary analysis of tDNA and CIF sequence varia- tion. Following a related strategy as with the binary classifier, we computed, for each genome, seven tRNA-CIF-based scores, one for each of the seven Alphapro- teobacterial clades as represented by their function logos, using the principle of Leave-One-Out Cross-Validation (LOO CV), that is, excluding data from the

11 genome to be scored. Function logos were computed for each clade as described in [24]. For each taxon set XG (with genome G left out if it occurs), genome G M obtains a score S (G, XG) defined by: 1 X X SM (G, XG) ≡ h(f, ct,XG). (10) |TG| t∈TG f∈Ft Each genome G is then represented by a vector of seven scores, one for each taxon set modeled. These labeled vectors were then used to train a multi- layer perceptron classifier in WEKA 3.7.7 (downloaded January 24, 2012, [32]) by their defaults through the command-line interface, which include a ten-fold cross-validation procedure. We bootstrap resampled sites in genomic tRNA alignment data (100 replicates) and also bootstrap resampled a reduced (and retrained) model including only CIFs with a Gorodkin height [24] ≥ 0.5 bits.

Acknowledgments

We thank J. Cameron Thrash and Stephen Giovannoni for sharing data in ad- vance of publication, Harish Bhat, Torgeir Hvidsten, Carolin Frank and Suzanne Sindi for helpful suggestions.

References

[1] Gribaldo S, Philippe H (2002) Ancient phylogenetic relationships. Theor Popul Biol 61: 391–408. [2] Gogarten JP, Doolittle WF, Lawrence JG (2002) Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19: 2226–2238. [3] Bapteste E, O’malley MA, Beiko RG, Ereshefsky M, Gogarten JP, et al. (2009) Prokaryotic evolution and the tree of life are two different things. Biol Direct 4: 34.

[4] Morris RM, Rapp´eMS, Connon SA, Vergin KL, Siebold WA, et al. (2002) SAR 11 clade dominates ocean surface bacterioplankton communities. Na- ture 420: 806–810. [5] Giovannoni SJ (2005) Genome Streamlining in a Cosmopolitan Oceanic Bacterium. Science 309: 1242–1245.

[6] Williams K, Sobral B, Dickerman A (2007) A Robust Species Tree for the Alphaproteobacteria . J Bacteriol 189: 4578. [7] Georgiades K, Madoui MA, Le P, Robert C, Raoult D (2011) Phyloge- nomic Analysis of Odyssella thessalonicensis Fortifies the Common Origin of Rickettsiales, Pelagibacter ubique and Reclimonas americana Mitochon- drion. PLoS ONE 6: e24857.

12 [8] Thrash JC, Boyd A, Huggett MJ, Grote J, Carini P, et al. (2011) Phy- logenomic evidence for a common ancestor of mitochondria and the SAR11 clade. Sci Rep 1. [9] Brindefalk B, Ettema TJG, Viklund J, Thollesson M, Andersson SGE (2011) A Phylometagenomic Exploration of Oceanic Alphaproteobacteria Reveals Mitochondrial Relatives Unrelated to the SAR11 Clade. PLoS ONE 6: e24457. [10] Rodr´ıguez-Ezpeleta N, Embley TM (2012) The SAR11 Group of Alpha- Is Not Related to the Origin of Mitochondria. PLoS ONE 7: e30520.

[11] Viklund J, Ettema TJG, Andersson SGE (2012) Independent genome re- duction and phylogenetic reclassification of the oceanic SAR11 clade. Mol Biol Evol 29: 599–615. [12] Foster PG (2004) Modeling compositional heterogeneity. Systematic Biol- ogy 53: 485-495.

[13] Losos JB, Hillis DM, Greene HW (2012) Who speaks with a forked tongue? Science 338: 1428-1429. [14] Dale C, Wang B, Moran N, Ochman H (2003) Loss of DNA recombinational repair enzymes in the initial stages of genome degeneration. Mol Biol Evol 20: 1188–1194. [15] Hershberg R, Petrov DA (2010) Evidence that mutation is universally bi- ased towards AT in bacteria. PLoS Genet 6. [16] Moran NA (2002) Microbial minimalism: genome reduction in bacterial pathogens. Cell 108: 583–586.

[17] Lind PA, Andersson DI (2008) Whole-genome mutational biases in bacteria. Proceedings of the National Academy of Sciences 105: 17878–17883. [18] Wang C, Sobral BW, Williams KP (2007) Loss of a Universal tRNA Fea- ture. J Bacteriol 189: 1954–1962.

[19] Ardell DH (2010) Computational analysis of tRNA identity. FEBS Lett 584: 325–333. [20] Ardell DH, Andersson SGE (2006) TFAM detects co-evolution of tRNA identity rules with lateral transfer of histidyl-tRNA synthetase. Nucleic Acids Res 34: 893–904.

[21] Wolfson A, LaRiviere F, Pleiss J, Dale T, Asahara H, et al. (2001) trna conformity. Cold Spring Harbor Symposia on Quantitative Biology 66: 185-194.

13 [22] Giege R (2008) Toward a more complete view of tRNA biology. Nat Struct Mol Biol 15: 1007–1014. [23] Gieg´eR, Sissler M, Florentz C (1998) Universal rules and idiosyncratic features in tRNA identity. Nucleic Acids Res 26: 5017.

[24] Freyhult E, Moulton V, Ardell DH (2006) Visualizing bacterial tRNA iden- tity determinants and antideterminants using function logos and inverse function logos. Nucleic Acids Research 34: 905–916. [25] Bailly M, Giannouli S, Blaise M, Stathopoulos C, Kern D, et al. (2006) A single tRNA base pair mediates bacterial tRNA-dependent biosynthesis of asparagine. Nucleic Acids Res 34: 6083–6094. [26] Sethi A, Eargle J, Black AA, Luthey-Schulten Z (2009) Dynamical net- works in tRNA:protein complexes. Proceedings of the National Academy of Sciences 106: 6620–6625. [27] Freyhult E, Cui Y, Nilsson O, Ardell DH (2007) New computational meth- ods reveal tRNA identity element divergence between Proteobacteria and Cyanobacteria. Biochimie 89: 1276–1288. [28] Abe T, Ikemura T, Sugahara J, Kanai A, Ohara Y, et al. (2011) tRNADB- CE 2011: tRNA gene database curated manually by experts. Nucleic Acids Research 39: D210–3.

[29] Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S (1998) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Research 26: 148–153. [30] Aitchison J (1986) The Statistical Analysis of Compositional Data. Mono- graphs on Statistics and Applied Probability. New York: Chapman and Hall. [31] Widmann J, Harris JK, Lozupone C, Wolfson A, Knight R (2010) Stable tRNA-based phylogenies using only 76 nucleotides. RNA 16: 1469–1477. [32] Hall M, Frank E, Holmes G, Pfahringer B (2009) The WEKA data mining software: an update. ACM SIGKDD . [33] Duda R, Hart P, Stork D (2012) Pattern Classification. Wiley, second edition. [34] Gupta RS, Mok A (2007) Phylogenomics and signature proteins for the alpha proteobacteria and its main groups. BMC Microbiol 7: 106.

[35] Doolittle RF, Handy J (1998) Evolutionary anomalies among the aminoacyl-tRNA synthetases. Current opinion in genetics & development 8: 630–636.

14 [36] Brown JR, Doolittle WF (1999) Gene descent, duplication, and horizontal transfer in the evolution of glutamyl- and glutaminyl-tRNA synthetases. J Mol Evol 49: 485–495. [37] Wolf YI, Aravind L, Grishin NV, Koonin EV (1999) Evolution of aminoacyl-tRNA synthetases–analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res 9: 689–710. [38] Woese CR, Olsen GJ, Ibba M, S¨ollD (2000) Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiol Mol Biol Rev 64: 202–236.

[39] Andam CP, Gogarten JP (2011) Biased gene transfer in microbial evolution. Nat Rev Micro 9: 543–555. [40] Dohm J, Vingron M (2006) Horizontal gene transfer in aminoacyl-tRNA synthetases including leucine-specific subtypes. J Mol Evol .

[41] Brindefalk B, Viklund J, Larsson D, Thollesson M, Andersson SGE (2006) Origin and evolution of the mitochondrial aminoacyl-tRNA synthetases. Mol Biol Evol 24: 743–756. [42] Shiba K, Motegi H (1997) Maintaining genetic code through adaptations of tRNA synthetases to taxonomic domains. Trends in biochemical sciences 22: 453–457. [43] Andersson SG, Kurland CG (1998) Reductive evolution of resident genomes. Trends in microbiology 6: 263–268. [44] Itoh T, Martin W, Nei M (2002) Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proc Natl Acad Sci USA 99: 12944–12948. [45] Grote J, Thrash JC, Huggett MJ, Landry ZC, Carini P, et al. (2012) Streamlining and core genome conservation among highly divergent mem- bers of the SAR11 clade. MBio 3.

[46] Dufresne A, Garczarek L, Partensky F (2005) Accelerated evolution asso- ciated with genome reduction in a free-living prokaryote. Genome Biol 6: R14–R14. [47] Schuster P, Fontana W, Stadler PF, Hofacker IL (1994) From sequences to shapes and back: a case study in RNA secondary structures. Proc Biol Sci 255: 279–284. [48] Kuo D, Licon K, Bandyopadhyay S, Chuang R, Luo C, et al. (2010) Co- evolution within a transcriptional network by compensatory trans and cis mutations. Genome Res 20: 1672–1678.

15 [49] Baker CR, Tuch BB, Johnson AD (2011) Extensive DNA-binding speci- ficity divergence of a conserved transcription regulator. Proceedings of the National Academy of Sciences 108: 7493–7498. [50] Barri`ereA, Gordon KL, Ruvinsky I (2012) Coevolution within and between Regulatory Loci Can Preserve Promoter Function Despite Evolutionary Rate Acceleration. PLoS Genet 8: e1002961. [51] Beltrao P, Alban`eseV, Kenner LR, Swaney DL, Burlingame A, et al. (2012) Systematic functional prioritization of protein posttranslational modifica- tions. Cell 150: 413–425.

[52] Hartl DL, Taubes CH (1996) Compensatory nearly neutral mutations: se- lection without adaptation. Journal of Theoretical Biology 182: 303–309. [53] He BZ, Holloway AK, Maerkl SJ, Kreitman M (2011) Does Positive Se- lection Drive Transcription Factor Binding Site Turnover? A Test with Drosophila Cis-Regulatory Modules. PLoS Genet 7: e1002053.

[54] Bullaughey K (2012) Multidimensional adaptive evolution of a feed-forward network and the illusion of compensation. Evolution 67: 49–65. [55] Haag ES, Molla MN (2005) Compensatory evolution of interacting gene products through multifunctional intermediates. Evolution 59: 1620–1632.

[56] Winker S, Woese CR (1991) A definition of the domains Archaea, Bacte- ria and Eucarya in terms of small subunit ribosomal RNA characteristics. Systematic and Applied Microbiology 14: 305–310. [57] Roberts E, Sethi A, Montoya J, Woese CR, Luthey-Schulten Z (2008) Molecular signatures of ribosomal evolution. Proceedings of the National Academy of Sciences 105: 13953–13958. [58] Chen K, Eargle J, Sarkar K, Gruebele M, Luthey-Schulten Z (2010) Func- tional role of ribosomal signatures. Biophysj 99: 3930–3940. [59] Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, et al. (2010) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 38: D5–16. [60] Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved detec- tion of transfer RNA genes in genomic sequence. Nucleic Acids Res 25: 955–964.

[61] Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Research 32: 11– 16. [62] T˚aquistH, Cui Y, Ardell DH (2007) TFAM 1.0: an online tRNA function classifier. Nucleic Acids Research 35: W350–3.

16 [63] Silva FJ, Belda E, Talens SE (2006) Differential annotation of tRNA genes with anticodon CAT in bacterial genomes. Nucleic Acids Research 34: 6015–6022. [64] Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Research 22: 2079–2088.

[65] Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, et al. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12: 1611–1618. [66] Gouy M, Guindon S, Gascuel O (2010) Seaview version 4: a multiplat- form graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution 27: 221–224. [67] J¨uhlingF, M¨orlM, Hartmann R, Sprinzl M, Stadler P, et al. (2009) trnadb 2009: compilation of trna sequences and trna genes. Nucleic acids research 37: D159–D162.

[68] Gorodkin J, Heyer LJ, Brunak S, Stormo GD (1997) Displaying the infor- mation contents of structural RNA alignments: the structure logos. Com- puter Applications In the Biosciences : CABIOS 13: 583–586. [69] Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) version 3.6. University of Washington, Seattle: Department of Genome Sciences.

[70] Stamatakis A, Hoover P, Rougemont J (2008) A rapid bootstrap algorithm for the raxml web servers. Systematic Biology 57: 758-771. [71] Price MN, Dehal PS, Arkin AP (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PloS One .

[72] Hamady M, Lozupone C, Knight R (2010) Fast UniFrac: facilitating high- throughput phylogenetic analyses of microbial communities including anal- ysis of pyrosequencing and PhyloChip data. the ISME Journal 4: 17–27.

17 Figure Legends

Figure 1 A universal schema for tRNA-protein interaction networks.

Figure 2 Function logos [24] for two groups of alphaproteobacteria and overview of tRNA-CIF-based binary phyloclassification.

Figure 3 Leave-One-Out Cross-Validation (LOO-CV) scores of al- phaproteobacterial genomes under two different binary phyloclassi- fiers. A. tRNA-CIF-based phyloclassifier B. Total tRNA sequence-based phyloclassifer.

Figure 4 Breakout of class contributions to scores under the tRNA CIF-based binary phyloclassifier.

Figure 5 Base compositions of alphaproteobacterial tRNAs showing convergence between Rickettsiales and SAR11. A. Stacked bar graphs of tRNA base composition by clade. B. UPGMA clustering of clades based on Euclidean distances of tRNA base compositions under the centered log ratio transformation [30].

Figure 6 FastUniFrac-based phylogenetic tree of alphaproteobacte- ria using tRNA data computed according to the methods of [31].

Figure 7 Seven-way tRNA-CIF-based phyloclassification of alphapro- teobacterial genomes by the default multilayer perceptron in WEKA. Bootstrap support values under resampling of tRNA sites against (left) all tRNA CIFs and (right) CIFs with Gorodkin heights ≥ 0.5 bits and model retraining (100 replicates). All support values correspond to most probable clade as shown except for Stapphia and Labrenza for which they correspond to Rhizobiales.

18 Class-Specific Binding Proteins

tRNA Functional

Identity Classes with Class Informative Features

Multi-Class Binding Proteins (e.g. tRNA Discrimination Modification Recognition Enzymes)

Conformity General Translation Factors

Figure 1: A universal schema for tRNA-protein interaction networks.

19 Rhizobiales + Rhodobacteraceae + Caulobacterales + Rickettsiales +Sphingomonadales + Rhodospirillales (RSR) Hyphomonadaceae (RRCH)

4 genome 4 3 3

bits 2 U U bits 2 T N L J L S A W A E D A ET X P D H L L 1 Q IH R N Q Q C H N K TX F T I I Q C I K W E M F D E D G K V A GV V A IM Y W N J Q I 1 G C Q V H N W S G H W X T Q L Y C N L SH K FN T G MY F E RTE I I C H K Q K K M F R J H H H K S C A E M V Q G Y I S D Y E P F S M D G L SV W H P K D R SG JC KYH F M X N H VF V A S P J R Q K E C N LW E D Q J A Q RT F J W L W A I LCNY S A D D H Q I M C K C J L D Y M Y G Q G P LK PNV V M N K X MF H Q T J AQ I L C WY P I M L K A V D I Y E F R ET RG MQ V F V MD S V J AVG J F Q NV T G N T R T S J P C C G F Y FR N K K C Y W E K F H Q M W AV M R H V Y Y K Y I Q L J D DF L H W C H I V K C Q Q S J V KW J S R A M P C V P J V K V X M X M AV T V S I H N W J P I H L H G N T R N M K J A P C Q K S Q T LE G PL R F R E Q I G Q Q V K S E N S G M P F WS W Q X R H I A T T H F J L N Y Q D Y M T T H P FN L E N N R H X C CI A P K H M J W D X W W F X H T A L V S A T V S H G D C I M F C J K N X R P P A E L K W F T G I Y E R L AD S C C Y Y P C P H L M P G E Y Y E C L K G R V Y R T Q A C A G G H W S K S K R T HS Q R D C L A L I L P Q S R G H Q X L W R V Q R W W K NQG L A G K P K S T T V S T L K N Y A G TJA M P WJ XL J LS P Q S I R L R T G I Q AE V S M C H Y G I R P T H F L F E N J T LK Q R R H F I M H P J L E Q C J L P N N C P K R M Q I Q JW M Q Q GV X X T N L C V T P LI X T M R E W M A G T K D S G L L T V P R K A K X S I T P H J P MM LK KC W VP S E FFN T G XP G R W N I G RS P PN M W N SV P A X YERG G XV M Y LA M V A Q E YHR TQN S T W Q W H D L H I QYG Q X Y Y G E K S P I P G A R H H J H G V V M M Y T T K G L V R S W V N P LH R R Y V H QH LJ Y P CG M Y SX P G J V X M Y T F L S G G F J V X M R C R S H H VN R L W M G P L F L E G R F P L D N R I L N M R R T L ND T T R X T A T K L R Y Q P W D K A P L V N S Q R M N D G D P L J N R T W L G Y L F G P L H A R S V G E C V W G T P L A A D M QF X T H S V L R A H T C E V Y N T I E Q F A A S S G D H T PG V V V S T ET HN K TC F H V H D GL M V XE J KY L T WT N S K F X M X C H K J P K J S L D KW SJ L F KK TI S Q P W G E L R J T E E QW J C K M R R R Q R G V R S LR Q PI Q D K KG Q I N I R T R R DTT YT M W T W J X M K J R W S P P K V DY M R A DR R X I W WV H I K XPWG E 8 9 1 2 3 4 5 6 7 M I R J G E S N F N M P DP K L S P S L R M N W C C V V NL DR DG R P AVYV M AN S AA F SD P W J L S D S R T S M GE F H K X A G S G X H GT K X V RT S P T Y S V S R CL C T V R K K A R R R J L S A R TC T P R E S P L Q 0 E A R ST T A C M T K

-1 T M RK E LR E 33 34 36 71 35 57 58 59 70 72 73 10 11 12 13 14 15 16 17 21 22 23 24 25 26 27 28 29 30 32 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 60 61 62 63 64 65 66 67 68 69 18 19 20 31 53 54 55 56 7 8 9 1 2 3 4 R 0 X 5 6 D

ACC STEM D STEM D LOOP D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM -1

17a 20a D V 10 32 33 34 36 68 69 71 35 57 58 59 70 72 73 11 12 13 14 15 16 17 21 22 23 24 25 26 27 28 29 30 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 60 61 62 63 64 65 66 67 18 19 20 31 53 54 55 56 Y E20b A Q E ACC STEM D STEM D LOOP D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM 20a 17a G P S R 20b C Q C 4 N G C 4 S P Y R 3 3 C E P X R C E C

bits X 2 C X bits 2 E X RV Q RV T E E X Q GC E X G X D T 1 L E NXEJ I I D S Q C A I GC E X G X D R V V J L E A F S G G J H W 1 C R J D X V D F A N I D S P NJ R E D T E GK EJ I Q Y F C I W CG X A E X I W SY S XD T AR Y J XX V H E D R V F H G D W G QGS E A AA R W E Y F G W G G F S J H F M S R J D D F A N K P X V Y G V H N A A C F R LH QMI A P NJ R E D T E K A Y F C R C J MW E G L C W C V Q F W S Q S V IC N G Q L T H A NV P H I W S X T C J S N E X SY D R R Y L G D J D M A V L Y X XX K M H E D K M N K H E Y CN L Y Q Q S F F V P H G D W G GS M J T A L T Y X M T I A R W E Y P E S G F K W C K X Y P H T R R N Q N K JH P S F M S A V M N P Y V K I Q Y K V N C M A T F P D A R J W W V I L P R L Q J F S R K N H A H D D W X A K Y T V A K F A Q Q I M N T N R J E X L L W C C W H Y M F P T N M J W Q D L C Q M I V Q T PD S V K W M Q M M T M H J T V S G E P F W F W N Q S T YJJ I A S S L V P N D T S G J M V F Q K H L L N I N H R C G T V P H R V K N H A K K D E C L F C J S N LT R W F Y X M Q E R G H R P T X N S YR E L A SI T ED L L G D J D M G C H S L H X P TC L X T J Y I D P V N K V M E R Y W I K TT A S R H R I K R X Q M L F GF PH D RA P Q P V EN Q K N K Y AV I E G E GS N S F LKVT H E R I L I N W H H M V E T GI L LN Y Q C Q Y C Q L F H S L W A W G F L C X W F L V Q V P R M VF G IC N A HQKN T I J G A R L W L J E P E W M G C M T Y C T R T S G M Y V P S MP GM G R KKA P X I Y R T L I L J X Y P Q L J A S P W P G HC Y M K T N JL G A R K V KM Y KH R T Q H G S Y T S P V L Y Y L A N G S A P V T E E R G T M L W M WR W D D P Q X R V H K P X Q M SH FX M CSXQ RER X J S N R L P KQ T C K Y H T E S G K P K C N Q G MJ R S V LP D KR H L P Q S G G L IP A L T R L R N K J P SS M A V Y VM L T K P F J W R P DI WVQ A S R K NJ J P D D W X A K R Y T V A K F Y Q Q I N X L L T W NC H Y M F P T N M X J W Q I Q M D T PD K W M M M T M H J T V G E P F W P Q S L S N L T YJJ A S S L V P D T S G J M V F Q K H L N I R N V H R K K K D E C L F Q G LT R W F Y X M Q E R G H P T X N S YR E I G L A S T ED L C H S L H X P T TC N I D JP K E R V Y W I TT A S R R I K R Y L X QP D F GF H RA P Q P V EN Q V Y AV I G E GS N S F LK H E R I P VT V T IL L W H H M E GI N C Q Q L W A W H GLF L R C X W MF F LG V NQ K V IC A HQ N I J G A R L W L E P E W M G Y C C S T TG R M Y V P S MP GM G R KKA P Y R T P I Q L L J J X A S W P G HC Y 1

4 6 7 8 9 P 2 3 5 N JL G KA R K V KM Y KH R T QY H G T S V L Y Y V S LG A N G S A P WRT E R T M L W M G D D XP Q X R F VM S H JTS N R LQ M P SH KQ X CXQT RER E S G K P K C C G G Q L LS V LP DIP A L P G L KR MJ 0 Q RR P J SSXY Q S M L PA D L -1 Q V P QY G 11 12 13 14 15 16 17 18 19 20 32 33 34 36 45 46 47 48 49 50 51 52 53 54 55 67 68 69 70 71 73 10 21 22 23 24 25 26 27 28 29 30 31 35 37 38 39 40 41 42 43 44 56 57 58 59 60 61 62 63 64 65 66 72 1 4 6 7 8 9 0 2 3 5 S Q CD G ACC STEM D STEM PD LOOP D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM -1 Q 20a 17a 10 11 12 13 14 15 16 17 31 32 33 34 35 36 45 46 47 48 49 50 51 52 53 54 55 67 68 69 70 71 73 18 19 20 21 22 23 24 25 26 27 28 29 30 37 38 39 40 41 42 43 44 56 57 58 59 60 61 62 63 64 65 66 72 20b T X J E Y ACC STEM D STEM PD LOOP D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM 17a 20a X Y X E J C X X E20b J C T X 4 E Y Y Y 4 EY Y 3 Y 3 Y functionally classified tRNAs bits 2 G S G bits 2 S X E TPH W L Y C T D X E TP N W LC I X Q I XG P 1 P I I F Q EECN S Y FH D V D R I J Q L R F 1 C C H X D I H J M J J G T D S C J D E G C N S A AW Y I D C N I J I J P D Y D W J I C F G F R N X F E I N K H F H X R J I X P G F G D MG G L X SD GS Y D L J F I A K H H S X N AC A K GE K G X C C A EE R V A M PW Q N J H F E A V V J W H S P EA S R T K Q N M C J J X DN G V V IC FM N C V J A N S D Y L I H G HP A L J C L N R M J Q SM F A H F G G N X M AJ E T C I E Q Y K M X W Q I CM T T KXI W X X N H M M Q P A DQ I DF S T L E S H K JQ P Y J M H D E D I E C P P R N N T Y NF I F LQ V K DJT H C T D V I F S W D N J MF P C FQ N F D H L A I P P A T N R G A I W T L E H F M Q T K P N P S M F W R H S S V CJ C Y D LT S L J Q Y I W V A M M A T Q L D D S J X M N X W M K Y V Q M EP P E K V K G G W T W N Y N K S T W S W K L M Q P Q V VK S V D V G L P S Y D K P Y Q W D P X L K V F S L TA J G M D S N D E V J E F L K D N I A K G L V FM G R M R T J L X N T V H R R T Q E I R W I RY M T R S HM D W F H L Y H A S H T V X R L R P P W J I L T N E N L L A E V K V V Q G S S H N F K P L X R Y C H P D H Q A W F G C I W L Y A A P R AR N T N H V GG R P F S F V E T S C T L M R W G X W E T M IV F R L G M F C P E K F Y C LN G W G P L J D H V X S W M K G A Y R M F H P LV N Y Y C R K N Q P R G JY I I P H A W A H L M Y T RQ P GTMQ SR TM R W H Y PSTK RW N Y MT J L R G R VCY R TN Q L M Q T P N A D T A L L P P Q Y L H T T N L LEE A H EY KA V E KGLI J R V H G M F NKW GA H V R D H H QK T M G V M PFW S RC W I G R QL CC S TJ P Q Q D G R S P H Q E E Q Q K Y R VQ L L D L G V W A I PI T N K Y E X K V S F G A GA EE T HV Q I PK M E S R G X I W H Y R V D P VY PF KES F V W I RH C W V R Q K W P E E L P X E X H J LA Y K V ARJF H L L Q K Y RS I Y T A Q V M N X R C Q S V H C V K K T R M L H V N G I M T C E MV A N T C S W P J X E M N EN V X A H R T S A N T I R G G F KTE K F L K R I D A L H J RKQQ H G T L H K Q G N T I TL H S E P Q M L G G CJ S AHM D F E D P V DW P LXW Q K F D M A T D E A R N E V P L W A R G JP F T J PX S LC Q X Y T LVTS W H V I L V C Y I Y I Y C M L A C T P P V W R M HT E V D J D P M W CG TS A R S L Q L A D R A I QY A G E FY T L G T G LR J C K K T ST J A LMK W GF K S V R V E FIV V N W PF M 3 4 5 6 7 8 9 1 2 ST F R I G N X CV SS KW K T CG YA P LP F MG HJ R T QQ S R L C P R J N G Q HY QA L LL EDY PH E J Q X N K T F T R V G A K R HR D JER T R K P I L S L SR L Q S L R Q J L P F 0 C C C T HC V C I ER A -1 S LA 17 18 19 20 27 28 29 30 31 32 33 34 36 51 52 53 54 55 56 57 64 65 66 67 68 69 71 10 11 12 13 14 15 16 21 22 23 24 25 26 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 58 59 60 61 62 63 70 72 73 V L A Y 1 2 3 4 5 6 7 8 9 Y R Y 0 P S S L H ACC STEM D STEM D LOOP D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM -1 C 20a 17a 17 27 28 29 30 31 32 33 34 36 51 52 53 54 55 56 57 64 65 66 67 68 69 71 73 10 11 12 13 14 15 16 21 22 23 24 25 26 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 58 59 60 61 62 63 70 72 18 19 20 20b Q Q V V ACC STEM D STEM D LOOP D STEM AC STEM AC AC STEM T STEM T LOOP KT STEM ACC STEM 20a 17a Y M K S C 20b S P H W E 4 Y L 4 A Q Y E Score each tRNA 3 P S P sequence for taxon-specific 3 H S P

bits 2 A functional Class-Informative- bits 2 L LW Q A S L M T C LW Q C X JF A C S E C Features (CIFs) Y L M 1 F E I I I S I X J K T K R L F K G E Y W VX E A X S Y S J W M GI VD K FY Q PN V X Y I 1 T FM G G G R K K V F V E E C C HF I M K Q H Y G X G F W N G S A S N Y PI M K E F S E V L I N V I M H H S I Y RR HQ Y F P L L J F H D S V I NL H P P E G V Q M C K J A L L I E M S Q Q D N Q K X X Y X Y V C R H M J Y P P N L K M W K P A V V K E F C S S A X X G T N C E Y F K G M G A MY A V H P I D S D J Q H R P A Y T L S C C H F E X PM S L T C K T I T Y K K W P K V W R Y K W J T P M D X P I I W W T E Q C F R H F L A T C K L K J P A Q Y Y QHV N T Q F K R P X A A C F V D F T IH M M M Y D N L Y J T R Q N TE I J T C M K F A I Q N W M C H N R N J H H VI G W G A H G K Y J W C C T W L M K I K L M I A J X S T T K V G Q E Y V N D S R T L N T MC C J L W Q K Q S I V H Q JM A I L Y C R V F K V S T L T H P S F Y M P V N C J T M M T M A V R A R D M X V X N I G G VXT F G E C F E J E H Y N A D P A A K K J J M Q X Y T F R T G L D G J R H R M R H Y D H V P C L R G RCJ R V L G R T S C X S V Q W NK F R T WQ N N J Y G L W H W S W T NLP E K T L T Y Y W E M HK T P A G W R W N W R A F R MH S S V V N T Q N R SG Y I K L H I X V S P M P J A P I X T Y P H Q H F L T K T L L K K Q J T X P I Q L N R A WW W S F R PR KP M P M Y S T X A R P T D R K W J S Q N D L W Q Q EP G R VPS F G T S N MGGG R V W A GP GXFL LL I K S H TR JE J G P N R E H Y M N Q MVI P S Q A W R R LL QV X L L Y N V N A F J QS E G T L G Q DC J AX X X D H S T V X V L L F EPR T G AHT R S Q G X G Y G R LH V Q F Y L H W G R C F K G T R L I C DW P M G G V G M V W S X V C L X R S X I A G J S TG W QR R L AP C K N L Q T R S W G F K C L R R T T H H H A Q Y R Q W T S L L Y F M R P M Y K I Q I D F G V T D N T L N R V G Q R R K Y E C N W A V S J P GK JQ V R C L R E H V G PL R T P F S D K X S Y L N G W RR T X V M L A JG KT P K T V K G R T G T L CV R Q N W FG N D K H KA F T Q S R D V DI J Y Y R E E G W J P X V V J H S Q W E G R R E T T M Y Y P T A Q V RA Q Y PT W JS AXTR M E V R H T H VNK I P T P X X C M Y S Q J D E A P X SJ A D W P T VD E S W H P V X P K F W I N 2 3 1 4 5 6 7 8 9 L RH Y SAA DNNW W A A G Q P PN F GI I K I CM S V H C K W L L C V E MN D C GSG A RI AT QG A FA H C L K D S L R H RW VI K T A I J S L T T S S R HT T W I P S H T S JN C S PT LHK R LG D KA CT F S J E 0 K L M -1 G D C ST Y X 14 15 16 17 18 19 20 25 26 27 28 29 30 31 32 33 34 35 36 48 49 50 51 52 61 62 63 64 65 66 67 68 69 71 72 10 11 12 13 21 22 23 24 37 38 39 40 41 42 43 44 45 46 47 53 54 55 56 57 58 59 60 70 73 2 3 4 5 6 7 8 9 R 1 KR L 0

L R -1 ACC STEM D STEM D LOOP Y D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM R H 20a 17a 14 15 16 17 18 19 20 25 26 27 28 29 30 32 33 34 36 48 49 50 51 52 61 62 63 64 65 66 67 68 69 71 72 10 11 12 13 21 22 23 24 31 35 37 38 39 40 41 42 43 44 45 46 47 53 54 55 56 57 58 59 60 70 73 20b MA H N G W E ACC STEM D STEM D LOOP D STEM AC STEM AC AC STEM T STEM T LOOP T STEM ACC STEM X 20a 17a Y C 20b A T Q R WK Y X X RLY R Y Average scores across tRNAs. WE X RRLY S M R Y Figure 2: Function logos [24] for two groups of alphaproteobacteria and overview of tRNA-CIF-based binary phyloclassification.

20 Classifier (RRCH - RSR)

BitsBits differencedifference scorescore logoddslogodds differencedifference scorescore "Functional""Functional" sitessites includedincluded All sites included A B 10.13 Rhizobiales 10.13 Rhodobacteraceae Caulobacterales 1.71 1.71 Hyphomonadaceae (RRCH) 2.19 2.19 0.96 0.96 Ca. Pelagibacter/SAR11 -5.76 -5.76 0.20 0.20 Sphingomonadales classification logodds scores logodds classification classification logodds scores logodds classification classification bit scores classification classification bit scores classification Rhodospirillales scores genome genome scores genome -13.71 -13.71 -0.66 -0.66

Rickettsiales -1.52 -1.52 -21.64 -21.64

3030 2020 1010 00 1010 2020 30 20 10 00 10 frequency frequency

Figure 3: Leave-One-Out Cross-Validation (LOO-CV) scores of al- phaproteobacterial genomes under two different binary phyloclassi- fiers. A. tRNA-CIF-based phyloclassifier B. Total tRNA sequence-based phyloclassifer.

21 40 HIMB114 HTCC1062 HIMB5 HTCC7211 HIMB59 HTCC9565 HIMB114 HTCC1062 HTCC1002 IMCC9063 HIMB5 HTCC7211 30 RRCH RSR HIMB59 HTCC9565 HTCC1002 IMCC9063 Ca. P. ubique strains 20 10 0 Score Contributions Score -10

A C D E F G H I J K L M N P Q R S T V W X Y

-20 tRNA functional class

A C D E F G H I J K L M N P Q R S T V W X Y Figure 4: Breakout of class contributions to scores under the tRNA CIF-based binary phyloclassifier.

22 A Rhizobiales A Rhodobacteraceae T Caulobacterales C G Sphingomonadales Rhodospirillales Rickettsiales SAR11

0.0 0.2 0.4 0.6 0.8 1.0 Rhodobacteraceae B

Rhodospirillales

Pelagibacter/SAR11 Sphingomonadales Rhizobiales Rickettsiales Caulobacterales

0.02

Figure 5: Base compositions of alphaproteobacterial tRNAs showing convergence between Rickettsiales and SAR11. A. Stacked bar graphs of tRNA base composition by clade. B. UPGMA clustering of clades based on Euclidean distances of tRNA base compositions under the centered log ratio transformation [30].

23 WOLBACHIAENDOSYMBIONTOFMUSCIDIFURAXUNIRAPTOWOLBACHIAENDOSYMBIONTOFDROSOPHILAANANASSA

WO WOLBACHIAENDOSYMBIONTOFCULEXQUINQUEFASCIATUSJHWO LBACH L BA

CH IAEND IAE

ND O SYMB OSYMBIO PELAGIBACTERUBIQUEIMCC906 EHRLICHIACHAFFEENSISSTRSAPULP 4 PELAGIBACTERUBIQUEHTCC721 PELAGIBACTERUBIQUEHIMB11 PELAGIBACTERUBIQUEHTCC956 PELAGIBACTERUBIQUEHIMB5 PELAGIBACTERUBIQUEHTCC100 PELAGIBACTERUBIQUEHTCC106 7 EHRLICHIARUMINANTIUMSTRWELGEVONDENUPRETORI 8 IO PELAGIBACTERUBIQUEHIMB E 3 30 HR N 23 0 S 3 2 4 TOF N S

5 ROSEOVARIUSNUBINHIBENSIS LI 7

TST 4 L1 CH CU L CU F EHRL DR 9

E TI IA R TI RU O AI H EHRLICHIARUMINANTIUMSTRGARDE R

X RC S ROSEOVARIUSSPTM103 RC CC259 I N O

CH E

1 MI O BAED T CH14 2 T I S

0 F TA N PH R I TA H O

I 1 ACHAF E WOLBACHIASPWR I L 0 A SOFB N S R N O I 7 N I A EHRLICHIACANISSTRJAK O 1 A LAMELAN A O 7 TI B SIS

G

D 5 ANAPLASMAPHA WO R R ERSH U SE A N 1

P N RAL I 1 A T C MST F I 2

RU S 7 N TE EENS LBACH TE O T ANAPLASMAMARGINALESTRSTMARIEAP S

R C

E C VAR GIAMA U ITO R E 6

L R BAC

B SISITI1 ANAPLASMAMARGINALESTRVIRGINIASMAMA WELGEVO T

OG S L ANAPLASMAMARGINALESTRMISSISSIPP I O B N SS C ABA P E I 9 ABA ABATSE ERL AP 4 I U

ANAPLASMAMARGINALESTRPUERTORIC A E 3 USSP21

A L AS 5 C T T B C R SE

1 B L I ERSPCCS RARKANSA Z I A 5 E O E P AY 2 O T 1 RUL N O T W 3 GO 2 D C R I I D EN E R I CHCH4 E A I BAC AE GI ND R I K O N NEORICKETTSIASENNETSUSTRMIYAYAM CY S E R C TA NEORICKETTSIARISTICIISTRILLINOI 3 O 1 N TI B TA BAC T

M N C

O S A B LASTELLATAE3 E I C TO 7 C O S 5 L EA ASPR1

R NC O U JANNASCHIASPCCS SE I D EST E O O C CU PH THALASSIOBIUMSPR2A6 LOKTANELLAVESTFOLDENSISSKA5O SE A O I S RHODOBACTERALESBACTERIUMHTCC265 R O ER ERSP 6 R R I R O I LUMH A L ROSEOBACTERDENITRIFICANSOCH11G T RL 0 EN F R 3 ORIENTIATSUTSUGAMUSHISTRIKED L D RHODOBACTERALESBACTERIUMHTCC208 5 5 TI O A RHODOBACTERALESBACTERIUMHTCC215 TE 1 A R SAGITT BAC C 14 TSU I Z RUE I 9 D PHAEOBACTERGALLAECIENSIS21IC AS 1 T A ROSEOBACTERSPSK2092L IBA N 6 RICKETTSIAENDOSYMBIONTOFIXODESSCAPULARIRICKETTSIARICKETTSIISTRIOWSU PHAEOBACTERGALLAECIENSISBS10I C G S S I SP RICKETTSIARICKETTSIISTRSHEILASMITAMUSH A RHODOBACTERALESBACTERIUMY4L ERSPMED19R SI T 5 TE I RHODOBACTERACEAEBACTERIUMKLH1C 1 O SILICIBACTERPOMEROYIDSSBAC IB ROSEOVARIUSSPHTCC260 RICKETTSIARICKETTSIO S O RY CITREICELLASPSE4SE O A SILICIBACTERSPTM104O FITOBA 2 NG R UL S S DE D122 A OI SP RICKETTSIAPROWAZEKI SULFITOBACTERSPEE3 R N 4 R A OCEANICOLAGRANULOSUSHTCC251 CA IC HAE IFI KETTSIABE ALPHAPROTEOBACTERIUMHTCC225SP R 0 3 R RICKETTSIABELLIIRML369 I RHODOBACTERSPHAEROIDESKD13R NIT S1 ICKETTSIA H RHODOBACTERSPHAEROIDESATCC1702DE C CC263 S ISM HT R LLIIOS S RHODOBACTERSPHAEROIDES24OBACTE R II ICKE CA RHODOBACTERSPHAEROIDESATCC1702D OCCU NDR 4 TT NAD U8538 I O C ISMA SIA EN RH RA AUL LEXA 4981 RICKETTSIAAKARISTRHARTFORTYPH SISST 9 PA IC ISA CC IS R R UL TRWI MC C MA ICA RICKETTSIAFELISURRWXCALLM KIE HYPHOMONASNEPTUNIUMATCC1544N 0 INGTO L CEA HIABALTICAAT N O SC RICKETTSIAMASSILIAEMTU HIR 5 D RIC 2 RICKETTSIAPEACOCKIISTRRUSTIKETTSIAAF 5 1 3 K1 RICAEESF EUMHL RICKETTSIACONORIISTRMALISH 5 CAULOBACTERCRESCENTUSNA100UCIN 48 CAULOBACTERCRESCENTUSCB1IMONASSPBALIUMZ SCB 4 C CAULOBACTERSPK3TER TRICU UENSDM OBAC CEN TORQ RI BREVUNDNYL ULISEX IUMEX 1 CKETTSIASIBI 7 PHE ACA ACTER ORQUENSAM RICA24 ASTICC HYLOB IUMEXT 1 6 MET OBACTER METHYL 1 METHYLOBACTERIUMPOPULIBJ00 THANICUMCM4 METHYLOBACTERIUMEXTORQUENSPAERIUMCHLOROME METHYLOBACT 1 METHYLOBACTERIUMRADIOTOLERANSJCM283 METHYLOBACTERIUMSP446 TR514341 METHYLOBACTERIUMNODULANSORS206 ISBV5S 1R69 BEIJE 0 L SM2929TIM1305TR68R6359208 RINCKIAINDICASUBSPI BRUCELLASU 6 RHODOSPIRILLUMRUBRUMATCC1117NDICAATCC9039 BRUCELLAPINNIPEDIALISB29ISBV3STR8 3 RHODOSPIRILLUMRUBRU BRUCELLACENSISBV2ST XAN 0 BRUCELLAPINNIPEDIABRUCELLACETIM64493USBV2S A THO RHODO BRUCELLASURT ZORH BACTERAUSPIRILLUMCEN M ELLAMELITEO A IZOB TOT T GLUCONACETOBACTERDIAZOTROPHICUSPAL5JGIUMCAUL ROPH ENUMSW BRUC A O GLUCONACETOBACTERDIAZOTROPHICUSPALIN ICUSPY BRUCELLASP83101 G ODANS 2 BRUCELLAMELITENSISBV3STRETHEBRUCELLAABBRUCELLAABORTUSBV4STR299 R2308 LUC ORS57 BRUCELLAABORTUSBV6STR8734 T GRA ON 1 BRUCELLAABORTUSBV9STRC61 ACIDIPHILIUMCRYPTUMJFNUL OBA B19USS MAGNETOSPIRILLUMMAGNETOTACTICUIBA CTE T C ROXY ETI R MAGNETOSP TER C O 0 ERY BET DAN A M5 ERYTHROBACTERLITORALISHTCC259HES S621 I LL 8 THR D H BRUCELLASPF59E 18 ERYTHROBACTERSPSD2 EN 19 SPHINGOPYXISALASKENSISRB225OBAC IR SIS 5 75 ILLUMMA 5 C 5 1 SPHINGOMONASSPSKA5 GDN 230 0 NOVOSPHINGOBIUMAROMATICIVORANSDSM1244TERSPNAP I BRUC BRUCELLACETISTRCUDCC2344S SS1 H1 U 491 Z BRUCELLAABORTUSBV3STRTULY U 133 YM G BRUCELLAAB T T M ZYMOMONASMOBILISSUBSPMOBILISATCC1098NETICUMAMB BRUCELLANEOTOMAE5K3ISAT R R IS SPHINGOMONASWITTICHIIRWO 1 BRUCELLAPINNIPEDIALISM163991BRUCELLACETIM49095U CC2345 M BRUCELLAMELITENSISBV1STRREV CC U O M AS ABO RHODOPSEUDOMONASPALUSTRISTIENASM AS RH 1 1 LL R OTI RHODOPSEUDOMONASPALUSTRISBISB E SISAT LL RHODOPSEUDOMONASPALUSTRIO O N D B CR E RH OPSE I BRUCELLAOVISATCC2584ELLAABO B LIS 8 4 BRUCELLAMELITENSIS16 F BRAR O RUC ITE A SUBS B L AMI ULV RHODOPSEUDOMONASPALUSTRISBISB1A D 6 RUC RUC UR RHODOPSEUDOMONASPALUSTRISBISA5D OPSE UD SISBIOVA B BRADYRH D Y N B LL O RH OMO P E A N Y M AME I NITROBACTERSPNB311L R OBI MAR IT ITE N NITROBACTERHAMBURGENSISX1IGOT HIZOBIIZOBIUD LL PARVIBACULUMLAVAMENTIVORANSDS L BARTONELLATRIBOCORUMCIP10547TIMO R N L BA HYPH MESORHIZOBIUMSPBNC BAR I E BARTONELLABACILLIFORMISKC58 METHYLOCELLASILVESTRISBL S BARTONELLAHENSELA OMO ASPA RUC CANDIDATUSLIBERIBACTERASIATICUSSTRPSY6 O H 1 Z B M I BAC BRUCELLACANISATCC2336 O R U AME R NAPELA 4 OP IZO U MSPO RUC E TO N LU OCHROBACTRUMANTHROPIATCC4918LL TO N O MSPBTAIASPAL E B F T ST BRUCELLAABORTUSBIOVAR1STR994 ASSPSI STAPPIAAGGREGATAIAM1261M H B LEAPH N LABRENZIAALEXANDRIIDFL1 ER 4 OCHROBACTRUMINTERMEDIUMLMG330 NELLA A I PSEUDOVIBRIOSPJE06 I UMJAP R R E CR C IS RUC WI S MESORHIZOBIUMLOTIMAFF303099 L A 1 8 B 27 U C LAG GI GA O N R ST BOXI 1 8 S H B OG R 00

OTOT O Q 859 I IS T 5 9 R UMDEN N 9 U RADSKY H 4 CC250 I A D CUMUSDA11 AA 1 I 3 N A 6 H OVO 5 1 A 2 4 T

RH RHIZOBIUMETLICFN4 5 A AMIIAS 1 R 8C 8 4 ANAS 5 L 4 I 1 I G 89 O L 4 IT R

T R IZO 6 PH T 8 I A E R NB25 A O 3 N I IFI 8

E SOM B I T C B CAD 4 I BRAS

A 3 I A R CANSA 0 I UME I A L 5 I L UME 2 4 C TO U T C 2 B T 5 I

T P F

V E UL

L4

V IZO R T

UME B O T UME I I L 3 I U 1 1 3 CC5188 I US 4 RH B M B C RHIZOBIUMSPNGR23 M 6

U

I

RHIZOBIUMETLIGR5 R RHIZOBIUMETLIKIM A E

R

IZO IZO A T

A

D 65

2 RHIZOBIUMETLIIE477 S 8 RH RH I

O

O 2

B AGROBACTERIUMVITISS N

I

SINORHIZOBIUMMELILOTI102 A SINORHIZOBIUMMEDICAEWSM41 M

C

U

T 2

E

G

R

E

L

K

8

M

4

U

I

B

AGROBACTERIUMTUMEFACIENSSTRC5

O

Z

I

H

R

RHIZOBIUMLEGUMINOSARUMBVTRIFOLIIWSM132 RHIZOBIUMLEGUMINOSARUMBVTRIFOLIIWSM230

Figure 6: FastUniFrac-based phylogenetic tree of alphaproteobacteria using tRNA data computed according to the methods of [31].

24 100 96 WOLBACHIA SP- WRI 100 96 CITREICELLA SP- SE45 100 96 WOLBACHIA PIPIENTIS 84 54 ALPHA PROTEOBACTERIUM HTCC2255 100 94 WOLBACHIA ENDOSYMBIONT STRAIN TRS OF BRUGIA MALAYI 4 65 OCEANICAULIS ALEXANDRII HTCC2633 100 96 WOLBACHIA ENDOSYMBIONT OF MUSCIDIFURAX UNIRAPTOR 3 58 MARICAULIS MARIS MCS10 99 96 WOLBACHIA ENDOSYMBIONT OF DROSOPHILA SIMULANS 0 50 HYPHOMONAS NEPTUNIUM ATCC 15444 100 97 WOLBACHIA ENDOSYMBIONT OF DROSOPHILA MELANOGASTER 3 35 HIRSCHIA BALTICA ATCC 49814 99 85 WOLBACHIA ENDOSYMBIONT OF DROSOPHILA ANANASSAE 53 43 LABRENZIA ALEXANDRII DFL-11 100 96 WOLBACHIA ENDOSYMBIONT OF CULEX QUINQUEFASCIATUS JHB 48 45 STAPPIA AGGREGATA IAM 12614 100 95 TYPHI STR- WILMINGTON 61 57 PSEUDOVIBRIO SP- JE062 100 93 RICKETTSIA SIBIRICA 246 48 59 PELAGIBACTER UBIQUE HIMB5 100 94 RICKETTSIA RICKETTSII STR- IOWA 54 56 PELAGIBACTER UBIQUE HTCC7211 100 94 RICKETTSIA RICKETTSII STR- 'SHEILA SMITH' 70 66 PELAGIBACTER UBIQUE HIMB114 100 94 RICKETTSIA RICKETTSII 80 79 PELAGIBACTER UBIQUE IMCC9063 100 93 RICKETTSIA PROWAZEKII 85 80 PELAGIBACTER UBIQUE HTCC1062 100 95 RICKETTSIA PEACOCKII STR- RUSTIC 81 80 PELAGIBACTER UBIQUE HTCC1002 100 91 RICKETTSIA MASSILIAE MTU5 86 79 PELAGIBACTER UBIQUE HTCC9565 100 91 RICKETTSIA FELIS URRWXCAL2 99 83 XANTHOBACTER AUTOTROPHICUS PY2 100 88 RICKETTSIA ENDOSYMBIONT OF IXODES SCAPULARIS 92 71 SINORHIZOBIUM MELILOTI 1021 100 93 RICKETTSIA CONORII STR- MALISH 7 92 69 SINORHIZOBIUM MEDICAE WSM419 100 94 RICKETTSIA CANADENSIS STR- MCKIEL 96 87 RHODOPSEUDOMONAS PALUSTRIS TIE-1 100 95 RICKETTSIA BELLII RML369-C 98 84 RHODOPSEUDOMONAS PALUSTRIS HAA2 100 95 RICKETTSIA BELLII OSU 85-389 98 92 RHODOPSEUDOMONAS PALUSTRIS CGA009 100 88 RICKETTSIA AKARI STR- HARTFORD 97 79 RHODOPSEUDOMONAS PALUSTRIS BISB5 100 95 RICKETTSIA AFRICAE ESF-5 95 74 RHODOPSEUDOMONAS PALUSTRIS BISB18 98 92 ORIENTIA TSUTSUGAMUSHI STR- IKEDA 97 81 RHODOPSEUDOMONAS PALUSTRIS BISA53 98 92 ORIENTIA TSUTSUGAMUSHI BORYONG 98 90 RHODOPSEUDOMONAS PALUSTRIS 97 90 NEORICKETTSIA SENNETSU STR- MIYAYAMA 93 72 RHIZOBIUM SP- NGR234 95 89 NEORICKETTSIA RISTICII STR- ILLINOIS 100 83 RHIZOBIUM LEGUMINOSARUM BV- VICIAE 3841 100 95 EHRLICHIA RUMINANTIUM STR- WELGEVONDEN (U-PRETORIA) 100 86 RHIZOBIUM LEGUMINOSARUM BV- TRIFOLII WSM2304 100 94 EHRLICHIA RUMINANTIUM STR- WELGEVONDEN (CIRAD) 100 81 RHIZOBIUM LEGUMINOSARUM BV- TRIFOLII WSM1325 100 95 EHRLICHIA RUMINANTIUM STR- GARDEL 99 86 RHIZOBIUM ETLI KIM 5 100 86 EHRLICHIA CHAFFEENSIS STR- SAPULPA 97 68 RHIZOBIUM ETLI IE4771 100 91 EHRLICHIA CHAFFEENSIS STR- ARKANSAS 100 97 RHIZOBIUM ETLI GR56 100 94 EHRLICHIA CANIS STR- JAKE 94 72 RHIZOBIUM ETLI CIAT 894 100 98 ANAPLASMA PHAGOCYTOPHILUM HZ 100 86 RHIZOBIUM ETLI CIAT 652 100 98 ANAPLASMA MARGINALE STR- VIRGINIA 100 83 RHIZOBIUM ETLI CFN 42 100 98 ANAPLASMA MARGINALE STR- ST- MARIES 100 96 RHIZOBIUM ETLI BRASIL 5 100 98 ANAPLASMA MARGINALE STR- PUERTO RICO 91 70 RHIZOBIUM ETLI 8C-3 100 98 ANAPLASMA MARGINALE STR- MISSISSIPPI 84 76 PARVIBACULUM LAVAMENTIVORANS DS-1 100 98 ANAPLASMA MARGINALE STR- FLORIDA 98 86 OLIGOTROPHA CARBOXIDOVORANS OM5 87 75 PELAGIBACTER UBIQUE HIMB59 98 83 OCHROBACTRUM INTERMEDIUM LMG 3301 83 48 RHODOSPIRILLUM RUBRUM ATCC 11170 99 82 OCHROBACTRUM ANTHROPI ATCC 49188 86 58 RHODOSPIRILLUM RUBRUM 98 88 NITROBACTER WINOGRADSKYI NB-255 86 60 RHODOSPIRILLUM CENTENUM SW 98 89 NITROBACTER SP- NB-311A 75 51 MAGNETOSPIRILLUM MAGNETOTACTICUM 96 79 NITROBACTER HAMBURGENSIS X14 96 73 MAGNETOSPIRILLUM MAGNETICUM AMB-1 89 79 METHYLOCELLA SILVESTRIS BL2 97 71 GRANULIBACTER BETHESDENSIS CGDNIH1 93 82 METHYLOBACTERIUM SP- 4-46 99 73 GLUCONOBACTER OXYDANS 621H 88 73 METHYLOBACTERIUM RADIOTOLERANS JCM 2831 100 80 GLUCONACETOBACTER DIAZOTROPHICUS PAL 5 (JGI) 89 79 METHYLOBACTERIUM POPULI BJ001 100 80 GLUCONACETOBACTER DIAZOTROPHICUS PAL 5 86 78 METHYLOBACTERIUM NODULANS ORS 2060 100 78 ACIDIPHILIUM CRYPTUM JF-5 83 75 METHYLOBACTERIUM EXTORQUENS PA1 98 85 ZYMOMONAS MOBILIS SUBSP- MOBILIS ZM4 91 79 METHYLOBACTERIUM EXTORQUENS DM4 99 90 ZYMOMONAS MOBILIS SUBSP- MOBILIS ATCC 10988 89 80 METHYLOBACTERIUM EXTORQUENS AM1 96 67 SPHINGOPYXIS ALASKENSIS RB2256 90 81 METHYLOBACTERIUM CHLOROMETHANICUM CM4 97 84 SPHINGOMONAS WITTICHII RW1 91 67 MESORHIZOBIUM SP- BNC1 88 59 SPHINGOMONAS SP- SKA58 94 73 MESORHIZOBIUM LOTI MAFF303099 98 64 NOVOSPHINGOBIUM AROMATICIVORANS DSM 12444 82 66 HYPHOMICROBIUM DENITRIFICANS ATCC 51888 97 80 ERYTHROBACTER SP- SD-21 94 74 HOEFLEA PHOTOTROPHICA DFL-43 98 80 ERYTHROBACTER SP- NAP1 97 74 FULVIMARINA PELAGI HTCC2506 97 79 ERYTHROBACTER LITORALIS HTCC2594 99 93 CANDIDATUS LIBERIBACTER ASIATICUS STR- PSY62 95 0 PHENYLOBACTERIUM ZUCINEUM HLK1 81 73 CANDIDATUS HODGKINIA CICADICOLA DSEM 100 0 CAULOBACTER SP- K31 98 84 BRUCELLA SUIS BV- 5 STR- 513 100 0 CAULOBACTER CRESCENTUS NA1000 98 82 BRUCELLA SUIS BV- 3 STR- 686 100 0 CAULOBACTER CRESCENTUS CB15 96 84 BRUCELLA SUIS ATCC 23445 96 0 BREVUNDIMONAS SP- BAL3 97 82 BRUCELLA SUIS 1330 92 1 ASTICCACAULIS EXCENTRICUS CB 48 97 84 BRUCELLA SP- F5/99 100 94 THALASSIOBIUM SP- R2A62 100 90 BRUCELLA SP- 83/13 100 98 SP- NAS-14-1 98 81 BRUCELLA PINNIPEDIALIS M292/94/1 100 96 SULFITOBACTER SP- EE-36 95 80 BRUCELLA PINNIPEDIALIS M163/99/10 100 93 SILICIBACTER SP- TRICHCH4B 98 81 BRUCELLA PINNIPEDIALIS B2/94 100 95 SILICIBACTER SP- TM1040 99 86 BRUCELLA OVIS ATCC 25840 100 94 SILICIBACTER POMEROYI DSS-3 97 84 BRUCELLA NEOTOMAE 5K33 100 94 SILICIBACTER LACUSCAERULENSIS ITI-1157 97 82 BRUCELLA MICROTI CCM 4915 100 95 STELLATA E-37 98 81 BRUCELLA MELITENSIS BV- 3 STR- ETHER 100 95 SP- R11 98 81 BRUCELLA MELITENSIS BV- 2 STR- 63/9 100 94 ROSEOVARIUS SP- TM1035 97 84 BRUCELLA MELITENSIS BV- 1 STR- REV-1 100 89 ROSEOVARIUS SP- HTCC2601 99 82 BRUCELLA MELITENSIS BIOVAR ABORTUS 2308 100 95 ROSEOVARIUS SP- 217 97 82 BRUCELLA MELITENSIS ATCC 23457 100 96 ISM 96 84 BRUCELLA MELITENSIS 16M 100 96 SP- SK209-2-6 96 81 BRUCELLA CETI STR- CUDO 100 86 ROSEOBACTER SP- MED193 98 81 BRUCELLA CETI M644/93/1 100 100 ROSEOBACTER SP- GAI101 97 84 BRUCELLA CETI M490/95/1 100 95 ROSEOBACTER SP- CCS2 98 81 BRUCELLA CETI M13/05/1 100 95 ROSEOBACTER SP- AZWK-3B 97 84 BRUCELLA CETI B1/94 100 97 OCH 149 97 82 BRUCELLA CANIS ATCC 23365 100 97 ROSEOBACTER DENITRIFICANS OCH 114 97 81 BRUCELLA ABORTUS STR- 2308 A 97 86 BACTERIUM Y4I 99 82 BRUCELLA ABORTUS S19 96 76 RHODOBACTERALES BACTERIUM HTCC2654 98 82 BRUCELLA ABORTUS BV- 9 STR- C68 95 73 RHODOBACTERALES BACTERIUM HTCC2150 98 82 BRUCELLA ABORTUS BV- 6 STR- 870 97 87 RHODOBACTERALES BACTERIUM HTCC2083 98 82 BRUCELLA ABORTUS BV- 4 STR- 292 97 82 RHODOBACTERACEAE BACTERIUM KLH11 98 84 BRUCELLA ABORTUS BV- 3 STR- TULYA 100 92 SPHAEROIDES KD131 98 82 BRUCELLA ABORTUS BV- 2 STR- 86/8/59 100 93 ATCC 17029 99 82 BRUCELLA ABORTUS BIOVAR 1 STR- 9-941 100 94 RHODOBACTER SPHAEROIDES ATCC 17025 96 78 BRADYRHIZOBIUM SP- ORS278 100 96 RHODOBACTER SPHAEROIDES 2-4-1 92 74 BRADYRHIZOBIUM SP- BTAI1 100 95 RHODOBACTER SPHAEROIDES 98 89 BRADYRHIZOBIUM JAPONICUM USDA 110 100 92 GALLAECIENSIS BS107 92 90 BEIJERINCKIA INDICA SUBSP- INDICA ATCC 9039 100 96 PHAEOBACTER GALLAECIENSIS 2-10 93 73 BARTONELLA TRIBOCORUM CIP 105476 100 95 DENITRIFICANS PD1222 93 71 BARTONELLA QUINTANA STR- TOULOUSE 100 96 ANTARCTICUS 307 85 64 BARTONELLA HENSELAE 100 95 OCTADECABACTER ANTARCTICUS 238 93 74 BARTONELLA GRAHAMII AS4AUP 99 91 GRANULOSUS HTCC2516 93 71 BARTONELLA BACILLIFORMIS KC583 100 94 OCEANICOLA BATSENSIS HTCC2597 94 77 AZORHIZOBIUM CAULINODANS ORS 571 100 99 OCEANIBULBUS INDOLIFEX HEL-45 99 77 AURANTIMONAS SP- SI85-9A1 100 94 VESTFOLDENSIS SKA53 98 80 AGROBACTERIUM VITIS S4 100 96 SP- CCS1 100 85 AGROBACTERIUM TUMEFACIENS STR- C58 100 97 SHIBAE DFL 12 100 87 AGROBACTERIUM RADIOBACTER K84 clades Hyphomonadaceae 1.00 0.75 0.50 0.25 0.00 Rhizobiales Sphingomonadales 1.00 0.75 0.50 0.25 0.00 classification Rhodobacteraceae Rhodospirillales classification probability Caulobacterales Rickettsiales probability

Figure 7: FastUniFrac-based phylogenetic tree of alphaproteobacteria using tRNA data computed according to the methods of [31].

25