Syst. Biol. 56(1):1–16, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150601109759 From Phylogenetics to Phylogenomics: The Evolutionary Relationships of Insect Endosymbiotic γ- as a Test Case

INAKI˜ COMAS,ANDRES´ MOYA, AND FERNANDO GONZALEZ´ -CANDELAS Institut Cavanilles de Biodiversitat i Biologia Evolutiva. Universitat de Valencia,` Valencia, Spain; E-mail: [email protected] (I.C.); [email protected] (A.M.); [email protected] (F.G.-C.)

Abstract.—The increasing availability of complete genome sequences and the development of new, faster methods for phylogenetic reconstruction allow the exploration of the set of evolutionary trees for each gene in the genome of any . This has led to the development of new phylogenomic methods. Here, we have compared different phylogenetic and phylogenomic methods in the analysis of the monophyletic origin of insect from the γ -Proteobacteria, a hotly debated issue with several recent, conflicting reports. We have obtained the phylogenetic tree for each of the 579 identified protein-coding genes in the genome of the primary of carpenter ants, Blochmannia floridanus, after determining their presumed orthologs in 20 additional Proteobacteria genomes. A reference phylogeny reflecting the monophyletic origin of insect endosymbionts was further confirmed with different approaches, which led us to consider it as the presumed species tree. Remarkably, only 43 individual genes produced exactly the same topology as this presumed species tree. Most discrepancies between this tree and those obtained from individual genes or by concatenation of different genes were due to the grouping of Xanthomonadales with β-Proteobacteria and not to uncertainties over the monophyly of insect endosymbionts. As previously noted, operational genes were more prone to reject the presumed species tree than those included in information-processing categories, but caution should be exerted when selecting genes for phylogenetic inference on the basis of their functional category assignment. We have obtained strong evidence in support of the monophyletic origin of γ -Proteobacteria insect endosymbionts by a combination of phylogenetic and phylogenomic methods. In our analysis, the use of concatenated genes has shown to be a valuable tool for analyzing primary phylogenetic signals coded in the genomes. Nevertheless, other phylogenomic methods such as supertree approaches were useful in revealing alternative phylogenetic signals and should be included in comprehensive phylogenomic studies. [Concatenate alignment; incongruence; microbial evolution; phylome; supermatrix alignment; supertree analysis.]

The evolution of is strongly influenced by the markers have been shown to be laterally transferred in fluid nature of their genomes. Processes such as duplica- some cases (Asai et al., 1999; Yap et al., 1999). tion, gene gain and loss limit the inference of their evo- The advantages of multiple gene approaches versus lutionary relationships. However, in the last decade the those based on single genes are a priori evident. In the- genomic revolution has represented not only a change ory, we can evade the single gene evolutionary histo- in scale for the analysis of sequences but also for phylo- ries in favor of a common “true” phylogenetic signal; genetic inference. The phylogenomic approach, initially it is possible to avoid problems derived from insuffi- proposed as the application of phylogenetic analysis cient sample size by addition of more sites from multiple to help annotating complete genome sequences (Eisen, genes or to compensate for biased base compositions. In 1998), has transformed into the (possibly) most appropri- practice, some of these problems will also affect phylo- ate way to derive the phylogenetic history of organisms genies reconstructed from large data sets, but others will through genome-scale phylogenetic inference (O’Brien be substantially reduced and/or easily diagnosed. Sev- and Stanyon, 1999; Sicheritz-Pont´en and Andersson, eral alternative methods to single-gene tree phylogenies 2001; Eisen and Fraser, 2003). This leap from phylogenet- have been proposed recently (reviewed in Delsuc et al., ics to phylogenomics has allowed avoiding some of the 2005). These methods are based on different kinds of se- inherent problems in the inference of species trees from quence characters and genome structure, such as gene single gene phylogenies. However, it has also brought content (Snel et al., 1999; Gu et al., 2004; Huson and Steel, new issues in the reconstruction of species trees and even 2004), gene order (Wolf et al., 2001; Korbel et al., 2002; questioned the existence of such a single tree for all the Belda et al., 2005), concatenated sequences (Brown et al., genes in the genome of a bacterial species (Doolittle, 1999; 2001; Rokas et al., 2003), or gene tree–based techniques Bapteste et al., 2005). (Bininda-Emonds et al., 2002; Bininda-Emonds, 2004b). The single-gene approach reached its peak with the In this article we focus on methods based on direct or use of ribosomal RNA sequences as phylogenetic mark- indirect analyses of genomic sequence data. ers for microorganisms (Woese, 1987). These genes have The concatenation approach has proven useful in been, and still are, a powerful tool in bacterial phylo- many cases and its use is increasingly common (Bapteste genetic analysis. Their properties as a good marker for et al., 2002; Rokas et al., 2003). The method has been phylogenetic inference, such as universal presence and justified in cases where single-gene phylogenetic sig- evolutionary conservation, have enabled the proposal of nals are insufficient (Herniou et al., 2001), when there a universal tree of life (Woese et al., 1990) and the clas- is heterogeneity in the evolutionary rates (Bapteste et al., sification and reconstruction of evolutionary relation- 2002; Gontcharov et al., 2004), or when a high influ- ships for the three domains in the absence of genomic ence of nonstandard evolutionary patterns in the shap- data (Woese and Fox, 1977). However, even ribosomal ing of gene trees, such as horizontal gene transfer, is to

1 2 SYSTEMATIC BIOLOGY VOL. 56 be expected (Brochier et al., 2002). This method increases is an ongoing debate on whether bacterial endosym- phylogenetic signal by joining the sequences from multi- bionts of insects from the gamma subdivision of Pro- ple genes, thus creating a supermatrix of characters, and teobacteria conform to a monophyletic group (Gil et al., generally recovers accepted (and presumably correct) 2003; Lerat et al., 2003; Canback et al., 2004) or they are phylogenies with highly supported nodes, thus evad- paraphyletic and their grouping results from artefacts ing many of the above pitfalls (for instance, insufficient in the phylogeny reconstruction process (Charles et al., amounts of informative sites or particular gene histories) 2001; Herbeck et al., 2004; Belda et al., 2005). This study as far as the correct model of evolution is used, although investigates the evolutionary relationships of five γ - it may not overcome problems associated to systematic Proteobacteria endosymbionts, including the three Buch- biases in the data (Phillips et al., 2004). nera strains sequenced so far. Several evolutionary fea- Contrary to the concatenation approach, consensus tures of these genomes, such as their high evolutionary and supertree approaches have an indirect association rates and low G+C content, and genome disintegration with the genome sequence. The consensus method is led to different methodological problems (Moreira and based on the integration of multiple source trees into Philippe, 2000; Sanderson and Shaffer, 2002). a single topology. When the initial data set does not Our work is based on the determination of the phy- include the same taxa in all the gene trees, then a su- lome (Sicheritz-Pont´en and Andersson, 2001)—the set of pertree is constructed combining the overlapping topolo- phylogenetic trees for each protein-coding gene in the gies (Bininda-Emonds, 2004a). All these methods have genome—of Blochmannia floridanus (Gil et al., 2003), the proven useful for phylogenetic inference (Daubin et al., primary endosymbiont of carpenter ants. We have used 2002; Rokas et al., 2003), but they have different possible it as a starting point to explore the phylogenetic land- associated errors. For instance, the strength of the super- scape of γ -Proteobacteria. In this work we have used matrix approach decreases when the number of shared an array of phylogenetic and phylogenomic techniques orthologous genes is low or when many of the genes to infer the phylogenetic relationships among 21 Pro- concatenated are influenced by horizontal gene trans- teobacteria. In consequence, we have analyzed several fer events or hidden paralogies (Daubin et al., 2002). On phylogenetic hypotheses for these species through the the other hand, a supertree approach has also its own examination of the phylogenies derived from the set sources of problems. The use of wrong source trees, the of protein-coding genes in Blochmannia and their com- indirect relationship with molecular data, taxon hetero- parison with topologies obtained from the 16S rDNA geneity among the trees, or lack of a universal method- and different phylogenomic methods. Our aim has been ology for assessing the reliability of the nodes are among not only to test alternative species tree hypotheses but the most common problems encountered by consensus also to analyze the advantages and pitfalls associated to and supertree approaches (Gatesy et al., 2002; Creevey, each class of approximation. Additionally, we have at- 2004; Bininda-Emonds, 2004a). tempted an approximation to the genomic core concept Our primary concerns in this work are to carry out a (Nesbø et al., 2001; Gil et al., 2004; Strobel and Arnold, comparative analysis of different phylogenetic method- 2004) through the analysis of trees and sequences that ologies and to derive, as far as possible, a presumed has allowed us to question the relevance of the informa- species tree for a group of Proteobacteria, although this tional/operational gene distinction in phylogenetic in- last concept itself is controversial for bacteria. Opinions ference and in the conformation of a potential bacterial range from those who consider nonvertical evolution- core set of genes. ary events, mainly horizontal gene transfer, as back- ground noise that, after appropriate filtering, still allows MATERIALS AND METHODS to recover the true phylogenetic relationships among the species (Jain et al., 1999; Kurland et al., 2003) to Genomes and Homologous Genes Selection those who sustain the impossibility of describing bacteria In this study we have used 21 complete genome genome evolution through a single tree-like phylogeny sequences of Proteobacteria (Table 1) retrieved from (Doolittle, 1999; Creevey et al., 2004; Bapteste et al., 2005). GenBank (Benson et al., 2004). We included three We will not address directly this question in this work, β-Proteobacteria, Neisseria meningitidis MC58, Neis- although we will use the term “species tree” to refer to seria meningitidis Z2491, and Ralstonia solanacearum; that phylogenetic hypothesis that encompasses most of one α-Proteobacteria, Rickettsia prowazekii; and 17 γ - the vertically transmitted phylogenetic signal. To study Proteobacteria, including five insect endosymbionts: the the evolution of bacteria we must identify the different three Buchnera species sequenced so far, Wigglesworthia vertical and nonvertical phylogenetic signals; the former, glossinidia brevipalpis, and Blochmannia floridanus, which if sufficiently strong, will be reflected in the species tree. is our initial genome. A complete and general scheme of In this work, we have approached a conflicting issue in the analyses carried out in this study and the relation- bacterial phylogenetics through the use of different phy- ships among them is presented in Figure 1. logenetic and phylogenomics methods. Apart from al- The procedure to obtain the putative orthologs for each lowing the comparison of the problems and advantages gene in the Blochmannia floridanus genome started from of the different methodologies employed, the particular an initial reference tree. Because it is not possible to be set of Proteobacteria used in this study also poses some certain about the truly orthologous nature of a gene un- interesting evolutionary questions. In particular, there til the phylogenetic analysis is completed, we considered 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 3

TABLE 1. Complete genome sequences of Proteobacteria used in this study.

Species (strain) Accession no. Division Order Rickettsia prowazekii NC 000963 α-Proteobacteria Rickettsiales Neisseria meningitidis MC58 NC 003112 β-Proteobacteria Neisseriales Neisseria meningitidis Z2491 NC 003116 β-Proteobacteria Neisseriales Ralstonia solanacearum NC 003295 β-Proteobacteria Bulkholderiales Blochmannia floridanus NC 005061 γ -Proteobacteria Enterobacteriales Buchnera aphidicola BPI NC 004545 γ -Proteobacteria Enterobacteriales Buchnera aphidicola SGI NC 004061 γ -Proteobacteria Enterobacteriales Buchnera sp. APS NC 002528 γ -Proteobacteria Enterobacteriales Escherichia coli K12 NC 000913 γ -Proteobacteria Enterobacteriales Escherichia coli O157:H7 EDL933 NC 002655 γ -Proteobacteria Enterobacteriales Haemophilus influenzae Rd NC 000907 γ -Proteobacteria Pasteurellales Pasteurella multocida NC 002663 γ -Proteobacteria Pasteurellales Pseudomonas aeruginosa NC 002516 γ -Proteobacteria Pseudomonadales Salmonella enterica subsp. enterica serovar Typhi NC 003198 γ -Proteobacteria Enterobacteriales Salmonella typhimurium LT2 NC 003197 γ -Proteobacteria Enterobacteriales Vibrio cholerae NC 002505 γ -Proteobacteria Vibrionales Wigglesworthia glossinidia brevipalpis NC 004344 γ -Proteobacteria Enterobacteriales Xanthomonas axonopodis pv. citri strain 306 NC 003919 γ -Proteobacteria Xanthomonadales Xanthomonas campestris pv. campestris strain ATCC 33913 NC 003902 γ -Proteobacteria Xanthomonadales Xylella fastidiosa NC 002488 γ -Proteobacteria Xanthomonadales Yersinia pestis KIM NC 004088 γ -Proteobacteria Enterobacteriales the homologous genes found in the procedure described present in all the genomes considered. Protein sequences below as putative orthologs, which may eventually be- were aligned with CLUSTALW 1.8 (Thompson et al., come true orthologs, paralogs, or xenologs, depending 1994) and processed with GBLOCKS (Castresana, 2000) on their inferred evolutionary history. The reference tree with default parameters to eliminate areas of uncertain was obtained as described in Gil et al. (2003). Briefly, homology or low phylogenetic content before concate- we obtained the orthologs for 60 informational genes nation. The resulting concatenate was used to obtain a

FIGURE 1. General diagram of the different methodologies used for assessing the phylome of Blochmannia floridanus as well as the phyloge- nomic analyses and testing. 1sKH: one-sided Kishino-Hasegawa test; SH: Shimodaira-Hasegawa test; ELW: expected likelihood weight test; MRP: matrix representation using parsimony; dfit: most similar supertree; qfit: maximum quartet fit; sfit: maximum splits fit. 4 SYSTEMATIC BIOLOGY VOL. 56 maximum likelihood tree using the quartet method im- against the remaining genomes. To retrieve the homolo- plemented in TREEPUZZLE 5.1 (Schmidt et al., 2002) gous genes from the other genomes, we first performed with the following options: JTT model of substitution a BLAST search using the representative genome from (Jones et al., 1992), proportion of invariant sites (I) each of the nine previously described groups as target estimated from the data, eight discrete categories to (Fig. 2). Next, we used the best hit from each repre- approximate a gamma distribution accounting for evo- sentative species of the nine groups in the first BLAST lutionary rate heterogeneity across sites (G), empirical search as query for a second BLAST against the remain- amino acids frequencies (F), and 4000 puzzling steps. We ing genomes in the corresponding group. This strategy also obtained phylogenetic trees by Bayesian inference was slightly modified for those B. floridanus genes for using MrBayes 3.0 (Ronquist and Huelsenbeck, 2003) which no reliable homolog (see below a more detailed (JTT+I+F+G and 100,000 generations). This tree is an description of the criteria used) was found in the rep- expanded version of the tree reported in Gil et al. (2003) resentative species of some group. In this case, a new with additional sequences from the non–γ - Proteobacte- BLAST search was performed against all the genomes in ria genomes. With this reference tree, we assigned each the group and the best hit was used as query for a sec- genome to one of nine different groups (see Fig. 2) in or- ond BLAST to retrieve the homologs from the remaining der to reduce the BLAST database and to speed up and species in the group as described. In all cases, we used an refine the searches (see below). E-value <1E −3 as threshold for considering a matching The B. floridanus genome contains 579 annotated pro- sequence as a putative ortholog in the BLAST searches. tein coding genes (Gil et al., 2003) and each of these was Once homologs for each of the 579 genes in B. floridanus used as query for a BLASTP (Altschul et al., 1997) search were retrieved we performed a new filtering to ensure

FIGURE 2. Guide species tree obtained with a trimmed alignment of 60 concatenated proteins. For clarity, only genus names are represented except for cases with more than one species or strain (see Table 1). Numbers in nodes indicate support values in the form of proportion of quartets and Bayesian a posteriori probabilities for the corresponding inner branch. The symbol next to each species indicates the group of adscription for BLAST searches. Species with underlined names were used as initial target species for BLAST searches of the corresponding groups. The taxonomic classification (order and Proteobacteria division) of each species is also indicated according to the NCBI taxonomy database. 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 5

∗ that these genes corresponded to putative orthologs. using PAUP 4b10 (Swofford, 2002) with heuristic We first proceeded to obtain a multiple alignment and search based in tree-bisection-reconnection (TBR) as the corresponding gene tree as described below. With branch-swapping algorithm and 1000-replicate to assess this information we considered five different criteria bootstrap support values. The best model for ML infer- for each case: annotation (coincidence in functional an- ence was selected using the ModelTest program (Posada notation), multiple alignment (similarity extended over and Crandall, 1998) following the Akaike information the complete sequence and not a short fragment), gene criterion (AIC) and implemented in PHYML 2.1b with tree (strong conflict with the reference tree), BLAST sig- 500 bootstrap replicates. For NJ we used two models that nificance, and information from the Microbial Genome could handle heterogeneous base composition among Database for Comparative Analysis (Uchiyama, 2003). lineages. Firstly we used MEGA 3 (Kumar et al., 2004) Homologous genes were accepted as putative orthologs to implement the LogDet distance modified by Tamura after simultaneous consideration of all these five crite- and Kumar (Tamura and Kumar, 2002) to account for ria. Because these are independent, they allowed us to substitution pattern heterogeneity. Secondly we used analyze orthology in the absence of a good annotation PHYLO WIN (Galtier et al., 1996), implementing the or when the position in the gene tree was unexpected. Galtier and Gouy distance (Galtier and Gouy, 1995) for Should we have based strictly and only on annotation or reducing nucleotide bias. A 1000-replicate bootstrap was position in the gene tree, then we would have biased the carried out for both NJ analyses. data set towards genes with the most congruent phylo- genies or best annotations. In most cases the identifica- tion of putative orthologs was easy. In the most difficult Phylogenomic Analysis cases, which corresponded to nonannotated proteins or We used three different approaches for phylogenomic to very divergent gene trees from the usually accepted analysis. Concatenated sequences, consensus trees, and species phylogenies, we considered as putative orthologs supertrees were used to explore the phylogenetic land- those genes that were supported by at least three of the scape of γ -Proteobacteria. above mentioned criteria. This resulted in a set of 200 Concatenates.—The analysis of concatenates was based genes in B. floridanus with reliable orthologs in all of on the subset of putative orthologous genes shared the other 20 genomes considered. We will refer to this by the 21 genomes, the ‘200-gene’ data set. We con- subset as the ‘200-gene’ data set. Whenever possible, catenated the GBLOCKS-trimmed alignments of these and in order to test the effect of missing data, we per- proteins and analyzed the resulting alignment using sev- ∗ formed the analyses described below with the remaining eral methods. PAUP 4b10 (Swofford, 2002) was used genes with reliable orthologs missing from at least one for parsimony analysis with stepwise addition and tree genome (‘379-gene’ data set) and also with the complete bisection-reconnection for heuristic search and 500 boot- set of coding genes in the B. floridanus genome (‘579-gene’ strap replicates. PHYML and TREEPUZZLE were used data set). for maximum likelihood inference. Due to computa- tional limitations, we had to reduce the number of parameters in the evolutionary models. In the analy- Obtaining the Phylome of Blochmannia floridanus sis with TREEPUZZLE we used a JTT model with four After obtaining putative orthologs for each protein gamma categories and 4000 puzzling steps. The JTT of the B. floridanus genome we proceeded to obtain its model of substitution with estimated proportion of in- phylome, the set of phylogenetic trees for each protein variant sites and 100 bootstrap replicates was used for coding gene. The alignment for each set of homologous inference with PHYML. proteins was obtained using CLUSTALW 1.8 (Thompson We also used this alignment of the ‘200-gene’ data set et al., 1994) and the result was trimmed with GBLOCKS to explore the possible influence of heterogeneous base (Castresana, 2000) using default parameters. The max- composition in the retrieval of the different identified imum likelihood tree for each multiple alignment was clades, especially those involving endosymbiotic bacte- inferred with the program PHYML 2.1b (Guindon and ria. We removed from the alignment those positions that Gascuel, 2003) as it implements one of the fastest and included the amino acids most strongly affected by nu- most accurate algorithms for heuristic searches. For all cleotide bias (FYMINK) (Singer and Hickey, 2000). With the alignments we used JTT + I + G + F as model and pa- this data set we carried out a Bayesian analysis using Mr- rameters of evolution. Some of the initial alignments and Bayes 3.0 (Ronquist and Huelsenbeck, 2003) (four chains, trees were modified based on the a posteriori selection 1,000,000 generations, 10% of burn-in) and a maximum of homologs explained above. We decided not to obtain likelihood analysis with PHYML (500 bootstrap pseudo- support values by bootstrap resampling to save compu- replicates) with the same evolutionary model described tation time and not to bias the analysis towards the best above (JTT+I). supported topologies. Finally, we obtained a concatenate tree using all avail- We used four phylogenetic approaches to obtain able sequences (‘579-gene’ data set) by setting as “miss- a 16S rDNA topology for the 21 species: maximum ing character” those amino acids that corresponded to parsimony (MP), maximum likelihood (ML), and two a protein not present in any of the 21 genomes. It was different distances with the neighbor-joining (NJ) algo- analyzed with TREEPUZZLE using the JTT model of rithm (Saitou and Nei, 1987). MP analysis was performed substitution with invariants estimated form the data set. 6 SYSTEMATIC BIOLOGY VOL. 56

Alternative, more complete methods were not feasible the KH test, but its validity when one of the competing due to computational limitations. hypotheses was derived from the alignment has been Consensus Trees.—A majority rule consensus tree questioned (Goldman et al., 2000). In this study we used was obtained with the program CONSENSE from the the one-sided version of the test because the gene tree is PHYLIP package (Felsenstein, 2005) with the gene trees the maximum likelihood tree and therefore we expected from the ‘200-gene’ data set. its likelihood to always be higher or equal than the like- Supertrees.—For the ‘200-gene’ data set and for the set lihood associated to the species tree. In any case we have of gene trees with an incomplete number of genomes rep- used it for comparative purposes and not as the reference resented we used a supertree approach (Sanderson et al., test because of its problems with multiple comparisons 1998). A supertree integrates all the topological informa- corrections. The SH test is a multiple hypotheses con- tion present in the source gene trees. In this case, we trast. Even though it is not free from errors (Goldman applied supertree reconstruction to three different data et al., 2000), it overcomes some of the problems associ- sets: (i) genes present in all the genomes (‘200-gene’ data ated with the KH test. We used it as our reference test set, the same set for which we obtained a consensus tree); for examining the rejection of the presumed species tree (ii) genes missing in at least one genome (‘379-genes’ data by each gene alignment. For both tests the resampling- set); and (iii) to the complete phylome (‘579-gene’ data estimated log likelihood (RELL) bootstrapping method set). We applied the different optimization algorithms with 1000 replicates was used. The ELW test is an alter- implemented in program CLANN (Phillips et al., 2004; native way of topology testing. It creates a confidence Creevey and McInerney, 2005) to our data set. The most set of phylogenies that could include (acceptance) or not commonly applied method in supertree reconstruction is (rejection) our presumed species tree topology.The usual matrix representation using parsimony (MRP) (Loomis α = 0.05 level was used to delimit the acceptance/ and Smith, 1990; Baum, 1992; Ragan, 1992). This method rejection region. considers each node of the source trees as a binary char- acter, assigning a “1” to the taxa contained in the clade, RESULTS defined by the internal node, a “0” for the taxa not present in this clade, and a “?” for the taxa not present in the Reference Tree and Search for Homologs tree. The resulting matrix is analyzed by parsimony. The The construction of a phylogenetic tree for a set of other methods implemented and used in our work were species from their genome sequences needs, as a starting most similar supertree method (dfit), maximum quar- point, a phylogeny on which to accept or reject which tet fit (qfit), and maximum split fit (sfit) (Creevey and gene should be used for this kind of analysis on the basis McInerney, 2005). They are based in the optimization of of their true orthology. This is a circular question that distances between nodes, shared quartets, or shared par- we have approached by obtaining an initial or reference titions between the supertrees proposed by the heuristic tree, testing its reliability, and then using it as benchmark search and the source trees, respectively. for further decisions. For this, we started by obtaining Nonparametric bootstrap by sampling with replace- the phylogenetic tree from a subset of conserved protein ment over gene trees was used to obtain an indicative coding gene usually accepted as providing the most measure of the support of the derived clades in each su- robust, reliable phylogenies when concatenated (Jain pertree from the corresponding data. Support for each et al., 1999). node in the supertrees was evaluated from 100 pseu- The concatenation of 60 informational gene after their doreplicate samples for each data set used to derive the independent alignment and trimming of residues of un- supertrees. certain homology resulted in an alignment with 8067 amino acid positions. Figure 2 shows the phylogenetic tree obtained by maximum likelihood and Bayesian in- Presumed Species Tree versus Gene Trees: ference methods. This reference tree was identical to the Likelihood-Based Test of Topologies one reported in Gil et al. (2003) except for the addition Likelihood-based tests of competing phylogenetic hy- of four genomes from Proteobacteria in the α- and β- pothesis were carried out once we obtained the complete divisions. The tree reflected monophyly not only for the phylome of B. floridanus. For each alignment we com- three Buchnera species but also for a wider group in- pared the topologies of the presumed species tree and the cluding the other insect endosymbionts, Blochmannia and corresponding gene tree by means of three different tests Wigglesworthia. This group is a sister clade to the YESS implemented in TREEPUZZLE (Schmidt et al., 2002): (Yersinia-Escherichia-Shigella-Salmonella) cluster. Due to the one-side Kishino-Hasegawa test (1sKH) (Kishino its concordance with the accepted taxonomy, the high and Hasegawa, 1989; Goldman et al., 2000), Shimodaira- support values for the nodes and the fact that it recovers Hasegawa test (SH) (Shimodaira and Hasegawa, 1999), the monophyly of Buchnera species, we considered this and the expected likelihood weight test (ELW) (Strimmer as a good reference tree for the ensuing BLAST searches. and Rambaut, 2002). For the two former tests, the null We further verified that this tree was obtained in most hypothesis assumes that the difference between the like- cases after concatenation of 60 random gene chosen lihoods associated to each topology is not significantly among those present in the 21 genomes (results not different from zero. The first test to be described was shown). 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 7

FIGURE 3. Histogram summarizing the distribution of putative orthologs for each protein coding gene (579 proteins of known function) in the Blochmannia floridanus genome among the 21 genomes considered in this study. Note that 15 gene from last column (those present in all the species) were considered of dubious orthology and were analyzed in the 379-gene data set.

The similarity searches for homologs in the 21 genomes Apart from trees based on protein sequences, we ob- resulted in 215 protein coding gene present in all of them. tained the tree based on the 16S rDNA. The four meth- After their initial multiple alignment and phylogenetic ods used (see Materials and Methods) resulted in placing tree reconstruction, we proceeded to verify their reliabil- Xanthomonadales in the β-Proteobacteria branch. For ity as putative orthologs according to the criteria indi- the endosymbiotic bacteria almost all methods retrieved cated above. This resulted in the exclusion of 15 proteins a monophyletic group (see Fig. 4 ML tree), although for with doubtful orthology for at least one γ -Proteobacteria the Galtier and Gouy distance this relationship was para- species different from the insect endosymbionts, thus phyletic (see Fig. 4 NJ tree). Other taxa like Vibrio cholerae conforming a 200-gene data set. However, in order to or Pseudomonas aeruginosa also had a variable position de- analyze the complete B. floridanus phylome, we included pending on the model or method used (Fig. 4 and data these 15 proteins present in 21 taxa but with doubtful or- not shown). thology for all of them into the data set of proteins absent Phylogenomic Analysis from at least one taxon. Therefore, this set was composed of 379 gene (‘379-gene’ data set) with a heterogeneous Concatenation Analysis.—We obtained an alignment distribution of presence in the different genomes, sum- of 52,029 amino acid positions, 40,576 of which were marized in Figure 3. Each protein was present on aver- variable and 32921 parsimony informative, from the age in 16 of the 21 genomes. Nearly 60% of the proteins concatenation of the ‘200-gene’ data set. The phyloge- were present in at least 16 genomes and only 14% in less netic analysis resulted in a common topology for all the than 10. A table with the distribution of B. floridanus pro- reconstruction methods used, with high support val- tein coding gene in the other 20 genomes is available as ues for most of the nodes and methods (Fig. 5). This supplementary information from the Systematic Biology topology agreed with the reference tree established by website at http://systematicbiology.org. the preliminary analysis (see above) and the phyloge- netic reconstruction in Gil et al. (2003) as well as the most commonly accepted phylogenetic history for these The Phylome of B. floridanus and the 16S rDNA Tree species. In consequence, we adopted this topology as the We obtained all the phylogenetic trees derived from the set of protein coding gene of B. floridanus as explained TABLE 2. Summary of tests for topologies. For each test (1sKH, one- sided Kishino-Hasegawa; SH, Shimodaira-Hasegawa; EWL, expected- in Materials and Methods. In the analysis of the ‘200- weighted likelihoods) percentage of cases in which the species tree gene’ data set, we identified only three gene (rpoC, pheT, topology was rejected is presented. The last column indicates the num- and mopA) with the same topology (Robinson-Foulds ber of cases in which the species and gene tree topologies are identical. distance = 0) than the reference tree (Table 2). In the ‘379- gene’ data set, we found 40 gene whose topology was Group 1sKH (%) SH (%) ELW (%) No. of Genes fully congruent with that of the reference tree, although ALL (579 gene) 30.4 29.5 31.4 43 this number reduced to 29 when only those present in at 21 spp. (200 gene) 30.5 29 32 3 < least 11 species were considered. 21 spp.(379 gene) 30.3 29.8 31.1 40 8 SYSTEMATIC BIOLOGY VOL. 56

FIGURE 4. Phylogenies of the 16S rDNA gene inferred using maximum likelihood inference and neighbor joining with Galtier and Gouy distance. Only bootstrap values higher than 80% are shown. Divisions of Proteobacteria different from the gamma division are indicated next to the corresponding species. presumed species tree on which to base further compar- ried out the analysis of the concatenated alignment of isons. This topology included a monophyletic clade with the whole data set (‘579-gene’ data set), incorporating the insect endosymbionts, which received a high support “missing data” as necessary,and obtained again the same by the three different methods used (bootstrap analy- topology of the presumed species tree. The resulting con- sis for maximum likelihood and parsimony and quartet- catenate included 137,301 positions with an average of puzzling maximum likelihood). This high support was 14.59% sites encoded as ”missing” per genome (exclud- common for most nodes in this tree and in fact there were ing Blochmannia floridanus). only two cases in which one method provided a sup- Consensus and Supertree Analyses.—Two kinds of meth- port lower than 70%. One corresponded to a large group ods dealing with the gene trees were used. Trees derived of γ -Proteobacteria, including the Enterobacteriales and from the ‘200-gene’ data set were analyzed by traditional H. influenzae and P. multocida, which was recovered in majority rule consensus and supertree approaches. Both only 69% by quartet puzzling analysis. The second case analyses recovered the same topology (Fig. 6) character- was the bootstrap support (59%) received by the three β- ized by the deviation from the presumed species tree in Proteobacteria using parsimony analysis. In both cases, one of the nodes. This divergence placed the Xanthomon- as well as for all other nodes, the bootstrap support of adales group and P.aeruginosa with the β-Proteobacteria. maximum likelihood analysis equaled 100%. It is also re- These alternative groupings received relatively low markable that the Xanthomonadales were placed at the bootstrap support and, in the most frequent alternative base of γ -Proteobacteria with very high support with the to the majority rule consensus tree, P.aeruginosa returned three methods used. to its position in the presumed species tree (data not We also obtained the same topology in two other shown). Although the remaining clusters in the consen- concatenate analyses. Firstly, we used the same con- sus were coincident with those in the presumed species catenate alignment of the ‘200-gene’ data set but re- tree, including the monophyly for insect endosym- moving all the positions with those amino acids most bionts, the frequency of each clade in the source trees influenced by a biased GC content (Singer and Hickey, was generally low. Even well-defined groups such as 2000). After removal of positions with amino acids FY- free-living Enterobacteria received lower support than IMNK, there were 16,772 positions left for analysis. We expected. retrieved the same topology with even higher support For the remaining gene trees (‘379-gene’ data set) from for the nodes (data not shown). Secondly, we also car- the B. floridanus genome, we applied only the supertree 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 9

FIGURE 5. Common topology obtained from the analysis of the 200-gene concatenate and the 579-gene supertree. This topology was adopted as the presumed species tree. Values above the nodes indicate support values for the 200-gene concatenate obtained by maximum likelihood (bootstrapping), maximum parsimony (bootstrapping), and maximum likelihood (quartet-puzzling). Values below the nodes indicate bootstrap support for the 579-gene supertree. The asterisk indicates the node that appears in a different position in the 579-gene bootstrap consensus tree. Divisions of Proteobacteria different from the gamma are indicated next to the corresponding species. approach due to the heterogeneity in the number of position of Xanthomonadales as γ -Proteobacteria al- taxa represented in each tree. The four algorithms used though with low bootstrap support. In fact, the bootstrap resulted in the same topology (Fig. 6). This topology consensus tree (obtained after 100 pseudoreplicates of was very close to the consensus and supertree topolo- 579 gene trees analyzed by MRP) again showed Xan- gies obtained from the ‘200-gene’ data set (Fig. 6). thomonadales grouping with β-Proteobacteria. This is a The Xanthomonadales group was also outside the γ - clear indication of the presence of conflicting phyloge- Proteobacteria but P.aeruginosa was recovered in its “nat- netic signals in this data set. ural” group. Finally,we also obtained a supertree from the complete Presumed Species Tree versus Gene Trees: phylome (579 gene trees) of B. floridanus. Interestingly, Likelihood-Based Test of Topologies the topology obtained by heuristic search using the MRP We next tested whether the presumed species tree was algorithm agreed with the presumed species tree (Fig. 5). significantly worse than each gene tree in the B. flori- Therefore, the complete phylome analysis recovered the danus phylome. Table 2 summarizes the proportion of 10 SYSTEMATIC BIOLOGY VOL. 56

FIGURE 6. Alternative topologies to the one shown in Figure 5 found in the analyses of 200-gene by consensus and supertree methods and 379-gene supertree. Pseudomonas is retrieved in different positions in the 200-gene (dashed line) and 379-gene (dotted line) analyses. Percentage of cases in the majority rule tree (above) and bootstrap support values for the 200-gene supertree (below) supporting each node are indicated.

cases in which the former was rejected for the three like- DISCUSSION lihood based tests used. As expected, the SH test was more conservative than the other two although the re- Comparison of Methods: From Phylogenetics sults of the three tests were quite similar. Globally we to Phylogenomics observed a 30% rate of rejections with no noticeable dif- We have used complete genome sequences of several ference between gene present in all (‘200-gene’ data set) Proteobacteria to analyze the phylogenetic relationships or absent from some of the 21 genomes (‘379-gene’ data of a group of Bacteria characterized by their endosymbi- set). We also found slight differences in the rate of rejec- otic relationship with different insects. Starting from the tion among informational and operational gene (21.4% genome of one of these bacteria, Blochmannia floridanus, and 32.5%, respectively). the endosymbiont of carpenter ants, we have obtained 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 11 the phylogenetic trees for every protein coding gene in expected gamma placement. This case has been stud- this species, by using single-gene, standard phylogenetic ied in detail elsewhere (Comas et al., 2006a). Deviations methods, and we have also analyzed this data set with in supertrees from the presumed species tree could arise phylogenomic approaches. from two methodological sources: errors due to hetero- Single-gene phylogenies are representative of tradi- geneity in the taxa used for analysis or from differences tional phylogenetic analyses. They have several advan- in their most common topological position in the individ- tages over multigene approaches. Taxon sampling is ual trees (Creevey, 2004). However, the former problem widest and the acquisition of raw data in the laboratory can be excluded in this case because we have worked and from public databases is easy and cheap. However, with a very homogeneous data set and the conflicting our analysis with 16S rDNA exemplifies the low robust- result appears even in the two analyses based on the ‘200- ness of this marker for phylogenetic reconstruction, be- gene’ trees that are common to all the genomes consid- cause different topologies were obtained when the evo- ered. Thus, the three different supertree analyses (‘200- lutionary model or reconstruction method were changed gene,’ ‘379-gene,’ and ‘579-gene’) have revealed a large (see also Herbeck et al., 2004). Furthermore, the analysis amount of conflicting signals in these gene about the phy- of the B. floridanus phylome has revealed the presence of logenetic positioning of Xanthomonadales. This group is a large number of different topologies from single gene also the one responsible for most discrepancies between trees in agreement with other studies (Canb¨ack et al., the consensus and supertree approaches, especially over 2004). As a case in point, only three of the gene present in the support values of the nodes. It has to be noted that all genomes had exactly the same topology than the pre- consensus trees are summaries of the observed clades sumed species tree obtained by the concatenate analysis in the underlying gene trees, all obtained from the com- (Fig. 5). This high variability in reconstructed topologies plete set of the same number of taxa, whereas support is the result of limitations in the phylogenetic methods, values in a supertree analysis are not a mere recount of sampling error due to the limited length of single-gene observed clades but a measure of the support, derived alignments, the action of evolutionary forces, and model from bootstrap resampling (Burleigh et al., 2006), that misspecification that may result in single-gene phylo- the source gene trees provide to their most compatible genies not coincident with the presumed species tree. topology. Therefore, the topological heterogeneity obtained by us- Additionally, the use of supertrees is suitable for many ing a single-gene approach, both for the 16S rDNA and situations where supermatrix or consensus may be lim- the phylome, does not justify in most cases their use in ited or nonapplicable. For example, in cases where the the inference of the evolutionary relationships among evolutionary distance among taxa is very high and species, especially when whole genome data are avail- the amount of shared genetic information is very low able. However, still very often these are the only kind (Sanderson and Driskell, 2003), it is highly likely that of data available for phylogenetic analysis. In this case, sparse data matrices are obtained, leading to concate- congruence with the known taxonomy and convergence nate alignments with many missing characters and indi- with other phylogenetic markers available are reassuring vidual trees with incomplete sets of taxa. Furthermore, but incongruence does not necessarily mean that a whole supertrees are useful in cases, such as this work, where new evolutionary history has to be envisioned. Further- more than one phylogenetic signal is present because more, the joint analysis of many single-gene phylogenies they can reveal alternative signals appearing in single provides a better, more accurate picture of the evolu- gene phylogenies. Finally, supertrees stand as the best tionary history of the whole genome, in which differ- choice method when phylogenetic data sources are het- ent evolutionary trajectories may be revealed once noisy erogeneous; for instance, when combining morphologi- and/or unreliable signals are identified and considered cal and molecular data based phylogenies. appropriately. The other phylogenomic method commonly used is On the other hand, phylogenomic methods, which concatenation of shared sequences. With this approach can be divided in supermatrix, consensus, and su- we derived a phylogenetic tree that was identical to the pertree approaches, have their own problems (Bininda- reference tree and in agreement with the most commonly Emonds and Sanderson, 2001; Bininda-Emonds et al., accepted phylogeny for the taxa considered. This led us 2002; Bininda-Emonds, 2004b). Although the concate- to adopt it as the “presumed species tree,” a hypothe- nates are most useful for retrieving a main phylogenetic sis that received further support when alternative recon- signal, techniques based on the analysis of underlying struction methods were applied. But the concatenation gene phylogenies may reveal the divergent signals en- approach cannot be considered free from errors and/or coded in the single gene that are usually ignored by con- biases, the most important one being the assumption catenate analysis ( see Gatesy and Baker, 2005). that a single, main phylogenetic signal exists and that In this study we have used the traditional consen- it will be revealed by the simultaneous consideration of sus (with the ‘200-gene’ data set) and supertree (with as many data as possible, regardless of their (evolution- the ‘200-gene,’ ‘379-gene,’ and ‘579-gene’ data sets) ap- ary) congruence in terms of history or mode of evolution proaches. Except for the complete phylome supertree, (Daubin et al., 2002). all the methods agreed in placing the Xanthomonadales Further problems arise in the concatenation approach with the β-Proteobacteria due to the high incidence of because the number of gene shared in all the genomes gene phylogenies with this placement instead of the decreases when the evolutionary distance between taxa 12 SYSTEMATIC BIOLOGY VOL. 56 increases (Charlebois and Doolittle, 2004). In this case a low-supported γ -Proteobacteria clade (Comas et al., we have used only a subset of 200 gene from the 579 2006a). present in the B. floridanus phylome. Although we have Many factors have to be considered in selecting a analyzed the concatenate of the complete genome, this methodology for whole genome phylogenetic analy- approach also presents some problems, mainly the large ses. We have shown that for intermediate phylogenetic number of missing characters in the supermatrix and depths, it is possible to recover the primary phyloge- the heavy computational load it imposes, with serious netic signal of the genomes through the concatenation of limitations on the methods and programs that can be shared genetic content, although caution must be taken used for its analysis. At least for the taxa we have studied, because usually a single model of evolution is applied to the subset of 200 common gene seems to be enough for all the alignment; only recently the use of mixed models recovering the main phylogenetic signal. In fact, even is being evaluated (Pagel and Meade, 2004). Meanwhile, smaller subsets are still able to recover it, but in these the supertree approach seems to be better suited to re- cases there is more uncertainty in the inferred trees since veal conflicting gene tree signals (i.e., the positioning of these depend on the specific gene, and their evolutionary Xanthomonadales) and when data are sparse (Sander- histories, chosen for analysis (Comas et al., 2006b). son and Driskell, 2003), which is usually the case for Contrary to the approaches based on gene trees, the large evolutionary distances. One such example is the concatenation methodology seems to favor mainly one supertree obtained from 730 gene trees of 45 organisms signal, the most common or less incongruent with all the from the three domains of life (Daubin et al., 2002). gene partitions in the alignment. The presence in the con- The gene tree approach is suitable when working with catenates of gene that, when analyzed separately, rejects species without complete genome sequences or for fast, the reference tree implies that incongruent or conflicting exploratory phylogenetic analyses. These techniques are phylogenetic signals are hidden at least in alignments not incompatible but complementary, allowing the de- corresponding to a large fraction of the genomes (Gatesy tection of possible sources of error and pointing towards and Baker, 2005). Despite these potential problems, par- interesting evolutionary problems. As we have shown, ticularly important in bacterial phylogenetics, concate- the use of different methodologies, from single-gene to nation is the most popular phylogenomic method. Tools genome-scale phylogenies, allows the identification of like the Microbial Genome Database for Comparative the different phylogenetic signals encoded in bacterial Analysis (Uchiyama, 2003) allow recovering the shared genomes. gene content from the growing number of completely se- quenced bacterial genomes, thus enabling an easier and faster analysis of concatenates. Similarly, recent works The Phylogenetic Landscape of Proteobacteria: The propose algorithms for obtaining maximum concate- Placement of Xanthomonadales and Insect Endosymbionts nated sequences from databases (Driskell et al., 2004) and Previous reports and this phylogenomic analysis of the analysis of the required number of gene (Rokas et al., Proteobacteria have identified two main problems in the 2003) or the impact of missing data to obtain a good phy- phylogeny of these bacteria. On one hand, the mono- logeny (Philippe et al., 2004). These works are allowing phyletic or polyphyletic origin of insect endosymbionts the formalization of a concatenation methodology. has generated a debate. Most studies using a multigene As we have shown for this particular data set, su- approach have placed this group as a monophyletic clade pertree and supermatrix methods allow exploring all the sister to the YESS cluster (Lerat et al., 2003; Canb¨ack phylogenetic signals contained in a bacterial genome. et al., 2004). However, studies with single-gene phyloge- Both analyses are valid, complementary starting points nies have pointed out that the group is not monophyletic in the evaluation of the incidence of events such as ver- (Charles et al., 2001; Herbeck et al., 2004), with the Buch- tical transmission, horizontal gene transfers, duplica- nera clade as sister of the cluster YESS and suggesting tions, or phylogenetic noise. Supermatrix methods are that the other two endosymbionts (B. floridanus and Wig- a double-edged sword, because they allow recovering glesworthia brevipalpis) could have an independent ori- the strongest phylogenetic signal even if the superma- gin related with secondary endosymbionts. On the other trix is composed by alignments where this signal is hid- hand, the concatenate analysis, although presenting a den; i.e., it is not the strongest one (Gatesy 1999, Gatesy strong primary phylogenetic signal, revealed some sec- and Baker, 2005), but, at the same time, this could re- ondary topologies with incongruence in the placement sult in the masking of other alternative signals. The pres- of Xanthomonadales. In consequence, the topologies ob- ence of these alternative signals is more easily revealed tained in this study have been analyzed and discussed by analyses based on the underlying gene topologies. based on the position and evolutionary relatedness of the As shown in this work, these signals usually arise in a Xanthomonadales and insect endosymbionts groups. supertree or consensus framework in the form of incon- The most frequently used phylogenetic marker in gruence around some taxa. For instance, the supermatrix bacterial systematics is 16S rDNA (Woese, 1987). In this approach failed to reveal the phylogenetic incongru- particular case, the 16S rDNA topology was not robust ence around Xanthomonadales position in the base of to different methods of phylogenetic reconstruction the γ -Proteobacteria tree (Fig. 2), whereas both the ‘200- as the result obtained depended on the method and gene’ tree consensus and the supertree of the ‘379- and model of evolution used for analysis. However, despite 579-gene’ trees placed them as β-Proteobacteria or as these discrepancies, all 16S rDNA topologies clustered 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 13 the Xanthomonadales with the β-Proteobacteria. A of endosymbionts and concluded that GC bias has influ- screening of the gene trees reveals that the three Xan- enced most published phylogenies. Using Galtier and thomonadales usually conform to a monophyletic group Gouy’s (1995) maximum likelihood method, they eval- whose placement in the Proteobacteria tree is unclear uated an array of possible phylogenies obtained by (see also Comas et al., 2006a). standard methods and models as well as other alter- As we have described, all the concatenate analyses natives derived from the permutation of the Buchnera revealed a main phylogenetic signal in which the Xan- position in these source trees. The topologies preferred thomonadales are placed at the base of γ -Proteobacteria. by Galtier and Gouy’s method were those showing a However, a secondary phylogenetic signal was also ob- paraphyletic origin of endosymbionts. This same strat- served. This signal also appeared in some single gene egy was used with the groEL gene and they reached the phylogenies in which the Xanthomonadales were ex- same conclusion. Although the two single-gene analyses cluded from the gamma clade. Therefore, although the supported the paraphyly of α-Proteobacteria insect en- use of concatenates of different numbers of gene (60, 200, dosymbionts, the corresponding topologies were incon- and 579) has allowed us to establish the main phyloge- gruent and some taxa appeared in unlikely positions. A netic signal, this approach also indicated the existence similar result has been obtained by Belda et al. (2005) us- of alternative signals through different analyses evalu- ing an alternative method for phylogeny reconstruction ating the support for the main topology. Obviously, the based on whole genome analysis. These authors have larger the number of gene incorporating the alternative analyzed the number of pairwise genome rearrange- signal, the higher the support for conflicting phylogenies ments and gene order from 244 gene common to 30 γ - will be obtained. Proteobacterial genomes. Their analysis concluded that Moreover, this secondary signal appeared more insect endosymbionts do not conform a monophyletic strongly in the consensus and supertree analyses. The cluster, with B. floridanus being closer to Escherichia coli majority rule consensus tree and the supertree of the 200 and separate from Wigglesworthia and Buchnera, genera common gene as well as the supertree obtained from the that do not show a common origin themselves. Neverthe- other 379 gene trees placed again the Xanthomonadales less, the lack of monophyly for insect endosymbionts is with the β-Proteobacteria. These methods are only indi- the only common result from these two studies, because rectly related to the sequence data because they are based the retrieved topologies are not compatible. on the gene trees derived from them. Because of this, they All our phylogenetic and phylogenomic analyses re- reflect what was already hinted in the single-gene anal- cover a single clade for endosymbionts, with the notable ysis, that an important fraction of gene exclude the Xan- exception of the 16s rDNA analysis with the Galtier and thomonadales from the gamma subtree, thus revealing Gouy model, which corrects for GC content biases. In the existence of either nonvertical evolutionary events, consequence, we have addressed the possible influence or methodological problems, or both, in the placement of the GC content in the topologies recovered with the of Xanthomonadales in the source trees (but see Comas aim of discarding a convergence artifact in the rest of the et al., 2006a). analyses. Although we have based our phylome analy- The evolutionary relationship of γ -Proteobacteria en- sis on amino acid sequences, these are not free from base dosymbionts remains a controversial issue. The first composition artifacts (Foster and Hickey,2002). From the attempts to address this problem using single-gene concatenate alignment of the 200 proteins common to all phylogenies produced conflicting results. For example, the genomes, we removed those positions most likely to Heddi et al. (1998) and Schroder et al. (1996) supported be influenced by GC bias. The trees obtained with two the common ancestor hypothesis whereas Charles et al. different methods were coincident and identical to the (2001) defended a paraphyletic origin for insect en- species tree shown in (Fig. 5). Furthermore, the support dosymbionts. More recently,even with the availability of values obtained for each node were even higher. Con- genomic data, the same conflicting results have been re- sequently, our results are robust to GC bias effects and produced. Three other phylogenomic analyses support they support the existence of a common ancestor for in- the common ancestor hypothesis: Gil et al. (2003) ob- sect endosymbionts. tained this result after maximum likelihood and Bayesian analyses of 61 conserved proteins involved in transla- tion; Lerat et al. (2003) obtained it from the common core The Genomic Core and the Interpretation of Incongruence of proteins, those present in the common ancestor and in Bacterial Genomes not involved in lateral transfers, for 13 γ -Proteobacteria One controversial issue in current evolutionary ge- species including one Buchnera and one Wigglesworthia nomics studies of bacteria is the existence of a core of gene species; and Canb¨ack et al. (2004) also examined the re- that share a common (i.e., vertical) evolutionary history lationship of Proteobacteria including two Buchnera and (Makarova et al., 1999; Nesbø et al., 2001; Bapteste et al., W. brevipalpis using three different methods. These three 2005; Susko et al., 2006). It has been argued that the vast studies agree with ours in postulating a common an- majority of components of this core belong to the infor- cestor for the endosymbionts and placing them as sister mational category of gene. Most of these gene are impor- group to the YESS cluster. tant because of their central role in essential metabolic However, two other studies have reached different pathways or because of their high connectivity in terms conclusions: Herbeck et al. (2004) analyzed the 16S rDNA of macromolecular interactions. The consequence of this 14 SYSTEMATIC BIOLOGY VOL. 56 division is that informational gene are less prone to hor- ACKNOWLEDGEMENTS izontal gene transfer (HGT) than operational ones (Jain et al., 1999). We express our gratitude to the associate editor, Ron DeBry, and the One remarkable result from our phylogenetic analy- reviewers, Amy Briskell and James McInerney, for their many valuable ses is that a substantial fraction (21.4 %) of informational comments and suggestions that have greatly helped us. This work has been supported by projects BMC2003-00305 from Spanish Ministerio gene reject the presumed species tree, whereas for op- de Ciencia y Tecnolog´ıa and Grupos 03/2004 from Generalitat Valencia. erational ones the corresponding value is not too much IC is recipient of a predoctoral fellowship from the Spanish Ministerio larger (32.5%). To prove the existence of a genomic core, de Educaci´on y Ciencia. Daubin et al. (2002) also computed the Robinson-Foulds distances for all of gene trees pairs from an initial pool REFERENCES of 310 trees. Next, they represented these distances in Akman, L., A. Yamashita,H. Watanabe,K. Oshima, T. Shiba, M. Hattori, a two-dimensional space by means of principal coordi- et al. 2002. Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia. Nat. Genet. 32:402–407. nates analysis and chose as a core those gene trees in the Altschul, S. E., T. L. Madden, A. A. Sch¨affer, J. Zhang, W. Miller, and D. J. highest density zone. Using this method they selected Lipman. 1997. Gapped BLAST and PSI-BLAST: A new generation of 121 gene to establish their set of core gene. Their most protein database search programs. Nucleic Acids Res. 25:3389–3402. relevant result is that nearly half of these gene were op- Asai, T., D. Zaporojets, C. Squires, and C. L. Squires. 1999. An Escherichia erational. Therefore, it seems that the division between coli strain with all chromosomal rRNA operons inactivated: Com- plete exchange of rRNA gene between bacteria. Proc. Natl. Acad. Sci. operational and informational gene is not adequate to USA 96:1971–1976. explain the existence of incongruences and of a core of Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P. gene in bacterial genomes. These results call for some Gordon, et al. 2002. The analysis of 100 gene supports the grouping caution when selecting gene for phylogenetic analysis of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Natl. Acad. Sci. USA 99:1414–1419. on the basis of their functional category assignment (see Bapteste, E., E. Susko, J. Leigh, D. Macleod, R. L. Charlebois, and W. also Comas et al., 2006b). In fact, in our concatenate anal- F. Doolittle. 2005. Do orthologous gene phylogenies really support ysis we have shown that the presence of gene that reject tree-thinking? BMC Evol. Biol. 5:e33. the presumed species tree and the mixture of operational Baum, B. R. 1992. Combining trees as a way of combining data sets for and informational gene are not sufficient obstacles to re- phylogenetic inference, and the desirability of combining gene trees. Taxon 41:3–10. cover the main evolutionary history of these species. Beiko, R. G., T. J. Harlow, and M. A. Ragan. 2005. Highways of gene It is also relevant to note that most of the gene we have sharing in prokaryotes. Proc. Natl. Acad. Sci. USA 102:14332–14337. analyzed are present in the genomes as single-copy gene. Belda, E., A. Moya, and F. J. Silva. 2005. Genome rearrangement dis- γ This means that possible incongruences not related with tances and gene order phylogeny in -Proteobacteria. Mol. Biol. Evol. 22:1456–1467. methodological problems between gene and presumed Benson, D. A., I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. species trees for these gene must be classified as cases Wheeler. 2004. GenBank: Update Nucleic Acids Res. 32:D23–D26. of nonorthologous gene displacement (horizontal gene Bininda-Emonds, O. R. P.2004a. New uses for old phylogenies. Pp. 3–14 transfer, hidden paralogies). Lerat et al. (2003) carried out in Phylogenetic supertrees. combining information to reveal the tree a similar approach to estimate the incidence of incongru- of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic Publishers, Dordrecht. ence in their data set of 206 common gene derived from Bininda-Emonds, O. R. P. 2004b. Trees versus characters and the su- 13 γ -Proteobacteria and detected only two cases of HGT pertree/supermatrix “paradox.” Syst. Biol. 53:356–359. (but see Susko et al., 2006). The wider taxon sampling Bininda-Emonds, O. R. P., J. L. Gittleman, and M. A. Steel. 2002. The used in this analysis has led us to detect more cases of (super)tree of life: Procedures, problems, and prospects. Annu. Rev. Ecol. Syst. 33:265–289. incongruence with a similar number of gene trees. Bininda-Emonds, O. R. P., and M. J. Sanderson. 2001. Assessment of the However, it is worth noting that our topologies tests accuracy of matrix representation with parsimony analysis supertree do not address directly the question about the source construction. Syst. Biol. 50:565–579. of incongruence, which is analyzed in more detail else- Brochier, C., E. Bapteste, D. Moreira, and H. Philippe. 2002. Eubacterial where (Comas et al., 2006a) and is aimed only at revealing phylogeny based on translational apparatus proteins. Trends Genet. 18:1–5. the importance of considering the different phylogenetic Brown, J. R., C. J. Douady,M. J. Italia, W.E. Marshall, and M. J. Stanhope. signals present in bacterial genomes. This study, which 2001. Universal trees based on large combined protein sequence data is mainly centred on analyzing the evolutionary rela- sets. Nat. Genet. 28:281–285. tionships of endosymbionts, has not been designed to Burleigh, J., A. C. Driskell, and M. Sanderson. 2006. Supertree boot- strapping methods for assessing phylogenetic variation among gene specifically detect horizontal gene transfers. Endosym- in genome-scale data sets. Syst. Biol. 55:426–440. bionts have not undergone gene exchange since they Canb¨ack, B., I. Tamas, and S. G. E. Andersson. 2004. A phylogenomic established their close intracellular relationship with the study of endosymbiotic bacteria. Mol. Biol. Evol. 21:1110–1122. insect hosts (Akman et al., 2002; Tamas et al., 2002; Gil Castresana, J. 2000. Selection of conserved blocks from multiple align- et al., 2003; van Ham et al., 2003). However, our anal- ments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540– 552. ysis has pointed towards taxa where phylogenetic in- Charlebois, R. L., and W.F.Doolittle. 2004. Computing prokaryotic gene congruence is particularly pervasive. A closer look to ubiquity: Rescuing the core from extinction. Genome Res. 14:2469– Xanthomonadales (Comas et al. 2006a) has allowed dis- 2477. tinguishing between phylogenetic noise and nonvertical Charles, H., A. Heddi, and Y. Rahbe. 2001. A putative insect intracellu- lar endosymbiont stem clade, within the , inferred evolutionary events, and similar analyses (Beiko et al., from phylogenetic analysis based on a heterogeneous model of DNA 2005) might also be revealing of the complexities of the evolution. Comptes Rendus de l’Academie des Sciences, Series III– evolutionary processes in bacterial lineages. Sciences de la Vie 324:489–494. 2007 COMAS ET AL.—PHYLOGENOMICS AND THE EVOLUTION OF γ -PROTEOBACTERIA 15

Comas, I., A. Moya, R. K. Azad, J. G. Lawrence, and F. Gonz´alez- Herbeck, J. T., P. H. Degnan, and J. J. Wernegreen. 2004. Non- Candelas. 2006a. The evolutionary origin of Xanthomonadales homogeneous model of sequence evolution indicates independent genomes and the nature of the horizontal gene transfer process. Mol. origins of primary endosymbionts within the Enterobacteriales (γ - Biol. Evol. 23: 2049–2057. Proteobacteria). Mol. Biol. Evol. 22:520–532. Comas, I., A. Moya, and F. Gonz´alez-Candelas. 2006b. Phylogenetic Herniou, E. A., T. Luque, X. Chen, J. M. Vlak, D. Winstanley, J. S. Cory, signal and functional categories in Proteobacteria genomes. BMC et al. 2001. Use of whole genome sequence data to infer Baculovirus Evol. Biol. In press. phylogeny. J. Virol. 75:8117–8126. Creevey, C. J. 2004. Clann: Construction of supertrees and ex- Huson, D. H., and M. Steel. 2004. Phylogenetic trees based on gene ploration of phylogenomic information from partially overlap- content. Bioinformatics 20:2044–2049. ping datasets. Program manual. Available at http://bioinf.may.ie/ Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer software/clann/. among genomes: The complexity hypothesis. Proc. Natl. Acad. Sci. Creevey, C. J., D. A. Fitzpatrick, G. K. Philip, R. J. Kinsella, M. J. USA 96:3801–3806. O’Connell, M. M. Pentony, et al. 2004. Does a tree-like phylogeny Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation only exist at the tips in the prokaryotes? Proc. R. Soc. Lond. B of mutation data matrices from protein sequences. Comput. Appl. 271:2551–2558. Biosci. 8:275–282. Creevey, C. J., and J. O. McInerney. 2005. Clann: Investigating phyloge- Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum like- netic information through supertree analyses. Bioinformatics 21:390– lihood estimate of the evolutionary tree topologies from DNA se- 392. quence data, and the branching order in Hominoidea. J. Mol. Evol. Daubin, V., M. Gouy, and G. Perriere. 2002. A phylogenomic approach 29:170–179. to bacterial phylogeny: Evidence of a core of gene sharing a common Korbel, J., B. Snel, M. Huynen, and P. Bork. 2002. SHOT: A web server history. Genome Res. 12:1080–1090. for the construction of genome phylogenies. Trends Genet. 18:158– Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics and 162. the reconstruction of the tree of life. Nat. Rev. Genet. 6:361–375. Kumar, S., K. Tamura, and M. Nei. 2004. MEGA3: Integrated software Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. for molecular evolutionary genetics analysis and sequence align- Science 284:2124–2128. ment. Brief Bioinformatics 5:150–163. Driskell, A. C., C. Ane, J. G. Burleigh, M. M. McMahon, B. C. O’Meara, Kurland, C. G., B. Canb¨ack, and O. G. Berg. 2003. Horizontal gene and M. J. Sanderson. 2004. Prospects for building the tree of life from transfer: A critical view. Proc. Natl. Acad. Sci. USA 100:9658–9662. large sequence databases. Science 306:1172–1174. Lerat, E., V.Daubin, and N. Moran. 2003. From gene trees to organismal Eisen, J. A. 1998. Phylogenomics: Improving functional predictions for phylogeny in prokaryotes: The case of γ -proteobacteria. PLoS Biol. uncharacterized gene by evolutionary analysis. Genome Res. 8:163– 1:1–9. 167. Loomis, W. F., and D. W. Smith. 1990. Molecular phylogeny of Dic- Eisen, J. A., and C. M. Fraser. 2003. Phylogenomics: Intersection of tyostelium discodeum by protein sequence comparison. Proc. Natl. evolution and genomics. Science 300:1706–1707. Acad. Sci. USA 87:9093–9097. Felsenstein, J. 2005. PHYLIP: Phylogenetic Inference Package. Release Makarova, K. S., L. Aravind, M. Y.Galperin, N. V.Grishin, R. L. Tatusov, 3.6. Department of Genome Sciences. University of Washington, Y. I. Wolf, et al. 1999. Comparative genomics of the Archaea (Eur- Seattle. yarchaeota): evolution of conserved protein families, the stable core, Foster, P. G., and D. A. Hickey. 2002. Compositional bias may affect and the variable shell. Genome Res. 9:608–628. both DNA-based and protein-based phylogenetic reconstructions. Moreira, D., and H. Philippe. 2000. Molecular phylogeny: Pitfalls and J. Mol. Evol. 48:284–290. progress. Int. Microbiol. 3:9–16. Galtier, N., and M. Gouy. 1995. Inferring phylogenies from DNA se- Nesbø, C. L., Y. Boucher, and W. F. Doolittle. 2001. Defining the core quences of unequal base compositions. Proc. Natl. Acad. Sci. USA of nontransferable prokaryotic gene: The euryarchaeal core. J. Mol. 92:11317–11321. Evol. 53:340–350. Galtier, N., M. Gouy,and C. Gautier. 1996. SEAVIEWand PHYLO WIN: O’Brien, S. J., and R. Stanyon. 1999. Phylogenomics: Ancestral primate Two graphic tools for sequence alignment and molecular phylogeny. viewed. Nature 402:365–366. Comp. Appl. Biosci. 12:543–548. Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for Gatesy, J., and R. H. Baker. 2005. Hidden likelihood support in genomic detecting pattern-heterogeneity in gene sequence or character-state data: Can forty-five wrongs make a right? Syst. Biol. 54:483–492. data. Syst. Biol. 53:571–581. Gatesy, J., C. Matthee, R. DeSalle, and C. Hayashi. 2002. Resolution of Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. H. Holland, and D. a supertree/supermatrix paradox. Syst. Biol. 51:652–664. Casane. 2004. Phylogenomics of eukaryotes: Impact of missing data Gatesy, J., M. Milinkovitch, V. Waddell, and M. Stanhope. 1999. Stabil- on large alignments. Mol. Biol. Evol. 21:1740–1752. ity of cladistic relationships between Cetacea and higher-level artio- Phillips, M. J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny dactyl taxa. Syst. Biol. 48:6–20. and the detection of systematic biases. Mol. Biol. Evol. 21:1455–1458. Gil, R., F. J. Silva, J. Pereto, and A. Moya. 2004. Determination of the core Posada, D., and K. A. Crandall. 1998. ModelTest: Testing the model of of a minimal bacterial gene set. Microbiol. Mol. Biol. Rev. 68:518–537. DNA substitution. Bioinformatics 14:917–918. Gil, R., F. J. Silva, E. Zientz, F. Delmotte, F. Gonz´alez-Candelas, A. Ragan, M. A. 1992. Matrix representation in reconstructing phyloge- Latorre, et al. 2003. The genome sequence of Blochmannia floridanus: netic relationships among the eukaryotes. Biosystems 28:47–55. Comparative analysis of reduced genomes. Proc. Natl. Acad. Sci. Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale USA 100:9388–9393. approaches to resolving incongruence in molecular phylogenies. Na- Goldman, N., J. P.Anderson, and A. G. Rodrigo. 2000. Likelihood-based ture 425:798–804. tests of topologies in phylogenetics. Syst. Biol. 49:652–670. Ronquist, F., and J. P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylo- Gontcharov, A. A., B. Marin, and M. Melkonian. 2004. Are combined genetic inference under mixed models. Bioinformatics 19:1572–1574. analysis better than single gene phylogenies? A case study using Saitou, N., and M. Nei. 1987. The neighbor-joining method: A new SSU rDNA and rbcl sequence comparisons in the Zygnematophyceae method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406– (Streptophyta). Mol. Biol. Evol. 21:612–624. 425. Gu, X., and H. Zhang. 2004. Genome phylogenetic analysis based on Sanderson, M. J., and A. C. Driskell. 2003. The challenge of constructing extended gene contents. Mol. Biol. Evol. 21:1401–1408. large phylogenetic trees. Trends Plant Sci. 8:374–379. Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algo- Sanderson, M. J., A. Purvis, and C. Henze. 1998. Phyogenetic su- rithm to estimate large phylogenies by maximum likelihood. Syst. pertrees: Assembling the trees of life. Trends Ecol. Evol. 13:105–109. Biol. 52:696–704. Sanderson, M. J., and H. B. Shaffer. 2002. Troubleshooting molecular Heddi, A., H. Charles, C. Khatchadourian, G. Bonnot, and P. Nardon. phylogenetic analyses. Annu. Rev. Ecol. Syst. 33:49–72. 1998. Molecular characterization of the principal symbiotic bacte- Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. ria of the weevil Sitophilus oryzae: A peculiar G + C content of an TREE-PUZZLE: Maximum likelihood phylogenetic analysis using endocytobiotic DNA. J. Mol. Evol. 47:52–61. quartets and parallel computing. Bioinformatics 18:502–504. 16 SYSTEMATIC BIOLOGY VOL. 56

Schroder, D., H. Deppisch, M. Obermayer, G. Krohne, E. Stackebrandt, Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL B. Holldobler, et al. 1996. Intracellular endosymbiotic bacteria of W: Improving the sensitivity of progressive multiple sequence Camponotus species (carpenter ants): Systematics, evolution and ul- alignment through sequence weighting, positions-specific gap trastructural characterization. Mol. Microbiol. 21:479–489. penalties and weight matrix choice. Nucl. Acids Res. 22:4673– Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of log- 4680. likelihoods with applications to phylogenetic inference. Mol. Biol. Uchiyama, I. 2003. MBGD: Microbial genome database for comparative Evol. 16:1114–1116. analysis. Nucleic Acids Res. 31:58–62. Sicheritz-Pont´en, T., and S. G. E.Andersson. 2001. A phylogenomic ap- van Ham, R. C. H. J., J. Kamerbeek, C. Palacios, C. Rausell, F. Abascal, U. proach to microbial evolution. Nucleic Acids Res. 29:545–552. Bastolla, et al. 2003. Reductive genome evolution in Buchnera aphidi- Singer, G. A. C., and D. A. Hickey. 2000. Nucleotide bias causes a cola. Proc. Natl. Acad. Sci. USA 100:581–586. genomewide bias in the amino acid aomposition of proteins. Mol. Woese, C. R. 1987. Bacterial evolution. Microbiol. Rev. 51:221–271. Biol. Evol. 17:1581–1588. Woese, C. R., and G. E. Fox. 1977. Phylogenetic structure of the prokary- Snel, B., P. Bork, and M. A. Huynen. 1999. Genome phylogeny based otic domain: The primary kingdoms. Proc. Natl. Acad. Sci. USA on gene content. Nat. Genet. 21:108–110. 74:5088–5090. Strimmer, K., and A. Rambaut. 2002. Inferring confidence sets of pos- Woese, C. R., O. Kandler, and M. L. Wheelis. 1990. Towards a natural sibly misspecified gene trees. Proc. R. Soc. Lond. Ser. B 269:137–142. system of organisms: Proposal for the Domains Archaea, Bacteria, Strobel, G. L., and J. Arnold. 2004. Essential eukaryotic core. Evolution and Eucarya. Proc. Natl. Acad. Sci. USA 87:4576–4579. 58:441–446. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, R. L. Tatusov, and E. V. Koonin. Susko, E., J. Leigh, W. F. Doolittle, and E. Bapteste. 2006. Visualizing and 2001. Genome trees constructed using five different approaches sug- assessing phylogenetic congruence of core gene sets: A case study of gest new major bacterial clades. BMC Evol. Biol. 1:8. the γ -Proteobacteria. Mol. Biol. Evol. 23:1019–1030. Yap, W. H., Z. Zhang, and Y. Wang. 1999. Distinct types of rRNA Swofford, D. L. 2002. PAUP*. Phylogenetic analysis using parsimony operons exist in the genome of the Actinomycete Thermomonospora (*and other methods). Release 4.0beta. Sinauer Associates, Sunder- chromogena and evidence for horizontal transfer of an entire rRNA land, Massachusetts. operon. J. Bacteriol. 181:5201–5209. Tamas, I., L. Klasson, B. Canb¨ack, A. K. Naslund, A. S. Eriksson, J. J. Wernegreen, et al. 2002. 50 million years of genomic stasis in en- dosymbiotic bacteria. Science 296:2376–2379. Tamura, K., and S. Kumar. 2002. Evolutionary distance estimation un- First submitted 13 May 2005; reviews returned 17 August 2005; der heterogeneous substitution pattern among lineages. Mol. Biol. final acceptance 19 October 2006 Evol. 19:1727–1736. Associate Editor: Ron DeBry

Camponotus ants on bamboo. Photo courtesy of Dr. Heike Feldhaar.