Open Access Method2008WuVolume and 9,Eisen Issue 10, Article R151 A simple, fast, and accurate method of phylogenomic inference Martin Wu* and Jonathan A Eisen*†‡

Addresses: *Genome Center, University of California, One Shields Avenue, Davis, CA 95616, USA. †Section of Evolution and Ecology, College of Biological Sciences, University of California, One Shields Avenue, Davis, CA 95616, USA. ‡Department of Medical Microbiology and Immunology, School of Medicine, University of California, One Shields Avenue, Davis, CA 95616, USA.

Correspondence: Martin Wu. Email: [email protected]

Published: 13 October 2008 Received: 12 August 2008 Revised: 26 September 2008 Genome Biology 2008, 9:R151 (doi:10.1186/gb-2008-9-10-r151) Accepted: 13 October 2008 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2008/9/10/R151

© 2008 Wu and Eisen; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. logeneticAMPHORA

An automated inference.

for phylogenomic pipeline for analysis phylogenomic analysis (AMPHORA) is presented that overcomes existing limits to large-scale protein phy-

Abstract

The explosive growth of genomic data provides an opportunity to make increased use of protein markers for phylogenetic inference. We have developed an automated pipeline for phylogenomic analysis (AMPHORA) that overcomes the existing bottlenecks limiting large-scale protein phylogenetic inference. We demonstrated its high throughput capabilities and high quality results by constructing a genome tree of 578 bacterial species and by assigning phylotypes to 18,607 protein markers identified in metagenomic data collected from the Sargasso Sea.

Background been well documented that evolutionarily distant SSU rRNA Since the 1970s the use of small subunit (SSU) rRNA (SSU genes that are similar in nucleotide composition have been rRNA) sequences has revolutionized microbial classification, consistently - but nevertheless incorrectly - placed close systematics, and ecology. The SSU rRNA gene has become the together in phylogenetic trees [1,2,1]. Furthermore, inferring most sequenced gene, with hundreds of thousands of its the phylogeny of organisms from any single gene carries some sequences now deposited in public databases. It has become risks and must be corroborated by the use of other phyloge- the current 'gold standard' in microbial diversity studies, and netic markers. Many researchers turned to protein encoding for good reasons. For one, it is present in all microbial organ- genes such as EF-Tu, rpoB, recA, and HSP70 [3]. Because isms. For another, the gene sequence is highly conserved at protein sequences are conserved at the amino acid level both ends. This enables one to obtain nearly full-length SSU instead of at the nucleotide level, phylogenetic analyses of rRNA gene sequences by polymerase chain reaction amplifi- protein sequences are in general less prone to the nucleotide cation using 'universal' primers and without having to isolate compositional bias seen in SSU rRNA [2,4-6]. In addition, the and culture the organism in question. Until very recently, the less constrained variation at the third codon position allows vast majority of microbes were identified and classified only these genes to be used in studies of more closely related by recovering and sequencing their SSU rRNA genes. This organisms. However, because of difficulties in cloning protein single sequence of approximately 1.5 kilobases is often the encoding genes from diverse species, SSU rRNA remained the only information we have about the organism from which it gold standard. came - the only way we know that it exists in the natural environment. The situation changed with the advent of genomic sequenc- ing. Each complete genome sequence brings with it the Although the SSU rRNA gene has been extremely valuable for sequences for all protein encoding genes in that organism. phylogenetic studies, it has its limitations. For example, it has Now, not only can one build gene trees based on a favorite

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.2

protein encoding gene, but also one has the option to concate- accurately generate highly reproducible multiple sequence nate multiple gene sequences to construct trees on the alignments for a set of selected phylogenetic markers. More 'genome level'. Possessing more phylogenetic signals, such importantly, unlike previous automated methods [9] it can 'genome trees' or 'super-matrix trees' are less susceptible to mask the alignments with quality equivalent to that of cura- the stochastic errors than those built from a single gene [7]. tion by humans. Recent studies attempting to reconstruct the tree of life have demonstrated the power of this approach [8,9] (for review The same pipeline can also be applied to metagenomic data [10]). Likewise, genome trees have also been used success- analyses. In metagenomics or environmental genomics, nat- fully to reassess the phylogenetic positions of individual spe- ural populations of microbes are collected from the environ- cies [11,12]. It is worth pointing out, however, that the ment; their DNAs are cloned and directly sequenced. One genome trees are still susceptible to systematic errors caused fundamental goal of metagenomics is to determine who is by compositional biases, unrealistic evolutionary models, and present in the community and what they are doing. Phyloge- inadequate taxonomic sampling [7,13,14]. netic analysis of markers present in these collected samples can be very informative in revealing who is there. If the Despite its demonstrated usefulness, phylogenetic inference marker happens to be part of a larger assembled sequence based on protein markers has been limited in application, fragment, then the entire fragment can be anchored by that mainly because of the formidable technical difficulties inher- marker to a specific taxonomic clade. In this way, environ- ent in this approach. Typically, molecular phylogenetic infer- mental shotgun sequences can be sorted into taxon-specific ence involves three steps: retrieval of homologous sequences, 'bins' in silico, thereby allowing us to determine who is doing creation of multiple sequence alignments, and phylogenetic what. tree construction. Because only characters of common ances- try can be used to infer the evolutionary history, the most crit- The most striking finding to date from this approach was the ical step is sequence alignment, in which sequences are discovery of a proteorhodopsin gene in , a homolog of overlaid horizontally on each other in such a way that, ideally, the bacteriorhodopsin gene previously found only in some each column in the alignment would only contain homolo- archaea. In this case, the gene could be anchored within the gous characters (amino acids or nucleotides). To ensure this bacteria because it was found to be associated with a bacterial positional homology, the alignments must be curated - a SSU rRNA gene [18]. However, because the SSU rRNA gene process that evaluates the probable homology of each column constitutes only a tiny fraction of any genome, the probability or position in the alignments. that any given sequence fragment can be anchored to a spe- cific taxonomic clade by using this one gene is small. Thus, Positions for which the assignment of homology is uncertain phylotyping of metagenomic data can greatly benefit from the are then excluded from further analysis by masking [15]. use of alternative phylogenetic markers such as the multiple Judicious masking increases the signal-to-noise ratio and protein markers described below. often improves the discriminatory power of the phylogenetic methods [16]. Unfortunately, curation requires skilled man- In this paper, we introduce AMPHORA (a pipeline for Auto- ual intervention, thus making it impractical to process suita- Mated PHylogenOmic infeRence) and demonstrate two sig- bly the massive amount of genome sequence data now nificant applications: building a genome tree from 578 available. Frequently, masking is simply ignored. Automated complete bacterial genomes that are available at the time of masking to remove alignment positions that contain gaps or the study and identifying bacterial phylotypes from metagen- that have a low degree of conservation has not been satisfac- omic data collected from the Sargasso Sea. tory. For example, using these criteria and given a set of ad hoc parameters such as the minimum block length, GBLOCKS automatically selects conserved blocks from mul- Results and discussion tiple sequence alignments for phylogenetic analysis. How- The AMPHORA pipeline ever, trees constructed using GBLOCKS-treated alignments Introduction have been found to have dramatically weaker support, possi- With the rapid increase in available genomic sequence data, bly because of excessive removal of informative sites by the there is an ever-urgent need for automated phylogenetic anal- program [17]. In addition, although many programs are avail- yses using protein sequences. However, automation is fre- able to automate the creation of multiple sequence align- quently accompanied by reduced quality. We introduce here ments, their use for the de novo alignment of a large protein a fully automated method that is not only fast but also is of family is still fairly time consuming. high quality. The main components of our approach are shown in Figure 1, and their implementation is described in To overcome these problems, we have developed an auto- detail in the Material and methods section (below). Designed mated pipeline for building concatenated genome trees using to align and trim protein sequences rapidly, reliably, and multiple protein markers, thus making this powerful method reproducibly, AMPHORA eliminates one of the tightest bot- applicable on a larger scale. Our pipeline can rapidly and tlenecks in large-scale protein phylogenetic inference. It can

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.3

Query sequences

seed 1 VKVNLDWIESEIFEKED--PAPFLEHVNGILVPGGFG seed 2 VKVNLDWVESETFEGDEGAAAARLENAHAIMVPGGFG iPhylogenetic Marker Database seed 3 AKVSIRWVDAENVHDEE--AESLLGGVDGILVPGGFG seed 4 ARVKLAFIDSTKLE-EG--DLSDLDKVDAILVPGGFG seed 5 CRVVLTYLDSERIESEG--IGSSFDDIDAILVPGGFG i Marker mask 1111111111100000000000011111111111111 multiple HMM model sequence alignment

Select markers

Align and mask Search against representative seed 1 genomes VKVNLDWIESE---IFEKED--PAPFLEHVNGILVPGGFG seed 2 VKVNLDWVESE---TFEGDEGAAAARLENAHAIMVPGGFG seed 3 AKVSIRWVDAE---NVHDEE--AESLLGGVDGILVPGGFG seed 4 ARVKLAFIDST---KLE-EG--DLSDLDKVDAILVPGGFG seed 5 CRVVLTYLDSE---RIESEG--IGSSFDDIDAILVPGGFG Multiple sequence query 1 TRVDIHWVDSE---KIEERG--AEALLGDCDSVLVAGGFG alignment query 2 TRVNIKWIDSE---ILVDN----LALLYDVDSLLIPGGFG query 3 MKVDIEWIDSEDLEKADDEK--LDEIFNEVSGILVAGGFG query 4 TKVELKWVDSE---KLENME--SSEVFKDVSGILVAGGFG mask 1111111111100000000000000011111111111111

Build masks Trim

query 1 TRVDIHWVDSELGDCDSVLVAGGFG query 2 TRVNIKWIDSELYDVDSLLIPGGFG query 3 MKVDIEWIDSEFNEVSGILVAGGFG HMM query 4 TKVELKWVDSEFKDVSGILVAGGFG

SStStepseps iin bbuildinguiildiinga a Tree inference Phylogenetic Marker Database

AFigure flowchart 1 illustrating the major components of AMPHORA A flowchart illustrating the major components of AMPHORA. The marker protein sequences from representative genomes are retrieved, aligned, and masked. Profile hidden Markov models (HMMs) are then built from those 'seed' alignments. New sequences of interest are rapidly and accurately aligned to the trusted seed alignments through HMMs. Predefined masks embedded within the 'seed' alignment are then applied to trim off regions of ambiguity before phylogenetic inference. Alignment columns marked with '1' or '0' were included or excluded, respectively, during further phylogenetic analysis. be used for phylogenetic analyses of single genes or whole Protein phylogenetic marker database genomes. The core of AMPHORA is a protein phylogenetic marker database that contains curated protein sequence alignments with trimming masks and corresponding profile hidden Markov models (HMMs). Thirty-one protein encoding

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.4

phylogenetic marker genes (dnaG, frr, infC, nusA, pgk, pyrG, comparing them only once each to the HMM model. As a rplA, rplB, rplC, rplD, rplE, rplF, rplK, rplL, rplM, rplN, result, the computational cost increases linearly with the rplP, rplS, rplT, rpmA, rpoB, rpsB, rpsC, rpsE, rpsI, rpsJ, number of sequences to be aligned. In contrast, the computa- rpsK, rpsM, rpsS, smpB, and tsf) from representatives of tional cost of a pair-wise alignment approach increases poly- complete bacterial genomes were individually aligned using nomially and can soon become prohibitively expensive. CLUSTALW. The alignments were curated and trimming masks were added manually by visually inspecting the align- Application I: Bacterial genome trees ments. We selected these proteins because they are univer- Constructing a 'genome' tree sally distributed in bacteria; the vast majority of them exist as We downloaded 578 complete bacterial genomes available at single copy genes within each genome; and they are house- the time of our study from the National Center for Biotechnol- keeping genes that are involved in information processing ogy Information (NCBI) RefSeq collections (Additional data (replication, transcription, and translation) or central metab- file 1). Protein marker sequences for 31 proteins were olism, and thus are thought to be relatively recalcitrant to lat- retrieved, aligned, trimmed, and concatenated as described in eral gene transfer [19]. the Materials and methods section (see below). This resulted in a mega-alignment of 5,591 good amino acid positions (col- High quality and highly reproducible sequence alignments umns) by 578 species (rows). A maximum likelihood genome Molecular phylogenetic inference assumes common ancestry, tree was constructed from this mega-alignment (Additional or homology, for every single column of a multiple sequence data file 2). A bootstrapped maximum likelihood genome tree alignment. When this assumption is violated, phylogenetic of 310 representatives is shown in Figure 2. signal can be obscured by noise. It has been shown that align- ment quality can have greater impact on the final tree than As with trees built from SSU rRNA data, all of the major bac- does the tree building method employed [20]. Therefore, pre- terial phyla are well separated into their own monophyletic paring high quality sequence alignments is a most critical part groups, even though the relationships among some of them of any molecular phylogenetic analysis. This preparation typ- remain unclear. Strikingly, unlike the SSU rRNA tree, the ically involves careful but tedious manual editing and trim- bushy area (intermediate levels) of our tree is highly resolved. ming of the generated alignments, and thus remains the In the γ-, for example, the nodes separating greatest challenge to automation. When scaling up this proc- taxa into different orders, families, and genera receive gener- ess, the trimming step is often simply ignored. Automated ally excellent bootstrapping support, whereas uncertainty is trimming based on the number of gaps in each column or high in the corresponding regions of the SSU rRNA tree each column's conservation score can be used to select con- (Additional data file 3). Highly robust organismal phyloge- served blocks, but this still is not satisfactory when a high nies of γ-proteobacteria and α-proteobacteria have been quality tree is required [17]. inferred previously using hundreds of commonly shared genes [21,22] and are congruent to our genome tree. This We overcame this problem by taking advantage of a unique reflects the much-reduced stochastic noise present in the con- feature of profile HMM-based multiple sequence alignments. catenated protein sequences compared with that of a single, When using HMMs to align sequences, new sequences can be slowly evolving SSU rRNA gene. This uncertainty in the SSU mapped back, residue by residue, onto the 'seed' alignment rRNA tree - the backbone of modern microbial systematics - from which that HMM originated. When the seed alignment often prevents microbial taxonomists from placing new spe- includes an accurate human curated mask, the newly gener- cies or genera within higher taxa, particularly at these inter- ated alignments can be automatically trimmed accordingly, mediate levels [23]. When such assignments were thus producing high quality alignments without requiring nevertheless made for these problematic taxa, inconsistency further human intervention. In addition, the HMM model is was introduced into the taxonomic nomenclature. For exam- the only variable in this automated alignment and trimming. ple, taxa assigned to the orders Alteromondales, Pseudomon- When the same model is used, the alignments generated adales, and Oceanospirillales in Bergey's Taxonomic Outline thereby are completely additive and reproducible, thus ena- of [23] are intermingled and paraphyletic in our bling meaningful comparison of the results from different genome tree. It is our view that the needs to be phylogenetic studies or different researchers. revisited and possibly revised in such cases.

Speed Genome-based microbial taxonomy Another big advantage of using an HMM-based approach is Although use of SSU rRNA was a landmark advancement in speed. For example, AMPHORA needs only 0.5 minutes on molecular microbial systematics, genome sequences provide an average desktop computer (Intel Pentium CPU 3.2 GHz) to an important alternative and complement [11,12]. Phyloge- align 340 sequences of the rpoB family. In comparison, the netic trees built from multiple genes are more robust in same job takes de novo pair-wise alignment methods such as resolving taxonomic relationships below the phylum level CLUSTALW and MUSCLE 120 and 12 minutes, respectively. and hence provide an excellent alternative phylogenetic This is because our HMM-based method aligns sequences by framework for microbial systematics. Until many more

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.5

Prochlorococcus marinus subsp pastoris str CCMP1986

P1375

Mycobacterium avium subsp paratuberculosis K 10

Prochlorococcus marinus str MIT 9515

Synechococcus elongatus PCC 6301 Thermosynechococcus elongatus BP 1

Trichodesmium erythraeum IMS101 Herpetosiphon aurantiacus ATCC 23779

arinus str CCM

Anabaena variabilis ATCC 29413

Synechococcus sp JA 2 3B a 2 13 14863 Roseiflexus castenholzii DSM 13941 Acaryochloris Microcystismarina MBIC11017 aeruginosa NIES 843 Mycobacterium smegmatis str MC2 155 M Corynebacterium diphtheriae NCTC 13129 G

loeobacter violaceus P Chloroflexus aurantiacus J 10 fl Synechocystis sp PCC 6803 IA Saccharopolyspora erythraea NRRL 2338

Corynebacterium Dehalococcoides sp CBDB1 Nocardia farcinica IFM 10152 Nostoc sp PCC 7120 ophilum C orynebacterium Salinispora arenicola CNS 205 Rhodococcus sp RHA1 therm

m Salinispora tropica CNB 440 8052 Renibacterium salmoninarum ATCC 33209 B StreptomycesAcidothermus avermitilis cellulolyticus MA 4680 11B IM C 27343 efficiens YS 314 C CC 7421 TC A Leifsonia xyli subsp xyli str C Frankia alni ACN14a jeikeium m ThermobifidaFrankia spfusca EAN1pec YX biobacteriu Kineococcus radiotolerans SRS30216 phytofermentans ISDg Prochlorococcus marinus str NATL1A Arthrobacter aurescens TC1 Prochlorococcus marinus str MIT 9211 Prochlorococcus marinus subsp m Frankia sp CcI3 Synechococcus sp RCC307 Prochlorococcus marinus str MIT 9303 Prochlorococcus marinus str MIT 9313 Synechococcus sp WH 7803 Synechococcus sp CC9311 ym Synechococcus sp CC9902 Synechococcus sp CC9605 Propionibacterium acnes KPA171202 Synechococcus sp WH 8102 B Syntrophomonas wolfei subsp wolfei str Goettingen S Arthrobacter sp FB24 Moorella thermoacetica ATCC 39073 ifidobacterium Desulfitobacterium hafniense Y51 K411 Carboxydothermus hydrogenoformans Z 2901 beijerinckii N Pelotomaculum thermopropionicum SI Desulfotomaculum reducens MI 1 novyi NT Clavibacter m Caldicellulosiruptor saccharolyticus DSM 8903 Thermoanaerobacter sp X514 Thermoanaerobacter tengcongensis MB4 Clostridium thermocellum ATCC 27405 subsp capricolu Clostridium Rubrobacter xylanophilus DSM 9941 Clostridium difficile 630 Alkaliphilus oremlandii OhILAs Tropheryma whipplei TW08 27 Alkaliphilus metalliredigens QYMF adolescentisN ocardioides A sp JS Clostridium perfringens SM101 Fervidobacterium nodosum Rt17 B1 C lostridium 2 ichiganensis Clostridium Clostridium acetobutylicum ATCC 824 Thermosipho melanesiensis BI429 Clostridium tetani E88 a capricolum TCB Clostridium botulinum A str ATCC 19397 Clostridium kluyveri DSM 555 Fusobacterium nucleatum subsp nucleatum ATCC 25586 Aquifex aeolicus VF5 07 Acholeplasma laidlawii PG 8A TC Aster yellows witches broom phytoplasma AYWB Thermotoga lettingae TMO ycoplasm a synoviae 53 Thermotoga maritima MSB8 C 15703 Mesoplasma florum L1 a agalactiae PG D 614 M einococcus geotherm Mycoplasma hyopneumoniae 7448 Deinococcus radiodurans R1 Mycoplasma mobile 163K Petrotoga mobilis SJ95 Mycoplasma pulmonis UAB CTIP Buchnera aphidicola str Bp Baizongia pistaciae M ycoplasm 831 Buchnera aphidicolaB uchnera str aphidicolaAPS Acyrthosiphon pisum M ycoplasm TE Thermus thermophilus HB8 Mycoplasma penetrans HF 2 Buchnera aphidicola str Sg Schizaphis graminum Ureaplasma parvum serovar 3 str ATCC 700970 Mycoplasma gallisepticum R alis D Mycoplasma pneumoniae M129 Mycoplasma genitalium G37 Wigglesworthia glossinidia endosymbiontstr C SM 11300 Geobacillus kaustophilus HTA426 62 c C Bacillus subtilis subsp subtilis str 168 in Bacillus weihenstephanensis KBAB4 CandidatusCandidatus Blochmannia Blochmannia pennsylvanicus floridanusara cedri O ceanobacillus iheyensis Clip112 H Bacillus haloduransocua C 125 Bacillus clausii KSM K16 Staphylococcusria inn aureus subsp aureus USA300 Baumannia cicadellinicola Liste Sodalis glossinidius str morsitans Enterococcus faecalis V583 O157 H7 str Sakai Streptococcus gordonii str Challis substr CH1 Lactococcus lactis subsp lactis Il1403 Erwinia carotovora SCRI1043 sakei subsp sakei 23K Serratia proteamaculans 568 Lactobacillus casei ATCC 334 Lactobacillus delbrueckii ATCC BAA 365 8081 Photorhabdus luminescens TTO1 Lactobacillus helveticus DPC 4571 Lactobacillus johnsoniiSU NCC 1 533 MannheimiaActinobacillus succiniciproducens succinogenes MBEL55E 130Z Lactobacillus salivarius subsp salivarius UCC118 Leuconostoc mesenteroides ATCC 8293 O enococcus oeni P Haem ophilus influenzae PittEE Lactobacillus reuteri F275 somnus 129PT pentosaceus ATCC 25745 Lactobacillus plantarum WCFS1 str Pm70 Lactobacillus brevis ATCC 367 35000HP Rhodopirellula baltica SH 1 pleuropneumoniae L20 Candidatus Protochlamydia amoebophila UWE25 Chlamydia trachomatis A HAR 13 Vibrio fischeri ES114 Chlamydophila abortus S26 3 Photobacterium profundum SS9 Chlamydophila pneumoniae J138 Aeromonas salmonicida subsp A449 Leptospira borgpetersenii JB197 ATCC 7966 Borrelia garinii PBi ii 37 Treponema denticola ATCC 35405 Psychrom onas ingraham Treponema pallidum subsp pallidum str Nichols Shewanella pealeana ATCC 700345 Chlorobium tepidum TLS Chlorobium phaeobacteroides DSM 266 Colwellia psychrerythraea 34H Chlorobium chlorochromatii CaD3 Pseudoalteromonas haloplanktis TAC125 Pelodictyon luteolum DSM 273 Idiomarina loihiensis L2TR Prosthecochloris vibrioformis DSM 265 MWYL1 Salinibacter ruber DSM 13855 Pseudoalteromonas atlantica T6c M arinomonas sp Cytophaga hutchinsonii ATCC 33406 Bacteroides fragilis NCTC 9343 Parabacteroides distasonis ATCC 8503 ChromohalobacterHahella chejuensis salexigens KCTC DSM 30432396 Porphyromonas gingivalis W83 C andidatus S Marinobacter aquaeolei VT8 Gramella forsetii KT0803 onas stutzeri A1501 Flavobacteriumulcia johnsoniae m UW101 Saccharophagus degradans 2 40 uelleri G P seudom Flavobacterium psychrophilum JIP02 86 bacterium Ellin345W Psychrobacter arcticus 273 4 SS Psychrobacter sp PRwf 1 Solibacter usitatus Ellin6076 Bdellovibrio bacteriovorus HD100 Sorangium cellulosum So ce 56 Acinetobacter sp ADP1 Myxococcus xanthus DK 1622 ATCC 17978 Anaeromyxobacter dehalogenans 2CP C borkumensis SK2 Anaeromyxobacter sp Fw109 5 P orby Pelobacterelob propionicus DSM 2379 Candidatus Ruthia magnifica Geobacteracte uraniireducens Rf4 Candidatus Vesicomyosocius okutanii HA L1 Geobacter metallireducensr carbinolicu GS 15 Thiomicrospira crunogenaophila XCL str C 2 G eobacter sulfurreducens PCA Coxiellaeum burnetii RSA 493LHE 1 Syntrophus aciditrophicus SB Syntrophobacter fumaroxidans MPOBs D ella pn Desulfococcus oleovorans Hxd3 SM subsp novicida U112 Desulfotalea psychrophila LSv54 2 L egionXylella fastidiosa Temecula1 Lawsonia intracellularis PHE MN1 00 380 Desulfovibrio desulfuricans G20 nicola ehrlichei M Desulfovibrio vulgaris subsp vulgaris str Hildenborough H alorhodospira halophila S Desulfovibrio vulgaris subsp vulgaris DP4 N Alkalilim 666 W itratiruptor sp SB Helicobacterolinella succinogenes acinonychis str D Sheeba 118 Sulfurovum sp NBC37 1 Xanthomonas campestrisNitrosococcus pv vesicatoria oceani AcidovoraxATCCstr 85 1970710 sp JS42 Arcobacter butzleri RM4018 Delftia acidovorans SPH 1 S Dichelobacter nodosus VCS1703A Campylobacterulfurim hominis ATCC BAA 381 onas sp JS 400 RM1221 Methylococcus capsulatus str Bath MI1000 Campylobacter jejuni subsp jejuni 81116 G Campylobacter fetus subsp fetus 82 40 155 2 Campylobacter curvus onas525 92 denitrificans D Campylobacter concisus 13826 P olarom Magnetococcus sp MC 1 Candidatus Pelagibacter ubique HTCC1062 Boryong R Verminephrobacter eiseniae EF01 2 str Madrid E S Neorickettsiaickettsia sennetsu bellii str O Miyayama M 1740 Acidovorax avenae subsphodoferax citrulli ferrireducens AAC00 1 T W Polaromonas naphthalenivorans CJ2 2 Wolbachia endosymbiont strain TRS of Brugia malayi R Bordetella petrii Anaplasm Anaplasma phagocytophilumolb HZ Ehrlichia ruminantium str Welgevonden Methylibium petroleiphilum PM1 Ehrlichia canis str Jake str Arkansas Magnetospirillum magneticum AMB 1 ach Azoarcus sp EbN1 Rhodospirillum rubrum ATCC 11170 Acidiphilium cryptum JF 5 Azoarcus sp BH72 Granulibacter bethesdensis CGDNIH1 SM C 1247 Gluconacetobacter diazotrophicus PAl 5 urkholderia xenovorans LB G Sphingomonas wittichii RW1 ia end Zymomonas mobilis subsp mobilis ZM4 R alstonia solanacearum Sphingopyxis alaskensis RB2256 B Novosphingobium arom Erythrobacter litoralis HTCC2594 1251 Maricaulis maris MCS10 Caulobacter crescentus CB15 TC Hyphomonas neptunium ATCC 15444 luconobacter oxydans 621H Paracoccus denitrificans PD1222 Jannaschia sp CCS1 Rhodobacter sphaeroides 2 4 1 PolynucleobacterHerminiimonas sp QLW P1DMWA arsenicoxydans 1 Dinoroseobacter shibae DFL 12 a m Janthinobacterium sp Marseille SU osym ultiformis ATCC 25196um A SM419 arginale str St M 85 389 Nitrosomonas eutropha C91 Dechloromonas aromatica RCB bio

violace nt of D Methylobacillus flagellatus KT edicae W rium m Nitrosomonas europaea ATCC 19718 rosophila m Nitrosospira m FA 1090 aries quintana str Toulouse Mesorhizobium sp BNC1 Thiobacillus denitrificansobacte ATCC 25259Bartonella bacilliformis KC583 str Houston 1 aticivorans DSM 12444 Bartonella tribocorum CIP 105476 Bradyrhizobium sp BTAi1 Silicibacter sp TM1040

hrom Bradyrhizobium sp ORS278 Mesorhizobium loti MAFF303099 C elan inorhizobium

Silicibacter pomeroyi DSS 3 Ochrobactrum anthropi ATCC 49188 Nitrobacter hamburgensis X14 S Brucella melitensis biovar AbortusAgrobacterium 2308 tumefaciens str C58 ogaste Xanthobacter autotrophicus Py2

Methylobacterium extorquens PA1

Azorhizobium caulinodans ORS 571 Roseobacter denitrificans OCh 114 Parvibaculum lavamentivorans DS 1 r Rhizobium leguminosarum bv viciae 3841 Rhodopseudomonas palustris CGA009

Gammaproteobacteria Acidobacteria Deinococcus/Thermus /Chlorobi / Deltaproteobacteria

AnFigure unrooted 2 maximum likelihood bacterial genome tree An unrooted maximum likelihood bacterial genome tree. The tree was constructed from concatenated protein sequence alignments derived from 31 housekeeping genes. All major phyla are separated into their monophyletic groups and are highlighted by color. The branches with bootstrap support of over 80 (out of 100 replicates) are indicated with black dots. Although the relationships among the phyla are not strongly supported, those below the phylum level show very respectable support. The radial tree was generated using iTOL [42]. genomes have been sequenced, however, a hybrid approach full genome sequences can be placed by comparing their SSU may be most fruitful. A genome tree built from sequenced rRNA sequences with those of sequenced species. genomes can be used as a scaffold; species for which we lack

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.6

Average rate of protein evolution in bacterial genomes be anchored. The genome of one SAR11 member, namely The average rates of protein evolution are proportional to the Candidatus Pelagibacter ubique, was subsequently branch lengths of our genome tree. The branch length varies sequenced, which allows for much finer phylotyping now. In widely among different lineages. For example, as has been our phylotyping analyses, 8,656 marker sequences (46.5% of previously reported, bacteria that have adopted an intracellu- the total) form outgroups to only P. ubique. We have assigned lar lifestyle have, in general, evolved more rapidly [24], with them to the SAR11 clade because their closest neighbor in the Wigglesworthia glossinidia (the endosymbiont of Glossina tree is P. ubique and their dominance in the population is brevipalpis) and Neorickettsia sennetsu str. Miyayama consistent with previous quantitative estimations by fluores- evolving at the fastest pace. The slowest rates are found in a ence in situ hybridization that, on average, members of the group of spore forming bacteria such as Carboxydothermus SAR11 clade account for one-third of the ocean surface bacte- hydrogenoformans, Moorella thermoacetica, Clostridium rioplankton communities [27]. spp., and Bacillus spp. These slow rates are of particular interest because it has been suggested that they might be Strategically located reference genomes related to the longer generation times for organisms that The use of metagenomics to phylotype communities has been spend a significant fraction of their time as dormant spores. limited by the lack of sequenced genomes from many taxo- Our data for spore forming bacteria are consistent with that nomic groups. To help fill some of these gaps, we sequenced hypothesis and differ strikingly from the findings of a recent representatives of several phyla for which genome sequences study [25], which identified no generation time effect in these were not previously available. For example, we recently organisms. sequenced the genomes of Dictyoglomus thermophilum and Thermomicrobium roseum as part of a US National Science Application II: metagenomic phylotyping Foundation (NSF) funded 'Tree of life' project (Eisen JA and Reanalysis of the phylotypes reported in the Sargasso Sea coworkers, unpublished data). To demonstrate the usefulness We used our automated pipeline to reanalyze the environ- of these additional genomes for improved phylotyping, we mental shotgun sequencing data collected from the Sargasso analyzed metagenomic data from a Yellowstone hot spring Sea and phylotyped in a previous study [26]. The approxi- community. From the 8,341 Sanger sequence reads obtained, mately 1.1 million predicted genes yielded a total of 18,607 we identified 59 reads that match the marker sequences genes that corresponded to our 31 protein markers and that present in our database. For 20 of these reads, their closest were long enough for phylogenetic analysis. Figure 3 illus- neighbors by phylogenetic analysis are D. thermophilum or T. trates the distribution of each of the 31 protein markers roseum (ten reads each), thus demonstrating the usefulness among the major phylotypes. Our analysis identifies the α- of these genomes for phylotyping their close relatives in the proteobacteria as the most abundant group, because more Yellowstone community (see Additional data file 4 for one than half of the marker sequences were assigned to this such example). This highlights the need to increase taxo- group. Notably, the various individual protein markers nomic sampling by selecting bacteria for sequencing based on present remarkably consistent microbial diversity profiles, their phylogenetic positions. thus suggesting that results for different markers may be additive. Selecting reference sequences For best phylotyping results, the more reference sequences It was noted that SSU rRNA gives significantly different esti- the better. Therefore, theoretically, the greater number of mates of microbial composition than those by protein mark- marker sequences identifiable from a more comprehensive ers [26]. This is believed to be caused by large variations in database such as the NCBI nonredundant protein sequence rRNA gene copy numbers among different species. The pro- (nr) database would be preferable to the lesser number tein markers used in our study are nearly all single-copy obtainable from complete genomes. However, taxonomic genes and thus should, theoretically, give a more accurate sampling bias of the reference sequences has a great impact estimation of the microbial composition. One factor affecting on the resulting phylotype assignments (see below). To be our analysis is that the peptides are based on assemblies able to make meaningful comparisons among the results rather than sequence reads. Therefore, our method will obtained using different markers, the taxonomic sampling underestimate those organisms that have deep coverage in must be controlled. In this regard, a complete genome data- assemblies. This is one major reason why depth of coverage base, in which every marker was sampled equally, would be should be provided with metagenomics assemblies and preferable to the NCBI nr database, in which each marker was annotation. sampled to a different extent.

Members of the α-proteobacterial SAR11 clade are the most With very few exceptions such as gyrB [28], the protein dominant micro-organisms in the Sargasso Sea [27]. At the marker sequences with species information in the nr database time of the Sargasso Sea metagenomics study, there were no were mostly derived from genome sequencing projects. This complete genome sequences available for members of the is because it is very difficult to obtain protein encoding genes SAR11 clade, and thus many of the SAR11 sequences could not by polymerase chain reaction amplification because their

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.7

dnaG 0.8 frr infC nusA pgk 0.7 pyrG rplA rplB 0.6 rplC rplD rplE 0.5 rplF rplK rplL rplM 0.4 rplN rplP rplS 0.3 rplT

Relative abundance Relative rpmA rpoB rpsB 0.2 rpsC rpsE rpsI 0.1 rpsJ rpsK rpsM 0 rpsS smpB tsf

Aquificae Chlorobi FirmicutesChloroflexi Chlamydiae ThermotogaeFusobacteria Spirochaetes Bacteroidetes CyanobacteriaAcidobacteria Actinobacteria Planctomycetes BetaproteobacteriaDeltaproteobacteria Alphaproteobacteria Unclassified bacteria GammaproteobacteriaEpsilonproteobacteria

Unclassified proteobacteria

FigureMajor phylotypes 3 identified in Sargasso Sea metagenomic data Major phylotypes identified in Sargasso Sea metagenomic data. The metagenomic data previously obtained from the Sargasso Sea was reanalyzed using AMPHORA and the 31 protein phylogenetic markers. The microbial diversity profiles obtained from individual markers are remarkably consistent. The breakdown of the phylotyping assignments by markers and major taxonomic groups is listed in Additional data file 5. sequences are not conserved at the nucleotide level [29]. As a value to define the top hits. Because individual bacterial result, the nr database does not actually contain many more genomes and proteins can evolve at very different rates, a uni- protein marker sequences that can be used as references than versal cut-off that works under all conditions does not exist. those available from complete genome sequences. As a result, the final results can be very subjective.

Comparison of phylogeny-based and similarity-based phylotyping In contrast, our tree-based bracketing algorithm places the Although our phylogeny-based phylotyping is fully auto- query sequence within the context of a phylogenetic tree and mated, it still requires many more steps than, and is slower only assigns it to a taxonomic level if that level has adequate than, similarity based phylotyping methods such as a sampling (see Materials and methods [below] for details of MEGAN [30]. Is it worth the trouble? Similarity based phylo- the algorithm). With the well sampled species Prochlorococ- typing works by searching a query sequence against a refer- cus marinus, for example, our method can distinguish closely ence database such as NCBI nr and deriving taxonomic related organisms and make taxonomic identifications at the information from the best matches or 'hits'. When species species level. Our reanalysis of the Sargasso Sea data placed that are closely related to the query sequence exist in the ref- 672 sequences (3.6% of the total) within a P. marinus clade. erence database, similarity-based phylotyping can work well. On the other hand, for sparsely sampled clades such as However, if the reference database is a biased sample or if it Aquifex, assignments will be made only at the phylum level. contains no closely related species to the query, then the top Thus, our phylogeny-based analysis is less susceptible to data hits returned could be misleading [31]. Furthermore, similar- sampling bias than a similarity based approach, and it makes ity-based methods require an arbitrary similarity cut-off

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.8

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 AMPHORA MEGAN 0.5 0.5 0.4 0.4 Sensitivity Specificity 0.3 0.3 0.2 0.2 0.1 0.1 0 0 phylum class order family genus species phylum class order family genus species

ComparisonFigure 4 of the phylotyping performance by AMPHORA and MEGAN Comparison of the phylotyping performance by AMPHORA and MEGAN. The sensitivity and specificity of the phylotyping methods were measured across taxonomic ranks using simulated Sanger shotgun sequences of 31 genes from 100 representative bacterial genomes. The figure shows that AMPHORA significantly outperforms MEGAN in sensitivity without sacrificing specificity. sequence assignments only at the taxonomic levels that are omic phylotyping, the marker genes do not have to be single- supported by the available data. copy or universal, but they must have been reasonably well sampled, have sufficient phylogenetic signal, and not be fre- To compare quantitatively the performance of the phylogeny quently exchanged between distantly related lineages. Until based and the similarity based phylotyping, we carried out a we learn more about the extent of lateral gene transfer in nat- simulation study. We determined the sensitivity and specifi- ural microbial communities, we caution against using every city of the taxonomic assignments made by AMPHORA and protein sequence collected in metagenomics studies for MEGAN using 3,088 simulated shotgun sequences of 31 phy- microbial diversity study. logenetic marker genes identified from 100 known bacterial genomes as benchmarks. The 100 genomes were chosen in More reference genomes such a way that maximizes their representation of the phylo- We have shown that adding representatives of novel phyla genetic diversity and thus decreases the impact of the data can facilitate metagenomic phylotyping. More reference sampling bias of current genome sequencing efforts on our genomes are needed for optimal performance. Although the results. Figure 4 compares the sensitivity and specificity of sequencing of thousands of microbial genomes is underway, the phylotyping assignments at the phylum, class, order, fam- the organisms chosen are a biased sample and thus are not ily, and genus level using AMPHORA and MEGAN. The gen- truly representative of the total microbial diversity. We see a eral trend toward decreasing sensitivity seen in the figure need to select microbes systematically for sequencing based from the phylum to the species level simply reflects the fact mainly on their phylogenetic positions, thus maximizing their that the amount of reference data available for taxonomic value for comparative genomics and phylogenomic studies. assignment is decreasing. However, AMPHORA significantly outperformed MEGAN in sensitivity at all taxonomic ranks. Both methods performed extremely well in specificity at all Conclusion levels (>0.97) except at the species level, where AMPHORA Currently, SSU rRNA is still the most powerful phylogenetic (0.63) outperformed MEGAN (0.43) by a large margin. marker because of the number of sequences available and the scope of taxonomic coverage. However, the imminent arrival Future issues of thousands of microbial genome sequences will vastly Additional markers expand the amount of data available for alternative protein We are in the process of adding more proteins to our initial phylogenetic markers, thus presenting us with both a chal- database of 31 markers, including the commonly used protein lenge and an opportunity. We have developed AMPHORA, a markers RecA, HSP70, and EF-Tu. Ideally, a probability fully automated method for phylogenetic inference using based method that evaluates the positional homology of the multiple protein markers. AMPHORA offers speed, reliabil- multiple sequence alignment could be developed to automate ity, and high quality analyses. By eliminating the need for fully the process of masking. Major expansion will also time consuming manual curation of sequence alignments, it require systematic assessment of many other protein families removes one of the tightest bottlenecks in large-scale protein for their suitability as phylogenetic markers. For metagen- phylogenetic inference. We demonstrated its usefulness for

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.9

automating both the construction of genome trees and the marker and treated it as 'missing' in that particular genome. assignment of phylotypes to environmental metagenomic It has been shown that as long as there is sufficient data, a few data. We believe such a phylogenomic approach will be 'holes' in the dataset will not compromise the resulting tree valuable in helping us to make sense of rapidly accumulating [36]. microbial genomic data. Phylotyping by phylogenetic analyses (AMPHORA) The protein markers used to construct the bacterial genome Materials and methods tree (see above) and the resultant genome tree were used as Protein phylogenetic marker database the reference sequences and the reference tree for phylotyp- For each marker, we first identified their protein sequences ing metagenomic data from the Sargasso Sea or the simulated from representative bacterial genomes. The amino acid sequences described below. Each marker sequence identified sequences were aligned using CLUSTALW [32] and then from the metagenomic data or simulated sequences was indi- manually edited and masked using the GDE package [33]. vidually aligned to its corresponding reference sequences and The mask is a text string of '1' and '0', where reliably aligned trimmed using the method described above. Then it was columns were labeled '1' and ambiguous columns were inserted into the reference tree using a maximum parsimony labeled '0'. Next, we used HMMer [34] to make local profile method of RAXML [37], constraining the topology of the tree HMMs from these 'seed' alignments (Figure 1). to that of the genome tree. This tree construction procedure was extremely fast, and 100 bootstrap replicates were run for Automated sequence alignment and trimming each query sequence to assess the confidence of the branching Subsequent steps are carried out by a Perl script joining mul- orders. The trees were rooted arbitrarily using Deinococcus tiple automated processes (Figure 1). First, HMMer effi- radiodurans as the outgroup. Tree branch lengths were cal- ciently aligns the query amino acid sequences onto the culated using the neighbor joining algorithm with a fixed tree trusted and fixed seed alignments. The Perl script then reads topology. the masks embedded in the seed alignments and automati- cally trims the query alignments accordingly. A tree-based bracketing algorithm was then employed to assign a phylotype to the query sequence (Figure 5), as fol-

Bacterial genome tree construction lows. Starting from the immediate ancestor n0 of the query Homologs of each of the 31 phylogenetic marker genes were sequence and moving toward the root of the tree, the first identified from the 578 complete bacterial genomes by internal node n1 whose bootstrap support exceeded a cut-off BLASTP searches (using marker sequences of Escherichia (for example, 70%) was identified. The common NCBI taxon- coli as query sequences and a cut-off E-value of 0.1) followed omy t1 that was shared by all descendants of the node n1 rep- by HMMer searches (cut-off E-value 1 × e-10). The corre- resented the most conservative taxonomic prediction for the sponding protein sequences were retrieved, aligned, and query sequence. Using the branch length information, finer trimmed as described above, and then concatenated by spe- scale phylotyping was carried out by comparing the normal- cies into a mega-alignment. A maximum likelihood tree was ized branch length from n0 to n1 with these between taxo- then constructed from the mega-alignment using PHYML nomic ranks that had been tallied from the bacterial genome [35]. The model selected based on the likelihood ratio test was tree. Based on this comparison, a taxonomic rank below or the WAG model of amino acid substitution with γ-distributed equal to t1 was assigned to the node n0. The taxonomy of the rate variation (five categories) and a proportion of invariable sister node of the query at this rank was then assigned to the sites. The shape of the γ-distribution and the proportion of the query. All tree branch lengths were normalized by dividing invariable sites were estimated by the program. them by the lengths of the root-to-tip branches of their partic- ular lineages. This was done to make the tree more clock-like, To speed up bootstrapping analyses, very closely related taxa and therefore the branch lengths would be much more were removed from the original mega-alignment, which left informative in inferring the time of evolution. In the simula- us with 310 taxa. Maximum likelihood trees were made from tion study, the query sequence itself was removed from the 100 bootstrapped replicates of this reduced dataset using reference dataset before the analyses. PHYML with the same parameters described above. Phylotyping by similarity-based analyses (MEGAN) With very few exceptions, the marker genes are single-copy A total of 3,088 simulated phylogenetic marker gene genes in all of the bacterial genomes analyzed. In those rare sequences described below were searched against a database cases in which two or more homologs were identified within a of complete bacterial genomes using BLASTX. The query single species, a tree-guided approach was used to resolve the sequence itself was discarded from the BLAST hits before redundancy. If the redundancy resulted from a species-spe- feeding the BLAST results into the software MEGAN [30] for cific duplication event, then one homolog was randomly cho- similarity-based phylotyping. A top per cent cut-off of 20% sen as the representative. In all other cases, to avoid potential was used to retain only those hits whose matching scores are complications such as lateral gene transfer, we excluded that at least 80% of the best matching score. This cut-off was

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.10

SSU rRNA tree construction Order Family Genus SSU rDNA sequences were extracted from complete I genomes, aligned, and trimmed using an online tool MyRDP H [40]. When multiple copies of SSU rRNA genes were present G within a single genome, one representative was randomly F chosen. Maximum likelihood tree was constructed using PHYML [35], applying the GTR model of substitution, with a E γ-distribution (α estimated by the program) of rates of five D n0 categories of variable sites and a proportion of invariable sites C (proportion estimated by the program). n1 query* B A Availability The AMPHORA package and the simulation study data can be AFigure tree based 5 bracketing algorithm for phylotyping a query sequence downloaded from [41]. A tree based bracketing algorithm for phylotyping a query sequence. To assign a phylotype to the query sequence, its immediate ancestor n0 and the first internal node n1 with ≥70% bootstrapping support were identified. The known descendant leaf nodes of n1, namely A through D, are used to Abbreviations infer the taxonomy of the query, in conjunction with the normalized AMPHORA: AutoMated PHylogenOmic infeRence; HMM: branch length information. The dashed timelines delimiting various hidden Markov model; NCBI: National Center for Biotech- taxonomic ranks were inferred from a clock that had been calibrated from the bacterial genome tree. nology Information; nr: nonredundant protein sequence; SSU: small subunit NSF: US National Science Foundation. chosen to match the one used in a similar phylotyping simu- lation study described in the original MEGAN report [30]. All Authors' contributions other parameters of MEGAN were set as default values except MW designed the study, developed the method, and per- that the min-support (the minimum number of sequence formed the analyses. JAE advised on method design and test- reads that must be assigned to a taxon) is set to 1, because in ing. MW and JAE wrote the paper. our simulation study each query sequence was assigned a phylotype independently. Additional data files Phylotyping simulation study The following additional data are available with the online To assess the performance of the phylotyping methods, a sim- version of this paper. Additional data file 1 is a table listing the ulation study was carried out. One hundred representative 578 complete bacterial genomes downloaded from the NCBI genomes maximizing the phylogenetic diversity of the 578 RefSeq database for this study. Additional data file 2 is a fig- complete bacterial genomes were selected using the genome ure of a maximum likelihood genome tree of 578 bacterial tree and an algorithm described in the report by Steel [38]. species; major taxonomic groups are highlighted by color. From each of the 31 phylogenetic marker genes identified Additional data file 3 provides a figure that compares γ-pro- from the 100 bacterial genomes, a DNA sequence fragment of teobacterial phylogenetic trees made from a super-matrix of 300 to 900 base pairs in length was randomly chosen, which 31 protein phylogenetic markers and from the SSU rDNA; resulted in a total of 3,088 simulated shotgun sequences that bootstrap support values are shown along their correspond- were used as benchmark query sequences in phylotyping ing branches. Additional data file 4 is a figure of a maximum (some markers are missing in some of the genomes). By com- likelihood tree of rpoB; adding a novel genome (Thermomi- paring the predicted taxa with the known taxa, the sensitivity crobium roseum) to the reference tree helped anchor a and specificity of phylotyping methods were calculated as sequence read (ZAVAM73TF) from a Yellowstone hotspring described in the report by Krause and coworkers [39]. Briefly, metagenomic study. Additional data file 5 is a table listing phylotypes breakdown of the Sargasso Sea metagenomic for a taxon i, let Pi be the number of query sequences from i, sequence data by phylogenetic markers and major taxonomic TPi be the number of sequences that are correctly assigned to groups. i, and FPi be the number of sequences that are incorrectly Additional578SeqPresenteddownloadedClickMaximumbacterialComparisontrees(A)shownahelpedhotspringPhylotypesdataSeamajor novel andmetagenomicdatabasecomplete madehere taxonomic along anchor genomefrom species. metagenomic isfor likelihooddatafrombreakdown afromof their the bacterialfilefigure table aγ file( -proteobacterialsequencea Thermomicrobium SSUMajor groups. super-matrixsequencethecorresponding 12345listing ofthatof treegenomeNCBIrDNA aofgenomesa taxonomic study. maximummaximum comparesthe of readphylotypesthe dataRefSeqrpoB(B). Sargasso tr578 (ZAVAM73TF) eeof downloadedbyBootstrapphylogenetic branches. 31 of complete groups likelihoodγdatabasephylogeneticlikelihood -proteobacterialproteinroseum578 breakdownSea bacterial aremetagenomic support )bacterialphylogenetic for from totreehighlighted treesgenomefrom the thismarkers ofof speciesthe values referencea therpoB. study.phylogenetic genomesYellowstone NCBItree Sargasso sequence markersbyand areAdding of color.Ref- 578tree assigned to i. The sensitivity TPi/Pi measures the proportion of query sequences that are correctly classified. The specifi- Acknowledgements city TPi/(TPi + FPi) measures the reliability of the phylotyping assignments. The initial development of this work was supported in part by NSF grant DEB-0228651 to JAE. The final development and testing was funded by the Gordon and Betty Moore Foundation (grant #1660 to JAE).

Genome Biology 2008, 9:R151 http://genomebiology.com/2008/9/10/R151 Genome Biology 2008, Volume 9, Issue 10, Article R151 Wu and Eisen R151.11

References tively constant despite spore dormancy. Evolution 2007, 1. Woese CR, Achenbach L, Rouviere P, Mandelco L: Archaeal phyl- 61:280-288. ogeny: reexamination of the phylogenetic position of Archae- 26. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen oglobus fulgidus in light of certain composition-induced JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap artifacts. Syst Appl Microbiol 1991, 14:364-371. AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons 2. Hasegawa M, Hashimoto T: Ribosomal RNA trees misleading? R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environ- Nature 1993, 361:23. mental genome shotgun sequencing of the Sargasso Sea. Sci- 3. Ludwig W, Klenk H-P: Overview: A phylogenetic backbone and ence 2004, 304:66-74. taxonomic framework for procaryotic systematics. In Bergey's 27. Morris RM, Rappe MS, Connon SA, Vergin KL, Siebold WA, Carlson Manual of Systematic Bacteriology Volume 1. 2nd edition. Edited by: CA, Giovannoni SJ: SAR11 clade dominates ocean surface bac- Boone DR, Castenholz RW, Garrity GM. New York, NY: Springer- terioplankton communities. Nature 2002, 420:806-810. Verlag; 2000:49-65. 28. Kasai H, Watanabe K, Gasteiger E, Bairoch A, Isono K, Yamamoto S, 4. Loomis WF, Smith DW: Molecular phylogeny of Dictyostelium Harayama S: Construction of the gyrB database for the identi- discoideum by protein sequence comparison. Proc Natl Acad Sci fication and classification of bacteria. Genome Inform Ser Work- USA 1990, 87:9093-9097. shop Genome Inform 1998, 9:13-21. 5. Lockhart PJ, Howe CJ, Bryant DA, Beanland TJ, Larkum AW: Substi- 29. Santos SR, Ochman H: Identification and phylogenetic sorting tutional bias confounds inference of cyanelle origins from of bacterial lineages with universally conserved genes and sequence data. J Mol Evol 1992, 34:153-162. proteins. Environ Microbiol 2004, 6:754-759. 6. Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF: A - 30. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of level phylogeny of eukaryotes based on combined protein metagenomic data. Genome Res 2007, 17:377-386. data. Science 2000, 290:972-977. 31. Koski LB, Golding GB: The closest BLAST hit is often not the 7. Jeffroy O, Brinkmann H, Delsuc F, Philippe H: Phylogenomics: the nearest neighbor. J Mol Evol 2001, 52:540-542. beginning of incongruence? Trends Genet 2006, 22:225-231. 32. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving 8. Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ: Univer- the sensitivity of progressive multiple sequence alignment sal trees based on large combined protein sequence data through sequence weighting, position-specific gap penalties sets. Nat Genet 2001, 28:281-285. and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680. 9. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: 33. Smith SW, Overbeek R, Woese CR, Gilbert W, Gillevet PM: The Toward automatic reconstruction of a highly resolved tree genetic data environment: an expandable GUI for multiple of life. Science 2006, 311:1283-1287. sequence analysis. Comput Appl Biosci 1994, 10:671-675. 10. Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the 34. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, reconstruction of the tree of life. Nat Rev Genet 2005, 6:361-375. 14:755-763. 11. Wu M, Ren Q, Durkin AS, Daugherty SC, Brinkac LM, Dodson RJ, 35. Guindon S, Gascuel O: A simple, fast, and accurate algorithm Madupu R, Sullivan SA, Kolonay JF, Haft DH, Nelson WC, Tallon LJ, to estimate large phylogenies by maximum likelihood. Syst Jones KM, Ulrich LE, Gonzalez JM, Zhulin IB, Robb FT, Eisen JA: Life Biol 2003, 52:696-704. in hot carbon monoxide: the complete genome sequence of 36. Wiens JJ: Missing data, incomplete taxa, and phylogenetic Carboxydothermus hydrogenoformans Z-2901. PLoS Genet 2005, accuracy. Syst Biol 2003, 52:528-538. 1:e65. 37. Stamatakis A: RAxML-VI-HPC: maximum likelihood-based 12. Badger JH, Eisen JA, Ward NL: Genomic analysis of Hyphomonas phylogenetic analyses with thousands of taxa and mixed neptunium contradicts 16S rRNA gene-based phylogenetic models. Bioinformatics 2006, 22:2688-2690. analysis: implications for the taxonomy of the orders 'Rhodo- 38. Steel M: Phylogenetic diversity and the greedy algorithm. Syst bacterales' and Caulobacterales. Int J Syst Evol Microbiol 2005, Biol 2005, 54:527-529. 55:1021-1026. 39. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer 13. Brocchieri L: Phylogenetic inferences from molecular F, Edwards RA, Stoye J: Phylogenetic classification of short envi- sequences: review and critique. Theor Popul Biol 2001, 59:27-40. ronmental DNA fragments. Nucleic Acids Res 2008, 14. Foster PG, Hickey DA: Compositional bias may affect both 36:2230-2239. DNA-based and protein-based phylogenetic 40. Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, McGarrell DM, Garrity reconstructions. J Mol Evol 1999, 48:284-290. GM, Tiedje JM: The Ribosomal Database Project (RDP-II): 15. Gatesy J, DeSalle R, Wheeler W: Alignment-ambiguous nucle- sequences and tools for high-throughput rRNA analysis. otide sites and the exclusion of systematic data. Mol Phylogenet Nucleic Acids Res 2005, 33:D294-D296. Evol 1993, 2:152-157. 41. AMPHORA [http://bobcat.genomecenter.ucdavis.edu/ 16. Eisen JA: Phylogenomics: improving functional predictions for AMPHORA] uncharacterized genes by evolutionary analysis. Genome Res 42. Letunic I, Bork P: Interactive Tree Of Life (iTOL): an online 1998, 8:163-167. tool for phylogenetic tree display and annotation. Bioinformat- 17. Castresana J: Selection of conserved blocks from multiple ics 2007, 23:127-128. alignments for their use in phylogenetic analysis. Mol Biol Evol 2000, 17:540-552. 18. Béjà O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP, Jovanovich SB, Gates CM, Feldman RA, Spudich JL, Spudich EN, DeLong EF: Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science 2000, 289:1902-1906. 19. Jain R, Rivera MC, Lake JA: Horizontal gene transfer among genomes: the complexity hypothesis. Proc Natl Acad Sci USA 1999, 96:3801-3806. 20. Morrison DA, Ellis JT: Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. Mol Biol Evol 1997, 14:428-441. 21. Williams KP, Sobral BW, Dickerman AW: A robust species tree for the alphaproteobacteria. J Bacteriol 2007, 189:4578-4586. 22. Lerat E, Daubin V, Moran NA: From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-proteo- bacteria. PLoS Biol 2003, 1:E19. 23. Taxonomic outline of the prokaryotes. [http://141.150.157.80/ bergeysoutline/main.htm] 24. Moran NA: Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci USA 1996, 93:2873-2878. 25. Maughan H: Rates of molecular evolution in bacteria are rela-

Genome Biology 2008, 9:R151